Train More with Less

A technique called dreaMLearning trains AI models directly on compressed data, cutting memory use 10x and speeding training by up to 8.8x.

It is no secret that as deep learning (DL) advances, it sucks up ever more data and computing resources. In the rush to build a better large language model, image generator, or whatever the case may be, researchers are focusing more heavily on performance than they are efficiency. This trend can only go on just so long before we hit a wall, however. In the real world, technological limitations cannot be ignored forever.

Traditionally, training DL models on large datasets requires them to be fully decompressed and loaded into memory. This process demands a lot of storage, memory, and processing power. While other methods like dataset distillation or coreset selection try to speed things up by using smaller datasets, they often involve complex calculations and still need access to the full dataset initially. Standard data compression also hogs a lot of resources as it requires decompression before training can begin.

dreaMLearning outperforms existing coreset selection methods (📷: X. Zhao et al.)

A novel technique developed at Aarhus University called dreaMLearning seeks to change this by allowing DL models to learn directly from compressed data without needing to decompress it first. This is made possible by a new compression technique called Entropy-based Generalized Deduplication (EntroGeDe). EntroGeDe intelligently organizes similar data points into a compact set of representative samples, guided by how much information (or entropy) each part of the data contains. This is different from older compression methods that might sacrifice too much accuracy or still require the data to be uncompressed for use.

A key feature of dreaMLearning is its ability to streamline the data pipeline from storage to training. Instead of the conventional approach of decompressing and then potentially selecting subsets, dreaMLearning generates “training-ready” compressed datasets that retain the essential characteristics for effective learning. This is a significant departure from existing methods that rely on computationally intensive optimization steps.

In a series of experiments conducted by the team, it was shown that dreaMLearning can accelerate training times by as much as 8.8 times. Furthermore, it slashes memory usage by a factor of 10 and reduces storage requirements by 42%. All of these efficiency gains come with only a minimal impact on the overall performance of the trained model.

dreaMLearning can do more with less data (📷: X. Zhao et al.)

These characteristics are particularly well-suited for certain areas within machine learning. For instance, in distributed and federated learning, where data is often spread across numerous devices and bandwidth can be limited, dreaMLearning's ability to train on compressed data will significantly enhance scalability and efficiency. Similarly, for TinyML applications running on resource-constrained edge devices, the dramatic reduction in memory and storage demands opens up new possibilities for deploying sophisticated AI models on tiny hardware platforms.

The framework was designed to accommodate a wide range of data types, including tabular and image data, and is compatible with various machine learning tasks, such as regression and classification, and a variety of other model architectures. This flexibility means that dreaMLearning is not a niche solution but a general-purpose framework that may benefit a broad spectrum of AI applications in the future.

nickbild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Latest Articles