Intel researchers compress AI models without compromising accuracy

The size of AI models correlates with their training times, generally speaking, such that larger models take more time -- and subsequently more compute -- to train. It's possible to optimize the connections among the mathematical functions (or neurons) through a process known as pruning, which reduces their overall size without compromising accuracy. But pruning can't be performed until after training.

That's why researchers at Intel devised a technique that approaches training from the opposite direction, beginning with a compact model and modifying the structure based on data during training. They claim it's more scalable and computationally efficient than starting with a large model followed by compression, because training operates directly on the compact model. And they claim that -- unlike past attempts -- it's able to train a small model with performance equivalent to a large pruned model.

By way of background, neural networks at the heart of most AI systems consist of neurons that are arranged in layers and transmit signals to other neurons. Those signals -- the product of data, or inputs, fed into the neural network -- travel from layer to layer and slowly "tune" the network by adjusting the synaptic strength (weights) of each connection. Over time, the network extracts features from the data set and identifies cross-sample trends, eventually learning to make predictions.

Neural networks don't ingest raw images, videos, audio, or text. Rather, samples from training corpora are transformed algebraically into multidimensional arrays like scalars (single numbers), vectors (ordered arrays of scalars), and matrices (scalars arranged into one or more columns and one or more rows). A fourth entity type that encapsulates scalars, vectors, and matrices -- tensors -- adds in descriptions of valid linear transformations (or relations).

The team's scheme, then, which is described in a newly published paper accepted as an oral presentation at the International Conference on Machine Learning 2019, trains a type of neural network known as a deep convolutional neural network (CNN), where the majority of layers have sparse weight tensors, or tensors containing mostly zero values. All of these tensors are initialized at the same sparsity (percentage of zeros) level, and non-sparse parameters (function arguments that have one of a range of values) are used for most other layers.

Throughout training, the same total number of non-zero parameters in the network is maintained while parameters move within and across tensors every few hundred training iterations in two phases: a pruning phase followed immediately by a growth phase. A type of pruning dubbed magnitude-based pruning is used to remove the links with the smallest weights, and parameters are reallocated across layers during training.

To address performance concerns, the researchers trained the neural networks for double the number of epochs (defined as one forward pass and backward pass through the network) used to train epochs, and they tested two of them -- WRN-28-2 and ResNet-50 -- on the Canadian Institute for Advanced Research's CIFAR10 image data set and Stanford's ImageNet. They report that the method achieved better accuracies than static approaches for the same model size while requiring substantially less training, and it yielded better accuracies than previous dynamic methods.

"Experiments indicate that exploring network structure during training is essential to achieve best accuracy," wrote Hesham Mostafa, one of the paper's lead authors. "If a static ... sparse network is constructed that copies the final structure of the sparse network discovered by the dynamic parameterization scheme, this static network will fail to train to the same level of accuracy."

More