MIT CSAIL details technique for shrinking neural networks without compromising accuracy

Deep neural networks -- layers of mathematical functions modeled after biological neurons -- are a versatile type of AI architecture capable of performing tasks from natural language processing to computer vision. That doesn't mean that they're without limitations, however. Deep neural nets are often quite large and require correspondingly large corpora, and training them can take days on even the priciest of purpose-built hardware.

But it might not have to be that way. In a new study ("The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks") published by scientists at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL), deep neural networks are shown to contain subnets that are up to 10 times smaller than the entire network, but which are capable of being trained to make equally precise predictions, in some cases more quickly than the originals.

The work is scheduled to be presented at the International Conference on Learning Representations (ICLR) in New Orleans, where it was named one of the conference's top two papers out of roughly 1,600 submissions.

"If the initial network didn't have to be that big in the first place, why can't you just create one that's the right size at the beginning?" said PhD student and coauthor Jonathan Frankle in a statement. "With a neural network you randomly initialize this large structure, and after training it on a huge amount of data it magically works. This large structure is like buying a big bag of tickets, even though there's only a small number of tickets that will actually make you rich. But we still need a technique to find the winners without seeing the winning numbers first."

The researchers' approach involved eliminating unnecessary connections among the functions -- or neurons -- in order to adapt them to low-powered devices, a process that's commonly known as pruning. (They specifically chose connections that had the lowest "weights," which indicated that they were the least important.) Next, they trained the network without the pruned connections and reset the weights, and after pruning additional connections over time, they determined how much could be removed without affecting the model's predictive ability.

After repeating the process tens of thousands of times on different networks in a range of conditions, they report that the AI models they identified were consistently less 10% to 20% of the size of their fully connected parent networks.

"It was surprising to see that re-setting a well-performing network would often result in something better," says coauthor and assistant professor Michael Carbin. "This suggests that whatever we were doing the first time around wasn't exactly optimal and that there's room for improving how these models learn to improve themselves."

Carbin and Frankle note that they only considered vision-centric classification tasks on smaller data sets, and they leave to future work exploring why certain subnetworks are particularly adept at learning and ways to quickly spot these subnetworks. However, they believe that the results may have implications for transfer learning, a technique where networks trained for one task are adapted to another task.

More