IBM researchers develop a pair of low-power, high-performance computer vision systems

Machine learning algorithms have improved by leaps and bounds in recent years. State-of-the-art systems like Facebook's, for instance, can train image classification algorithms in an hour without sacrificing accuracy. But many of them are trained on high-end machines with powerful GPUs, and as the internet of things (IoT) industry moves toward edge computing, there's growing demand for low-power artificial intelligence (AI) models with low overhead.

Promising research out of IBM lays the foundation for much more efficient algorithms. At the 2018 Conference on Computer Vision and Pattern Recognition in Salt Lake City, Utah this week, research scientists from the company are presenting two papers that deal with image classification.

BlockDrop

The first, titled "BlockDrop: Dynamic Interference Paths in Residual Networks," builds on Microsoft's work on residual networks that was published in 2015. Residual networks (ResNets for short) introduce identity connections between the layers in the neural network, allowing them to learn incremental, or residual, representations in the course of training.

IBM takes this idea one step further. Scientists introduced a lightweight secondary neural network -- referred to in the paper as a "policy network" -- that dynamically dropped residual blocks in a pre-trained ResNet. To ensure the performance gains didn't come at the cost of precision, the policy network was trained to use a minimal number of blocks and to preserve recognition accuracy.

"Generally speaking, if you add more layers to a model, you can improve its accuracy, but you increase the computational cost," IBM research manager Rogerio Feris told VentureBeat in a phone interview. "One issue with most current models today is that you have one-size-fits-all networks where the same computation is applied to all images. [Our] system allocates resources more efficiently and [can] accurately identify an image."

BlockDrop sped up image classification by 20 percent on average, and by as much as 36 percent in certain cases, all while maintaining 76.4 percent accuracy -- the same as the experiment's control.

Improving stereo vision

The second paper, "A Low Power, High Throughput, Full Event-Based Stereo System," tackled another problem in image processing: stereo vision.

As IBM researcher Alexander Andreopoulos explained, human eyes are centimeters apart from each other and see the world from slightly different perspectives. The brain's visual cortex seamlessly merges images from both eyes into one, allowing us to perceive depth, but two-camera robotics systems have a tougher time reconciling the disparity.

"In the case of computer vision, camera lenses have abnormalities, and this leads to noise and complicates the problem," Andreopoulos said.

The researcher's solution: a system running on IBM's TrueNorth neuromorphic chips, which have a highly parallelized architecture optimized for machine learning models. Using a cluster of nine processors, a pair of event-based cameras (cameras that only snap an image when they detect motion), and a laptop that distributed computations to the aforementioned chips, the algorithms captured and processed 400 (up to a maximum of 2,000) disparity maps per second.

The use of event-based cameras significantly cut down on bandwidth and energy usage, Andreopoulos explained. "Stereo algorithms have been around for over 30 years, but most of these systems ... use an active approach to sensing the world. Ours uses a passive approach."

Overall, the system demonstrated a 200 times improvement in terms of power per pixel per disparity map compared to state-of-the-art systems with high framerate cameras.

The results hold promise for robotics systems that depend on low-power, low-latency depth information to navigate the world, Andreopoulos said. "[I imagine] it being used in companion robots for the elderly ... [that] offer some kind of mobility assistance."

BlockDrop

Improving stereo vision

More