Microsoft researchers develop assistive eye-tracking AI that works on any device

Gaze tracking has the potential to help people living with motor neuron diseases and disorders exert control over their environment and communicate with others. But estimating a person's gaze isn't a trivial task owing to variables including head pose, head position, eye rotation, distance, illumination, background noise, and the presence of glasses, face coverings, and assistive medical equipment. Commercially available gaze trackers exist -- they use specialized sensor assemblies -- but they tend to be expensive, costing up to thousands of dollars. Inexpensive software-based trackers, on the other hand, are often prone to lighting interference.

This challenge inspired a team of researchers at Microsoft to develop an ultra-precise, hardware-agnostic gaze tracker that works with any off-the-shelf webcam. In a preprint paper published earlier this month, they detail their work on a system that achieves an error of 1.8073 centimeters on GazeCapture, an MIT corpus containing eye-tracking data from over 1,450 people, without calibration or fine-tuning.

Microsoft has a recent history of gaze-tracking research. In a previous study, researchers at the company experimented with multiple infrared lights around a display for eye-tracking, as well as with a camera and depth sensors. And Windows 10 was the first version of the operating system to offer Eye Control, a technology that allows users to use their eyes to control an on-screen mouse and keyboard experience, and the Eye Drive Library, which emulates a joystick via eye-tracking.

This latest effort from Microsoft engineers isn't the first attempt to build a more accurate software-based tracker. The MIT team behind GazeCapture designed iTracker, an AI model that performs gaze tracking on Apple devices using built-in cameras. Their system achieved a prediction error of 1.86 centimeters and 2.81 centimeters on smartphones and tablets, respectively, without calibration.

But the researchers aimed to take this a step further to enable mouse-like gestures on laptops and desktops in addition to smartphones and tablets.

Like iTracker, the coauthors' model takes the left eye, right eye, and face regions from camera images as input along with a 25-by-25 face-grid indicating positions of the face pixels in the captured images. These input images are passed through eye and face sub-models (which are based on the pretrained ResNet18 computer vision algorithm), and the outputs of these sub-models are then processed into gaze point coordinates.

The researchers trained their model on portions of the GazeTracker dataset, but they also performed data augmentation to ensure the model would better handle variations it might encounter in the real world. They randomly changed the brightness, contrast, saturation, and hue of sample images and then resized them before randomly cropping them to add noise that prevented the model from overfitting. (Overfitting refers to a model that learns the detail in the training data to the extent that it negatively impacts the performance of the model on new data.)

The coauthors combined the tracking model with a face detection library, Dlib, they said resulted in more useful, consistent, and higher-quality detection. While GazeCapture doesn't include data above the eyebrow and below the lip regions, Dlib handles a range of head rotation information in captured images. During detection, the library (and an OpenCV method called minAreaRect) fits a rectangle to identified facial landmarks and estimates the head angle before performing rotation correction and extracting face and eye crops. The rotation angle is encoded into the face-grid.

The researchers attempted to root out potential biases and other issues by analyzing the areas of faces to which the system paid attention. They found that it mostly cared about the eye region but also the eyebrow and the lower edge of the eyelid -- in other words, the muscles activated when people move their eyes in certain directions. "Trigonometric models that focus only at the pupil and the iris would not necessarily pick these features and therefore, this is where deep learning could exploit beyond the obvious in order to improve the accuracy," the researchers wrote.

In future work, the coauthors plan to develop custom neural network architectures that improve performance even further. "Gaze-tracking as accessibility technology has many roadblocks including lack of interoperability and non-existence of a diverse and large-scale dataset covering issues of facial occlusion, head poses and various eye conditions," they continued. "This research shows promise that one day soon any computer, tablet, or phone will be controllable using just your eyes due to the prediction capabilities of deep neutral networks."

More