Teleoperation - Not your typical video stream

Teleoperation, a seemingly simple capability, involves a multitude of technologies and systems in order to be implemented safely. In the first article from this series we established what teleoperation is and why it is critical for the future of autonomous vehicles (AV). In the second article we showed the legislative traction and emphasis gained for this technology. In the third article we explained just one of many technical challenges needed to be overcome to enable remote vehicle assistance and operation. In this article, we’ll explore another major technical challenge - dynamic ultra-low latency video streaming without compromising on availability.

Don’t blink!

Everyone has gone to a website on their smartphone and there is something they want to click on, but when they do, at the very last moment, an ad pushes the desired thing up and as a result they accidentally click on that ad. This is perhaps one of the most frustrating kinds of advertisements today. If the ad had loaded with the site and there was no delay anyone could have easily clicked where they wanted and no one would get frustrated. Now imagine the same issue while you’re remotely driving a real car and suddenly experiencing a glitch in the video feed - the one you’re using to understand the car’s surroundings.

Certainly, when driving in your own vehicle you see exactly what is happening and in real time. After all, you are in the vehicle and you can see (almost) all of your surroundings. However, when teleoperating a vehicle, or remote driving, you don’t have this natural advantage. Your eyes are replaced by cameras mounted on or in the vehicle and instead of going just through your own optic nerve, the feed is also being transmitted over public cellular networks. As mentioned in a previous article from this series, when optimizing for very low latency transmission, there is going to be packet loss in the transfer of data and, unfortunately, video frames are likely to get lost as a result.

Go back to the scenario with you in the vehicle. While driving down the streets of your city, you have to sneeze. As a reflex you, like everyone else, close your eyes for JUST a moment. You might even call it a frame. Once you reopen your eyes, you again see your surroundings and re/act accordingly and in real time. Then take this back to the teleoperator. A frame is lost. Not because of a literal sneeze but a figurative one. Their “eyes” close for JUST a moment. However, when they are open again the teleoperator has to react to a situation that took place a number of milliseconds ago and, when driving, milliseconds can be a lot (depending on vehicle speed).

Only send what you can.

Every lane on a roadway translates to a certain number of vehicles that can travel along it at any given time. If there are more than this number of vehicles using the road, there are traffic jams, frustrated drivers and even road rage. The same happens with transmitting video over a network. If you transfer too many bits at a given moment the results are delay, corrupted frames and even frame loss. The video size needs to be adjusted so it can be relayed with the given network conditions.

Unfortunately, unlike roadways, it is very difficult to know how much the network can handle as conditions shift constantly. Changing variables like: tunnels, dense urban areas, distance to the cellular antenna, and weather affect the network performance. There are two potential negative outcomes. One is using too few bits and video quality is unnecessarily degraded. Two is using too many bits and there is video degradation and loss.

The network must be continuously monitored. By measuring data like channel latency, packet loss, and modem signals - and doing so at high frequency - the video can undergo dynamic adjustment and fit every moment's network total capacity. The result is sharp video quality when the connection is strong and somewhat blurrier video quality when the connection is weak, but most importantly, the video is resilient and continuous.

Once the strength of the connection is estimated, the video can be sent to an on-board encoder to adapt it for the relevant bitrate. There are a few primary methods for doing that, each with its own benefits and detriments.

The first primary way an encoder can shrink the video size is by lowering the resolution. In every frame there is an incredible number of dots, or pixels, that make up the picture. By removing a certain percent of these pixels and “stretching” the remaining ones, the picture is still complete but it is less sharp. The same effect can be seen day to day when we enlarge a picture or zoom in too much with a digital zoom. Resolution is the most common way to change the bitrate as the loss is the least problematic for teleoperation. After all, even if things are a little bit blurry you can still easily tell the difference between a car, a person and a tree.

The second primary way an encoder can shrink the video size is by lowering the frames per seconds (FPS). This method will allow for the resolution to stay the same but it means there is more latency between images. More latency means a longer reaction time by the teleoperator. A longer reaction time means an increased likelihood of collision. This type of solution would be reserved for use cases with less moving objects, like agriculture or mining, where there are a limited number of vehicles and few to no pedestrians but precision is critical.

The third primary way an encoder can shrink the video size is by limiting the color scheme or even moving entirely to grayscale. With fewer or no colors to deal with, the image can be kept clear and the frame rate high. This would seemingly be the most ideal solution. However, when dealing with grayscale, many crucial details can be missed. Perhaps a concrete block might look like a cardboard box. Or maybe what looks like an advertisement is actually a real person. Clearly the safety issues are of grave concern and this method should also be used for specific use cases.

Testing the limits.

There are also some methods, either to maximize network capabilities or reduce streaming throughput, one can employ that are not fully explored yet. However, they do merit mention and further investigation.

An additional tool to deal with shifting network conditions and available bandwidth is Multi Resolution Encoding (MRE). Traditionally, an encoder is told to convert a single video feed of a current size to a different size. With MRE one hedges their bets and gets both a high bitrate video and a low bitrate video. The result is that both are available to be used at any given moment and can be easily switched as needed. The downside of this is increased computing power and encoding time as it now has to perform double duty.

Alternative methods for bitrate reduction include:

Synchronizing images from multiple cameras into one big picture. This might ease strain on the encoders as a whole. It might also overload a single encoder. It might also result in bursts of data instead of a smoother and more even flow.
Eliminating irrelevant parts of the picture. This would result in less data needed to be transmitted. However, a big challenge here is establishing which parts are irrelevant. Especially tricky in that the autonomy system already has difficulty understanding the situation, that is why a teleoperator is involved in the first place, so it is possible that a seemingly irrelevant part is actually important.
Only transmitting shapes but no details. This is the flip side of grayscale in that the colors are all still there but the shapes are without any definition or detail. The problems it would cause are similar to grayscale in that details would be hard to discern and something that seems innocent enough could indeed be problematic or even dangerous.

Dynamic ultra-low latency video streaming without compromising on availability is the second crucial component of teleoperation. Like network connectivity it is not easy to achieve if your mission is to adhere to strict automotive requirements and to save peoples lives. An accurate estimation of what the network can handle has to be made continuously and the video needs to be resized accordingly, all in real time. However, like the other technologies discussed in this series, not one can be skipped nor its importance minimized. The other components will be laid out and explained in the future articles of this series.