This article was contributed by Can Kocagil, data scientist at OREDATA.

From spatial to spatiotemporal visual processing

Instance-based classification, segmentation, and object detection in images are fundamental issues in the context of computer vision. Different from image-level information retrieval, the video-level problems aim at detection, segmentation, and tracking of object instances in spatiotemporal domain that have both space and time dimensions.

Video domain learning is a crucial task for spatiotemporal understanding in camera and drone-based systems with applications in video-editing, autonomous driving, pedestrian tracking, augmented reality, robot vision, and a lot more. Furthermore, it helps us to decode spatiotemporal raw data to actionable insights along with the video, as it has richer content compared to visual-spatial data. With the addition of temporal dimension to our decoding process, we get further information about

  • Motion
  • Viewpoint variations
  • Illuminations
  • Occlusions
  • Deformations
  • Local ambiguities
from the video frames. Because of this, video-level information retrieval has gained popularity as a research area, and it attracts the community along the lines of research for video understanding.

Conceptually speaking, video-level information retrieval algorithms are mostly adapted from image-level processes by adding additional heads to capture temporal information. Aside from simpler video-level classification and regression tasks, video object detection, video object tracking, video captioning, and video instance segmentation are the most common tasks.

To start with, let’s recall the image-level instance segmentation problem.

Image-level instance segmentation

Instance segmentation not only groups pixels into different semantic classes, but also groups them into different object instances. A two-stage paradigm is usually adopted, which first generates object proposals using a Region Proposal Network (RPN), and then predicts object bounding boxes and masks using aggregated RoI features. Different from semantic segmentation, which segments different semantic classes only, instance segmentation also segments the different instances of each class.

The core idea is that, given specific video frames, we want to identify the type of video from pre-defined classes.

Video captioning

Video captioning is the task of generating captions for a video by understanding the action and event in the video, which can help in the retrieval of the video efficiently through text. The idea here is that, given specific video frames, we want to generate natural language that describes the concept and context of the video.

Video captioning is a multidisciplinary problem that requires algorithms from both computer vision (to extract features) and natural language processing (to map extracted features to natural language).

Video object detection (VOD)

Video object detection aims to detect objects in videos, which was first proposed as part of the ImageNet visual challenge. Even though the association and providing of identity improves the detection quality, this challenge is limited to spatially preserved evaluation metrics for per-frame detection and does not require joint object detection and tracking. However, there is no joint detection, segmentation, and tracking as opposed to video-level semantic tasks.

The difference between image-level object detection and video object detection is that the time series of images are given to the machine learning model, which contains temporal information as opposed to image-level processes.

Video object tracking (VOT)

Video object tracking is the process of both localizing the objects and tracking them across the video. Given an initial set of detections in the first frame, the algorithm generates a unique ID for each object in each timestamp and tries to successfully match them across the video. For instance, if I say that the particular object has an ID of "P1" in the first frame, the model tries to predict the ID of "P1" of that particular object in the remaining frames.

Video object tracking tasks are generally categorized as detection-based and detection-free tracking approaches. In detection-based tracking algorithms, objects are jointly detected and tracked such that the tracking part improves the detection quality, whereas in detection-free approaches we're given an initial bounding box and try to track that object across video frames.

This article was contributed by Can Kocagil, data scientist at OREDATA.



Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!