Machine learning in Video Analysis: Top Challenges and Achievements

According to Wyzowl’s survey, the average duration of video watched per week increased by 12% in the last year. Now it is 18 hours per week. Take YouTube: more than 500 hours of video content uploaded per minute. Due to active growth in video data, its analysis has more and more tasks to solve. Machine learning video analysis algorithms are also rapidly improving. This article will focus on the main challenges and achievements in video analysis.


Before we begin, let’s look at some terms. Video is just an ordered sequence of frames (images) of the same resolution, captured and displayed with a given frequency (frames per second). There are two types of video with different approaches to processing: 

  • Video stream –  an ongoing video for online real-time processing, where we don’t know future frames;
  • Video sequence –  a video with a fixed length, where all frames are available at once, and we can process it as a whole object.

Video Processing and Analysis

Due to the rapid growth in the number of videos, it is necessary to analyze and process them to find the needed information effectively.

Video processing is a set of operations that we do on each frame. For example, to transform video from compressed to raw format, we need decoding. After that, we may perform some computation on the received frame. The final step is encoding, which we apply to convert a given frame back to a compressed state. Also, after lossy compression videos, we may need to use super-resolution algorithms to reconstruct high resolution. Undoubtedly, we should keep in mind the main criteria: speed, flexibility, and accuracy. Therefore, the speed optimization algorithms play a crucial role in video processing for faster computation, just like parallelization and a smart pipeline and file splitting system. 

Machine learning video processing is quite helpful in video compression. We can use convolutional neural networks for codec improvements, reducing complexity, reconstructing features lost during compression, and post-processing. Another technique is to apply quantization to an auto-encoder bottleneck to learn meaningful latent representations and get a compressed image. Redundancy reduction, using neighbor blocks with the same appearance to reduce details in an image without quality losses, following frame extrapolation for frame rate improvement are also examples of machine learning. A great example is how Netflix uses machine learning to improve streaming quality: for network quality characterization and prediction, video quality adaptation during playback, predictive caching, multiple and device anomaly detection.

During a video analysis, our goal is to automatically extract information about objects and events in the video, such as object detection, identifying their properties (human pose estimation, attributes identification), and search for people’s actions and events (identification and localization). Video summarization using deep learning is a difficult task for video analysis, which aims to select the essential parts of the video.

Video Summarization

With a skyrocketing number of videos, the problem of video summarization has come to the forefront for researchers and businesses. Fast extraction of relevant information in videos and selecting the most important and meaningful part of the video is pertinent. Let’s talk about the main challenges and recent achievements of deep learning video summarization.


 Video summarization faces the same problem as most deep learning tasks – it requires a vast amount of data. The collection of summarization labels is time-consuming, and a small dataset will not be sufficient. Available datasets contain only videos of certain types, which offers a poor performance of the model on videos of other categories. To deal with this, we can apply unsupervised, semi-supervised, or multi-task learning. Computational hardware and complexity in development are also well-known problems.

  As video summarization relies on users’ feedback,  subjectivity is one of the main challenges since people may consider different parts to be important for the same video. Depending on a specific business case, this can be solved with personalization and a content-based recommender system. Suppose we have several ground-truth summaries of the video. In that case, a practical solution is to learn from multiple losses, each capturing different characteristics of a summary (summary length, closeness to the original video, etc.).

Recent Achievements

1. “Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network”
Researchers build a summarizer in an unsupervised way to avoid expensive human annotation. The network consists of a generator and discriminator (this is what adversarial means). The generator is used to predict the importance score for each frame in the video. Discriminator aims to distinguish the original feature x from the importance score weighted feature x. The goal is to retain semantic information from the original video in summary. 

One of the challenges is capturing temporal relationships in the video, and the generator consists of a fully convolutional sequence network (FCSN) and a self-attention module. FSCN is used for global representation extraction and a self-attention mechanism for capturing dependencies. Self-attention mechanisms are fast in matrix computations and retain the dependencies between frames, so they become prevalent for video summarization tasks. The final step is outputting normalized importance scores. 

An LSTM-based discriminator provides a signal for training by distinguishing the raw frame features from the importance score weighted ones.

To learn the parameters, an objective function is composed of three losses:

  • Adversarial loss –  to complete adversarial network, generator against the discriminator; 
  • Sparsity loss –  to limit the selected number of key frames; 
  • Reconstruction loss –  to keep the summary semantically close to the original video.

The proposed method was evaluated on two popular datasets for video summarization: TVSum and SumMe. It achieved better results than existing state-of-the-art unsupervised approaches and has beaten several supervised methods as well. In addition to this, it is applicable to video capture and image text recognition tasks. 

The network architecture of the given approach, image source

2. “Video Summarization through Reinforcement Learning with a 3D Spatio-Temporal U-Net.”

This research is attractive due to the use of reinforcement learning in the video summarization task. In contrast to the one described above, the study applies this method to medical videos. It investigates if 3D spatio-temporal CNN features are better suited to representation learning than 2D image features, which are more common. 

Using reinforcement learning (RL) is nothing new in video analysis tasks: it has proven successful in visual tracking, facial recognition, video captioning and object segmentation. 

While the FCSN (used in the previous paper) takes only 2D CNN features for each frame and applies 1-dimensional convolutions, spatio-temporal features and full 3D convolutions are used in this paper. Since U-Net is one of the most popular FCNs and works effectively for semantic image feature extraction, the final approach in the paper is 3D spatio-temporal U-Net (3DST-UNet) using spatio-temporal features.

The role of the RL agent here is to learn policies that maximize the received reward for performing actions.  The action in our case is to select or discard the current frame as a key frame or not.

3DST-UNet-RL method’s scheme of work, image source

Comparing the results of these two papers on two popular benchmarks, the CAAN model performed better and had a higher F1-score.

Model SumMe  TVSum 
3DST-UNet-RL 58.1% 44.6%
CAAN  59.58% 50.81%

F1-scores for 3DST-UNet-RL and CAAN models on SumMe and TVSum benchmarks

Business Use Cases

The range of applications is vast, especially for media organizations. It allows them to index, retrieve and promote video content more effectively. Also, it’s an excellent opportunity for personal content-recommendation systems, trailer generation for the entertainment industry, and important information highlights in the event industry or sports. For the security domain, the main applications are to retrieve the main activities over the last 24hrs of recordings of a surveillance camera, for time-efficient progress monitoring or security purposes, and so on.

The last paper has a great potential to be used in the medical field; it saves on the storage costs of ultrasound screening videos and can be helpful while browsing patient video data in brief format only with crucial information.

Since video summarization is quite challenging, there are only a few products that use this technique so far. The most common use case is Youtube video summarization. There are a lot of products created for this task: vdNote, Summary Panda. Another case is the summarization of video conversations –  Spoke Enterprise. It is also actively implemented in surveillance systems. There is also an effective API by Huawei for video clip summaries and video cover summaries. 

Video Understanding


  “Long-Form Video Understanding”

Many videos like TV shows, movies, Youtube tutorials, educational lectures are examples of long-form videos. We can call a video long-form if it lasts more than 10 minutes and has a specific story with a beginning, middle, and end. All these parts are somewhat dependent, which means we can fully understand the context of the current scene based only on previous ones.  We need to watch a few minutes of video to delve into the relationship between the characters and understand the reasons for someone’s actions.

This is the challenge for modern visual systems, which can accurately work with short videos and understand only specific objects, places, shapes, short actions or events, etc. However, they cannot distinguish the connection between past, present, and future and operate only with a limited time part of the video. Therefore, it’s hard for them to contextualize and “see” the full picture of what is happening. The first reason for this is that short-term models descend from image-based architectures, for which there are many top-performing state-of-the-art models. The second reason is that not all video recognition tasks require long-term reasoning (like human motion recognition).

The difference between short and long-form video understanding, image source

Recent Achievements:

“Towards Long-Form Video Understanding”

This recent paper explores the challenge of long-form video understanding. 

Firstly, the model uses all recent advances in image-based architectures that provide high-performance object detection and tracking during the video. Existing short-term models effectively parse many valuable aspects of the video. This part gives us detected objects with arbitrary length and visual features, which we can treat as essential semantic elements. Object detection with categories and bounding box detection for each object, atomic actions detection, instance tracking across the time, shot transition detection, and other methods were used for this paper. 

After that, the main goal of the transformer-based architecture is to capture the complex interactions between these elements across time to save the context. It is worth pointing out that we don’t operate here with frames like the “frame-as-a-word” and “video-as-a-sentence” approaches like in some recent research, where the goal was to capture the interaction between adjacent frames, but rather we build and use object-centric representations. This design was inspired by representing instances as space-time trajectory commonly used in previous papers but transformed into working with inter-instance (inter-objects) relationships. This is the key idea and novelty of this study and the reason this architecture is called Object Transformers. 

This experiment showed better results than all existing state-of-the-art methods for long-form tasks on existing datasets. The model was used for evaluation on content understanding (relationship, speaking style, place), user engagement prediction (like ratio and popularity on YouTube), movie metadata (director, genre, writer, release year).

Business Use Cases

This work is a significant step in long-form video understanding and visual understanding all over the world. It can be helpful in entertainment, education, retail, sports, surveillance, etc. YouTubers might use such an algorithm to evaluate and predict the popularity of videos and improve them. Also, this method can be used for faster content navigation, video summarization, evaluation of customer satisfaction and recommender systems, video question answering, and others.

Valossa is an example of a complex video understanding system, which generates automatic video preview, monitors content, recognizes faces and emotions, extracts scene metadata, and analyzes brand mentions.


Research in video analysis is actively underway, and every month new NN architectures and new approaches appear. New models are increasingly learning to cope with challenges and limitations. 

This article discussed challenges related to video summarization and long-form video understanding tasks, as well as recent achievements that outperformed existing state-of-the-art models. These results are very encouraging and can be implemented in various domains: entertainment, education, smart city, security, e-commerce, etc.


What is video understanding?

Video understanding is a general term that aims to localize and detect all objects, actions, and events in the video, capture the relationship and temporal dependencies. It includes tasks like video classification, video summarization, visual object tracking, and action classification that help capture the general context, speeds up video indexing and retrieval, and content navigation.

What is intelligent video analytics?

The main goal of an intelligent video analytics system is to analyze video streams or recorded sequences, track and process information about objects and actions, and identify locations. Using analytics, users can search for specific events and evaluate them. For surveillance systems, the main tasks of intelligent video analytics are monitoring anomaly activity and sending alarm notifications in case of suspicious actions.