Top AI Achievements of 2025 (part 2)

You might have already noticed how dynamic the world is. New cool things appear in our lives every day, while many familiar things become obsolete. The same happens in the world of artificial intelligence. New approaches, architectures, models – all of these are contributing to the progress of AI.

It is usually quite challenging to track all the inventions. In this article, we have gathered all the most important ones, briefly describing their significance. We have already published a similar article about the innovations of the first quarter of 2021, so we will concentrate on everything that appeared during the second quarter. And although there were no such significant breakthroughs like CLIP or DALL-E during these few months, there is still a lot of exciting stuff going on.

Without further ado, let us begin!

Table of Contents:

DINO and PAWS
Perceiver
HuBERT
Honorable Mentions
Conclusion

DINO and PAWS

DINO and PAWS are two new methods for model training, simultaneously developed and released by Facebook AI. DINO allows training Vision Transformers (ViT) with no supervision. This is an incredible combination of two of the most promising recent innovations: self-supervised learning and transformers. It can discover and segment objects in an image or a video with no supervision or segmentation-targeted objective. PAWS is a semi-supervised approach that optimizes model training and produces state-of-the-art results using much less computing. DINO and PAWS combined can significantly enhance computer vision systems, making them more efficient and less dependent on labeled data.

From left to right: original image, supervision model segmentation example, DINO segmentation example (source)

How Does It Work?

To implement self-supervision, DINO utilizes the concept of self-distillation but with no labels at all. Basically, there are two identical networks, one defined as a student and one as a teacher. Both these networks take the same image as input but in different vector representations. The teacher gets the global idea of an image obtained from two partially overlapped patches of great dimensions. The student also receives a local representation of an image acquired by a series of smaller patches.

From the image of the two kittens above, two global views and several local views are extracted (source)

During the training, the student matches local views to the global ones, thus trying to understand that they represent the same image. The teacher network then performs classification based only on the global views while trying to match the output obtained by the student network.

Unlike DINO, PAWS needs a small amount of labeled data. Given an unlabeled training image, several views are generated using random data augmentations and transformations. The neural network is trained to make the representations of these views as similar as possible. The algorithm then uses a random subsample of labeled images to assign a pseudo-label to the unlabeled views. This assignment is conducted by comparing the representations of the unlabeled views and labeled samples. Finally, the model is updated by minimizing a standard classification loss between the pseudo-labels of pairs of views of the same unlabeled image. For more interesting examples refer to computer vision development services.

Why is It Innovative?

Training a computer vision model used to require a lot of time, labeled data, and computing power until this work came along. Algorithms proposed by Facebook AI solve all these problems. The process of self-supervised learning implemented in DINO allows us to train highly accurate models with unlabeled data. At the same time, PAWS dramatically reduces training time using a small set of labeled examples. The latter has also solved a common issue for self-supervised methods – collapsing representations, where all images get mapped to the same representation.

Use Cases:

Both DINO and PAWS greatly enhance the process of model training, making many computer vision tasks easier and the results more accurate. For example, semantic image segmentation services facilitate the tasks from swapping out the background of a video chat to teaching robots to navigate the environment. This is one of the most challenging tasks in computer vision which requires a deep understanding of an image. It is usually achieved by supervised learning. DINO offers an alternative – self-supervised learning. And best of all – the accuracy does not suffer!

k-NN classification accuracy for various models (source)

PAWS, on the other hand, is focused on efficiency rather than performance. It allows training state-of-the-art computer vision models without large-scale computing resources. For comparison, a standard ResNet-50 model needs only one percent of the labels in ImageNet when trained with PAWS. Additionally, it needs 10x fewer steps to reach the same accuracy as previous models. PAWS is an excellent approach for domains where there are few annotated images, like medical imaging.

Final words:

DINO represents a new level in image understanding and segmentation. Not only does it not need labeled data, but it also produces easily interpretable features. Interestingly, this algorithm is also one of the best at identifying image copies, even though it was not trained to do that.

DINO outperformed two current state-of-the-art models for near-duplicate detection, reaching an accuracy of 96.4% on Flickr 100M dataset (source)

With DINO and PAWS, it is possible to build computer vision models that don’t need a lot of labeled data or computing power. These algorithms are also a massive step towards the further development of self-supervised systems and vision transformers. They open a bunch of new possibilities and approaches.

Perceiver

Perceiver is a new state-of-the-art transformer-based model developed by DeepMind. Its uniqueness lies in the fact that it can work with multimodal data. Just like the human brain can simultaneously analyze data received from all our sense organs, Perceiver receives and processes data in different formats. It is a new approach and one more step towards real artificial intelligence.