Top AI Achievements of 2021 (So Far)

The 21st century is a very dynamic and continuously changing era. Technology is evolving at an enormous speed. Scientists are working harder than ever, and new cool things appear all the time. And even though it’s been only four months since the beginning of the year, there have already been several fascinating breakthroughs. This article will describe the most impressive innovations in artificial intelligence and related fields in the first quarter of 2021. 

Without further ado, let’s begin!

Table of contents:


DALL-E is a neural network that creates images from text captions. OpenAI developed it at the beginning of 2021, and it immediately became one of the breakthroughs in computer vision. DALL-E is a smaller version of GPT-3 (not really small, as it has 12 billion parameters, but GPT-3 has 175 billion), and it also shows incredible results. DALL-E differs from all earlier models by its exceptional zero-shot performance: it can easily generate images of abstract, unreal, and even absurd objects with unbelievable quality.

Examples of images generated by DALL-E (source)

How does it work? DALLE-E is a language-transformer model trained on 250 million pairs of texts and images collected over the Internet. It holds 12 billion parameters of autoregressive transformers taken from GPT-3. The model receives its input as a sequence of tokens (tokens are simply symbols from a discrete vocabulary), and it is trained to maximize the likelihood of sequential token generation. The training itself divides into two stages. The first one includes image compression. The primary purpose of this process is to reduce the transformer’s size without a considerable degradation in quality. This results in a sequence of 1024 tokens (32×32 grid of image tokens). The second stage includes concatenations of 256 BPE-encoded text tokens with image tokens and training an autoregressive transformer. 

Why is it innovative? Unlike all the earlier generative models, DALL-E has an exceptional ability to widen its knowledge on previously unseen texts. It can produce anthropomorphized versions of animals and objects, combining completely unrelated things and applying transformations to existing images. DALL-E can effectively modify an object’s attributes and the number of times it appears on the image and visualize different perspectives even in three-dimensional space. DALL-E can even draw an object’s internal structure, which requires deep knowledge of the object itself and was not previously possible without specific training. Interestingly, DALL-E is not based on GANs that are most commonly used for image generation, effectively making it a completely new and highly promising approach.

A cross-section view of an engine (source)

Use Cases: Because of its outstanding performance, DALL-E has lots of use cases. It can be used in advertising, publishing, journalism, social media, etc., to create all types of illustrations: from mock-ups to complete designs. DALL-E might even solve the problem of copyrighting, as it can create unique images for any purpose, whether it is for a school project, a musical album cover, or an advertising campaign. Finally, this model can generate design ideas: there is no need to draw sketches anymore; you can simply write your thoughts as a text and choose the best option DALL-E produces. For example, being a furniture designer, wouldn’t you find some of those armchairs in the shape of a Rubik’s cube awesome?

An armchair imitating a Rubik’s cube (source)

Final words: DALL-E, being a version of GPT-3, shows similar, state-of-the-art results. It is likely to be used across different industries and can help (if not substitute) artists and designers. Even though many of the images generated by DALL-E might not be 100% accurate, there are still pretty impressive results. And CLIP (described below) can help in choosing the best ones. We may anticipate the model to improve and widen its possibilities shortly. 


CLIP (Contrastive Language-Image Pre-training) is another neural network developed by OpenAI. It provides a powerful bridge between computer vision and natural language processing and opens many new opportunities. CLIP was designed to solve some of the main computer vision problems, including labor-intensive and costly labeling for datasets, low adaptability to new tasks, and poor real-world performance. Unlike traditional classification models, CLIP does not recognize the objects on an image but provides the most appropriate description. This approach gives more flexibility and robustness to non-standard datasets. CLIP outperformed the older ImageNet model on 20 out of 26 different datasets tested. Particularly interesting is its result on an adversarial dataset that is explicitly designed for confusing AI models. 77.1% accuracy is an incredible result which proves CLIP to be robust to previously unseen images.

Comparison of ImageNet and CLIP on various datasets (source)

How does it work? CLIP was trained on 400 million image-text pairs. These are images found on the Internet and their captions. For example, a picture of a puppy sitting on the grass with the phrase “A photo of a puppy on the grass.” The model consists of two sub-models: a text encoder and an image encoder. Each of these converts texts or images into a mathematical space (vector representation). These vector representations allow us to compare how close the objects are. Ideally, the phrase “A photo of a puppy on the grass” must be as close to the picture of a puppy sitting on the grass as possible. This is precisely what happens during the training phase. The model tries to maximize the similarity between texts and their corresponding images. As a result, we receive a matrix with corresponding texts and pictures on the main diagonal (their similarity must be as high as possible) and unrelated texts and images elsewhere (their similarities must be as low as possible). Once the model is trained, we can get a text describing the image.

CLIP’s pretraining phase (source)

Why is it innovative? CLIP differs from other computer vision models in several ways. First of all, it does not need to construct a specific dataset, as it is already trained on a massive number of images taken from the Internet. As a result, no additional human effort is required. Secondly, CLIP has impressive zero-shot capabilities where a model can perform well on tasks it has never seen before. It utilizes semantics extracted from the text to add value to the images, making it possible to use CLIP for many different purposes without fine-tuning. It is different from the usual classification model that simply enumerates the classes and does not use the information provided by the text. Lastly, CLIP learns from unfiltered and noisy data, which adds to its flexibility and robustness and increases its accuracy on real-world data. 

Use Cases: Even though it’s been only four months since the release of CLIP, there are several exciting applications already. This project uses CLIP to create a text-based interface for another model (StyleGAN), which, in turn, can generate and manipulate images. The result of its work can be a picture of a celebrity with another hairstyle. Also, this project can help you find a particular moment in a video by just entering its description and providing the video link. Another application would be a CLIP model in a Pictionary judge’s role, where it can decide how similar a picture is to the phrase. Finally, CLIP can be used as a discriminator for DALL-E (described above). Among many images generated by DALL-E, CLIP can choose the ones that correspond to the prompt the most, thus creating a robust image generation pipeline.

The result of StyleGAN+CLIP image manipulation (source)

Final words: Despite its power and adaptability, CLIP has its limitations. It struggles with more abstract or systematic tasks (for example, counting the objects on a picture) and has difficulties with industry-specific classification (like determining the car model). However, CLIP also opens a wide variety of new possibilities and approaches. Not so long ago, scientists discovered a multimodal neuron in CLIP that is responsible for abstract thinking. This is a significant step towards understanding the associations and biases that exist in such models. We may expect further developments and new astonishing achievements in this area.


SEER (SElf-supERvised) is a self-supervised billion parameter computer vision model developed by Facebook AI. It can learn from any random set of images found on the Internet and does not need any preprocessing or labeling, which distinguishes it from all previous computer vision models. SEER was pre-trained on a billion random images publicly available on Instagram and reached an 84.2% accuracy on ImageNet, exceeding prior results. The model has also demonstrated state-of-the-art performance on various tasks, including object detection, segmentation, image classification, etc.

Comparison of SEER and other algorithms accuracy on ImageNet (source)

How does it work? Two major components are required to make SEER work efficiently with complex images. The first one is an algorithm that can work with a large number of pictures without labels. A new algorithm SwAV is used for this purpose. It can cluster related visual concepts quickly to make use of their similarities. The second component is a convolutional neural network that could effectively work with large and complex data without losing accuracy. RegNet models fit this need perfectly. They are capable of scaling to billions of parameters and show incredible accuracy. Lastly, Facebook AI also released an all-purpose library for self-supervised learning used for the development of SEER.

Two components of SEER (source)

Why is it innovative? SEER has proved self-supervised learning to be a powerful instrument in AI’s development. It allows us to use the data that already exists in the world rather than specifically preparing it. This opens new possibilities for future AI research and allows for much more efficient real-world data usage. Training the models on real-life data increases its accuracy and ability to generalize, while simultaneously saving time and money that would otherwise be wasted on manual data preparation and labeling. Apart from that, self-supervised learning mitigates the biases that might arise during data annotation. 

Use Cases: Facebook believes that they can use SEER on their platforms for safety matters, such as identifying and removing hate or racist images rapidly. Aside from that, the model can potentially be used to automatically generate the description for images and better categorization of items sold. Finally, SEER’s efficiency and speed enable it to be used for medical purposes, including disease diagnosis.

Final words: SEER is a one step ahead towards AI capable of understanding the visual world. More research is going on in the self-supervised area, and SEER is only one future innovation. 

Honorable Mentions

Apart from DALL-E, CLIP, and SEER, there were other innovations in 2021. They might not be as significant as these models, but they are still shaping the future of AI, so it is important to know about them as well.


Stretch is a new multi-purpose robot explicitly designed for warehouse facilities and distribution centers. It was developed by Boston Dynamics – the global leader in mobile robotics. Stretch is aimed at improving efficiency and safety in the logistics industry. It is equipped with a custom-designed arm and a smart-gripper with sensors to handle various boxes and cases. Stretch also uses computer vision technology to identify packages and navigate in the space around it. This robot is likely to substitute workers in dangerous and physically challenging activities, increase the flow of goods, and lower automation costs.

Boston Dynamic’s Stretch (source)

Image Restoration

Scientists from IIT Madras have developed a new approach to image restoration. They have divided the process into two stages. Firstly, they use one neural network to localize the degraded part of the image. Then, another neural network is used to restore the image. It turns out that solving an auxiliary task of degradation mask injects crucial localizing ability in network layers. This ability is then transferred to the second network using attentive knowledge distillation, which is a mechanism that allows transferring the knowledge from a large model to a smaller one. This additional knowledge helps the algorithm focus on the refinement of degraded regions. Such an approach improves the quality of images significantly and has several advantages. First of all, it simplifies the data preprocessing in computer vision tasks. It diminishes the effect of rain, snow, and haze on image quality. Second, the model improves various computer vision systems, such as self-driving cars and surveillance cameras. It may prevent many critical errors related to them. Finally, this algorithm simply helps us keep our memories in high quality without regretting taking a bad picture.

Image restoration in action (source)

Switch Transformer

Google Brain has recently published a paper about Switch transformer – a new enormous transformer-based language model with over 1.6 trillion parameters. It is significantly larger than other models (including GPT-3, which has only 175 billion parameters) but takes less time to pre-train. The model’s intuition is based on the fact that not all available data is relevant to a specific case. For example, technical documentation is not particularly useful when working with poetry. Therefore, the model can filter the information like a switch; hence the name – Switch transformer. This ability is the key to Switch’s efficiency. Since the model itself and the API have not been released yet, it is difficult to judge its performance. But it may potentially be the next breakthrough of 2021.


TimeSformer (Time-Space Transformer) is an entirely new architecture for video understanding developed by Facebook AI. It is the first video architecture based purely on transformers. TimeSformer has overperformed most of the earlier models while showing higher speed. Additionally, this architecture is highly scalable, making it possible to build more extensive and more accurate models. TimeSformer was built exclusively on the self-attention mechanism used in transformers. This allows the model to capture space-time dependencies over the entire video. TimeSformer also has a relatively low computational cost. This is achieved by decomposing the video into a small set of non-overlapping patches and applying self-attention that avoids exhaustive comparisons between all pairs of patches. This architecture is a massive step towards lengthy video processing, as the existing models could only deal with videos that are a few seconds long. 

Video classification accuracy of TimeSformer versus state-of-the-art 3D convolutional neural networks on different Kinetics datasets (source)

Attractive Faces Generation

Scientists at the University of Helsinki have released a neural network capable of creating personalized attractive faces. Thousands of faces were shown to thirty people with an instruction to focus their attention on the most beautiful faces. Their brains’ activities were recorded using EEG. Machine learning-based algorithms have determined the looks that each of the thirty people found attractive and established the connection between them. Finally, a GAN model produced new images based on traits from the analyzed faces. It turned out that 87% of the generated images were selected as being the most attractive. This study shows that it is possible to produce personalized results by connecting artificial intelligence and brain activity analysis. Creators hope that such systems will be used to understand individual preferences and perhaps to identify people’s unconscious attitudes.


Even though we are only four months into 2021, we have already witnessed three breakthroughs in artificial intelligence. DALL-E and CLIP are two revolutionary models that connect computer vision and natural language processing. At the same time, SEER has achieved state-of-the-art performance in computer vision and showed the power of self-supervised learning. Other related fields have had several innovations as well. We can only guess what other amazing things await us in the future. But we can be confident that they will change the world.