Transformers Are All You Need

A quick tour through the most popular Neural Net architecture

Have you ever thought about what happens when you read a book? Unless you have a unique ability, you don’t process and memorize every single word and special character, right? What happens is that we represent events and characters to build our understanding of the story. We do this with a selective memorization that allows us to keep the most relevant pieces of information without needing to accumulate each minor detail.

This is exactly what Hochreiter and Schmidhuber were looking for when they created the Long Short Term Memory (LSTM) model in 1997. LSTMs are types of Recurrent Neural Networks that emulate a selective memory approach, allowing them to store relevant information about the past in order to optimize a task. This impressive architecture ruled the sequential data landscape for over two decades and drove huge progress towards the way we understand different disciplines today. Unlike human memory, however, LSTMs can struggle when dealing with long-range contexts, as is the case with human language.

This limitation was particularly evident in language research that tried to move from keyword-based to more linguistic approaches in tasks like information searching, document classification, or question-answering. The nuances of human language were just too many for the Natural Language Processing discipline to keep up.

Natural Language Processing (NLP) is a field of Artificial Intelligence that gives the machines the ability to read, understand, and derive meaning from human languages. It represents the automatic handling of natural human language like speech or text, and, because, as humans, we excel at understanding our language, we tend to underestimate how hard it is for machines to do it.

Despite this,things have significantly changed in past years, and, although far from new, NLP is living a new age thanks to the invention of Transformers.

Transformers represent new architectures of Artificial Neural Networks (ANN) that generalize to many NLP tasks with incredible results.

Transformers have improved the performance of language models in a substantial way, significantly extending the model’s contextual processing. Interested in seeing how they perform? You can test Transformers here and here.

The road to Transformers

Before getting to Transformers, let’s start by understanding what an Artificial Neural Network (ANN) is.

ANNs are computing systems composed of neurons, where each neuron individually performs only a simple computation.

The power of an ANN comes from the complexity of the connections these neurons can form. ANNs accept input variables as information, weigh variables as knowledge, and output a prediction. Every ANN works this way.

The concept is far from new, since the first ANN (the Perceptron) was created in 1958. At the beginning, ANNs were built and used to solve basic tasks, but they rapidly evolved, becoming complex mechanisms able to solve challenges in areas like Computer Vision and Natural Language Processing.

The Perceptron

The Perceptron is the oldest Artificial Neural Network, created by Frank Rosenblatt in 1958. It has a single neuron and is the simplest form of a neural network. Source: IBM

Deep Neural Network

Artificial Neural Networks are composed of node layers, containing an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. Source: IBM

ANNs architectures were expanded and improved in order to fit the complexity of the data we started gathering; Convolutional Neural Networks (CNNs) were designed to process spatial data like images, while Recurrent Neural Networks (RNNs) and Long Short Term Memories (LSTMs) were built to process sequential data like text.

But it wasn’t until just recently that something changed the way we conceived most of our challenges. Introduced in 2017, Transformers rapidly showed effective results at modelling data with long-range dependencies. Originally thought to solve NLP tasks, the application of Transformers has expanded, reaching incredible accomplishments in many disciplines.


Many of the world’s greatest challenges, such as developing treatments for diseases or finding enzymes that break down industrial waste, are fundamentally tied to proteins and the role they play. A protein’s shape is closely linked to its function, and the ability to predict this structure unlocks a greater understanding of what it does and how it works. In a major scientific advance, DeepMind’s AlphaFold system has been recognised as a solution to this grand challenge.

Two examples of protein targets in the free modelling category. AlphaFold predicts highly accurate structures measured against experimental results. Source: DeepMind

Self-driving cars

Tesla’s strategy is built around its Artificial Neural Networks. Unlike many self-driving car companies, Tesla does not use lidar, a more expensive sensor that can see the world in 3D. It relies instead on interpreting scenes by using neural network algorithms to parse input from its cameras and radar.

Tesla’s neural networks can consistently detect objects in various visibility conditions. Source: VentureBeat

Art generation

DALL·E parses text prompts and then responds not with words, but in pictures. It has been specifically trained to generate images from text descriptions, using a dataset of text-image pairs. The amazing thing is that DALL·E can do more than just paint a pretty picture from a caption: It can also, in a sense, answer questions visually. DALL·E is often able to solve matrices that involve continuing simple patterns or basic geometric reasoning.

DALL-E Output Example

Example of an output from DALL·E when prompted to generate an “armchair in the shape of an avocado”. Source: OpenAI

These examples are impressive, and, although they seem unrelated, they have one common factor: They all use Transformers architectures. 

Now let’s see how Transformers work.

The anatomy of Transformers

It’s usually helpful to visualize things when trying to understand them. Let’s see what Transformers look like:

Transformer Architecture

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively. Source: HarvardNLP

Too much information, right? Let’s start with the basics. In very simple terms, a Transformer’s architecture consists of encoder and decoder components. The encoder receives an input (e.g. a sentence to be translated), processes it into a hidden representation, and passes it to the decoder, which returns an output (e.g. the translated sentence).

Encoding & Decoding Components

An encoding component, a decoding component, and connections between them. Source: Jay Alammar

Going back to the model, we can find the encoder and decoder components as follows: