The Age of Transformers

7 min readFeb 8, 2023

Transformers, Attention and Self Attention

Transformers are a type of deep learning model used in natural language processing (NLP) tasks such as language translation, text summarization, and question answering. They are designed to handle sequences of data, such as sentences or paragraphs of text.

The key idea behind transformers is the use of attention mechanisms. Attention allows the model to focus on the most relevant parts of the input sequence when making predictions.

Imagine you are trying to translate a sentence from English to Spanish. With attention, the model will pay more attention to the words that are more important for the translation, and less attention to words that are not so relevant. This makes the model more effective and efficient than traditional NLP models that treat all words equally.

So, in short, transformers use attention to focus on the most relevant information in an input sequence, which allows them to achieve state-of-the-art results on a wide range of NLP tasks.

In the context of transformers, “self-attention” refers to the mechanism by which the model attends to different parts of the input sequence. It allows the model to weigh the importance of each element in the sequence when making predictions. The self-attention mechanism is applied multiple times in parallel across the input sequence, giving the model a full understanding of the dependencies between the elements.

Encoder Decoder Architecture

The “encoder-decoder architecture” is a general framework used in NLP tasks such as language translation, where the input is encoded into a hidden representation, and then decoded into the desired output. The encoder maps the input sequence to a high-dimensional vector representation, while the decoder generates the final output.

In the case of transformers, the encoder is usually a stack of self-attention and feed-forward layers that capture the relationships between the input elements. The decoder then generates the final output, typically by applying self-attention and feed-forward layers to the encoder representation and previous decoder states. This architecture allows transformers to effectively handle sequential inputs, making them well suited for NLP tasks.

Applications of Transformers

Language: Transformers have been applied to various NLP tasks, such as language translation, text classification, named entity recognition, and question answering. The attention mechanism in transformers allows the model to weigh the importance of each word in the input sequence, which is essential for tasks such as language translation where the meaning of a sentence depends on the relationships between its words.
Computer Vision: Recently, transformers have been applied to computer vision tasks such as object detection, segmentation, and image captioning. In these tasks, the transformer architecture is used to process image features to perform the task. For example, in object detection, the transformer takes the features of an image and performs self-attention to weigh the importance of each feature, allowing the model to focus on the most relevant parts of the image.
Decision Making: Transformers have also been applied to decision-making tasks such as recommendation systems and reinforcement learning. In recommendation systems, transformers can be used to model user behavior by considering the relationships between items and users. In reinforcement learning, the transformer can be used to model the state representation of the environment and make decisions based on the state.

The attention mechanism and the encoder-decoder architecture make transformers well suited for processing sequential data, making them a versatile tool for a variety of tasks.

Mixture of Experts and Switch Transformers

“Mixture of Experts” (MoE) is a machine learning technique that combines the predictions of multiple models to produce a final prediction. In a MoE architecture, each expert model is responsible for making predictions for a different subset of the input space, and a gating network is used to determine which expert should be used for each input. This allows the model to capture complex relationships in the input data and achieve improved performance compared to a single model.

Mixture of Experts: Example Use case: Text Classification

In this task, the goal is to classify a given text into one of several predefined categories (such as positive/negative sentiment). A Mixture of Experts model could be trained on a large dataset of texts, with each expert model being responsible for making predictions for a specific category. The gating network would determine which expert to use for each text based on its characteristics. This allows the model to capture complex relationships in the data and achieve improved performance compared to a single model.

“Switch Transformers” are a variant of MoE that uses transformers as the expert models. In a switch transformer, each expert is a transformer with a specific architecture and trained on a specific task or subset of the input data. The gating network, also a transformer, is responsible for determining which expert to use for each input. This allows the switch transformer to effectively handle complex sequential inputs and make accurate predictions.

Switch Transformers: Example Use case: Multi-Language Translation

In this task, the goal is to translate a given text from one language to another. A Switch Transformer model could be trained on a large dataset of texts in multiple languages, with each expert being a transformer trained on a specific language. The gating network, also a transformer, would determine which expert to use for each text based on the language of the input. This allows the model to effectively handle multiple languages and make accurate translations.

In summary, Mixture of Experts is a machine learning technique that combines multiple models to improve performance, while Switch Transformers are a specific implementation of MoE that uses transformers as the expert models. The use of transformers as experts allows the switch transformer to effectively handle sequential inputs and make accurate predictions.

Perceiver and Perceiver IO

“Perceiver” is a deep learning architecture that combines transformers with self-supervised learning to perform multi-modal tasks such as image-text matching, visual question answering, and multi-modal translation. It was introduced in the paper “Perceiver: Self-Supervised Perception-Prediction Networks for Multi-Modal Tasks” by Facebook AI Research.

The key idea behind Perceiver is to train the model to perform self-supervised tasks that involve both perception (processing images or text) and prediction (predicting the missing modality given the other modality). This allows the model to learn a rich representation of the input data, which can then be fine-tuned for the target task.

Perceiver: Use case: Image-Text Matching

In this task, the goal is to determine if a given text description matches a given image. A Perceiver model could be trained on a large dataset of image-text pairs, with the model being trained to predict the text description given the image and vice versa. The model would learn a rich representation of the image and text data, which can then be fine-tuned for the target task of image-text matching.

“Perceiver IO” is an extension of the Perceiver architecture that uses an “Input-Output” paradigm, where the model takes both the input and the output of a task as input and predicts the intermediate representations. This allows the model to learn a more complete understanding of the task and improve its performance.

Perceiver IO: Use case: Visual Question Answering

In this task, the goal is to answer a natural language question about an image. A Perceiver IO model could be trained on a large dataset of images and questions, with the model being trained to predict intermediate representations (such as object detections and attributes) given the image and question, and then use these intermediate representations to answer the question. The model would learn a more complete understanding of the task and improve its performance compared to a regular Perceiver model.

In summary, Perceiver is a deep learning architecture for multi-modal tasks that uses self-supervised learning and transformers, while Perceiver IO is an extension that uses an Input-Output paradigm to improve performance.

Non-Parametric Transformers (NPT)

Non-parametric transformers refer to a class of transformers models that do not have a fixed number of parameters, unlike traditional transformer models. Non-parametric transformers are designed to be flexible in terms of the number of parameters, allowing them to adapt to the complexity of the input data dynamically.

The key idea behind non-parametric transformers is to introduce an external memory that can be used to store and process information in an adaptive manner. For example, the popular Transformer-XL model is a type of non-parametric transformer that uses a segment-level recurrence mechanism to store information from previous segments in its memory. This enables the model to process long sequences effectively, while keeping the number of parameters constant.

Another example of non-parametric transformers is the Dynamic Convolutional Transformer (DyConv), which uses a dynamic convolution operation to adapt the number of parameters to the input sequence length. The advantage of this approach is that the model can handle sequences of varying lengths without the need to pad or truncate the inputs, making it well-suited for online applications.

Imagine you have a sentence completion task, where you need to predict the next word given the previous words in a sentence. A traditional transformer model would have a fixed number of parameters that are used to process all sentences, regardless of their length. However, a non-parametric transformer, such as the Transformer-XL, can dynamically adjust its memory capacity based on the length of the sentence, making it well-suited for processing both short and long sentences.

Therefore, non-parametric transformers can adapt to the complexity of the input data dynamically, allowing them to achieve improved performance compared to traditional transformers. The key advantage of non-parametric transformers is their ability to handle sequences of varying lengths, without the need for padding or truncation, making them well-suited for online applications.

Overall, non-parametric transformers are an active area of research, and they are being explored as a way to overcome the limitations of traditional transformers and improve performance on tasks such as language modeling and machine translation.