The first transformer removed the need for RNNs through the use of three key components: Since then, the NLP ecosystem has entirely shifted from RNNs to transformers thanks to their vastly superior performance and incredible capability for generalization. This new attention-based model was named a ‘transformer’. The authors demonstrated that we could remove the RNN networks and get superior performance using just the attention mechanism - with a few changes. In 2017, a paper titled Attention Is All You Need was published. The best-performing RNN encoder-decoders all used this attention mechanism. Meaning the mechanism calculated which encoder words to pay attention to.Īttention between an English-French encoder and decoder, source. Higher alignment resulted in greater weighting to the encoder annotation on the output of the decoder step. An alignment (e.g., similarity) between the word and all encoder annotations is calculated for each step. The attention mechanism considered all encoder output activations and each timestep’s activation in the decoder, which modifies the decoder outputs.ĭuring decoding, the model decodes one word/timestep at a time. Still, it didn’t overwhelm the process because it focused attention only on the most relevant information.īy passing a context vector from each timestep into the attention mechanism (producing annotation vectors), the information bottleneck is removed, and there is better information retention across longer sequences.Įncoder-decoder with the attention mechanism. It offered another route for information to pass through. The attention mechanism provided a solution to the bottleneck issue. This limits the encoder-decoder performance because much of the information produced by the encoder is lost before reaching the decoder. We’re creating a massive amount of information over multiple time steps and trying to squeeze it all through a single connection. The problem here is that we create an information bottleneck between the two models. The first model for encoding the original language to a context vector, and a second model for decoding this into the target language.Įncoder-decoder architecture with the single context vector shared between the two models, this acts as an information bottleneck as all information must be passed through this point. In machine translation, we would find encoder-decoder networks. These old recurrent models were typically built from many recurrent units like LSTMs or GRUs. Transformers are indirect descendants of the previous RNN models. In this article, we will explore how these embeddings have been adapted and applied to a range of semantic similarity applications by using a new breed of transformers called ‘sentence transformers’.īefore we dive into sentence transformers, it might help to piece together why transformer embeddings are so much richer - and where the difference lies between a vanilla transformer and a sentence transformer. Clustering - we can cluster our sentences, useful for topic modeling.Enables search to be performed on concepts (rather than specific words). Given a set of sentences, we can search using a ‘query’ sentence and identify the most similar records. Semantic search - information retrieval (IR) using semantic meaning.We may want to identify patterns in datasets, but this is most often used for benchmarking. Semantic textual similarity (STS) - comparison of sentence pairs.These increasingly rich sentence embeddings can be used to quickly compare sentence similarity for various use cases. The dense embeddings created by transformer models are so much richer in information that we get massive performance benefits despite using the same final outward layers. It’s the input to these layers that changed. The funny thing is, for many tasks, the latter parts of these models are the same as those in RNNs - often a couple of feedforward NNs that output model predictions. These new models can answer questions, write articles (maybe GPT-3 wrote this), enable incredibly intuitive semantic search - and much more. Since the introduction of the first transformer model in the 2017 paper ‘Attention is all you need’, NLP has moved from RNNs to models like BERT and GPT. Before transformers, we had okay translation and language classification thanks to recurrent neural nets (RNNs) - their language comprehension was limited and led to many minor mistakes, and coherence over larger chunks of text was practically impossible.
Transformers have wholly rebuilt the landscape of natural language processing (NLP). Sentence Transformers: Meanings in Disguise