MLWhiz: Recs|ML|GenAI

MLWhiz: Recs|ML|GenAI

Understanding Transformers, the MLE Way

GenAI Series Part 1: What even are transformers?

May 29, 2026
∙ Paid

Hey, Rahul here! 👋 Each week, I publish long-form ML+AI posts covering ML, AI, and System design for MLwhiz. Paid subscribers also get how-to guides with full code walkthroughs. I publish occasional extra articles. If you’d like to become a paid subscriber, here’s a button for that:

Over the coming weeks, I’ll be writing more about GenAI, including topics like pre-training and post-training. This post is one of the foundational pieces meant to set up that series.

Understanding Transformers, the Data Science Way

Transformers have become the de facto standard for almost everything. Though the architecture was introduced for NLP, it now powers computer vision, recommender systems, and—most importantly—the entire wave of modern LLMs.

Yet for all their ubiquity, transformers remain as hard to understand as ever.

It has taken me multiple readings through the Google research paper that first introduced transformers, along with just so many blog posts, to really understand how a transformer works.

So, I thought of putting the whole idea down in as simple words as possible, and with some very basic Math and some puns, as I am a proponent of having some fun while learning. I will try to keep both the jargon and the technicality to a minimum, yet it is such a topic that I could only do so much. And my goal is to make the reader understand even the most gory details of Transformer by the end of this post.

Also, this is officially my longest post, both in terms of time taken to write it as well as the length of the post. Hence, I will advise you to Grab A Coffee. ☕️

Before we dive in, here’s the path we’ll walk together: we’ll start with the big picture of what a transformer even does, then crack open the encoder stack (attention, feed-forward, positional encodings, and those mysterious “Add & Norm” boxes). From there, we’ll move to the decoder stack and the masking trick that makes it tick, bolt on an output head to actually get our German words, and finish with how the whole thing is trained and how it makes predictions at test time. Long road, but I promise the view is worth it. Onwards.


Q: So, why should I even understand Transformer?

In the past, the LSTM and GRU architecture(as explained here in my past post on NLP), along with the attention mechanism, used to be the State of the Art Approach for Language modeling problems (put very simply, predict the next word) and Translation systems. But the main problem with these architectures is that they are recurrent in nature, and the runtime increases as the sequence length increases. That is, these architectures take a sentence and process each word in a sequential way, and hence, with the increase in sentence length, the whole runtime increases.

Transformer, a model architecture first explained in the paper Attention is all you need, lets go of this recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output. And that makes it FAST.



<a href="https://arxiv.org/pdf/1706.03762.pdf" target="_blank" rel="nofollow noopener">Source</a>
From the Paper

This is the picture of the full transformer as taken from the paper. And, it surely is intimidating. So, I will aim to demystify it in this post by going through each piece. So read ahead.


The Big Picture

Q: That sounds interesting. So, what does a transformer do exactly?

Essentially, a transformer can perform almost any NLP task. It can be used for language modeling, Translation, or Classification as required, and it does it fast by removing the sequential nature of the problem. So, the transformer in a machine translation application would convert one language to another, or for a classification problem will provide the class probability using an appropriate output layer.

It all will depend on the final output layer for the network; the Transformer basic structure will remain quite the same for any task. For this particular post, I will be continuing with the machine translation example.

So, from a very high place, this is how the transformer looks for a translation task. It takes as input an English sentence and returns a German sentence.

Transformer for Translation
Transformer for Translation

The Building Blocks

Q: That was too basic. 😎 Can you expand on it?

Okay, just remember in the end, you asked for it. Let’s go a little deeper and try to understand what a transformer is composed of.

So, a transformer is essentially composed of a stack of encoder and decoder layers. The role of an encoder layer is to encode the English sentence into a numerical form using the attention mechanism, while the decoder aims to use the encoded information from the encoder layers to give the German translation for the particular English sentence.

In the figure below, the transformer is given an English sentence as input, which gets encoded using 6 encoder layers. The output from the final encoder layer then goes to each decoder layer to translate English to German.

Data Flow in a Transformer
Data Flow in a Transformer

1. Encoder Architecture

Q: That’s alright, but how does an encoder stack encode an English sentence exactly?

Patience, I am getting to it. So, as I said, the encoder stack contains six encoder layers on top of each other(As given in the paper, but the future versions of transformers use even more layers). And each encoder in the stack has essentially two main layers:

  • a multi-head self-attention Layer, and

  • a position-wise fully connected feed-forward network

Very basic encoder Layer
Very basic encoder Layer

They are a mouthful. Right? Don’t lose me yet as I will explain both of them in the coming sections. Right now, just remember that the encoder layer incorporates attention and a position-wise feed-forward network.

Q: But, how does this layer expect its inputs to be?

This layer expects its inputs to be of the shape SxD (as shown in the figure below) where S is the source sentence(English Sentence) length, and D is the dimension of the embedding whose weights can be trained with the network. In this post, we will be using D as 512 by default throughout. While S will be the maximum length of a sentence in a batch. So it normally changes with batches.

User's avatar

Continue reading this post for free, courtesy of Rahul Agarwal.

Or purchase a paid subscription.
© 2026 Rahul Agarwal · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture