Transformers, Attention, and the Evolution of LLMs - Chapter 2 Part 1
Chapter 2 of Laura Funderburk’s book opens with the big picture of large language models. Where they came from, how they work inside, and where they are heading. If Chapter 1 was about pipelines, this chapter is about the models that sit at the center of those pipelines.
Let me break it down.
What LLMs Actually Do
LLMs are deep learning models built to process and produce text that looks like human communication. They are trained on huge datasets of books, websites, and articles, and they can generate sentences, answer questions, write code, and more.
Funderburk groups their use cases into four buckets:
- Understanding: sentiment analysis, text classification, named-entity recognition. Classic NLP stuff.
- Generating: writing marketing copy, code snippets, stories. GitHub Copilot is a famous example, and it has grown from a simple autocomplete into a full agentic system.
- Retrieving: grounding a model in actual data so it can answer questions from a knowledge base instead of making things up.
- Interacting: chatbots, virtual assistants, content moderation bots.
Nothing too surprising there. But here’s the thing. The book then takes you through the architecture that makes all of this possible.
The Transformer Architecture
The engine behind all modern LLMs is the transformer, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. Before transformers, models processed text one word at a time in sequence. Transformers changed that by processing all tokens in parallel.
Here’s a simplified walkthrough of how it works:
- Inputs: Take a sentence, break it into tokens, give each token a unique ID, then turn those IDs into vectors using an embedding layer.
- Positional encoding: Transformers don’t naturally know word order, so you inject position information into the embeddings. The original used absolute positions. Later models switched to relative positions for better performance.
- Attention mechanism: This is the key innovation. Self-attention lets the model weigh how relevant each word is to every other word in the sentence. Multi-head attention runs this process multiple times in parallel with different weights, then combines the results. That lets the model look at relationships from different angles at the same time.
- Feed-forward networks: Each layer has a small neural network that refines the representations further.
- Layer normalization: Applied after each block to stabilize and speed up training.
- Output layer: Projects the final representation into probabilities over the vocabulary. The token with the highest probability gets picked as the next word.
- Post-processing: Convert token IDs back into human-readable text.
Think of it like this in pseudo-code:
tokens = tokenize("The cat sat on the")
embeddings = embed(tokens) + positional_encoding(tokens)
for layer in transformer_layers:
attended = multi_head_attention(embeddings)
embeddings = feed_forward(attended)
embeddings = normalize(embeddings)
next_token = softmax(linear(embeddings)) # probability distribution
Three Families of Transformer Models
The transformer design spawned three families by 2023:
| Type | Architecture | Example | Good For |
|---|---|---|---|
| Auto-regressive | Decoder only | GPT | Text generation, chatbots |
| Auto-encoding | Bidirectional encoder | BERT | Understanding input, Q&A |
| Sequence-to-sequence | Encoder-decoder | BART | Translation, tasks needing both input understanding and generation |
GPT predicts the next word. BERT looks at words in both directions to understand context. BART combines both approaches. Each has its sweet spot.
The 2023 Toolkit: Prompting, Fine-Tuning, and RAG
Back in 2023, the main problems with these big models were hallucinations and cost. Three tactics competed to fix them:
- Prompting: writing careful instructions to guide the model’s output.
- Fine-tuning: further training the model on domain-specific data. Techniques like PEFT (parameter-efficient fine-tuning) and LoRA (Low-Rank Adaptation) made this cheaper by only updating a subset of the model’s parameters.
- RAG: giving the model access to an external knowledge base so it can look up facts instead of guessing.
The question back then was always “should I use RAG or should I fine-tune?” Funderburk makes the point that by 2025, this is no longer an either/or choice. The modern stack uses RAG AND specialized models together.
The 2024-2025 Split: Small vs. Reasoning Models
Here’s where things get interesting. The one-size-fits-all era is over. The model landscape has forked into two directions.
Small Language Models (SLMs) are the “smaller, faster” branch. Under 10 billion parameters. Models like Microsoft’s Phi-3-mini (3.8B parameters) and Apple’s OpenELM. They run on phones, IoT devices, and edge hardware. They are great for classification, extraction, and lightweight tasks where you need low cost and low latency.
Reasoning Language Models (RLMs) are the “smarter, deeper” branch. These are not just bigger LLMs. They are a new hybrid architecture combining a traditional LLM (for knowledge), reinforcement learning (for strategy), and search heuristics (for exploring multiple solution paths). Think OpenAI’s o3 and the open-source DeepSeek-R1.
The DeepSeek-R1 release in January 2025 was a big deal. It matched the performance of the best proprietary reasoning models, was released under MIT license, and reportedly cost under $6 million to train. The book explains that DeepSeek’s secret sauce was GRPO (Group Relative Policy Optimization), a clever RL algorithm that eliminates the need for a separate critic model during training by comparing outputs within a group instead.
The training pipeline for DeepSeek-R1 went through multiple stages: pure RL first (which worked but produced messy output), then supervised fine-tuning on chain-of-thought examples, then more RL, then rejection sampling to build clean training data, then another round of fine-tuning, and finally a polishing RL pass.
The takeaway from Funderburk is clear. For developers in 2025, the job is no longer picking one giant model. It is choosing the right model for the right job based on trade-offs between cost, latency, and reasoning depth.
This is post 4 of 24 in the Building Natural Language and LLM Pipelines series.