Normalization and Database Design - Data Engineering for Beginners (Ch.5 Part 2)
This is Part 2 of Chapter 5, continuing from data modeling basics.
If Part 1 was about drawing the blueprint, Part 2 is about keeping the building from falling apart. Normalization is one of those topics that sounds academic until you hit a real bug caused by duplicate data. Then it clicks fast.
Why Normalization?
The book starts with a good example. Imagine an Orders table where every row stores the customer name, their address, the product, the quantity, and the date. All in one big flat table.
Customer Jessica has two orders. Her address, “405 Whitehouse St, NY, USA,” shows up in both rows. Now Jessica moves. You update one row but miss the other. Congrats, your database now thinks Jessica lives in two places at once.
Normalization fixes this by splitting the data into separate tables. Customer info goes in a Customers table. Order info goes in an Orders table. Jessica’s address lives in exactly one place. Update it once, done.
-- Customers table: address stored once
-- CustomerID | Name | Address
-- 1 | Jessica | 405 Whitehouse St, NY, USA
-- Orders table: references CustomerID
-- OrderID | CustomerID | ProductName | Quantity | OrderDate
-- 1001 | 1 | Laptop | 1 | 2024-11-20
-- 1003 | 1 | Headphones | 1 | 2024-11-22
Simple. No duplication. One source of truth.
The Normal Forms
Normalization has rules, called normal forms. Each one builds on the previous.
First Normal Form (1NF)
Rule: every column holds one value. No lists, no sets, no comma-separated stuff.
Bad example: Bailey has phone numbers “123-456, 987-654” crammed into one cell. That breaks 1NF.
Fix: give Bailey two rows, one per phone number. Each cell holds exactly one value.
The book connects this to ACID. When each field holds one atomic value, updates are cleaner. You do not risk partially updating a list inside a single cell.
Second Normal Form (2NF)
Rule: achieve 1NF, then remove partial dependencies. This matters when you have a composite primary key (a key made from two or more columns).
Example: a StudentCourses table with the key (StudentID, CourseID). The column CourseName depends only on CourseID, not on the full key. That is a partial dependency.
Fix: pull course details into their own table.
-- StudentCourses (just the relationship)
-- StudentID | CourseID
-- Courses (details about each course)
-- CourseID | CourseName | Instructor
-- CS 101 | Programming | Dr. Smith
-- CS 102 | Networking | Mrs. Logan
Now “Programming” and “Dr. Smith” are stored once, not repeated for every student taking CS 101.
Third Normal Form (3NF)
Rule: achieve 2NF, then remove transitive dependencies. That is when a non-key column depends on another non-key column instead of the primary key.
Example: a Students table where StudentID determines State, and State determines StateTaxCode. The chain goes: StudentID -> State -> StateTaxCode. That is a transitive dependency.
Fix: create a separate States table.
-- Students
-- StudentID | Name | City | State
-- States
-- State | StateTaxCode
-- New York | NY123
-- California | CA456
Now “NY123” lives in one row instead of being repeated for every New York student.
Boyce-Codd Normal Form (BCNF)
This is a stricter version of 3NF. The rule: for every functional dependency A -> B, A must be a superkey (it can uniquely identify every row).
The book uses a course scheduling example where Instructor determines Department, but Instructor is not a superkey. The fix is splitting into an Instructor_Department table and a Course_Instructor table.
In practice, most systems stop at 3NF. BCNF, 4NF, and 5NF exist for edge cases in large enterprise systems. The book covers them, but if you are starting out, get comfortable with 1NF through 3NF first. That covers 90% of real-world needs.
Denormalization: When Rules Get in the Way
Here is the problem with heavy normalization: joins. The more you split tables, the more joins you need to put data back together. And joins are not free. They cost CPU and time.
Denormalization is when you deliberately combine tables or duplicate data to speed up reads. It is the opposite of normalization, and sometimes it is exactly what you need.
-- A denormalized orders table: everything in one place
-- OrderID | CustomerName | ProductName | Quantity | TotalPrice | OrderDate
-- 101 | Alexa | Laptop | 1 | 1000 | 2025-12-04
Yes, CustomerName is duplicated. Yes, TotalPrice is precomputed. But for a reporting dashboard that needs to show order summaries fast, this is much quicker than joining three tables every time.
Where denormalization makes sense:
- Data warehouses and analytics (read-heavy, not many writes)
- Real-time apps like stock trading or IoT, where latency matters
- Distributed systems where you want all needed data on one node
The book’s advice is good: normalize first, denormalize later. Get the structure right, then adjust for performance where you actually see bottlenecks.
Data Modeling Best Practices
Nwokwu gives four practical tips:
Define the grain. What does one row represent? One transaction? One customer per month? One product per store per day? You cannot mix grains in the same table. Decide this with your business stakeholders, not in isolation.
Normalize now, denormalize later. Start clean. Then, when you see slow queries caused by too many joins, add targeted denormalization. Do not start with a messy flat table and hope for the best.
Choose the right data types. Use DECIMAL for money (not FLOAT). Use BIGINT for IDs that might grow past 2 billion. Use VARCHAR over TEXT when you know the max length. Small decisions, big impact at scale.
Use proper naming conventions. customer_id is clear. c1 is not. Pick a convention (snake_case, PascalCase) and stick to it across the whole database. Future you will be grateful.
Database Optimization
The last section of Chapter 5 tackles what to do when your database gets slow. The book uses a fictional e-commerce company called Sepora that grows from hundreds of daily transactions to thousands. On Christmas Eve, their website crawls because the database cannot keep up.
Indexing
An index is a data structure that helps the database find rows fast, without scanning the entire table. Think of it like the index at the back of a textbook.
-- Create an index on the product name column
CREATE INDEX idx_product_name ON products(name);
-- Now this query uses the index instead of scanning every row
SELECT * FROM products WHERE name = 'Laptop';
Without the index, the database reads every row in the products table. With it, the database jumps straight to matching rows. Big difference on a table with millions of rows.
Partitioning
Partitioning splits a large table into smaller pieces, all on the same server. The database uses rules to know which partition to check.
CREATE TABLE sales (
sale_id INT PRIMARY KEY,
sale_date DATE,
amount DECIMAL(10, 2)
) PARTITION BY RANGE (YEAR(sale_date)) (
PARTITION p0 VALUES LESS THAN (2000),
PARTITION p1 VALUES LESS THAN (2010),
PARTITION p2 VALUES LESS THAN (2020),
PARTITION p3 VALUES LESS THAN (MAXVALUE)
);
Now a query for 2021 sales only scans partition p3, not the entire table going back to 2005. Much faster.
Sharding
Sharding goes further than partitioning. Instead of splitting data within one server, you spread it across multiple servers. Each server (shard) holds a slice of the data.
Social media platforms do this all the time. User data gets split by user ID or region, so queries hit a nearby shard instead of one overloaded central database.
# Simple sharding logic
def get_shard(user_id):
if user_id % 2 == 0:
return "shard1"
else:
return "shard2"
The key difference between partitioning and sharding: partitioning happens within one database server, sharding happens across multiple servers. Partitioning is for performance. Sharding is for scale.
Views
Views are virtual tables built from SQL queries. You define them once and reuse them. They simplify complex queries and can restrict what data users see (good for security).
CREATE VIEW EngineeringEmployees AS
SELECT id, name, salary
FROM employees
WHERE department = 'Engineering';
-- Now just query the view
SELECT name, salary FROM EngineeringEmployees WHERE salary > 50000;
Materialized views go one step further: they store the query results physically. Great for dashboards that do not need real-time data. Just refresh them on a schedule.
My Take
This is a dense chapter, but it is the kind of density that pays off. Normalization, indexing, partitioning, sharding. These are the tools you reach for when things break at scale. And things always break at scale.
If I had to pick one takeaway from this chapter: normalize first, denormalize later. It is the database equivalent of “measure twice, cut once.” Get your structure clean, then tune for speed where it actually hurts.
The next chapter moves from how to organize data within a database to where to put the data in the first place: warehouses, lakes, and lakehouses.
This is part 8 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.
| < Previous: Data Modeling and ER Diagrams | Next: Data Warehouses, Data Lakes, and Lakehouses > |