SQL Advanced Queries: JOINs, Subqueries, and Window Functions
This is Part 2 of Chapter 4, continuing from the SQL basics.
In Part 1 we covered how to pull data from one table. Filter it, sort it, count it. But real databases have many tables. Customers in one, orders in another, products in a third. The interesting stuff happens when you combine them.
That is what this part is about: joins, subqueries, window functions, and some practical advice on writing clean SQL.
JOINs: combining tables
Remember our two tables? The Customer table has four customers (IDs 1 through 4). The Orders table has four orders, but order 104 has customer_id = 5, which does not exist in the customer table. This mismatch is what makes join examples interesting.
INNER JOIN
The most common join. It returns only rows where there is a match in both tables.
SELECT
orders.order_id,
customers.customer_name,
orders.total_amount
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id;
Result: you get three rows. Ada Herbert (orders 101 and 102) and Grace Turing (order 103). Alan Hopper and Margaret Bill do not appear because they have no orders. Order 104 does not appear because customer 5 does not exist.
Here is how it works: the database looks at every combination of rows from both tables and keeps only the ones where customer_id matches on both sides. No match, no row.
LEFT JOIN
Returns all rows from the left (first) table, plus matching rows from the right table. If there is no match, you get NULLs.
SELECT
customers.customer_id,
customers.customer_name,
orders.order_id,
orders.total_amount
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id;
Now you see all four customers. Ada and Grace have their orders. Alan Hopper and Margaret Bill show up with NULL for order_id and total_amount. They are in the customer table but have never ordered anything.
This is useful when you want to find inactive customers or spot missing data. “Show me everyone, even if they have not done anything yet.”
RIGHT JOIN
The mirror image of LEFT JOIN. All rows from the right (second) table, matched rows from the left.
SELECT
customers.customer_name,
orders.order_id,
orders.total_amount
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id;
Now order 104 shows up with NULL for customer_name. The order exists, but the customer does not. This can happen after data migrations, when records get lost or were never recorded properly.
Honestly, RIGHT JOINs are rare in practice. Most people just flip the table order and use a LEFT JOIN. But it is good to know they exist.
FULL OUTER JOIN
Returns everything from both tables. Matches where possible, NULLs where not.
SELECT
customers.customer_name,
orders.order_id,
orders.total_amount
FROM customers
FULL OUTER JOIN orders
ON customers.customer_id = orders.customer_id;
You get all customers (including those with no orders) and all orders (including those with no matching customer). This is your “show me the complete picture” join. Good for data quality checks and finding mismatches between tables.
Subqueries
Sometimes you need the answer to one question before you can ask another. Who has the highest order total? First, find the max. Then, find who placed it.
SELECT
customers.customer_name,
orders.order_id,
orders.total_amount
FROM orders
JOIN customers
ON orders.customer_id = customers.customer_id
WHERE orders.total_amount = (
SELECT MAX(total_amount)
FROM orders
);
The inner query finds the maximum total_amount. The outer query uses that result to filter. Grace Turing, order 103, 400 dollars.
Subqueries are queries nested inside other queries. They work. But they can get messy fast.
CTEs: a cleaner way
Common Table Expressions (CTEs) solve the readability problem. Instead of nesting queries inside each other, you define a named temporary result at the top, then reference it below.
WITH avg_order AS (
SELECT AVG(total_amount) AS average_amount
FROM orders
)
SELECT *
FROM orders, avg_order
WHERE orders.total_amount >= avg_order.average_amount;
The WITH clause creates avg_order as a temporary result. Then you use it like a regular table. No nesting, easy to read, easy to debug.
Here is the thing: if you need the same calculation in multiple places, a CTE saves you from repeating yourself. Write the logic once, reference it by name. Your future self will thank you.
Set operations
These combine results from two separate queries.
UNION merges two result sets and removes duplicates:
SELECT customer_id FROM customers
UNION
SELECT customer_id FROM orders;
You get IDs 1 through 5. Customer IDs from both tables, no duplicates.
UNION ALL does the same but keeps duplicates. Faster because it skips the dedup step.
INTERSECT returns only rows that appear in both queries:
SELECT customer_id FROM customers
INTERSECT
SELECT customer_id FROM orders;
Result: 1 and 2. These are customers who actually placed orders.
EXCEPT returns rows from the first query that are not in the second:
SELECT customer_id FROM customers
EXCEPT
SELECT customer_id FROM orders;
Result: 3 and 4. Customers who exist but never ordered.
One rule: both queries must have the same number of columns with compatible data types.
Window functions
This is where SQL gets powerful. Window functions do calculations across a set of rows related to the current row, but without collapsing everything into groups like GROUP BY does. You keep every row and add a calculated column.
Ranking functions
The book uses an employee sales table:
| employee | region | sales |
|---|---|---|
| David | West | 900 |
| Eve | West | 800 |
| Frank | West | 800 |
| Grace | West | 700 |
| Bob | East | 600 |
| Alice | East | 500 |
| Charlie | East | 500 |
| Hannah | East | 400 |
ROW_NUMBER() gives each row a unique number within its group:
SELECT
employee, region, sales,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC) AS row_num
FROM sales;
Eve and Frank both have 800 in sales, but they get different row numbers (2 and 3). ROW_NUMBER always assigns unique numbers, even for ties.
Let me break down the syntax. OVER() defines the window. PARTITION BY region means “restart the numbering for each region.” ORDER BY sales DESC means “highest sales first.”
RANK() handles ties differently. Eve and Frank both get rank 2, and then Grace gets rank 4. Rank 3 is skipped. Like a sports tournament where two people tie for second place.
DENSE_RANK() also handles ties, but does not skip. Eve and Frank both get rank 2, Grace gets rank 3. No gaps.
When to use which? ROW_NUMBER when you need unique positions. RANK when ties should skip numbers. DENSE_RANK when you want continuous ranking despite ties.
Aggregate window functions
You can use SUM(), AVG(), and other aggregates as window functions. The difference from regular GROUP BY is that you keep every row.
SELECT
employee, region, sales,
SUM(sales) OVER (PARTITION BY region) AS total_sales_region,
AVG(sales) OVER (PARTITION BY region) AS avg_sales_region
FROM sales;
Every row in the West region shows total_sales_region = 3200 and avg_sales_region = 800. You can instantly compare each employee to their regional average without losing any detail.
LAG and LEAD
These let you peek at previous or future rows.
-- Previous month's sales
SELECT
month, sales,
LAG(sales) OVER (ORDER BY month) AS prev_month_sales
FROM sales;
-- Next month's sales
SELECT
month, sales,
LEAD(sales) OVER (ORDER BY month) AS next_month_sales
FROM sales;
LAG looks backward. LEAD looks forward. The first row of LAG is NULL (nothing before it). The last row of LEAD is NULL (nothing after it). These are great for trend analysis. “Did sales go up or down compared to last month?”
Best practices
The chapter ends with practical advice. Here is what I found most useful:
Avoid SELECT *. Only pull the columns you need. It is faster and makes your intent clear.
Filter early. Apply WHERE conditions before joins and aggregations. The less data you process, the faster your query runs.
Use CTEs for complex logic. Break things into named steps. Your queries become readable and maintainable.
Capitalize SQL keywords. Write SELECT, not select. Write FROM, not from. It is a convention that makes queries easier to scan.
Test on real data. A query that works on four rows might crawl on four million. Check your execution plans.
Setting up a practice lab
The book walks through setting up SQL Server Express and SQL Server Management Studio (SSMS). You create the sample tables, insert the data, and run the queries yourself. If you are following along with the book, I recommend doing this. Reading SQL is one thing. Writing it and seeing results is how it actually sticks.
The sample code is also available on GitHub at github.com/Sommie09/sql-chapter-examples.
My take
This second half of Chapter 4 is where the book earns its keep. Joins and window functions are the tools that separate beginners from people who can actually work with data. Nwokwu explains them clearly, with consistent examples and good visual analogies.
If you learn nothing else from this chapter, learn INNER JOIN, LEFT JOIN, and how to read a window function. Those three things will cover 80% of the SQL you write as a data engineer.
The CTE section is also worth your time. Once you start writing CTEs instead of nested subqueries, you will never go back. Your SQL becomes something other people can actually read.
Next up: Chapter 5 on data modeling and ER diagrams. We go from writing queries to designing the tables themselves.
This is part 6 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.
| < Previous: SQL Basics | Next: Data Modeling & ER Diagrams > |