Data Engineering with AWS Chapter 9 Part 1: Loading Data into a Data Mart with Redshift

This is post 14 in my Data Engineering with AWS retelling series.

Your data lake is humming along. Data lands in S3, gets cleaned, transformed, cataloged. Athena lets you run SQL queries on it. So why would you need anything else?

Because sometimes “good enough” performance is not good enough. When a business intelligence dashboard needs to refresh every few seconds, or analysts are running complex joins across billions of rows, querying files in S3 starts to feel slow. That is where data warehouses and data marts come in. Chapter 9 is about Amazon Redshift, the AWS service built to handle exactly this kind of hot data.

Not Everything Needs a Warehouse

Before diving into Redshift, the book makes a smart point about data temperature. Not all data deserves the same treatment.

Cold data is stuff nobody looks at regularly. Bank website logs from three years ago. Historical sensor readings from a factory floor. You keep it around for compliance or future ML training, but nobody queries it day to day. Stick it in S3 Glacier or Glacier Deep Archive and save money. Retrieval takes minutes to hours, but that is fine because you rarely need it.

Warm data is your active pipeline data. Daily ETL jobs process it, analysts query it on demand, and it flows through your landing, transformed, and curated zones. S3 Standard or S3 Standard-Infrequent Access handles this well. If you are not sure about access patterns, S3 Intelligent-Tiering automatically moves data between tiers based on how often it gets touched.

Hot data is the stuff that powers real-time dashboards, BI tools, and critical daily reports. Executives refreshing KPI dashboards, analysts running complex aggregations multiple times a day. This data needs low-latency, high-performance access. That is Redshift territory.

The key takeaway is that a data warehouse should only store hot data. Loading everything into Redshift just because you can is a waste of money and a misuse of the tool.

What Not to Do with a Data Warehouse

The book lists several anti-patterns that are worth memorizing.

Do not use it as a transactional database. Redshift is built for OLAP (online analytical processing), not OLTP (online transaction processing). It supports primary and foreign keys, but only as hints for the query optimizer. It does not enforce them. If you need a transactional database, use RDS with MySQL or PostgreSQL.

Do not use it as a data lake. Data warehouses require upfront schema design. They cannot store unstructured data like images or video. They cost more than S3 because you are paying for compute alongside storage. Keep your data lake as the single source of truth and only bring a curated subset into Redshift.

Do not stream individual records into it. Redshift is optimized for batch loading. If you have real-time IoT data, buffer it first with something like Kinesis Firehose, which can batch data for up to 15 minutes or 128 MB before loading it into Redshift.

Do not store unstructured data in it. While Redshift can handle semi-structured data like JSON via the SUPER data type, it is not the place for images, videos, or deeply nested document formats. Use purpose-built tools like Amazon HealthLake for specialized data types like FHIR health records.

How Redshift Works Under the Hood

Understanding Redshift architecture helps you make better performance decisions.

A Redshift cluster has a leader node and one or more compute nodes. The leader node receives your query, creates an execution plan, and distributes work to the compute nodes. Each compute node is split into 2, 4, or 16 slices depending on the node type and size. Each slice gets its own portion of memory and storage and works in parallel with other slices.

Data is stored in a columnar format. Each column is persisted as 1 MB immutable blocks across the slices. When you run a query, the leader node figures out which slices have relevant data, the slices do their work in parallel, and the leader aggregates the results.

The critical thing to understand is that a query is only as fast as its slowest slice. If data is unevenly distributed, one slice ends up doing most of the work while others sit idle. That is why distribution style matters.

Distribution Styles: Where Your Rows Live

Redshift supports three distribution styles that determine how rows get spread across slices.

EVEN distribution spreads rows in round-robin fashion across all slices. Simple and balanced. Good for large fact tables where you want even workload distribution.

ALL distribution copies the entire table to every slice. This sounds wasteful, but it is perfect for small dimension tables. When you join a small dimension table (ALL) with a large fact table (EVEN), every slice already has the full dimension table locally. No data needs to shuffle between nodes. The join runs fast.

KEY distribution hashes a specific column value to determine which slice stores each row. If two tables both use the same column as their distribution key, rows that will be joined end up on the same slice. For example, if your products table and sales table both distribute on product_id, all rows with the same product ID land on the same slice. Joins on that column become very fast because no data moves between nodes.

But KEY distribution has a trap. If you filter on the distribution key column in a WHERE clause, all matching data lives on one slice, creating a bottleneck. Also, the column must have high cardinality and even distribution of values to avoid data skew.

The good news is that you do not have to get this right from day one. If you set the distribution style to AUTO (which is the default), Redshift uses machine learning to monitor your queries and automatically picks the best distribution style. Small tables start as ALL, grow into EVEN, and may eventually switch to KEY-based distribution as Redshift learns your query patterns.

Sort Keys and Zone Maps

Beyond distribution, sorting matters a lot for performance.

Redshift maintains in-memory metadata called Zone Maps on the leader node. Zone Maps store the minimum and maximum values for each column in every 1 MB data block. When you run a query with a WHERE clause, Redshift checks the Zone Maps and skips blocks that cannot contain matching data. This can eliminate massive amounts of unnecessary disk reads.

Zone Maps work best when data within blocks is sorted. You control this with sort keys when creating a table.

Compound sort keys give priority to columns in the order you list them. The first column is sorted first, then the second within each group of the first, and so on. This is the default and recommended for most cases.

Interleaved sort keys give equal weight to each sort key column. Useful when you query on multiple columns equally, but less common in practice.

Sort keys should go on columns you frequently filter with range queries or use in aggregations. Like distribution style, sort keys can be set to AUTO, and Redshift will figure out the optimal sort key based on your actual query patterns.

Choosing the Right Node Type

Redshift offers three node families.

RA3 nodes decouple compute and storage. You pay separately for compute hours and managed storage. Hot data lives on local SSDs, and less frequently accessed data automatically moves to S3-backed managed storage. This is the recommended choice for most data warehouses over 1 TB.

DC2 nodes are compute-intensive with fixed local SSD storage. Compute and storage are coupled, meaning you add both when you need more of either. Good for smaller data warehouses under 1 TB.

DS2 nodes are legacy. Hard disk drives, coupled compute and storage. Not recommended for new clusters.

When you create a new cluster in the console, AWS can recommend a node type and count based on your data size and retention requirements.

Data Types: Pick the Smallest That Fits

Redshift supports six broad data type categories, and the book spends time on each one because picking the right type directly affects performance.

For strings, use CHAR for fixed-length single-byte data like phone numbers, and VARCHAR for variable-length or multi-byte data. Always use the smallest length that accommodates your data.

For numbers, use the smallest integer type that fits. SMALLINT for values up to 32,767, INTEGER up to about 2 billion, BIGINT for everything larger. Use DECIMAL when you need exact precision, and FLOAT types when approximation is acceptable.

There are two special types worth knowing. HLLSKETCH stores results from the HyperLogLog algorithm, which efficiently estimates the number of unique values in huge datasets. Think counting unique daily visitors on a site with hundreds of millions of users. The error range is only 0.01 to 0.6 percent, but the speed and memory savings are enormous compared to traditional COUNT DISTINCT.

SUPER stores semi-structured data like JSON and arrays, up to 1 MB per column value. You can query it without imposing a schema first. Dynamic typing figures out data types at query time. If your JSON has hundreds of attributes, SUPER is significantly faster than unnesting everything into columns.

Key Takeaway

Redshift is not a replacement for your data lake. It is a performance layer for hot data that needs fast, complex querying. Understanding its architecture – slices, distribution styles, sort keys, zone maps – helps you design tables that take full advantage of its parallel processing engine. But if you are just getting started, lean on AUTO settings and let Redshift optimize itself based on your actual query patterns.

In Part 2, we will look at how Redshift connects to your S3 data lake through Spectrum, external tables, and how data moves between the two.


Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3


Previous: Chapter 8 - Data Consumers Next: Chapter 9 Part 2 - Bridging Data Lake and Data Warehouse

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More