Data Engineering with AWS Chapter 9 Part 2: Bridging Data Lake and Data Warehouse
This is post 15 in my Data Engineering with AWS retelling series.
In Part 1, we looked at Redshift internals – clusters, slices, distribution styles, sort keys. All the pieces that make a data warehouse fast. But a warehouse sitting in isolation is not very useful. Data needs to flow in from your data lake, and sometimes it needs to flow back out. Part 2 of Chapter 9 covers that bridge between S3 and Redshift, including Redshift Spectrum, the COPY and UNLOAD commands, and a hands-on exercise that ties it all together.
Table Types: More Than Just Regular Tables
Redshift supports several table types beyond the standard local table. Knowing when to use each one is key to building a system that is both fast and cost-effective.
Local tables are the default. Data lives on the compute node’s local disk, gets replicated for fault tolerance, and benefits from all of Redshift’s columnar compression and automatic optimization. This is where your hot data goes for maximum query performance.
External tables do not store data inside Redshift at all. They are schema objects that point to data in the AWS Glue Data Catalog or an Amazon EMR Hive Metastore. When you query an external table, Redshift Spectrum reads the data directly from S3. The performance is good but generally not as fast as querying local tables. The trade-off is that you avoid duplicating data and you can query your data lake’s single source of truth directly from Redshift.
Temporary tables are session-specific and get dropped automatically when the session ends. They are not replicated and do not trigger incremental backups, which makes insert operations faster. The most common use case is staging data for upsert-style operations. Since Redshift does not enforce primary keys or support native UPSERT, you load new data into a temporary table, then use an INNER JOIN against your permanent table to handle updates and inserts manually. The AWS documentation on merge operations covers this pattern in detail.
Materialized views precompute expensive queries and store the results. If your BI dashboard runs the same complex join with aggregations every time someone loads it, a materialized view runs that query once and caches the output. Users then query the view instead of the base tables. The catch is that materialized views do not auto-refresh when underlying data changes. You need to run a REFRESH MATERIALIZED VIEW statement after data loads.
Redshift Spectrum: Query S3 Without Loading Data
Redshift Spectrum is probably the most important feature for bridging the lake and the warehouse. It lets you run SQL queries against data sitting in S3, using the full power of Redshift’s MPP (Massive Parallel Processing) engine, without copying a single byte into your cluster.
Here is how it works. You create an external schema in Redshift that points to a database in the AWS Glue Data Catalog. That Glue database contains table definitions that map to files in S3. Once the schema is linked, you can query those external tables just like local tables. You can even join external tables with local tables in the same query.
A common real-world pattern is the 80/20 split. Say 80 percent of your queries hit data from the past 12 months. Load that into Redshift locally for maximum speed. The other 20 percent of queries need historical data from the past 5 years. Keep that in S3 and query it through Spectrum. Most users get fast results, and the occasional historical query still works, just a bit slower.
There is a cost difference to understand. Redshift cluster charges are fixed – you pay for node type and count regardless of how many queries you run. Redshift Spectrum charges are based on the amount of data scanned per query. For frequently queried datasets, loading data locally is usually cheaper. For occasional queries against large historical datasets, Spectrum saves you from provisioning a much bigger cluster.
Spectrum also lets Redshift read file formats not natively supported, like Amazon ION, Grok, RCFile, and Sequence files.
Loading Data with COPY
The recommended way to bulk load data into Redshift is the COPY command. It can read from S3, DynamoDB, Amazon EMR, or remote SSH hosts. It supports CSV, Parquet, Avro, JSON, ORC, and several other formats.
The key to fast COPY operations is parallelism. Remember those slices from Part 1? Each slice can ingest a file independently. If your cluster has 16 slices, split your data into a number of files that is a multiple of 16. Each file should be between 1 MB and 1 GB after compression.
For example, a cluster with 4 ra3.4xlarge nodes has 16 slices (4 slices per node). If you have a 64 GB file, split it into 64 files of 1 GB each. Each slice ingests 4 files in parallel. This is significantly faster than loading one giant file.
A few important COPY details:
- The entire COPY operation is a single transaction. If one file fails, the whole operation rolls back.
- You need to specify an IAM role with permissions to read the source and write to Redshift. AWS recommends using an IAM role rather than access keys.
- Avoid using INSERT statements for bulk data. Single-row inserts are dramatically slower than COPY. If you must use INSERT, at least use multi-row insert syntax to batch multiple rows per statement.
You can also load data directly from Spark on EMR using the Spark-Redshift JDBC driver. Behind the scenes, Spark writes the DataFrame to a temporary S3 location and then executes a COPY command. AWS Glue has built-in Redshift connection support that works similarly.
Exporting Data with UNLOAD
Data does not always flow in one direction. Sometimes you process data in Redshift and need to export results back to the data lake for other consumers. The UNLOAD command handles this.
UNLOAD runs a SELECT query and writes the results to S3. Multiple slices write in parallel, with each output file up to 6.2 GB by default. For most use cases, set MAXFILESIZE to 1 GB for manageable file sizes.
Best practices for UNLOAD:
- Use Parquet format. Unloading as Parquet is up to twice as fast as text format, the files are compressed, and the data is analytics-ready for anything else in your lake.
- Use the PARTITION option. Specify columns to partition by, and Redshift writes data in Hive-style partitioning (for example,
year=2021/month=July/000.parquet). This makes the exported data immediately queryable by Athena or Glue with partition pruning. - Use ORDER BY. Especially if you plan to load the data back into Redshift later. Pre-sorted data loads more efficiently.
- Use CLEANPATH or ALLOWOVERWRITE for regular exports. CLEANPATH removes existing files in the path before writing. ALLOWOVERWRITE replaces them in place.
The Hands-On: A Travel Agency Data Mart
The chapter ends with a practical exercise that puts all these concepts together. The scenario is a travel agency that needs to find the best Airbnb listings near popular New York City tourist attractions.
Here is the flow:
1. Upload data to S3. Airbnb listing CSV files for New York City and Jersey City go into the landing zone, partitioned by city. You can verify the upload using S3 Select to query the CSV files directly in the console.
2. Set up IAM roles. The Redshift cluster needs an IAM role with S3, Glue, and Athena permissions so Spectrum can read from the data lake and register external tables in the Glue catalog.
3. Create the Redshift cluster. A single dc2.large node using the free trial. Connect via the Redshift query editor.
4. Create an external schema and table. This links Redshift to the Glue Data Catalog:
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'accommodation'
IAM_ROLE 'arn:aws:iam::1234567890:role/AmazonRedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
Then define the external table with columns matching the CSV structure, partitioned by city. After adding partitions for each city, you can query the S3 data through Redshift Spectrum and also through Athena, since both use the same Glue catalog.
5. Create a local table and load data. For hot data that gets queried often, create a local Redshift table with only the columns you need:
CREATE TABLE accommodation_local.listings(
listing_id INTEGER,
name VARCHAR(100),
neighbourhood_group VARCHAR(100),
neighbourhood VARCHAR(100),
latitude DECIMAL(8,6),
longitudes DECIMAL(9,6),
room_type VARCHAR(100),
price SMALLINT,
minimum_nights SMALLINT,
city VARCHAR(40))
DISTKEY(listing_id)
SORTKEY(price);
Then use INSERT…SELECT to pull data from the external Spectrum table into the local table.
6. Run advanced queries. The exercise uses Redshift spatial functions to calculate distances between Airbnb listings and tourist spots like the Freedom Tower and Empire State Building. The ST_Point function converts longitude and latitude into point geometry, and ST_DistanceSphere calculates the distance in kilometers between two points. You can find the 100 closest listings to any landmark, sorted by distance.
7. Create a materialized view. Since travel agents will run these distance queries repeatedly, a materialized view precomputes the distance between every listing and every tourist spot. Subsequent queries hit the view instead of recalculating everything:
SELECT * FROM listings_touristspot_distance_view
WHERE tourist_spot LIKE 'Empire%'
ORDER BY distance_in_km LIMIT 100;
Fast. No recalculation. The view just needs a refresh when new listing data gets loaded.
Key Takeaway
The real power of this chapter is not just Redshift in isolation. It is the bridge between your data lake and your data warehouse. Spectrum lets you query S3 directly when you do not need local performance. COPY lets you pull hot data into Redshift when you do. UNLOAD lets you push processed data back to the lake. And materialized views cache expensive computations so they run once instead of thousands of times.
The lakehouse pattern – data lake as single source of truth, data warehouse for performance-critical queries, Spectrum bridging the two – is how modern analytics architectures actually work in production. Chapter 9 gives you the practical tools to build it.
Next up, Chapter 10 covers orchestrating the data pipeline, tying all these individual components into an automated workflow.
Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3
Previous: Chapter 9 Part 1 - Loading Data into a Data Mart with Redshift Next: Chapter 10 - Orchestrating the Data Pipeline