Data Engineering with AWS Chapter 11: Ad Hoc Queries with Amazon Athena

Apr 07, 2026
software-engineering

This is post 17 in my Data Engineering with AWS retelling series.

You have a data lake. Terabytes of files sitting in S3 across landing zones, clean zones, and transform zones. The data is there. But how do you actually ask it questions? You could spin up a database, load everything into it, and then query. But that defeats the purpose of having a data lake in the first place.

Chapter 11 introduces Amazon Athena, a service that lets you point SQL at your S3 files and get answers. No servers. No loading. Just write a query, run it, and pay for what you scan.

What Is Amazon Athena?

Athena is a serverless, interactive query service. You write standard SQL, Athena reads files directly from S3, and returns results. That is it. No infrastructure to provision. No clusters to manage. No data to import.

Under the hood, Athena is based on Presto, an open source distributed SQL engine originally built at Facebook. AWS launched Athena in 2016 as a managed wrapper around Presto, so you get the power of a distributed query engine without any of the operational headaches.

The pricing model is simple: you pay $5 per terabyte of data scanned. Read that again. You are not paying for compute time or reserved capacity. You are paying based on how much data Athena has to read to answer your query. This means the way you store and format your data directly impacts your bill.

Optimizing How You Store Data

Since you pay per terabyte scanned, anything that reduces the amount of data Athena reads saves you money and makes queries faster. Eagar covers several optimization techniques.

Use Columnar Formats

CSV and JSON are row-based. If you query one column out of fifty, Athena still reads all fifty columns. Columnar formats like Apache Parquet and ORC store data by column. Query one column and Athena only reads that column. The difference can be massive.

These formats also compress much better than CSV. Similar values sitting next to each other in a column compress more efficiently than mixed values across a row.

You can convert existing CSV data to Parquet using a CTAS (Create Table As Select) statement right inside Athena:

CREATE TABLE customers_parquet
WITH (
  format = 'Parquet',
  parquet_compression = 'SNAPPY'
) AS SELECT * FROM customers_csv

One query, and your CSV table becomes a Parquet table stored in S3. Snappy compression gives you a good balance between compression ratio and speed.

Partition Your Data

Partitioning means organizing your S3 files into a folder structure based on column values. Instead of dumping all sales data into one folder, you break it up by date:

/datalake/transform_zone/sales/YEAR=2021/MONTH=9/DAY=30/

When someone queries WHERE YEAR=2021 AND MONTH=9, Athena only scans the files in that specific folder. It skips everything else entirely. For a dataset covering years of daily transactions, this can reduce your scan volume by 99%.

Optimize File Sizes

Too many small files slow Athena down. Each file requires a separate request to S3, and the overhead of opening and reading thousands of tiny files adds up fast.

Too few giant files are also a problem. Athena cannot parallelize effectively if your entire dataset is one 50 GB file.

The sweet spot is 128 MB to 1 GB per file. If you have lots of small files, use a Glue job or a CTAS query to merge them into larger ones.

Optimizing Your SQL

Beyond storage format, the SQL you write matters too.

Stop Using SELECT *

This is the biggest one. SELECT * tells Athena to read every single column. If you only need three columns out of forty, specify them. With columnar formats, Athena skips the columns you do not ask for. With SELECT *, it reads everything.

Use approx_distinct Instead of COUNT(DISTINCT)

If you need an exact count of unique values, COUNT(DISTINCT column) is the correct query. But exact counts are expensive on large datasets.

If you can tolerate a small margin of error (around 2.3%), approx_distinct is dramatically faster:

-- Exact but slow
SELECT COUNT(DISTINCT customer_id) FROM orders;

-- Fast approximate, ~2.3% standard error
SELECT approx_distinct(customer_id) FROM orders;

For dashboards and monitoring where “roughly 1.2 million unique customers” is just as useful as “exactly 1,198,437”, the approximate version saves time and money.

Use regexp_like Instead of LIKE

If you need to check a column against multiple patterns, the typical approach chains LIKE conditions together:

SELECT * FROM logs
WHERE url LIKE '%/admin%'
   OR url LIKE '%/login%'
   OR url LIKE '%/dashboard%';

With regexp_like, you combine them into a single regex pattern:

SELECT * FROM logs
WHERE regexp_like(url, '/admin|/login|/dashboard');

Cleaner syntax. And for large datasets, it can be more efficient than multiple OR conditions.

Athena Federated Query

This is where Athena gets interesting beyond just S3. Normally Athena queries files in S3. But what if the data you need lives in DynamoDB? Or an RDS MySQL database? Or Redshift? Or even CloudWatch Logs?

Athena Federated Query lets you query all of these through a single SQL statement. Behind the scenes, it uses Lambda-based connectors. AWS provides pre-built connectors for many data sources, and you can write custom ones.

Here is the idea: you write one SQL query that joins data from your S3 data lake with data in a DynamoDB table and data in a MySQL database. Athena sends the relevant parts of the query to each connector Lambda, collects the results, joins them, and returns the answer. One query, multiple sources.

This is powerful for analytics. Instead of building ETL pipelines to copy data from DynamoDB into S3 just so you can join it with other data, you query it in place.

Athena Workgroups: Governance and Cost Control

Once Athena becomes popular in your organization (and it will, because people love being able to query data without filing infrastructure requests), you need governance. That is where Workgroups come in.

A workgroup is a way to separate Athena users into groups with different settings and controls. Think of it like creating separate rooms for different teams.

Each workgroup gets:

Its own query history and saved queries – the marketing team sees only their queries, not the engineering team’s.
Enforced query result location – all results from this workgroup go to a specific S3 bucket. No more random query results scattered everywhere.
Encryption settings – enforce that all query results are encrypted at rest.
Per-query data scan limits – set a maximum of, say, 1 TB per query. Any query that would scan more gets cancelled before it runs. This prevents someone from accidentally running a full table scan on a 50 TB dataset.
Per-workgroup data scan limits – cap the total data scanned by the entire workgroup per day, week, or month. When the limit is hit, queries stop or an SNS notification fires.

This is how you make Athena safe for self-service analytics. You give each team their own workgroup, set reasonable limits, and let them query freely within those guardrails. The finance team cannot blow through your monthly AWS budget with one bad query.

The Hands-On: Querying Film Data

The chapter exercise sets up a workgroup and runs queries against the streaming_films table that was loaded in earlier chapters. You create a new workgroup with a dedicated S3 results location, then run queries to explore the dataset.

Nothing fancy here, just practical experience with the Athena console, workgroup configuration, and writing queries against real data. The point is to make you comfortable with the workflow: create workgroup, set permissions, write SQL, review results, check costs.

Key Takeaway

Athena is one of those AWS services that makes the data lake concept actually practical. Having millions of files in S3 is useless if querying them requires setting up and maintaining a database cluster. Athena removes that barrier entirely.

But the pay-per-scan model means you have to think about optimization. Use Parquet, not CSV. Partition your data. Right-size your files. Write precise queries instead of SELECT *. These are not optional best practices. They are the difference between a $5 query and a $500 query.

For organizations with multiple teams querying the same data lake, workgroups turn Athena from a free-for-all into a governed, cost-controlled analytics platform. And federated queries let you reach beyond S3 into databases and services without building pipelines to copy data around.

Chapter 11 is a shorter chapter, but it covers a service you will use constantly in any AWS data engineering role. Athena is where the data lake stops being a storage system and starts being an analytics platform.

Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3

Previous: Chapter 10: Orchestrating the Data Pipeline Next: Chapter 12: Visualizing Data with Amazon QuickSight

#data-engineering-aws #gareth-eagar #aws #data-engineering #book-retelling #amazon-athena #sql-queries