Data Engineering with AWS Chapter 4 Part 2: Data Governance in Practice
In Part 1, we covered the theory: what data security and governance mean, how catalogs prevent your lake from becoming a swamp, and the core AWS services for encryption and identity. Now it is time to put it into practice.
This is post 7 in my Data Engineering with AWS retelling series.
This part focuses on AWS Lake Formation – the service that makes managing data lake permissions actually manageable – and walks through the hands-on exercise from Chapter 4 where you configure real permissions from scratch.
The Problem Lake Formation Solves
Before Lake Formation existed, managing data lake permissions was a JSON policy nightmare.
Think about it. Your data lake has dozens of datasets spread across multiple S3 buckets and prefixes. Different teams need access to different datasets. Every time you add a new dataset or a new team member, someone has to update a JSON policy document. For every user, you need three layers of IAM permissions:
- Glue catalog permissions – access to specific databases and tables in the catalog.
- S3 permissions – access to the underlying physical files (Parquet, CSV, whatever) in the right S3 locations.
- Service permissions – access to the analytics tools themselves (Athena, Glue, EMR).
Now multiply that by 50 users across 10 teams with 200 datasets. Those JSON policies get long, complicated, and fragile. One wrong character and someone either gets locked out or gets access to data they should never see.
Lake Formation fixes this by adding a permission layer that works on top of IAM, using familiar database concepts like GRANT and REVOKE instead of raw JSON.
How Lake Formation Permissions Work
Lake Formation does not replace IAM. It works alongside it. The recommended approach is:
- IAM policies provide broad, coarse-grained access. Give users access to the Glue service, the Lake Formation service, and their analytics tools.
- Lake Formation provides fine-grained access. Control exactly which databases, tables, and even columns a specific user can see.
Here is the key benefit: with Lake Formation, users do not need direct S3 permissions at all. When a user queries data through a compatible service like Athena, Lake Formation provides temporary credentials to access the underlying S3 files. The user never touches S3 directly.
Compatible services at the time of the book’s writing include:
- Amazon Athena
- Amazon QuickSight
- Apache Spark on Amazon EMR
- Amazon Redshift Spectrum
- AWS Glue
The Pass-Through Permission Trick
When you first set up Lake Formation, every database and table has a special permission assigned to a group called IAMAllowedPrincipals. This is the pass-through permission – it tells Lake Formation to skip its own permission checks and just let IAM handle everything.
This exists so that Lake Formation does not break your existing setup the moment you enable it. Your IAM policies keep working exactly as before. Lake Formation just sits there, doing nothing, until you explicitly activate it.
To activate Lake Formation permissions on a specific database or table, you revoke the IAMAllowedPrincipals permission. Once revoked, a user needs both IAM permissions and Lake Formation permissions to access that resource. This lets you migrate to Lake Formation gradually – one database at a time – instead of flipping a switch on your entire data lake.
Hands-On: Building It From Scratch
The book walks through a complete exercise. Here is what it covers, step by step.
Step 1: Create an IAM Policy for Your Data Lake User
Start by creating a custom IAM policy based on the AmazonAthenaFullAccess managed policy. The twist is you restrict the Glue permissions to a specific database (CleanZoneDB) instead of allowing access to everything:
"Resource": [
"arn:aws:glue:*:*:catalog",
"arn:aws:glue:*:*:database/cleanzonedb",
"arn:aws:glue:*:*:database/cleanzonedb*",
"arn:aws:glue:*:*:table/cleanzonedb/*"
]
You also add S3 permissions so the user can access the actual data files in the clean zone bucket:
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::dataeng-clean-zone-<initials>/*"
]
}
Step 2: Create a Data Lake User
Create a new IAM user called datalake-user with console access and attach the policy you just built. This gives them Athena access scoped to CleanZoneDB only.
You also create an S3 bucket for Athena query results (something like aws-athena-query-results-dataengbook-<initials>). Athena needs somewhere to write results, and this bucket is it.
Step 3: Verify IAM-Only Permissions Work
Log in as datalake-user, open Athena, set the query results location, and run:
SELECT * FROM cleanzonedb.csvparquet
If everything is configured right, you see the data. At this point, permissions are managed entirely through IAM – the traditional approach.
Step 4: Activate Lake Formation
Now switch to the Lake Formation model. Log in as your admin user and open the Lake Formation console. First time in, it will ask you to designate a data lake administrator. Add yourself.
Navigate to the cleanzonedb database and view its permissions. You will see two entries:
- DataEngLambdaS3CWGlueRole – the role that originally created the database. It has full permissions.
- IAMAllowedPrincipals – the pass-through permission that lets IAM handle everything.
Revoke the IAMAllowedPrincipals permission on the database. Then do the same for the CSVParquet table inside that database.
At this point, if you log in as datalake-user and try the same Athena query, it will fail. Lake Formation permissions are now active, but datalake-user has not been granted any Lake Formation permissions yet.
Step 5: Grant Lake Formation Permissions
Back in the Lake Formation console as your admin user:
- Navigate to the CSVParquet table.
- Click Actions, then Grant.
- Select
datalake-userfrom the IAM users dropdown. - Under Columns, choose Exclude columns and select Age.
- For table permissions, check Select.
- Click Grant.
This is where Lake Formation shows its power. You just granted SELECT access to a table while excluding a specific column. Column-level permissions are not possible with IAM alone. This is a Lake Formation feature.
Step 6: Test Column-Level Access
Log in as datalake-user again and run the same query:
SELECT * FROM cleanzonedb.csvparquet
The query works, but the Age column is missing from the results. Lake Formation silently excluded it because you specified that column exclusion when granting permissions.
This is a real-world pattern. Think about an HR dataset where you want analysts to see employee names and departments but not salaries. Or a healthcare dataset where researchers can see diagnosis codes but not patient names. Column-level access makes this clean and manageable.
The Migration Path
The book highlights an important practical point about transitioning to Lake Formation. You do not have to do it all at once. The pass-through permission exists specifically to allow gradual migration:
- Start with IAM-only permissions (the traditional approach).
- Pick one database and revoke IAMAllowedPrincipals.
- Grant Lake Formation permissions to the users who need access.
- Simplify the IAM policy for those users – remove the fine-grained Glue and S3 restrictions since Lake Formation now handles that.
- Repeat for the next database.
Once all databases use Lake Formation, you can replace your custom IAM policies with the standard AmazonAthenaFullAccess managed policy. All the fine-grained access control lives in Lake Formation, and IAM just provides the broad “you can use these AWS services” permissions.
Putting All the Pieces Together
Chapter 4 covers a lot of ground but it all connects:
- Catalog your data so users can find it and understand it. Use Glue Crawlers to automate this. Enforce metadata policies so every dataset has an owner, a source, and a sensitivity classification.
- Encrypt everything. At rest with KMS. In transit with TLS. No exceptions.
- Tokenize PII as the first processing step after ingestion. Keep the tokenization system completely separate from your analytics environment.
- Use federated identity so that when someone leaves the company, their analytics access dies with their Active Directory account.
- Apply least privilege everywhere. Start with the minimum permissions and add more only when needed.
- Use Lake Formation for fine-grained access control on your data lake. It is cleaner, more manageable, and supports column-level permissions that IAM cannot do.
None of this is technically difficult. The hard part is doing it consistently and getting organizational buy-in to enforce the policies. But skip it and you are building a data pipeline on a foundation of sand.
In the next chapter, we step back and look at the big picture: how to architect a complete data engineering pipeline from end to end.
Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3
Previous: Chapter 4 Part 1 - Data Cataloging and Security Next: Chapter 5 - Architecting Data Engineering Pipelines