Data Engineering with AWS Chapter 4 Part 1: Data Cataloging and Security
You can have the fastest data pipeline on the planet. You can have the slickest dashboards, the fanciest machine learning models, the most optimized Parquet files. None of it matters if your data gets stolen, mishandled, or dumped into a lake that nobody can navigate.
This is post 6 in my Data Engineering with AWS retelling series.
Chapter 4 of Gareth Eagar’s book is about the stuff that is not glamorous but absolutely essential: cataloging your data so people can find it, securing it so bad actors cannot steal it, and governing it so your organization does not end up on the news for all the wrong reasons.
Why Data Security Is Not Optional
Let me hit you with some numbers. Equifax had a data breach in 2017 that exposed the personal and financial information of nearly 150 million people. The settlement cost them at least $575 million. Google got fined over $50 million for not properly complying with GDPR in Europe. According to CSO Online, data breach penalties have cost companies over $1.3 billion total.
And those are just the fines. The reputational damage is even worse. Once your customers stop trusting you, good luck getting that trust back.
As a data engineer, you are the person building the pipes that move sensitive data around. This is your problem to solve.
Security vs. Governance: Know the Difference
These two terms get thrown around together, but they mean different things.
Data security is about protecting data from unauthorized access. Encryption, firewalls, access controls, preventing ransomware attacks, keeping data from being stolen and sold on the dark web. It is the digital equivalent of locks on doors and cameras in hallways.
Data governance is about making sure the right people have access to the right data, and that your organization only uses personal data in approved ways. It is less about hackers and more about policies. Who should see this dataset? Are we allowed to store this information? Do we need to delete it if a customer asks?
Both are mandatory. Get either one wrong and you are in trouble.
The Regulatory Alphabet Soup
No matter where you operate, there are laws about how you handle personal data. Here are the big ones:
- GDPR (General Data Protection Regulation) in the European Union. This one has teeth. It applies to you even if your company is not in the EU, as long as you hold data on EU residents.
- CCPA/CPRA (California Consumer Privacy Act / California Privacy Rights Act) in California.
- PDP Bill (Personal Data Protection Bill) in India.
- POPIA (Protection of Personal Information Act) in South Africa.
- HIPAA for healthcare data in the US.
- PCI DSS for credit card data.
These laws generally give individuals the right to know what data a company holds about them, demand proper protection of that data, and in some cases request deletion. GDPR even requires some organizations to appoint a Data Protection Officer.
If your organization has a CISO or DPO, talk to them. Understand which regulations apply to your data before you build anything. AWS also offers a service called AWS Artifact that gives you on-demand access to AWS compliance reports if you ever face an audit.
Core Data Protection Concepts
Before we talk about AWS tools, you need to understand the building blocks of data protection.
PII and Personal Data
Personally Identifiable Information (PII) is anything that can identify a person: names, social security numbers, IP addresses, photos, medical conditions, even location data. GDPR uses the broader term “personal data” which covers basically the same ground plus a few extras.
Encryption: At Rest and In Transit
Encryption scrambles your data using a key so it becomes unreadable without that key. There are two types you need and you need both:
- Encryption in transit protects data as it moves between systems. If someone intercepts the data stream, they get garbage. The standard approach is TLS (Transport Layer Security) for all communications.
- Encryption at rest protects data sitting on a disk. After processing, all persisted data should be encrypted.
Anonymization vs. Pseudonymization
Anonymized data has PII removed permanently. You cannot reverse it. The problem? Even minimal data can identify people. One study found that just zip code, gender, and date of birth can uniquely identify 87% of the US population.
Pseudonymized data (or tokenization) replaces PII with random tokens. The key difference is you can get the original data back through a secure tokenization system. The token itself is random and cannot be reverse-engineered. But the tokenization system that maps tokens to real values must be kept completely separate from your analytics systems and locked down tight.
A word about hashing: it is the weakest approach for de-identifying data. The SHA-256 hash of “John Smith” always produces the same output. Rainbow tables exist that map common names and social security numbers to their hashes. You can literally Google a hash and find the original value. Even salting the hash is not great for data with limited possible values.
Authentication and Authorization
Authentication answers “who are you?” – verifying identity through passwords, MFA tokens, or federated identity systems like Active Directory.
Authorization answers “what can you access?” – once your identity is confirmed, determining which resources you are allowed to touch.
These work together. First you prove who you are, then the system decides what you can see and do.
The AWS Glue Data Catalog
Now let us talk about keeping your data organized. Without organization, your data lake becomes a data swamp. Eagar uses a great analogy: a lake is beautiful and useful. A swamp is a mess where things just sit and rot. Nobody wants to visit a swamp when they were promised a lake.
The fix is a data catalog – a searchable record of every dataset in your lake with metadata that tells users what the data is, where it came from, and whether they can trust it.
There are two flavors:
Technical catalogs map physical files to logical representations (databases and tables). They track file locations, schemas, column data types, partition info. The AWS Glue Data Catalog is a Hive Metastore-compatible technical catalog. Services like Amazon Athena use it to know where data lives and how to read it.
Business catalogs capture context: who owns this data, which business unit it belongs to, how often it updates, its sensitivity classification, and how it relates to other datasets. Tools like Collibra and Informatica are popular business catalog options.
The Glue Data Catalog does both to some extent. It handles the technical metadata natively and lets you attach key-value properties to tables for business metadata – things like data_owner: marketing_team or contains_pii: true.
Avoiding the Data Swamp
Having a catalog tool is not enough. You also need policies:
- Every dataset must be cataloged. If data enters the lake without a catalog entry, you are one step closer to swamp territory.
- Business metadata is mandatory. Technical metadata alone is not useful to business users. They need to know the data source, owner, sensitivity level, update frequency, and which data lake zone it belongs to.
- Automate catalog updates. AWS Glue Crawlers can scan data sources, infer schemas, and populate the catalog automatically. Set them to run after every data ingestion.
Key metadata to capture for every dataset:
- Data source
- Data owner
- Sensitivity classification (public, general, sensitive, confidential, PII)
- Data lake zone (raw, transformed, enriched)
- Cost allocation tag (business unit, department)
AWS Services for Encryption and Monitoring
AWS Key Management Service (KMS)
KMS is the backbone of encryption on AWS. It manages the keys you use to encrypt and decrypt data, and it integrates with almost every analytics service: S3, Athena, Redshift, EMR, Glue, Kinesis, Lambda, and more.
A practical tip from the book: use S3 Bucket Keys to encrypt all objects in a bucket with a single KMS key. It is significantly cheaper than encrypting each object individually with SSE-KMS.
Protect your KMS keys carefully. If a key gets deleted, any data encrypted with it is gone forever. AWS forces a 7-30 day waiting period before deletion, and you should set up CloudWatch alarms during that window. If you use AWS Organizations, create a Service Control Policy to prevent anyone from deleting KMS keys in child accounts.
Amazon Macie
Macie uses machine learning and pattern matching to scan S3 buckets for sensitive data like names, addresses, and credit card numbers. It alerts you when it finds PII that should not be there and can trigger automated remediation through step functions. Think of it as a security guard that never sleeps, constantly checking your data for things that should have been tokenized.
Amazon GuardDuty
GuardDuty monitors your entire AWS account for malicious activity by analyzing CloudTrail events, VPC flow logs, and DNS logs. It is not analytics-specific, but it protects the environment your analytics runs in.
IAM: The Gatekeeper
AWS Identity and Access Management handles both authentication and authorization. Here are the key identities:
- Root user – the email address that created the account. Has full access. Never use this for daily work.
- IAM Users – individual identities with usernames, passwords, and optional access keys for CLI/API use.
- IAM Groups – collections of users that share the same permissions.
- IAM Roles – identities without passwords that users or services can “assume” to get temporary permissions. Used heavily for Lambda functions, federated login, and cross-service access.
Permissions are granted through IAM policies – JSON documents that allow or deny specific actions on specific resources. There are three types:
- AWS managed policies – pre-built by AWS for common use cases like
AdministratorAccessorDatabaseAdministrator. - Customer managed policies – you build these for fine-grained control, like read-only access to one specific S3 bucket.
- Inline policies – tied directly to a single user, group, or role.
Here is a quick example of a customer managed policy granting read access to a specific bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::de-landing-zone"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": ["arn:aws:s3:::de-landing-zone/*"]
}
]
}
You can add conditions to restrict access by IP address, time of day, or other factors. The principle is always least privilege – give users only the minimum access they need to do their job.
What Is Next
In Part 2, we will dig into AWS Lake Formation and how it simplifies data lake permissions management compared to raw IAM policies. We will also walk through the hands-on exercise from the book where you set up a data lake user, configure permissions, and see column-level access control in action.
The tools exist to keep your data organized and secure. The hard part is having the discipline to use them consistently from day one.
Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3
Previous: Chapter 3 Part 2 - Analytics and Processing Next: Chapter 4 Part 2 - Data Governance