Data Security for Data Engineers - Chapter 9 Retelling
In 2016, hackers stole personal data of 57 million Uber users and drivers. How? Someone left API credentials in a private GitHub repo. The attackers grabbed those keys, got into AWS, and downloaded everything. Uber didn’t even notice for a year. When they finally found out, they paid the hackers $100,000 to delete the data and kept quiet about it.
That story opens Chapter 9, and it sets the tone well. Data security is not optional. If you build data pipelines, you need to understand how to protect what flows through them.
The CIA Triad
The foundation of data security sits on three ideas, often called the CIA triad. Not the spy agency. It stands for Confidentiality, Integrity, and Availability.
Confidentiality means only the right people see the data. You protect this with encryption, data masking, and access controls.
Integrity means the data stays accurate and unaltered. Engineers use checksums for this: generate a code from your data, recalculate it on the other end. If the codes match, nothing changed. Audit logs help too, tracking who touched what and when.
Availability means the data is there when you need it. Redundancy (copies in multiple places), failover mechanisms (another server picks up if one dies), and regular backups.
Common Threats
The book lists the usual suspects:
- Malware and ransomware - software that locks your data and demands payment
- Phishing - fake emails that trick you into giving up your credentials
- Insider threats - sometimes the problem is someone who already has access
- SQL injection - bad input sneaking through an unsanitized form field
- Lack of encryption - data sitting in plain text, waiting to be read by anyone
None of these are exotic. They happen every day, at companies of every size.
Encryption: Symmetric vs Asymmetric
Encryption turns readable data (plaintext) into scrambled data (ciphertext). Only someone with the right key can unscramble it.
Symmetric encryption uses one key for both locking and unlocking. Fast, good for encrypting large datasets at rest. The catch: you need to share that key with the other side. If someone intercepts it, they have everything.
Asymmetric encryption uses two keys: a public key to encrypt and a private key to decrypt. The private key never leaves the owner. Slower but more secure for communication. That padlock in your browser? Asymmetric encryption is part of it.
# Symmetric: same key on both sides
encrypted = encrypt(data, key)
decrypted = decrypt(encrypted, key)
# Asymmetric: public key encrypts, private key decrypts
encrypted = encrypt(data, public_key)
decrypted = decrypt(encrypted, private_key)
For data at rest, symmetric is the go-to. For data in transit, you typically use TLS, which combines both.
Data Masking
Data masking replaces sensitive info with fake or partial data. Developers need production-like data for testing, but you can’t hand them real SSNs. So you mask it:
| Full Name | SSN (Original) | SSN (Masked) |
|---|---|---|
| John Doe | 123-45-6789 | XXX-XX-6789 |
| Jane Smith | 234-56-7890 | XXX-XX-7890 |
The last four digits stay visible for verification. Nobody sees the full number. Standard practice at any company handling financial or health data.
Network Security and TLS
When data moves between systems, TLS (Transport Layer Security) protects it. It replaced the older SSL protocol. The handshake is simple: your browser says hello, the server sends a certificate proving it’s legit, your browser verifies it, and both sides agree on a secret key. After that, everything is encrypted.
For data engineers, this matters when pulling data from APIs or writing to cloud databases:
import requests
# HTTPS means TLS is handling encryption automatically
url = "https://api.example.com/data"
response = requests.get(url)
data = response.json()
That https:// prefix means TLS is active. Your data is encrypted during the trip.
Access Control: Who Gets In and What Can They Do
Access control has two parts: authentication (proving who you are) and authorization (deciding what you’re allowed to do).
Authentication methods include passwords, MFA (password plus a code from your phone), biometrics, and OAuth (the “Sign in with Google” button). MFA is the most important one. Even if someone steals your password, they still need your phone.
Authorization decides what an authenticated user can do. Three models:
- RBAC (Role-Based) - permissions grouped by role. Admins can delete accounts, employees can only see their own data. Scales well.
- ABAC (Attribute-Based) - access depends on attributes like department, time of day, or data sensitivity. More flexible but more complex.
- ACLs (Access Control Lists) - attached directly to resources, specifying exactly who can read, write, or deny. Fine-grained but messy at scale.
The Principle of Least Privilege
Here’s the thing about access: give people only what they need. Nothing more. A junior engineer shouldn’t have full admin access to the entire cloud environment. One wrong command and a critical database is gone.
The book recommends three practices:
- Use RBAC to match permissions to job functions
- Use just-in-time (JIT) access for temporary needs, revoked automatically when time’s up
- Audit access regularly, because people switch teams and leave companies but their permissions stick around
Secrets Management
Passwords, API keys, encryption keys, tokens. These are all secrets. And they need proper handling.
The number one rule: never hardcode secrets in your code. Don’t put your database password in a config file that ends up in Git. That’s exactly what happened with Uber.
Instead, use tools built for this:
- AWS Secrets Manager
- HashiCorp Vault
- Azure Key Vault
- Google Secret Manager
These tools store secrets securely, rotate them on a schedule, and control who can access them through RBAC.
Data Security vs Data Privacy
The chapter closes with an important distinction. Security is about keeping the house locked (firewalls, encryption, access controls). Privacy is about who you invite inside and what rooms they can enter (consent, transparency, GDPR). You need both. A locked door means nothing if everyone inside sees everything.
My Take
This chapter covers a lot of ground, and it does it well. The Uber breach as an opener was a smart choice. It shows what happens when security basics are ignored.
If you’re just starting in data engineering, focus on these three things first: use TLS for all network calls, never hardcode secrets, and follow least privilege for access. Those three habits alone will keep you out of most trouble. The rest builds on top of that foundation.
This is part 13 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.