Cloud Data Engineering - Storage, Compute, Networking, and Cost on the Cloud
Chapter 12 is the one where everything moves to the cloud. If you’ve been following along, we’ve been talking about databases, pipelines, data quality, security, governance, and big data. All of that can run on your own hardware. But most teams today don’t do that. They use cloud providers. This chapter explains why, and more importantly, how.
From Server Rooms to Someone Else’s Computer
The book starts with a quick history lesson. Back in the day, companies bought physical servers, put them in dedicated rooms with controlled temperature, and hired teams to maintain everything. Cooling, power supply, backups, disaster recovery, hardware failures. The whole works.
Here’s the problem: you had to predict your future computing needs. Guess too low and you can’t handle traffic spikes. Guess too high and you waste money on hardware collecting dust. Neither option is great.
Then in 2006, Amazon launched AWS. The idea was simple: what if you could rent computing power over the internet? Pay only for what you use. Scale up when you need more, scale down when you don’t. Microsoft Azure and Google Cloud Platform followed. And suddenly, you didn’t need a server room anymore.
That’s cloud computing in a nutshell. You’re using someone else’s computer. A very sophisticated one, but still someone else’s computer.
On-Premises vs Cloud
The book lays out both options fairly.
On-premises gives you full control. Your hardware, your network, your rules. Good for companies with strict security requirements and the budget to maintain it all. But it’s expensive upfront and you need a dedicated team to keep everything running.
Cloud gives you flexibility. Scale on demand, pay as you go, and the provider handles most of the hardware headaches. The trade-off is that you’re trusting the provider with your infrastructure, and sometimes with your data.
Many companies do a hybrid approach. Keep sensitive stuff on-premises, run everything else in the cloud. It depends on what you need.
The Three Pillars: Storage, Compute, Networking
Cloud infrastructure sits on three core concepts.
Storage
There are three types of cloud storage, and they each have a specific purpose:
Object storage is for large amounts of unstructured data. Files, images, videos, logs. Instead of folders and files, everything is stored as “objects” with metadata and a unique ID. You don’t say “open file Y in folder X.” You say “give me object 12345 from bucket Z.” Think Amazon S3, Google Cloud Storage, Azure Blob Storage. It’s cheap, it scales well, and it works great for data lakes and long-term archiving.
Block storage is like a virtual hard drive. Data is split into chunks (blocks), each with its own address. Fast reads and writes. Good for databases and performance-heavy workloads. Think AWS EBS. You attach it directly to a virtual machine.
File storage is the traditional folder-and-file setup, but in the cloud. Multiple users or machines can access the same files. Good for shared scripts, configuration files, or team collaboration. Think Amazon EFS or Azure Files.
Here’s a quick comparison:
| Factor | Object Storage | Block Storage | File Storage |
|---|---|---|---|
| Scalability | Best (horizontal) | Good (vertical) | Limited |
| Performance | Slower access | Fastest | Middle ground |
| Cost | Cheapest | Most expensive | In between |
| Best for | Data lakes, backups, logs | Databases, VMs | Shared scripts, team files |
Compute
Two main options here:
Virtual machines (VMs) are cloud-hosted computers you spin up on demand. You choose the OS, install your tools, configure everything. Full control. Good for custom ETL setups with Airflow, Spark, or specific Python libraries. The downside is you manage the OS, patches, and scaling yourself.
Containers package your code, dependencies, and runtime into a single portable unit. Each piece of your pipeline gets its own sealed box. No more “it works on my machine” problems. Different teams can use different Python versions without conflicts. Containers behave the same way on a laptop and in production.
Networking
Here’s the thing about cloud networking: if it’s not set up right, your pipeline will fail. Not because of bad code, but because services can’t talk to each other.
VPC (Virtual Private Cloud) is your isolated network space in the cloud. Think of it as your own private data center. You control which services can communicate and how traffic flows. Keep your ETL jobs, databases, and storage inside the same VPC for security.
Subnets divide your VPC into smaller segments. Public subnets can reach the internet (good for pulling data from external APIs). Private subnets stay internal (good for databases and sensitive processing). Your ingestion service might sit in a public subnet while your data warehouse stays in a private one.
Gateways are the doors between your VPC and the outside world. Without proper gateway setup, your pipeline can’t reach external data sources, download packages, or send logs to monitoring services.
IaaS, PaaS, SaaS
These three models define how much you manage vs how much the cloud provider manages.
IaaS (Infrastructure as a Service) gives you raw building blocks. Virtual machines, storage, networking. You install and configure everything from the OS up. Maximum control, maximum responsibility. Good for teams that need custom setups.
PaaS (Platform as a Service) sits one level up. The provider handles the infrastructure, OS, and runtime. You just build and deploy your applications. Google Cloud Composer, Azure Data Factory, AWS Glue are PaaS examples. You spend time on data workflows instead of server configs. Less control, but much faster to deploy.
SaaS (Software as a Service) is fully managed software you access through a browser. Snowflake, BigQuery, Fivetran. You configure the workflow through a web interface. Minimal setup, but also minimal customization. Good for quick wins and teams that want to move fast.
Most real-world data teams use a mix of all three. IaaS for custom infrastructure, PaaS for managed platforms, SaaS for analytics and visualization.
Cloud Management Models
Beyond service models, there are three management approaches:
Serverless means you write code that runs in response to events. File uploaded? Trigger a function. New record? Process it automatically. No server running 24/7. You only pay for execution time. Great for lightweight, event-driven tasks. But watch out for cold start latency, execution time limits, and tricky debugging.
Managed services let the provider handle infrastructure, scaling, and often optimization. BigQuery and Snowflake are managed. You run queries and transformations without worrying about tuning servers. Built-in redundancy keeps things running even during failures. The trade-off is less flexibility and vendor lock-in.
Self-managed means you rent VMs and build everything yourself. Full control over every software version and configuration. Good for complex or performance-sensitive workloads. But you own all the maintenance, patching, and scaling headaches.
Cost: The Silent Killer
Here’s what I found most practical in this chapter. Cloud bills can get out of hand fast. The book covers several strategies:
Pricing models. On-demand instances are flexible but expensive for long-running workloads. Reserved instances save you 30-60% if you commit for 1-3 years. Spot instances use leftover capacity at a discount, but the provider can reclaim them anytime.
Right-sizing. If your job needs 2 vCPUs, don’t give it 8. Monitor actual usage and match your resources to the real need. I’ve seen teams triple their bill just by being “safe” with oversized machines.
Smart scheduling. Run batch jobs during off-peak hours. Use event-driven serverless for workloads that don’t need to run constantly.
Storage tiers. Hot storage for active data, cold storage for archives. Set up lifecycle policies that automatically move old data to cheaper tiers. Logs older than 30 days go cold. After 90 days, archive or delete.
Shut down idle resources. Dev and test environments don’t need to run overnight or on weekends. Automate the shutdown. Clean up after demos. Forgotten resources quietly run up bills.
Monitor everything. Set budgets and alerts. Track daily usage. Catch anomalies early before they become expensive surprises.
What I Think
This is a solid chapter for anyone who hasn’t worked with cloud infrastructure yet. The explanations are clear, and the comparisons between storage types, service models, and management approaches are useful. The cost section alone could save a beginner team thousands of dollars in their first year.
If I had to pick one takeaway: understand the fundamentals before you pick a specific cloud provider. Object storage works the same way whether it’s S3 or GCS. VPCs work the same way on AWS and Azure. Learn the concepts, and the specific tools become easy to pick up.
This is part 16 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.
| < Previous: Big Data and Distributed Systems | Next: Building a Career in Data Engineering > |