Mastering AWS for Big Data: EC2, S3, and EMR
Previous: Comparing the Giants: AWS, Azure, and Google Cloud
We’ve talked about the “what” and the “why” of the cloud. Now it’s time for the “how.” Chapter 12 of Sridhar Alla’s book is a deep look at Amazon Web Services (AWS), which is essentially the playground where most big data pros spend their time.
If you want to run Hadoop in the cloud, AWS is the place to do it. Let’s look at the foundational bricks.
Amazon EC2: Your Servers in the Sky
EC2 (Elastic Compute Cloud) is where your code runs. Instead of buying a physical server, you launch an Instance.
- AMIs (Amazon Machine Images): These are templates for your servers. You can pick an AMI that already has Linux and Hadoop installed.
- Instance Types: You can choose how much “juice” your server has. Need a lot of RAM for Spark? Pick an
r5instance. Doing heavy math? Go forc5. - Auto Scaling: This is the best part. If your data job gets hit with more traffic than expected, AWS can automatically launch more servers to help out.
Amazon S3: The Infinite Hard Drive
S3 (Simple Storage Service) is where your data lives. It’s designed for 99.999999999% durability. That’s “eleven nines.” To put that in perspective, if you store 10 million objects in S3, you can expect to lose a single one every 10,000 years.
For Hadoop users, S3 is often used as a replacement for HDFS. It’s cheaper, it’s easier to manage, and it can store an unlimited amount of data.
Security and Networking
AWS is big on security. You use VPCs (Virtual Private Clouds) to create your own isolated network, and Security Groups to act as a virtual firewall. You also use Key Pairs (SSH keys) to log into your instances securely.
AWS Lambda: The “Serverless” Option
The book also touches on AWS Lambda. This is for when you don’t even want to manage a virtual server. You just upload your code, and AWS runs it whenever an event happens (like a new file being uploaded to S3). It’s perfect for small, quick data processing tasks.
Summary
EC2 and S3 are the bread and butter of big data on AWS. But if you really want to run Hadoop without the headache of manual setup, you need EMR. That’s what we’re looking at in the final chapter post.