Mastering AWS for Big Data: EC2, S3, and EMR

Jan 21, 2019
Big Data

Previous: Comparing the Giants: AWS, Azure, and Google Cloud

We’ve talked about the “what” and the “why” of the cloud. Now it’s time for the “how.” Chapter 12 of Sridhar Alla’s book is a deep look at Amazon Web Services (AWS), which is essentially the playground where most big data pros spend their time.

If you want to run Hadoop in the cloud, AWS is the place to do it. Let’s look at the foundational bricks.

Amazon EC2: Your Servers in the Sky

EC2 (Elastic Compute Cloud) is where your code runs. Instead of buying a physical server, you launch an Instance.

AMIs (Amazon Machine Images): These are templates for your servers. You can pick an AMI that already has Linux and Hadoop installed.
Instance Types: You can choose how much “juice” your server has. Need a lot of RAM for Spark? Pick an r5 instance. Doing heavy math? Go for c5.
Auto Scaling: This is the best part. If your data job gets hit with more traffic than expected, AWS can automatically launch more servers to help out.

Amazon S3: The Infinite Hard Drive

S3 (Simple Storage Service) is where your data lives. It’s designed for 99.999999999% durability. That’s “eleven nines.” To put that in perspective, if you store 10 million objects in S3, you can expect to lose a single one every 10,000 years.

For Hadoop users, S3 is often used as a replacement for HDFS. It’s cheaper, it’s easier to manage, and it can store an unlimited amount of data.

Security and Networking

AWS is big on security. You use VPCs (Virtual Private Clouds) to create your own isolated network, and Security Groups to act as a virtual firewall. You also use Key Pairs (SSH keys) to log into your instances securely.

AWS Lambda: The “Serverless” Option

The book also touches on AWS Lambda. This is for when you don’t even want to manage a virtual server. You just upload your code, and AWS runs it whenever an event happens (like a new file being uploaded to S3). It’s perfect for small, quick data processing tasks.

Summary

EC2 and S3 are the bread and butter of big data on AWS. But if you really want to run Hadoop without the headache of manual setup, you need EMR. That’s what we’re looking at in the final chapter post.

Next: Elastic MapReduce: Running Hadoop in the AWS Cloud

#aws #ec2 #s3 #cloud-computing #big-data #infrastructure

Mastering AWS for Big Data: EC2, S3, and EMR

Amazon EC2: Your Servers in the Sky

Amazon S3: The Infinite Hard Drive

Security and Networking

AWS Lambda: The “Serverless” Option

Summary

About

About BookGrill.net

Category

Tags View all tags

Theme Settings

Accent Color