Amazon Web Services (AWS) is a remarkable platform for building distributed systems. Moreover, it is just about an ideal environment for doing data science. Suppose, for example, you want to create a commodity cluster to run a Hadoop ecosystem. No problem. You can allocate the virtual machines, storage, etc in a few hours. And you can connect into your VM’s (instances in AWS terminology) to install whatever software you need.
Not only is AWS a great set of resources, but it is also constantly getting better as Amazon adds new services–many geared to the needs of data scientists. If you are looking to upgrade your computing environment, maybe you should consider AWS. It certainly is a step up from doing projects on your favorite old PC or Mac, even if you are using virtualization to extend your machine.
The downside of adopting AWS for future projects is the learning curve. If you are a Linux user like me, you will no doubt feel comfortable with some aspects of AWS, but other things will be baffling, at least at the beginning. Also, you can’t learn everything at once, so it’s a good strategy to take a small subset of AWS’s capabilities and use them as a sort of project template. That’s the goal of this article: to present a project template for AWS that you can reuse for your own projects. Needless to say, what follows is for a Linux user.
The steps below take you through opening an AWS account, provisioning a virtual machine, providing all the requirements for that virtual machine, and finally running it and logging on. Each of these steps will be discussed in detail in later sections of this article.
- Open an account.
- Create a key pair for secure log on.
- Allocate a Virtual Private Cloud (VPC), the network where your virtual machine will reside.
- Create a security group to control access to your virtual machine.
- Get an elastic IP address, a static IP to point to your VM.
- Assign Domain Name.
- Allocate an Elastic Block Store (EBS) volume, your disk space for the VM.
- Launch a new instance (a virtual machine).
- Connect to the new instance using the secure shell command, ssh.
We will using the Amazon Web Console to accomplish these tasks.
A. Open an Account
The first step in using Amazon Web Services is to open an account. You’ll need an email address, which will become your ID. You’ll also have to provide a credit card to be billed against. The web address for creating an account is aws.amazon.com. It’s a quick, 5-minute process. A brand new account qualifies for “free tier” resources for the first 12 months. Free tier virtual machines are very low-powered, though. They are OK for becoming familiar with the AWS environment, but not much more (my opinion).
To log on to your new account, go to aws.amazon.com, click on the [Sign in to the Console] button, enter your email and password, and you’re in. You will be at the AWS Console, which is the home page for AWS. The top of the page has a menu bar; some things to note about it are:
- The icon on the left is a link that will always bring you back to the console.
- The Services tab toggles a display of available services in the body below. Services are groups of related tasks and resources. Although there are many services, you probably will be using only three or four on a regular basis. My advice is to ignore the ones you don’t use, otherwise you will go crazy trying to figure them out.
- Towards the right you will see your name on a drop down. You can click on this to look at your billing status. You get billed on a monthly basis for resources that you use, excluding the free tier ones. Most of the charges are for running virtual machines and data storage. You are not charged for virtual machines in the off state. In practice, this means you can have many virtual machines for your projects and save money by turning them off when you are not using them.
- Next to your name on the right will be a location drop down. Your account has a default region (eg US East) and an availability zone within that region. Region and availability point to the data center that hosts your account’s resources. My advice is to stay with your defaults.
B. Create a Key Pair for Secure Log On
A key pair is a set of two encryption keys–one public, the other private. They are used to communicate securely over a network. In the AWS environment, a key pair is a set of logon credentials that allow you to connect to a running instance.
The keys from a new key pair are stored separately by AWS in two places:
- The public key is maintained at your AWS account.
- The private key is placed in a .pem file and downloaded automatically to your local computer. Don’t lose this .pem file, as it is the only copy of the private key.
The procedure for creating a key pair is:
- Go to Console page => EC2 Service => Network and Security (left sidebar) => Key Pairs.
- Click on [Create Key Pair] button, enter a name, click on [Create]. Note a good naming convention is <your-project>-keypair.
- AWS creates the key pair and downloads the private one to your local computer as a .pem file.
- Use the chmod command to set the permissions on this file to 400. This restricts access to your private key. Note: The secure shell command ssh throws an error if you try to use a .pem file with wider permissions.
C. Allocate a Virtual Private Cloud (VPC)
A virtual private cloud is a secure and flexible network that you own and administer. It is a home for virtual machines. When you launch a new virtual machine you add it to an existing VPC. Each VM has a local (private) IP address which can be used to communicate with other instances on the VPC. An instance can also have a public IP address for the Internet. Normally you lock down IP access for the virtual machines with a security group, allowing only the type of access that you need for your application.
One way of assigning a public IP to an instance is to allocate an Elastic IP in AWS and associate it with the instance. (See section on elastic IP’s below.) Another way is to let AWS assign it from a pool of temp IP numbers when an instance is started. I don’t recommend this second method because the IP changes each time you stop/start an instance.
The private IP address for a VM does not change when you stop/start the VM.
To create a Virtual Private Cloud:
- Go to the Console => Networking and Content Delivery => VPC.
- Click the [Start VPC Wizard] button.
- Select VPC with single public subnet, click [Next].
- Enter VPC name. A good naming convention is <your-project>-vpc.
- Go with the default settings, click the [Create VPC] button.
The default settings will give you a network that can hold up to 256 virtual machines and assign the network to an availability zone geographically nearby. The new VPC will appear in the list of available VPC’s. Note that AWS gives you a default VPC when you open an account. This should be on the list too.
To update or Delete a Virtual Private Cloud:
- Navigate to the VPC page. Select the VPC you want from the list by clicking the checkbox next to it.
- Click the [Actions] dropdown and pick the appropriate option.
D. Create a Security Group
A security group is a set of rules for a virtual machine that controls which ports on the VM are open to access by other computers. There are two types of rules: inbound and outbound. If you are running a web server that will listen on port 80 for connections, you will require an incoming rule for that port so that requests to port 80 will be allowed. Outgoing rules can be used to restrict connections from the VM to other computers; however, most security groups contain only inbound rules.
When you create a security group, you assign it to an existing virtual private cloud. The group can be applied to multiple VM’s in that VPC. Any changes you make to a group become effective immediately; there is no need to restart the associated VM’s.
Example Security Group:
The above security group has two inbound rules. The first permits remote computers to connect to port 80; the second allows access to port 22.
To create a security group:
- Go to the Console => EC2 Service => Security & Network (left sidebar) => Security Groups.
- Click the [Create Security Group] button. Enter name and description (both required); Select the target VPC. Note: a good naming convention for security groups is <your-project-name>-sg.
- Select the inbound tab, click on the [Add Rule] button.
- The Type drop-down lists well-known port names, for example HTTP. If you select HTTP, the rest of the rule will be filled in automatically. Using the well-known names is the way to go, unless you have a special requirement, in which case you can select one of the custom options and fill in the rule manually. The source field is an IP range that defaults to all. If you want to restrict a rule to a specific IP (or range) you can do that with source.
- If Source field is not pre-filled, select [Custom] => Anywhere.
- Add more rules if required and click the [Create] button.
To edit or delete an existing security group:
- Select the group you want from the list on the security groups page.
- Click the [Actions] drop-down and choose the appropriate action.
E. Get an Elastic IP Address
An elastic IP is a static IPv4 address that you can assign to an instance as its public IP. AWS puts a few restrictions on elastic IP’s, as they are relatively scarce resources. You are limited to a max of five in a given region. Also, while there is no charge for creating or using an elastic IP, AWS will impose a small fee for any one that you leave idle. The term “elastic” seems to be a reference to the fact that you can re-assign an elastic IP at any time to a different VM.
To create an elastic IP:
- Go to Console => EC2 Service => Network & Security => Elastic IP’s.
- Click the [Allocate New Address] button.
- Default scope to “VPC”, Click [Allocate].
To assign an elastic IP to an instance:
- Go to elastic IP page.
- Select the one you want from the list.
- Click [Action] dropdown, “Associate Address.”
- Select the target instance from the dropdown list.
- Click [Associate].
Note: Other actions you can perform at the elastic IP page are releasing an address (deleting it) and disassociating an address.
F. Assign Domain Name
A domain name registrar is an organization or company that manages the reservation of domain names on the Internet. If you want to use the domain acmeproducts.com, for example, you would do this thru the services of a registrar. If you don’t already have a registrar, you can use Amazon. The Route 53 service under Network Content & Delivery is for this purpose. Incidentally the name “Route 53” does not refer to a highway, rather it is a reference to the well-known port 53, which is for DNS queries.
There are two steps to assigning a domain name:
- Get ownership of the name.
- Set up DNS entries to have it point to the elastic IP that you defined.
To get a domain name:
- Go to Console => Network Content & Delivery => Route 53.
- Under Domain registration, click [Get Started Now] button.
- Click [Register Domain]. Follow prompts to select and verify your domain name, complete process.
Note: AWS will charge you about $12/year to maintain your domain name.
To set up DNS:
- Go to Console => Network Content & Delivery => Route 53.
- Under DNS management, click [Get Started Now] button.
- Click [Create Hosted Zone].
- Enter domain name, keep defaults, click [Create].
- Fill in info for type A record (domain name to IP assignment).
Note: A hosted zone is an AWS term for a collection of DNS records.
G. Allocate an Elastic Block Store (EBS) Volume
Elastic Block Store is persistent storage you can assign to an instance. An EBS allocation is called a volume. You can assign more than one volume to an instance. From the instance’s point of view, each volume is simply a part of the file system. AWS replicates EBS volumes, so they have a degree of fault tolerance.
AWS has another type of storage called Simple Storage Service (S3). S3 and EBS are quite different products. An EBS volume can only be assigned to one instance at a time, whereas S3 is meant to be shared by multiple processes. EBS becomes part of an instance’s file system, thus programs running on the instance can access EBS data as an ordinary file. To access S3 buckets, you go through the AWS API.
The simplest way to create an EBS volume is to allocate it when you first launch an instance. The create wizard for instances has a page where you can do this (See secton below.).
H. Launch an New Instance
An instance is a virtual machine which you can configure and run under your account. The process of configuring and starting an instance is called launching it. This is done using a wizard on the EC2 Service page. You can stop and start an instance from the EC2 page under the Instances menu item.
Amazon charges you an hourly rate for running instances. A type t2-small VM bills at 2.3 cents per hour, or about $16.50 per month. You are not charged for instances that are stopped. This model of pricing is called “on demand.” There are other pricing models, but they are only suitable for large-scale systems.
You can connect to a running instance using the secure shell command, ssh. You would do this to install and configure software, stop/start services, etc..
To launch an instance:
- Go to Console => EC2 Service
- Click [Launch Instance] button.
- Select OS image from list, in this case Amazon Linux.
- Select machine capacity, in this case t2-small.
- Click on [Configure Instance Details] button.
- Under Network dropdown, select your project’s VPC network. Leave the rest as defaults.
- Click [Next: Add Storage] button.
- Uncheck “delete on termination.” Click [Add Volume] button.
- Enter the number of GB for volume, leave rest as defaults. Ensure that “delete on termination” is not checked.
- Click [Add Tags], then [Next: Configure Security Group].
- Select a previously-defined security group.
- Click [Review and Launch] button.
- Click [Launch]. Select a previously-defined key pair, check ownership box. Click [Launch].
To start or stop an instance:
- Go to Console => EC2 Service => Instances (left sidebar).
- Select the instance you want by checking the box to its left.
- Use [Action] dropdown + instance state to view stop / start options, select the appropriate one.
To delete (terminate) an instance:
- Follow procedure above to view options for instance state.
- Select “terminate.”
Note: A terminated instance will remain on the instance list until cleaned up by AWS, which might take a day or so.
I. Connect to Instance with ssh
Many admin tasks, for example installing software on an instance, require you to connect in remotely to an instance. The way to do this is to execute the ssh command on your local computer in a terminal window. You need a key pair to identify yourself as a root user to the instance. Creating and downloading a key pair file was covered above. Also, you will need the domain name or public IP address of the instance. Assume that you have already created an elastic IP address and assigned it to the instance.
To log on:
Assume that your key pair file is Downloads/myproject-keypair.pem and the IP address is 220.127.116.11. The ssh command would then be:
ssh -i Downloads/myproject-keypair.pem email@example.com
An AWS linux instance comes pre-supplied with a user account, ec2-user. This is the user you will be when you connect to an instance. You can’t user root, as AWS blocks that. ec2-user has sudo privileges, so you can do anything root would do by prefixing your commands with “sudo “.
An inactive ssh session will time out after about an hour. You’ll see a “broken pipe” message when this happens. The exit command from ssh is logout.
The goal of this article was to present a template for creating a data science project on the Amazon AWS platform. At this point you may think it is an awful lot of steps, and maybe not worth the trouble. But almost all the steps are very simple, and take very little time once you become familiar with AWS. And some are one-time events, like signing up for an account at Amazon.
On the plus side, the template gives you a way to quickly and reliably set up infrastructure for a project. Once you have that, you con go on to the next phase, which would be to install and configure software packages. My next article will use the template to do a project to analyze log activity.