云计算代写 | CS 346 (Fall 20): Cloud Computing Project #9

这个作业是编写一个程序来扫描文本文件查找使用单词频率
CS 346 (Fall 20): Cloud Computing
Project #9

1 Examples
For the first part of this Project, you are going to run a demo job, which I have
already defined for you. (You don’t need to turn this in.) In the second step,
you will need to implement a MapReduce job of your own – and you will turn
in that piece.
1.1 Example of a Mapper
For our trivial example, we’ll be writing a program which scans through a text
file to find out how often certain words were used. Create a Python program
with the following code:
This script reads the lines of stdin, breaks it into words based on whitespace,
and removes all of the punctuation characters. For each non-empty word that
remains, it prints out the word, a tab, and then the number 1.
Try it out, make sure it works. Need a handy input file? You could always
just use the program as its own input….
1.1.1 What We’re Doing
When we run this inside of MapReduce, the word (the first thing on the line)
is the key, and the second is the value. Note that MapReduce doesn’t know
anything about what either of these mean – it just treats them both as text
fields. In this example, the keys are always text from the input, and the value
is always 1. But there’s no reason to assume that either of these are true, in
general. In fact, the value doesn’t even have to be an integer!
1.2 Example of a Reducer
A reducer collects all of the values associated with a single key, and produces a
single new (k,v) pair. The MapReduce system guarantees that the (k,v) pairs
are grouped by key – otherwise, this process would consume a gigantic amount
of memory, as the reducer kept track of many, many keys and their associated
values.
The simplest reducer is one that simply sums up all of the values. It’s so
common that Hadoop includes one that you can just use – but for practice, we’re
going to write it by hand. Create a Python program with the following code:
if cur is not None:
print(“%s\t%d” % (cur,total))
This program reads (k,v) pairs from stdin. It keeps track of the “current”
word that it’s reading, along with a sum of all of the counts so far. When it
hits a new (different) word (or EOF), it prints out the old word and its count.
Test this program out, using the input file you’ve chosen and the mapper
you’ve written.
2 Running MapReduce
Now, we’ll actually run the MapReduce job on the Amazon EMR. When you
later write your own MapReduce job, you will follow the same basic instructions
to execute a different job.
2.1 Setting up S3
MapReduce applications need someplace to get their input data. In EMR, the
most common place to get that data is from S3 – Amazon’s Simple Storage
Service. S3 is a service which allows you to upload files – from small files
like scripts, up to gigantic, multi-GB files – and then access them anywhere. It
provides a “directory tree” of sorts (in that it has folders to organize your data),
but it usually is not used as a simple filesystem; normally, you navigate directly
to an object using its name.
Under the Free Tier, you get (for 12 months) 5 GB of free storage space,
plus limited numbers of GET and PUT requests. This will be plenty sufficient
for our work, but remember that it won’t be free after the first 12 months.
S3 organizes data into “buckets,” which basically are the roots of your directory trees. The buckets, because they are publically addressable, must have
names that are globally unique.
To create a bucket, open your AWS console, go to the S3 service, and click
“Create Bucket.” Give your bucket a name; it must have only letters, numbers, hyphens, and periods (and must not end in a number). (Some of these
limitations may not be enforced by S3, but are important for EMR.)
I’d suggest a name of the form:
demo-emr-bucket-346-SEMESTER-YOUR NETID HERE
2.2 Populate S3
Upload your mapper, reducer, and input file to the bucket you created in S3.
2.3 EC2 Key Pair
Make sure that you still have your EC2 login key pair. (The same one you use
to ssh to your instances.) If you don’t, then go create a new one.
2.4 Create an EMR Cluster
Open another AWS console in another window, and go to the EMR service.
Click “Create Cluster.” While the quick options are pretty good, we’re going to
use the more advanced options in order to help us control costs. Click on “Go
to advanced options.”
2.4.1 Setting Up the Job
Under Software Configuration, all of the defaults are pretty good, except that
we want to set up our steps (at the bottom of the page). Start by clicking the
“Auto-terminate cluster…” option – this will tell Amazon to automatically shut
down your cluster when you’re done.2