Python大数据代写 | CS6513 Big Data Assignment 1 – Hadoop HDSF & MapReduce
本次美国代写是Python Hadoop大数据的一个assignment
Abstract: Show proficiency using HDFS and writing a MapReduce program,
including submitting to Hadoop and getting results out of HDFS.
Instructions:
– Submit your solution as a zip file, named with your netid. E.g. if my netid is
jcr365, so my file will be jcr364-hw1.zip
– Submit in Brightspace no later than the due date. Late assignments will not
be accepted.
– If I cannot run your code(s), you will not get full credit. But submit anything
for partial credit.
– Give attribution to any code you use that is not your original code
Running on any version of Hadoop (docker or Peel), submit screen grabs (a
picture in jpg or other suitable format) of the following (via command line
hadoop) :
a) create a directory in HDFS with this format: netid-semester (e.g. mine will
be ‘jcr365-hw1’).
b) Create a file in your local computer with any text. Show a screen grab of
this file on your computer file system.
c) Transfer your file in (b) to your HDFS directory created in (a) above. Show a
screen grab of the file listing in HDFS, showing the commands you used.
d) Create a directory for the homework problem 1.2 (ngram count), and
extract all input files into it. Call this directory as follows: netid-hw1-2, e.g.
mine will be jcr365-hw1-2. Submit a picture of directory listings or
otherwise show the input files in it.
Modify the MapReduce WordCount template code (shown and provided in class)
to create an n-gram language model. An n-gram is a contiguous sequence of n
words from text: https://en.wikipedia.org/wiki/N-gram.
For this exercise, n = 3 (that is, compute unigrams and bigrams and trigrams). The
input files are provided with this assignment.
– Input is multiple files of lines of text; do not compute n-grams across line
boundaries.
– ignore case (map all text to lowercase)
– ignore non alpha-numeric characters (ignore text not in [a-z0-9]; replace all
non-alphanumeric characters with a space)
– unigram: a single word
– bigram: two consecutive words in the input sequence
– trigram: three consecutive words in the input sequence
The language model output in this exercise is defined as a probability table based
on word counts and maximum likelihood estimates (MLE). For example, given the
following line of text: “The Cat in the Hat is the best cat book ever”, the output
would look something like this:
Dataset: coca-samples-text.zip
Specifically, for every file in the input directory, output the number of times each
n-gram appears in the input lines. Note that with Hadoop you can pass it
individual files, directories (which are recursed) or simple wildcard patterns to
match multiple files.
IMPORTANT: The denominator in the counts cannot be computed by a single
map/reduce program (Why?). You have to post-process the first map/reduce
output counts to count the total number of n-grams and add or compute the
probabilities. You must write the solution entirely as a sequence of map/reduce
programs.
Write your own code in your Hadoop language of choice. Your code MUST run in
Hadoop MapReduce. For Python, use Hadoop streaming as shown in class.
Submit the code and the result output (not the input).