大数据代写 | COSC 2637/2633 Big Data Processing Assignment 3

本次澳洲代写是大数据分析相关的assignment

Write Spark programs which gives your chance to apply the essential components you learned in lectures and
to understand the complexity of Spark programming.

The key course learning outcomes are:

• CLO 1 – Model and implement efficient big data solutions for various application areas using
appropriately selected algorithms and data structures.

• CLO 2 – Analyse methods and algorithms, to compare and evaluate them with respect to time and space
requirements and make appropriate design choices when solving real-world problems.

• CLO 4 – Explain the Big Data Fundamentals, including the evolution of Big Data, the characteristics of
Big Data and the challenges introduced.

• CLO 5 – Apply non-relational databases, the techniques for storing and processing large volumes of
structured and unstructured data, as well as streaming data.

• CLO 6 – Apply the novel architectures and platforms introduced for Big data, i.e. Hadoop, MapReduce
and Spark.

Develop a spark streaming program with Scala Maven to monitor a folder in HDFS in real time such that any
new file in the folder will be processed. The following three tasks are implemented in the same Scala
object:

A. For each RDD of Dstream, count the word frequency and save the output in HDFS. Use regular
expression to make sure that each word consists of characters only (tip: findAllIn()). (10 marks)

B. For each RDD of Dstream, filter out the short words (i.e., < 5characters) and then count the co-occurrence
frequency of words (the words are considered co-occurred if they are in the same line); save the output in
HDFS. (10 marks)

C. For the Dstream, filter out the short words (i.e., < 5characters) and then count the co-occurrence
frequency of words (the words are considered co-occurred if they are in the same line); save the output in
HDFS. Note you are required to use updateStateByKey operation to continuously update the co-
occurrence frequency of words with new information. (10 marks)

You should use Scala to develop your MapReduce program over AWS EMR (if you want to use other code
language, contact lecturer for approval).

Failure to follow the requirements incur penalty

(a) The source codes are entailed in each Scala Maven project. (1 mark)
(b) You need create a single Scala Maven project for all three tasks. (1 mark)
(c) Submit the developed Scala Maven project in a single .zip file with a standalone jar file. (1 mark)
(d) The zip file should be named as sxxxxx_BDP_A3_2021.zip (replace sxxxxx bi student ID). (1 mark)
(e) You need include a “README” file in the zip file. (1 mark)
(f) In README, you must specify exactly how to run the standalone jar in AWS EMR platform to
perform the tasks. (1 mark)
(g) Paths of input and output should not be hard-coded. (1 mark)
(h) Each task has its own output path. (1 mark)

Failure to follow the requirements incur penalty

(i) For each task, the output file should be saved with the current time or a unique sequence number in
HDFS. (1 mark)
(j) All three tasks are implemented in the same Scala object. (5 marks)