Python代写 | 200 Points + 40 Extra Credit Points


1. Hadoop Map Reduce – Reservoir Sampling 100 points

Imagine you’re working with a terabyte-scale dataset and you have a MapReduce application you want to test with that dataset. Running your MapReduce application against the dataset may take hours, and constantly iterating with code refinements and rerunning against it isn’t an optimal workflow.

To solve this problem you look to sampling, which is a statistical methodology for extracting a relevant subset of a population. In the context of MapReduce, sampling provides an opportunity to work with large datasets without the overhead of having to wait for the entire dataset to be read and processed.

Reservoir Sampling

In the reservoir sampling algorithm, you first fill up an array of size K (the reservoir) with the rows being sampled.

Once full, every additional item i (where i > k) can replace an item r in the reservoir by choosing a random number j between 0 and i; If j < k-1, then element i replaces element j.

After all values are seen, the reservoir is the sample (of size K).

Dataset: (Daily hard disk failure log for the first quarter of 2019.)

SOLVE – (In Hadoop MapReduce code):



• Sample the dataset using Reservoir Sampling in MapReduce code. The value K can be supplied via parameter (environment variable) or hard coded in your code.

NYU Tandon, Big Data Midterm

2. Spark Anomaly Detection – Hard Drive Failures

100 points

Data: same as Q1:, Reference:

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers.

Anomalies can be broadly categorized as:

1. Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount spent.”

2. Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd otherwise.

3. Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.

TO-DO, in Spark Scala, Spark Java, or PySpark (** not Pandas **):

– Define an anomaly as any value that deviates for more than one standard deviation of the mean.

– Find the mean and standard deviation of hard disk failure from the Reservoir Sampled dataset in Question 1.

– List hard disks anomalies by model and serial_number . That is, hard disks that have a total failure count that exceeds our threshold for anomaly.

NYU Tandon, Big Data Midterm

3. EXTRA CREDIT: Item Analysis

40 points


For the dataset in Homework 2, Question 1:, show both the two most bought items and the least two bought items, as tuples, per hour (over all days).

For example: (this data is made up):

9, (Coffee, Muffin), (Cookies, Jam) 10, (Cofee, Tea), (Bread, Mineral Water)