大数据代写 | FIT5202 – Data processing for Big Data Assignment 2

这个作业是完成大数据中的数据处理
FIT5202 – Data processing for Big Data
Assignment 2: Detecting Linux system hacking activities Part B

1. Producing the data (30%)
In this task, we will implement two Apache Kafka producers (one for process and one for
memory) to simulate the real-time streaming of the data.
Important:
– In this task, you need to generate the event timestamp in UTC timezone for each
data record in the producer, and then convert the timestamp to unix-timestamp
format (keeping UTC timezone) to simulate the “ts” column. For example, if the
current time is 2020-10-10 10:10:10 UTC, it should be converted to the value of
1602324610, and stored in the “ts” column
– Do not use Spark in this task
1.1Process Event Producer (15%)
Write a python program that loads all the data from “Streaming_Linux_process.csv”. Save
the file as Assignment-2B-Task1_process_producer.ipynb.
Your program should send X number of records from each machine following the
sequence to the Kafka stream every 5 seconds.
– The number X should be a random number between 10~50 (inclusive), which is
regenerated for each machine in each cycle.
– You will need to append event time in unix-timestamp format (as mentioned above).
– If the data is exhausted, restart from the first sequence again
1.2 Memory Event Producer (15%)
Write a python program that loads all the data from “Streaming_Linux_memory.csv”. Save
the file as Assignment-2B-Task1_memory_producer.ipynb.
Your program should send X number of records from each machine following the
sequence to the Kafka stream every 10 seconds. Meanwhile, also generate Y number of
records with the same timestamp. These Y number of records would be sent after 10
seconds (or the next cycle)
2
. A figure demonstrating the timeline is shown below.
– The number X should be a random number between 20~80 (inclusive), which is
regenerated for each machine in each cycle.
– The number Y should be a random number between 0~5 (inclusive), which is
regenerated for each machine in each cycle.
– You will need to append event time in unix-timestamp format (as mentioned above).
– If the data is exhausted, restart from the first sequence again
Fig 2: Timeline on the data generation and publication onto the stream
2. Consuming data using Kafka (10%)
In this task, we will implement multiple Apache Kafka consumers to consume the data from
task 1.
Important:
– In this task, use Kafka consumer to consume the data from task 1.
– Do not use Spark in this task
2.1 Process Event Consumer (5%)
Write a python program that consumes the process events using kafka consumer, visualise
the record counts in real time. Save the file as
Assignment-2B-Task2_process_consumer.ipynb.
Your program should get the count of records coming in each 2-min window for each
machine, and use line charts to visualise the count from the start to the most recent.
– Hint – x-axis can be used to represent the timeline, while y-axis can be used to
represent the count; each machine’s line data can be represented in different color
legends
2.2 Memory Event Consumer (5%)
Write a python program that consumes the memory events using kafka consumer, visualise
the record counts in real time. Save the file as
Assignment-2B-Task2_memory_consumer.ipynb.
Your program should get the count of records coming in each 2-min window for each
machine, and use line charts to visualise the count from the start to the most recent.
– Hint – x-axis can be used to represent the timeline, while y-axis can be used to
represent the count; each machine’s line data can be represented in different color
legends
3. Streaming application using Spark Structured Streaming (60%)
In this task, we will implement Spark Structured Streaming to consume the data from task 1
and perform predictive analytics.
Important:
– In this task, use Spark Structured Streaming together with Spark SQL and ML
– You are also provided with a set of pre-trained pipeline models, one for
predicting attack in process data, another for predicting attack in memory data
Write a python program that achieves the following requirements. Save the file as
Assignment-2B-Task3_streaming_application.ipynb.
1. SparkSession is created using a SparkConf object, which would use two local cores
with a proper application name, and use UTC as the timezone
3
(4%)
2. From the Kafka producers in Task 1.1 and 1.2, ingest the streaming data into Spark
Streaming for both process and memory activities(3%)
3. Then the streaming data format should be transformed into the proper formats
following the metadata file schema for both process and memory, similar to
assignment 2A
4
(3%)
– The numeric values with extra spaces or “K” / “M” / “G” should be properly
transformed into their correct values
– The NICE value should also be restored based on the PRI values using their
relationship
5
– Hint – There is a mapping between PRI (priority) and NICE, as long as
the process is not yet finished during the last interval. For example,
– PRI 100 maps to NICE -20
– PRI 101 maps toNICE -19
– …
– PRI 139 maps to NICE 19
– Hint – If the process is finished, PRI and NICE would both be 0.
4. For process and memory, respectively, create a new column “CMD_PID”
concatenating “CMD” and “PID” columns, and a new column “event_time” as
timestamp format based on the unix time in “ts” column (5%)
– Allow 20-second tolerance for possible data delay on “event_time” using
watermarking
5. Persist the transformed streaming data in parquet format for both process and
memory (5%)
– The process data should be stored in “process.parquet” in the same folder of
your notebook, and the memory data should be stored in “memory.parquet” in
the same folder of your notebook
6. Load the machine learning models given
6
, and use the models to predict whether
each process or memory streaming record is an attack event, respectively (5%)
7. Using the prediction result, and monitor the data following the requirements below
(30%)
a. If any program in one machine is predicted as an attack in EITHER process
or memory activity prediction, it could be a false alarm or a potential attack.
Keep track of the approximate count of such events in every 2-min window
for each machine for process and memory, respectively, and write the stream
into Spark Memory sink using complete mode
7
– Your aggregated result should include machine ID, the time window,
and the counts
– Note that if there are more than one entries having the SAME
“CMD_PID” in a 2-min window, get the approximate distinct count
– For example, if two or more records of “atop” program with the
exact same “CMD_PID” are predicted as an attack in the
process between 2020-10-10 10:10:10 and 2020-10-10
10:11:09 , only need to count this “atop” program attack once.