Python代写 | COSC 2670/2732 Practical Data Science with Python Project Assignment 1



The key objectives of this assignment are to learn how to process messy text data using python. Data is always going to start in a form you may not want. So, you must massage the data into a form you do want. Python is a perfect tool for processing large volumes of raw data.

Australians love sports, so it seems fitting to begin our data adventure by pro-cessing a bit of statistical data from Australian Rules Football.                                                                       If you are un-familiar with the sport, you might want to watch this video https://https:// and/or read a bit about the rules on Wikipedia https: //

Provided files

The following template files are provided:

The Templates

We realise that many of you are just getting used to python, and there is much to learn. So we have tried to provide you a well-defined harness to guide you towards a working prototype. Once you have the data, you will apply some basic data analysis techniques to find useful information in the data.

Creating Anaconda Environment

The first task is to create the correct Anaconda working environment. The rules for this may differ a little depending on every OS, but you should be able to find the right invocation for your platform of choice. I show the command line invocation here.

conda create -n PDSA1 python=3.8 conda activate PDSA1

pip install -r requirements.txt

That’s it. This will create a new environment you can start in Anaconda using “conda activate PDSA1” and exit from using “conda deactivate”. For the first assignment you should not need anything except for the python core library and the packages related to Jupyter notebook.

The Data

The data you will be processing is data that has been crawled from the web to produce a set of markdown files. The data is raw statistical information for Australian Rules Football. You should study several of the input files in a text editor to get a feel for what you will have to parse. You will quickly start to recognize clear patterns in the line structure that you will exploit to quickly process thousands of lines of raw data. The two main file types are team-based statistics. The data is not clean, but it is well-formed. So this makes it very amenable to python data wrangling to extract and aggregate all of the data into a more usable form – a dataframe. pandas is a popular Python package that is used regularly in data science. So you will find that dataframes are a very useful way to organize and process columns of data similar to a database table – without all the overhead of a full rbdms system. However, as many of you learning python still, we will not use a dataframe in this project. It would be relatively easy to convert the data array we are using in this project into pandas as you will have done all the hard work of cleaning the data, but we will save that challenge for another day.

Program output

When your program is combined with the datasets supplied, your program should produce the output as specified in the code template. The primary output format is actually a tsv file which is similar to a csv file, but uses tabs instead of commas to separate fields. When working with long strings of text, you are more likely to avoid collisions as tabs can easily be removed from a text file by replacing them with spaces, but not commas. Once you have tsv files (or even csv files), it is easy to serialize them out to a file for storage and reload them when you need the data again. You can also easily load some and not all of the data if there is a lot to process too. Do not change

the output functions provided, or you will fail to pass the automated harness tests. All you need to do is to implement the functions in the skeleton code. Once you get each of these functions to work, it will output the answers automatically. Each function is worth a subset of the the 30 possible points you can get on the project.

Processing Raw Team Statistics (15/30 marks)

The first challenge is the core of our first assignment. The basic idea is to process several semi-structured text files which contain outcomes for all AFL games for each team. For example, the start of the file contains the following information:

| Richmond | | — |

|     2021 | | — |

| Rnd | T | Opponent | Scoring | F | Scoring | A | R | M | W-D-L | Venue | Crowd | Date |

| R1 | H | Carlton | 3.3 8.5 10.8 15.15        | 105 | 3.2 6.6 8.12 11.14

| 80 | W | 25 | 1-0-0 | M.C.G. |        49218 | Thu 18-Mar-2021 7:25 PM | …

There are only two fielded line types, the ones containing header information like the first four, or the core statistics lines with 14 fields of data. Our goal is simply to walk each of these files and create a single two-dimensional array (list of lists in python). Each line of the array will contain the following 15 cells:

So, we can see that the columns we really want with the exception of team name and year which you can extract easily from the files are already the way we want them to be – we just need to split the lines into columns. We will talk in the lectorials and in practicals how to process a text file line by line as well as locate and split the lines we want using the line.split(’|’) command. This may seem daunting at first, but once you get the hang of it, you will find that parsing data files is surprisingly easy in Python. You just have to know what you need to extract and set up guards to ensure you ignore the lines you do not care about. You will want to ignore the header lines which repeat for each year and ignore two lines of cumulative statistics at the end of each year. These lines will look like this:

| Totals | 188.162 |        1290 | 187.177 |      1299 | P:16 W:7 D:0 L:9 |                                              | 577583 |     |

| Averages |      12.10 |    81 |     12.11 |     81 |     |     |     36099 |     |

So if the first column contains Rnd, Totals, or Averages, you want to skip over it. You will skip the separator lines like | — | too. Everything else you will want to carefully capture as you walk each file in order to build complete rows in the final array.

Challenge 2 Validate total scores, margins, and outcome (10/30 marks)

This task will challenge you a little more. Several rows of data are not guaranteed to be correct in the raw data. However, there is always enough information in a row to validate and correct the errors you encounter. More specifically, the total scores for home and away teams may be incorrect, which means the margin and final outcome may be wrong too. However, the quarterly scores for both home and away teams are always correct, so your goal is to parse these two fields and separate off the last recorded quarter score. Since these are cumulative, it is all you need to get the final score for the team. For example, given a game scoring of “1.0 1.4 4.5 5.8”, you would separate off “5.8”, split 5 and 8, ensure they are integers and not strings, multiply 5 by 6 as each goal is worth a total of six points, and add 8 as each behind is worth only one point.

In summary, you need to check every For Total, Against Total, Margin, and Result column in every row to ensure that the scores are all correct, the margins are correct, and the final outcome is correct. All changes used to update the current array and it is written out one last time.

Challenge 3 Find top five wins home and away (5/30 marks)

This increases the bar one more time and will require you to use a non-trivial data structure / function to find the biggest wins of all time – home and away. This challenge is definitely easier if you know you have the correct margins for every game. If you do,

you just need to find the rows with the largest margins for home wins and away wins in the data set.

Hint: There are definitely multiple ways to achieve this goal, but we will cover an example in the lectorials that shows a problem that is analogous to this one. So pay attention in lectorials and make sure you understand the heapq.nlargest call when we cover it.