2023年12月3日

Python代写 | COSC 2670/2732 Practical Data Science with Python Project Assignment 1

本次澳洲代写是Python数据分析相关的一个assignment

Objective

The key objectives of this assignment are to learn how to process messy text data using python. Data is always going to start in a form you may not want. So, you must massage the data into a form you do want. Python is a perfect tool for processing large volumes of raw data.

Australians love sports, so it seems ﬁtting to begin our data adventure by pro-cessing a bit of statistical data from Australian Rules Football. If you are un-familiar with the sport, you might want to watch this video https://https:// youtu.be/Dtmu-1kMFZw and/or read a bit about the rules on Wikipedia https: //en.wikipedia.org/wiki/Australian_Football_League.

Provided ﬁles

The following template ﬁles are provided:

The Templates

We realise that many of you are just getting used to python, and there is much to learn. So we have tried to provide you a well-deﬁned harness to guide you towards a working prototype. Once you have the data, you will apply some basic data analysis techniques to ﬁnd useful information in the data.

Creating Anaconda Environment

The ﬁrst task is to create the correct Anaconda working environment. The rules for this may differ a little depending on every OS, but you should be able to ﬁnd the right invocation for your platform of choice. I show the command line invocation here.

conda create -n PDSA1 python=3.8 conda activate PDSA1

pip install -r requirements.txt

That’s it. This will create a new environment you can start in Anaconda using “conda activate PDSA1” and exit from using “conda deactivate”. For the ﬁrst assignment you should not need anything except for the python core library and the packages related to Jupyter notebook.

The Data

The data you will be processing is data that has been crawled from the web to produce a set of markdown ﬁles. The data is raw statistical information for Australian Rules Football. You should study several of the input ﬁles in a text editor to get a feel for what you will have to parse. You will quickly start to recognize clear patterns in the line structure that you will exploit to quickly process thousands of lines of raw data. The two main ﬁle types are team-based statistics. The data is not clean, but it is well-formed. So this makes it very amenable to python data wrangling to extract and aggregate all of the data into a more usable form – a dataframe. pandas is a popular Python package that is used regularly in data science. So you will ﬁnd that dataframes are a very useful way to organize and process columns of data similar to a database table – without all the overhead of a full rbdms system. However, as many of you learning python still, we will not use a dataframe in this project. It would be relatively easy to convert the data array we are using in this project into pandas as you will have done all the hard work of cleaning the data, but we will save that challenge for another day.

Program output

When your program is combined with the datasets supplied, your program should produce the output as speciﬁed in the code template. The primary output format is actually a tsv ﬁle which is similar to a csv ﬁle, but uses tabs instead of commas to separate ﬁelds. When working with long strings of text, you are more likely to avoid collisions as tabs can easily be removed from a text ﬁle by replacing them with spaces, but not commas. Once you have tsv ﬁles (or even csv ﬁles), it is easy to serialize them out to a ﬁle for storage and reload them when you need the data again. You can also easily load some and not all of the data if there is a lot to process too. Do not change

the output functions provided, or you will fail to pass the automated harness tests. All you need to do is to implement the functions in the skeleton code. Once you get each of these functions to work, it will output the answers automatically. Each function is worth a subset of the the 30 possible points you can get on the project.

Processing Raw Team Statistics (15/30 marks)

The ﬁrst challenge is the core of our ﬁrst assignment. The basic idea is to process several semi-structured text ﬁles which contain outcomes for all AFL games for each team. For example, the start of the richmond.md ﬁle contains the following information:

| Richmond | | — |

| 2021 | | — |

| Rnd | T | Opponent | Scoring | F | Scoring | A | R | M | W-D-L | Venue | Crowd | Date |

| R1 | H | Carlton | 3.3 8.5 10.8 15.15 | 105 | 3.2 6.6 8.12 11.14

| 80 | W | 25 | 1-0-0 | M.C.G. | 49218 | Thu 18-Mar-2021 7:25 PM | …

There are only two ﬁelded line types, the ones containing header information like the ﬁrst four, or the core statistics lines with 14 ﬁelds of data. Our goal is simply to walk each of these ﬁles and create a single two-dimensional array (list of lists in python). Each line of the array will contain the following 15 cells:

So, we can see that the columns we really want with the exception of team name and year which you can extract easily from the ﬁles are already the way we want them to be – we just need to split the lines into columns. We will talk in the lectorials and in practicals how to process a text ﬁle line by line as well as locate and split the lines we want using the line.split(’|’) command. This may seem daunting at ﬁrst, but once you get the hang of it, you will ﬁnd that parsing data ﬁles is surprisingly easy in Python. You just have to know what you need to extract and set up guards to ensure you ignore the lines you do not care about. You will want to ignore the header lines which repeat for each year and ignore two lines of cumulative statistics at the end of each year. These lines will look like this:

| Totals | 188.162 | 1290 | 187.177 | 1299 | P:16 W:7 D:0 L:9 | | 577583 | |

| Averages | 12.10 | 81 | 12.11 | 81 | | | 36099 | |

So if the ﬁrst column contains Rnd, Totals, or Averages, you want to skip over it. You will skip the separator lines like | — | too. Everything else you will want to carefully capture as you walk each ﬁle in order to build complete rows in the ﬁnal array.

Challenge 2 Validate total scores, margins, and outcome (10/30 marks)

This task will challenge you a little more. Several rows of data are not guaranteed to be correct in the raw data. However, there is always enough information in a row to validate and correct the errors you encounter. More speciﬁcally, the total scores for home and away teams may be incorrect, which means the margin and ﬁnal outcome may be wrong too. However, the quarterly scores for both home and away teams are always correct, so your goal is to parse these two ﬁelds and separate off the last recorded quarter score. Since these are cumulative, it is all you need to get the ﬁnal score for the team. For example, given a game scoring of “1.0 1.4 4.5 5.8”, you would separate off “5.8”, split 5 and 8, ensure they are integers and not strings, multiply 5 by 6 as each goal is worth a total of six points, and add 8 as each behind is worth only one point.

In summary, you need to check every For Total, Against Total, Margin, and Result column in every row to ensure that the scores are all correct, the margins are correct, and the ﬁnal outcome is correct. All changes used to update the current array and it is written out one last time.

Challenge 3 Find top ﬁve wins home and away (5/30 marks)

This increases the bar one more time and will require you to use a non-trivial data structure / function to ﬁnd the biggest wins of all time – home and away. This challenge is deﬁnitely easier if you know you have the correct margins for every game. If you do,

you just need to ﬁnd the rows with the largest margins for home wins and away wins in the data set.

Hint: There are deﬁnitely multiple ways to achieve this goal, but we will cover an example in the lectorials that shows a problem that is analogous to this one. So pay attention in lectorials and make sure you understand the heapq.nlargest call when we cover it.

程序辅导定制C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB

本网站支持 Alipay WeChatPay PayPal等支付方式

E-mail:vipdue@outlook.com 微信:vipnxx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。

CS代写,留学生编程代写,CS作业代写,Java代写,程序代写，代码代写 | ITCS代写

机器学习代写 | COMPSCI 3314/7314 Introduction to Statistical Machine Learning Python辅导 | Final Project – SI 507