Python辅导 | DS 6001: Practice and Application of Data Science

Please answer all of the following questions. Use text, Python and markdown code, and the output of code as needed to answer each question completely. Prepare your lab report as a Jupyter notebook. Upload this notebook to Collab prior to the due date and time.

In order to earn full credit for this assignment, you must not only demonstrate that you can use the techniques we discussed in class correctly, but also that you can communicate the results in a clear way by using a combination of markdown text and neatly formatted output. Your notebook should be sectioned to let us know what problem each code cell refers to.

Your lab report will be graded out of 10 points on the following criteria:

You are responsible for your own lab report. That said, you are encouraged to work collaboratively with other students. It is very important however not to cheat by copying text or code, or by allowing your text or code to be copied. If we see evidence of cheating, we will take measures such as initiating an action at the Honor Court, and we have to disallow this kind of collaboration in the future, and that’s worse for everyone.

Before we begin, load the following packages and functions:

If these commands return a “no module” error, use the pip install to download the package.

Why does a CSV file usually take less memory than a JSON formatted file for the samw data? Under what conditions could a JSON file be smaller in memory than a CSV file for the same data?

NASA has a dataset of all meteorites that have fallen to Earth between the years A.D. 860 and 2013. The data contain the name of each meteorite, along with the coordinates of the place where the meteorite hit, the mass of the meteorite, and the date of the collison. The data is stored as a JSON here: https://data.nasa.gov/resource/y77d-th95.json

Use pd.read_json() and json_normalize() to load the NASA data into a dataframe.

Save the NASA data from the previous question as a JSON string, reorganized so that the row and column names are stored as vectors, and the datapoints are stored as an array. [Hint: use a function to do this, do not try to type it out manually]

One limitation of json_normalize() is that it can only work with JSON data that is “semi-structured”. That is, it has to be apparent that different items in the list correspond to different observations in a dataframe. For example, we can pull data on the top 25 posts on reddit.com/r/all at the moment by using this URL: https://www.reddit.com/r/all.json. Take a moment and click on this link. The JSON is organized in levels: on the top level is a variable named “kind” which equals “listing” – this is metadata which tells us about the kind of data we can access here. The second item on the top level is named “data”, and it contains its own nesting of variables. Under “children” we finally see repeated lists named “0”, “1”, “2”, etc. through “24” for each of the 25 posts on Reddit’s /r/all frontpage. In this case, we cannot use json_normalize() to create a dataframe from the whole JSON, but we can use it on the sublist that refers to each of the top 25 posts.

For this problem, download the data from https://www.reddit.com/r/all.json as a series. Pull out the list that is contained in the “children” item of the “data” item of the JSON structure. Then pass this list to json_normalize() to create a dataframe. Finally, display the titles of the top 25 posts.

For this part of the assignment, download the “wines.csv”, “wine_tasters.csv”, “wine_locations.csv”, “wineries.csv”, and “wine_id.csv” files from this assignment’s post on Collab. Then proceed to the following questions:

Load each of the five CSV files listed above into Python as dataframes. (Note: no tricks here, all are ready to be loaded without special parameters or techniques.) Then create an SQL database called wine.db, and add all five dataframes to this database.

Create an Entity-Relationship diagram for the database. Use any software you like to create the diagram, or use a photo of a hand-drawn diagram. To display an image file in your notebook, create a markdown cell, and use code that looks like ![The image's caption](image.png), where image.png is the image file for your diagram.

Issue a query to wine.db to create a dataframe that contains the all of the available columns for wines that have review scores (points) of exactly 100. Within the SQL query, sort the rows from highest to lowest price.

Issue a query to wine.db to create a dataframe that contains the title, variety, points, and price for wines that cost at least $1000 per bottle. Within the SQL query, rename points to review_score, and sort the rows from highest to lowest values of review_score.

Issue a query to wine.db to create a dataframe that contains the title, description, variety, points, and price for wines whose descriptions contain the word “eucalyptus”. To account for the fact that sometimes this word is capitalized to begin a sentence, first recode the description column to be all upper-case. Sort the dataframe by price.

Issue a query to wine.db to create a dataframe that contains the title, description, variety, points, and price for wines whose descriptions contains all of the words “chocolate”, “leather”, and “wood’. Use the .format() method to do this. Only keep the first 30 rows in the data that meet this condition. To account for the fact that sometimes this word is capitalized to begin a sentence, first recode the description column to be all upper-case. Sort the dataframe by price.

Issue a query to wine.db to create a dataframe that contains one row per variety of wine, and columns for the count of the number of wines of that variety, the average points, and the average price. Keep only the rows for varieties with at least 1000 wines in the data, and sort the rows from the highest to lowest average points. [Hint: use GROUP BY and HAVING]

Issue a query to wine.db to create a dataframe that joins the wine table with the wine ID table.

Issue a query to wine.db to create a dataframe that joins the wine table with the wine ID table. Keep the wine ID, title, variety, and price from the wines data, and keep the wine ID and taster ID from the wine ID table. Use aliases to refer to the relevant data tables.

Issue a query to wine.db to create a dataframe that joins the wine table with the wine ID table, then proceeds to join that table with the taster, winery, and location tables . Keep all columns.