Python辅导 | BUSS6002 Assignment 1


Value: 15% of the total mark

Suppose the year is 2010 and you are working as a Data Scientist for an investment firm. The firm is assessing locations for investing in housing redevelopment in the United States. The firm has selected Ames, Iowa as a candidate location. As a consequence, the firm would need to purchase existing houses, which would be demolished to make space for the development.

In order to estimate the costs involved the firm needs to know the current value of the houses that it needs to purchase. You are working on a data science project aiming to build a model to estimate the house prices.

The Ames City Assessor’s Office has been collecting data since 2006 on house sales and the characteristics of each house that was sold. You have been given access to a copy of original database “housing.db”, which is an SQLite file. The Assessor’s Office have also provided you with a data dictionary “housing_data_description.txt”.

Hint: To list all tables in the database you can use the following query SELECT name FROM sqlite_master WHERE type=’table’ ORDER BY name;

You can download the dataset and detailed dataset description from the BUSS6002 Canvas site.

Question 1

To start your analysis, you wish to build a prototype model that will be demonstrated to a wider team. Therefore it needs to be easily understood by non-experts, meaning that you can only use a few variables.

To save you time, an experienced member of your team suggests to you that from their experience the above ground living area, basement size and the age of the house are most useful variables.

Perform EDA to determine which two of these features are most useful. Carefully explain your selection criteria and present the results to justify your choice.


Question 2

Suppose you are interested in using the above ground living area and basement size to estimate the price of a home.

Adhere to the same requirements from Question 1.

Question 3

The models you have built so far provide an approximate estimate of house prices.

However, to accurately estimate the costs of the redevelopment plan you must be able to estimate house prices as accurately as possible.

Your goal is now to improve your model as much as possible through feature engineering and feature selection.


Hint: Carefully read the data dictionary. In this dataset the “NA” (Not Applicable) code is represented by a NULL value in the database, which will be interpreted by Pandas as NaN. This means that the NaN values do not indicate whether the value is missing. For example, the “Garage Type” variable codes “No Garage” as “NA”, consequently it will be interpreted as NaN by Pandas.

If you wish to include such variables in your analysis, then you should recode NaN as a valid code. For example, to recode the Garage Type you could use:

df[‘Garage Type’] = df[‘Garage Type’].apply(lambda x: “NoGarage” if pd.isnull(x) else x)

Question 4

Suppose you have finished your analysis, now you need to report to your manager and reflect on what you have experimented with in your data science project:

The firm is also considering redevelopment projects in other locations. Comment on whether the model you have built can or cannot be applied in other locations. Justify your answer.