Machine Learning and Data Mining (Module 2)

Matteo Francia
DISI — University of Bologna
m.francia@unibo.it

The `California Housing` case study

Our task is to use California census data to forecast housing prices given the population, median income, and median housing price for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for short.

Code

import pandas as pd

# df = pd.read_csv("/content/sample_data/california_housing_train.csv", delimiter=",")
df = pd.read_csv("https://raw.githubusercontent.com/w4bo/handsOnDataPipelines/main/materials/datasets/housing.csv", delimiter=",")
df

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY
...	...	...	...	...	...	...	...	...	...	...
20635	-121.09	39.48	25.0	1665.0	374.0	845.0	330.0	1.5603	78100.0	INLAND
20636	-121.21	39.49	18.0	697.0	150.0	356.0	114.0	2.5568	77100.0	INLAND
20637	-121.22	39.43	17.0	2254.0	485.0	1007.0	433.0	1.7000	92300.0	INLAND
20638	-121.32	39.43	18.0	1860.0	409.0	741.0	349.0	1.8672	84700.0	INLAND
20639	-121.24	39.37	16.0	2785.0	616.0	1387.0	530.0	2.3886	89400.0	INLAND

20640 rows × 10 columns

… and now?

Answer some questions:

Which attributes (i.e., columns) are contained in the dataset?
Which is their semantics?

Code

df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

Dataset description

longitude: A measure of how far west a house is; a higher value is farther west
latitude: A measure of how far north a house is; a higher value is farther north
housingMedianAge: Median age of a house within a block; a lower number is a newer building
totalRooms: Total number of rooms within a block
totalBedrooms: Total number of bedrooms within a block
population: Total number of people residing within a block
households: Total number of households, a group of people residing within a home unit, for a block
medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
medianHouseValue: Median house value for households within a block (measured in US Dollars)
oceanProximity: Location of the house w.r.t ocean/sea

Data profiling

Code

# show some statistics on the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

Data profiling

Code

df.describe(include='all')

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640
unique	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5
top	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	<1H OCEAN
freq	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9136
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909	NaN
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874	NaN
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000	NaN
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000	NaN
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000	NaN
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000	NaN
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000	NaN

… are you satisfied with the understanding?

… what about data visualization?

Can we exploit the nature of the data?

Code

ax = df.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, s=df["population"] / 100, label="population", figsize=(16, 9), c="median_house_value", cmap="jet", colorbar=True)
ax.set_aspect('equal', adjustable='box')

What if we integrate open data?

Can we exploit the nature of the data… again?

California, USA

Merging the data sources

California, USA

Non-numeric attributes

ocean_proximity is a text attribute, but not all machine libraries can manipulated textual data types.

Code

df["ocean_proximity"].value_counts()

	count
ocean_proximity
<1H OCEAN	9136
INLAND	6551
NEAR OCEAN	2658
NEAR BAY	2290
ISLAND	5

dtype: int64

Code

df["ocean_proximity"].hist()

Code

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(16, 9))
sns.scatterplot(data=df, x="longitude", y="latitude", hue="ocean_proximity", alpha=0.6, s=20)
plt.gca().set_aspect('equal', adjustable='box')
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.tight_layout()

Missing values

There are some missing values for total_bedrooms. What should we do?

Most Machine Learning algorithms cannot work with missing features.

Memory usage

What if I change float64 to float32?

Code

dff = df.copy(deep=True)  # copy the dataframe
for x in df.columns:  # iterate over the columns
    if dff[x].dtype == 'float64':  # if the column has type `float64`
        dff[x] = dff[x].astype('float32')  # ... change it to `float32`
dff.info()  # show some statistics on the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float32
 1   latitude            20640 non-null  float32
 2   housing_median_age  20640 non-null  float32
 3   total_rooms         20640 non-null  float32
 4   total_bedrooms      20433 non-null  float32
 5   population          20640 non-null  float32
 6   households          20640 non-null  float32
 7   median_income       20640 non-null  float32
 8   median_house_value  20640 non-null  float32
 9   ocean_proximity     20640 non-null  object 
dtypes: float32(9), object(1)
memory usage: 887.0+ KB

Visualization

Code

import matplotlib.pyplot as plt
%matplotlib inline
df.hist(bins=50, figsize=(20, 15))
plt.show()

Open questions

median_income should be in dollars. However, it has a strange range. Why? “you are told that the data has been scaled and capped at 15 (actually 15.0001) for higher median incomes, and at 0.5 (actually 0.4999) for lower median incomes. The numbers represent roughly tens of thousands of dollars. The numbers represent roughly tens of thousands of dollars”
housing_median_age and median_house_value are capped. As to median_house_value, this is a serious problem since it is your target attribute (your labels). Your Machine Learning algorithms may learn that prices never go beyond that limit. You need to check with your client team (the team that will use your system’s output) to see if this is a problem or not. If they tell you that they need precise predictions even beyond 500,000USD, then you have mainly two options: (a) collect proper labels for the districts whose labels were capped, (b) remove those districts from the training set.”
These attributes have very different scales. Should we scale them?
Many histograms are tail heavy: they extend much farther to the right of the median than to the left. This may make it a bit harder for some Machine Learning algorithms to detect patterns

Relationships between variables

Each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column
The diagonal plots are treated differently: univariate distribution plots show the marginal distributions of the data

Code

import seaborn as sns

tmp = df[["median_income", "housing_median_age", "median_house_value", "households", "population", "total_rooms"]]
sns.pairplot(tmp.sample(n=1000, random_state=42), hue='median_house_value', markers='+')
plt.show()

Code

import numpy as np

correlation_matrix = df.select_dtypes(include=['number']).corr()  # Compute the correlation matrix only on numerical features
filtered_correlation_matrix = correlation_matrix[correlation_matrix.abs() >= 0.25]  # Mask correlations with absolute values below 0.25
filtered_correlation_matrix[np.eye(filtered_correlation_matrix.shape[0], dtype=bool)] = np.nan  # Mask the diagonal
plt.figure(figsize=(10, 10))
sns.heatmap(filtered_correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
plt.title("Pairwise Correlations (|correlation| >= 0.25)")
plt.tight_layout()

Pareto frontier

Which blocks represent a good tradeof between cost and number of rooms?

Code

fig, ax = plt.subplots(nrows=1, ncols=1)
df["room_per_household"] = df["total_rooms"] / df["households"]
df.plot.scatter(x='median_house_value', y='room_per_household', ax=ax)

Skyline

The skyline operator: optimization problem that computes the Pareto optimum on tuples with multiple dimensions

Skyline comes from the view on Manhattan from the Hudson River. A building is visible if it is not dominated by a building that is taller or closer to the river (two dimensions, distance to the river minimized, height maximized).
Users want hotels for a holidays to be cheap and close to the beach. However, hotels that are close to the beach are expensive.

Formally, a tuple dominates another if it is at least as good in all dimensions and better in at least one dimension.

NY Skyline

Pareto frontier

Code

from paretoset import paretoset

mask = paretoset(df[["median_house_value", "room_per_household"]], sense=["min", "max"])
df[mask].plot.scatter(x='median_house_value', y='room_per_household', ax=ax, c='red')
fig

Exercise 1

For each ocean_proximity, return the number of blocks
Return the ocean_proximity with the highest number of blocks
Print the mean value per ocean_proximity for every column
Print the mean, min and max values value per ocean_proximity for every column

Code

# Write your code here

Exercise 2

Add a new column: population_per_household = population / households
Add a new column: rooms_per_household = total_rooms / households
Add a new column: bedrooms_per_room = total_bedrooms / total_rooms

Code

# Write your code here

Exercise 3

Complete the prediction pipeline with data preprocessing and machine learning algorithms.

To help you, you can follow these steps:

Set the random seed to enable reproducibility
Feature pre-processing (e.g., remove useless features, impute missing values, encode some features)
Check pairwise correlations among features
Split training and test data. When splitting train and test datasets, the test dataset should contain 30% of the data.
Plot the training dataset in 2D, are the outcomes well-separated?
Train at least two ML classification models
Perform hyperparameter optimization for at least one model
Identify the most relevant features in the prediction

Machine Learning and Data Mining (Module 2)

The California Housing case study

… and now?

Answer some questions:

Data profiling

Data profiling

… are you satisfied with the understanding?

… what about data visualization?

What if we integrate open data?

Can we exploit the nature of the data… again?

Merging the data sources

Non-numeric attributes

Missing values

Memory usage

Visualization

Open questions

Relationships between variables

Pareto frontier

Skyline

Pareto frontier

Exercise 1

Exercise 2

Exercise 3

The `California Housing` case study