Intro

This project is the famous housing price prediction. Description of data field can be found at the dataset source.

Overview

Here, I’ll show

  • EDA of numerical and categorical features with visualization
  • Feature engineering (some of them are not existing method, as far as I know)
    • Change categorical feature into numerical feature
    • Impute empty entry in a way less impact on linear regression
    • Feature selection
    • Cleaning
  • Training
    • Hyperparameter tuning
      • Regularization and learning rate tuning with ElasticNet
      • Tune feature selection
  • Test result
    • Visualize the fitting result and residuals
    • List important features
  • Conclusion
    • Show how precise this model is for a practical metric for potential users.

Load dataset

# import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#import statsmodels.formula.api as smf

#from sklearn.decomposition import PCA # tested and didn't help
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, ElasticNetCV
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
from sklearn.preprocessing import StandardScaler
import sklearn.metrics as metrics
# figure cosmetic function
def fsize(w,h,c=False):
    
    # set figure size
    plt.rcParams["figure.figsize"] = [w, h]
    
    # adjust plot automatically
    plt.rcParams['figure.constrained_layout.use'] = c
# import training data
df = pd.read_csv("data/house.csv")
df_sub = pd.read_csv("data/house_test.csv")

df.info()
df.head(5)

# check duplicated entries
print(df.duplicated().value_counts())
print(df_sub.duplicated().value_counts())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
False    1460
dtype: int64
False    1459
dtype: int64
# rename columns to follow variable naming for convenience
df.rename(columns={'1stFlrSF':'FlrSF1st', '2ndFlrSF':'FlrSF2nd', '3SsnPorch':'Porch3Ssn'}, inplace=True) 
df_sub.rename(columns={'1stFlrSF':'FlrSF1st', '2ndFlrSF':'FlrSF2nd', '3SsnPorch':'Porch3Ssn'}, inplace=True) 

Split train/dev/test set before exploration

Test set will be used only for testing purpose. Shouldn’t be used for exploration and feature engineering.

df, df_test = train_test_split(df, test_size=0.2, random_state=20)

EDA with Feature Engineering

Check price distribution and make it log scale

# plot
fsize(8,5)
ax = plt.hist(df.SalePrice,bins=30)

Distribution is skewed. Let’s even out for better linear regression performance.

# Change to log scale
df.SalePrice = np.log10(df.SalePrice) # train set
df_test.SalePrice = np.log10(df_test.SalePrice) # test set

# plot result
ax = plt.hist(df.SalePrice,bins=30)

Now close to normal distribution.

Check numerical feature distirubution and clean them a bit

#%%script false --no-raise-error

# get a list of numerical features
features = df.dtypes[df.dtypes!='object'].index.tolist()

# drop the target column
features = features[:-1] 
n = len(features)

# set figure size
fsize(16,n,True)

# plot histograms
for i in range(n):
             
    plt.subplot(n//4+n%4, 4, i+1)

    plt.hist(df[features[i]],bins=30)
    plt.title(features[i])
    plt.xlabel(features[i])

A little cleaning

# skewed features (we will use these lists soon)
left_skewed = ['LotFrontage','LotArea','MasVnrArea',
               'BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF',
               'FlrSF1st','FlrSF2nd','GrLivArea',
               'GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch',
               'MiscVal','SalePrice'] # all sizes, makes sense
right_skewed = ['YearBuilt','YearRemodAdd','GarageYrBlt'] # all years, makes sense      

# drop redundunt feature
df.drop(['Id'],axis=1,inplace=True) # train set
df_test.drop(['Id'],axis=1,inplace=True) # test

# this feature should be categorical
df = df.astype({'MSSubClass':str})
df_test = df_test.astype({'MSSubClass':str})
df_sub = df_sub.astype({'MSSubClass':str})

Impute numerical feature

Non of valid numerical feature has zero value. Set empty entry as zero for now. Meaning of empty entry is this house doesn’t have corresponding material/place, i.e. N/A.

# get a list of numerical features after cleaning
features = df.dtypes[df.dtypes!='object'].index.tolist()
# drop the target column
features = features[:-1] 

# impute empty value with 0
# I'll handle imputation later soon

for x in features:
    
    df.fillna({x: 0},inplace=True) 
    df_test.fillna({x: 0},inplace=True)
    df_sub.fillna({x: 0},inplace=True)

    #avg = df[x].mean()

    #df.fillna({x: avg},inplace=True) 
    #df_test.fillna({x: avg},inplace=True) 

Check categorical feature distirubution and clean/impute them

#%%script false --no-raise-error

# get a list of categorical features
features = df.dtypes[df.dtypes=='object'].index.tolist()

n = len(features)

# set figure size
fsize(16,2*n,True)

# plot histograms
for i in range(n):
             
    plt.subplot(n//2+n%2, 2, i+1)

    sns.boxplot(x=features[i], y="SalePrice", data=df)
    sns.stripplot(x=features[i], y="SalePrice", data=df, alpha=0.3)
    plt.title(features[i])
    plt.xlabel(features[i])

A little cleaning and imputation

# impute empty data - empty for unknown reason
df.fillna({'MasVnrType':'Unknown'},inplace=True)
df.fillna({'Electrical':'Unknown'},inplace=True)

df_test.fillna({'MasVnrType':'Unknown'},inplace=True)
df_test.fillna({'Electrical':'Unknown'},inplace=True)

df_sub.fillna({'MasVnrType':'Unknown'},inplace=True)
df_sub.fillna({'Electrical':'Unknown'},inplace=True)

# get a list of categorical features
features = df.dtypes[df.dtypes=='object'].index.tolist()

# impute empty data - when a house doesn't have this material
for x in features:
    df.fillna({x: 'NotUsed'},inplace=True) 
    df_test.fillna({x: 'NotUsed'},inplace=True) 
    df_sub.fillna({x: 'NotUsed'},inplace=True) 

Change categorical features to numerical features

Here, I’ll make a categorical feature into numeric ones, then perform linear regression. For each feature, they way how I transform is

  • calculate mean value of SalePrice for each category
  • replace the category by the mean SalePrice value
# get a list of categorical features
features = df.dtypes[df.dtypes=='object'].index.tolist()

# Get avarage to fill rare entry
avg = df.SalePrice.mean()

for x in features:
    
    # make a dictionary of {category: mean of SalePrice of that category}
    # use only train set
    dic = df.groupby([x]).SalePrice.mean().to_dict()
       
    # Change categorical value into average sale price
    # fill dev and test set values by values obtained from train set
    def cat_to_num(x):
        try:
            return dic[x]
        except:
            # exception when the rare category is not shown in training set
            return avg
    
    
    df[x] = df[x].apply(lambda x: cat_to_num(x))
    df_test[x] = df_test[x].apply(lambda x: cat_to_num(x))
    df_sub[x] = df_sub[x].apply(lambda x: cat_to_num(x))

# for nan entries of rare categories
# fill average of training set
for x in df_test.columns[:-1]:
    df_test[x].fillna(df[x].mean(), inplace=True)

for x in df_sub.columns[1:]:
    df_sub[x].fillna(df[x].mean(), inplace=True)
# check result
df.head(5)

# we shouldn't have nan now.
df.isna().value_counts()
MSSubClass  MSZoning  LotFrontage  LotArea  Street  Alley  LotShape  LandContour  Utilities  LotConfig  LandSlope  Neighborhood  Condition1  Condition2  BldgType  HouseStyle  OverallQual  OverallCond  YearBuilt  YearRemodAdd  RoofStyle  RoofMatl  Exterior1st  Exterior2nd  MasVnrType  MasVnrArea  ExterQual  ExterCond  Foundation  BsmtQual  BsmtCond  BsmtExposure  BsmtFinType1  BsmtFinSF1  BsmtFinType2  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  Heating  HeatingQC  CentralAir  Electrical  FlrSF1st  FlrSF2nd  LowQualFinSF  GrLivArea  BsmtFullBath  BsmtHalfBath  FullBath  HalfBath  BedroomAbvGr  KitchenAbvGr  KitchenQual  TotRmsAbvGrd  Functional  Fireplaces  FireplaceQu  GarageType  GarageYrBlt  GarageFinish  GarageCars  GarageArea  GarageQual  GarageCond  PavedDrive  WoodDeckSF  OpenPorchSF  EnclosedPorch  Porch3Ssn  ScreenPorch  PoolArea  PoolQC  Fence  MiscFeature  MiscVal  MoSold  YrSold  SaleType  SaleCondition  SalePrice
False       False     False        False    False   False  False     False        False      False      False      False         False       False       False     False       False        False        False      False         False      False     False        False        False       False       False      False      False       False     False     False         False         False       False         False       False      False        False    False      False       False       False     False     False         False      False         False         False     False     False         False         False        False         False       False       False        False       False        False         False       False       False       False       False       False       False        False          False      False        False     False   False  False        False    False   False   False     False          False        1168
dtype: int64

Numerical feature imputation and scaling for linear regression

For numerical features, 0 value means eigher there’s no such material or data is empty. Here, I made an imputation technique, which fill the empty record by

    1. perform linear regressionn with filled records ($y = mx + b$)
    1. get the average SalePrice of empty record ($=y0$)
    1. calculate the corresponding x value of y0 on the linear regression curve found at step 1 ($x0 = (y0-b)/m$)
    1. impute empty record with $x0$

Numerical feature imputation

def get_scaling_parameter(df):
    
    # feature scaling function
    def feature_scaling(df, x):
    
        mu = np.mean(df[x])
        sig = np.std(df[x])
    
        return mu, sig
    
    # features need singular dat atransform
    lst_return =[]

    for i in range(len(df.columns)-1):
        
        x = df.columns[i]
          
        regular  = df.loc[df[x]!=0, [x,'SalePrice']].copy()
        singular = df.loc[df[x]==0, [x,'SalePrice']].copy()
        
        if x in left_skewed:
            regular[x]  = np.log10(regular[x]+1)
            singular[x] = np.log10(singular[x]+1)
        
        if x in right_skewed:
            regular[x]  = np.log10(2030-regular[x])
            singular[x] = np.log10(2030-singular[x])
        

        mu, sig = feature_scaling(regular, x)
            
        #regular[x] = (regular[x]-mu)/sig
            
        #results = smf.ols('SalePrice'+'~'+x, data=regular).fit()
        #b, m = results.params
        #b_err, m_err = results.bse
        
        model = LinearRegression()
        model.fit(regular[x].to_numpy().reshape(-1,1),regular.SalePrice.to_numpy())
        
        b = model.intercept_
        m = model.coef_.squeeze()
                
        singular_y_mean = np.mean(singular['SalePrice']) 
        singular_x_shift = (singular_y_mean-b)/m
        
        lst_return.append([x,mu,sig,singular_x_shift])
            
    return lst_return

def sdp_transform(df, lst_scale_par):
    
    df_copy = df.copy()

    for item in lst_scale_par:
        
        x, mu, sig, shift = item
     
        if np.isnan(mu): 
            print('err')
            df_copy.drop([x], axis=1, inplace=True)

        else:
            
            regular  = df.loc[df[x]!=0, [x]].copy()
            singular = df.loc[df[x]==0, [x]].copy()
            
            if x in left_skewed:
                regular[x]  = np.log10(regular[x]+1)
                #singular[x] = np.log10(singular[x]+1)
        
            if x in right_skewed:
                regular[x]  = np.log10(2030-regular[x])
                #singular[x] = np.log10(2030-singular[x])
 
            #regular[x] = (regular[x]-mu)/sig
            singular[x] = shift
            
            df_add = regular[[x]]
                        
            if len(singular)>0 :
                df_add = pd.concat([df_add, singular[[x]]])
                
            df_copy[x] = df_add[x]

    return df_copy

lst_scale_par = get_scaling_parameter(df)


df = sdp_transform(df,lst_scale_par)
df_test = sdp_transform(df_test,lst_scale_par)
df_sub = sdp_transform(df_sub,lst_scale_par)
/Users/minjungkim/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arraylike.py:397: RuntimeWarning: invalid value encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)

“RuntimeWarning: invalid value encountered in log10” is from a few NaN entries of rare categories. It will be handled later.

Feature scaling

Feature scaling improves optimization performance and give a fair weight to each feature. Here, I’m using standardization, which is more robust to outliers compared to min-max normalization. \(x_{j} \rightarrow \frac{x_{j}-\mu_{j}}{\sigma_{j}},\) where $x$ is data value, $j$ is feature index, $\mu_{j}$ is mean of $x_{J}$, and $\sigma_{j}$ is standard deviation of $x_{J}$.

scaler = StandardScaler()

# fit with training set
scaler.fit(df.drop('SalePrice',axis=1))

# transform all sets
df[df.columns[:-1]] = scaler.transform(df.drop('SalePrice',axis=1))
df_test[df_test.columns[:-1]] = scaler.transform(df_test.drop('SalePrice',axis=1))
df_sub[df_sub.columns[1:]] = scaler.transform(df_sub.drop('Id',axis=1))

SalePrice vs Feature

#%%script false --no-raise-error

# Plot one variable linear regression of each feature
# x: feature, y: SalePrice

n = len(df.columns)-1
fsize(16,n,True)

for i in range(n):
    
    x=df.columns[i]
    
    plt.subplot(n//4+n%4,4,i+1)
    sns.regplot(x=x, y='SalePrice', data=df)
    plt.title(x)

Select input features

Sort features in an order of correlation with SalePrice, then remove highly correlated features with them.

def select_feature(threshold0=0.5, threshold1=0.6):

    # Select features highly correlated with SalePrice
    features = df.corr().SalePrice.apply(lambda x: abs(x)).sort_values(ascending=False)
    high_corr_features = features[features>threshold0].drop('SalePrice')
    

    # Select highly correlating columns
    columns_to_drop = []
    for x in high_corr_features.index:

        if x in columns_to_drop:
            continue

        for y in high_corr_features.index:

            if x==y:
                continue
            val = df[x].corr(df[y])
            if val>threshold1:
                columns_to_drop.append(y)
                
    
    corr_features = [x for x in high_corr_features.index if not x in columns_to_drop]
    return corr_features


corr_features = select_feature(0.6, 0.6)
corr_features
['OverallQual', 'GarageArea', 'YearBuilt', 'GarageFinish', 'TotalBsmtSF']
# Plot one highly correlating example
fsize(8,6)
sns.regplot(x='OverallQual', y='SalePrice', data=df)
<AxesSubplot:xlabel='OverallQual', ylabel='SalePrice'>

Train

I’m using root mean squared error (RMSE) as the scoring metric. When I calculate RMSE, the SalePrice is log transformed.

# Make a scorer
def cost_function(y, y_pred):
    
    # flip sign for make_scorer function to give positive output
    return -1.0*(np.square(y_pred-y).sum()/len(y))**0.5

scorer = metrics.make_scorer(cost_function, greater_is_better=False)

Plot learning curve

Plot learning curve to check sign of overfitting.

X_train = df[corr_features]
y_train = df.SalePrice

def plot_learning_curve(estimator, title, X, y, scoring=None, ylim=None, cv=None,
                        train_sizes=np.linspace(.1, 1.0, 5)):

    # Copied and modified scikit-learn document
    ax = plt.subplot()

    ax.set_title(title)
    if ylim is not None:
        ax.set_ylim(*ylim)
    ax.set_xlabel("Training examples size")
    ax.set_ylabel("std of difference in percents")

    train_sizes, train_scores, test_scores = \
        learning_curve(estimator, X, y, scoring=scoring, cv=cv, train_sizes=train_sizes)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    # Plot learning curve
    ax.grid()
    ax.fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    ax.fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    ax.plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    ax.plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    ax.legend(loc="best")

    return plt

def learning_curve_wrapper(X,y):
    
    n_samples = X.shape[0]
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)

    model = LinearRegression()
    scorer = metrics.make_scorer(cost_function, greater_is_better=False)

    title = "Learning Curves"

    plot_learning_curve(model, title, X, y, scoring=scorer, train_sizes=np.linspace(.1, 1.0, 20))


    plt.show()
    
    return model

    
model = learning_curve_wrapper(X_train, y_train)

Both training and validation set are converging at quite low error. Good!

Hyperparameter tuning

  • Select features
  • Optimize ElasticNet regularization parameters and learning rate
# define a function to tune regularization and learning rate
# linear regression with ElasticNet
def tuneElasticNet(X_train,y_train):
    
    # 1st iteration to find scale
    model = ElasticNetCV(l1_ratio = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 0.6, 1],
                              alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 
                                        0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                              max_iter = 10000, cv = 10, n_jobs=-1,
                              fit_intercept=True)

    model.fit(X_train, y_train)
    if (model.l1_ratio_ > 1):
        model.l1_ratio_ = 1    
        
    alpha = model.alpha_
    l1_ratio = model.l1_ratio_
    #print("1st iteration - l1_ratio, alpha :", ratio, alpha)


    # 2nd iteration for fine tuning
    
    l1_ratio_temp = [l1_ratio*0.5, l1_ratio*0.8, l1_ratio, l1_ratio*1.2, l1_ratio*1.5]
    
    
    model = ElasticNetCV(l1_ratio = [x if x<=1 else 1 for x in l1_ratio_temp ],
                              alphas = [alpha*0.1 , alpha*0.3, alpha, alpha*3, alpha*10], 
                              max_iter = 10000, cv = 5, n_jobs=-1,
                              fit_intercept=True)

    model.fit(X_train, y_train)
    if (model.l1_ratio_ > 1):
        model.l1_ratio_ = 1    

    alpha = model.alpha_
    l1_ratio = model.l1_ratio_
    #print("2nd iteration - l1_ratio, alpha :", ratio, alpha)


    # 3rd iteration for fine tuning
    
    l1_ratio_temp = [l1_ratio*0.8, l1_ratio*0.85, l1_ratio*0.9, l1_ratio*0.95, l1_ratio,
                     l1_ratio*1.05, l1_ratio*1.1, l1_ratio*1.15, l1_ratio*1.2]

    model = ElasticNetCV(l1_ratio = [x if x<=1 else 1 for x in l1_ratio_temp ],
                              alphas = [alpha*0.8, alpha*0.9, alpha, alpha*1.1, alpha*1.2], 
                              max_iter = 10000, cv = 5, n_jobs=-1,
                              fit_intercept=True)

    model.fit(X_train, y_train)
    if (model.l1_ratio_ > 1):
        model.l1_ratio_ = 1    
        
        
    alpha = model.alpha_
    l1_ratio = model.l1_ratio_
    #print("3rd iteration - l1_ratio, alpha :", ratio, alpha)

    # Cross validation score
    #print("Score:", cross_val_score(model, X_train, y_train, cv=5, scoring=scorer).mean())    
    
    return model, cross_val_score(model, X_train, y_train, cv=5, scoring=scorer).mean()
# Train and Hyperparameter tuning -- 1st iteration

params = []

# found out ranges through iterations
for th0 in (0, 0.1, 0.2): 
        #(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8):
    for th1 in (0.7, 0.8):
        #(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8):
        
        corr_features = select_feature(th0, th1)

        X_train = df[corr_features]
        y_train = df.SalePrice
        
        model, score = tuneElasticNet(X_train,y_train)
        
        params.append((model,th0,th1,score))
def takeFourth(elem):
    return elem[3]

params.sort(key=takeFourth)
[params[i][1:] for i in range(0,5)]
[(0, 0.8, 0.054934925785856605),
 (0, 0.7, 0.055045021987165335),
 (0.1, 0.8, 0.05660590767588628),
 (0.1, 0.7, 0.05667749672828577),
 (0.2, 0.7, 0.058364893449542175)]

Threshold0 = 0 and Threshold1 = 0.7-0.8 gave the least error, 0.055. Overall, the sorted result indicate that taking all parameters gives the best result with regularization.

Test and Residual Analysis

# final model
model, th0, th1, score = params[0]
corr_features = select_feature(th0, th1)

# test set
X_test = df_test[corr_features]
y_test = df_test.SalePrice

# check nan entry
X_test[X_test.isna().any(axis=1)]
OverallQual Neighborhood GrLivArea ExterQual KitchenQual GarageCars BsmtQual YearBuilt GarageFinish TotalBsmtSF ... Porch3Ssn Street MoSold LandSlope PoolArea BsmtFinSF2 YrSold OverallCond Utilities BsmtHalfBath
954 -0.064865 -1.020837 -1.126258 -0.68169 -0.80258 -1.898184 0.600439 0.125701 -2.090759 -0.137196 ... -0.13691 0.071858 1.34402 -0.232617 -0.067759 0.343611 -1.392892 -0.510559 0.029273 32.454951

1 rows × 71 columns

# there's just a nan entry
# let's fill with average

for x in X_test.columns:
    X_test[x].fillna(X_test[x].mean(), inplace=True)
/var/folders/31/7v9nfdf14sz0sxn2xwnq90y00000gn/T/ipykernel_94749/8326161.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test[x].fillna(X_test[x].mean(), inplace=True)
# predict
y_pred = model.predict(X_test)
fsize(8,6)

# Plot predictions
plt.scatter(y_test, y_pred)
plt.title("Prediction")
plt.ylabel("Predicted price")
plt.xlabel("Real price")
plt.plot([4.5, 6], [4.5,6], c = "red")
plt.show()

print('R^2 of y_pred and y_test:',model.score(X_test,y_test))


# Plot residue
residuals = y_pred - y_test
plt.scatter(y_pred, residuals, alpha=0.4)
plt.title('Residue')
plt.xlabel("Predicted price")
plt.hlines(y = 0, xmin = 4.8, xmax = 5.7, color = "red")
plt.show()

R^2 of y_pred and y_test: 0.9025793776944269

A little deviations at very low or high prices. Otherwise, good.

# Plot features with high coefficient (leading features)
coefs = pd.Series(model.coef_, index = corr_features)

coefs = pd.concat([coefs.sort_values().head(7),
                     coefs.sort_values().tail(8)])
coefs.plot.barh()
ax = plt.title("Coefficients of used features")

So, good to have large great room and overall high quality/condition with large garage in expensive neighbor. Having a basement is good, but it better not have a bathroom.

Conclusion

What should be the final metric? I think the most important metric for users will be how precisely we predict the house price in percent.

# so far, we've used log scale price
# now, conver to real value
y_pred = 10**y_pred
y_test = 10**y_test

# difference in percent
y_diff = (y_pred-y_test)/y_test
plt.hist(y_diff, bins=30)

total_count = len(y_diff)

precise_count = len(y_diff[y_diff<0.2])
print('Predict housing price within 20% for ',precise_count/total_count*100,'% of data') 

precise_count = len(y_diff[y_diff<0.1])
print('Predict housing price within 10% for ',precise_count/total_count*100,'% of data')
Predict housing price within 20% for  95.2054794520548 % of data
Predict housing price within 10% for  86.64383561643835 % of data

Would you buy this model? I would definetly.

Submission to Kaggle

# test set
X_sub = df_sub[['Id']+corr_features]

# check nan entry
X_sub[X_sub.isna().any(axis=1)]
Id OverallQual Neighborhood GrLivArea ExterQual KitchenQual GarageCars BsmtQual YearBuilt GarageFinish ... Porch3Ssn Street MoSold LandSlope PoolArea BsmtFinSF2 YrSold OverallCond Utilities BsmtHalfBath
1127 2588 0.649874 -0.363391 -1.320127 -0.68169 -0.80258 -1.151364 -0.785915 0.125701 -0.797912 ... -0.13691 0.071858 -1.233069 -0.232617 -0.067759 -2.069423 -0.640047 0.367693 0.029273 -1.313451
1399 2860 -0.064865 -1.020837 -1.030939 -0.68169 -0.80258 -1.898184 0.600439 0.125701 -2.090759 ... -0.13691 0.071858 1.344020 -0.232617 -0.067759 0.343611 -1.392892 -0.510559 0.029273 32.454951

2 rows × 72 columns

# there's just two nan entry
# let's fill with average

for x in X_sub.columns:
    X_sub[x].fillna(X_sub[x].mean(), inplace=True)
/var/folders/31/7v9nfdf14sz0sxn2xwnq90y00000gn/T/ipykernel_94749/172487346.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_sub[x].fillna(X_sub[x].mean(), inplace=True)
# predict
y_pred = model.predict(X_sub.drop('Id',axis=1))
y_pred = 10**y_pred
# make a submission format
X_sub['SalePrice'] = y_pred

submit = X_sub[['Id','SalePrice']]

# save
submit.to_csv('data/house_submission.csv',index=False)
/var/folders/31/7v9nfdf14sz0sxn2xwnq90y00000gn/T/ipykernel_94749/1023321855.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_sub['SalePrice'] = y_pred

Done.

Leave a comment