Data Analysis Machine Learning Linear Regression Forward and Backward Methods Decision Trees

Predicting House Prices Based on Their Features Using Linear Regression Models

Project by Mauricio Conde and Matías Zieleniec | December 2022


 
This post is under construction. For detailed information, please get in touch through the available contact methods.

Introduction


Our objective is to build an R model that allows us to predict the final price of a house in the city of Ames, in the state of Iowa, United States. To achieve this, we will use different features of the houses as variables in our model (such as lot size, number of rooms, etc.).


To carry out this project, we have used three Kaggle datasets (see link). The first dataset, train.csv, contains information on 1459 houses and 80 features, including the final price (you can check this link for more details on the variables). The second dataset, test.csv, includes 1458 houses that are different from those in the first dataset but have the same variables, except for the price, which is found in the sample_submission.csv dataset.


Next, we will import train.csv, the dataset with which we will train our models. This file is located in our working directory. We will review the first rows and columns to have a better understanding of the data we are working with.


library(readr) 
train <- read_csv("train.csv")
train[1:5, 1:10]
## # A tibble: 5 × 10 ## Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl AllPub
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl AllPub
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl AllPub
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl AllPub
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl AllPub
## # … with abbreviated variable names ¹​LotFrontage, ²​LotShape, ³​LandContour,
## # ⁴​Utilities
 

Required packages


To carry out this report we use the following packages:


library(tidyverse)
library(ggplot2)
library(dplyr)
library(cowplot) # To calculate corplot.
library(corrplot) # To plot correlation matrix.
library(leaps) # For Forward and Backward methods.
library(rpart) # For decision trees.
library(rpart.plot) # For decision tree plot.

Exploratory analysis


One of the first questions we ask ourselves when tackling this problem is whether there is a strong relationship between the year of construction of a house and its final price. Does the year of construction influence the sale price? To answer this question, we will analyze the relationship between the YearBuilt variable (year of construction of the house) and the SalePrice variable (sale price). If there is a correlation, we expect to see that as the house gets newer, its sale price gets higher.


ggplot(data = train) +
    geom_point(mapping = aes(x = YearBuilt, y = SalePrice, color = SalePrice),
    show.legend = FALSE) +
    geom_smooth(mapping = aes(x = YearBuilt, y = SalePrice), color = "red") +
    ggtitle("Comparación entre año de construcción y precio de la vivienda") +
    xlab("Año de construcción") + ylab("Precio de la vivienda") +
    theme(plot.title = element_text(size=13, face="bold", hjust = 0.5))
 
ggplot(data = train) +
geom_point(mapping = aes(x = YearRemodAdd, y = SalePrice, color = SalePrice),
show.legend = FALSE) +
geom_smooth(mapping = aes(x = YearRemodAdd, y = SalePrice), color = "red") +
ggtitle("Comparación entre año de remodelación y precio de la vivienda") +
xlab("Año de remodelación") + ylab("Precio de la vivienda") +
theme(plot.title = element_text(size=13, face="bold", hjust = 0.5))
 

When visually examining the relationship between the year homes were built and their final price, we can notice a slight trend. It seems that newer or recently renovated homes have a higher sale price. This is not the only factor that influences the final price, but we can see that our initial intuition had some sense. Next, we will delve deeper into the analysis and compare the price of homes with the year of renovation, dividing the data by neighborhood.


ggplot(data = train) +
    geom_point(mapping = aes(x = YearBuilt, y = SalePrice), color = "blue", alpha = 0.3) +
    facet_wrap(~Neighborhood) +
    geom_point(mapping = aes(x = YearRemodAdd, y = SalePrice), color = "red", alpha = 0.2) +
    facet_wrap(~Neighborhood) +
    ggtitle("Comparación por vecindarios entre precio, \n
    año de construcción (azul) y año de remodelación (rojo)") +
    xlab("Año") + ylab("Precio de la vivienda") +
    theme(plot.title = element_text(size=13, face="bold", hjust = 0.5))


 
The newest neighborhoods seem to have a higher number of high-value homes. To verify this, we will use a boxplot. Furthermore, we can see that home remodelings started to be carried out more intensely from the mid-1950s onwards.


ggplot(data = train, mapping = aes(x = fct_reorder(Neighborhood, SalePrice),
                                   y = SalePrice)) +
    geom_boxplot(color = "#145A32") +
    coord_flip() +
    ggtitle("Comparación entre vecindarios de los precios de las viviendas") +
    xlab("Vecindario") + ylab("Precio") +
    theme(plot.title = element_text(size=13, face="bold", hjust = 0.5))


 

Through the boxplot, we can see that the median sale price of relatively new neighborhoods, such as NridgHt, is higher than that of neighborhoods with more years of activity, such as OldTown. In addition, we can see that most neighborhoods with median prices above 2e+05 are newly founded neighborhoods. However, there are exceptions to this trend: MeadowV, for example, is one of the newest neighborhoods but has the lowest median prices, while Crawfor is the seventh neighborhood with the highest median and is not one of the most recently founded.