Project by Mauricio Conde and Matías Zieleniec | December 2022
Our objective is to build an R model that allows us to predict the final price of a house in the city of Ames, in the state of Iowa, United States. To achieve this, we will use different features of the houses as variables in our model (such as lot size, number of rooms, etc.).
To carry out this project, we have used three Kaggle datasets (see link).
The first dataset, train.csv
, contains information on 1459 houses and 80 features, including
the final price (you can check this link
for more details on the variables). The second dataset, test.csv
, includes 1458 houses that
are different from those in the first dataset but have the same variables, except for the price, which is found in the sample_submission.csv
dataset.
Next, we will import train.csv
, the dataset with which we will train our models. This file is
located in our working directory. We will review the first rows and columns to have a better understanding
of the data we are working with.
library(readr)
train <- read_csv("train.csv")
train[1:5, 1:10]
## # A tibble: 5 × 10
## Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl AllPub
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl AllPub
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl AllPub
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl AllPub
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl AllPub
## # … with abbreviated variable names ¹LotFrontage, ²LotShape, ³LandContour,
## # ⁴Utilities
To carry out this report we use the following packages:
library(tidyverse)
library(ggplot2)
library(dplyr)
library(cowplot) # To calculate corplot.
library(corrplot) # To plot correlation matrix.
library(leaps) # For Forward and Backward methods.
library(rpart) # For decision trees.
library(rpart.plot) # For decision tree plot.
One of the first questions we ask ourselves when tackling this problem is whether there is a strong
relationship between the year of construction of a house and its final price. Does the year of construction
influence the sale price? To answer this question, we will analyze the relationship between the YearBuilt
variable (year of construction of the house) and the SalePrice
variable (sale price). If there is
a correlation, we expect to see that as the house gets newer, its sale price gets higher.
ggplot(data = train) +
geom_point(mapping = aes(x = YearBuilt, y = SalePrice, color = SalePrice),
show.legend = FALSE) +
geom_smooth(mapping = aes(x = YearBuilt, y = SalePrice), color = "red") +
ggtitle("Comparación entre año de construcción y precio de la vivienda") +
xlab("Año de construcción") + ylab("Precio de la vivienda") +
theme(plot.title = element_text(size=13, face="bold", hjust = 0.5))
 
ggplot(data = train) +
geom_point(mapping = aes(x = YearRemodAdd, y = SalePrice, color = SalePrice),
show.legend = FALSE) +
geom_smooth(mapping = aes(x = YearRemodAdd, y = SalePrice), color = "red") +
ggtitle("Comparación entre año de remodelación y precio de la vivienda") +
xlab("Año de remodelación") + ylab("Precio de la vivienda") +
theme(plot.title = element_text(size=13, face="bold", hjust = 0.5))
 
When visually examining the relationship between the year homes were built and their final price, we can notice a slight trend. It seems that newer or recently renovated homes have a higher sale price. This is not the only factor that influences the final price, but we can see that our initial intuition had some sense. Next, we will delve deeper into the analysis and compare the price of homes with the year of renovation, dividing the data by neighborhood.
ggplot(data = train) +
geom_point(mapping = aes(x = YearBuilt, y = SalePrice), color = "blue", alpha = 0.3) +
facet_wrap(~Neighborhood) +
geom_point(mapping = aes(x = YearRemodAdd, y = SalePrice), color = "red", alpha = 0.2) +
facet_wrap(~Neighborhood) +
ggtitle("Comparación por vecindarios entre precio, \n
año de construcción (azul) y año de remodelación (rojo)") +
xlab("Año") + ylab("Precio de la vivienda") +
theme(plot.title = element_text(size=13, face="bold", hjust = 0.5))
ggplot(data = train, mapping = aes(x = fct_reorder(Neighborhood, SalePrice),
y = SalePrice)) +
geom_boxplot(color = "#145A32") +
coord_flip() +
ggtitle("Comparación entre vecindarios de los precios de las viviendas") +
xlab("Vecindario") + ylab("Precio") +
theme(plot.title = element_text(size=13, face="bold", hjust = 0.5))
Through the boxplot, we can see that the median sale price of relatively new neighborhoods, such as NridgHt, is higher than that of neighborhoods with more years of activity, such as OldTown. In addition, we can see that most neighborhoods with median prices above 2e+05
are newly founded neighborhoods. However, there are exceptions to this trend: MeadowV, for example, is one of the newest neighborhoods but has the lowest median prices, while Crawfor is the seventh neighborhood with the highest median and is not one of the most recently founded.