It turns out that the lower MSE value on the normalised features model that we got in the lab was the result of a lucky selection of the training and test set. With different numbers, it’s likely that the MSE would have been higher. To understand why this is, we have to first understand why you’re getting the same residuals. For simplicity, I’m going to talk about the one variable case (i.e. you can assume that we’re just using “area”, but everything I’m writing also applies to the multivariable case).

Recall, that what we’re trying to do with linear regression is to find the parameters \(a\) and \(b\) in the following equation that best fits our data:

\[ y = a + bx \]

Now, when we transform our data, we’re taking \(x\) and subtracting off its mean and dividing by its standard deviation. So we’re replacing it with \((x-m)/s\) where \(m\) is the mean and \(s\) is the standard deviation. So now our model becomes:

\[ y = a + b\frac{(x - m)}{s} \] \[ = a + \frac{b}{s}x - \frac{m}{s} \]

But, notice that \(m\) and \(s\) are just constants, so linear regression algorithm’s point of view, it’s just estimating \[ y = a' + b'x \]

where \(a' = a - \frac{m}{s}\) and \(b' = \frac{b}{s}\). I.e. it’s exactly the same problem - which makes sense when you think about it, because if we were trying to find the relationship between weight and height, we wouldn’t want it to matter if weight was measured in pound or kilograms or whatever other unit, and you can think of the normalised x values just like a unit-free version of the original x variable.

So the fact that you’re getting the same residuals means that the linear regression algorithm is doing is job correctly - i.e. it’s predicting the same spread of \(y\) values despite the fact that your features are now “unit free”.

Which raises the next question - why did we get different MSE results on the test set? If it’s producing the same model, why isn’t it performing the same on the test set?

setwd("~/Documents/ComputerScience/UBC/340 - Machine Learning/Labs/Lab 3/")
train <- read.table("housing-train.csv",sep = ",",header=TRUE)
train$X <- NULL
test <- read.table("housing-test.csv",sep = ",",header=TRUE)
test$X <- NULL
all <- rbind(train,test)

feature.scale = function (dataset, columns = c("area", "bedrooms")) {
  for (column in columns) {
    sigma = sd(dataset[,column])
    mu = mean(dataset[,column])
    dataset[,column] = (dataset[,column] - mu)/sigma
  }
  return(dataset)
}

# Helper function to calculate the MSE directly...

get.mse <- function(trn, tst){
  mod <- lm(price ~ . , data = trn)
  pred <- predict(mod,newdata=tst)
  return(sum(abs(tst$price-pred))/length(tst$price))
}

#Sanity check to see that we get the same results as before:

idx <- 1:37 # set the training indices
train <- all[idx,]
test <- all[-idx,]
unscaled <- get.mse(train, test)
scaled <- get.mse(feature.scale(train), feature.scale(test))
obs <- data.frame(unscaled = unscaled, scaled = scaled)
obs
##   unscaled scaled
## 1    51688  39723
# Monte Carlo simulation to check the MSE values for the scaled and unscaled test set

mse.samples <- data.frame(unscaled = c(), scaled = c())
for (i in 1:1000) {
  idx <- sample(1:nrow(all), 37) #get a random sample of indices that's the same size as our original sample
  train <- all[idx,]
  test <- all[-idx,]
  unscaled <- get.mse(train, test)
  scaled <- get.mse(feature.scale(train), feature.scale(test))
  obs <- data.frame(unscaled = unscaled, scaled = scaled)
  mse.samples <- rbind(obs, mse.samples)
}

summary(mse.samples)
##     unscaled         scaled      
##  Min.   :23359   Min.   : 25688  
##  1st Qu.:47498   1st Qu.: 52616  
##  Median :55140   Median : 63533  
##  Mean   :55724   Mean   : 66326  
##  3rd Qu.:62735   3rd Qu.: 76388  
##  Max.   :97807   Max.   :156672

It turns out, this is one of the perils of preprocessing your data - you have to be really careful not to unintentionally introduce biases into your test set (which is what we did in this example). When we did feature normalisation, we normalised the test set and the training set separately so we used a different m and s value for our training and test sets. As a result, our estimated coefficient behaved differently on the test set to how it behaved on the training set. Under normal conditions you wouldn’t notice such a difference, but because the dataset was very small the means and standard deviations of the training test were potentially very different to those in the test set. It turns out that on average this bias will actually lead to a larger MSE value on the scaled data than on the unscaled data This R script here which shows this by fitting the same model as we did on 1000 different training and test sets and then comparing the MSE values.

plot of chunk unnamed-chunk-2

Finally, does this mean that feature normalisation isn’t useful for linear regression? It turns out it is actually useful, but for optimisation reasons not accuracy reasons. In really big data, calculating the estimates of the coefficients by calculating (X’X)^{-1}X’y directly is far more computationally expensive than using a method called gradient decent, and there feature normalization will ensure that the gradient decent algorithm will converge quickly.

Hope this helps!