This version of the tutorial 4 script shows how you can use RMarkdown to present your work. Most of it is pretty self explanatory, with the exception of generating tables. In this script I use both the xtable
and the kable
command to make the printed tables look nicer in the html file. We also need to use the results='asis'
command in the r block preamble to make it print the table correctly.
For more info on markdown and what commands are available, be sure to check out John Gruber’s homepage and google RMarkdown.
We’ll start by examining the data…
hd <- head(train)
print(xtable(hd),type="html", html.table.attributes = tab.attributes)
X | admit | gre | gpa | rank | |
---|---|---|---|---|---|
1 | 254 | 1 | 540 | 3.55 | 4 |
2 | 90 | 1 | 660 | 4.00 | 2 |
3 | 177 | 0 | 520 | 2.62 | 2 |
4 | 195 | 1 | 600 | 3.47 | 2 |
5 | 124 | 0 | 500 | 2.98 | 3 |
6 | 96 | 0 | 660 | 3.33 | 2 |
summ <- summary(train)
print(xtable(summary(train)),type="html", html.table.attributes = tab.attributes)
X | admit | gre | gpa | rank | |
---|---|---|---|---|---|
1 | Min. : 1 | Min. :0.0 | Min. :220 | Min. :2.42 | Min. :1.00 |
2 | 1st Qu.:101 | 1st Qu.:0.0 | 1st Qu.:520 | 1st Qu.:3.12 | 1st Qu.:2.00 |
3 | Median :205 | Median :0.0 | Median :580 | Median :3.36 | Median :2.00 |
4 | Mean :203 | Mean :0.3 | Mean :586 | Mean :3.37 | Mean :2.46 |
5 | 3rd Qu.:307 | 3rd Qu.:1.0 | 3rd Qu.:660 | 3rd Qu.:3.65 | 3rd Qu.:3.00 |
6 | Max. :399 | Max. :1.0 | Max. :800 | Max. :4.00 | Max. :4.00 |
Contingency tables make sense more for categorical data…
1 | 2 | 3 | 4 | |
---|---|---|---|---|
0 | 20 | 83 | 68 | 39 |
1 | 23 | 42 | 15 | 10 |
220 | 300 | 340 | 360 | 380 | 400 | 420 | 440 | 460 | 480 | 500 | 520 | 540 | 560 | 580 | 600 | 620 | 640 | 660 | 680 | 700 | 720 | 740 | 760 | 780 | 800 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | 4 | 5 | 8 | 4 | 6 | 6 | 8 | 15 | 12 | 15 | 13 | 18 | 11 | 13 | 13 | 10 | 10 | 14 | 4 | 4 | 1 | 0 | 12 |
1 | 0 | 1 | 1 | 0 | 0 | 2 | 0 | 1 | 2 | 6 | 1 | 8 | 7 | 5 | 4 | 6 | 9 | 2 | 8 | 6 | 3 | 1 | 2 | 4 | 4 | 7 |
Recall, when reading in the data, R tries to infer the type of each column, so columns with integer values are treated as numbers by default. We need them as categories. To convert them, we use the factor
command.
train$rank <- factor(train$rank)
head(train$rank)
## [1] 4 2 2 2 3 2
## Levels: 1 2 3 4
test$rank <- factor(test$rank)
train$admit <- factor(train$admit)
test$admit <- factor(test$admit)
model1 <- glm(admit ~ gre + gpa, data = train, family="binomial")
predict1 <- predict(model1, newdata = test, type = "response")
print(xtable(summary(model1)),type="html", html.table.attributes = tab.attributes)
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -3.3716 | 1.2228 | -2.76 | 0.0058 |
gre | 0.0027 | 0.0012 | 2.22 | 0.0266 |
gpa | 0.2783 | 0.3608 | 0.77 | 0.4404 |
Look at the predict1 values and guess what are the values?
#head(predict1, 50)
print(xtable(data.frame(Predict1 = head(predict1, 25))),type="html", html.table.attributes = 'border="0" style="border-collapse: collapse; text-align: right; width: 25%; "')
Predict1 | |
---|---|
1 | 0.26 |
2 | 0.29 |
3 | 0.19 |
4 | 0.25 |
5 | 0.30 |
6 | 0.35 |
7 | 0.29 |
8 | 0.44 |
9 | 0.21 |
10 | 0.47 |
11 | 0.32 |
12 | 0.16 |
13 | 0.37 |
14 | 0.47 |
15 | 0.33 |
16 | 0.31 |
17 | 0.31 |
18 | 0.30 |
19 | 0.43 |
20 | 0.37 |
21 | 0.19 |
22 | 0.47 |
23 | 0.26 |
24 | 0.38 |
25 | 0.27 |
In order to change the probabilities to binary decision, we need to find a cutoff. We want a cutoff that gives us the best both sensitivity/specificity.
perf = function(cut, mod, y)
{
yhat = (mod$fit>cut)
w = which(y==1)
sensitivity = mean( yhat[w] == 1 )
specificity = mean( yhat[-w] == 0 )
out = t(as.matrix(c(sensitivity, specificity)))
colnames(out) = c("sensitivity", "specificity")
return(out)
}
perf(0.1,model1,train$admit)
## sensitivity specificity
## [1,] 1 0
perf(0.5,model1,train$admit)
## sensitivity specificity
## [1,] 0 1
perf(0.8,model1,train$admit)
## sensitivity specificity
## [1,] 0 1
Let’s plot the sensitivity and specificity…
sensetivity=c()
specifity=c()
for(i in seq(0.1, 0.9, by=0.1)){
sensetivity <- c(sensetivity,perf(i,model1,train$admit)[1])
specifity <- c(specifity,perf(i,model1,train$admit)[2])
}
#find the cutoff
plot(seq(0.1, 0.9, by=0.1), sensetivity , type="l", lty=1, col="blue", xlab="cutoff points", ylab="sensetivity/specifity", lwd=2)
lines(seq(0.1, 0.9, by=0.1),specifity, type="l", lty=1, col="red", lwd=2)
#prediction
predict1 <- ifelse(predict(model1, newdata=test,type="response")>0.3, 1, 0)
confusionmatrix1 <- table(predict1, test$admit)
print(xtable(confusionmatrix1),type="html", html.table.attributes = tab.attributes)
0 | 1 | |
---|---|---|
0 | 37 | 10 |
1 | 26 | 27 |
#accuracy of model over the test data:
sum(diag(confusionmatrix1))/sum(confusionmatrix1)
## [1] 0.64
We now fit a richer model that includes the rank of the school.
model2 <- glm(admit ~ gre + gpa + rank, data = train, family="binomial")
print(xtable(summary(model2)),type="html", html.table.attributes = tab.attributes)
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.2490 | 1.3082 | -1.72 | 0.0856 |
gre | 0.0023 | 0.0012 | 1.87 | 0.0618 |
gpa | 0.2891 | 0.3741 | 0.77 | 0.4397 |
rank2 | -0.7935 | 0.3655 | -2.17 | 0.0299 |
rank3 | -1.6016 | 0.4233 | -3.78 | 0.0002 |
rank4 | -1.4306 | 0.4739 | -3.02 | 0.0025 |
sensetivity=c()
specifity=c()
for(i in seq(0.1, 0.9, by=0.1)){
sensetivity <- c(sensetivity,perf(i,model2,train$admit)[1])
specifity <- c(specifity,perf(i,model2,train$admit)[2])
}
plot(seq(0.1, 0.9, by=0.1), sensetivity , type="l", lty=1, col="blue", xlab="cutoff points", ylab="sensetivity/specifity", lwd=2)
lines(seq(0.1, 0.9, by=0.1),specifity, type="l", lty=1, col="red", lwd=2)
predict2 <- ifelse(predict(model2, newdata=test,type="response")>0.3, 1, 0)
confusionmatrix2 <- table(predict2, test$admit)
print(xtable(summary(model2)),type="html", html.table.attributes = tab.attributes)
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.2490 | 1.3082 | -1.72 | 0.0856 |
gre | 0.0023 | 0.0012 | 1.87 | 0.0618 |
gpa | 0.2891 | 0.3741 | 0.77 | 0.4397 |
rank2 | -0.7935 | 0.3655 | -2.17 | 0.0299 |
rank3 | -1.6016 | 0.4233 | -3.78 | 0.0002 |
rank4 | -1.4306 | 0.4739 | -3.02 | 0.0025 |
sum(diag(confusionmatrix2))/sum(confusionmatrix2)
## [1] 0.66