This is a temporary project that exists only to complete mandatory classwork for the Open Data Science class at the University of Helsinki. I expect to hide or remove this repository after December 2020.


Introduction to Open Data Science - Course Project

Chapter 1: About the project

This repository is for an introductory open data science class for PhD students at University of Helsinki (with some students from other universities). The course is open to everyone, uses open source tools, and emphasizes open science. The statistics part seems fairly elementary, but I could use a brush-up and that is why I signed up for the course.

The course seems to cover github, R, Rstudio, linear and logistic regression, dimensionality reduction, clustering, hypothesis testing and some other basic multivariate statistical analysis tools.

Links to my repositories and web pages for the class:


Chapter 2: Regression and model validation

Data wrangling exercise

Here is code and output from data/create_learning2014.R:

# Tatu Ylönen / University of Helsinki
# 1.11.2020
# RStudio Exercise 2 - data wrangling

# This uses the data set described in:
# Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183
# Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014
# (Introduction to Social Statistics, fall 2014 - in Finnish),
# international survey of Approaches to Learning, made possible
# by Teachers' Academy funding for KV in 2013-2015.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the data into a data frame.
data <- read.csv(file="http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t")

# Print the dimensions of the data.
print("=== dimensions of the data JYTOPKYS3-data dataset")
## [1] "=== dimensions of the data JYTOPKYS3-data dataset"
dim(data)
## [1] 183  60
# Print a summary of what the data frame contains.
# Turns out there are 183 observations for 60 variables.
print("Structure of the JYTOPKYS3-data dataset")
## [1] "Structure of the JYTOPKYS3-data dataset"
str(data)
## 'data.frame':    183 obs. of  60 variables:
##  $ Aa      : int  3 2 4 4 3 4 4 3 2 3 ...
##  $ Ab      : int  1 2 1 2 2 2 1 1 1 2 ...
##  $ Ac      : int  2 2 1 3 2 1 2 2 2 1 ...
##  $ Ad      : int  1 2 1 2 1 1 2 1 1 1 ...
##  $ Ae      : int  1 1 1 1 2 1 1 1 1 1 ...
##  $ Af      : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ ST01    : int  4 4 3 3 4 4 5 4 4 4 ...
##  $ SU02    : int  2 2 1 3 2 3 2 2 1 2 ...
##  $ D03     : int  4 4 4 4 5 5 4 4 5 4 ...
##  $ ST04    : int  4 4 4 4 3 4 2 5 5 4 ...
##  $ SU05    : int  2 4 2 3 4 3 2 4 2 4 ...
##  $ D06     : int  4 2 3 4 4 5 3 3 4 4 ...
##  $ D07     : int  4 3 4 4 4 5 4 4 5 4 ...
##  $ SU08    : int  3 4 1 2 3 4 4 2 4 2 ...
##  $ ST09    : int  3 4 3 3 4 4 2 4 4 4 ...
##  $ SU10    : int  2 1 1 1 2 1 1 2 1 2 ...
##  $ D11     : int  3 4 4 3 4 5 5 3 4 4 ...
##  $ ST12    : int  3 1 4 3 2 3 2 4 4 4 ...
##  $ SU13    : int  3 3 2 2 3 1 1 2 1 2 ...
##  $ D14     : int  4 2 4 4 4 5 5 4 4 4 ...
##  $ D15     : int  3 3 2 3 3 4 2 2 3 4 ...
##  $ SU16    : int  2 4 3 2 3 2 3 3 4 4 ...
##  $ ST17    : int  3 4 3 3 4 3 4 3 4 4 ...
##  $ SU18    : int  2 2 1 1 1 2 1 2 1 2 ...
##  $ D19     : int  4 3 4 3 4 4 4 4 5 4 ...
##  $ ST20    : int  2 1 3 3 3 3 1 4 4 2 ...
##  $ SU21    : int  3 2 2 3 2 4 1 3 2 4 ...
##  $ D22     : int  3 2 4 3 3 5 4 2 4 4 ...
##  $ D23     : int  2 3 3 3 3 4 3 2 4 4 ...
##  $ SU24    : int  2 4 3 2 4 2 2 4 2 4 ...
##  $ ST25    : int  4 2 4 3 4 4 1 4 4 4 ...
##  $ SU26    : int  4 4 4 2 3 2 1 4 4 4 ...
##  $ D27     : int  4 2 3 3 3 5 4 4 5 4 ...
##  $ ST28    : int  4 2 5 3 5 4 1 4 5 2 ...
##  $ SU29    : int  3 3 2 3 3 2 1 2 1 2 ...
##  $ D30     : int  4 3 4 4 3 5 4 3 4 4 ...
##  $ D31     : int  4 4 3 4 4 5 4 4 5 4 ...
##  $ SU32    : int  3 5 5 3 4 3 4 4 3 4 ...
##  $ Ca      : int  2 4 3 3 2 3 4 2 3 2 ...
##  $ Cb      : int  4 4 5 4 4 5 5 4 5 4 ...
##  $ Cc      : int  3 4 4 4 4 4 4 4 4 4 ...
##  $ Cd      : int  4 5 4 4 3 4 4 5 5 5 ...
##  $ Ce      : int  3 5 3 3 3 3 4 3 3 4 ...
##  $ Cf      : int  2 3 4 4 3 4 5 3 3 4 ...
##  $ Cg      : int  3 2 4 4 4 5 5 3 5 4 ...
##  $ Ch      : int  4 4 2 3 4 4 3 3 5 4 ...
##  $ Da      : int  3 4 1 2 3 3 2 2 4 1 ...
##  $ Db      : int  4 3 4 4 4 5 4 4 2 4 ...
##  $ Dc      : int  4 3 4 5 4 4 4 4 4 4 ...
##  $ Dd      : int  5 4 1 2 4 4 5 3 5 2 ...
##  $ De      : int  4 3 4 5 4 4 5 4 4 2 ...
##  $ Df      : int  2 2 1 1 2 3 1 1 4 1 ...
##  $ Dg      : int  4 3 3 5 5 4 4 4 5 1 ...
##  $ Dh      : int  3 3 1 4 5 3 4 1 4 1 ...
##  $ Di      : int  4 2 1 2 3 3 2 1 4 1 ...
##  $ Dj      : int  4 4 5 5 3 5 4 5 2 4 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
# Create the analysis dataset with variables gender, age, attitude,
# deep, stra, surf and points.  First copy the data wrangling statements from
# the datacamp exercises.
data$attitude = data$Attitude / 10
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30",
                    "D06",  "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21",
                       "SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20",
                         "ST28")
deep_columns <- select(data, one_of(deep_questions))
data$deep <- rowMeans(deep_columns)
surface_columns <- select(data, one_of(surface_questions))
data$surf <- rowMeans(surface_columns)
strategic_columns <- select(data, one_of(strategic_questions))
data$stra = rowMeans(strategic_columns)

# Select only the desired fields for the analysis data set
anal <- select(data, c("gender", "Age", "attitude", "deep", "stra", "surf",
                       "Points"))
colnames(anal)[2] <- "age"
colnames(anal)[7] <- "points"
# Exclude observations where the exam points variable is zero from the
# analysis data set
anal <- filter(anal, points != 0)

# Show the structure of the analysis dataset
print("=== analysis dataset")
## [1] "=== analysis dataset"
str(anal)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# Write the analysis data set into a file
write.csv(anal, "data/analysis-dataset.csv")

# Demonstrate that we can read back the analysis data set
readback <- read.csv("data/analysis-dataset.csv")
print("=== structure of the readback dataset")
## [1] "=== structure of the readback dataset"
str(readback)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
print("=== head of the readback dataset")
## [1] "=== head of the readback dataset"
head(readback)
##   X gender age attitude     deep  stra     surf points
## 1 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6 6      F  38      3.8 4.750000 3.625 2.416667     21

Analysis exercise

Objective of the study

The objective of the study was to analyze how various factors influence the student’s skills in statistics.

Methods

In this exercise we analyzed the dataset described in Kimmo Vehkalahti: The Relationship between Learning Approaches and Students’ Achievements in an Introductory Statistics Course in Finland, 60th World Statistics Congress (ISI2015), Rio de Janeiro, Brazil, July 2015. The overall approach is covered more deeply in Tait, Entwistle and McKune: ASSIST: The Approaches and Study Skills Inventory for Students, 1998.

The data set was collected to understand how various factors influence students’ statistics skills. It includes data for 183 students. A total of 60 variables were colleted. The variables can be grouped into those relating to the student’s background and scholastic achievements, deep questions (seeking meaning, relating ideas, use of evidence), surface questions (lack of purpose, unrelated memorizing, syllabus-boundedness), and strategic questions (organized studying, time management).

Student’s exam points are taken as measuring the level of the student’s skill that we try to explain using the other variables. For this analysis, we chose to use linear least squares regression using the student’s attitude and the averaged answers to surface questions and strategic questions as explanatory variables. We chose to not include deep questions, age, or gender as a possible explanatory variables because the assignment instructions said we should use three variables; extending the analysis to include additional variables would be straightforward.

We additionally analyze the significance of the results using the t statistic, including a look at the distribution of the regression errors and whether their distribution is approximately normal to verify that the significance testing methods chosen can be used in this situation.

The analysis was performed using R (version 3.6.3). The R scripts and their output are included below.

Data preprocessing

The original data set can be found here and its detailed description here.

We preprocessed the data by averaging the answers to the deep questions, surface questions, and strategic questions. A sum of the answers had already been precomputed, so the sums were simply divided by the number of values (10) in each case. Additionally, student’s exam points, general attitude towards statistics, age, and gender were included as possible explanatory variables.

The data preprocessing was done using the script built for the data wrangling exercise, data/create_learning2014.R, shown above.

Observations where the exam points are zero were excluded from the analysis. This left 166 observations.

To get a quick overview of the data and the relationships between points and the possible explanatory variables, we first plot points against each variable.

library(ggplot2)
data <- read.csv("data/analysis-dataset.csv")
p <- ggplot(data, aes(x=deep, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=surf, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=stra, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=attitude, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=age, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=gender, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

It is clear that gender as a two-category variable is not suitable for linear regression. The data also contains relatively few observations for age above about 27, so it is probably not a very helpful explanatory variable either. Also the deep questions have most averages concentrated between 3 and 5, so it might be less useful for regression than the other variables; it also seems to have little influence on points based on the regression line. These observations contributed to choosing attitude, surf, and stra for our analysis.

Analysis

First, we compute a linear least squares regression, trying to explain points using the students’ attitude and the averaged answers for the surface and strategic questions.

library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude + surf + stra, data=data)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + surf + stra, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

For statistically significant results we would expect p < 0.05. The estimated fit shows a positive coefficient for attitude (3.68, p=1.9 * 10-8, which is statistically significant). It also shows a negative coefficient for surface questions and a positive cofficient for strategic questions, but these are not statistically significant (p=0.47 and p=0.12, respectively).

The t statistic used for significance testing in this linear regression assumes that the distribution of errors follows the normal distribution. To validate this assumption, we first plot the residuals.

library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude + surf + stra, data=data)
plot(model)

The residuals vs. fitted plot shows no obvious pattern that would invalidate the p values. For a normal distribution we would expect the highest concentration of residuals around zero and the residuals should be independent of the fitted value. There is a small cluster of negative residuals near 24-27 and at extreme fitted values the residuals tend to be negative. However this effect seems small enough to not cause significant concern.

From the Q-Q plot we would expect residual distribution to match the theoretical distribution. we can see that while the regression seems to work fairly well for most data points, extreme errors on both sides tend to be more negative than predicted by a normal distribution. However the deviation is not very big. Overall, the plot looks to me as sufficiently consistent with a bounded normal distribution to consider the p values reasonably reliable, but some caution woul be warranted.

Finally, the residuals vs. leverage plot for a normal distribution should show the same spread regardless of leverage and the standardized residual density should follow a normal distribution. This appears to hold reasonably well in this case, though it is also possible that the spread decreases with leverage. There are too few data points at high leverage to be sure. Generally I would say the plot is consistent with the assumption of an approximately normal distribution of residuals.

We will now remove the variables that are not statistically significant from the regression. Thus, we compute the regression against just attitude.

library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude, data=data)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
plot(model)

There is little change in the coefficient of attitude compared to the previous regression (now 3.53, was 3.68). Its statistical significance has slightly improved (now p=4.12 * 10-9, which is highly significant). The intercept 11.64 is also highly significant (p=1.95 * 10-9).

Looking at the residuals vs fitted plot, it now looks like a better match against the normal distribution, though there seem to be fewer than expected samples further out. The Q-Q plot has not changed much, or perhaps the negative deviation at extremes has gotten stronger.

Given the very high significance from the t test and the approximate conformance with the normal distribution, it seems likely that the result for attitude truly is significant. However, I would not trust the absolute p value as there is some deviation from a normal distribution. I would seek to confirm the p value using nonparametric tests if submitting for a peer-reviewed scientific publication.

Finally, we try to analyze the correlation and explanatory power of attitude on points. For this, we use the Pearson correlation coefficient and use R2 to measure explanatory power. This assumes a normal distribution.

library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
r = cor(data$attitude, data$points, method="pearson")
print(r)
## [1] 0.4365245
print(cat("correlation R", r, "\n"))
## correlation R 0.4365245 
## NULL
print(cat("R^2", r * r, "\n"))
## R^2 0.1905537 
## NULL

The R2 value of 0.19 suggests that attitude explains about 19% of the variance in points.

Conclusion

The student’s attitude toward statistics seems to have a statistically significant relationship with the student’s points on the exam that can be approximated with a linear model: \[ points = 3.53 \cdot attitude + 11.64 \]

This model explains about 19% of the variance in points. About 81% of variance is not explained by this model.

It should be noted that we have not analyzed whether higher points are caused by attitude or vice versa, or if the results could be better explained by some other common couse. We have merely detected a correlation.


Chapter 3: Logistic regression

Data wrangling

See the script create_alc.R in the data directory on github.

Analysis

In this exercise we analyzed how students’ alcohol usage can be predicted using other variables in the data set. The data set is the Student Performance Data Set from the Machine Learning Repository at UC Irwine. The dataset is described in P. Cortez and A. Silva: Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira (eds): Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008), pp. 5-12, Porto, Portugal, April 2008.

(2) Loading data and describing it

First, let’s look at an overview of the data. Looking at available field names, we select a few fields as potential candidates for further investigation.

library(dplyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

alc <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt")

# Print column names
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
# Print overall structure of the data
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

The data includes numerous factors describing the the student’s background and lifestyle, as well problems with the student’s studies, such as failed classes or absences, and the student’s alcohol usage. Our task is to study the relationship between high/low alcohol consumption and the other data. In total, the data set includes 382 observations for 35 variables.

(3) Selecting four variables and hypotheses

Looking at the data, I decided to choose the following variables for closer inspection:

  • Medu (Mother’s education) - it is conceivable parent’s education could have an influence (generally higher education of parents correlates with school success, I think)
  • Fedu (Father’s education) - same reason
  • goout (How much goes out with friends - people often drink when they go out)
  • famrel (Quality of family relationships) - conceivably this could have an influence, for example kids of problem families having more problems of their own (i.e., negative correlation in this case).

(4) Graphical and numerical analysis of the variables

Let’s now look at them graphically using scatterplots (with some position jitter to make the point density more visible, given the discrete valus):

ggplot(alc, aes(x=Medu, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(alc, aes(x=Fedu, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(alc, aes(x=goout, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(alc, aes(x=famrel, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

Let’s also look at the data numerically using covariances.

# Covariances help us understand the relationship between different variables.
sub <- select(alc, one_of(c("alc_use", "Medu", "Fedu", "goout", "famrel")))
cov(sub, sub)
##              alc_use       Medu        Fedu      goout      famrel
## alc_use  0.975635899 0.01153619 0.009385607 0.43383353 -0.10978961
## Medu     0.011536189 1.18022289 0.773865963 0.08223056  0.01192783
## Fedu     0.009385607 0.77386596 1.201742452 0.04117025  0.01313710
## goout    0.433833533 0.08223056 0.041170246 1.28125902  0.08291077
## famrel  -0.109789614 0.01192783 0.013137101 0.08291077  0.84938368

We can see that alc_use has a covariance of 0.43 with goout and -0.11 with famrel, while the covariances with Medu and Fedu are small.

Visually and based on the linear regression line, it does not look like parents’ education has much impact on alcohol use. Nevertheless, they might have an impact when conditioning on other variables or after eliminating the impact of other variables, so let’s keep them in the analysis. Going out and family relationships, however, show a clear impact.

(5) Logistic regression

Let’s now focus on high alcohol use, i.e., the binarized variable high_use, and use logistic regression to analyze the impact of the selected variables on it. We will use the bootstrap method and

library(boot)

m <- glm(high_use ~ Medu + Fedu + goout + famrel, data=alc, family="binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ Medu + Fedu + goout + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6349  -0.7725  -0.5598   0.9912   2.4030  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.73933    0.68188  -2.551  0.01075 *  
## Medu        -0.10465    0.14883  -0.703  0.48194    
## Fedu         0.05742    0.14751   0.389  0.69707    
## goout        0.80690    0.11928   6.765 1.33e-11 ***
## famrel      -0.42459    0.13316  -3.189  0.00143 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 401.82  on 377  degrees of freedom
## AIC: 411.82
## 
## Number of Fisher Scoring iterations: 4

Looking at the significance level of the coefficients, we can see that Medu and Fedu are not statistically significant, while goout and famrel are statistically significant at the p=0.01 level. Let’s run the regression again but without Medu and Fedu.

library(boot)

m <- glm(high_use ~ goout + famrel, data=alc, family="binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ goout + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5586  -0.7385  -0.5676   0.9969   2.3758  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.8742     0.6015  -3.116  0.00183 ** 
## goout         0.8004     0.1187   6.743 1.55e-11 ***
## famrel       -0.4218     0.1330  -3.171  0.00152 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 402.32  on 379  degrees of freedom
## AIC: 408.32
## 
## Number of Fisher Scoring iterations: 4

We can see that goout and famrel are still statistically significant at the p=0.01 level, with coefficients of 0.80 and -0.42, respectively. Their standard errors are at most of 1/3 of the absolute value of the means.

Let’s now look at the coefficients and their confidence intervals after exponentiation. Exponentiation effectively coverts addition into multiplication, and coefficients >1 indicate positive impact on high_use while coefficients <1 indicate negative impact on high_use.

OR <- coef(m) %>% exp
OR
## (Intercept)       goout      famrel 
##   0.1534844   2.2264964   0.6558574
CI <- exp(confint(m))
## Waiting for profiling to be done...
CI
##                  2.5 %    97.5 %
## (Intercept) 0.04577266 0.4877710
## goout       1.77580201 2.8309109
## famrel      0.50355350 0.8498297
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1534844 0.04577266 0.4877710
## goout       2.2264964 1.77580201 2.8309109
## famrel      0.6558574 0.50355350 0.8498297

We can see that the results are in line with our hypothesis and the results we got with non-exponentiated coefficients (i.e., negative value there corresponds to a value <1 in after exponentiation).

The exponentiated coefficients relate directly to the odds ratios. Essentially, in this case, an increase in goout by increases the odds by a factor of 2.2, and an increase in famrel decreases the odds by a factor of 0.66. The confidence intervals of these odds ratios are given directly by the confidence intervals for the exponentiated coefficients as reported by exp(confint(m)) above.

(6) Predictive power and cross tabulation

Let’s now explore the predictive power of the model.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Predict the probability of high_use using the model
probabilities <- predict(m, type="response")
alc <- mutate(alc, probability=probabilities)
alc <- mutate(alc, prediction=probability >= 0.5)

# Let's look at ten samples and how they were predicted vs. actual high_use
select(alc, high_use, prediction, Medu, Fedu, goout, famrel) %>% tail(10)
##     high_use prediction Medu Fedu goout famrel
## 373    FALSE      FALSE    1    1     2      4
## 374     TRUE      FALSE    4    2     3      5
## 375    FALSE      FALSE    2    2     3      5
## 376    FALSE      FALSE    4    4     3      4
## 377    FALSE      FALSE    2    3     2      5
## 378    FALSE      FALSE    3    1     4      4
## 379    FALSE      FALSE    1    1     1      1
## 380    FALSE      FALSE    1    1     1      1
## 381     TRUE       TRUE    3    1     5      2
## 382     TRUE      FALSE    3    2     1      4
# Let's also create the confusion matrix for the prediction
table(high_use=alc$high_use, prediction=alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   243   27
##    TRUE     64   48

Based on the ten samples, we see 8 correct predictions, one Type I error, and one Type II error. Looks reasonable. The confusion matrix shows that the prediction missed 70 cases of high alcohol use and correctly predicted 42.

Let’s also look at a scatterplot of the predictions vs. actual values (though this is not very useful as we only have four possible combinations of values - but a plot with position jitter will slow something).

ggplot(alc, aes(x=prediction, y=high_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5))

The plot supports our earlier analysis of the confusion matrics.

(7) Cross-validation

Let’s now use 10-fold cross-validation to estimate how sensitive the model is to sampling of the data. (Loss function was already defined above)

cv <- cv.glm(data=alc, cost=loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation, e.g., error
cv$delta[1]
## [1] 0.2486911

The cross validation indicates an error of 0.25 for my model. This is better than the error of the model introduced in DataCamp (0.26). (Note, however, that due to the random nature of cross-validation, and using only 10-fold cross validation as mandated by the exercise, the accuracy of these errors is not very high and the errors will randomly vary somewhat from run to run. I’ve also seen runs where it is higher than 0.26.)

(8) Comparing different models using cross-validation

Let’s now see how the model error behaves as the number of variables decreases. For this analysis we’ll first use all variables (except high_use, alc_use, Dalc, Walc, probability, and prediction) as explanatory variables, and then start removing variables that have low significance. Finally, we will collect the cross-validation errors and plot the change in error as a function of the number of variables removed. We’ll use 29-fold cross-validation to reduce the estimation error while keeping running times reasonable (R warns about K=30; using 29 instead reduces garbage in this report).

# These columns are always excluded
always_exclude = c("high_use", "alc_use", "Dalc", "Walc",
  "prediction", "probability")

# Dataframe where we collect the errors from each step in simplification
# (this is updated by the step() function)
results = data.frame(count=integer(), error=double(), last=character())

# Estimate a model that excludes the variables given as argument (plus the
# variables that are always excluded), performs cross-validation to estimate
# the model error, and collects the errors into ``results``.
step <- function(additional_excludes) {
  excludes <- c(always_exclude, additional_excludes)
  cols <- colnames(alc)[!colnames(alc) %in% excludes]
  formula_text <- paste("high_use ~ ", paste(cols, collapse=" + "))
  formula = as.formula(formula_text)
  m <- glm(formula, data=alc)
  probabilities <- predict(m, type="response")
  alc <- mutate(alc, probability=probabilities)
  alc <- mutate(alc, prediction=probability >= 0.5)
  cv <- cv.glm(data=alc, cost=loss_func, glmfit=m, K=29)
  error <- cv$delta[1]  # error
  count <- length(additional_excludes)
  if (length(additional_excludes) == 0) {
    last = "<none>"
  } else {
    last <- additional_excludes[[length(additional_excludes)]]
  }
  results <<- rbind(results, data.frame(count, error, last))
  return(m)
}

# Perform one step for the exercise.  We can add steps by adding more fields
# to be excluded (as many as we like).
m <- step(c())  # famsup has lowest significance
m <- step(c("famsup"))
m <- step(c("famsup", "schoolsup"))
m <- step(c("famsup", "schoolsup", "traveltime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher", "paid"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
  "nursery"))
summary(m)
## 
## Call:
## glm(formula = formula, data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7470  -0.2983  -0.1224   0.2964   1.0904  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.023323   0.105566   0.221  0.82526    
## sexM         0.181264   0.041632   4.354 1.72e-05 ***
## famrel      -0.077049   0.022620  -3.406  0.00073 ***
## goout        0.135759   0.018450   7.358 1.17e-12 ***
## absences     0.011949   0.002735   4.369 1.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1632952)
## 
##     Null deviance: 79.162  on 381  degrees of freedom
## Residual deviance: 61.562  on 377  degrees of freedom
## AIC: 398.78
## 
## Number of Fisher Scoring iterations: 2
results
##    count     error       last
## 1      0 0.2460733     <none>
## 2      1 0.2329843     famsup
## 3      2 0.2434555  schoolsup
## 4      3 0.2356021 traveltime
## 5      4 0.2434555       Fjob
## 6      5 0.2460733       Fedu
## 7      6 0.2408377       Fjob
## 8      7 0.2356021    famsize
## 9      8 0.2434555         G2
## 10     9 0.2408377   failures
## 11    10 0.2277487   internet
## 12    11 0.2356021         G1
## 13    12 0.2329843         G3
## 14    13 0.2225131  studytime
## 15    14 0.2251309 activities
## 16    15 0.2303665     school
## 17    16 0.2277487     health
## 18    17 0.2120419       Mjob
## 19    18 0.2015707       Medu
## 20    19 0.2041885   freetime
## 21    20 0.2120419     reason
## 22    21 0.2094241    Pstatus
## 23    22 0.2198953    address
## 24    23 0.2172775        age
## 25    24 0.2198953   guardian
## 26    25 0.2068063   romantic
## 27    26 0.2094241     higher
## 28    27 0.2172775       paid
## 29    28 0.2198953    nursery

This leaves us with four variables, goout, absences, sex, and famrel that are all statistically highly significant (at p=0.001 level). The prediction error is at the 0.21 level. For the fun of it, let’s see what happens if we remove more.

m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
  "nursery", "famrel"))
summary(m)
## 
## Call:
## glm(formula = formula, data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8046  -0.3032  -0.1538   0.3365   1.0327  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.263330   0.064620  -4.075 5.61e-05 ***
## sexM         0.172743   0.042136   4.100 5.07e-05 ***
## goout        0.130710   0.018646   7.010 1.10e-11 ***
## absences     0.012497   0.002768   4.514 8.50e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1678754)
## 
##     Null deviance: 79.162  on 381  degrees of freedom
## Residual deviance: 63.457  on 378  degrees of freedom
## AIC: 408.36
## 
## Number of Fisher Scoring iterations: 2
results
##    count     error       last
## 1      0 0.2460733     <none>
## 2      1 0.2329843     famsup
## 3      2 0.2434555  schoolsup
## 4      3 0.2356021 traveltime
## 5      4 0.2434555       Fjob
## 6      5 0.2460733       Fedu
## 7      6 0.2408377       Fjob
## 8      7 0.2356021    famsize
## 9      8 0.2434555         G2
## 10     9 0.2408377   failures
## 11    10 0.2277487   internet
## 12    11 0.2356021         G1
## 13    12 0.2329843         G3
## 14    13 0.2225131  studytime
## 15    14 0.2251309 activities
## 16    15 0.2303665     school
## 17    16 0.2277487     health
## 18    17 0.2120419       Mjob
## 19    18 0.2015707       Medu
## 20    19 0.2041885   freetime
## 21    20 0.2120419     reason
## 22    21 0.2094241    Pstatus
## 23    22 0.2198953    address
## 24    23 0.2172775        age
## 25    24 0.2198953   guardian
## 26    25 0.2068063   romantic
## 27    26 0.2094241     higher
## 28    27 0.2172775       paid
## 29    28 0.2198953    nursery
## 30    29 0.2094241     famrel

Turns out prediction error did not really change from removing famrel (even though it was statistically highly significant). Let’s see what happens if we not remove sex, the next least significant variable.

m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
  "nursery", "famrel", "sex"))
summary(m)
## 
## Call:
## glm(formula = formula, data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8224  -0.2859  -0.1266   0.3944   1.0561  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.191905   0.063512  -3.022  0.00269 ** 
## goout        0.135835   0.018988   7.154 4.39e-12 ***
## absences     0.011713   0.002819   4.155 4.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1748772)
## 
##     Null deviance: 79.162  on 381  degrees of freedom
## Residual deviance: 66.278  on 379  degrees of freedom
## AIC: 422.97
## 
## Number of Fisher Scoring iterations: 2
results
##    count     error       last
## 1      0 0.2460733     <none>
## 2      1 0.2329843     famsup
## 3      2 0.2434555  schoolsup
## 4      3 0.2356021 traveltime
## 5      4 0.2434555       Fjob
## 6      5 0.2460733       Fedu
## 7      6 0.2408377       Fjob
## 8      7 0.2356021    famsize
## 9      8 0.2434555         G2
## 10     9 0.2408377   failures
## 11    10 0.2277487   internet
## 12    11 0.2356021         G1
## 13    12 0.2329843         G3
## 14    13 0.2225131  studytime
## 15    14 0.2251309 activities
## 16    15 0.2303665     school
## 17    16 0.2277487     health
## 18    17 0.2120419       Mjob
## 19    18 0.2015707       Medu
## 20    19 0.2041885   freetime
## 21    20 0.2120419     reason
## 22    21 0.2094241    Pstatus
## 23    22 0.2198953    address
## 24    23 0.2172775        age
## 25    24 0.2198953   guardian
## 26    25 0.2068063   romantic
## 27    26 0.2094241     higher
## 28    27 0.2172775       paid
## 29    28 0.2198953    nursery
## 30    29 0.2094241     famrel
## 31    30 0.2513089        sex

This time we see a significant increase in error, to the 0.25 level. We thus conclude that goout, absences, and sex is the best combination that we can get to using this method. It is, however, possible that we are at a local optimum and different choices earlier would have resulted in a better final result. However I think that is unlikely.

Let’s finish with a plot of the error as a function of the number of variables removed.

data = select(results, count, error)
plot(data$count, data$error,
  xlab="variables removed", ylab="error") + title(
    "Prediction error vs. number of variables removed")

## integer(0)

Chapter 4: Clustering and classification

Analysis

Let’s start by loading the libraries we’ll use.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
library(GGally)
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)

(2) Loading the Boston dataset from the MASS library

In this exercise we are analyzing the “Boston” dataset from the R MASS library. This example dataset is further described in D. Harrison and D.L. Rubinfeld: Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5:81-102, 1978. The data set relates to the housing values in the suburbs of Boston and several variables that might predict housing prices.

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Thre are a total of 506 observations for 14 variables. All variables have numerical values.

(3) Graphical overview of the data

ggpairs(Boston, axislabels="show")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'axislabels' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.

We can see that the distributions of the variables vary quite a bit. Some have a neat near-normal distribution, while others are highly skewed and far from normally distributed.

Some variables are neatly linearly correlated, while others so non-linear correlation (e.g., having a “bump” in the middle in the 2-D plot).

We can see from the pairwise plots of the variables that there are significant correlations between many of the variables. In particular, if we look at the median house value (medv, bottom row / rightmost column), we can see that it might have correlations with all or most of the other variables, though in some cases the correlation might not be linear (e.g., age, dis, zn).

Let’s also plot the correlations between the variables to understand the strength of the correlations between each pair of variables:

cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

We can see that medv has the highest positive or negative correlations with, e.g., rm (average number of rooms per dwelling) and lstat (percentage of population with lower status), but many other correlations are also significant.

(4) Standardizing the dataset and creating categorical variable for crime rate and divide to train/test sets

Let’s then standardize the data set to zero mean and unit standard deviation.

# Standardize the test set
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# Create a new categorial variable with values High, Medium, Low from
# the crime rate, replacing variable ``crim`` by ``crime``
bins = quantile(boston_scaled$crim)
labels = c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim,
             breaks=bins,
             labels=labels,
             include.lowest=TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
# Divide the dataset to train and test sets, so that 80% of the data belongs
# to the train set.
cnt <- nrow(boston_scaled)
ind <- sample(cnt, size=cnt * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.487 0.413 3.372 1.442 3.372 ...
##  $ indus  : num  -0.867 -0.801 -1.077 -1.122 -1.447 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 3.665 ...
##  $ nox    : num  -0.343 -0.998 -1.387 -1.016 -1.326 ...
##  $ rm     : num  -0.426 -0.458 1.664 0.386 2.332 ...
##  $ age    : num  -0.823 -0.813 -1.221 -1.402 -1.555 ...
##  $ dis    : num  0.483 1.434 1.207 0.366 0.993 ...
##  $ rad    : num  -0.522 -0.637 -0.752 -0.522 -0.982 ...
##  $ tax    : num  -1.0932 -0.9804 -0.9745 -0.0607 -1.2474 ...
##  $ ptratio: num  0.806 -0.765 -1.18 -1.504 -2.243 ...
##  $ black  : num  0.441 0.426 0.325 0.287 0.426 ...
##  $ lstat  : num  -0.408 0.112 -1.336 -1.133 -1.329 ...
##  $ medv   : num  -0.384 -0.308 2.334 0.79 2.987 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 2 1 2 1 4 2 3 3 2 ...
str(test)
## 'data.frame':    102 obs. of  14 variables:
##  $ zn     : num  -0.4872 -0.4872 0.0487 0.0487 0.0487 ...
##  $ indus  : num  -1.306 -1.306 -0.476 -0.476 -0.476 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.834 -0.834 -0.265 -0.265 -0.265 ...
##  $ rm     : num  1.227 0.207 -0.388 -0.93 -0.392 ...
##  $ age    : num  -0.5107 -0.3508 -0.0702 1.1164 0.5089 ...
##  $ dis    : num  1.077 1.077 0.838 1.086 1.155 ...
##  $ rad    : num  -0.752 -0.752 -0.522 -0.522 -0.522 ...
##  $ tax    : num  -1.105 -1.105 -0.577 -0.577 -0.577 ...
##  $ ptratio: num  0.113 0.113 -1.504 -1.504 -1.504 ...
##  $ black  : num  0.441 0.41 0.426 0.328 0.441 ...
##  $ lstat  : num  -1.0255 -1.0423 -0.0312 2.4194 0.0864 ...
##  $ medv   : num  1.486 0.6706 0.0399 -0.6559 -0.395 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 2 2 2 3 3 3 3 3 ...

(5) Fit linear discriminant analysis on the train set

# Linear discriminant analysis for ``crime`` using all other variables as
# predictor variables
model = lda(crime ~ ., data=train)
model
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2599010 0.2574257 0.2351485 
## 
## Group means:
##                  zn      indus        chas        nox         rm        age
## low       1.0286810 -0.9198913 -0.07547406 -0.9051447  0.4484172 -0.9061433
## med_low  -0.1115548 -0.3447577 -0.08484810 -0.5951334 -0.1382398 -0.3801100
## med_high -0.3882929  0.1758222  0.21980846  0.4567743  0.1650563  0.4641226
## high     -0.4872402  1.0172655 -0.06511327  1.1132401 -0.3407451  0.8465461
##                 dis        rad        tax     ptratio      black       lstat
## low       0.9181227 -0.6717851 -0.7563150 -0.49263761  0.3732627 -0.76149228
## med_low   0.3611591 -0.5465475 -0.4984575 -0.03884856  0.3105997 -0.14403840
## med_high -0.4344407 -0.4087420 -0.2972790 -0.30012966  0.1232377  0.01418048
## high     -0.8703655  1.6366336  1.5129868  0.77903654 -0.9213342  0.90291602
##                  medv
## low       0.567808478
## med_low   0.009791214
## med_high  0.198210533
## high     -0.731713849
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.11328573  0.63746659 -1.04541122
## indus    0.01130410 -0.11873888  0.16735248
## chas    -0.08889385 -0.06741197 -0.03742423
## nox      0.36836931 -0.78845730 -1.20696117
## rm      -0.10761628 -0.13318267 -0.20557784
## age      0.30736843 -0.35574127 -0.15091626
## dis     -0.07701285 -0.17522081  0.17895494
## rad      3.13547160  1.23834694 -0.21189112
## tax      0.04376011 -0.31959341  0.70311967
## ptratio  0.12054018 -0.04752169 -0.27128886
## black   -0.13840297 -0.02543248  0.08000891
## lstat    0.22073559 -0.13062063  0.37818243
## medv     0.20397111 -0.32935615 -0.08350609
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9395 0.0451 0.0154
classes = as.numeric(train$crime)
plot(model, dimen=2, col=classes, pch=classes)

From the plot we can see that high is relatively easily separable (except from a few med_high instances), whereas there is more difficulty with the other classes when mapped to two-dimensional space. However, they could still be easily separable in the higher-dimensional original space.

(6) Test LDA prediction

# Save the crime categories from the test set and remove the categorical
# crime variable from the test set
correct_classes = test$crime
test <- dplyr::select(test, -crime)
str(correct_classes)
##  Factor w/ 4 levels "low","med_low",..: 1 1 2 2 2 3 3 3 3 3 ...
str(test)
## 'data.frame':    102 obs. of  13 variables:
##  $ zn     : num  -0.4872 -0.4872 0.0487 0.0487 0.0487 ...
##  $ indus  : num  -1.306 -1.306 -0.476 -0.476 -0.476 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.834 -0.834 -0.265 -0.265 -0.265 ...
##  $ rm     : num  1.227 0.207 -0.388 -0.93 -0.392 ...
##  $ age    : num  -0.5107 -0.3508 -0.0702 1.1164 0.5089 ...
##  $ dis    : num  1.077 1.077 0.838 1.086 1.155 ...
##  $ rad    : num  -0.752 -0.752 -0.522 -0.522 -0.522 ...
##  $ tax    : num  -1.105 -1.105 -0.577 -0.577 -0.577 ...
##  $ ptratio: num  0.113 0.113 -1.504 -1.504 -1.504 ...
##  $ black  : num  0.441 0.41 0.426 0.328 0.441 ...
##  $ lstat  : num  -1.0255 -1.0423 -0.0312 2.4194 0.0864 ...
##  $ medv   : num  1.486 0.6706 0.0399 -0.6559 -0.395 ...
# Predict the classes with the LDA model on the test data
predictions = predict(model, newdata=test)

# Cross-tabulate the correct classes vs. predicted classes.
table(correct=correct_classes, predicted=predictions$class)
##           predicted
## correct    low med_low med_high high
##   low       13      12        2    0
##   med_low    4      10        7    0
##   med_high   0       7       14    1
##   high       0       0        0   32

The tabulation shows that a majority of all classifications were correct (i.e., they are on the diagonal). However, many were also classified into neighboring classes. No values were classed more than two steps away from the current class.

(7) K-means clustering

# Reload the original Boston dataset
data("Boston")

# Scale the variables to that all are normalized (zero mean and unit standard
# deviation).
boston_scaled <- as.data.frame(scale(Boston))

# Calculate distances between the (scaled) observations.  We use the
# Euclidean distance here (i.e., L2-norm).  Are we supposed to use these
# distances somehow?  The instructions do not say so.
dist_eu <- dist(boston_scaled, method="euclidean")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# Run k-means on the dataset.
km <- kmeans(boston_scaled, centers=4)

# Visualize the clustering for a few variable pairs as a sample
pairs(boston_scaled[6:10], col=km$cluster)

# Investigate the optimal number of clusters.  The "optimal" value is said to
# be where the WCSS drops radically.  (In real applications with overlapping
# clusters it is often not quite this simple.)
k_max = 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# Plot the WCSS as a function of the number of clusters.
qplot(x=1:k_max, y=twcss, geom="line")

We can see from the plot that WCSS drops significantly between 1 and 2 (though there is additional decrease with more clusters). Based on this, we pick 2 as the optimal number of clusters and run k-means again with that number of clusters. It would not be unreasonable to choose a higher number of classes, but there is no additional dramatic drops at higher numbers of classes.

# Run k-means again with the "optimal" number of clusters
km <- kmeans(boston_scaled, centers=2)

# Visualize the clusters (the "optimal" number of them)
pairs(boston_scaled, col=km$cluster)

# Given that visualizing all pairs is near unreadable, visualize for a small
# sample of variables.  We could use select() to pick a list of named
# variables instead, but instructions don't require it.
pairs(boston_scaled[6:10], col=km$cluster)

Looking at the visualized variables, we can see that some pairs of variables cluster the data pretty well. For example, tax, radf, and dis all seem to offer relatively good classifications.

Bonus exercise: Clustering and visualization with arrows

# Reload the original Boston dataset
data("Boston")

# Scale the variables to that all are normalized (zero mean and unit standard
# deviation).
boston_scaled <- as.data.frame(scale(Boston))

# Cluster into 3 clusters using k-means
km <- kmeans(boston_scaled, centers=3)

# Perform LDA using the clusters as target classes
lda.fit <- lda(x=boston_scaled, grouping=km$cluster)

# print the lda.fit object
lda.fit
## Call:
## lda(boston_scaled, grouping = km$cluster)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2470356 0.3260870 0.4268775 
## 
## Group means:
##         crim         zn      indus         chas        nox         rm
## 1 -0.3989700  1.2614609 -0.9791535 -0.020354653 -0.8573235  1.0090468
## 2  0.7982270 -0.4872402  1.1186734  0.014005495  1.1351215 -0.4596725
## 3 -0.3788713 -0.3578148 -0.2879024  0.001080671 -0.3709704 -0.2328004
##           age        dis        rad        tax     ptratio      black
## 1 -0.96130713  0.9497716 -0.5867985 -0.6709807 -0.80239137  0.3552363
## 2  0.79930921 -0.8549214  1.2113527  1.2873657  0.59162230 -0.6363367
## 3 -0.05427143  0.1034286 -0.5857564 -0.5951053  0.01241316  0.2805140
##        lstat        medv
## 1 -0.9571271  1.06668290
## 2  0.8622388 -0.67953738
## 3 -0.1047617 -0.09820229
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim    -0.03206338 -0.19094456
## zn       0.02935900 -1.07677218
## indus    0.63347352 -0.09917524
## chas     0.02460719  0.10009606
## nox      1.11749317 -0.75995105
## rm      -0.18841682 -0.57360135
## age     -0.12983139  0.47226685
## dis      0.04493809 -0.34585958
## rad      0.67004295 -0.08584353
## tax      1.03992455 -0.58075025
## ptratio  0.25864960 -0.02605279
## black   -0.01657236  0.01975686
## lstat    0.17365575 -0.41704235
## medv    -0.06819126 -0.79098605
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8506 0.1494
# Visualize the results using a biplot with arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0,
         x1 = myscale * heads[,choices[1]],
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads),
       cex = tex, col="black", pos=3)
}

# Plot the LDA results using two dimensions, with arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 1)

# Plot again with much bigger scale so that we can read the labels on the
# smaller arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 5)

# Plot again with even bigger scale so that we can read the labels on the
# smaller arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 15)

The length of the arrows reflects the directions of the different dimensions (variables) relative to the hyperplane chosen for visualizing the clusters. The longer the arrow, the closer it is to the direction of the hyperplane; the shorter, the closer it is perpendicular to the hyperplane.

It looks like the plane that R chose for visualization (which seems to separate the clusters reasonably well) is most parallel to rad (the nitrogen oxygen concentration), the second most parallel to tax (the accessibility of radial highways), and third most parallel to age (proportion of owner-occupied units built prior to 1940).

Super-Bonus: 3D plots

# Save gold standard classes from the training data
train_correct <- train$crime

# Recompute the LDA model from the training data
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 14  2
model <- lda(crime ~ ., data=train)

# Map the datapoints into the visualization space
matrix_product <- as.matrix(model_predictors) %*% model$scaling
matrix_product <- as.data.frame(matrix_product)

# Create a 3D plot with color indicating the gold standard crime
# classes in the train set
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(matrix_product, x=~LD1, y=~LD2, z=~LD3,
        type="scatter3d", mode="markers",
        color=train_correct, marker=list(size=2))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Create a 3D plot with color indicating the predicted crime classes.  Since
# the gold data used four classes, we'll predict four classes.
train_no_crime = dplyr::select(train, -crime)
km <- kmeans(train_no_crime, centers=4)
plot_ly(matrix_product, x=~LD1, y=~LD2, z=~LD3,
        type="scatter3d", mode="markers",
        color=km$cluster, marker=list(size=2))

Both plots contain the same training points, so their point locations mapped to the visualization space are identical. We can see that the green, orange, and blue classes in the original data overlap quite a bit and are not well separated. The prediction, however, produces much shaper class boundaries. This is as expected, because clustering is based on a distance to cluster centers. While there are several dimensions not shown in these plots, one could expect these dimensions to partially reflect the distance from cluster centers. This is consistent with what we are seeing in the plot for predicted data. Except for the sharpening of class boundaries and the selection of a different color scheme, the plots are fairly similar.

Data wrangling

The data wranling script that was part of Exercise 4 can be found in my github repository in the data directory, here.


Chapter 5: Dimensionality reduction techniques

Let’s start by loading some libraries might use.

library(MASS)
library(ggplot2)
library(GGally)
library(corrplot)
library(dplyr)
# Set default figure size
knitr::opts_chunk$set(fig.width=12, fig.height=8)

Data wrangling

The data wrangling exercise is performed by the data/create_human.R script in my github repository, as specified by the instructions. It can be found here. The first part of the script was prepared as part of Exercise 4; the new code for this Exercise 5 is towards the end of the file.

Analysis

Let’s start the analysis by loading the dataset we created as part of the data wrangling exercise. It should have 155 observations and 8 variables, with country as the row name.

human <- read.csv("data/human5.csv", row.name=1)

(1) Show graphical overview and summaries of the variables

# Plot an overview of the relationships and distribution of the variables
ggpairs(human)

# Plot correlations of the variables
corrplot(cor(human))

# Look at the variables numerically
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

We can see that some variables are highly correlated, for example Ado.Birth and Mat.Mor. Some variables are not significantly correlated (e.g., edu2.FM and Labo.FM). Some variables show strong negative correlation, e.g., Mat.Mor and Edu2.FM).

Looking at the distributions of the variables, GNI has a very large range (581 to 123124), while the other variables have much smaller ranges, e.g., Edu2.FM has a range from 0.1717 to 1.4967.

(2) Perform PCA on the original (non-standardized) data

# Perform PCA on the original (non-standardized data)
pca1 <- prcomp(human)
summary(pca1)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

We can see that the first principal component (PC1) alone captures 99.99% of the variance. Let’s plot the observations using PC1 and PC2.

# Draw a biplot of the PCA representation (first two components) with arrows
biplot(pca1, choices=1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

It is easy to see from the plot that PC1 is essentially same as GNI. This makes sense, as GNI had much higher range and variance than the other variables (we did not actually plot the variance, but the quartiles for the original human data tell the story).

(3) Standardize the variables and repeat the above analysis

# Normalize the mean and variance of each variable
human_std <- scale(human)

# Perform PCA on the normalized data
pca2 <- prcomp(human_std)
summary(pca2)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

We can see that PC1 now explains 53.6% of the variance, PC2 16.24%, PC3 9.6%, etc. This is much more sensible. Let’s see the plots. This time we also include custom captions for the labels to show what they describe and what percentage of variance they cover.

# Draw a biplot of the normalized PCA (first two components) with arrows and
# and useful captions
s = summary(pca2)
pca_pr <- round(100 * s$importance[2,], digits=1)
pca_lab <- paste0("Human data ", names(pca_pr), " (", pca_pr, "%)")
biplot(pca2, choices=1:2, xlab=pca_lab[1], ylab=pca_lab[2])

We can see that result is very different from the non-standardized case. This is because PCA identifies the dimensions with the highest variance. When some variable has much higher variance than the others, the highest overall variance will be closely aligned with that variable, and PC1 will be closely aligned with that variable. The plot will not give very much information about the overall relationships of the variables. The scale of the values for the original variable does not necessarily reflect the importance of the variable in any way. If the variables are normalized, each variable will have equal weight in how the principal components are selected. This usually gives much more useful results.

(4) Give your personal interpretations of the first two PCA components

Wwe can see that PC1 and PC2 are not directly aligned with any particular variable, even though PC2 has high correlation with Parli.F and Labo.FM and PC1 has high correlation with the other variables.

I think we can interpret PC1 as describing variables that relate to material and technological wellbeing. Mat.Mor, Ado.Birth, and Life.Exp reflect quality of health care, which probably correlates with general technical and scientific prowess. Edu.Exp and Edu2.FM indicate the quality and availability of education, and it makes sense that that is correlated with medical well being, health, and life expectancy. GNI is also correlated with the same axis; it describes economic wealth, which usually correlates strongly with education and health (higher income enables providing health care and education, and education and health care also help achieve higher income due to likely having technology, medicines, and other things to sell others and the ability to produce them).

On the other hand, PC2 is correlated with Parli.F and Labo.F, which relate to the role of women in the society and particularly to the equality of women in the society. There are countries that are wealthy, have good education, and good health care, but where women don’t have a big role in public life (politics and workplace). I think some resource-rich countries (e.g., oil producers) may fall into this category. It could also have described many Western countries quite well just a few decades ago. While female education seems to be strongly correlated with economic and medical well-being, in this data it would seem that female participation in politics and labor force would not necessarily be so.

(5) Multiple correspondence analysis using the tea dataset

Let’s first load the tea dataset from FactMineR, look at its structure, and visualize the dataset.

# Load the tea dataset from FactoMineR
library(FactoMineR)
data(tea)
# Look at its structure
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# Visualize the dataset.  There are too many variables to visualize it using
# ggpairs reasonably.  Let's instead visualize it with a barplot.
library(miscset)
## 
## Attaching package: 'miscset'
## The following object is masked from 'package:dplyr':
## 
##     collapse
ggplotGrid(ncol=4,
        lapply(colnames(tea),
          function(col) {
            ggplot(tea, aes_string(col)) + geom_bar()
          }))

Let’s then do Multiple Correspondence Analysis on the data. I’m going to only calculate it for the Tea, How, how, sugar, where, and lunch columns, as trying all columns does not seem to work out-of-the-box.

# Extract a subset of the data with only the selected columns
keep <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep))
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# Compute Multiple Correspondence Analysis of the tea data
mca1 <- MCA(tea_time, graph=FALSE)
# Print a summary of the model
summary(mca1)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

Looking at the summary, we can see that no dimension explains more than 15.3% of the variance, and that the 7th dimension still explains 7.8%. Thus there does not appear to be a simple way to map the observations to a very low-dimensional space cleanly.

Let’s then draw the a biplot containing the variables. We also include the individuals in the same plot.

# Visualize the model using a biplot from the factoextra package
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_mca_biplot(mca1, repel=TRUE, ggtheme=theme_minimal())

We can see that the two first dimensions together account for roughly 30% of the variance. Quite a few individuals are mapped to the exact same locations. It seems likely that they have identical values for variables that looked at.


Chapter 6: Analysis of longitudinal data

Data wrangling

The data wrangling for Exercise 6 is performed by the data/meet_and_repeat.R script in my github repository, as specified in the instructions. It can be found here.

Analysis

Let’s start by loading some libraries we might use.

library(ggplot2)
library(GGally)
library(corrplot)
library(dplyr)
library(tidyr)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car
# Set default figure size
knitr::opts_chunk$set(fig.width=12, fig.height=8)

Let’s load the data sets created by the data wrangling script (we’ll do the factor conversion later).

RATS <- read.csv("data/meet_RATS.csv", row.name=1)
RATSL <- read.csv("data/meet_RATSL.csv", row.name=1)
BPRS <- read.csv("data/meet_BPRS.csv", row.name=1)
BPRSL <- read.csv("data/meet_BPRSL.csv", row.name=1)

(1) Implement the analyses of Chapter 8 of MABS using the RATS data

Ok, so it looks like basically we need to perform the analyses we did on Datacamp, but with the datasets swapped.

# First, convert ID and Group fields to factors
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

# Let's look at the data
names(RATSL)
## [1] "ID"     "Group"  "WD"     "Weight" "Time"
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
summary(RATSL)
##        ID      Group        WD         Weight           Time      
##  1      : 11   1:88   WD1    :16   Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   WD15   :16   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   WD22   :16   Median :344.5   Median :36.00  
##  4      : 11          WD29   :16   Mean   :384.5   Mean   :33.55  
##  5      : 11          WD36   :16   3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11          WD43   :16   Max.   :628.0   Max.   :64.00  
##  (Other):110          (Other):80
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD     <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
# Display the data graphically
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") +
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

We can see that rats in Group 1 had consistently lower weight than those in Groups 2 and 3. However, there were only eight rats in Group 1, four in Group 2, and four in Group 3. These numbers are so low that making any statistical analyses based on them are fraught with uncertainty. That said, we are tasked with analyzing this data.

We can also see that in each group the rats are gaining weight as time passes.

Let’s see how the picture changes if we standardize the data for each data point (keeping in mind that the number of observations is too low to do this with any accuracy).

# Standardize Weight
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdWeight = (Weight - mean(Weight)) / sd(Weight)) %>%
  ungroup()
str(RATSL)
## tibble [176 × 6] (S3: tbl_df/tbl/data.frame)
##  $ ID       : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group    : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD       : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight   : int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
##  $ Time     : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stdWeight: num [1:176] -1.001 -1.12 -0.961 -0.842 -0.882 ...
# Plot again with standardized Weight
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized Weight")

We can now see clearly that some rats are growing faster than the others. Where the line goes down, it means that the rat is gaining wait slower than the mean of the group, or might even be losing weight.

Let’s now plot the group means and their error bars. We should note that it is impossible to calculate variance with any accuracy from just four observations, and correspondingly the magnitude of the error bars is only a very rough estimate that shouldn’t be given much statistical significance.

# Add mean and standard error computations.  Note that the n in standard error
# computation should be the number of samples used to compute that mean,
# not the total number of samples in the whole dataset for all Group/Time
# combinations (the Datacamp exercise is in error).
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = sd(Weight) / sqrt(length(Weight))) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Note: In view of the warning about regrouping above, verify that groups
# are correct inside summarize
dim(RATSL)
## [1] 176   6
X <- RATSL %>% group_by(Group, Time) %>% summarize(cnt=length(Weight)) %>% ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
X$cnt
##  [1] 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# Yes, this matches what we see in the earlier plots - 8, 4, 4 rats in each
# group, for each timepoint.  So looks like summarize() worked correctly
# despite the weird warning.

# Let's glimpse at the data with timepoint means and standard errors.
# We leave aside the question whether standard error is reliable with
# just 4 or 8 observations in each grouping...  (it ISN'T very
# reliable with so few observations)
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se    <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.8…
# Plot the mean profiles for each Group, with error bars
ggplot(RATSS,
  aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8,0.8)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
## Warning in if (position != "none") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "manual") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (position == "left") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "right") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "bottom") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (position == "top") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "manual") {: the condition has length > 1 and only
## the first element will be used

Let’s then check for utliers using a boxplot. (We should note, however, that the low number of observations makes any outlier detection very unreliable.)

# Quick check for outliers using a boxplot
RATSB <- RATSL %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSB)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274…
ggplot(RATSB, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), weeks 1-8")
## Warning: `fun.y` is deprecated. Use `fun` instead.

We can see that there is a possible outlier in Group 2. However, given that there are only four rats in this group (which is already too little data), any variance estimate (and thus standard error and the quartiles) is highly suspect, and thus we cannot really justify removing any individuals from the data. They might not be outliers at all if we had a bit more data, and being already data-deprived, we would be even more so with only three rats.

Let’s then see if there is a statistically significant difference between the groups. We’ll use the dataset we created with the mean for each group, but for the reasons above we did not do outlier elimination. There is a compication here, as the t test only works between exactly two groups. For the purposes of this analysis, we’ll lumk groups 2 and 3 together and compare them againt Group 1. Let’s add an auxiliary field for this.

# Create a helper variable indicating whether the observation is in Group 1
RATSB$isGroup1 <- RATSB$Group == "1"

# Perform a t-test between being in group 1 and not being in group 1
t.test(mean ~ isGroup1, data=RATSB, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by isGroup1
## t = 12.518, df = 14, p-value = 5.427e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  200.1507 282.9175
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            505.2500            263.7159
# Add a baseline from Time 1
RATSB$baseline <- RATS$WD1

# Fit a linear model with the mean as the response
fit <- lm(mean ~ baseline + isGroup1, data=RATSB)
fit
## 
## Call:
## lm(formula = mean ~ baseline + isGroup1, data = RATSB)
## 
## Coefficients:
##  (Intercept)      baseline  isGroup1TRUE  
##      78.6222        0.8865      -37.0852
# Compute an analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2173.0397 7.423e-16 ***
## isGroup1   1    570     570    4.9159   0.04505 *  
## Residuals 13   1508     116                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the data suggests the baseline (WD1) to be a very significant predictor of the group mean weight and being in Group 1 being significant at 95% level. However, even ignoring the unsuitability of the hypothesis testing method testing method for our case (the observations are not independent), we had so few data points that the means are inaccurate, and this significance prediction was based on the inaccurately computed means. This adds to uncertainty beyond what was considered by hypothesis testing, and we probably should not consider being in Group 1 as significant given the baseline weight.

(2) Implement the analyses of Chapter 9 of MABS uising the BPRS data

Let’s then look at analyzing the BPRS (brief psychiatric rating scale) data. We first convert certain fields to factors, look at the data, and plot the data.

In analyzing this data, we must first note that the subject field does not uniquely identify the person. Instead, the data is structured so that the subject field only identifies a subject within each treatment regime (logically one cannot administer the two treatments to be compared to the same person simultaneously in this kind of experiment). Also, the Datacamp exercises say the dataset contains data for 40 male subjects, but there are only 20 distinct values in the subject field. For this purpose, I added a uniqueSubject field in the data wrangling phase to uniquely identify each person, as such unique identification will be needed for mixed effects modeling as a random factor. Our analysis will use this field instead of subject to identify the person, because subject does not uniquely identify the individual!

# First, convert the treatment and subject fields to factors
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)

# This field was added in data wrangling to uniquely identify each subject,
# as discussed above and in the comments in the data wrangling script.
BPRSL$uniqueSubject <- factor(BPRSL$uniqueSubject)

# Let's look at the data
names(BPRSL)
## [1] "treatment"     "subject"       "weeks"         "bprs"         
## [5] "week"          "uniqueSubject"
str(BPRSL)
## 'data.frame':    360 obs. of  6 variables:
##  $ treatment    : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject      : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks        : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bprs         : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ uniqueSubject: Factor w/ 40 levels "1-1","1-2","10-1",..: 1 23 27 29 31 33 35 37 39 3 ...
summary(BPRSL)
##  treatment    subject        weeks          bprs            week  
##  1:180     1      : 18   week0  : 40   Min.   :18.00   Min.   :0  
##  2:180     2      : 18   week1  : 40   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   week2  : 40   Median :35.00   Median :4  
##            4      : 18   week3  : 40   Mean   :37.66   Mean   :4  
##            5      : 18   week4  : 40   3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18   week5  : 40   Max.   :95.00   Max.   :8  
##            (Other):252   (Other):120                              
##  uniqueSubject
##  1-1    :  9  
##  1-2    :  9  
##  10-1   :  9  
##  10-2   :  9  
##  11-1   :  9  
##  11-2   :  9  
##  (Other):306
glimpse(BPRSL)
## Rows: 360
## Columns: 6
## $ treatment     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ subject       <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ weeks         <fct> week0, week0, week0, week0, week0, week0, week0, week0,…
## $ bprs          <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38,…
## $ week          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ uniqueSubject <fct> 1-1, 2-1, 3-1, 4-1, 5-1, 6-1, 7-1, 8-1, 9-1, 10-1, 11-1…
ggplot(BPRSL, aes(x=week, y=bprs, linetype=uniqueSubject)) +
  geom_line() +
  scale_linetype_manual(values=rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller=label_both) +
  scale_y_continuous(name="BPRS", limits=c(min(BPRSL$bprs), max(BPRSL$bprs))) +
  theme(legend.position="top")

Visually, it looks like there is a downward trend with treatment 1. The situation looks less clear for treatment 2.

Let’s then try a linear regression.

# Create a linear regression model
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRSL)
# Print a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

In this model, it looks like the improvement with time is statistically significant, but this model does not show statistical significance for the treatment regime. However, we should note that the observations are not independent as we have repeated observations from the same individuals, so the assumptions of the significance test were violated and it cannot really be relied upon here.

Let’s then try to construct a mixed model for the same variables, using uniqueSubject as a random effect.

# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | uniqueSubject), data=BPRSL,
                 REML=FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | uniqueSubject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev.
##  uniqueSubject (Intercept) 97.39    9.869   
##  Residual                  54.23    7.364   
## Number of obs: 360, groups:  uniqueSubject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## week         -2.2704     0.1503 -15.104
## treatment2    0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.256       
## treatment2 -0.684  0.000

Let’s then construct the model so that we treat both treatment and uniqueSubject as random effects.

# Create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | uniqueSubject),
                  data=BPRSL, REML=FALSE)

# Print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | uniqueSubject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev. Corr 
##  uniqueSubject (Intercept) 167.827  12.955        
##                week          2.331   1.527   -0.67
##  Residual                   36.747   6.062        
## Number of obs: 360, groups:  uniqueSubject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.9830     2.6470  17.372
## week         -2.2704     0.2713  -8.370
## treatment2    1.5139     3.1392   0.482
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.545       
## treatment2 -0.593  0.000
# Perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | uniqueSubject)
## BPRS_ref1: bprs ~ week + treatment + (week | uniqueSubject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRS_ref     5 2582.9 2602.3 -1286.5   2572.9                         
## BPRS_ref1    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It would seem that the second model fits the data better than the first model at the 95% significance level (p=0.026).

Let’s plot again the original BPRS values.

ggplot(BPRSL, aes(x=week, y=bprs, group=uniqueSubject)) +
  geom_line(aes(linetype=treatment)) +
  scale_x_continuous(name="Week") +
  facet_grid(. ~ treatment, labeller=label_both) +
  scale_y_continuous(name="BPRS") +
  theme(legend.position="top")

Let’s then plot the fit using the second model (BPRS_ref1) for comparison.

# Compute the fitted values and add them as a column in RATSL
Fitted <- fitted(BPRS_ref1, BPRSL)

# Add a new column for the fitted data
BPRSL <- mutate(BPRSL, Fitted=Fitted)

# Draw a plot of BPRSL with the observed bprs values for each subject
ggplot(BPRSL, aes(x=week, y=Fitted, group=uniqueSubject)) +
  geom_line(aes(linetype=treatment)) +
  scale_x_continuous(name="Week") +
  facet_grid(. ~ treatment, labeller=label_both) +
  scale_y_continuous(name="Fitted BPRS") +
  theme(legend.position="top")

I have to say that using a non-unique subject idenfier in the BPRS dataset was a rather devious twist in the exercise.


Course diary

Week 1

Thoughts after the first week:

  • Created IODS-project repository
  • Installed R and Rstudio on Ubuntu 20.04 (Linux). Rstudio (latest 1.3.1093) keeps crashing very frequenty. R seems to work. For now I’m using Rscript to generate HTML from the .Rmd files.
  • I’ve used github for years so that part was easy.
  • Rstudio did not clone the github repository as I would like. It did not store the username in the cloned repository correctly. Thus manual git pull or git push did not work properly with SSH keys (it asked for git username and password). I had to tweak it manually to make SSH keys work correctly.
  • The actual information on the lecture could have been presented much faster. I will probably read the information from the book/web pages/transcripts in the coming weeks rather than listening to lectures.

Week 2

Thoughts for the second week:

  • I went through the DataCamp exercises. The platform was simple and easy to use, even though this would not have been my preferred mode of learning. I would rather have read a description of the R philosophy and how main commands work and then approached it as a programming language. Now rather than learning the basics we are forced to learn snippets that may be useful in themselves, but we don’t learn to understand what options and commands really mean or what the philosophy and basic concepts are.
  • It took some time to find out how to use R. I’m not a fan of its programming language syntax, but it certainly seems useful for various statistical and plotting tasks once you learn the different libraries. The challenge is that you need to become familiar with a number of libraries. In some sense python, pandas, scikit, and matplotlib still feels easier for me.
  • I’ve used various types of linear regression many times in the past (least squares, Lasso, Ridge), but haven’t really looked at significance analysis and the distribution of residual errors in the past. I think that analysis may turn out to be a useful tool for me in the future. I would probably implement in Python for my machine learning applications rather than use R though.
  • Rmarkdown is kind of a cool idea and nice for prototyping or coursework, but I still fail to see how to utilize it for an academic paper. The nice thing is that it leaves a trail that makes repeating the procedure easy.

Week 3

Thoughts after the third week

  • I’m becoming frustrated. I read throught the chapters of the book, but found them too general and vague, lacking precise analysis and description of the topics. Perhaps I’d prefer a more mathematical approach to the topics.
  • I’m finding I really dislike R syntax and the study approach taken in the data camp exercises. They are not hard, but they are performing tasks without first gaining an understanding of the available commands and operations. We haven’t looked at even the basic concepts of the programming language that R is, yet we are learning and memorizing snippets in the hope that they might be used for something useful. As someone with a long programming background, I’m finding this approach very frustrating, inefficient, and annoying. I’m waiting for a book on R programming to arrive.
  • I do appreciate the graphics and significant testing that is easily available with R packages.
  • The Super-Bonus exercise was fun.

Week 4

Thoughts after the fourth week

  • These exercises are taking quite a few hours to do. They are not difficult, but there is a lot of tedious detail. This is working to teach some rote skills, but the concepts and overall approach in R has not been discussed. R really looks like a hack, though a lot of people have put a lot of effort into it.
  • I got a book on R, which seems to help somewhat, but it doesn’t quite go as deep as I’d like either. It’s Nicholas J. Horton and Ken Kleinman: Using R and RStudio for Data Management, Statistical Analysis, and Graphics, 2nd ed, CRC Press, 2015. The book is helpful though.

Week 5

Thoughts after the fifth week

  • I’ve used PCA previously, but haven’t really paid attention to data normalization (perhaps it hasn’t been a major issue in my applications). Nevertheless, this exercise clearly points out how important normalization is in this context. This is a useful takeaway.

Week 6

  • I definitely hit the low point of motivation and high point of resentment towards this course during Exercise 6.
  • The way the Analysis task was defined in Exercise 6 was in my opinion unacceptable. It defines the tasks by reference to chapters of the book (MABS), but those chapters are not available in the “special edition” available for the online course and only downloadable through the university library if you haven’t used your 100 page limit on EBSCO. I believe the university library is closed due to covid-19 so getting a physical copy from there is out of the question. It is too late to order from Amazon. Now I think it is fine to require a book for a course (if the requirement is announced at the beginning of the course), but I detest definining the tasks to be performed in the exercise by reference to chapters of the book that I did not expect to need for the course and that are not easily available.
  • Ok, I was told on the chat area that these chapters are at the end of the MABS special edition, out-of-place. The table of contents does not reflect this, and apparently the only way to find them and know that they are there is to page through the whole document. It never occurred to me that the chapters in a PDF document with a table of contents and numbered pages would be out-of-order.
  • I don’t know if using a non-unique subject identifier in the BPRS dataset was intentional or not. It was rather devious.
  • The datacamp exercises seemed to have a few problems, the most serious the incorrect value of n in computing the standard error.
  • The more I use R, the more I dislike its programming language. I will give though that it is quite handy for certain visualization and analysis tasks. I will probably look into Python and pandas next (I’ve used matplotlib, numpy, scipy, sklearn, etc. before).