This is a temporary project that exists only to complete mandatory classwork for the Open Data Science class at the University of Helsinki. I expect to hide or remove this repository after December 2020.
This repository is for an introductory open data science class for PhD students at University of Helsinki (with some students from other universities). The course is open to everyone, uses open source tools, and emphasizes open science. The statistics part seems fairly elementary, but I could use a brush-up and that is why I signed up for the course.
The course seems to cover github, R, Rstudio, linear and logistic regression, dimensionality reduction, clustering, hypothesis testing and some other basic multivariate statistical analysis tools.
Links to my repositories and web pages for the class:
Here is code and output from data/create_learning2014.R:
# Tatu Ylönen / University of Helsinki
# 1.11.2020
# RStudio Exercise 2 - data wrangling
# This uses the data set described in:
# Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183
# Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014
# (Introduction to Social Statistics, fall 2014 - in Finnish),
# international survey of Approaches to Learning, made possible
# by Teachers' Academy funding for KV in 2013-2015.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the data into a data frame.
data <- read.csv(file="http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t")
# Print the dimensions of the data.
print("=== dimensions of the data JYTOPKYS3-data dataset")
## [1] "=== dimensions of the data JYTOPKYS3-data dataset"
dim(data)
## [1] 183 60
# Print a summary of what the data frame contains.
# Turns out there are 183 observations for 60 variables.
print("Structure of the JYTOPKYS3-data dataset")
## [1] "Structure of the JYTOPKYS3-data dataset"
str(data)
## 'data.frame': 183 obs. of 60 variables:
## $ Aa : int 3 2 4 4 3 4 4 3 2 3 ...
## $ Ab : int 1 2 1 2 2 2 1 1 1 2 ...
## $ Ac : int 2 2 1 3 2 1 2 2 2 1 ...
## $ Ad : int 1 2 1 2 1 1 2 1 1 1 ...
## $ Ae : int 1 1 1 1 2 1 1 1 1 1 ...
## $ Af : int 1 1 1 1 1 1 1 1 1 2 ...
## $ ST01 : int 4 4 3 3 4 4 5 4 4 4 ...
## $ SU02 : int 2 2 1 3 2 3 2 2 1 2 ...
## $ D03 : int 4 4 4 4 5 5 4 4 5 4 ...
## $ ST04 : int 4 4 4 4 3 4 2 5 5 4 ...
## $ SU05 : int 2 4 2 3 4 3 2 4 2 4 ...
## $ D06 : int 4 2 3 4 4 5 3 3 4 4 ...
## $ D07 : int 4 3 4 4 4 5 4 4 5 4 ...
## $ SU08 : int 3 4 1 2 3 4 4 2 4 2 ...
## $ ST09 : int 3 4 3 3 4 4 2 4 4 4 ...
## $ SU10 : int 2 1 1 1 2 1 1 2 1 2 ...
## $ D11 : int 3 4 4 3 4 5 5 3 4 4 ...
## $ ST12 : int 3 1 4 3 2 3 2 4 4 4 ...
## $ SU13 : int 3 3 2 2 3 1 1 2 1 2 ...
## $ D14 : int 4 2 4 4 4 5 5 4 4 4 ...
## $ D15 : int 3 3 2 3 3 4 2 2 3 4 ...
## $ SU16 : int 2 4 3 2 3 2 3 3 4 4 ...
## $ ST17 : int 3 4 3 3 4 3 4 3 4 4 ...
## $ SU18 : int 2 2 1 1 1 2 1 2 1 2 ...
## $ D19 : int 4 3 4 3 4 4 4 4 5 4 ...
## $ ST20 : int 2 1 3 3 3 3 1 4 4 2 ...
## $ SU21 : int 3 2 2 3 2 4 1 3 2 4 ...
## $ D22 : int 3 2 4 3 3 5 4 2 4 4 ...
## $ D23 : int 2 3 3 3 3 4 3 2 4 4 ...
## $ SU24 : int 2 4 3 2 4 2 2 4 2 4 ...
## $ ST25 : int 4 2 4 3 4 4 1 4 4 4 ...
## $ SU26 : int 4 4 4 2 3 2 1 4 4 4 ...
## $ D27 : int 4 2 3 3 3 5 4 4 5 4 ...
## $ ST28 : int 4 2 5 3 5 4 1 4 5 2 ...
## $ SU29 : int 3 3 2 3 3 2 1 2 1 2 ...
## $ D30 : int 4 3 4 4 3 5 4 3 4 4 ...
## $ D31 : int 4 4 3 4 4 5 4 4 5 4 ...
## $ SU32 : int 3 5 5 3 4 3 4 4 3 4 ...
## $ Ca : int 2 4 3 3 2 3 4 2 3 2 ...
## $ Cb : int 4 4 5 4 4 5 5 4 5 4 ...
## $ Cc : int 3 4 4 4 4 4 4 4 4 4 ...
## $ Cd : int 4 5 4 4 3 4 4 5 5 5 ...
## $ Ce : int 3 5 3 3 3 3 4 3 3 4 ...
## $ Cf : int 2 3 4 4 3 4 5 3 3 4 ...
## $ Cg : int 3 2 4 4 4 5 5 3 5 4 ...
## $ Ch : int 4 4 2 3 4 4 3 3 5 4 ...
## $ Da : int 3 4 1 2 3 3 2 2 4 1 ...
## $ Db : int 4 3 4 4 4 5 4 4 2 4 ...
## $ Dc : int 4 3 4 5 4 4 4 4 4 4 ...
## $ Dd : int 5 4 1 2 4 4 5 3 5 2 ...
## $ De : int 4 3 4 5 4 4 5 4 4 2 ...
## $ Df : int 2 2 1 1 2 3 1 1 4 1 ...
## $ Dg : int 4 3 3 5 5 4 4 4 5 1 ...
## $ Dh : int 3 3 1 4 5 3 4 1 4 1 ...
## $ Di : int 4 2 1 2 3 3 2 1 4 1 ...
## $ Dj : int 4 4 5 5 3 5 4 5 2 4 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
# Create the analysis dataset with variables gender, age, attitude,
# deep, stra, surf and points. First copy the data wrangling statements from
# the datacamp exercises.
data$attitude = data$Attitude / 10
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30",
"D06", "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21",
"SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20",
"ST28")
deep_columns <- select(data, one_of(deep_questions))
data$deep <- rowMeans(deep_columns)
surface_columns <- select(data, one_of(surface_questions))
data$surf <- rowMeans(surface_columns)
strategic_columns <- select(data, one_of(strategic_questions))
data$stra = rowMeans(strategic_columns)
# Select only the desired fields for the analysis data set
anal <- select(data, c("gender", "Age", "attitude", "deep", "stra", "surf",
"Points"))
colnames(anal)[2] <- "age"
colnames(anal)[7] <- "points"
# Exclude observations where the exam points variable is zero from the
# analysis data set
anal <- filter(anal, points != 0)
# Show the structure of the analysis dataset
print("=== analysis dataset")
## [1] "=== analysis dataset"
str(anal)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
# Write the analysis data set into a file
write.csv(anal, "data/analysis-dataset.csv")
# Demonstrate that we can read back the analysis data set
readback <- read.csv("data/analysis-dataset.csv")
print("=== structure of the readback dataset")
## [1] "=== structure of the readback dataset"
str(readback)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
print("=== head of the readback dataset")
## [1] "=== head of the readback dataset"
head(readback)
## X gender age attitude deep stra surf points
## 1 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 6 F 38 3.8 4.750000 3.625 2.416667 21
The objective of the study was to analyze how various factors influence the student’s skills in statistics.
In this exercise we analyzed the dataset described in Kimmo Vehkalahti: The Relationship between Learning Approaches and Students’ Achievements in an Introductory Statistics Course in Finland, 60th World Statistics Congress (ISI2015), Rio de Janeiro, Brazil, July 2015. The overall approach is covered more deeply in Tait, Entwistle and McKune: ASSIST: The Approaches and Study Skills Inventory for Students, 1998.
The data set was collected to understand how various factors influence students’ statistics skills. It includes data for 183 students. A total of 60 variables were colleted. The variables can be grouped into those relating to the student’s background and scholastic achievements, deep questions (seeking meaning, relating ideas, use of evidence), surface questions (lack of purpose, unrelated memorizing, syllabus-boundedness), and strategic questions (organized studying, time management).
Student’s exam points are taken as measuring the level of the student’s skill that we try to explain using the other variables. For this analysis, we chose to use linear least squares regression using the student’s attitude and the averaged answers to surface questions and strategic questions as explanatory variables. We chose to not include deep questions, age, or gender as a possible explanatory variables because the assignment instructions said we should use three variables; extending the analysis to include additional variables would be straightforward.
We additionally analyze the significance of the results using the t statistic, including a look at the distribution of the regression errors and whether their distribution is approximately normal to verify that the significance testing methods chosen can be used in this situation.
The analysis was performed using R (version 3.6.3). The R scripts and their output are included below.
The original data set can be found here and its detailed description here.
We preprocessed the data by averaging the answers to the deep questions, surface questions, and strategic questions. A sum of the answers had already been precomputed, so the sums were simply divided by the number of values (10) in each case. Additionally, student’s exam points, general attitude towards statistics, age, and gender were included as possible explanatory variables.
The data preprocessing was done using the script built for the data wrangling exercise, data/create_learning2014.R
, shown above.
Observations where the exam points are zero were excluded from the analysis. This left 166 observations.
To get a quick overview of the data and the relationships between points and the possible explanatory variables, we first plot points against each variable.
library(ggplot2)
data <- read.csv("data/analysis-dataset.csv")
p <- ggplot(data, aes(x=deep, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=surf, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=stra, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=attitude, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=age, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=gender, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
It is clear that gender as a two-category variable is not suitable for linear regression. The data also contains relatively few observations for age above about 27, so it is probably not a very helpful explanatory variable either. Also the deep questions have most averages concentrated between 3 and 5, so it might be less useful for regression than the other variables; it also seems to have little influence on points based on the regression line. These observations contributed to choosing attitude, surf, and stra for our analysis.
First, we compute a linear least squares regression, trying to explain points using the students’ attitude and the averaged answers for the surface and strategic questions.
library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude + surf + stra, data=data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + surf + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
For statistically significant results we would expect p < 0.05. The estimated fit shows a positive coefficient for attitude (3.68, p=1.9 * 10-8, which is statistically significant). It also shows a negative coefficient for surface questions and a positive cofficient for strategic questions, but these are not statistically significant (p=0.47 and p=0.12, respectively).
The t statistic used for significance testing in this linear regression assumes that the distribution of errors follows the normal distribution. To validate this assumption, we first plot the residuals.
library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude + surf + stra, data=data)
plot(model)
The residuals vs. fitted plot shows no obvious pattern that would invalidate the p values. For a normal distribution we would expect the highest concentration of residuals around zero and the residuals should be independent of the fitted value. There is a small cluster of negative residuals near 24-27 and at extreme fitted values the residuals tend to be negative. However this effect seems small enough to not cause significant concern.
From the Q-Q plot we would expect residual distribution to match the theoretical distribution. we can see that while the regression seems to work fairly well for most data points, extreme errors on both sides tend to be more negative than predicted by a normal distribution. However the deviation is not very big. Overall, the plot looks to me as sufficiently consistent with a bounded normal distribution to consider the p values reasonably reliable, but some caution woul be warranted.
Finally, the residuals vs. leverage plot for a normal distribution should show the same spread regardless of leverage and the standardized residual density should follow a normal distribution. This appears to hold reasonably well in this case, though it is also possible that the spread decreases with leverage. There are too few data points at high leverage to be sure. Generally I would say the plot is consistent with the assumption of an approximately normal distribution of residuals.
We will now remove the variables that are not statistically significant from the regression. Thus, we compute the regression against just attitude.
library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude, data=data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
plot(model)
There is little change in the coefficient of attitude compared to the previous regression (now 3.53, was 3.68). Its statistical significance has slightly improved (now p=4.12 * 10-9, which is highly significant). The intercept 11.64 is also highly significant (p=1.95 * 10-9).
Looking at the residuals vs fitted plot, it now looks like a better match against the normal distribution, though there seem to be fewer than expected samples further out. The Q-Q plot has not changed much, or perhaps the negative deviation at extremes has gotten stronger.
Given the very high significance from the t test and the approximate conformance with the normal distribution, it seems likely that the result for attitude truly is significant. However, I would not trust the absolute p value as there is some deviation from a normal distribution. I would seek to confirm the p value using nonparametric tests if submitting for a peer-reviewed scientific publication.
Finally, we try to analyze the correlation and explanatory power of attitude on points. For this, we use the Pearson correlation coefficient and use R2 to measure explanatory power. This assumes a normal distribution.
library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
r = cor(data$attitude, data$points, method="pearson")
print(r)
## [1] 0.4365245
print(cat("correlation R", r, "\n"))
## correlation R 0.4365245
## NULL
print(cat("R^2", r * r, "\n"))
## R^2 0.1905537
## NULL
The R2 value of 0.19 suggests that attitude explains about 19% of the variance in points.
The student’s attitude toward statistics seems to have a statistically significant relationship with the student’s points on the exam that can be approximated with a linear model: \[ points = 3.53 \cdot attitude + 11.64 \]
This model explains about 19% of the variance in points. About 81% of variance is not explained by this model.
It should be noted that we have not analyzed whether higher points are caused by attitude or vice versa, or if the results could be better explained by some other common couse. We have merely detected a correlation.
See the script create_alc.R
in the data directory on github.
In this exercise we analyzed how students’ alcohol usage can be predicted using other variables in the data set. The data set is the Student Performance Data Set from the Machine Learning Repository at UC Irwine. The dataset is described in P. Cortez and A. Silva: Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira (eds): Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008), pp. 5-12, Porto, Portugal, April 2008.
First, let’s look at an overview of the data. Looking at available field names, we select a few fields as potential candidates for further investigation.
library(dplyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
alc <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt")
# Print column names
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
# Print overall structure of the data
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
The data includes numerous factors describing the the student’s background and lifestyle, as well problems with the student’s studies, such as failed classes or absences, and the student’s alcohol usage. Our task is to study the relationship between high/low alcohol consumption and the other data. In total, the data set includes 382 observations for 35 variables.
Looking at the data, I decided to choose the following variables for closer inspection:
Let’s now look at them graphically using scatterplots (with some position jitter to make the point density more visible, given the discrete valus):
ggplot(alc, aes(x=Medu, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(alc, aes(x=Fedu, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(alc, aes(x=goout, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(alc, aes(x=famrel, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
Let’s also look at the data numerically using covariances.
# Covariances help us understand the relationship between different variables.
sub <- select(alc, one_of(c("alc_use", "Medu", "Fedu", "goout", "famrel")))
cov(sub, sub)
## alc_use Medu Fedu goout famrel
## alc_use 0.975635899 0.01153619 0.009385607 0.43383353 -0.10978961
## Medu 0.011536189 1.18022289 0.773865963 0.08223056 0.01192783
## Fedu 0.009385607 0.77386596 1.201742452 0.04117025 0.01313710
## goout 0.433833533 0.08223056 0.041170246 1.28125902 0.08291077
## famrel -0.109789614 0.01192783 0.013137101 0.08291077 0.84938368
We can see that alc_use has a covariance of 0.43 with goout
and -0.11
with famrel, while the covariances with Medu
and Fedu
are small.
Visually and based on the linear regression line, it does not look like parents’ education has much impact on alcohol use. Nevertheless, they might have an impact when conditioning on other variables or after eliminating the impact of other variables, so let’s keep them in the analysis. Going out and family relationships, however, show a clear impact.
Let’s now focus on high alcohol use, i.e., the binarized variable high_use
, and use logistic regression to analyze the impact of the selected variables on it. We will use the bootstrap method and
library(boot)
m <- glm(high_use ~ Medu + Fedu + goout + famrel, data=alc, family="binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ Medu + Fedu + goout + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6349 -0.7725 -0.5598 0.9912 2.4030
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.73933 0.68188 -2.551 0.01075 *
## Medu -0.10465 0.14883 -0.703 0.48194
## Fedu 0.05742 0.14751 0.389 0.69707
## goout 0.80690 0.11928 6.765 1.33e-11 ***
## famrel -0.42459 0.13316 -3.189 0.00143 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 401.82 on 377 degrees of freedom
## AIC: 411.82
##
## Number of Fisher Scoring iterations: 4
Looking at the significance level of the coefficients, we can see that Medu
and Fedu
are not statistically significant, while goout
and famrel
are statistically significant at the p=0.01 level. Let’s run the regression again but without Medu
and Fedu
.
library(boot)
m <- glm(high_use ~ goout + famrel, data=alc, family="binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ goout + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5586 -0.7385 -0.5676 0.9969 2.3758
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8742 0.6015 -3.116 0.00183 **
## goout 0.8004 0.1187 6.743 1.55e-11 ***
## famrel -0.4218 0.1330 -3.171 0.00152 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 402.32 on 379 degrees of freedom
## AIC: 408.32
##
## Number of Fisher Scoring iterations: 4
We can see that goout
and famrel
are still statistically significant at the p=0.01 level, with coefficients of 0.80 and -0.42, respectively. Their standard errors are at most of 1/3 of the absolute value of the means.
Let’s now look at the coefficients and their confidence intervals after exponentiation. Exponentiation effectively coverts addition into multiplication, and coefficients >1 indicate positive impact on high_use
while coefficients <1 indicate negative impact on high_use
.
OR <- coef(m) %>% exp
OR
## (Intercept) goout famrel
## 0.1534844 2.2264964 0.6558574
CI <- exp(confint(m))
## Waiting for profiling to be done...
CI
## 2.5 % 97.5 %
## (Intercept) 0.04577266 0.4877710
## goout 1.77580201 2.8309109
## famrel 0.50355350 0.8498297
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1534844 0.04577266 0.4877710
## goout 2.2264964 1.77580201 2.8309109
## famrel 0.6558574 0.50355350 0.8498297
We can see that the results are in line with our hypothesis and the results we got with non-exponentiated coefficients (i.e., negative value there corresponds to a value <1 in after exponentiation).
The exponentiated coefficients relate directly to the odds ratios. Essentially, in this case, an increase in goout
by increases the odds by a factor of 2.2, and an increase in famrel
decreases the odds by a factor of 0.66. The confidence intervals of these odds ratios are given directly by the confidence intervals for the exponentiated coefficients as reported by exp(confint(m))
above.
Let’s now explore the predictive power of the model.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Predict the probability of high_use using the model
probabilities <- predict(m, type="response")
alc <- mutate(alc, probability=probabilities)
alc <- mutate(alc, prediction=probability >= 0.5)
# Let's look at ten samples and how they were predicted vs. actual high_use
select(alc, high_use, prediction, Medu, Fedu, goout, famrel) %>% tail(10)
## high_use prediction Medu Fedu goout famrel
## 373 FALSE FALSE 1 1 2 4
## 374 TRUE FALSE 4 2 3 5
## 375 FALSE FALSE 2 2 3 5
## 376 FALSE FALSE 4 4 3 4
## 377 FALSE FALSE 2 3 2 5
## 378 FALSE FALSE 3 1 4 4
## 379 FALSE FALSE 1 1 1 1
## 380 FALSE FALSE 1 1 1 1
## 381 TRUE TRUE 3 1 5 2
## 382 TRUE FALSE 3 2 1 4
# Let's also create the confusion matrix for the prediction
table(high_use=alc$high_use, prediction=alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 243 27
## TRUE 64 48
Based on the ten samples, we see 8 correct predictions, one Type I error, and one Type II error. Looks reasonable. The confusion matrix shows that the prediction missed 70 cases of high alcohol use and correctly predicted 42.
Let’s also look at a scatterplot of the predictions vs. actual values (though this is not very useful as we only have four possible combinations of values - but a plot with position jitter will slow something).
ggplot(alc, aes(x=prediction, y=high_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5))
The plot supports our earlier analysis of the confusion matrics.
Let’s now use 10-fold cross-validation to estimate how sensitive the model is to sampling of the data. (Loss function was already defined above)
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation, e.g., error
cv$delta[1]
## [1] 0.2486911
The cross validation indicates an error of 0.25 for my model. This is better than the error of the model introduced in DataCamp (0.26). (Note, however, that due to the random nature of cross-validation, and using only 10-fold cross validation as mandated by the exercise, the accuracy of these errors is not very high and the errors will randomly vary somewhat from run to run. I’ve also seen runs where it is higher than 0.26.)
Let’s now see how the model error behaves as the number of variables decreases. For this analysis we’ll first use all variables (except high_use
, alc_use
, Dalc
, Walc
, probability
, and prediction
) as explanatory variables, and then start removing variables that have low significance. Finally, we will collect the cross-validation errors and plot the change in error as a function of the number of variables removed. We’ll use 29-fold cross-validation to reduce the estimation error while keeping running times reasonable (R warns about K=30; using 29 instead reduces garbage in this report).
# These columns are always excluded
always_exclude = c("high_use", "alc_use", "Dalc", "Walc",
"prediction", "probability")
# Dataframe where we collect the errors from each step in simplification
# (this is updated by the step() function)
results = data.frame(count=integer(), error=double(), last=character())
# Estimate a model that excludes the variables given as argument (plus the
# variables that are always excluded), performs cross-validation to estimate
# the model error, and collects the errors into ``results``.
step <- function(additional_excludes) {
excludes <- c(always_exclude, additional_excludes)
cols <- colnames(alc)[!colnames(alc) %in% excludes]
formula_text <- paste("high_use ~ ", paste(cols, collapse=" + "))
formula = as.formula(formula_text)
m <- glm(formula, data=alc)
probabilities <- predict(m, type="response")
alc <- mutate(alc, probability=probabilities)
alc <- mutate(alc, prediction=probability >= 0.5)
cv <- cv.glm(data=alc, cost=loss_func, glmfit=m, K=29)
error <- cv$delta[1] # error
count <- length(additional_excludes)
if (length(additional_excludes) == 0) {
last = "<none>"
} else {
last <- additional_excludes[[length(additional_excludes)]]
}
results <<- rbind(results, data.frame(count, error, last))
return(m)
}
# Perform one step for the exercise. We can add steps by adding more fields
# to be excluded (as many as we like).
m <- step(c()) # famsup has lowest significance
m <- step(c("famsup"))
m <- step(c("famsup", "schoolsup"))
m <- step(c("famsup", "schoolsup", "traveltime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher", "paid"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
"nursery"))
summary(m)
##
## Call:
## glm(formula = formula, data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7470 -0.2983 -0.1224 0.2964 1.0904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.023323 0.105566 0.221 0.82526
## sexM 0.181264 0.041632 4.354 1.72e-05 ***
## famrel -0.077049 0.022620 -3.406 0.00073 ***
## goout 0.135759 0.018450 7.358 1.17e-12 ***
## absences 0.011949 0.002735 4.369 1.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1632952)
##
## Null deviance: 79.162 on 381 degrees of freedom
## Residual deviance: 61.562 on 377 degrees of freedom
## AIC: 398.78
##
## Number of Fisher Scoring iterations: 2
results
## count error last
## 1 0 0.2460733 <none>
## 2 1 0.2329843 famsup
## 3 2 0.2434555 schoolsup
## 4 3 0.2356021 traveltime
## 5 4 0.2434555 Fjob
## 6 5 0.2460733 Fedu
## 7 6 0.2408377 Fjob
## 8 7 0.2356021 famsize
## 9 8 0.2434555 G2
## 10 9 0.2408377 failures
## 11 10 0.2277487 internet
## 12 11 0.2356021 G1
## 13 12 0.2329843 G3
## 14 13 0.2225131 studytime
## 15 14 0.2251309 activities
## 16 15 0.2303665 school
## 17 16 0.2277487 health
## 18 17 0.2120419 Mjob
## 19 18 0.2015707 Medu
## 20 19 0.2041885 freetime
## 21 20 0.2120419 reason
## 22 21 0.2094241 Pstatus
## 23 22 0.2198953 address
## 24 23 0.2172775 age
## 25 24 0.2198953 guardian
## 26 25 0.2068063 romantic
## 27 26 0.2094241 higher
## 28 27 0.2172775 paid
## 29 28 0.2198953 nursery
This leaves us with four variables, goout
, absences
, sex
, and famrel
that are all statistically highly significant (at p=0.001 level). The prediction error is at the 0.21 level. For the fun of it, let’s see what happens if we remove more.
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
"nursery", "famrel"))
summary(m)
##
## Call:
## glm(formula = formula, data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8046 -0.3032 -0.1538 0.3365 1.0327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.263330 0.064620 -4.075 5.61e-05 ***
## sexM 0.172743 0.042136 4.100 5.07e-05 ***
## goout 0.130710 0.018646 7.010 1.10e-11 ***
## absences 0.012497 0.002768 4.514 8.50e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1678754)
##
## Null deviance: 79.162 on 381 degrees of freedom
## Residual deviance: 63.457 on 378 degrees of freedom
## AIC: 408.36
##
## Number of Fisher Scoring iterations: 2
results
## count error last
## 1 0 0.2460733 <none>
## 2 1 0.2329843 famsup
## 3 2 0.2434555 schoolsup
## 4 3 0.2356021 traveltime
## 5 4 0.2434555 Fjob
## 6 5 0.2460733 Fedu
## 7 6 0.2408377 Fjob
## 8 7 0.2356021 famsize
## 9 8 0.2434555 G2
## 10 9 0.2408377 failures
## 11 10 0.2277487 internet
## 12 11 0.2356021 G1
## 13 12 0.2329843 G3
## 14 13 0.2225131 studytime
## 15 14 0.2251309 activities
## 16 15 0.2303665 school
## 17 16 0.2277487 health
## 18 17 0.2120419 Mjob
## 19 18 0.2015707 Medu
## 20 19 0.2041885 freetime
## 21 20 0.2120419 reason
## 22 21 0.2094241 Pstatus
## 23 22 0.2198953 address
## 24 23 0.2172775 age
## 25 24 0.2198953 guardian
## 26 25 0.2068063 romantic
## 27 26 0.2094241 higher
## 28 27 0.2172775 paid
## 29 28 0.2198953 nursery
## 30 29 0.2094241 famrel
Turns out prediction error did not really change from removing famrel
(even though it was statistically highly significant). Let’s see what happens if we not remove sex
, the next least significant variable.
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
"nursery", "famrel", "sex"))
summary(m)
##
## Call:
## glm(formula = formula, data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8224 -0.2859 -0.1266 0.3944 1.0561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.191905 0.063512 -3.022 0.00269 **
## goout 0.135835 0.018988 7.154 4.39e-12 ***
## absences 0.011713 0.002819 4.155 4.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1748772)
##
## Null deviance: 79.162 on 381 degrees of freedom
## Residual deviance: 66.278 on 379 degrees of freedom
## AIC: 422.97
##
## Number of Fisher Scoring iterations: 2
results
## count error last
## 1 0 0.2460733 <none>
## 2 1 0.2329843 famsup
## 3 2 0.2434555 schoolsup
## 4 3 0.2356021 traveltime
## 5 4 0.2434555 Fjob
## 6 5 0.2460733 Fedu
## 7 6 0.2408377 Fjob
## 8 7 0.2356021 famsize
## 9 8 0.2434555 G2
## 10 9 0.2408377 failures
## 11 10 0.2277487 internet
## 12 11 0.2356021 G1
## 13 12 0.2329843 G3
## 14 13 0.2225131 studytime
## 15 14 0.2251309 activities
## 16 15 0.2303665 school
## 17 16 0.2277487 health
## 18 17 0.2120419 Mjob
## 19 18 0.2015707 Medu
## 20 19 0.2041885 freetime
## 21 20 0.2120419 reason
## 22 21 0.2094241 Pstatus
## 23 22 0.2198953 address
## 24 23 0.2172775 age
## 25 24 0.2198953 guardian
## 26 25 0.2068063 romantic
## 27 26 0.2094241 higher
## 28 27 0.2172775 paid
## 29 28 0.2198953 nursery
## 30 29 0.2094241 famrel
## 31 30 0.2513089 sex
This time we see a significant increase in error, to the 0.25 level. We thus conclude that goout
, absences
, and sex
is the best combination that we can get to using this method. It is, however, possible that we are at a local optimum and different choices earlier would have resulted in a better final result. However I think that is unlikely.
Let’s finish with a plot of the error as a function of the number of variables removed.
data = select(results, count, error)
plot(data$count, data$error,
xlab="variables removed", ylab="error") + title(
"Prediction error vs. number of variables removed")
## integer(0)
Let’s start by loading the libraries we’ll use.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(GGally)
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)
In this exercise we are analyzing the “Boston” dataset from the R MASS library. This example dataset is further described in D. Harrison and D.L. Rubinfeld: Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5:81-102, 1978. The data set relates to the housing values in the suburbs of Boston and several variables that might predict housing prices.
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Thre are a total of 506 observations for 14 variables. All variables have numerical values.
ggpairs(Boston, axislabels="show")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'axislabels' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
We can see that the distributions of the variables vary quite a bit. Some have a neat near-normal distribution, while others are highly skewed and far from normally distributed.
Some variables are neatly linearly correlated, while others so non-linear correlation (e.g., having a “bump” in the middle in the 2-D plot).
We can see from the pairwise plots of the variables that there are significant correlations between many of the variables. In particular, if we look at the median house value (medv
, bottom row / rightmost column), we can see that it might have correlations with all or most of the other variables, though in some cases the correlation might not be linear (e.g., age
, dis
, zn
).
Let’s also plot the correlations between the variables to understand the strength of the correlations between each pair of variables:
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
We can see that medv
has the highest positive or negative correlations with, e.g., rm
(average number of rooms per dwelling) and lstat
(percentage of population with lower status), but many other correlations are also significant.
Let’s then standardize the data set to zero mean and unit standard deviation.
# Standardize the test set
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# Create a new categorial variable with values High, Medium, Low from
# the crime rate, replacing variable ``crim`` by ``crime``
bins = quantile(boston_scaled$crim)
labels = c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim,
breaks=bins,
labels=labels,
include.lowest=TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
# Divide the dataset to train and test sets, so that 80% of the data belongs
# to the train set.
cnt <- nrow(boston_scaled)
ind <- sample(cnt, size=cnt * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 0.413 3.372 1.442 3.372 ...
## $ indus : num -0.867 -0.801 -1.077 -1.122 -1.447 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 3.665 ...
## $ nox : num -0.343 -0.998 -1.387 -1.016 -1.326 ...
## $ rm : num -0.426 -0.458 1.664 0.386 2.332 ...
## $ age : num -0.823 -0.813 -1.221 -1.402 -1.555 ...
## $ dis : num 0.483 1.434 1.207 0.366 0.993 ...
## $ rad : num -0.522 -0.637 -0.752 -0.522 -0.982 ...
## $ tax : num -1.0932 -0.9804 -0.9745 -0.0607 -1.2474 ...
## $ ptratio: num 0.806 -0.765 -1.18 -1.504 -2.243 ...
## $ black : num 0.441 0.426 0.325 0.287 0.426 ...
## $ lstat : num -0.408 0.112 -1.336 -1.133 -1.329 ...
## $ medv : num -0.384 -0.308 2.334 0.79 2.987 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 2 1 2 1 4 2 3 3 2 ...
str(test)
## 'data.frame': 102 obs. of 14 variables:
## $ zn : num -0.4872 -0.4872 0.0487 0.0487 0.0487 ...
## $ indus : num -1.306 -1.306 -0.476 -0.476 -0.476 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.834 -0.834 -0.265 -0.265 -0.265 ...
## $ rm : num 1.227 0.207 -0.388 -0.93 -0.392 ...
## $ age : num -0.5107 -0.3508 -0.0702 1.1164 0.5089 ...
## $ dis : num 1.077 1.077 0.838 1.086 1.155 ...
## $ rad : num -0.752 -0.752 -0.522 -0.522 -0.522 ...
## $ tax : num -1.105 -1.105 -0.577 -0.577 -0.577 ...
## $ ptratio: num 0.113 0.113 -1.504 -1.504 -1.504 ...
## $ black : num 0.441 0.41 0.426 0.328 0.441 ...
## $ lstat : num -1.0255 -1.0423 -0.0312 2.4194 0.0864 ...
## $ medv : num 1.486 0.6706 0.0399 -0.6559 -0.395 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 2 2 2 3 3 3 3 3 ...
# Linear discriminant analysis for ``crime`` using all other variables as
# predictor variables
model = lda(crime ~ ., data=train)
model
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2599010 0.2574257 0.2351485
##
## Group means:
## zn indus chas nox rm age
## low 1.0286810 -0.9198913 -0.07547406 -0.9051447 0.4484172 -0.9061433
## med_low -0.1115548 -0.3447577 -0.08484810 -0.5951334 -0.1382398 -0.3801100
## med_high -0.3882929 0.1758222 0.21980846 0.4567743 0.1650563 0.4641226
## high -0.4872402 1.0172655 -0.06511327 1.1132401 -0.3407451 0.8465461
## dis rad tax ptratio black lstat
## low 0.9181227 -0.6717851 -0.7563150 -0.49263761 0.3732627 -0.76149228
## med_low 0.3611591 -0.5465475 -0.4984575 -0.03884856 0.3105997 -0.14403840
## med_high -0.4344407 -0.4087420 -0.2972790 -0.30012966 0.1232377 0.01418048
## high -0.8703655 1.6366336 1.5129868 0.77903654 -0.9213342 0.90291602
## medv
## low 0.567808478
## med_low 0.009791214
## med_high 0.198210533
## high -0.731713849
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11328573 0.63746659 -1.04541122
## indus 0.01130410 -0.11873888 0.16735248
## chas -0.08889385 -0.06741197 -0.03742423
## nox 0.36836931 -0.78845730 -1.20696117
## rm -0.10761628 -0.13318267 -0.20557784
## age 0.30736843 -0.35574127 -0.15091626
## dis -0.07701285 -0.17522081 0.17895494
## rad 3.13547160 1.23834694 -0.21189112
## tax 0.04376011 -0.31959341 0.70311967
## ptratio 0.12054018 -0.04752169 -0.27128886
## black -0.13840297 -0.02543248 0.08000891
## lstat 0.22073559 -0.13062063 0.37818243
## medv 0.20397111 -0.32935615 -0.08350609
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9395 0.0451 0.0154
classes = as.numeric(train$crime)
plot(model, dimen=2, col=classes, pch=classes)
From the plot we can see that high
is relatively easily separable (except from a few med_high
instances), whereas there is more difficulty with the other classes when mapped to two-dimensional space. However, they could still be easily separable in the higher-dimensional original space.
# Save the crime categories from the test set and remove the categorical
# crime variable from the test set
correct_classes = test$crime
test <- dplyr::select(test, -crime)
str(correct_classes)
## Factor w/ 4 levels "low","med_low",..: 1 1 2 2 2 3 3 3 3 3 ...
str(test)
## 'data.frame': 102 obs. of 13 variables:
## $ zn : num -0.4872 -0.4872 0.0487 0.0487 0.0487 ...
## $ indus : num -1.306 -1.306 -0.476 -0.476 -0.476 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.834 -0.834 -0.265 -0.265 -0.265 ...
## $ rm : num 1.227 0.207 -0.388 -0.93 -0.392 ...
## $ age : num -0.5107 -0.3508 -0.0702 1.1164 0.5089 ...
## $ dis : num 1.077 1.077 0.838 1.086 1.155 ...
## $ rad : num -0.752 -0.752 -0.522 -0.522 -0.522 ...
## $ tax : num -1.105 -1.105 -0.577 -0.577 -0.577 ...
## $ ptratio: num 0.113 0.113 -1.504 -1.504 -1.504 ...
## $ black : num 0.441 0.41 0.426 0.328 0.441 ...
## $ lstat : num -1.0255 -1.0423 -0.0312 2.4194 0.0864 ...
## $ medv : num 1.486 0.6706 0.0399 -0.6559 -0.395 ...
# Predict the classes with the LDA model on the test data
predictions = predict(model, newdata=test)
# Cross-tabulate the correct classes vs. predicted classes.
table(correct=correct_classes, predicted=predictions$class)
## predicted
## correct low med_low med_high high
## low 13 12 2 0
## med_low 4 10 7 0
## med_high 0 7 14 1
## high 0 0 0 32
The tabulation shows that a majority of all classifications were correct (i.e., they are on the diagonal). However, many were also classified into neighboring classes. No values were classed more than two steps away from the current class.
# Reload the original Boston dataset
data("Boston")
# Scale the variables to that all are normalized (zero mean and unit standard
# deviation).
boston_scaled <- as.data.frame(scale(Boston))
# Calculate distances between the (scaled) observations. We use the
# Euclidean distance here (i.e., L2-norm). Are we supposed to use these
# distances somehow? The instructions do not say so.
dist_eu <- dist(boston_scaled, method="euclidean")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# Run k-means on the dataset.
km <- kmeans(boston_scaled, centers=4)
# Visualize the clustering for a few variable pairs as a sample
pairs(boston_scaled[6:10], col=km$cluster)
# Investigate the optimal number of clusters. The "optimal" value is said to
# be where the WCSS drops radically. (In real applications with overlapping
# clusters it is often not quite this simple.)
k_max = 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# Plot the WCSS as a function of the number of clusters.
qplot(x=1:k_max, y=twcss, geom="line")
We can see from the plot that WCSS drops significantly between 1 and 2 (though there is additional decrease with more clusters). Based on this, we pick 2 as the optimal number of clusters and run k-means again with that number of clusters. It would not be unreasonable to choose a higher number of classes, but there is no additional dramatic drops at higher numbers of classes.
# Run k-means again with the "optimal" number of clusters
km <- kmeans(boston_scaled, centers=2)
# Visualize the clusters (the "optimal" number of them)
pairs(boston_scaled, col=km$cluster)
# Given that visualizing all pairs is near unreadable, visualize for a small
# sample of variables. We could use select() to pick a list of named
# variables instead, but instructions don't require it.
pairs(boston_scaled[6:10], col=km$cluster)
Looking at the visualized variables, we can see that some pairs of variables cluster the data pretty well. For example, tax
, radf
, and dis
all seem to offer relatively good classifications.
# Reload the original Boston dataset
data("Boston")
# Scale the variables to that all are normalized (zero mean and unit standard
# deviation).
boston_scaled <- as.data.frame(scale(Boston))
# Cluster into 3 clusters using k-means
km <- kmeans(boston_scaled, centers=3)
# Perform LDA using the clusters as target classes
lda.fit <- lda(x=boston_scaled, grouping=km$cluster)
# print the lda.fit object
lda.fit
## Call:
## lda(boston_scaled, grouping = km$cluster)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2470356 0.3260870 0.4268775
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3989700 1.2614609 -0.9791535 -0.020354653 -0.8573235 1.0090468
## 2 0.7982270 -0.4872402 1.1186734 0.014005495 1.1351215 -0.4596725
## 3 -0.3788713 -0.3578148 -0.2879024 0.001080671 -0.3709704 -0.2328004
## age dis rad tax ptratio black
## 1 -0.96130713 0.9497716 -0.5867985 -0.6709807 -0.80239137 0.3552363
## 2 0.79930921 -0.8549214 1.2113527 1.2873657 0.59162230 -0.6363367
## 3 -0.05427143 0.1034286 -0.5857564 -0.5951053 0.01241316 0.2805140
## lstat medv
## 1 -0.9571271 1.06668290
## 2 0.8622388 -0.67953738
## 3 -0.1047617 -0.09820229
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim -0.03206338 -0.19094456
## zn 0.02935900 -1.07677218
## indus 0.63347352 -0.09917524
## chas 0.02460719 0.10009606
## nox 1.11749317 -0.75995105
## rm -0.18841682 -0.57360135
## age -0.12983139 0.47226685
## dis 0.04493809 -0.34585958
## rad 0.67004295 -0.08584353
## tax 1.03992455 -0.58075025
## ptratio 0.25864960 -0.02605279
## black -0.01657236 0.01975686
## lstat 0.17365575 -0.41704235
## medv -0.06819126 -0.79098605
##
## Proportion of trace:
## LD1 LD2
## 0.8506 0.1494
# Visualize the results using a biplot with arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col="black", pos=3)
}
# Plot the LDA results using two dimensions, with arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 1)
# Plot again with much bigger scale so that we can read the labels on the
# smaller arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 5)
# Plot again with even bigger scale so that we can read the labels on the
# smaller arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 15)
The length of the arrows reflects the directions of the different dimensions (variables) relative to the hyperplane chosen for visualizing the clusters. The longer the arrow, the closer it is to the direction of the hyperplane; the shorter, the closer it is perpendicular to the hyperplane.
It looks like the plane that R chose for visualization (which seems to separate the clusters reasonably well) is most parallel to rad
(the nitrogen oxygen concentration), the second most parallel to tax
(the accessibility of radial highways), and third most parallel to age
(proportion of owner-occupied units built prior to 1940).
# Save gold standard classes from the training data
train_correct <- train$crime
# Recompute the LDA model from the training data
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 14 2
model <- lda(crime ~ ., data=train)
# Map the datapoints into the visualization space
matrix_product <- as.matrix(model_predictors) %*% model$scaling
matrix_product <- as.data.frame(matrix_product)
# Create a 3D plot with color indicating the gold standard crime
# classes in the train set
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(matrix_product, x=~LD1, y=~LD2, z=~LD3,
type="scatter3d", mode="markers",
color=train_correct, marker=list(size=2))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Create a 3D plot with color indicating the predicted crime classes. Since
# the gold data used four classes, we'll predict four classes.
train_no_crime = dplyr::select(train, -crime)
km <- kmeans(train_no_crime, centers=4)
plot_ly(matrix_product, x=~LD1, y=~LD2, z=~LD3,
type="scatter3d", mode="markers",
color=km$cluster, marker=list(size=2))
Both plots contain the same training points, so their point locations mapped to the visualization space are identical. We can see that the green, orange, and blue classes in the original data overlap quite a bit and are not well separated. The prediction, however, produces much shaper class boundaries. This is as expected, because clustering is based on a distance to cluster centers. While there are several dimensions not shown in these plots, one could expect these dimensions to partially reflect the distance from cluster centers. This is consistent with what we are seeing in the plot for predicted data. Except for the sharpening of class boundaries and the selection of a different color scheme, the plots are fairly similar.
The data wranling script that was part of Exercise 4 can be found in my github repository in the data
directory, here.
Let’s start by loading some libraries might use.
library(MASS)
library(ggplot2)
library(GGally)
library(corrplot)
library(dplyr)
# Set default figure size
knitr::opts_chunk$set(fig.width=12, fig.height=8)
The data wrangling exercise is performed by the data/create_human.R
script in my github repository, as specified by the instructions. It can be found here. The first part of the script was prepared as part of Exercise 4; the new code for this Exercise 5 is towards the end of the file.
Let’s start the analysis by loading the dataset we created as part of the data wrangling exercise. It should have 155 observations and 8 variables, with country as the row name.
human <- read.csv("data/human5.csv", row.name=1)
# Plot an overview of the relationships and distribution of the variables
ggpairs(human)
# Plot correlations of the variables
corrplot(cor(human))
# Look at the variables numerically
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
We can see that some variables are highly correlated, for example Ado.Birth
and Mat.Mor
. Some variables are not significantly correlated (e.g., edu2.FM
and Labo.FM
). Some variables show strong negative correlation, e.g., Mat.Mor
and Edu2.FM
).
Looking at the distributions of the variables, GNI
has a very large range (581 to 123124), while the other variables have much smaller ranges, e.g., Edu2.FM
has a range from 0.1717 to 1.4967.
# Perform PCA on the original (non-standardized data)
pca1 <- prcomp(human)
summary(pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
We can see that the first principal component (PC1) alone captures 99.99% of the variance. Let’s plot the observations using PC1 and PC2.
# Draw a biplot of the PCA representation (first two components) with arrows
biplot(pca1, choices=1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
It is easy to see from the plot that PC1 is essentially same as GNI
. This makes sense, as GNI
had much higher range and variance than the other variables (we did not actually plot the variance, but the quartiles for the original human
data tell the story).
# Normalize the mean and variance of each variable
human_std <- scale(human)
# Perform PCA on the normalized data
pca2 <- prcomp(human_std)
summary(pca2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
We can see that PC1 now explains 53.6% of the variance, PC2 16.24%, PC3 9.6%, etc. This is much more sensible. Let’s see the plots. This time we also include custom captions for the labels to show what they describe and what percentage of variance they cover.
# Draw a biplot of the normalized PCA (first two components) with arrows and
# and useful captions
s = summary(pca2)
pca_pr <- round(100 * s$importance[2,], digits=1)
pca_lab <- paste0("Human data ", names(pca_pr), " (", pca_pr, "%)")
biplot(pca2, choices=1:2, xlab=pca_lab[1], ylab=pca_lab[2])
We can see that result is very different from the non-standardized case. This is because PCA identifies the dimensions with the highest variance. When some variable has much higher variance than the others, the highest overall variance will be closely aligned with that variable, and PC1 will be closely aligned with that variable. The plot will not give very much information about the overall relationships of the variables. The scale of the values for the original variable does not necessarily reflect the importance of the variable in any way. If the variables are normalized, each variable will have equal weight in how the principal components are selected. This usually gives much more useful results.
Wwe can see that PC1 and PC2 are not directly aligned with any particular variable, even though PC2 has high correlation with Parli.F
and Labo.FM
and PC1 has high correlation with the other variables.
I think we can interpret PC1 as describing variables that relate to material and technological wellbeing. Mat.Mor
, Ado.Birth
, and Life.Exp
reflect quality of health care, which probably correlates with general technical and scientific prowess. Edu.Exp
and Edu2.FM
indicate the quality and availability of education, and it makes sense that that is correlated with medical well being, health, and life expectancy. GNI
is also correlated with the same axis; it describes economic wealth, which usually correlates strongly with education and health (higher income enables providing health care and education, and education and health care also help achieve higher income due to likely having technology, medicines, and other things to sell others and the ability to produce them).
On the other hand, PC2 is correlated with Parli.F
and Labo.F
, which relate to the role of women in the society and particularly to the equality of women in the society. There are countries that are wealthy, have good education, and good health care, but where women don’t have a big role in public life (politics and workplace). I think some resource-rich countries (e.g., oil producers) may fall into this category. It could also have described many Western countries quite well just a few decades ago. While female education seems to be strongly correlated with economic and medical well-being, in this data it would seem that female participation in politics and labor force would not necessarily be so.
Let’s first load the tea dataset from FactMineR, look at its structure, and visualize the dataset.
# Load the tea dataset from FactoMineR
library(FactoMineR)
data(tea)
# Look at its structure
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# Visualize the dataset. There are too many variables to visualize it using
# ggpairs reasonably. Let's instead visualize it with a barplot.
library(miscset)
##
## Attaching package: 'miscset'
## The following object is masked from 'package:dplyr':
##
## collapse
ggplotGrid(ncol=4,
lapply(colnames(tea),
function(col) {
ggplot(tea, aes_string(col)) + geom_bar()
}))
Let’s then do Multiple Correspondence Analysis on the data. I’m going to only calculate it for the Tea
, How
, how
, sugar
, where
, and lunch
columns, as trying all columns does not seem to work out-of-the-box.
# Extract a subset of the data with only the selected columns
keep <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# Compute Multiple Correspondence Analysis of the tea data
mca1 <- MCA(tea_time, graph=FALSE)
# Print a summary of the model
summary(mca1)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
Looking at the summary, we can see that no dimension explains more than 15.3% of the variance, and that the 7th dimension still explains 7.8%. Thus there does not appear to be a simple way to map the observations to a very low-dimensional space cleanly.
Let’s then draw the a biplot containing the variables. We also include the individuals in the same plot.
# Visualize the model using a biplot from the factoextra package
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_mca_biplot(mca1, repel=TRUE, ggtheme=theme_minimal())
We can see that the two first dimensions together account for roughly 30% of the variance. Quite a few individuals are mapped to the exact same locations. It seems likely that they have identical values for variables that looked at.
The data wrangling for Exercise 6 is performed by the data/meet_and_repeat.R
script in my github repository, as specified in the instructions. It can be found here.
Let’s start by loading some libraries we might use.
library(ggplot2)
library(GGally)
library(corrplot)
library(dplyr)
library(tidyr)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
# Set default figure size
knitr::opts_chunk$set(fig.width=12, fig.height=8)
Let’s load the data sets created by the data wrangling script (we’ll do the factor conversion later).
RATS <- read.csv("data/meet_RATS.csv", row.name=1)
RATSL <- read.csv("data/meet_RATSL.csv", row.name=1)
BPRS <- read.csv("data/meet_BPRS.csv", row.name=1)
BPRSL <- read.csv("data/meet_BPRSL.csv", row.name=1)
Ok, so it looks like basically we need to perform the analyses we did on Datacamp, but with the datasets swapped.
# First, convert ID and Group fields to factors
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# Let's look at the data
names(RATSL)
## [1] "ID" "Group" "WD" "Weight" "Time"
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
summary(RATSL)
## ID Group WD Weight Time
## 1 : 11 1:88 WD1 :16 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 WD15 :16 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 WD22 :16 Median :344.5 Median :36.00
## 4 : 11 WD29 :16 Mean :384.5 Mean :33.55
## 5 : 11 WD36 :16 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 WD43 :16 Max. :628.0 Max. :64.00
## (Other):110 (Other):80
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
# Display the data graphically
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
We can see that rats in Group 1 had consistently lower weight than those in Groups 2 and 3. However, there were only eight rats in Group 1, four in Group 2, and four in Group 3. These numbers are so low that making any statistical analyses based on them are fraught with uncertainty. That said, we are tasked with analyzing this data.
We can also see that in each group the rats are gaining weight as time passes.
Let’s see how the picture changes if we standardize the data for each data point (keeping in mind that the number of observations is too low to do this with any accuracy).
# Standardize Weight
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdWeight = (Weight - mean(Weight)) / sd(Weight)) %>%
ungroup()
str(RATSL)
## tibble [176 × 6] (S3: tbl_df/tbl/data.frame)
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight : int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
## $ stdWeight: num [1:176] -1.001 -1.12 -0.961 -0.842 -0.882 ...
# Plot again with standardized Weight
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized Weight")
We can now see clearly that some rats are growing faster than the others. Where the line goes down, it means that the rat is gaining wait slower than the mean of the group, or might even be losing weight.
Let’s now plot the group means and their error bars. We should note that it is impossible to calculate variance with any accuracy from just four observations, and correspondingly the magnitude of the error bars is only a very rough estimate that shouldn’t be given much statistical significance.
# Add mean and standard error computations. Note that the n in standard error
# computation should be the number of samples used to compute that mean,
# not the total number of samples in the whole dataset for all Group/Time
# combinations (the Datacamp exercise is in error).
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight) / sqrt(length(Weight))) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Note: In view of the warning about regrouping above, verify that groups
# are correct inside summarize
dim(RATSL)
## [1] 176 6
X <- RATSL %>% group_by(Group, Time) %>% summarize(cnt=length(Weight)) %>% ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
X$cnt
## [1] 8 8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# Yes, this matches what we see in the earlier plots - 8, 4, 4 rats in each
# group, for each timepoint. So looks like summarize() worked correctly
# despite the weird warning.
# Let's glimpse at the data with timepoint means and standard errors.
# We leave aside the question whether standard error is reliable with
# just 4 or 8 observations in each grouping... (it ISN'T very
# reliable with so few observations)
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3.8…
# Plot the mean profiles for each Group, with error bars
ggplot(RATSS,
aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8,0.8)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
## Warning in if (position != "none") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "manual") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (position == "left") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "right") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "bottom") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (position == "top") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "manual") {: the condition has length > 1 and only
## the first element will be used
Let’s then check for utliers using a boxplot. (We should note, however, that the low number of observations makes any outlier detection very unreliable.)
# Quick check for outliers using a boxplot
RATSB <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSB)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274…
ggplot(RATSB, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), weeks 1-8")
## Warning: `fun.y` is deprecated. Use `fun` instead.
We can see that there is a possible outlier in Group 2. However, given that there are only four rats in this group (which is already too little data), any variance estimate (and thus standard error and the quartiles) is highly suspect, and thus we cannot really justify removing any individuals from the data. They might not be outliers at all if we had a bit more data, and being already data-deprived, we would be even more so with only three rats.
Let’s then see if there is a statistically significant difference between the groups. We’ll use the dataset we created with the mean for each group, but for the reasons above we did not do outlier elimination. There is a compication here, as the t test only works between exactly two groups. For the purposes of this analysis, we’ll lumk groups 2 and 3 together and compare them againt Group 1. Let’s add an auxiliary field for this.
# Create a helper variable indicating whether the observation is in Group 1
RATSB$isGroup1 <- RATSB$Group == "1"
# Perform a t-test between being in group 1 and not being in group 1
t.test(mean ~ isGroup1, data=RATSB, var.equal=TRUE)
##
## Two Sample t-test
##
## data: mean by isGroup1
## t = 12.518, df = 14, p-value = 5.427e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 200.1507 282.9175
## sample estimates:
## mean in group FALSE mean in group TRUE
## 505.2500 263.7159
# Add a baseline from Time 1
RATSB$baseline <- RATS$WD1
# Fit a linear model with the mean as the response
fit <- lm(mean ~ baseline + isGroup1, data=RATSB)
fit
##
## Call:
## lm(formula = mean ~ baseline + isGroup1, data = RATSB)
##
## Coefficients:
## (Intercept) baseline isGroup1TRUE
## 78.6222 0.8865 -37.0852
# Compute an analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2173.0397 7.423e-16 ***
## isGroup1 1 570 570 4.9159 0.04505 *
## Residuals 13 1508 116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that the data suggests the baseline (WD1) to be a very significant predictor of the group mean weight and being in Group 1 being significant at 95% level. However, even ignoring the unsuitability of the hypothesis testing method testing method for our case (the observations are not independent), we had so few data points that the means are inaccurate, and this significance prediction was based on the inaccurately computed means. This adds to uncertainty beyond what was considered by hypothesis testing, and we probably should not consider being in Group 1 as significant given the baseline weight.
Let’s then look at analyzing the BPRS (brief psychiatric rating scale) data. We first convert certain fields to factors, look at the data, and plot the data.
In analyzing this data, we must first note that the subject field does not uniquely identify the person. Instead, the data is structured so that the subject
field only identifies a subject within each treatment regime (logically one cannot administer the two treatments to be compared to the same person simultaneously in this kind of experiment). Also, the Datacamp exercises say the dataset contains data for 40 male subjects, but there are only 20 distinct values in the subject
field. For this purpose, I added a uniqueSubject
field in the data wrangling phase to uniquely identify each person, as such unique identification will be needed for mixed effects modeling as a random factor. Our analysis will use this field instead of subject
to identify the person, because subject
does not uniquely identify the individual!
# First, convert the treatment and subject fields to factors
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
# This field was added in data wrangling to uniquely identify each subject,
# as discussed above and in the comments in the data wrangling script.
BPRSL$uniqueSubject <- factor(BPRSL$uniqueSubject)
# Let's look at the data
names(BPRSL)
## [1] "treatment" "subject" "weeks" "bprs"
## [5] "week" "uniqueSubject"
str(BPRSL)
## 'data.frame': 360 obs. of 6 variables:
## $ treatment : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
## $ uniqueSubject: Factor w/ 40 levels "1-1","1-2","10-1",..: 1 23 27 29 31 33 35 37 39 3 ...
summary(BPRSL)
## treatment subject weeks bprs week
## 1:180 1 : 18 week0 : 40 Min. :18.00 Min. :0
## 2:180 2 : 18 week1 : 40 1st Qu.:27.00 1st Qu.:2
## 3 : 18 week2 : 40 Median :35.00 Median :4
## 4 : 18 week3 : 40 Mean :37.66 Mean :4
## 5 : 18 week4 : 40 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 week5 : 40 Max. :95.00 Max. :8
## (Other):252 (Other):120
## uniqueSubject
## 1-1 : 9
## 1-2 : 9
## 10-1 : 9
## 10-2 : 9
## 11-1 : 9
## 11-2 : 9
## (Other):306
glimpse(BPRSL)
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ weeks <fct> week0, week0, week0, week0, week0, week0, week0, week0,…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38,…
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ uniqueSubject <fct> 1-1, 2-1, 3-1, 4-1, 5-1, 6-1, 7-1, 8-1, 9-1, 10-1, 11-1…
ggplot(BPRSL, aes(x=week, y=bprs, linetype=uniqueSubject)) +
geom_line() +
scale_linetype_manual(values=rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller=label_both) +
scale_y_continuous(name="BPRS", limits=c(min(BPRSL$bprs), max(BPRSL$bprs))) +
theme(legend.position="top")
Visually, it looks like there is a downward trend with treatment 1. The situation looks less clear for treatment 2.
Let’s then try a linear regression.
# Create a linear regression model
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRSL)
# Print a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
In this model, it looks like the improvement with time is statistically significant, but this model does not show statistical significance for the treatment regime. However, we should note that the observations are not independent as we have repeated observations from the same individuals, so the assumptions of the significance test were violated and it cannot really be relied upon here.
Let’s then try to construct a mixed model for the same variables, using uniqueSubject
as a random effect.
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | uniqueSubject), data=BPRSL,
REML=FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | uniqueSubject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## uniqueSubject (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: uniqueSubject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
Let’s then construct the model so that we treat both treatment and uniqueSubject
as random effects.
# Create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | uniqueSubject),
data=BPRSL, REML=FALSE)
# Print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | uniqueSubject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## uniqueSubject (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: uniqueSubject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
# Perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | uniqueSubject)
## BPRS_ref1: bprs ~ week + treatment + (week | uniqueSubject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2582.9 2602.3 -1286.5 2572.9
## BPRS_ref1 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It would seem that the second model fits the data better than the first model at the 95% significance level (p=0.026).
Let’s plot again the original BPRS values.
ggplot(BPRSL, aes(x=week, y=bprs, group=uniqueSubject)) +
geom_line(aes(linetype=treatment)) +
scale_x_continuous(name="Week") +
facet_grid(. ~ treatment, labeller=label_both) +
scale_y_continuous(name="BPRS") +
theme(legend.position="top")
Let’s then plot the fit using the second model (BPRS_ref1
) for comparison.
# Compute the fitted values and add them as a column in RATSL
Fitted <- fitted(BPRS_ref1, BPRSL)
# Add a new column for the fitted data
BPRSL <- mutate(BPRSL, Fitted=Fitted)
# Draw a plot of BPRSL with the observed bprs values for each subject
ggplot(BPRSL, aes(x=week, y=Fitted, group=uniqueSubject)) +
geom_line(aes(linetype=treatment)) +
scale_x_continuous(name="Week") +
facet_grid(. ~ treatment, labeller=label_both) +
scale_y_continuous(name="Fitted BPRS") +
theme(legend.position="top")
I have to say that using a non-unique subject idenfier in the BPRS dataset was a rather devious twist in the exercise.
Thoughts after the first week:
git pull
or git push
did not work properly with SSH keys (it asked for git username and password). I had to tweak it manually to make SSH keys work correctly.Thoughts for the second week:
Thoughts after the third week
Thoughts after the fourth week
Thoughts after the fifth week
n
in computing the standard error.