List

Here I work through the practice questions in Chapter 6, “Overfitting, Regularization, and Information Criteria,” of Statistical Rethinking (McElreath, 2016). I do my best to use only approaches and functions discussed so far in the book, as well as to name objects consistently with how the book does. If you find any typos or mistakes in my answers, or if you have any relevant questions, please feel free to add a comment below.

Here is the chapter summary from page 205:

This chapter has been a marathon. It began with the problem of overfitting, a universal phenomenon by which models with more parameters fit a sample better, even when the additional parameters are meaningless. Two common tools were introduced to address overfitting: regularizing priors and information criteria. Regularizing priors reduce overfitting during estimation, and information criteria help estimate the degree of overfitting. Practical functions compare() and ensemble() in the rethinking package were introduced to help analyze collections of models fit to the same data. In the chapters to follow, these tools will be applied to both new and old data examples. In all cases, keep in mind that these tools are heuristic. They provide no guarantees. No statistical procedure will ever substitute for iterative scientific investigation.

# Setup
library(rethinking)
data(Howell1)
set.seed(6)

Easy Difficulty

Practice Question 6E1

State the three motivating criteria that define information entropy. Try to express each in your own words.

The motivating criteria are described on pages 177-178. In my own words, they are that a measure of uncertainty should (1) be measured on a continuous scale such that the spacing between adjacent values is consistent, (2) capture the size of the possibility space such that its value scales with the number of possible outcomes, and (3) be additive for independent events such that it does not matter how the events are divided. This last criterion is the most confusing, but is also very important. It states that dividing a possibility space into four events (e.g., rain/hot, rain/cold, shine/hot, shine/cold) provides the same amount of information as dividing this same possibility space into two sets of two events (e.g., rain-or-shine and hot-or-cold).


Practice Question 6E2

Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?

Entropy is defined on page 178. We just need to plug in the appropriate probabilities.

\[H(p) = -\sum_{i=1}^n p_i \log(p_i) = -(p_H \log(p_H)+p_T \log(p_T))\]

p <- c(0.7, 1 - 0.7)
(H <- -sum(p * log(p)))
## [1] 0.6108643

The entropy of this coin is 0.61.


Practice Question 6E3

Suppose a four-sided die is loaded such that, when tossed onto a table, it shows “1” 20%, “2” 25%, “3” 25%, and “4” 30% of the time. What is the entropy of this die?

We can use the same approach as above (page 178).

p <- c(0.20, 0.25, 0.25, 0.30)
(H <- -sum(p * log(p)))
## [1] 1.376227

The entropy of this die is 1.38.


Practice Question 6E4

Suppose another four-sided die is loaded such that it never shows “4”. The other three sides show equally often. What is the entropy of this die?

We can use the same approach as above (page 178). The only complication is that we need to use L’Hopital’s rule and ignore the side that never shows (page 179).

p <- c(1/3, 1/3, 1/3)
(H <- -sum(p * log(p)))
## [1] 1.098612

The entropy of the die is 1.10.


Medium Difficulty

Practice Question 6M1

Write down and compare the definitions of AIC, DIC, and WAIC. Which of these criteria is most general? Which assumptions are required to transform a more general criterion into a less general one?

AIC is defined as \(D_{train}+2p\) where \(D_{train}\) is the in-sample training deviance and \(p\) is the number of free parameters estimated in the model (page 189).

DIC is defined as \(\bar{D}+p_D=\bar{D}+(\bar{D}+\hat{D})\) where \(\bar{D}\) is the average of the posterior distribution of deviance and \(\hat{D}\) is the deviance calculated at the posterior mean (page 190).

WAIC is defined as \(-2(\mathrm{lppd}-p_{WAIC})=-2(\sum_{i=1}^N \log\Pr(y_i) – \sum_{i=1}^N V(y_i))\) where \(\Pr(y_i)\) is the average likelihood of observation \(i\) in the training sample and \(V(y_i)\) is the variance in log-likelihood for observation \(i\) in the training sample (pages 191-192).

All three definitions involve two components: an estimate or analog of the in-sample training deviance (i.e., \(D_{train}\) for AIC, \(\bar{D}\) for DIC, and \(\mathrm{lppd}\) for WAIC) and an estimate or analog for the number of free parameters estimated in the model (i.e., \(p\) in AIC, \(p_D\) in DIC, and \(p_{WAIC}\) in WAIC).

WAIC is the most general, followed by DIC, and finally AIC (page 190). To move from WAIC to DIC, we must assume that the posterior distribution is approximately multivariate Gaussian. To move from DIC to AIC, we must further assume that the priors are flat or overwhelmed by the likelihood.


Practice Question 6M2

Explain the difference between model selection and model averaging. What information is lost under model selection? What information is lost under model averaging?

Model selection is choosing to retain (e.g., use and interpret) the model with the lowest information criterion value and to discard all other models with higher values (page 195). This practice loses information about relative model accuracy contained in the differences among information criterion values; this is especially problematic when the selected model only outperforms its alternatives to a small degree. Model averaging is using Bayesian information criteria to construct a posterior predictive distribution (i.e., make predictions) that leverages the uncertainty in multiple models (page 196). This practice does not lose information on its own. However, when combined with undisclosed data dredging (page 205), it can lead to spurious findings (e.g., non-generalizable coincidence).


Practice Question 6M3

When comparing models with an information criterion, why must all models be fit to exactly the same observations? What would happen to the information criterion values, if the models were fit to different numbers of observations? Perform some experiments, if you are not sure.

Information criteria are based on deviance, which is accrued over observations without being divided by the number of observations (page 182). Thus, it is a sum and not an average. So, all else being equal, a model with more observations will have a higher deviance and thus worse accuracy according to information criteria. It would be an unfair comparison to contrast models fit to different numbers of observations.

As an experiment, we can calculate WAIC for models fit to increasingly small subsamples of the same data. The information criteria should decrease alongside the sample size. In order to get a large sample to begin with, I will return to the Howell1 database from last chapter.

d <- Howell1[complete.cases(Howell1), ]
d_500 <- d[sample(1:nrow(d), size = 500, replace = FALSE), ]
d_400 <- d[sample(1:nrow(d), size = 400, replace = FALSE), ]
d_300 <- d[sample(1:nrow(d), size = 300, replace = FALSE), ]
m_500 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_500,
  start = list(a = mean(d_500$height), b = 0, sigma = sd(d_500$height))
)
m_400 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_400,
  start = list(a = mean(d_400$height), b = 0, sigma = sd(d_400$height))
)
m_300 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b * log(weight)
  ),
  data = d_300,
  start = list(a = mean(d_300$height), b = 0, sigma = sd(d_300$height))
)
(model.compare <- compare(m_500, m_400, m_300))
## Warning in compare(m_500, m_400, m_300): Different numbers of observations found for at least two models.
## Information criteria only valid for comparing models fit to exactly same observations.
## Number of observations for each model:
## m_500 500 
## m_400 400 
## m_300 300
## Warning in waic_ptw1 - waic_ptw2: longer object length is not a multiple of
## shorter object length

## Warning in waic_ptw1 - waic_ptw2: longer object length is not a multiple of
## shorter object length

## Warning in waic_ptw1 - waic_ptw2: longer object length is not a multiple of
## shorter object length
##         WAIC pWAIC  dWAIC weight    SE   dSE
## m_300 1835.6   3.6    0.0      1 29.97    NA
## m_400 2457.5   3.3  621.9      0 31.90 44.00
## m_500 3071.2   3.2 1235.6      0 35.49 50.32

Indeed, the WAIC increased from the \(N=300\) model to the \(N=400\) model and again to the \(N=500\) model. The compare() function also returns a warning about the number of observations being different.


Practice Question 6M4

What happens to the effective number of parameters, as measured by DIC or WAIC, as a prior becomes more concentrated? Why? Perform some experiments, if you are not sure.

As the prior becomes more concentrated (e.g., as in a regularizing prior), the effective number of parameters decreases. In the case of DIC, this is because \(p_D\) is a measure of how flexible the model is (page 191); with more concentrated priors, the model becomes less flexible (and therefore less prone to overfitting). In the case of WAIC, this is because \(p_{WAIC}\) is a measure of the variance in the log-likelihood for each observation in the training sample (page 191); with more concentrated priors, the likelihood will become more concentrated as well and thus variance will decrease.

As an experiment to demonstrate this, we can calculate WAIC for two models of the same data that only differ in the concentration of their priors.

d <- Howell1[complete.cases(Howell1), ]
d$height.log <- log(d$height)
d$height.log.z <- (d$height.log - mean(d$height.log)) / sd(d$height.log)
d$weight.log <- log(d$weight)
d$weight.log.z <- (d$weight.log - mean(d$weight.log)) / sd(d$weight.log)
m_wide <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 10),
    b ~ dnorm(1, 10),
    sigma ~ dunif(0, 10)
  ),
  data = d
)
m_narrow <- map(
  alist(
    height.log.z ~ dnorm(mu, sigma),
    mu <- a + b * weight.log.z,
    a ~ dnorm(0, 0.10),
    b ~ dnorm(1, 0.10),
    sigma ~ dunif(0, 1)
  ),
  data = d
)
WAIC(m_wide, refresh = 0)
## [1] -102.7533
## attr(,"lppd")
## [1] 55.64182
## attr(,"pWAIC")
## [1] 4.265166
## attr(,"se")
## [1] 36.52922
WAIC(m_narrow, refresh = 0)
## [1] -102.7489
## attr(,"lppd")
## [1] 55.60564
## attr(,"pWAIC")
## [1] 4.231208
## attr(,"se")
## [1] 36.47835

Indeed, the \(p_{WAIC}\) decreases as the priors become more concentrated. However, note that the regularizing priors must be chosen carefully in order for this to occur (see 6M6).


Practice Question 6M5

Provide an informal explanation of why informative priors reduce overfitting.

Informative priors reduce overfitting because they constrain the flexibility of the model; they make it less likely for extreme parameter values to be assigned high posterior probability. In simpler terms, informative priors reduce overfitting by forcing the model to learn less from the sample data.


Practice Question 6M6

Provide an informal explanation of why overly informative priors result in underfitting.

Overly informative priors result in underfitting because they constrain the flexibility of the model too much; they make it less likely for “correct” parameter values to be assigned high posterior probability. In simpler terms, overly informative priors result in underfitting by preventing the model from learning enough from the sample data.


Hard Difficulty

All practice problems to follow use the same data. Pull out the old Howell !Kung demography data and split it into two equally sized data frames. Here’s code to do it:

d <- Howell1

d$age <- (d$age - mean(d$age)) / sd(d$age)

set.seed(1000)

i <- sample(1:nrow(d), size = nrow(d) / 2)

d1 <- d[i, ]

d2 <- d[-i, ]

You now have two randomly formed data frames, each with 272 rows. The notion here is to use the cases in d1 to fit models and the cases in d2 to evaluate them. The set.seed() command just ensures that everyone works with the same randomly shuffled data.

Now let \(h_i\) and \(x_i\) be the height and centered age values, respectively, on row \(i\). Fit the following models to the data in d1:

\[
\begin{aligned}
\mathcal{M}_1: h_i &\sim \mathrm{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha + \beta_1 x_i \\
\mathcal{M}_2: h_i &\sim \mathrm{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha + \beta_1 x_i + \beta_2 x_i^2 \\
\mathcal{M}_3: h_i &\sim \mathrm{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 \\
\mathcal{M}_4: h_i &\sim \mathrm{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \beta_4 x_i^4 \\
\mathcal{M}_5: h_i &\sim \mathrm{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \beta_4 x_i^4 + \beta_5 x_i^5 \\
\mathcal{M}_6: h_i &\sim \mathrm{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \beta_4 x_i^4 + \beta_5 x_i^5 + \beta_6 x_i^6
\end{aligned}
\]

Use map() to fit these. Use weakly regularizing priors for all parameters.

Note that fitting all of these polynomials to the height-by-age relationship is not a good way to derive insight. It would be better to have a simpler approach that would allow for more insight, like perhaps a piecewise linear model. But the set of polynomial families above will serve to help you practice and understand model comparison and averaging.

Practice Question 6H1

Compare the models above, using WAIC. Compare the model rankings, as well as the WAIC weights.

First let’s prep the data.

d <- Howell1
d$age <- (d$age - mean(d$age)) / sd(d$age)
set.seed(1000)
i <- sample(1:nrow(d), size = nrow(d) / 2)
d1 <- d[i, ]
d2 <- d[-i, ]

Now we can fit the models to d1 using map():

m1 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b1 * age,
    a ~ dnorm(140, 30),
    b1 ~ dnorm(0, 50), 
    sigma ~ dnorm(30, 10)
  ),
  data = d1
)
m2 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b1 * age + b2 * I(age^2),
    a ~ dnorm(140, 30),
    b1 ~ dnorm(0, 50), 
    b2 ~ dnorm(0, 50),
    sigma ~ dnorm(30, 10)
  ),
  data = d1
)
m3 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b1 * age + b2 * I(age^2) + b3 * I(age^3),
    a ~ dnorm(140, 30),
    b1 ~ dnorm(0, 50),
    b2 ~ dnorm(0, 50),
    b3 ~ dnorm(0, 50),
    sigma ~ dnorm(30, 10)
  ),
  data = d1
)
m4 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b1 * age + b2 * I(age^2) + b3 * I(age^3) + b4 * I(age^4),
    a ~ dnorm(140, 30),
    b1 ~ dnorm(0, 50), 
    b2 ~ dnorm(0, 50),
    b3 ~ dnorm(0, 50),
    b4 ~ dnorm(0, 50),
    sigma ~ dnorm(30, 10)
  ),
  data = d1
)
m5 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b1 * age + b2 * I(age^2) + b3 * I(age^3) + b4 * I(age^4) + b5 * I(age^5),
    a ~ dnorm(140, 30),
    b1 ~ dnorm(0, 50), 
    b2 ~ dnorm(0, 50),
    b3 ~ dnorm(0, 50),
    b4 ~ dnorm(0, 50),
    b5 ~ dnorm(0, 50),
    sigma ~ dnorm(30, 10)
  ),
  data = d1
)
m6 <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b1 * age + b2 * I(age^2) + b3 * I(age^3) + b4 * I(age^4) + b5 * I(age^5) + b6 * I(age^6),
    a ~ dnorm(140, 30),
    b1 ~ dnorm(0, 50), 
    b2 ~ dnorm(0, 50),
    b3 ~ dnorm(0, 50),
    b4 ~ dnorm(0, 50),
    b5 ~ dnorm(0, 50),
    b6 ~ dnorm(0, 50),
    sigma ~ dnorm(30, 10)
  ),
  data = d1
)

Now let’s compare the models with WAIC:

(comparison <- compare(m1, m2, m3, m4, m5, m6, refresh = 0))
##      WAIC pWAIC dWAIC weight    SE   dSE
## m4 1926.0   5.6   0.0   0.56 25.30    NA
## m5 1927.5   6.4   1.4   0.28 25.16  1.07
## m6 1928.6   7.7   2.5   0.16 24.88  2.80
## m3 1952.3   5.4  26.2   0.00 24.03 10.73
## m2 2149.8   5.0 223.7   0.00 22.38 26.48
## m1 2395.4   3.4 469.4   0.00 22.78 30.84

The ranking of the models is as follows, in descending order from best to worst WAIC: \(\mathcal{M}_4\), \(\mathcal{M}_5\), \(\mathcal{M}_6\), \(\mathcal{M}_3\), \(\mathcal{M}_2\), \(\mathcal{M}_1\). The Akaike weights indicate that \(\mathcal{M}_4\) has 56% of the weight, \(\mathcal{M}_5\) has 28% of the weight, and \(\mathcal{M}_6\) has 16% of the weight; all other models had weights of zero.


Practice Question 6H2

For each model, produce a plot with model averaged mean and 97% confidence interval of the mean, superimposed on the raw data. How do the predictions differ across models?

To begin, let’s get the prediction data ready.

age.seq <- seq(from = -2.0, to = 4.0, length.out = 30)
d.predict <- list(
  height = rep(0, 30),
  age = age.seq
)

Now we can make a quick function to create the plot. The creation of functions has not been taught in the class, but is a pretty important part of R that will save us some repetition here. If you don’t understand functions, just imagine ignoring the first and last lines of the next chunk and replace m with whatever model you are interested in. That’s basically what the function does anyway.

age_plot <- function(m, d1, d.predict) {
  pred <- link(m, data = d.predict, refresh = 0)
  mu <- apply(pred, 2, mean)
  mu.PI <- apply(pred, 2, PI, prob = 0.97)
  plot(height ~ age, d1, col = rangi2)
  lines(d.predict$age, mu)
  shade(mu.PI, d.predict$age)
}
age_plot(m1, d1, d.predict)

plot of chunk unnamed-chunk-11

age_plot(m2, d1, d.predict)

plot of chunk unnamed-chunk-11

age_plot(m3, d1, d.predict)

plot of chunk unnamed-chunk-11

age_plot(m4, d1, d.predict)

plot of chunk unnamed-chunk-11

age_plot(m5, d1, d.predict)

plot of chunk unnamed-chunk-11

age_plot(m6, d1, d.predict)

plot of chunk unnamed-chunk-11

The predictions differ pretty dramatically across models. \(\mathcal{M}_1\) is underfit and only makes good predictions when age is around -1 and 1. \(\mathcal{M}_2\) does a better job but makes poor predictions below -1.5 and above 1.5. \(\mathcal{M}_3\) does a better job still but introduces an unlikely upward tick after around 2.0. \(\mathcal{M}_4\) does the best job of all, but has a wide fanning out of its interval after around 2.0. \(\mathcal{M}_5\) and \(\mathcal{M}_6\) are similar to \(\mathcal{M}_4\) for the majority of the age range, until around 2.5 when they both fan out even worse around either a down or up-tick.


Practice Question 6H3

Now also plot the model averaged predictions, across all models. In what ways do the averaged predictions differ from the predictions of the model with the lowest WAIC value?

We can use the ensemble() function and reuse the d.predict data frame (page 204).

# Plot the lowest WAIC model with dashed lines
pred.m4 <- link(m4, data = d.predict, refresh = 0)
mu <- apply(pred.m4, 2, mean)
mu.PI <- apply(pred.m4, 2, PI, prob = 0.97)
plot(height ~ age, d1, col = rangi2)
lines(age.seq, mu, lty = 2)
lines(age.seq, mu.PI[1, ], lty = 2)
lines(age.seq, mu.PI[2, ], lty = 2)
# Calculate and plot model averaged predictions
height.ensemble <- ensemble(m1, m2, m3, m4, m5, m6, data = d.predict, refresh = 0)
mu <- apply(height.ensemble$link, 2, mean)
mu.PI <- apply(height.ensemble$link, 2, PI, prob = 0.97)
lines(age.seq, mu)
shade(mu.PI, age.seq)

plot of chunk unnamed-chunk-12

The averaged predictions only differ from the lowest WAIC model (i.e., \(\mathcal{M}_4\)) by showing wider confidence intervals at the extreme ends of the age range. It is thus more conservative and communicates the uncertainty in this age range more clearly.


Practice Question 6H4

Compute the test-sample deviance for each model. This means calculating deviance, but using the data in d2 now. You can compute the log-likelihood of the height data with:

sum(dnorm(d2$height, mu, sigma, log = TRUE))

where mu is a vector of predicted means (based upon age values and MAP parameters) and sigma is the MAP standard deviation.

We can use the approach on page 183 to compute the test-sample deviance. The important intuitions here are that we need to calculate mu for each observation in d2 and then multiply the resulting log-likelihood by -2.

# M1
theta <- coef(m1)
(m1.d <- -2 * sum(
  dnorm(
    x = d2$height, 
    mean = theta[1] + theta[2]*d2$age,
    sd = theta[3], 
    log = TRUE
  )
))
## [1] 2421.868
# M2 
theta <- coef(m2)
(m2.d <- -2 * sum(
  dnorm(
    x = d2$height, 
    mean = theta[1] + theta[2]*d2$age + theta[3]*d2$age^2,
    sd = theta[4],
    log = TRUE
  )
))
## [1] 2137.56
# M3 
theta <- coef(m3)
(m3.d <- -2 * sum(
  dnorm(
    x = d2$height, 
    mean = theta[1] + theta[2]*d2$age + theta[3]*d2$age^2 + theta[4]*d2$age^3,
    sd = theta[5],
    log = TRUE
  )
))
## [1] 1932.346
# M4
theta <- coef(m4)
(m4.d <- -2 * sum(
  dnorm(
    x = d2$height, 
    mean = theta[1] + theta[2]*d2$age + theta[3]*d2$age^2 + theta[4]*d2$age^3 + theta[5]*d2$age^4,
    sd = theta[6],
    log = TRUE
  )
))
## [1] 1876.83
# M5
theta <- coef(m5)
(m5.d <- -2 * sum(
  dnorm(
    x = d2$height, 
    mean = theta[1] + theta[2]*d2$age + theta[3]*d2$age^2 + theta[4]*d2$age^3 + theta[5]*d2$age^4 +
      theta[6]*d2$age^5,
    sd = theta[7],
    log = TRUE
  )
))
## [1] 1878.554
# M6
theta <- coef(m6)
(m6.d <- -2 * sum(
  dnorm(
    x = d2$height, 
    mean = theta[1] + theta[2]*d2$age + theta[3]*d2$age^2 + theta[4]*d2$age^3 + theta[5]*d2$age^4 +
      theta[6]*d2$age^5 + theta[7]*d2$age^6,
    sd = theta[8],
    log = TRUE
  )
))
## [1] 1877.788

Practice Question 6H5

Compare the deviances from 6H4 to the WAIC values. It might be easier to compare if you subtract the smallest value in each list from the others. For example, subtract the minimum WAIC from all of the WAIC values so that the best WAIC is normalized to zero. Which model makes the best out-of-sample predictions in this case? Does WAIC do a good job of estimating the test deviance?

Let’s begin by collecting the deviance and WAIC values in a data frame.

model.stats <- data.frame(
  Model = c("M1", "M2", "M3", "M4", "M5", "M6"),
  Deviance = c(m1.d, m2.d, m3.d, m4.d, m5.d, m6.d),
  WAIC = c(WAIC(m1, refresh = 0), WAIC(m2, refresh = 0), WAIC(m3, refresh = 0),
    WAIC(m4, refresh = 0), WAIC(m5, refresh = 0), WAIC(m6, refresh = 0))
)
model.stats$Deviance.C <- model.stats$Deviance - min(model.stats$Deviance)
model.stats$WAIC.C <- model.stats$WAIC - min(model.stats$WAIC)
model.stats
##   Model Deviance     WAIC  Deviance.C     WAIC.C
## 1    M1 2421.868 2395.402 545.0372266 469.070460
## 2    M2 2137.560 2150.128 260.7297604 223.795917
## 3    M3 1932.346 1953.196  55.5152180  26.863549
## 4    M4 1876.830 1926.332   0.0000000   0.000000
## 5    M5 1878.554 1927.757   1.7240836   1.425026
## 6    M6 1877.788 1928.429   0.9573314   2.096813

\(\mathcal{M}_4\) makes the best out-of-sample predictions as well as in-sample predictions. The ranking of models is also very similar by both WAIC and out-of-sample deviance. The only differences in ranking were swapping \(\mathcal{M}_5\) and \(\mathcal{M}_6\), although both were similar in magnitude. In this case, WAIC did a good job of estimating test deviance, although it tended to underestimate deviance for poorly fitting models. It’s also worth noting that we assigned observations to train and test sets randomly, which increases the likelihood that they are representative of one another.


Practice Question 6H6

Consider the following model.

\begin{aligned}
h_i &\sim \mathrm{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \beta_4 x_i^4 + \beta_5 x_i^5 + \beta_6 x_i^6 \\
\beta_1 &\sim \mathrm{Normal}(0, 5) \\
\beta_2 &\sim \mathrm{Normal}(0, 5) \\
\beta_3 &\sim \mathrm{Normal}(0, 5) \\
\beta_4 &\sim \mathrm{Normal}(0, 5) \\
\beta_5 &\sim \mathrm{Normal}(0, 5) \\
\beta_6 &\sim \mathrm{Normal}(0, 5) \\
\end{aligned}

and assume flat (or nearly flat) priors on \(\alpha\) and \(\sigma\). This model contains more strongly regularizing priors on the coefficients.

First, fit this model to the data in d1. Report the MAP estimates and plot the implied predictions. Then compute the out-of-sample deviance using the data in d2, using MAP estimates from the model fit to d1 only. How does this model, using regularizing priors, compare to the best WAIC model from earlier? How do you interpret this result?

First, let’s calculate and report the MAP estimates.

m <- map(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + b1*age + b2*I(age^2) + b3*I(age^3) + b4*I(age^4) + b5*I(age^5) + b6*I(age^6),
    a ~ dunif(50, 200),
    b1 ~ dnorm(0, 5),
    b2 ~ dnorm(0, 5),
    b3 ~ dnorm(0, 5),
    b4 ~ dnorm(0, 5),
    b5 ~ dnorm(0, 5),
    b6 ~ dnorm(0, 5),
    sigma ~ dunif(0, 100)
  ),
  data = d1
)
precis(m)
##         Mean StdDev   5.5%  94.5%
## a     155.86   0.88 154.46 157.26
## b1      5.97   1.81   3.08   8.86
## b2    -16.60   2.13 -20.01 -13.20
## b3     12.09   2.72   7.75  16.44
## b4     -3.51   1.16  -5.37  -1.65
## b5      0.23   1.04  -1.43   1.89
## b6      0.05   0.33  -0.48   0.58
## sigma   8.20   0.36   7.63   8.77

Next, let’s plot the implied predictions using the age_plot() function we created earlier.

age_plot(m, d1, d.predict)

plot of chunk unnamed-chunk-16

Finally, let’s compute the out-of-sample deviance using the data in d2.

theta <- coef(m)
(m.d <- -2 * sum(
  dnorm(
    x = d2$height, 
    mean = theta[1] + theta[2]*d2$age + theta[3]*d2$age^2 + theta[4]*d2$age^3 + theta[5]*d2$age^4 +
      theta[6]*d2$age^5 + theta[7]*d2$age^6,
    sd = theta[8],
    log = TRUE
  )
))
## [1] 1875.434
m.d - m4.d
## [1] -1.396751

This out-of-sample deviance estimate is slightly better than that of \(\mathcal{M}_4\) (i.e., the best WAIC model from earlier). I interpret this result as an example of regularizing priors allowing the model to learn some, but not too much, from the data. For examples, this model includes fifth and sixth order polynomial terms, but their estimates are very imprecise in this case (more so than in \(\mathcal{M}_6\)).


Session Info

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows >= 8 x64 (build 9200)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] RWordPress_0.2-3     knitr_1.21           rethinking_1.59     
## [4] rstan_2.18.2         StanHeaders_2.18.0-1 ggplot2_3.1.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0         mvtnorm_1.0-8      lattice_0.20-35   
##  [4] prettyunits_1.0.2  ps_1.3.0           assertthat_0.2.0  
##  [7] rprojroot_1.3-2    digest_0.6.18      mime_0.6          
## [10] R6_2.3.0           plyr_1.8.4         backports_1.1.3   
## [13] stats4_3.5.1       evaluate_0.12      coda_0.19-2       
## [16] highr_0.7          pillar_1.3.1       rlang_0.3.0.1     
## [19] curl_3.2           lazyeval_0.2.1     rstudioapi_0.8    
## [22] callr_3.1.1        rmarkdown_1.11     desc_1.2.0        
## [25] devtools_2.0.1     stringr_1.3.1      loo_2.0.0         
## [28] RCurl_1.95-4.11    munsell_0.5.0      compiler_3.5.1    
## [31] xfun_0.4           pkgconfig_2.0.2    base64enc_0.1-3   
## [34] pkgbuild_1.0.2     htmltools_0.3.6    tidyselect_0.2.5  
## [37] tibble_1.4.2       gridExtra_2.3      matrixStats_0.54.0
## [40] XML_3.98-1.16      crayon_1.3.4       dplyr_0.7.8       
## [43] withr_2.1.2        bitops_1.0-6       MASS_7.3-50       
## [46] grid_3.5.1         jsonlite_1.6       gtable_0.2.0      
## [49] magrittr_1.5       scales_1.0.0       cli_1.0.1         
## [52] stringi_1.2.4      fs_1.2.6           remotes_2.0.2     
## [55] bindrcpp_0.2.2     testthat_2.0.1     tools_3.5.1       
## [58] glue_1.3.0         markdown_0.9       purrr_0.2.5       
## [61] rsconnect_0.8.12   processx_3.2.1     pkgload_1.0.2     
## [64] yaml_2.2.0         inline_0.3.15      colorspace_1.3-2  
## [67] sessioninfo_1.1.1  memoise_1.1.0      bindr_0.1.1       
## [70] XMLRPC_0.3-1       usethis_1.4.0

References

McElreath, R. (2016). Statistical rethinking: A Bayesian course with examples in R and Stan. New York, NY: CRC Press.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Posts

January 5th, 2019

Statistical Rethinking: Chapter 7 Practice

Here I work through the practice questions in Chapter 7, “Interactions,” of Statistical Rethinking (McElreath, 2016). I do my best […]

December 31st, 2018

Statistical Rethinking: Chapter 6 Practice

Here I work through the practice questions in Chapter 6, “Overfitting, Regularization, and Information Criteria,” of Statistical Rethinking (McElreath, 2016). […]

December 27th, 2018

Statistical Rethinking: Chapter 5 Practice

Here I work through the practice questions in Chapter 5, “Multivariate Linear Models,” of Statistical Rethinking (McElreath, 2016). I do […]

December 26th, 2018

Statistical Rethinking: Chapter 4 Practice

Here I work through the practice questions in Chapter 4, “Linear Models,” of Statistical Rethinking (McElreath, 2016). I do my […]

December 25th, 2018

Statistical Rethinking: Chapter 3 Practice

Here I work through the practice questions in Chapter 3, “Sampling the Imaginary,” of Statistical Rethinking (McElreath, 2016). I do […]

December 23rd, 2018

Statistical Rethinking: Chapter 2 Practice

Here I work through the practice questions in Chapter 2, “Small Worlds and Large Worlds,” of Statistical Rethinking (McElreath, 2016). […]

October 11th, 2018

Data Visualization Principles and Practices

Data Visualization Principles and Practice Tutorial on the principles and practice of data visualization, including an introduction to the layered […]

May 14th, 2018

Statistical Methods for Affective Computing (FG 2018)

Statistical Methods for Affective Computing Tutorial presented at the IEEE International Conference on Automatic Face and Gesture Recognition Statistical methods […]