To show off how R can help you explore interesting and even fun questions using data that is freely available online, I thought I’d put together a quick tutorial.

First, I will download the most recent “basic information” datafile from the Internet Movie Database (IMDB) and explore the length (i.e., runtime) of movies. To do so, I will use functions from base R and the tidyverse family of packages.

# Load packages

To download the file, we can use the aptly named download.file() function and to save it into temporary memory, we can use the tempfile() function. Of course, we could also have downloaded the file using a web browser and loaded it into R directly.

# Download data file from IMDB
url <- ""
tmp <- tempfile()
download.file(url, tmp)

Next, we need to read the data from the temporary file, which we know (from its file extension) is a tab-separated values (tsv) file that has been compressed using gzip. So we need to uncompressed it using the gzfile() function and then read the tsv data using the read_tsv() function. We can explicate the file’s formatting by passing additional arguments (e.g., col_names, quote, na, and col_types) to the read_tsv() function. This process will take a little while.

# Import downloaded data
imdb_all <- readr::read_tsv(
  file = gzfile(tmp),
  col_names = TRUE, 
  quote = "",
  na = "\\N",
  col_types = cols(
    tconst = col_character(),
    titleType = col_character(),
    primaryTitle = col_character(),
    originalTitle = col_character(),
    isAdult = col_logical(),
    startYear = col_integer(),
    endYear = col_integer(),
    runtimeMinutes = col_double(),
    genres = col_character()
  progress = FALSE

Now that we have imported the data into the imdb_all data frame, we can select a subset of columns and observations. For the purposes of this tutorial, let’s use the filter() function to exclude non-movies (e.g., tv series, shorts, and video games), adult movies, movies that are over 4 hours long (these are rare at only 0.287% of all movies), and movies from before 1918 or after 2018. Let’s also select just the movie’s primary title, release year, runtime, and genre listing. Finally, let’s sort by release year and then by title and output a preview of the resulting data.

# Enforce exclusion criteria
imdb_sel <- imdb_all %>% 
    titleType == "movie",  # exclude non-movies
    isAdult == 0,          # exclude adult movies
    runtimeMinutes <= 240, # exclude movies over 4 hours long
    startYear >= 1918,     # exclude movies more than 100 years old
    startYear <= 2018      # exclude movies from incomplete years
  ) %>% 
  dplyr::select(primaryTitle, startYear, runtimeMinutes, genres) %>% 
  dplyr::arrange(startYear, primaryTitle) %>% 
## # A tibble: 307,120 x 4
##    primaryTitle           startYear runtimeMinutes genres        
##    <chr>                      <int>          <dbl> <chr>         
##  1 'Blue Blazes' Rawden        1918             65 Drama,Western 
##  2 $5,000 Reward               1918             50 Mystery       
##  3 500 Pounds Reward           1918             55 <NA>          
##  4 A bánya titka               1918             97 <NA>          
##  5 A Burglar for a Night       1918             50 Comedy,Drama  
##  6 A Desert Wooing             1918             58 Drama         
##  7 A Doll's House              1918             50 Drama         
##  8 A Japanese Nightingale      1918             50 Drama         
##  9 A Lady's Name               1918             50 Comedy,Romance
## 10 A Law Unto Herself          1918             50 Drama         
## # ... with 307,110 more rows

Let’s visualize the distribution of runtimes across all included movies. We can do so using several types of visualization. First, let’s use the trusty histogram and plot the count of movies for each possible runtime (grouped in intervals of 5 mins).

# Visualize the overall distribution of runtimes using a histogram
t <- glue("Runtime distribution for {sum(!$runtimeMinutes))} movies in IMDB (1918-2018)")
imdb_sel %>% 
  ggplot(aes(x = runtimeMinutes)) +
  geom_histogram(binwidth = 5, fill = "white", color = "black") +
  scale_x_continuous(breaks = seq(0, 240, 60)) +
    labels = scales::unit_format(scale = 1e-3, suffix = "k"), 
    limits = c(0, 45e3)
  ) +
    x = "Runtime (minutes)", 
    y = "Count (of movies)", 
    title = t

plot of chunk unnamed-chunk-4

Next, we can visualize the same distribution using a density plot, which is like a smoothed histogram. Note that the y-axis is the kernel density estimate and not the proportion of each runtime value; this is an important distinction to make because densities do not have to add up to 1 whereas proportions do.

# Visualize the overall distribution of runtimes
t <- glue("Runtime distribution for {sum(!$runtimeMinutes))} movies in IMDB (1918-2018)")
imdb_sel %>% 
  ggplot(aes(x = runtimeMinutes)) +
  geom_density(fill = "white") +
  scale_x_continuous(breaks = seq(0, 240, 60)) +
    x = "Runtime (minutes)", 
    y = "Density (of movies)", 
    title = t)

plot of chunk unnamed-chunk-5

Another way to visualize this distribution is the boxplot. The boxplot below shows the middle 50\% of the data as a white box (i.e., the box’s left and right sides are the 25th and 75th percentiles, respectively) and the 50th percentile (i.e., median) is shown as a vertical line within the box. The light horizontal lines extending from the edges of the box are called “whiskers” and show data points within 1.5 times the inter-quartile range (IQR) which is the width of the box. Finally, the black dots (which are grouped so closely in this figure that they look like thicker horizontal lines) are data points that are more than 1.5 times the IQR away from the box (i.e., outliers). Note that boxplots can be depicted horizontally, as below, or vertically.

# Visualize the overall distribution of runtimes
t <- glue("Runtime distribution for {sum(!$runtimeMinutes))} movies in IMDB (1918-2018)")
imdb_sel %>% 
  ggplot(aes(y = runtimeMinutes)) +
  geom_boxplot(fill = "white") +
  scale_y_continuous(breaks = seq(0, 240, 60)) +
  labs(y = "Runtime (minutes)", title = t) +
  coord_flip() + theme(axis.text.y = element_blank())

plot of chunk unnamed-chunk-6

Next, let’s examine the runtimes per year to see if there have been trends over time. We can do this effectively by plotting a vertical boxplot for each year and stacking them next to each other. We can see below that the median runtimes have been remarkably stable since 1950 or so, although the median runtime increased from around 60 min in the early 1920s to around 90 min by 1950 or so. The 75th and especially the 25th percentiles (i.e., the top and bottom of the boxes) have seen a bit more variability over time. It appears that runtimes were relatively more clumped around 90 min between 1949 and 1999, but saw more variability before and after this range; it would be fascinating for film scholars to weigh in on what factors may have contributed these changes.

# Visualize the distribution of runtimes per year using boxplots
t <- glue("Runtime distributions by year for {sum(!$runtimeMinutes))} movies in IMDB (1918-2018)")
imdb_sel %>% 
  ggplot(aes(y = runtimeMinutes, x = factor(startYear))) + 
  geom_boxplot(fill = "white") + 
  scale_x_discrete(breaks = seq(1910, 2020, 10)) +
  scale_y_continuous(breaks = seq(0, 240, 60)) +
    x = "Release Year",
    y = "Runtime (minutes)",
    title = t

plot of chunk unnamed-chunk-7

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.