library(tidyverse)
library(tidymodels)
library(schrute)
library(lubridate)
The Office
Use theoffice
data from the schrute package to predict IMDB scores for episodes of The Office.
glimpse(theoffice)
Rows: 55,130
Columns: 12
$ index <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode_name <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
$ director <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
$ writer <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
$ character <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
$ text <chr> "All right Jim. Your quarterlies look very good. How …
$ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
$ imdb_rating <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
$ total_votes <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
$ air_date <chr> "2005-03-24", "2005-03-24", "2005-03-24", "2005-03-24…
Fix air_date
for later use.
<- theoffice %>%
theoffice mutate(air_date = ymd(as.character(air_date)))
We will
- engineer features based on episode scripts
- train a model
- perform cross validation
- make predictions
Note: The episodes listed in theoffice
don’t match the ones listed in the data we used in the cross validation lesson.
%>%
theoffice distinct(season, episode)
# A tibble: 186 × 2
season episode
<int> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 2 1
8 2 2
9 2 3
10 2 4
# ℹ 176 more rows
Exercise 1 - Calculate the percentage of lines spoken by Jim, Pam, Michael, and Dwight for each episode of The Office.
Exercise 2 - Identify episodes that touch on Halloween, Valentine’s Day, and Christmas.
Exercise 3 - Put together a modeling dataset that includes features you’ve engineered. Also add an indicator variable called michael
which takes the value 1
if Michael Scott (Steve Carrell) was there, and 0
if not. Note: Michael Scott (Steve Carrell) left the show at the end of Season 7.
Exercise 4 - Split the data into training (75%) and testing (25%).
set.seed(1122)
Exercise 5 - Specify a linear regression model.
Exercise 6 - Create a recipe that updates the role of episode_name
to not be a predictor, removes air_date
as a predictor, uses season
as a factor, and removes all zero variance predictors.
Exercise 7 - Build a workflow for fitting the model specified earlier and using the recipe you developed to preprocess the data.
Exercise 8 - Fit the model to training data and interpret a couple of the slope coefficients.
Exercise 9 - Perform 5-fold cross validation and view model performance metrics.
#set.seed(345)
#folds <- vfold_cv(___, v = ___)
#folds
#
#set.seed(456)
#office_fit_rs <- ___ %>%
# ___(___)
#
#___(office_fit_rs)
Exercise 10 - Use your model to make predictions for the testing data and calculate the RMSE. Also use the model developed in the cross validation lesson to make predictions for the testing data and calculate the RMSE as well. Which model did a better job in predicting IMDB scores for the testing data?
New model
Old model
TO DO: See what ___
is.
#| label: old-model
#| error: true
<- linear_reg() %>%
office_mod_old set_engine("lm")
<- recipe(imdb_rating ~ season + episode + total_votes + air_date, data = office_train) %>%
office_rec_old # extract month of air_date
step_date(air_date, features = "month") %>%
step_rm(air_date) %>%
# make dummy variables of month
step_dummy(contains("month")) %>%
# remove zero variance predictors
step_zv(all_predictors())
<- workflow() %>%
office_wflow_old add_model(office_mod_old) %>%
add_recipe(office_rec_old)
<- office_wflow_old %>%
office_fit_old fit(data = office_train)
tidy(office_fit_old)
___