Class Meeting 6 Intro to data wrangling, Part I

Worksheet: You can find a worksheet template for today here.

6.1 Today’s Lessons

Today we’ll introduce the dplyr package. Specifically, we’ll look at these three lessons:

  • Intro to dplyr syntax
  • The dplyr advantage
  • Relational/comparison and logical operators in R

6.2 Resources

All three of today’s lessons are closely aligned to the stat545: dplyr-intro.

More detail can be found in the r4ds: transform chapter, up until and including the select() section. Section 5.2 also elaborates on relational/comparison and logical operators in R

Here are some supplementary resources:

6.3 Participation

To get participation points for today, we’ll be filling out the cm006-exercise.Rmd file, and adding it to your participation repo.

6.4 Intro to dplyr syntax

6.4.1 Learning Objectives

Here are the concepts we’ll be exploring in this lesson:

  • tidyverse
  • dplyr functions:
    • select
    • arrange
  • piping

By the end of this lesson, students are expected to be able to:

  • subset and rearrange data with dplyr
  • use piping (%>%) when implementing function chains

6.4.2 Preamble

Let’s talk about:

  • The history of dplyr: plyr
  • tibbles are a special type of data frame
  • the tidyverse

6.4.3 Demonstration

Let’s get started with the exercise:

  1. Open RStudio, and download the tidyverse meta-package by executing install.packages("tidyverse") into the R console.
  2. Optional: open the STAT545_participation RStudio project in RStudio.
  3. With RStudio, open the cm006-exercise.Rmd file you downloaded and committed earlier.
  4. Follow the instructions in the .Rmd file until the resume lecture section.

6.5 Small break

Here are some things you might choose to do on this break:

  • Talk with a TA, Vincenzo, or your neighbour(s) about the content so far.
  • Attempt the bonus exercises on the cm006-exercise.Rmd file.
  • Work on an assignment.

6.6 The dplyr advantage

6.6.1 Learning Objectives

By the end of this lesson, students are expected to be able to:

  • Have a sense of why dplyr is advantageous compared to the “base R” way with respect to good coding practice.

Why?

  • Having this in the back of your mind will help you identify qualities of and produce a readable analysis.

6.6.2 Compare base R to dplyr

Self-documenting code.

This is where the tidyverse shines.

Example of dplyr vs base R:

gapminder %>%
  filter(country == "Cambodia") %>%
  select(year, lifeExp)

vs.

gapminder[gapminder$country == "Cambodia", c("year", "lifeExp")]

No need to take excerpts.

Wrangle with dplyr first, then pipe into a plot/analysis.

OR, use the subset argument that’s often offered by R functions like lm().

Especially don’t use magic numbers to subset!

Note that you need to use the assignment operator to store changes!

6.7 Relational/Comparison and Logical Operators in R

6.7.1 Learning Objectives

Here are the concepts we’ll be exploring in this lesson:

  • Relational/Comparison operators
  • Logical operators
  • dplyr functions:
    • filter
    • mutate

By the end of this lesson, students are expected to be able to:

  • Predict the output of R code containing the above operators.
  • Explain the difference between &/&& and |/||, and name a situation where one should be used over the other.
  • Subsetting and transforming data using filter and mutate

6.7.2 R Operators

Arithmetic operators allow us to carry out mathematical operations:

Operator Description
+ Add
- Subtract
* Multiply
/ Divide
^ Exponent
%% Modulus (remainder from division)

Relational operators allow us to compare values:

Operator Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to
  • Arithmetic and relational operators work on vectors.

Logical operators allow us to carry out boolean operations:

Operator Description
! Not
| Or (element_wise)
& And (element-wise)
|| Or
&& And
  • The difference between | and || is that || evaluates only the first element of the two vectors, whereas | evaluates element-wise.

6.7.3 Demonstration

Continue along with the cm006-exercise.Rmd file.

6.8 If there’s time remaining

  1. Let’s do the bonus exercises together, in the cm006-exercise.Rmd file.
  2. Another “break”