Class Meeting 14 The Model-Fitting Paradigm in R

14.1 Today

  1. Recap from last class on principles of effective visualizations
  2. Introduction and motivation for model-fitting in R.
  3. Instructor and TA evaluations
  4. Worksheet where we do a full model-fitting analysis together
  5. Deep thoughts on data analytic work (go through material at home)

Worksheet: the model-fitting paradigm in R. - Rmd, - from html

14.2 Recap: principles of effective visualizations

Here are some general principles on effective figures

  1. Apply Principle of proportional ink
    • Definition: “The amount of ink used to indicate a value should be proportional to the value itself.”
    • Example: Truncating the y-axis on a bar chart to exaggerate the difference between bars violates the principle of proportional ink
  2. Maintain a high data-to-ink ratio: less is more
    • Definition: remove distracting visual elements to focus attention on the data
    • Examples: Lighten line weights, remove backgrounds, never use 3D or special effects, remove unnecessary/redundant labels, etc…
  3. Always update axes labels and titles on your plots
    • In STAT545/547 we take principles of effective visualizations very seriously and you will lose marks if this isn’t followed
  4. Choose your scale-type carefully
    • Whether you choose a linear, logarithm, sqrt scale depends on your data, context, and purpose
  5. Choose your graph-type carefully
    • Examples: here is a great directory of plots
  6. Choose colours with accessibility and readability in mind
    • Examples: here is a great set of colour schemes that are colour-blind friendly and perceptually uniform

14.3 Resources

If you’re interested in learning more about the actual statistical / machine learning methods for fitting models, I highly recommend the book “An Introduction to Statistical Learning” (freely available online). It uses R, too.

14.4 Introduction and motivation for model-fitting in R.

We will go through this tutorial from Dr. Benjamin Soltoff.

14.5 Instructor and TA evaluations

Since I was only your instructor for the last two weeks, I probably will not have a course evaluation on Canvas, but I’d still like your feedback. Please give it to me (anonymously) using this brief survey.

To fill out an evaluation for *Dr. Coia** (cm001 to cm010), go to the STAT545 Canvas course, and in the left sidebar, click “Course Evaluation”.

TA evaluations are still done on paper so I will hand out the TA evaluation forms and a pencil if you need it. Your TAs for this course were: Alejandra, Victor, Hossam, and Yulia.

Let’s take some time to fill these out, feedback is very important to all of us.

14.6 Break

14.7 Worksheet

14.8 Deep Thoughts about Data Analytic Work

(“Deep Thoughts” refers to Jack Handley’s skit on Saturday Night Live)

14.8.1 Reproducibility

This practice is so important, I’m giving it it’s own section.

  • Jenny Bryan: “If the thought of re-running your analysis makes you ill… you’re not doing it right”
  • Progress is not about stacking bricks. Don’t ever be afraid to re-visiting “earlier” points in the analysis.

The concept goes beyond programming, and many argue as to its meaning, but we’re taking it to mean two things as relevant to data analytic work:

1. How easily someone can reproduce output.

Example: generating a plot; or a report.

  • Worst case scenario: no source code is available; regenerate from scratch.
  • Best case scenario: output is regenerated with the click of a button.

Concepts to live by:

  • Source is real. Output is transient.
    • Source >> Output.
  • Working interactively? Frequently nuke your session and run from the top. It should still work.
  • Don’t save workspace upon closing.

Who is this “someone” I speak of?

  • Future you
  • Collaborators
  • Critics
  • Successors (your capstone partners)

2. How easily someone can reproduce your frame of mind / conceptual framework at some point in time.

Example: All the code files relating to your thesis.

  • Worst case scenario: No explanations.
  • Best case scneario: README’s concisely summarize what’s present; comments or markdown in reproducible reports outline big-picture logic.

Broader examples:

  • Conclusions drawn from an analysis.
  • Understanding of a concept in class.
  • Even taking photos to capture moments in time.

Concepts to live by:

  • Avoid technical debt, and document the big-picture structure both within files (comments) and between files (README’s).
    • An unsuspecting data scientist stumbles upon your work. Can they figure out what’s going on?

14.8.2 Coding Practices: Naming

  • some people use camelCase, snake_case, or.periods.
  • Functions should be verbs (based on the one thing that it achieves). Objects should be nouns. Functions that return functions should be adverbs.
  • Don’t over-create. The more objects in your global environment there are, the more confusing it will be – especially if you don’t have a consistent rule.
  • Don’t under-create: avoid magic numbers!
    • Bad: x <- rnorm(100) y <- x + rnorm(100)
    • Good: n <- 100 x <- rnorm(n) y <- x + rnorm(n)
  • Disambiguate from left-to-right, not right-to-left. Think tab completion.
    • Bad: canada_gdp and china_gdp.
    • Good: gdp_canada and gdp_china.

14.8.3 Coding Practices: Documenting code

Self-documenting code = code that speaks for itself.

  • Write self-documenting code with meaningful variable names, and ideally using the tidyverse. Metaprogramming shines here, too.
    • Base R: mtcars[mtcars$cyl < 8, c("cyl", "mpg")]
    • tidyverse: mtcars %>% filter(cyl < 8) %>% select(cyl, mpg)

Think before you comment your code! It’s easy for comments and code to misalign at original writing OR as things shift around.

  • Use comments to describe high-level decisions on the analysis.
    • Bad: # Lag the river discharge twice.
    • Good: # create predictors.
  • Don’t use comments to describe what your code is doing on a low-level.
    • Bad: # make data frame of cylinders less than 8, with variables 'cyl' and 'mpg'
    • Good: No comment at all.
    • Better: Use the tidyverse.

14.8.4 Coding Practices: The DRY principle

DRY = don’t repeat yourself. Instead,

  • Write functions
  • Write functions that do a very specific thing

Example of function that does too much: normalize a vector AND possibly compute the location and scale.

Some other best practices with functions:

  • Functions should be self-contained, not drawing on global
    • For readability
    • For function (can’t easily export that to it’s own package or other analyses, for example)
  • Write unit tests for the functions (look at the testthat package)

14.8.5 Coding Practices: Style Guides

When writing code, at the very least, it’s important to be consistent with your style. We highly recommend: