Class Meeting 11 File input/output (I/O)

Today’s class is all about reading in and writing out data files into and out of R.

11.1 Worksheet

Normally there would be a separate worksheet file, but we’ll do everything in-line for this class

11.2 Resources

11.2.1 References and tutorials

  • Jenny Bryan’s notes on file I/O
  • Tutorial on importing excel file
  • Tutorial on relative paths and how RStudio treats .R script files and .Rmd files

11.3 Writing data to disk (10 mins)

Let’s first load the built-in gapminder dataset and the tidyverse:

Next, let’s filter the data only from 2007 and only in the Asia continent and save it to a variable.

## # A tibble: 33 x 6
##    country          continent  year lifeExp        pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Afghanistan      Asia       2007    43.8   31889923      975.
##  2 Bahrain          Asia       2007    75.6     708573    29796.
##  3 Bangladesh       Asia       2007    64.1  150448339     1391.
##  4 Cambodia         Asia       2007    59.7   14131858     1714.
##  5 China            Asia       2007    73.0 1318683096     4959.
##  6 Hong Kong, China Asia       2007    82.2    6980412    39725.
##  7 India            Asia       2007    64.7 1110396331     2452.
##  8 Indonesia        Asia       2007    70.6  223547000     3541.
##  9 Iran             Asia       2007    71.0   69453570    11606.
## 10 Iraq             Asia       2007    59.5   27499638     4471.
## # … with 23 more rows

We can write this to a comma-separated value (csv) file with just one command:

write_csv(gap_asia_2007,"exported_file.csv")

But let’s not just write the csv file anywhere, we should download the file in a sensible location. The next section talks about the here package

11.4 Making the case for here::here()

If you wanted to make your Rproj more portable and accessible to more users in a cross-platform (between Mac, Unix, Windows users), rather than specifying every path explicitly, here::here() allows you to set relative paths much more easily.

For example, create a new folder “data” and within it a subfolder called “cm011_data”, specify here() and then save a file to that location with these commands:

11.4.1

More thorough notes to come…

In the meantime, read this short article for an excellent explanation of why we use the here::here() package in R.

11.5 Reading data from disk (5 mins)

The same csv file that we just saved to disk can be imported into R again by specifying the path where it exists:

## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_double(),
##   lifeExp = col_double(),
##   pop = col_double(),
##   gdpPercap = col_double()
## )
## # A tibble: 33 x 6
##    country          continent  year lifeExp        pop gdpPercap
##    <chr>            <chr>     <dbl>   <dbl>      <dbl>     <dbl>
##  1 Afghanistan      Asia       2007    43.8   31889923      975.
##  2 Bahrain          Asia       2007    75.6     708573    29796.
##  3 Bangladesh       Asia       2007    64.1  150448339     1391.
##  4 Cambodia         Asia       2007    59.7   14131858     1714.
##  5 China            Asia       2007    73.0 1318683096     4959.
##  6 Hong Kong, China Asia       2007    82.2    6980412    39725.
##  7 India            Asia       2007    64.7 1110396331     2452.
##  8 Indonesia        Asia       2007    70.6  223547000     3541.
##  9 Iran             Asia       2007    71.0   69453570    11606.
## 10 Iraq             Asia       2007    59.5   27499638     4471.
## # … with 23 more rows

Normally we would store the imported data into a new variable and you can use that by assigning the output to a variable. Notice that the output of the imported file is the same as the original tibble, and read_csv was intelligent enough to detect the types of the columns. This won’t always be true so it’s worth checking! The read_csv package has many additional options including the ability to skip columns, skip rows, rename columns on import, trim whitespace, and more…

11.6 Import a file from the web/cloud

11.6.1 Import a csv file from the internet

To import a csv file from a web, assign the URL to a variable

and then apply read_csv file to the url.

## Parsed with column specification:
## cols(
##   Magazine = col_character(),
##   AdRevenue = col_double(),
##   AdPages = col_double(),
##   SubRevenue = col_double(),
##   NewsRevenue = col_double()
## )
## # A tibble: 204 x 5
##    Magazine                             AdRevenue AdPages SubRevenue NewsRevenue
##    <chr>                                    <dbl>   <dbl>      <dbl>       <dbl>
##  1 Weekly World News                         2280    300         854       16568
##  2 National Examiner                         3382    380         968       27215
##  3 J-14                                      4218    250        2206       12453
##  4 Soap Opera Weekly                         4622    439        5555       24282
##  5 Easyriders                                5121    524.       4155        9929
##  6 Mary Engelbreit's Home Companion          5259    189        9048        4363
##  7 Official Xbox Magazine                    5838    542.       4311       10320
##  8 Weight Watchers                           6986    287.       9202        4048
##  9 Globe                                     7634    380        2180       63771
## 10 PSM: 100% Independent PlayStation 2…      8034    720.       6846        5271
## # … with 194 more rows

11.6.2 Import an excel file (.xls) from the internet

First, we’ll need the package to load in excel files:

’Datafiles from this tutorial were obtained from: https://beanumber.github.io/sds192/lab-import.html#data_from_an_excel_file

To import an .xls or .xlsx file from the internet, you first need to download it locally. The read_excel function from the readxl package can help us read it after we download it. To download it, create a new variable called xls_url, as well the name of the destination file you would like to download the data into.

NOTE: If you are on windows and end up downloading a corrupt file, you need to add an extra argument: download.file(…, mode=“wb”). More details about this behaviour can be found here.

Naming a file “some_file” is extremely bad practice (hard to keep track of the files) and I would strongly encourage you to name the file similar (or the same) to the original file. Let me show you a handy trick to extract the filename from the URL:

Now we can import the file:

## # A tibble: 50 x 8
##    Rank  Name  Background `2003-07 Given … Causes `Estimated lift… `Net Worth`
##    <chr> <chr> <chr>      <chr>            <chr>             <dbl> <chr>      
##  1 1     Warr… Berkshire…  40,650          Healt…            40780 52000      
##  2 2     Bill… Microsoft…  3,519           Globa…            28144 59000      
##  3 3     Geor… Oil and g…  2,271           Pover…             2522 11000      
##  4 4     Geor… Investor    2,109           Open …             6401 8800       
##  5 5     Gord… Intel co-…  2,067           Envir…             7404 4500       
##  6 6     Walt… Family of…  1,475           Educa…             2015 82500      
##  7 7     Herb… Golden We…  1,368           Medic…             1389 2400       
##  8 8     Eli … SunAmeric…  1,216           Publi…             2286 7000       
##  9 9     Dona… Real esta…  915             Educa…             1326 13000      
## 10 10    Jon … Huntsman …  800             Cance…             1233 1900       
## # … with 40 more rows, and 1 more variable: `Giving%` <chr>

11.7 Read in a sample Excel file. (Optional)

Let’s load in a sample dataset from a PhD research project in MRI (magnetic resonance imaging). In MRI, subjects are imaged and the data collected can be visualized in “slices”. If a human head was being imaged, the first slice might be a cross-section of the neck. The next slice would be an image in the same plane as the first slice, but a few mm above the first slice, and so on and so forth. The exact details of the dataset aren’t important, but here are key details we can extract from the dataset:

  • the first column contains the subject ID (e.g., “HerS18Bs01.BS1/8”)
  • Each row contains 8 measurements, each from a different slice in the image
  • The next column is a weighted average, and the final column is the volume measurement of all 8 slices
  • The data is not in tidy format
  • The subjects are divided into two treatment groups: “Avastin” and “Herceptin”, but unfortunately, this information is not captured in the table
  • The relevant data can be found in the range A1:K12 since the rest of the data contains either footer rows or repeated data.

Viewing the data using the View() function will allow you to investigate how the imported data looks in R. As hinted above, it looks like we have to do one preliminary data-processing steps before we can import all the data.

Notice that below row 12, two subsets of the data from the first 12 rows is duplicated into two groups: “Avastin” and “Herceptin” treated. Indeed, the person who did the analysis copied the raw data and split it into the two groups. While it is possible to do this processing in R, in this case it is better that we do this directly in the excel file. Let’s do this and save a new file called Firas-MRI_minor_cleaning.xlsx

Note also that column J is a calculated column of weighted averages. Rather than bring this column in a-is, we want to remove this column and then later calculate it in R if we need to.

We are now ready to import in the newly processed file:

## New names:
## * `` -> ...12

Finally, let’s make our data tidy using pivot_longer:

## # A tibble: 88 x 5
##    `Animal ID`      Volume ...12 slice_no value
##    <chr>             <dbl> <lgl> <chr>    <dbl>
##  1 HerS18Bs01.BS1/8   523. NA    Slice 1  11.1 
##  2 HerS18Bs01.BS1/8   523. NA    Slice 2  14.7 
##  3 HerS18Bs01.BS1/8   523. NA    Slice 3  16.5 
##  4 HerS18Bs01.BS1/8   523. NA    Slice 4  14.3 
##  5 HerS18Bs01.BS1/8   523. NA    Slice 5  18.9 
##  6 HerS18Bs01.BS1/8   523. NA    Slice 6  22.5 
##  7 HerS18Bs01.BS1/8   523. NA    Slice 7  19.0 
##  8 HerS18Bs01.BS1/8   523. NA    Slice 8  18.8 
##  9 HerS18Bs02.BS1/7   400. NA    Slice 1  10.3 
## 10 HerS18Bs02.BS1/7   400. NA    Slice 2   7.90
## # … with 78 more rows

And we are done! Ready to explore trends in this dataset.