03: Getting Data In and Out

files
formats
directories
I/O
R scripts
Author

Thomas Manke

Goal: Ultimately we want to access our own data and write results to file.

A small detour: R Markdowns

It is important to document your work and reproducible workflows.

  1. One common solutions is to write a script (e.g. my_script.R) that can run a series of commands. But (apart from brief comments) such scripts often lack proper documentation which are crucial to understand the rationale and the motivation for more complex analysis steps.

  2. A modern solution are R/Quarto markdown documents (rmd/qmd). These combine scripts with powerful text formatting for data analysis and analysis reporting.

The markdown documents can be run (“rendered”, “knit”) to yield standardised analysis reports in html format (or pdf)

Task

In Rstudio,

  • open a new R markdown document with File > New File > R Markdown.... (This will open a template for an Rmd file that can be knit)
  • To convert this file into html press Knit - try it out! You may have to save it first as - e.g. “first.Rmd”
  • There is a YAML header that contains parameters which will affect the rendering process - customize them
  • The rest of the document are text blocks (with simple format instructions) and code blocks (with R code)
  • In R studio, the code blocks can also be run individually using the embedded Play button - try it out
  • Play time: Modify the yaml header, text blocks or code - or all of it. “knit” the documnt and observe changes.

Notice: many modern dvelopment tool (Rstudio, Jupyter, VSCode) allow for convenient editing and testing of markdown documents.

Avoid the console from now on, and work in a markdown document instead.


CSV files

Comma-separated text files (ASCII) are both human and machine readible. Other separators may be chosen (tab or “|”). This format is frequently used for simple data, such as rows of different samples/observations and columns of multiple variables (per sample)

Important: Make sure that you know the precise location of your data file and provide this as filename.

Topics:

  • home and working directory
  • relative and absolute path
Code
getwd()                     # working directory
dir()                       # display content
filename='data/iris.csv'    # relative to wd
d = read.csv(filename)      # file content --> memory (d)
str(d)

There are many different ways to load such data into memory and to customize the loading.

Tasks

  • Explore ?read.csv to get a first overview how this function can be customized.
  • How would you read only the first 10 lines?
  • Explore the data object d
  • Optional bonus: try your own file and brace! Is it clean enough?

From URL

Notice that files do not need to be available locally, but might be provided by some URL.

Be aware that in those cases there might be significant reduced loading speed, depending on your network connections.

Code
filename='https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/data/iris.tsv'
d = read.csv(filename, sep='\t')  
str(d)

Compressed formats

Especially for big data it is common to store them in compressed format (e.g. *gz) to reduced the storage footprint and speed-up data transfer. Such files are not human readable (binary) can also be read

Code
cmd = "gunzip -c data/iris.tsv.gz"   # command to uncompress
d = read.csv(pipe(cmd), sep='\t')       # read as pipe
str(d)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

Writing data

There are many ways to save data to text files. One of the simplest uses write.csv.

Code
write.csv(iris, file="output/iris.csv", row.names=FALSE, quote=FALSE)

For large data you may prefer to write compressed version:

Code
write.csv(iris, gzfile("output/iris.csv.gz"))

Task

  • Change some of the parameters (row.names, quote) and observed their effect on the resulting file
  • Save only the subset of flowers where Species=“setosa” to a file setosa.tsv

RData

In the context of the R-programming language, RData is a very convenient (binary) format that can be used to save multiple data structures or even whole environments It’s very efficient when you exchange your data with other R-users (or your future self)

Specific objects

Code
d = iris                # copy of iris data
fn="output/iris.RData"  # filename (and extension) of choice
save(d, file=fn)
rm(d)             # remove object d for illustration - and watch global env
load(fn)          # reload object d from file - and watch global env

Task: All objects

Sometimes we want to save all objects and variable that have accumulated in the “Global Environment” - just to be sure. This task tests some jargon, familiarity with directory structure and ability to find help. Please try it yourself.

  • Create a new data object for the iris data set as before and additional variables for your favorite numbers and perhaps some favorite strings.
  • Save the whole environment (using save.image())
  • Delete the whole environment aka “workspace”; e.g. using rm(list=ls())
  • Reload the environment and confirm that you successfully recreated all objects
  • Determine your current working directory (>getwd())
  • Locate saved image on disk and inspect its size. Delete it if you prefer.
Code
# your code snippet here

Notice: The suffix (.RData) is not strictly necessary, but it is best practice and used consistently by the community.


Review

  • Many different data sources, formats & structures
    • text files: .tsv, .csv, …
    • compressed files: *.bed.gz
    • application specific: .RData, (.xls)
  • Reading Data: many ways
    • read.csv(), read.table(), scan(), …
    • from URL
    • customization with parameters
    • and there is more: special packages
  • Writing Data: many ways
    • write.csv()
    • save() \(\to\) load()
    • save.image() \(\to\) load()
  • Data I/O can be challenging:
    • file \(\to\) memory
    • know your paths, format, type, size
    • ensure clean and structured data
    • bring time and patience
  • R scripts: writing and running (source)