03: Getting Data In and Out

files

formats

directories

I/O

R scripts

Author

Thomas Manke

Goal: Ultimately we want to access our own data and write results to file.

A small detour: R Markdowns

It is important to document your work and reproducible workflows.

One common solutions is to write a script (e.g. my_script.R) that can run a series of commands. But (apart from brief comments) such scripts often lack proper documentation which are crucial to understand the rationale and the motivation for more complex analysis steps.
A modern solution are R/Quarto markdown documents (rmd/qmd). These combine scripts with powerful text formatting for data analysis and analysis reporting.

The markdown documents can be run (“rendered”, “knit”) to yield standardised analysis reports in html format (or pdf)

Task

In Rstudio,

open a new R markdown document with File > New File > R Markdown.... (This will open a template for an Rmd file that can be knit)
To convert this file into html press Knit - try it out! You may have to save it first as - e.g. “first.Rmd”
There is a YAML header that contains parameters which will affect the rendering process - customize them
The rest of the document are text blocks (with simple format instructions) and code blocks (with R code)
In R studio, the code blocks can also be run individually using the embedded Play button - try it out
Play time: Modify the yaml header, text blocks or code - or all of it. “knit” the documnt and observe changes.

Notice: many modern dvelopment tool (Rstudio, Jupyter, VSCode) allow for convenient editing and testing of markdown documents.

Avoid the console from now on, and work in a markdown document instead.

CSV files

Comma-separated text files (ASCII) are both human and machine readible. Other separators may be chosen (tab or “|”). This format is frequently used for simple data, such as rows of different samples/observations and columns of multiple variables (per sample)

Important: Make sure that you know the precise location of your data file and provide this as filename.

Topics:

home and working directory
relative and absolute path

Code

getwd()                     # working directory
dir()                       # display content
filename='data/iris.csv'    # relative to wd
d = read.csv(filename)      # file content --> memory (d)
str(d)

There are many different ways to load such data into memory and to customize the loading.

Tasks

Explore ?read.csv to get a first overview how this function can be customized.
How would you read only the first 10 lines?
Explore the data object d
Optional bonus: try your own file and brace! Is it clean enough?

From URL

Notice that files do not need to be available locally, but might be provided by some URL.

Be aware that in those cases there might be significant reduced loading speed, depending on your network connections.

Code

filename='https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/data/iris.tsv'
d = read.csv(filename, sep='\t')  
str(d)

Compressed formats

Especially for big data it is common to store them in compressed format (e.g. *gz) to reduced the storage footprint and speed-up data transfer. Such files are not human readable (binary) can also be read

Code

cmd = "gunzip -c data/iris.tsv.gz"   # command to uncompress
d = read.csv(pipe(cmd), sep='\t')       # read as pipe
str(d)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

Writing data

There are many ways to save data to text files. One of the simplest uses write.csv.

Code

write.csv(iris, file="output/iris.csv", row.names=FALSE, quote=FALSE)

For large data you may prefer to write compressed version:

Code

write.csv(iris, gzfile("output/iris.csv.gz"))

Task

Change some of the parameters (row.names, quote) and observed their effect on the resulting file
Save only the subset of flowers where Species=“setosa” to a file setosa.tsv

RData

In the context of the R-programming language, RData is a very convenient (binary) format that can be used to save multiple data structures or even whole environments It’s very efficient when you exchange your data with other R-users (or your future self)

Specific objects

Code

d = iris                # copy of iris data
fn="output/iris.RData"  # filename (and extension) of choice
save(d, file=fn)
rm(d)             # remove object d for illustration - and watch global env
load(fn)          # reload object d from file - and watch global env

Task: All objects

Sometimes we want to save all objects and variable that have accumulated in the “Global Environment” - just to be sure. This task tests some jargon, familiarity with directory structure and ability to find help. Please try it yourself.

Create a new data object for the iris data set as before and additional variables for your favorite numbers and perhaps some favorite strings.
Save the whole environment (using save.image())
Delete the whole environment aka “workspace”; e.g. using rm(list=ls())
Reload the environment and confirm that you successfully recreated all objects
Determine your current working directory (>getwd())
Locate saved image on disk and inspect its size. Delete it if you prefer.

Code

# your code snippet here

Notice: The suffix (.RData) is not strictly necessary, but it is best practice and used consistently by the community.

Review

Many different data sources, formats & structures
- text files: .tsv, .csv, …
- compressed files: *.bed.gz
- application specific: .RData, (.xls)
Reading Data: many ways
- read.csv(), read.table(), scan(), …
- from URL
- customization with parameters
- and there is more: special packages
Writing Data: many ways
- write.csv()
- save() $\to$ load()
- save.image() $\to$ load()
Data I/O can be challenging:
- file $\to$ memory
- know your paths, format, type, size
- ensure clean and structured data
- bring time and patience
R scripts: writing and running (source)

--- title: "03: Getting Data In and Out" author: "Thomas Manke" categories: - files - formats - directories - I/O - R scripts --- ```{r, child="_setup.qmd"} ``` **Goal**: Ultimately we want to access our own data and write results to file. ## A small detour: R Markdowns It is important to **document** your work and **reproducible** workflows. 1. One common solutions is to write a script (e.g. my_script.R) that can run a series of commands. But (apart from brief comments) such scripts often lack proper documentation which are crucial to understand the rationale and the motivation for more complex analysis steps. 2. A modern solution are R/Quarto markdown documents (*rmd/*qmd). These combine scripts with powerful text formatting for data analysis and analysis reporting. The markdown documents can be run ("rendered", "knit") to yield standardised analysis reports in html format (or pdf) #### Task In Rstudio, - open a new R markdown document with `File > New File > R Markdown...`. (This will open a template for an Rmd file that can be knit) - To convert this file into html press `Knit` - try it out! You may have to save it first as - e.g. "first.Rmd" - There is a `YAML` header that contains parameters which will affect the rendering process - customize them - The rest of the document are `text blocks` (with simple format instructions) and `code blocks` (with R code) - In R studio, the code blocks can also be run individually using the embedded `Play` button - try it out - Play time: Modify the yaml header, text blocks or code - or all of it. "knit" the documnt and observe changes. **Notice**: many modern dvelopment tool (Rstudio, Jupyter, VSCode) allow for convenient editing and testing of markdown documents. > Avoid the console from now on, and work in a markdown document instead. *** ## CSV files Comma-separated text files (ASCII) are both human and machine readible. Other separators may be chosen (tab or "|"). This format is frequently used for simple data, such as rows of different samples/observations and columns of multiple variables (per sample) **Important:** Make sure that you know the precise location of your data file and provide this as filename. Topics: - home and working directory - relative and absolute path ```{r csv, eval=FALSE} getwd() # working directory dir() # display content filename='data/iris.csv' # relative to wd d = read.csv(filename) # file content --> memory (d) str(d) ``` There are many different ways to load such data into memory and to customize the loading. #### Tasks - Explore ?read.csv to get a first overview how this function can be customized. - How would you read only the first 10 lines? - Explore the data object d - *Optional bonus:* try your own file and brace! Is it clean enough? ## From URL Notice that files do not need to be available locally, but might be provided by some URL. Be aware that in those cases there might be significant reduced loading speed, depending on your network connections. ```{r url, eval=FALSE} filename='https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/data/iris.tsv' d = read.csv(filename, sep='\t') str(d) ``` ## Compressed formats Especially for big data it is common to store them in compressed format (e.g. *gz) to reduced the storage footprint and speed-up data transfer. Such files are not human readable (binary) can also be read ```{r gzip} cmd = "gunzip -c data/iris.tsv.gz" # command to uncompress d = read.csv(pipe(cmd), sep='\t') # read as pipe str(d) ``` *** ## Writing data There are many ways to save data to text files. One of the simplest uses `write.csv`. ```{r write_csv, eval=FALSE} write.csv(iris, file="output/iris.csv", row.names=FALSE, quote=FALSE) ``` For large data you may prefer to write compressed version: ```{r write_gz, eval=FALSE} write.csv(iris, gzfile("output/iris.csv.gz")) ``` #### Task - Change some of the parameters (row.names, quote) and observed their effect on the resulting file - Save only the subset of flowers where Species="setosa" to a file setosa.tsv ```{r, eval=FALSE, echo=FALSE} write.csv(iris[iris$Species=="setosa",], file="output/setosa.csv", row.names=FALSE, quote=FALSE) ``` ## RData In the context of the R-programming language, RData is a very convenient (binary) format that can be used to save multiple data structures or even whole environments It's very efficient when you exchange your data with other R-users (or your future self) ### Specific objects ```{r rdata} d = iris # copy of iris data fn="output/iris.RData" # filename (and extension) of choice save(d, file=fn) rm(d) # remove object d for illustration - and watch global env load(fn) # reload object d from file - and watch global env ``` #### Task: All objects Sometimes we want to save all objects and variable that have accumulated in the "Global Environment" - just to be sure. This task tests some jargon, familiarity with directory structure and ability to find help. Please try it yourself. - Create a new data object for the iris data set as before *and* additional variables for your favorite numbers and perhaps some favorite strings. - Save the whole environment (using `save.image()`) - Delete the whole environment aka "workspace"; e.g. using `rm(list=ls())` - Reload the environment and confirm that you successfully recreated all objects - Determine your current working directory (>getwd()) - Locate saved image on disk and inspect its size. Delete it if you prefer. ```{r} # your code snippet here ``` ```{r rdat_env, eval=FALSE, echo=FALSE} # define some variables d = iris s = "Hello Thomas" n = 42 v = 1:1000 getwd() # make sure you know where you writing to save.image("output/my_env.RData") # default image_name = ".RData" rm(list=ls()) # remove everything = sweep global environment load("output/my_env.RData") # recreate all ``` **Notice**: The suffix (.RData) is not strictly necessary, but it is best practice and used consistently by the community. *** ## Review * Many different data sources, formats & structures - text files: *.tsv, *.csv, ... - compressed files: *.bed.gz - application specific: .*RData, (.*xls) * Reading Data: many ways - read.csv(), read.table(), scan(), ... - from URL - customization with parameters - and there is more: special packages * Writing Data: many ways - write.csv() - save() $\to$ load() - save.image() $\to$ load() * Data I/O can be challenging: - file $\to$ memory - know your paths, format, type, size - ensure clean and structured data - bring time and patience * R scripts: writing and running (source)