03: Getting Data In and Out

files
formats
directories
I/O
R scripts
Author

Thomas Manke

Goal: Ultimately we want to access our own data and write results to file.

CSV files

Comma-separated text files (ASCII) are both human and machine readible. Other separators may be chosen (tab or “|”). This format is frequently used for simple data, such as rows of different samples/observations and columns of multiple variables (per sample)

Important: Make sure that you know the precise location of your data file and provide this as filename.

Topics:

  • home and working directory
  • relative and absolute path
Code
getwd()                     # working directory
dir()                       # display content
filename='data/iris.csv'    # relative to wd
d = read.csv(filename)      # file content --> memory (d)
str(d)

There are many different ways to load such data into memory and to customize the loading.

Tasks:

  • Explore ?read.csv to get a first overview how this function can be customized.
  • How would you read only the first 10 lines?
  • Explore the data object d
  • Optional bonus: try your own file and brace! Is it clean enough?

From URL

Notice that files do not need to be available locally, but might be provided by some URL.

Be aware that in those cases there might be significant reduced loading speed, depending on your network connections.

Code
filename='https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/data/iris.tsv'
d = read.csv(filename, sep='\t')  
str(d)

Compressed formats

Especially for big data it is common to store them in compressed format (e.g. *gz) to reduced the storage footprint and speed-up data transfer. Such files are not human readable (binary) can also be read

Code
cmd = "gunzip -c data/iris.tsv.gz"   # command to uncompress
d = read.csv(pipe(cmd), sep='\t')       # read as pipe
str(d)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

Writing data

There are many ways to save data to text files. One of the simplest uses write.csv.

Code
write.csv(iris, file="output/iris.csv", row.names=FALSE, quote=FALSE)

For large data you may prefer to write compressed version:

Code
write.csv(iris, gzfile("output/iris.csv.gz"))

Task:

  • Change some of the parameters (row.names, quote) and observed their effect on the resulting file
  • Save only the subset of flowers where Species=“setosa” to a file setosa.tsv

RData

In the context of the R-programming language, RData is a very convenient (binary) format that can be used to save multiple data structures or even whole environments It’s very efficient when you exchange your data with other R-users (or your future self)

Specific objects

Code
d = iris                # copy of iris data
fn="output/iris.RData"  # filename (and extension) of choice
save(d, file=fn)
rm(d)             # remove object d for illustration - and watch global env
load(fn)          # reload object d from file - and watch global env

Task: All objects

Sometimes we want to save all objects and variable that have accumulated in the “Global Environment” - just to be sure. This task tests some jargon, familiarity with directory structure and ability to find help. Please try it yourself.

  • Create a new data object for the iris data set as before and additional variables for your favorite numbers and perhaps some favorite strings.
  • Save the whole environment (using save.image())
  • Delete the whole environment aka “workspace”; e.g. using rm(list=ls())
  • Reload the environment and confirm that you successfully recreated all objects
  • Determine your current working directory (>getwd())
  • Locate saved image on disk and inspect its size. Delete it if you prefer.
Code
# your code snippet here

Notice: The suffix (.RData) is not strictly necessary, but it is best practice and used consistently by the community.


R Scripts

Typical data analyses involve many successive steps.
To record everything that was done - and ultimately to be able to reproduce this - we use scripts.

Basically these are just lines of code that are collected in a text file.

In Rstudio they can be created using File > New File > R script

Task (10 min):

  • Create a new R-script filter_iris.R to do the following:
    • Starting from the pre-compiled iris data, create a new data_frame d of all flowers where Sepal.Length is greater or equal to 7 cm.
    • determine the number flower in the new data_frame. Save this number as variable nf
    • How many different species are in the new data_frame. Save this number as variable ns (hint: there are two useful functions: unique() and length())
    • write the new data_frame d in comma seperated file iris_big_sepal.csv
    • save the whole environment in a file analysis.RData
  • Save the script and run it (source)
  • Bonus: Delete all whole environment reload it from the image file

Query: After the filtering, how many flowers and how many species are left?


Review

  • Many different data sources, formats & structures
    • text files: .tsv, .csv, …
    • compressed files: *.bed.gz
    • application specific: .RData, (.xls)
  • Reading Data: many ways
    • read.csv(), read.table(), scan(), …
    • from URL
    • customization with parameters
    • and there is more: special packages
  • Writing Data: many ways
    • write.csv()
    • save() \(\to\) load()
    • save.image() \(\to\) load()
  • Data I/O can be challenging:
    • file \(\to\) memory
    • know your paths, format, type, size
    • ensure clean and structured data
    • bring time and patience
  • R scripts: writing and running (source)




Appendix: reality of I/O

Many external data is badly formatted, leading to much wasted time - before any analysis.

The code snippets below are not run - but try them out

Code
file="data/GeneList.tsv"     
# file="https://raw.githubusercontent.com/maxplanck-ie/Rintro/master/data/GeneList.tsv"  
read.csv(file)      
read.csv(file, comment.char = "%")                        # comment lines
read.csv(file, comment.char = "%", sep="\t")              # separators
Code
# use ncol, is.na(), na.omit()
ic = colSums(is.na(d)) == ncol(d) # columns where all enries are NA
d1 = d[, !ic]                     # exclude those columns

d1
na.omit(d) 
na.omit(d1)

Messages:

  • This file is a mess
  • Many data come like this
  • read functions have many parameters for better control
  • there are also other dedicated software packages to help (somewhat)