10: Tidy Data

Juan Caballero

Our dataset

library(tidyverse)

rnaseq_file = "https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq.csv"

rna = read_csv(rnaseq_file) 

Dplyr in action - combining tables

Dplyr can also merge tables with the basic table operations:

  1. An inner_join keeps observations that are present in both tables

  2. A left_join keeps observations that are present in the left (first) table, dropping those that are only present in the other

  3. A right_join keeps observations that are present in the right (second) table, dropping those that are only present in the other

  4. A full_join keeps all observations

Dplyr in action - combining tables

Dplyr in action - combining tables

As an example, let’s subset a part of the rna table

rna_mini <- rna %>%
   select(gene, sample, expression) %>%
   head(10)

rna_mini
# A tibble: 10 × 3
   gene    sample     expression
   <chr>   <chr>           <dbl>
 1 Asl     GSM2545336       1170
 2 Apod    GSM2545336      36194
 3 Cyp2d22 GSM2545336       4060
 4 Klk6    GSM2545336        287
 5 Fcrls   GSM2545336         85
 6 Slc2a4  GSM2545336        782
 7 Exd2    GSM2545336       1619
 8 Gjc2    GSM2545336        288
 9 Plp1    GSM2545336      43217
10 Gnb4    GSM2545336       1071

Dplyr in action - combining tables

Load the annotation table

annot_file = "https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/annot1.csv"

annot = read_csv(annot_file)

annot
# A tibble: 10 × 2
   gene    gene_description                                                     
   <chr>   <chr>                                                                
 1 Cyp2d22 cytochrome P450, family 2, subfamily d, polypeptide 22 [Source:MGI S…
 2 Klk6    kallikrein related-peptidase 6 [Source:MGI Symbol;Acc:MGI:1343166]   
 3 Fcrls   Fc receptor-like S, scavenger receptor [Source:MGI Symbol;Acc:MGI:19…
 4 Plp1    proteolipid protein (myelin) 1 [Source:MGI Symbol;Acc:MGI:97623]     
 5 Exd2    exonuclease 3'-5' domain containing 2 [Source:MGI Symbol;Acc:MGI:192…
 6 Apod    apolipoprotein D [Source:MGI Symbol;Acc:MGI:88056]                   
 7 Gnb4    guanine nucleotide binding protein (G protein), beta 4 [Source:MGI S…
 8 Slc2a4  solute carrier family 2 (facilitated glucose transporter), member 4 …
 9 Asl     argininosuccinate lyase [Source:MGI Symbol;Acc:MGI:88084]            
10 Gjc2    gap junction protein, gamma 2 [Source:MGI Symbol;Acc:MGI:2153060]    

Dplyr in action - combining tables

Combine rna_mini with annot

full_join(rna_mini, annot)
# A tibble: 10 × 4
   gene    sample     expression gene_description                               
   <chr>   <chr>           <dbl> <chr>                                          
 1 Asl     GSM2545336       1170 argininosuccinate lyase [Source:MGI Symbol;Acc…
 2 Apod    GSM2545336      36194 apolipoprotein D [Source:MGI Symbol;Acc:MGI:88…
 3 Cyp2d22 GSM2545336       4060 cytochrome P450, family 2, subfamily d, polype…
 4 Klk6    GSM2545336        287 kallikrein related-peptidase 6 [Source:MGI Sym…
 5 Fcrls   GSM2545336         85 Fc receptor-like S, scavenger receptor [Source…
 6 Slc2a4  GSM2545336        782 solute carrier family 2 (facilitated glucose t…
 7 Exd2    GSM2545336       1619 exonuclease 3'-5' domain containing 2 [Source:…
 8 Gjc2    GSM2545336        288 gap junction protein, gamma 2 [Source:MGI Symb…
 9 Plp1    GSM2545336      43217 proteolipid protein (myelin) 1 [Source:MGI Sym…
10 Gnb4    GSM2545336       1071 guanine nucleotide binding protein (G protein)…

Dplyr in action - combining tables

Note: both tables are compared in terms of variables, only matching variables are used to merge, in our case, both tables have a common column gene

In case the names don’t match, we need to define specific variables names for merging

full_join(rna_mini, annot, by = c("gene" = "gene_symbol"))

Any questions?