Introduction to R - 3rd lesson (import/export)

Nathalie Villa-Vialaneix - http://www.nathalievialaneix.eu
September 14-16th, 2015

Master TIDE, Université Paris 1

R

Get and set your working directory

For import/export operations, R works with a working directory

getwd()
[1] "/home/nathalie/Private/Travail/Enseignements/masterTIDE"

that can be changed using

set.wd("~/Rlesson3") # not run

or using the menu Session/Set working directory in RStudio.

Importing data in R

Kind of data that can be downloaded in R:

  • text (and csv) files
  • rda (R data) files
  • more complicated/specific files (Excel, HTML, JSON, SQL…) with specific packages

You can import file from:

  • local files
  • internet (http, https, ftp)
  • BD servers (MySQL)…

Download a file from the internet

fileURL <- "http://www.nathalievialaneix.eu/doc/csv/ex-data-tide.csv"
dir.create("data")
download.file(fileURL, destfile="data/ex-data.csv")
list.files("data")
[1] "ex-data.csv"
ls()
[1] "fileURL"

Import a text/csv file in R

df <- read.table("data/ex-data.csv", sep=";", header=TRUE)
summary(df[,1:3])
     annee           age        nee.france
 Min.   :2007   Min.   :17.00   Non:10    
 1st Qu.:2008   1st Qu.:18.00   Oui:89    
 Median :2009   Median :19.00             
 Mean   :2010   Mean   :19.05             
 3rd Qu.:2010   3rd Qu.:20.00             
 Max.   :2013   Max.   :26.00             

Important options for read.table

  • sep: column separator character (default: white space)
  • header (TRUE/FALSE): are column names contained in the first line? (default: FALSE)
  • dec: decimal separator character (default: comma)
  • quote: quoting character (default: ")
  • row.names: a number giving the column which contains the row names (default: the file contains no column with row names)
  • na.strings: strings to be interpreted as NA (default: blank strings)
  • stringsAsFactor: strings are imported as factors (TRUE, default) or as characters (FALSE)

Fast importation of csv files

read.csv (English standard format, comma separator) and read.csv2 (French standard format, semicolumn separator) can be used to import CSV file

df <- read.csv2("data/ex-data.csv", stringsAsFactor=FALSE)
summary(df[,4:5])
  cp.naissance       sexe          
 Min.   : 6600   Length:99         
 1st Qu.:11000   Class :character  
 Median :33000   Mode  :character  
 Mean   :42042                     
 3rd Qu.:69000                     
 Max.   :98714                     
 NA's   :14                        

Reading files as texts

Files can be read as strings (and processed inside R).

cur.conn <- url(fileURL) # open connexion
df2 <- readLines(cur.conn, n=3)
lapply(df2, substr, start=1, stop=15) # first 15 characters
[[1]]
[1] "\"annee\";\"age\";\""

[[2]]
[1] "2007;19;\"Oui\";7"

[[3]]
[1] "2007;19;\"Oui\";1"
close(cur.conn) # close connexion

Exportation of matrices, data frames and vectors in text files

write.table(df, file="data/export-data.txt")
write.csv2(df, file="data/export-data.csv",
           row.names=FALSE)

with approximately the same options than read.table.

Exportation of objects in an Rdata file

If you want to save more complicated variables or several variables in a single file, you can use Rdata format:

data(iris); ls()
[1] "cur.conn" "df"       "df2"      "fileURL"  "iris"    
save(df, iris, file="data/export-ws.rda")

Loading an Rdata file

Rdata files are loaded with:

rm(list=ls()); ls()
character(0)
load("data/export-ws.rda"); ls()
[1] "df"   "iris"
load("data/export-ws.rda", verbose=TRUE)
Loading objects:
  df
  iris

Importing more complicated data

  • Excel files: see xlsx package
  • HTML and XML files: see XML package
  • JSON data: see jsonlite and RJSON packages
  • MySQL data bases: see RMySQL package
  • Big data: see rhdf5 (HDF5) and rhdfs (hadoop) packages
  • … (compressed files, Stata files, SAS files, SPSS files, Octave files …)

Exercise 1

Using the file at http://www.nathalievialaneix.eu/doc/csv/co2.csv, import the corresponding data and answer the following questions:

  • What is the dimension of the data?

  • What are the variables included in the data? What are their types?

  • Make a contingency tables of the variables Type and Treatment.

  • What is the median uptable value for each plant Type?