Introduction to R - 5th lesson (statistics)

Nathalie Villa-Vialaneix - http://www.nathalievialaneix.eu
September 14-16th, 2015

Master TIDE, Université Paris 1

R

Probability distribution

R

  • functions related to probability distributions
  • random number generations
  • reproducibility

Gaussian distribution

dnorm(0.5); dnorm(c(-0.5,0,0.5))
[1] 0.3520653
[1] 0.3520653 0.3989423 0.3520653
curve(dnorm, xlim=c(-3,3))

plot of chunk densGaussian

Gaussian cumulative distribution

pnorm(0.5); pnorm(c(-0.5,0,0.5))
[1] 0.6914625
[1] 0.3085375 0.5000000 0.6914625
curve(pnorm, xlim=c(-3,3))

plot of chunk cumDistGaussian

Inverse Gaussian distribution

qnorm(0.975) # quantile of order 97.5%
[1] 1.959964
curve(qnorm, xlim=c(0,1))

plot of chunk invGaussian

Continuous density distributions

The same functions are available for other continuous distributions, as: beta, Cauchy, exponential, Fisher, gamma, log-normal, uniform,, \( \chi^2 \)… see help(Distributions).

Poisson distribution

dpois(1, lambda=3)
[1] 0.1493612
barplot(dpois(1:10, lambda=3), col="darkred",
        names=1:10)

plot of chunk densPois

Poisson cumulative distribution

ppois(1, lambda=3)
[1] 0.1991483
barplot(ppois(1:10, lambda=3), col="darkred",
        names=1:10)

plot of chunk cumDensPois

Inverse Poisson distribution

qpois(0.95, lambda=3) # quantile of order 95%
[1] 6
x <- seq(0,1,length=1000)
plot(x, qpois(x,lambda=3), type="h")

plot of chunk invPois

Discrete density distributions

The same functions are available for other discrete distributions, as: binomial, multinomial, negative binomial, … see help(Distributions).

Random sampling

R can generate pseudo-random numbers or samples. A random sample is taken from a list using the function sample (with or without replacement):

sample(1:12, 5, replace=FALSE)
[1]  1  9  8 12  6
sample(1:12, 20, replace=TRUE)
 [1] 12  3  7  2  9  3 11  3  4  7 10  8 12  3  1  8  3  1  7 12

Random number generation

Random numbers are generated from a given distribution using the prefix “r” with the previously introduced functions:

rnorm(3); hist(rnorm(50, mean=2, sd=3))
[1]  0.1460172 -0.3260648 -1.2325926

plot of chunk random2

Random number generation

Large samples approximate well the probability distribution:

hist(rnorm(1000), col="darkred", freq=FALSE)
curve(dnorm, lwd=4, col="green", add=TRUE)

plot of chunk random3

Reproductible random sampling

Random functions produce a different result each time they are called. Reproducible results can be obtained using set.seed with an arbitrary random seed:

rnorm(3) # varies each time it is called
[1] -0.4628617  1.1976318  2.0578466
set.seed(2807); rnorm(3)
[1] -1.1624644 -0.1750661  0.8881384

Statistical tests

R

  • normality tests
  • equality tests
  • independence tests

Normality tests

Standard tests to check the normality of a sample are the Kolmogorov-Smirnov test and the Shapiro-Wilk test:

to.test <- rnorm(100, mean=5)
shapiro.test(to.test)

    Shapiro-Wilk normality test

data:  to.test
W = 0.99396, p-value = 0.9389

Normality tests

to.test <- runif(100)
res <- ks.test(to.test, "dnorm")
res; res$statistic; res$p.value

    One-sample Kolmogorov-Smirnov test

data:  to.test
D = 0.75723, p-value < 2.2e-16
alternative hypothesis: two-sided
        D 
0.7572269 
[1] 0

Student test

Student test is used to test an average value:

to.test <- rnorm(100, mean=5)
t.test(to.test, mu=5)

    One Sample t-test

data:  to.test
t = 0.49773, df = 99, p-value = 0.6198
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 4.847839 5.254059
sample estimates:
mean of x 
 5.050949 

Student test

Student test is used to test the mean equality between two samples:

to.test <- rnorm(100)
to.test2 <- rnorm(100, mean=1)
res <- t.test(to.test, to.test2, var.equal=TRUE)
res$p.value; res$statistic; qt(c(0.025,0.975), df=99)
[1] 5.562276e-09
        t 
-6.096964 
[1] -1.984217  1.984217

Wilcoxon test

Wilcoxon test is a non parametric test to test the equality between two distributions:

to.test <- rnorm(100); to.test2 <- runif(200)
wilcox.test(to.test, to.test2)

    Wilcoxon rank sum test with continuity correction

data:  to.test and to.test2
W = 7163, p-value = 6.208e-05
alternative hypothesis: true location shift is not equal to 0

Chi square test

\( \chi^2 \) test test the dependancy between two variables in a contingency table:

CT <- as.table(rbind(c(762, 327, 468),
                    c(484, 239, 477)))
dimnames(CT) <- list(gender = c("M","F"),
                    party =
c("Democrat","Independent", "Republican"))
res <- chisq.test(CT)  ; res

    Pearson's Chi-squared test

data:  CT
X-squared = 30.07, df = 2, p-value = 2.954e-07

Chi square test

The resulting object is a list with further information:

names(res)
[1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
[7] "expected"  "residuals" "stdres"   
res$expected
      party
gender Democrat Independent Republican
     M 703.6714    319.6453   533.6834
     F 542.3286    246.3547   411.3166

Linear regression

R

  • fit a linear regression
  • interpretation and tests
  • prediction

Fit a linear regression

data(airquality)
res <- lm(Ozone~Wind, data=airquality); res

Call:
lm(formula = Ozone ~ Wind, data = airquality)

Coefficients:
(Intercept)         Wind  
     96.873       -5.551  

Interpret results

names(res)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "xlevels"       "call"          "terms"        
[13] "model"        
res$coefficients
(Intercept)        Wind 
  96.872895   -5.550923 

Interpret results

head(res$fitted)
       1        2        3        4        6        7 
55.79607 52.46551 26.93127 33.03728 14.16414 49.13496 
head(res$residuals)
        1         2         3         4         6         7 
-14.79607 -16.46551 -14.93127 -15.03728  13.83586 -26.13496 
cor(airquality$Ozone, airquality$Wind,
    use="complete.obs")
[1] -0.6015465

Graphics

R is an object oriented program: the class of res is “lm” that has a dedicated plot function: plot(res) plots four charts

par(mfrow=c(2,2))
plot(res)

plot of chunk plotLM

Graphics

plot(Ozone~Wind, data=airquality, pch="+")
abline(res$coefficients, col="blue", lwd=2)

plot of chunk plotLM2

Graphics

Check residues' normality and heteroscedasticity

par(mfrow=c(1,2))
hist(res$residuals, col="grey")
plot(res$residuals~res$fitted, pch=19)

plot of chunk plotLM3

Tests

cor.test(airquality$Ozone, airquality$Wind)

    Pearson's product-moment correlation

data:  airquality$Ozone and airquality$Wind
t = -8.0401, df = 114, p-value = 9.272e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7063918 -0.4708713
sample estimates:
       cor 
-0.6015465 

Tests

anova(res)
Analysis of Variance Table

Response: Ozone
           Df Sum Sq Mean Sq F value    Pr(>F)    
Wind        1  45284   45284  64.644 9.272e-13 ***
Residuals 114  79859     701                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prediction

predict(res, data.frame(Wind=c(5,20)))
        1         2 
 69.11828 -14.14556 
# predict(res, c(5,20)) 
## does not work because 'Wind' is not set

Exercise 1

data(Orange)
  • What is the correlation coefficient between age and circumference?

  • Is the correlation significant?

  • What are the coefficients of the regression of circumference on age?

  • What is the expected uptake rate for a Orange concentration equal to the median of age?

Exercise 1

Make the following chart: plot of chunk ex1-res5

Exercise 1

set.seed(1976)

Generate a random Gaussian vector with the same mean and variance than circumference and fit a linear model of this variable on age. Compare the results with those of the previous model.

plot of chunk ex1-res6