Introduction to R - 5th lesson (statistics)

Nathalie Villa-Vialaneix - http://www.nathalievialaneix.eu
September 14-16th, 2015

Master TIDE, Université Paris 1

Probability distribution

functions related to probability distributions
random number generations
reproducibility

Gaussian distribution

dnorm(0.5); dnorm(c(-0.5,0,0.5))

[1] 0.3520653

[1] 0.3520653 0.3989423 0.3520653

curve(dnorm, xlim=c(-3,3))

plot of chunk densGaussian

Gaussian cumulative distribution

pnorm(0.5); pnorm(c(-0.5,0,0.5))

[1] 0.6914625

[1] 0.3085375 0.5000000 0.6914625

curve(pnorm, xlim=c(-3,3))

plot of chunk cumDistGaussian

Inverse Gaussian distribution

qnorm(0.975) # quantile of order 97.5%

[1] 1.959964

curve(qnorm, xlim=c(0,1))

plot of chunk invGaussian

Continuous density distributions

The same functions are available for other continuous distributions, as: beta, Cauchy, exponential, Fisher, gamma, log-normal, uniform,, \( \chi^2 \)… see help(Distributions).

Poisson distribution

dpois(1, lambda=3)

[1] 0.1493612

barplot(dpois(1:10, lambda=3), col="darkred",
        names=1:10)

plot of chunk densPois

Poisson cumulative distribution

ppois(1, lambda=3)

[1] 0.1991483

barplot(ppois(1:10, lambda=3), col="darkred",
        names=1:10)

plot of chunk cumDensPois

Inverse Poisson distribution

qpois(0.95, lambda=3) # quantile of order 95%

[1] 6

x <- seq(0,1,length=1000)
plot(x, qpois(x,lambda=3), type="h")

plot of chunk invPois

Discrete density distributions

The same functions are available for other discrete distributions, as: binomial, multinomial, negative binomial, … see help(Distributions).

Random sampling

R can generate pseudo-random numbers or samples. A random sample is taken from a list using the function sample (with or without replacement):

sample(1:12, 5, replace=FALSE)

[1]  1  9  8 12  6

sample(1:12, 20, replace=TRUE)

 [1] 12  3  7  2  9  3 11  3  4  7 10  8 12  3  1  8  3  1  7 12

Random number generation

Random numbers are generated from a given distribution using the prefix “r” with the previously introduced functions:

rnorm(3); hist(rnorm(50, mean=2, sd=3))

[1]  0.1460172 -0.3260648 -1.2325926

plot of chunk random2

Random number generation

Large samples approximate well the probability distribution:

hist(rnorm(1000), col="darkred", freq=FALSE)
curve(dnorm, lwd=4, col="green", add=TRUE)

plot of chunk random3

Reproductible random sampling

Random functions produce a different result each time they are called. Reproducible results can be obtained using set.seed with an arbitrary random seed:

rnorm(3) # varies each time it is called

[1] -0.4628617  1.1976318  2.0578466

set.seed(2807); rnorm(3)

[1] -1.1624644 -0.1750661  0.8881384

Statistical tests

normality tests
equality tests
independence tests

Normality tests

Standard tests to check the normality of a sample are the Kolmogorov-Smirnov test and the Shapiro-Wilk test:

to.test <- rnorm(100, mean=5)
shapiro.test(to.test)


    Shapiro-Wilk normality test

data:  to.test
W = 0.99396, p-value = 0.9389

Normality tests

to.test <- runif(100)
res <- ks.test(to.test, "dnorm")
res; res$statistic; res$p.value


    One-sample Kolmogorov-Smirnov test

data:  to.test
D = 0.75723, p-value < 2.2e-16
alternative hypothesis: two-sided

        D 
0.7572269

[1] 0

Student test

Student test is used to test an average value:

to.test <- rnorm(100, mean=5)
t.test(to.test, mu=5)


    One Sample t-test

data:  to.test
t = 0.49773, df = 99, p-value = 0.6198
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 4.847839 5.254059
sample estimates:
mean of x 
 5.050949

Student test

Student test is used to test the mean equality between two samples:

to.test <- rnorm(100)
to.test2 <- rnorm(100, mean=1)
res <- t.test(to.test, to.test2, var.equal=TRUE)
res$p.value; res$statistic; qt(c(0.025,0.975), df=99)

[1] 5.562276e-09

        t 
-6.096964

[1] -1.984217  1.984217

Wilcoxon test

Wilcoxon test is a non parametric test to test the equality between two distributions:

to.test <- rnorm(100); to.test2 <- runif(200)
wilcox.test(to.test, to.test2)


    Wilcoxon rank sum test with continuity correction

data:  to.test and to.test2
W = 7163, p-value = 6.208e-05
alternative hypothesis: true location shift is not equal to 0

Chi square test

\( \chi^2 \) test test the dependancy between two variables in a contingency table:

CT <- as.table(rbind(c(762, 327, 468),
                    c(484, 239, 477)))
dimnames(CT) <- list(gender = c("M","F"),
                    party =
c("Democrat","Independent", "Republican"))
res <- chisq.test(CT)  ; res


    Pearson's Chi-squared test

data:  CT
X-squared = 30.07, df = 2, p-value = 2.954e-07

Chi square test

The resulting object is a list with further information:

names(res)

[1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
[7] "expected"  "residuals" "stdres"

res$expected

      party
gender Democrat Independent Republican
     M 703.6714    319.6453   533.6834
     F 542.3286    246.3547   411.3166

Linear regression

fit a linear regression
interpretation and tests
prediction

Fit a linear regression

data(airquality)
res <- lm(Ozone~Wind, data=airquality); res


Call:
lm(formula = Ozone ~ Wind, data = airquality)

Coefficients:
(Intercept)         Wind  
     96.873       -5.551

Interpret results

names(res)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "xlevels"       "call"          "terms"        
[13] "model"

res$coefficients

(Intercept)        Wind 
  96.872895   -5.550923

Interpret results

head(res$fitted)

       1        2        3        4        6        7 
55.79607 52.46551 26.93127 33.03728 14.16414 49.13496

head(res$residuals)

        1         2         3         4         6         7 
-14.79607 -16.46551 -14.93127 -15.03728  13.83586 -26.13496

cor(airquality$Ozone, airquality$Wind,
    use="complete.obs")

[1] -0.6015465

Graphics

R is an object oriented program: the class of res is “lm” that has a dedicated plot function: plot(res) plots four charts

par(mfrow=c(2,2))
plot(res)

plot of chunk plotLM

Graphics

plot(Ozone~Wind, data=airquality, pch="+")
abline(res$coefficients, col="blue", lwd=2)

plot of chunk plotLM2

Graphics

Check residues' normality and heteroscedasticity

par(mfrow=c(1,2))
hist(res$residuals, col="grey")
plot(res$residuals~res$fitted, pch=19)

plot of chunk plotLM3

Tests

cor.test(airquality$Ozone, airquality$Wind)


    Pearson's product-moment correlation

data:  airquality$Ozone and airquality$Wind
t = -8.0401, df = 114, p-value = 9.272e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7063918 -0.4708713
sample estimates:
       cor 
-0.6015465

Tests

anova(res)

Analysis of Variance Table

Response: Ozone
           Df Sum Sq Mean Sq F value    Pr(>F)    
Wind        1  45284   45284  64.644 9.272e-13 ***
Residuals 114  79859     701                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prediction

predict(res, data.frame(Wind=c(5,20)))

        1         2 
 69.11828 -14.14556

# predict(res, c(5,20)) 
## does not work because 'Wind' is not set

Exercise 1

data(Orange)

What is the correlation coefficient between age and circumference?
Is the correlation significant?
What are the coefficients of the regression of circumference on age?
What is the expected uptake rate for a Orange concentration equal to the median of age?

Exercise 1

Make the following chart: plot of chunk ex1-res5

Exercise 1

set.seed(1976)

Generate a random Gaussian vector with the same mean and variance than circumference and fit a linear model of this variable on age. Compare the results with those of the previous model.

plot of chunk ex1-res6