Nathalie Villa-Vialaneix - http://www.nathalievialaneix.eu
September 14-16th, 2015
Master TIDE, Université Paris 1
dnorm(0.5); dnorm(c(-0.5,0,0.5))
[1] 0.3520653
[1] 0.3520653 0.3989423 0.3520653
curve(dnorm, xlim=c(-3,3))
pnorm(0.5); pnorm(c(-0.5,0,0.5))
[1] 0.6914625
[1] 0.3085375 0.5000000 0.6914625
curve(pnorm, xlim=c(-3,3))
qnorm(0.975) # quantile of order 97.5%
[1] 1.959964
curve(qnorm, xlim=c(0,1))
The same functions are available for other continuous distributions, as: beta,
Cauchy, exponential, Fisher, gamma, log-normal, uniform,, \( \chi^2 \)…
see help(Distributions)
.
dpois(1, lambda=3)
[1] 0.1493612
barplot(dpois(1:10, lambda=3), col="darkred",
names=1:10)
ppois(1, lambda=3)
[1] 0.1991483
barplot(ppois(1:10, lambda=3), col="darkred",
names=1:10)
qpois(0.95, lambda=3) # quantile of order 95%
[1] 6
x <- seq(0,1,length=1000)
plot(x, qpois(x,lambda=3), type="h")
The same functions are available for other discrete distributions, as: binomial,
multinomial, negative binomial, … see help(Distributions)
.
R can generate pseudo-random numbers or samples. A random sample is taken
from a list using the function sample
(with or without replacement):
sample(1:12, 5, replace=FALSE)
[1] 1 9 8 12 6
sample(1:12, 20, replace=TRUE)
[1] 12 3 7 2 9 3 11 3 4 7 10 8 12 3 1 8 3 1 7 12
Random numbers are generated from a given distribution using the prefix “r” with the previously introduced functions:
rnorm(3); hist(rnorm(50, mean=2, sd=3))
[1] 0.1460172 -0.3260648 -1.2325926
Large samples approximate well the probability distribution:
hist(rnorm(1000), col="darkred", freq=FALSE)
curve(dnorm, lwd=4, col="green", add=TRUE)
Random functions produce a different result each time they are called.
Reproducible results can be obtained using set.seed
with an arbitrary
random seed:
rnorm(3) # varies each time it is called
[1] -0.4628617 1.1976318 2.0578466
set.seed(2807); rnorm(3)
[1] -1.1624644 -0.1750661 0.8881384
Standard tests to check the normality of a sample are the Kolmogorov-Smirnov test and the Shapiro-Wilk test:
to.test <- rnorm(100, mean=5)
shapiro.test(to.test)
Shapiro-Wilk normality test
data: to.test
W = 0.99396, p-value = 0.9389
to.test <- runif(100)
res <- ks.test(to.test, "dnorm")
res; res$statistic; res$p.value
One-sample Kolmogorov-Smirnov test
data: to.test
D = 0.75723, p-value < 2.2e-16
alternative hypothesis: two-sided
D
0.7572269
[1] 0
Student test is used to test an average value:
to.test <- rnorm(100, mean=5)
t.test(to.test, mu=5)
One Sample t-test
data: to.test
t = 0.49773, df = 99, p-value = 0.6198
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.847839 5.254059
sample estimates:
mean of x
5.050949
Student test is used to test the mean equality between two samples:
to.test <- rnorm(100)
to.test2 <- rnorm(100, mean=1)
res <- t.test(to.test, to.test2, var.equal=TRUE)
res$p.value; res$statistic; qt(c(0.025,0.975), df=99)
[1] 5.562276e-09
t
-6.096964
[1] -1.984217 1.984217
Wilcoxon test is a non parametric test to test the equality between two distributions:
to.test <- rnorm(100); to.test2 <- runif(200)
wilcox.test(to.test, to.test2)
Wilcoxon rank sum test with continuity correction
data: to.test and to.test2
W = 7163, p-value = 6.208e-05
alternative hypothesis: true location shift is not equal to 0
\( \chi^2 \) test test the dependancy between two variables in a contingency table:
CT <- as.table(rbind(c(762, 327, 468),
c(484, 239, 477)))
dimnames(CT) <- list(gender = c("M","F"),
party =
c("Democrat","Independent", "Republican"))
res <- chisq.test(CT) ; res
Pearson's Chi-squared test
data: CT
X-squared = 30.07, df = 2, p-value = 2.954e-07
The resulting object is a list with further information:
names(res)
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals" "stdres"
res$expected
party
gender Democrat Independent Republican
M 703.6714 319.6453 533.6834
F 542.3286 246.3547 411.3166
data(airquality)
res <- lm(Ozone~Wind, data=airquality); res
Call:
lm(formula = Ozone ~ Wind, data = airquality)
Coefficients:
(Intercept) Wind
96.873 -5.551
names(res)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "na.action" "xlevels" "call" "terms"
[13] "model"
res$coefficients
(Intercept) Wind
96.872895 -5.550923
head(res$fitted)
1 2 3 4 6 7
55.79607 52.46551 26.93127 33.03728 14.16414 49.13496
head(res$residuals)
1 2 3 4 6 7
-14.79607 -16.46551 -14.93127 -15.03728 13.83586 -26.13496
cor(airquality$Ozone, airquality$Wind,
use="complete.obs")
[1] -0.6015465
R is an object oriented program: the class of res
is “lm”
that has a dedicated plot
function: plot(res)
plots four charts
par(mfrow=c(2,2))
plot(res)
plot(Ozone~Wind, data=airquality, pch="+")
abline(res$coefficients, col="blue", lwd=2)
Check residues' normality and heteroscedasticity
par(mfrow=c(1,2))
hist(res$residuals, col="grey")
plot(res$residuals~res$fitted, pch=19)
cor.test(airquality$Ozone, airquality$Wind)
Pearson's product-moment correlation
data: airquality$Ozone and airquality$Wind
t = -8.0401, df = 114, p-value = 9.272e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.7063918 -0.4708713
sample estimates:
cor
-0.6015465
anova(res)
Analysis of Variance Table
Response: Ozone
Df Sum Sq Mean Sq F value Pr(>F)
Wind 1 45284 45284 64.644 9.272e-13 ***
Residuals 114 79859 701
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
predict(res, data.frame(Wind=c(5,20)))
1 2
69.11828 -14.14556
# predict(res, c(5,20))
## does not work because 'Wind' is not set
data(Orange)
What is the correlation coefficient between age
and circumference
?
Is the correlation significant?
What are the coefficients of the regression of circumference
on age
?
What is the expected uptake rate for a Orange concentration equal to the
median of age
?
Make the following chart:
set.seed(1976)
Generate a random Gaussian vector with the same mean and variance
than circumference
and fit a linear model of this variable on age
.
Compare the results with those of the previous model.