Introductory statistics

# Introductory statistics
### Nathalie Vialaneix & Sandrine Laguerre
### Toulouse, March 14-16 – 2022

---

## An elementary map of statistics

---

## An elementary map of statistics

---

## Before you start: tidy your data!

---

## Before you start: tidy your data!

.left[<img src="img/real-data.png" height="300"> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
<img src="img/clean-data.png" height="300"> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
<img src="img/tidy_data.png" width="350">]

**clean data**: one sample in each row, one variable in each column, one value
in each cell

.center[“Like families, tidy datasets are all alike but every messy dataset is
messy in its own way.“]

Hadley Wickham (2014) Tidy Data. Journal of Statistical Sofware,
59(10).

---

## Types of variables

* numeric (discrete or continuous)

* non numeric (ordered or not)

<img src="img/warning.png" height="50"> how things are encoded!

---

## Univariate statistics

---

## Statistics (e.g., numerical characteristics)

### Purpose
&nbsp; &nbsp; &nbsp; summarize a series of values by one numeric value

### central characteristics 
&nbsp; &nbsp; &nbsp; *indicateurs de tendance centrale*

### dispersion characteristics 
&nbsp; &nbsp; &nbsp; *indicateurs de dispersion*

---

## Central characteristics

- **Mean** (**Moyenne**): sum of the observed values divided by the number of
observations: `$\overline{X} = \frac{1}{n} \sum_{i=1}^n x_i$`

- **Median** (**Médiane**): value that splits the sample into two subsamples 
with equal sizes

- **Mode** (**Mode**): most frequently observed values

- **Quartiles** (**Quartiles**): 3 values that split the sample into 4 equal 
size subsamples

- **Deciles** (**Déciles**): 9 values that split the sample into 10 equal 
size subsamples

- **Percentiles** (**Percentiles**): 99 values that split the sample into 100 
equal size subsamples

- **Quantiles** (**Quantiles**): generalize the others

---

## Never heard of percentiles?

---

## Mean and median: the mean is not robust!

How to increase the mean salary in a company?

* increase all salaries of `$x$`\%

* increase the salary of the most well paid person

* suppress a few jobs with low salaries

---

## Are central characteristics enough?

![](intro-stat_files/figure-html/central-1.png)
10 marks for 5 students: same mean, same median

---

## Dispersion characteristics

* **variance**: average squared distance to the mean

`$\mbox{Var}(X) = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{X})^2$`

and **standard deviation** (*écart type*): `$\sigma_X = \sqrt{\mbox{Var}(X)}$`

* **range** (*étendue*): difference between the largest and the smallest values

* **inter-quartile range** (*écart inter-quartile*): difference between the 1st 
and the 3rd quartiles (half of the observations lie between these two 
quantities)

---

## A few properties for the standard deviation

* positive (or null if all the observations have the same value)

* does not change when values are translated

* sensible to extreme values (like the mean)

* expressed in the same unit than the original variable (like the mean)

--

**Consequences**

* mean and standard deviation can be added (confidence interval)

* they can also be divided:

`$\mbox{CV}(X) = \frac{\sigma_X}{\overline{X}}$`

(*coefficient de variation*) can be used to compare the respective variability 
of two series

---

## Example of the impact of variability

« **Les filles brillent en classe, les garçons aux concours**

*LE MONDE | 07.09.09 - Article paru dans l'édition du 08.09.09. Philippe Jacqué*

Elles obtiennent de meilleurs résultats en cours de scolarité, mais réussissent
moins bien les concours des meilleures grandes écoles que les hommes.

[...]

Pour vérifier [cette hypothèse], trois économistes - Evren Örs, professeur à 
HEC, Eloïc Peyrache, directeur d'HEC, et Frédéric Palomino, ancien de l'école 
parisienne et actuel professeur associé à l'Edhec Lille - ont étudié à la loupe 
les résultats obtenus entre 2005 et 2007 au concours d'admission en première 
année d'HEC, une des écoles de management les plus réputées.

[...]

« D'un point de vue technique, il semble que la structure du concours HEC crée 
d'avantage d'hétérogénéité chez les hommes que chez les femmes », estime M. 
Peyrache. Si, « en moyenne », les performances des hommes et des femmes sont 
similaires, « les notes des femmes sont concentrées autour de la moyenne, tandis
que celles des hommes sont très dispersées avec beaucoup de très bonnes notes et
de très mauvaises. Mécaniquement, quand on sélectionne les 380 premiers 
résultats, on a un peu plus d'hommes ».

---

## Standard modifications of data

* **binarization of a numeric variable** (*discrétisation*): transform a numeric
variable into a factor by:
<ul>
 <li>creating intervals of equal width</li>
 <li>creating intervals of equal number of observations (how to do 
 that?)</li>
 <li>other...</li>
</ul>

--

Which solution sounds the best? What are the advantages/drawbacks of such a 
transformation?

---

## Standard modifications of data

* **centering and scaling (to unit variance)** (*centrage et reduction*), often
called **Z-score**:
<ul>
 <li>centering: removing the mean</li>
 <li>scaling: dividing by the standard deviation</li>
</ul>

`$z_i = \frac{x_i - \overline{X}}{\sigma_X}$`

After centering and scaling, the mean of the variable is 0 and its standard 
deviation is 1.

![](intro-stat_files/figure-html/zscore-1.png)

---

## Log transformations

![](intro-stat_files/figure-html/log-1.png)

* useful for asymetric distribution to make the variable fit a Gaussian 
distribution (after transformation) `$\Rightarrow$` often performed before tests

* useful for ratios (because a value twice or half the other have the same log
with opposite signs)

* for `$p$`-values, `$\log_{10}$` is often used

* most frequent logs:
<ul>
 <li> `$y = \log_2(x) \Leftrightarrow x = 2^y$` </li>
 <li> `$y = \log_{10}(x) \Leftrightarrow x = 10^y$` </li>
 <li> `$y = \ln(x) \Leftrightarrow x = \exp(y)$` </li>
</ul>

---

## Other transformations

* compute ratios

* normalization

* other functions ( `$\sqrt{.}$` , ...)

---

## Display a series of values with a chart

### In theory,
a graphic should:

* show the data

* help looking at it and understanding the data structure somehow

* avoid data distorsion

* plot many data in a simple way

**References**

* Edward Tufte (1983) *The Visual Display of Quantitative Information*, Graphics 
Press.

* http://r-graph-gallery.com/

---

## Common graphics for univariate analyses

The type of chart depends on...

.. the variable type:

* factors: 
--

<ul>
 <li>pie charts</li>
 <li>bar charts</li>
 <li>spider charts</li>
 <li>...</li>
</ul>

.center[<img src="img/pie.png" height="200">
<img src="img/bar.png" height="200">
<img src="img/spider.png" height="200">]

---

## Common graphics for univariate analyses

The type of chart depends on...

.. the variable type:

* numeric:

<ul>
 <li>histograms / density plots</li>
 <li>boxplot / violin plots</li>
 <li>...</li>
</ul>

.center[<img src="img/histogram.png" height="150">
<img src="img/density.png" height="150">
<img src="img/boxplot.png" height="150">
<img src="img/violin.png" height="150">]

---

## Boxplots?

---

## Most frequently seen mistakes