library("RColorBrewer")
library("car")
## Loading required package: carData
library("scales")
Data are given as a CSV file HNP_2019Data.csv
. They contain the Health Nutrition and Population Statistics database that is hosted by the World Bank Group and provides various key indicators related to health issues, population dynamics and nutrition. The data have been gathered from 258 countries and the current dataset is restricted to the year 2019. It comes from https://datacatalog.worldbank.org/dataset/health-nutrition-and-population-statistics
The first columns contain country and country code and the following columns provide indicators related to the current country as measured in 2019. The meaning of the different columns is provided in a separated spreedsheet HNP_StatsSeries.xlsx
.
Even though CSV files can be opened with Excel, we strongly discourage this practice and we will use R directly for this task:
hnp_data <- read.table("HNP_2019Data.csv", sep = ",", header = TRUE,
stringsAsFactors = FALSE)
where: * sep
is used to provide the character separating columns; * header = TRUE
indicates that column names are includings in the file (in the first row); * stringsAsFactor = FALSE
indicates that strings must not be converted to type factor
(this is the default behavior since R 4.0.0).
Other information and options are available in the help page ?read.table
or in ?read.csv
, ?read.csv2
, ?read.delim
, ?read.delim2
.
We can take a first look at the data with:
head(hnp_data)
## Country.Name Country.Code SP.ADO.TFRT
## 1 Arab World ARB NA
## 2 Caribbean small states CSS NA
## 3 Central Europe and the Baltics CEB NA
## 4 Early-demographic dividend EAR NA
## 5 East Asia & Pacific EAS NA
## 6 East Asia & Pacific (excluding high income) EAP NA
## SH.HIV.TOTL SH.HIV.INCD.TL SH.DYN.AIDS SH.HIV.INCD SP.DYN.SMAM.FE
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SP.DYN.SMAM.MA SP.POP.DPND SP.POP.DPND.OL SP.POP.DPND.YG SP.POP.AG00.FE.IN
## 1 NA 61.08180 7.432003 52.12709 NA
## 2 NA 48.15854 13.329092 34.76142 NA
## 3 NA 51.99869 28.799254 23.15305 NA
## 4 NA 53.15051 9.118569 43.63640 NA
## 5 NA 44.87090 16.159883 28.39655 NA
## 6 NA 43.59206 14.248537 29.22363 NA
## SP.POP.AG00.MA.IN SP.POP.AG01.FE.IN SP.POP.AG01.MA.IN SP.POP.AG02.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG02.MA.IN SP.POP.AG03.FE.IN SP.POP.AG03.MA.IN SP.POP.AG04.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG04.MA.IN SP.POP.AG05.FE.IN SP.POP.AG05.MA.IN SP.POP.AG06.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG06.MA.IN SP.POP.AG07.FE.IN SP.POP.AG07.MA.IN SP.POP.AG08.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG08.MA.IN SP.POP.AG09.FE.IN SP.POP.AG09.MA.IN SP.POP.AG10.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG10.MA.IN SP.POP.AG11.FE.IN SP.POP.AG11.MA.IN SP.POP.AG12.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG12.MA.IN SP.POP.AG13.FE.IN SP.POP.AG13.MA.IN SP.POP.AG14.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG14.MA.IN SP.POP.AG15.FE.IN SP.POP.AG15.MA.IN SP.POP.AG16.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG16.MA.IN SP.POP.AG17.FE.IN SP.POP.AG17.MA.IN SP.POP.AG18.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG18.MA.IN SP.POP.AG19.FE.IN SP.POP.AG19.MA.IN SP.POP.AG20.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG20.MA.IN SP.POP.AG21.FE.IN SP.POP.AG21.MA.IN SP.POP.AG22.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG22.MA.IN SP.POP.AG23.FE.IN SP.POP.AG23.MA.IN SP.POP.AG24.FE.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG24.MA.IN SP.POP.AG25.FE.IN SP.POP.AG25.MA.IN SH.DYN.AIDS.DH
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.HIV.ARTC.ZS SH.HIV.PMTC.ZS SH.STA.ARIC.ZS SP.DYN.CBRT.IN SH.STA.BRTC.ZS
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.XPD.KHEX.GD.ZS SH.DTH.COMM.ZS SH.DTH.INJR.ZS SH.DTH.NCOM.ZS SH.HIV.0014
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.HIV.INCD.14 SH.HIV.ORPH SH.MLR.TRET.ZS SH.MED.CMHW.P3 SP.REG.BRTH.ZS
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SP.REG.BRTH.FE.ZS SP.REG.BRTH.MA.ZS SP.REG.BRTH.RU.ZS SP.REG.BRTH.UR.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.REG.DTHS.ZS SH.HIV.1524.KW.FE.ZS SH.HIV.1524.KW.MA.ZS SH.HIV.KNOW.FE.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.HIV.KNOW.MA.ZS SH.CON.AIDS.FE.ZS SH.CON.AIDS.MA.ZS SH.CON.1524.FE.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.CON.1524.MA.ZS SN.ITK.SALT.ZS SP.DYN.CONU.ZS SP.DYN.CONM.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.XPD.CHEX.GD.ZS SH.XPD.CHEX.PC.CD SH.XPD.CHEX.PP.CD SP.DYN.CDRT.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.FPL.SATI.ZS SH.FPL.SATM.ZS SH.STA.DIAB.ZS SH.STA.ORCF.ZS SH.STA.ORTH
## 1 NA NA 12.536001 NA NA
## 2 NA NA 11.629347 NA NA
## 3 NA NA 6.296002 NA NA
## 4 NA NA 10.090910 NA NA
## 5 NA NA 8.202608 NA NA
## 6 NA NA 8.488062 NA NA
## SH.XPD.GHED.CH.ZS SH.XPD.GHED.GD.ZS SH.XPD.GHED.GE.ZS SH.XPD.GHED.PC.CD
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.XPD.GHED.PP.CD SH.XPD.PVTD.CH.ZS SH.XPD.PVTD.PC.CD SH.XPD.PVTD.PP.CD
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.STA.BFED.ZS SH.XPD.EHEX.CH.ZS SH.XPD.EHEX.EH.ZS SH.XPD.EHEX.PC.CD
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.XPD.EHEX.PP.CD SP.HOU.FEMA.ZS SP.DYN.TFRT.IN NY.GNP.PCAP.CD SH.MED.BEDS.ZS
## 1 NA NA NA 6502.415 NA
## 2 NA NA NA 9804.796 NA
## 3 NA NA NA 15796.772 NA
## 4 NA NA NA 3645.529 NA
## 5 NA NA NA 11725.701 NA
## 6 NA NA NA 8299.220 NA
## HD.HCI.OVRL HD.HCI.OVRL.FE HD.HCI.OVRL.LB.FE HD.HCI.OVRL.UB.FE HD.HCI.OVRL.LB
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## HD.HCI.OVRL.MA HD.HCI.OVRL.LB.MA HD.HCI.OVRL.UB.MA HD.HCI.OVRL.UB SH.IMM.IBCG
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.IMM.IDPT SH.IMM.HEPB SH.IMM.HIB3 SH.IMM.MEAS SH.IMM.POL3 SH.HIV.INCD.ZS
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## SH.MLR.INCD.P3 SH.TBS.INCD SH.UHC.NOP1.ZG SH.UHC.NOP1.CG SH.UHC.NOP2.ZG
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.UHC.NOP2.CG SH.STA.IYCF.ZS SH.MLR.IPTP.ZS SL.TLF.TOTL.FE.ZS SL.TLF.TOTL.IN
## 1 NA NA NA 20.71772 138180908
## 2 NA NA NA 43.93480 3406886
## 3 NA NA NA 44.99905 49355626
## 4 NA NA NA 29.81833 1303740803
## 5 NA NA NA 43.32029 1265785108
## 6 NA NA NA 43.23804 1132518723
## SP.DYN.LE00.FE.IN SP.DYN.LE00.MA.IN SP.DYN.LE00.IN SH.MMR.RISK.ZS SH.MMR.RISK
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SE.ADT.LITR.FE.ZS SE.ADT.LITR.MA.ZS SE.ADT.LITR.ZS SE.ADT.1524.LT.MA.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SE.ADT.1524.LT.ZS SH.STA.BRTW.ZS SH.STA.MALR SH.STA.STNT.ZS SH.STA.STNT.FE.ZS
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA 11 NA
## 6 NA NA NA NA NA
## SH.STA.STNT.MA.ZS SH.STA.MALN.ZS SH.STA.MALN.FE.ZS SH.STA.MALN.MA.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA 5.3 NA NA
## 6 NA NA NA NA
## SH.MMR.WAGE.ZS SH.STA.MMRT SH.STA.MMRT.NE SH.STA.TRAF.P5 SH.DYN.NCOM.ZS
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.DYN.NCOM.FE.ZS SH.DYN.NCOM.MA.ZS SH.STA.AIRP.P5 SH.STA.AIRP.FE.P5
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.STA.AIRP.MA.P5 SH.STA.POIS.P5 SH.STA.POIS.P5.FE SH.STA.POIS.P5.MA
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.STA.WASH.P5 SP.DYN.AMRT.FE SP.DYN.AMRT.MA SP.DYN.IMRT.IN SP.DYN.IMRT.FE.IN
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SP.DYN.IMRT.MA.IN SH.DYN.NMRT SH.DYN.MORT SH.DYN.MORT.FE SH.DYN.MORT.MA
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SM.POP.NETM SH.VAC.TTNS.ZS SH.DTH.0514 SH.DTH.IMRT SH.MMR.DTHS SH.DTH.NMRT
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## SH.UHC.NOP1.TO SH.UHC.NOP2.TO SH.UHC.OOPC.10.TO SH.UHC.OOPC.25.TO SN.ITK.DEFC
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.SGR.PROC.P5 SH.DTH.MORT SH.MMR.LEVE SH.MED.NUMW.P3 SH.XPD.OOPC.CH.ZS
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.XPD.OOPC.PC.CD SH.XPD.OOPC.PP.CD SH.STA.ODFC.ZS SH.STA.ODFC.RU.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.STA.ODFC.UR.ZS SH.H2O.BASW.ZS SH.H2O.BASW.RU.ZS SH.H2O.BASW.UR.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.STA.BASS.ZS SH.STA.BASS.RU.ZS SH.STA.BASS.UR.ZS SH.H2O.SMDW.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.H2O.SMDW.RU.ZS SH.H2O.SMDW.UR.ZS SH.STA.SMSS.ZS SH.STA.SMSS.RU.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.STA.SMSS.UR.ZS SH.STA.HYGN.ZS SH.STA.HYGN.RU.ZS SH.STA.HYGN.UR.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.MED.PHYS.ZS SP.POP.0004.FE SP.POP.0004.FE.5Y SP.POP.0004.MA
## 1 NA 24854512 12.058287 26008786
## 2 NA 277110 7.557271 289767
## 3 NA 2456961 4.651194 2596717
## 4 NA 153419754 9.501407 163715691
## 5 NA 73077265 6.330451 80054159
## 6 NA 67939534 6.601297 74614334
## SP.POP.0004.MA.5Y SP.POP.0014.TO SP.POP.0014.TO.ZS SP.POP.0014.FE.IN
## 1 11.728850 139782891 32.66946 68286720
## 2 8.026905 1708078 23.47309 837203
## 3 5.240148 15599476 15.23705 7592488
## 4 9.770659 939913302 28.56627 454379983
## 5 6.749714 459749160 19.64392 218362329
## 6 7.010223 426431937 20.36885 202188095
## SP.POP.0014.FE.ZS SP.POP.0014.MA.IN SP.POP.0014.MA.ZS SP.POP.0509.FE
## 1 33.12963 71493710 32.24060 23284558
## 2 22.83201 870864 24.12401 280062
## 3 14.37311 8007008 16.15805 2543150
## 4 28.14011 485527118 28.97657 151781082
## 5 18.91603 241386466 20.35234 73854649
## 6 19.64546 224243661 21.06831 68353500
## SP.POP.0509.FE.5Y SP.POP.0509.MA SP.POP.0509.MA.5Y SP.POP.1014.FE
## 1 11.296616 24402277 11.004383 20147651
## 2 7.637777 291784 8.082778 280032
## 3 4.814359 2681766 5.411775 2592379
## 4 9.399923 161980152 9.667081 149179148
## 5 6.397793 81722934 6.890416 71430417
## 6 6.641520 75893633 7.130416 65895060
## SP.POP.1014.FE.5Y SP.POP.1014.MA SP.POP.1014.MA.5Y SP.POP.1519.FE
## 1 9.774730 21082648 9.507372 18115624
## 2 7.636958 289313 8.014328 290255
## 3 4.907553 2728522 5.506129 2358496
## 4 9.238783 159831279 9.538834 144632997
## 5 6.187789 79609371 6.712212 71678473
## 6 6.402647 73735694 6.927672 65814658
## SP.POP.1519.FE.5Y SP.POP.1519.MA SP.POP.1519.MA.5Y SP.POP.1564.TO.ZS
## 1 8.788883 18928186 8.535802 62.67271
## 2 7.915755 300451 8.322862 67.52627
## 3 4.464793 2474088 4.992683 65.81012
## 4 8.957236 155781135 9.297119 65.46432
## 5 6.209278 79216105 6.679054 69.17714
## 6 6.394835 72971742 6.855897 69.69993
## SP.POP.1564.FE.IN SP.POP.1564.FE.ZS SP.POP.1564.MA.IN SP.POP.1564.MA.ZS
## 1 127238287 61.73027 140922288 63.54993
## 2 2479088 67.60900 2434643 67.44254
## 3 33502065 63.42171 33873432 68.35620
## 4 1055828098 65.38827 1098144784 65.53799
## 5 793779704 68.76261 825251623 69.58055
## 6 714792393 69.45230 744409915 69.93937
## SP.POP.1564.TO SP.POP.2024.FE SP.POP.2024.FE.5Y SP.POP.2024.MA
## 1 268157877 17449218 8.465573 18738921
## 2 4913718 300422 8.193025 311444
## 3 67375470 2615950 4.952177 2746660
## 4 2153966405 140447231 8.698008 150203474
## 5 1619031919 75434106 6.534616 82351224
## 6 1459202369 68779490 6.682911 75223409
## SP.POP.2024.MA.5Y SP.POP.2529.FE SP.POP.2529.FE.5Y SP.POP.2529.MA
## 1 8.450452 17272753 8.379960 19452856
## 2 8.627407 293522 8.004852 300181
## 3 5.542731 3178398 6.016926 3378765
## 4 8.964241 135077139 8.365434 142819714
## 5 6.943390 83221802 7.209239 89215677
## 6 7.067447 76193573 7.403295 81698887
## SP.POP.2529.MA.5Y SP.POP.3034.FE SP.POP.3034.FE.5Y SP.POP.3034.MA
## 1 8.772407 16467310 7.989196 18887678
## 2 8.315384 275619 7.516610 276692
## 3 6.818307 3556613 6.732912 3816946
## 4 8.523573 128305461 7.946059 133696383
## 5 7.522161 94044814 8.146803 98224849
## 6 7.675836 86553698 8.409929 90437096
## SP.POP.3034.MA.5Y SP.POP.3539.FE SP.POP.3539.FE.5Y SP.POP.3539.MA
## 1 8.517536 14684769 7.124387 16743388
## 2 7.664714 270183 7.368360 258544
## 3 7.702555 3700371 7.005055 3917777
## 4 7.979087 117819461 7.296653 121995348
## 5 8.281764 80135756 6.941905 82809779
## 6 8.496815 71767388 6.973228 74253414
## SP.POP.3539.MA.5Y SP.POP.4044.FE SP.POP.4044.FE.5Y SP.POP.4044.MA
## 1 7.550551 12394573 6.013287 14241425
## 2 7.161990 236542 6.450911 222783
## 3 7.906032 3958123 7.493001 4087420
## 4 7.280761 102624792 6.355635 105546723
## 5 6.982052 80156003 6.943659 82559993
## 6 6.976313 71292738 6.927109 73576869
## SP.POP.4044.MA.5Y SP.POP.4549.FE SP.POP.4549.FE.5Y SP.POP.4549.MA
## 1 6.422273 10182884 4.940276 11540918
## 2 6.171365 233078 6.356442 220353
## 3 8.248366 3652805 6.915010 3715541
## 4 6.299097 89875870 5.566084 91728576
## 5 6.960992 91142844 7.895414 93325370
## 6 6.912750 81599425 7.928551 83664862
## SP.POP.4549.MA.5Y SP.POP.5054.FE SP.POP.5054.FE.5Y SP.POP.5054.MA
## 1 5.204460 8454292 4.101641 9299194
## 2 6.104051 216462 5.903294 206274
## 3 7.497920 3351256 6.344158 3302637
## 4 5.474421 77851846 4.821427 78642889
## 5 7.868668 87443149 7.574921 88003627
## 6 7.860545 78621452 7.639198 79203035
## SP.POP.5054.MA.5Y SP.POP.5559.FE SP.POP.5559.FE.5Y SP.POP.5559.MA
## 1 4.193539 6837653 3.317320 7481678
## 2 5.714045 200948 5.480202 190426
## 3 6.664681 3337347 6.317829 3123393
## 4 4.693459 65974962 4.085882 65698486
## 5 7.419968 71038323 6.153824 71212363
## 6 7.441344 62592679 6.081774 62798224
## SP.POP.5559.MA.5Y SP.POP.6064.FE SP.POP.6064.FE.5Y SP.POP.6064.MA
## 1 3.373917 5379212 2.609751 5608048
## 2 5.275037 162056 4.419549 147491
## 3 6.302974 3792702 7.179846 3310202
## 4 3.920928 53218344 3.295854 52032062
## 5 6.004224 59484439 5.152947 58332643
## 6 5.900066 51577296 5.011472 50582383
## SP.POP.6064.MA.5Y SP.POP.65UP.TO.ZS SP.POP.65UP.FE.IN SP.POP.65UP.FE.ZS
## 1 2.528990 4.657838 10594745 5.140092
## 2 4.085684 9.000639 350509 9.558994
## 3 6.679950 18.952825 11729732 22.205188
## 4 3.105307 5.969409 104497550 6.471616
## 5 4.918279 11.178944 142234944 12.321360
## 6 4.752355 9.931220 112204121 10.902235
## SP.POP.65UP.MA.IN SP.POP.65UP.MA.ZS SP.POP.65UP.TO SP.POP.6569.FE
## 1 9334519 4.209469 19929502 4114802
## 2 304443 8.433451 654954 119486
## 3 7673854 15.485749 19403633 3620088
## 4 91913084 5.485433 196410915 40197975
## 5 119399672 10.067105 261633659 52949684
## 6 95711107 8.992323 207914985 45327228
## SP.POP.6569.FE.5Y SP.POP.6569.MA SP.POP.6569.MA.5Y SP.POP.7074.FE
## 1 1.996316 3992940 1.800645 2772099
## 2 3.258592 110047 3.048439 90619
## 3 6.853076 2889446 5.830869 2827052
## 4 2.489493 38273259 2.284173 27574050
## 5 4.586862 50218440 4.234135 35225618
## 6 4.404188 43013941 4.041278 28082534
## SP.POP.7074.FE.5Y SP.POP.7074.MA SP.POP.7074.MA.5Y SP.POP.7579.FE
## 1 1.344897 2503846 1.129128 1867325
## 2 2.471339 81572 2.259646 63441
## 3 5.351802 2019510 4.075346 2044326
## 4 1.707683 24398827 1.456138 18492383
## 5 3.051483 31003463 2.614037 23727138
## 6 2.728620 24606417 2.311841 18165806
## SP.POP.7579.FE.5Y SP.POP.7579.MA SP.POP.7579.MA.5Y SP.POP.80UP.FE
## 1 0.9059418 1530529 0.6902031 1840519
## 2 1.7301473 52522 1.4549250 76963
## 3 3.8700498 1246412 2.5152458 3238266
## 4 1.1452479 15552858 0.9282047 18233141
## 5 2.0554064 19302553 1.6274822 30332504
## 6 1.7650677 14846206 1.3948420 20628553
## SP.POP.80UP.MA SP.POP.80UP.MA.5Y SP.POP.80UP.FE.5Y SP.POP.GROW
## 1 1307203 0.5894928 0.8929369 1.9246927
## 2 60302 1.6704404 2.0989159 0.5763854
## 3 1518485 3.0642878 6.1302590 -0.1545266
## 4 13688140 0.8169171 1.1291929 1.2945836
## 5 18875215 1.5914515 2.6276082 0.5364899
## 6 13244543 1.2443614 2.0043588 0.5781955
## SP.POP.TOTL.FE.IN SP.POP.TOTL.FE.ZS SP.POP.TOTL.MA.IN SP.POP.TOTL.MA.ZS
## 1 206119753 48.17342 221750517 51.82658
## 2 3666801 50.39065 3609949 49.60935
## 3 52824286 51.59701 49554293 48.40299
## 4 1614705634 49.07486 1675584988 50.92514
## 5 1154376978 49.32361 1186037760 50.67639
## 6 1029184609 49.15980 1064364682 50.84020
## SP.POP.TOTL SH.STA.PNVC.ZS SI.POV.NAHC SH.STA.ANVC.ZS SH.STA.ANV4.ZS
## 1 427870270 NA NA NA NA
## 2 7401381 NA NA NA NA
## 3 102378579 NA NA NA NA
## 4 3290290622 NA NA NA NA
## 5 2340628292 NA NA NA NA
## 6 2093675040 NA NA NA NA
## SH.ANM.CHLD.ZS SH.ANM.NPRG.ZS SH.PRG.ANEM SH.ANM.ALLW.ZS SH.HIV.1524.FE.ZS
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.HIV.1524.MA.ZS SH.DYN.AIDS.ZS SH.STA.OWAD.ZS SH.STA.OWGH.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA 6.8
## 6 NA NA NA NA
## SH.STA.OWGH.FE.ZS SH.STA.OWAD.FE.ZS SH.STA.OWGH.MA.ZS SH.STA.OWAD.MA.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.SVR.WAST.ZS SH.SVR.WAST.FE.ZS SH.SVR.WAST.MA.ZS SH.PRG.SYPH.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 1.4 NA NA NA
## 6 NA NA NA NA
## SN.ITK.DEFC.ZS SH.STA.WAST.ZS SH.STA.WAST.FE.ZS SH.STA.WAST.MA.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA 3.7 NA NA
## 6 NA NA NA NA
## SE.PRM.CMPT.FE.ZS SE.PRM.CMPT.MA.ZS SE.PRM.CMPT.ZS SH.DYN.0514 SH.UHC.NOP1.ZS
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.UHC.NOP2.ZS SH.UHC.OOPC.10.ZS SH.UHC.OOPC.25.ZS SE.XPD.TOTL.GD.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SE.ENR.ORPH SE.ADT.1524.LT.FM.ZS SH.SGR.CRSK.ZS SH.SGR.IRSK.ZS SP.RUR.TOTL
## 1 NA NA NA NA 174564024
## 2 NA NA NA NA 3599290
## 3 NA NA NA NA 38455076
## 4 NA NA NA NA 1791866077
## 5 NA NA NA NA 938907166
## 6 NA NA NA NA 909282997
## SP.RUR.TOTL.ZS SP.RUR.TOTL.ZG SI.POV.RUHC SE.PRM.ENRR SE.PRM.NENR
## 1 40.79835 1.2392457 NA NA NA
## 2 48.62996 0.2246310 NA NA NA
## 3 37.56164 -0.4317326 NA NA NA
## 4 54.45920 0.5354880 NA NA NA
## 5 40.11347 -1.5127732 NA NA NA
## 6 43.43000 -1.5452493 NA NA NA
## SE.PRM.ENRR.FE SE.PRM.NENR.FE SE.PRM.ENRR.MA SE.PRM.NENR.MA SE.SEC.ENRR
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SE.SEC.NENR SE.SEC.ENRR.FE SE.SEC.NENR.FE SE.SEC.ENRR.MA SE.SEC.NENR.MA
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SE.TER.ENRR SE.TER.ENRR.FE SP.POP.BRTH.MF SL.EMP.INSV.FE.ZS SH.PRV.SMOK.FE
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.PRV.SMOK.MA SH.PRV.SMOK SH.MED.SAOP.P5 SH.STA.SUIC.P5 SH.STA.SUIC.FE.P5
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.STA.SUIC.MA.P5 SP.DYN.TO65.FE.ZS SP.DYN.TO65.MA.ZS SP.MTR.1519.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SH.ALC.PCAP.LI SH.ALC.PCAP.FE.LI SH.ALC.PCAP.MA.LI SH.TBS.DTEC.ZS SH.TBS.MORT
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SH.TBS.CURE.ZS SH.UHC.SRVS.CV.XD SL.UEM.TOTL.FE.ZS SL.UEM.TOTL.MA.ZS
## 1 NA NA 19.954200 7.824012
## 2 NA NA 10.127709 6.352640
## 3 NA NA 3.926424 3.824401
## 4 NA NA 6.671477 5.453449
## 5 NA NA 3.424570 4.139991
## 6 NA NA 3.457939 4.228728
## SL.UEM.TOTL.ZS SP.UWT.TFRT SP.URB.TOTL SP.URB.TOTL.IN.ZS SP.URB.GROW
## 1 10.336946 NA 253306246 59.20165 2.4024908
## 2 8.011520 NA 3802091 51.37004 0.9116601
## 3 3.870378 NA 63923503 62.43836 0.0129797
## 4 5.816614 NA 1498424545 45.54080 2.2175228
## 5 3.830288 NA 1401721126 59.88653 1.9575029
## 6 3.895681 NA 1184392043 56.57000 2.2716088
## SI.POV.URHC SH.MLR.NETS.ZS SN.ITK.VITA.ZS SP.DYN.WFRT SP.M15.2024.FE.ZS
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SP.M18.2024.FE.ZS SH.DYN.AIDS.FE.ZS
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
which shows that many rows contain missing values (identified with NA
). The dataset dimension is obtained with:
dim(hnp_data)
## [1] 259 407
A simple first cleaning will consist in: * removing all columns with only missing values * removing all rows with only missing values
The number of missing values in all columns is given by:
missing_in_cols <- apply(hnp_data, 2, function(acol) sum(is.na(acol)))
head(missing_in_cols)
## Country.Name Country.Code SP.ADO.TFRT SH.HIV.TOTL SH.HIV.INCD.TL
## 0 0 259 259 259
## SH.DYN.AIDS
## 259
Exercise: Perform the cleaning steps described above.
hnp_data <- hnp_data[ ,which(missing_in_cols < 259)]
missing_in_rows <- apply(hnp_data, 1, function(arow) sum(is.na(arow)))
hnp_data <- hnp_data[which(missing_in_rows < (ncol(hnp_data) - 2)), ]
If these steps have been properly performed, your dataset should look like this:
head(hnp_data)
## Country.Name Country.Code SP.POP.DPND
## 1 Arab World ARB 61.08180
## 2 Caribbean small states CSS 48.15854
## 3 Central Europe and the Baltics CEB 51.99869
## 4 Early-demographic dividend EAR 53.15051
## 5 East Asia & Pacific EAS 44.87090
## 6 East Asia & Pacific (excluding high income) EAP 43.59206
## SP.POP.DPND.OL SP.POP.DPND.YG SP.POP.AG00.FE.IN SP.POP.AG00.MA.IN
## 1 7.432003 52.12709 NA NA
## 2 13.329092 34.76142 NA NA
## 3 28.799254 23.15305 NA NA
## 4 9.118569 43.63640 NA NA
## 5 16.159883 28.39655 NA NA
## 6 14.248537 29.22363 NA NA
## SP.POP.AG01.FE.IN SP.POP.AG01.MA.IN SP.POP.AG02.FE.IN SP.POP.AG02.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG03.FE.IN SP.POP.AG03.MA.IN SP.POP.AG04.FE.IN SP.POP.AG04.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG05.FE.IN SP.POP.AG05.MA.IN SP.POP.AG06.FE.IN SP.POP.AG06.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG07.FE.IN SP.POP.AG07.MA.IN SP.POP.AG08.FE.IN SP.POP.AG08.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG09.FE.IN SP.POP.AG09.MA.IN SP.POP.AG10.FE.IN SP.POP.AG10.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG11.FE.IN SP.POP.AG11.MA.IN SP.POP.AG12.FE.IN SP.POP.AG12.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG13.FE.IN SP.POP.AG13.MA.IN SP.POP.AG14.FE.IN SP.POP.AG14.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG15.FE.IN SP.POP.AG15.MA.IN SP.POP.AG16.FE.IN SP.POP.AG16.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG17.FE.IN SP.POP.AG17.MA.IN SP.POP.AG18.FE.IN SP.POP.AG18.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG19.FE.IN SP.POP.AG19.MA.IN SP.POP.AG20.FE.IN SP.POP.AG20.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG21.FE.IN SP.POP.AG21.MA.IN SP.POP.AG22.FE.IN SP.POP.AG22.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG23.FE.IN SP.POP.AG23.MA.IN SP.POP.AG24.FE.IN SP.POP.AG24.MA.IN
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.POP.AG25.FE.IN SP.POP.AG25.MA.IN SH.STA.DIAB.ZS NY.GNP.PCAP.CD
## 1 NA NA 12.536001 6502.415
## 2 NA NA 11.629347 9804.796
## 3 NA NA 6.296002 15796.772
## 4 NA NA 10.090910 3645.529
## 5 NA NA 8.202608 11725.701
## 6 NA NA 8.488062 8299.220
## SL.TLF.TOTL.FE.ZS SL.TLF.TOTL.IN SH.STA.STNT.ZS SH.STA.STNT.FE.ZS
## 1 20.71772 138180908 NA NA
## 2 43.93480 3406886 NA NA
## 3 44.99905 49355626 NA NA
## 4 29.81833 1303740803 NA NA
## 5 43.32029 1265785108 11 NA
## 6 43.23804 1132518723 NA NA
## SH.STA.STNT.MA.ZS SH.STA.MALN.ZS SH.STA.MALN.FE.ZS SH.STA.MALN.MA.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA 5.3 NA NA
## 6 NA NA NA NA
## SP.POP.0004.FE SP.POP.0004.FE.5Y SP.POP.0004.MA SP.POP.0004.MA.5Y
## 1 24854512 12.058287 26008786 11.728850
## 2 277110 7.557271 289767 8.026905
## 3 2456961 4.651194 2596717 5.240148
## 4 153419754 9.501407 163715691 9.770659
## 5 73077265 6.330451 80054159 6.749714
## 6 67939534 6.601297 74614334 7.010223
## SP.POP.0014.TO SP.POP.0014.TO.ZS SP.POP.0014.FE.IN SP.POP.0014.FE.ZS
## 1 139782891 32.66946 68286720 33.12963
## 2 1708078 23.47309 837203 22.83201
## 3 15599476 15.23705 7592488 14.37311
## 4 939913302 28.56627 454379983 28.14011
## 5 459749160 19.64392 218362329 18.91603
## 6 426431937 20.36885 202188095 19.64546
## SP.POP.0014.MA.IN SP.POP.0014.MA.ZS SP.POP.0509.FE SP.POP.0509.FE.5Y
## 1 71493710 32.24060 23284558 11.296616
## 2 870864 24.12401 280062 7.637777
## 3 8007008 16.15805 2543150 4.814359
## 4 485527118 28.97657 151781082 9.399923
## 5 241386466 20.35234 73854649 6.397793
## 6 224243661 21.06831 68353500 6.641520
## SP.POP.0509.MA SP.POP.0509.MA.5Y SP.POP.1014.FE SP.POP.1014.FE.5Y
## 1 24402277 11.004383 20147651 9.774730
## 2 291784 8.082778 280032 7.636958
## 3 2681766 5.411775 2592379 4.907553
## 4 161980152 9.667081 149179148 9.238783
## 5 81722934 6.890416 71430417 6.187789
## 6 75893633 7.130416 65895060 6.402647
## SP.POP.1014.MA SP.POP.1014.MA.5Y SP.POP.1519.FE SP.POP.1519.FE.5Y
## 1 21082648 9.507372 18115624 8.788883
## 2 289313 8.014328 290255 7.915755
## 3 2728522 5.506129 2358496 4.464793
## 4 159831279 9.538834 144632997 8.957236
## 5 79609371 6.712212 71678473 6.209278
## 6 73735694 6.927672 65814658 6.394835
## SP.POP.1519.MA SP.POP.1519.MA.5Y SP.POP.1564.TO.ZS SP.POP.1564.FE.IN
## 1 18928186 8.535802 62.67271 127238287
## 2 300451 8.322862 67.52627 2479088
## 3 2474088 4.992683 65.81012 33502065
## 4 155781135 9.297119 65.46432 1055828098
## 5 79216105 6.679054 69.17714 793779704
## 6 72971742 6.855897 69.69993 714792393
## SP.POP.1564.FE.ZS SP.POP.1564.MA.IN SP.POP.1564.MA.ZS SP.POP.1564.TO
## 1 61.73027 140922288 63.54993 268157877
## 2 67.60900 2434643 67.44254 4913718
## 3 63.42171 33873432 68.35620 67375470
## 4 65.38827 1098144784 65.53799 2153966405
## 5 68.76261 825251623 69.58055 1619031919
## 6 69.45230 744409915 69.93937 1459202369
## SP.POP.2024.FE SP.POP.2024.FE.5Y SP.POP.2024.MA SP.POP.2024.MA.5Y
## 1 17449218 8.465573 18738921 8.450452
## 2 300422 8.193025 311444 8.627407
## 3 2615950 4.952177 2746660 5.542731
## 4 140447231 8.698008 150203474 8.964241
## 5 75434106 6.534616 82351224 6.943390
## 6 68779490 6.682911 75223409 7.067447
## SP.POP.2529.FE SP.POP.2529.FE.5Y SP.POP.2529.MA SP.POP.2529.MA.5Y
## 1 17272753 8.379960 19452856 8.772407
## 2 293522 8.004852 300181 8.315384
## 3 3178398 6.016926 3378765 6.818307
## 4 135077139 8.365434 142819714 8.523573
## 5 83221802 7.209239 89215677 7.522161
## 6 76193573 7.403295 81698887 7.675836
## SP.POP.3034.FE SP.POP.3034.FE.5Y SP.POP.3034.MA SP.POP.3034.MA.5Y
## 1 16467310 7.989196 18887678 8.517536
## 2 275619 7.516610 276692 7.664714
## 3 3556613 6.732912 3816946 7.702555
## 4 128305461 7.946059 133696383 7.979087
## 5 94044814 8.146803 98224849 8.281764
## 6 86553698 8.409929 90437096 8.496815
## SP.POP.3539.FE SP.POP.3539.FE.5Y SP.POP.3539.MA SP.POP.3539.MA.5Y
## 1 14684769 7.124387 16743388 7.550551
## 2 270183 7.368360 258544 7.161990
## 3 3700371 7.005055 3917777 7.906032
## 4 117819461 7.296653 121995348 7.280761
## 5 80135756 6.941905 82809779 6.982052
## 6 71767388 6.973228 74253414 6.976313
## SP.POP.4044.FE SP.POP.4044.FE.5Y SP.POP.4044.MA SP.POP.4044.MA.5Y
## 1 12394573 6.013287 14241425 6.422273
## 2 236542 6.450911 222783 6.171365
## 3 3958123 7.493001 4087420 8.248366
## 4 102624792 6.355635 105546723 6.299097
## 5 80156003 6.943659 82559993 6.960992
## 6 71292738 6.927109 73576869 6.912750
## SP.POP.4549.FE SP.POP.4549.FE.5Y SP.POP.4549.MA SP.POP.4549.MA.5Y
## 1 10182884 4.940276 11540918 5.204460
## 2 233078 6.356442 220353 6.104051
## 3 3652805 6.915010 3715541 7.497920
## 4 89875870 5.566084 91728576 5.474421
## 5 91142844 7.895414 93325370 7.868668
## 6 81599425 7.928551 83664862 7.860545
## SP.POP.5054.FE SP.POP.5054.FE.5Y SP.POP.5054.MA SP.POP.5054.MA.5Y
## 1 8454292 4.101641 9299194 4.193539
## 2 216462 5.903294 206274 5.714045
## 3 3351256 6.344158 3302637 6.664681
## 4 77851846 4.821427 78642889 4.693459
## 5 87443149 7.574921 88003627 7.419968
## 6 78621452 7.639198 79203035 7.441344
## SP.POP.5559.FE SP.POP.5559.FE.5Y SP.POP.5559.MA SP.POP.5559.MA.5Y
## 1 6837653 3.317320 7481678 3.373917
## 2 200948 5.480202 190426 5.275037
## 3 3337347 6.317829 3123393 6.302974
## 4 65974962 4.085882 65698486 3.920928
## 5 71038323 6.153824 71212363 6.004224
## 6 62592679 6.081774 62798224 5.900066
## SP.POP.6064.FE SP.POP.6064.FE.5Y SP.POP.6064.MA SP.POP.6064.MA.5Y
## 1 5379212 2.609751 5608048 2.528990
## 2 162056 4.419549 147491 4.085684
## 3 3792702 7.179846 3310202 6.679950
## 4 53218344 3.295854 52032062 3.105307
## 5 59484439 5.152947 58332643 4.918279
## 6 51577296 5.011472 50582383 4.752355
## SP.POP.65UP.TO.ZS SP.POP.65UP.FE.IN SP.POP.65UP.FE.ZS SP.POP.65UP.MA.IN
## 1 4.657838 10594745 5.140092 9334519
## 2 9.000639 350509 9.558994 304443
## 3 18.952825 11729732 22.205188 7673854
## 4 5.969409 104497550 6.471616 91913084
## 5 11.178944 142234944 12.321360 119399672
## 6 9.931220 112204121 10.902235 95711107
## SP.POP.65UP.MA.ZS SP.POP.65UP.TO SP.POP.6569.FE SP.POP.6569.FE.5Y
## 1 4.209469 19929502 4114802 1.996316
## 2 8.433451 654954 119486 3.258592
## 3 15.485749 19403633 3620088 6.853076
## 4 5.485433 196410915 40197975 2.489493
## 5 10.067105 261633659 52949684 4.586862
## 6 8.992323 207914985 45327228 4.404188
## SP.POP.6569.MA SP.POP.6569.MA.5Y SP.POP.7074.FE SP.POP.7074.FE.5Y
## 1 3992940 1.800645 2772099 1.344897
## 2 110047 3.048439 90619 2.471339
## 3 2889446 5.830869 2827052 5.351802
## 4 38273259 2.284173 27574050 1.707683
## 5 50218440 4.234135 35225618 3.051483
## 6 43013941 4.041278 28082534 2.728620
## SP.POP.7074.MA SP.POP.7074.MA.5Y SP.POP.7579.FE SP.POP.7579.FE.5Y
## 1 2503846 1.129128 1867325 0.9059418
## 2 81572 2.259646 63441 1.7301473
## 3 2019510 4.075346 2044326 3.8700498
## 4 24398827 1.456138 18492383 1.1452479
## 5 31003463 2.614037 23727138 2.0554064
## 6 24606417 2.311841 18165806 1.7650677
## SP.POP.7579.MA SP.POP.7579.MA.5Y SP.POP.80UP.FE SP.POP.80UP.MA
## 1 1530529 0.6902031 1840519 1307203
## 2 52522 1.4549250 76963 60302
## 3 1246412 2.5152458 3238266 1518485
## 4 15552858 0.9282047 18233141 13688140
## 5 19302553 1.6274822 30332504 18875215
## 6 14846206 1.3948420 20628553 13244543
## SP.POP.80UP.MA.5Y SP.POP.80UP.FE.5Y SP.POP.GROW SP.POP.TOTL.FE.IN
## 1 0.5894928 0.8929369 1.9246927 206119753
## 2 1.6704404 2.0989159 0.5763854 3666801
## 3 3.0642878 6.1302590 -0.1545266 52824286
## 4 0.8169171 1.1291929 1.2945836 1614705634
## 5 1.5914515 2.6276082 0.5364899 1154376978
## 6 1.2443614 2.0043588 0.5781955 1029184609
## SP.POP.TOTL.FE.ZS SP.POP.TOTL.MA.IN SP.POP.TOTL.MA.ZS SP.POP.TOTL SI.POV.NAHC
## 1 48.17342 221750517 51.82658 427870270 NA
## 2 50.39065 3609949 49.60935 7401381 NA
## 3 51.59701 49554293 48.40299 102378579 NA
## 4 49.07486 1675584988 50.92514 3290290622 NA
## 5 49.32361 1186037760 50.67639 2340628292 NA
## 6 49.15980 1064364682 50.84020 2093675040 NA
## SH.STA.OWGH.ZS SH.STA.OWGH.FE.ZS SH.STA.OWGH.MA.ZS SH.SVR.WAST.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 6.8 NA NA 1.4
## 6 NA NA NA NA
## SH.SVR.WAST.FE.ZS SH.SVR.WAST.MA.ZS SH.STA.WAST.ZS SH.STA.WAST.FE.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA 3.7 NA
## 6 NA NA NA NA
## SH.STA.WAST.MA.ZS SE.PRM.CMPT.FE.ZS SE.PRM.CMPT.MA.ZS SE.PRM.CMPT.ZS
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## SP.RUR.TOTL SP.RUR.TOTL.ZS SP.RUR.TOTL.ZG SE.PRM.ENRR SE.PRM.NENR
## 1 174564024 40.79835 1.2392457 NA NA
## 2 3599290 48.62996 0.2246310 NA NA
## 3 38455076 37.56164 -0.4317326 NA NA
## 4 1791866077 54.45920 0.5354880 NA NA
## 5 938907166 40.11347 -1.5127732 NA NA
## 6 909282997 43.43000 -1.5452493 NA NA
## SE.PRM.ENRR.FE SE.PRM.NENR.FE SE.PRM.ENRR.MA SE.PRM.NENR.MA SE.SEC.ENRR
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SE.SEC.NENR SE.SEC.ENRR.FE SE.SEC.NENR.FE SE.SEC.ENRR.MA SE.SEC.NENR.MA
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 NA NA NA NA NA
## 6 NA NA NA NA NA
## SE.TER.ENRR SE.TER.ENRR.FE SL.UEM.TOTL.FE.ZS SL.UEM.TOTL.MA.ZS SL.UEM.TOTL.ZS
## 1 NA NA 19.954200 7.824012 10.336946
## 2 NA NA 10.127709 6.352640 8.011520
## 3 NA NA 3.926424 3.824401 3.870378
## 4 NA NA 6.671477 5.453449 5.816614
## 5 NA NA 3.424570 4.139991 3.830288
## 6 NA NA 3.457939 4.228728 3.895681
## SP.UWT.TFRT SP.URB.TOTL SP.URB.TOTL.IN.ZS SP.URB.GROW
## 1 NA 253306246 59.20165 2.4024908
## 2 NA 3802091 51.37004 0.9116601
## 3 NA 63923503 62.43836 0.0129797
## 4 NA 1498424545 45.54080 2.2175228
## 5 NA 1401721126 59.88653 1.9575029
## 6 NA 1184392043 56.57000 2.2716088
and have the following dimensions:
dim(hnp_data)
## [1] 258 196
Information on column types can be obtained with:
sapply(hnp_data, class)
## Country.Name Country.Code SP.POP.DPND SP.POP.DPND.OL
## "character" "character" "numeric" "numeric"
## SP.POP.DPND.YG SP.POP.AG00.FE.IN SP.POP.AG00.MA.IN SP.POP.AG01.FE.IN
## "numeric" "integer" "integer" "integer"
## SP.POP.AG01.MA.IN SP.POP.AG02.FE.IN SP.POP.AG02.MA.IN SP.POP.AG03.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG03.MA.IN SP.POP.AG04.FE.IN SP.POP.AG04.MA.IN SP.POP.AG05.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG05.MA.IN SP.POP.AG06.FE.IN SP.POP.AG06.MA.IN SP.POP.AG07.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG07.MA.IN SP.POP.AG08.FE.IN SP.POP.AG08.MA.IN SP.POP.AG09.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG09.MA.IN SP.POP.AG10.FE.IN SP.POP.AG10.MA.IN SP.POP.AG11.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG11.MA.IN SP.POP.AG12.FE.IN SP.POP.AG12.MA.IN SP.POP.AG13.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG13.MA.IN SP.POP.AG14.FE.IN SP.POP.AG14.MA.IN SP.POP.AG15.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG15.MA.IN SP.POP.AG16.FE.IN SP.POP.AG16.MA.IN SP.POP.AG17.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG17.MA.IN SP.POP.AG18.FE.IN SP.POP.AG18.MA.IN SP.POP.AG19.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG19.MA.IN SP.POP.AG20.FE.IN SP.POP.AG20.MA.IN SP.POP.AG21.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG21.MA.IN SP.POP.AG22.FE.IN SP.POP.AG22.MA.IN SP.POP.AG23.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG23.MA.IN SP.POP.AG24.FE.IN SP.POP.AG24.MA.IN SP.POP.AG25.FE.IN
## "integer" "integer" "integer" "integer"
## SP.POP.AG25.MA.IN SH.STA.DIAB.ZS NY.GNP.PCAP.CD SL.TLF.TOTL.FE.ZS
## "integer" "numeric" "numeric" "numeric"
## SL.TLF.TOTL.IN SH.STA.STNT.ZS SH.STA.STNT.FE.ZS SH.STA.STNT.MA.ZS
## "numeric" "numeric" "numeric" "numeric"
## SH.STA.MALN.ZS SH.STA.MALN.FE.ZS SH.STA.MALN.MA.ZS SP.POP.0004.FE
## "numeric" "numeric" "numeric" "integer"
## SP.POP.0004.FE.5Y SP.POP.0004.MA SP.POP.0004.MA.5Y SP.POP.0014.TO
## "numeric" "integer" "numeric" "integer"
## SP.POP.0014.TO.ZS SP.POP.0014.FE.IN SP.POP.0014.FE.ZS SP.POP.0014.MA.IN
## "numeric" "integer" "numeric" "integer"
## SP.POP.0014.MA.ZS SP.POP.0509.FE SP.POP.0509.FE.5Y SP.POP.0509.MA
## "numeric" "integer" "numeric" "integer"
## SP.POP.0509.MA.5Y SP.POP.1014.FE SP.POP.1014.FE.5Y SP.POP.1014.MA
## "numeric" "integer" "numeric" "integer"
## SP.POP.1014.MA.5Y SP.POP.1519.FE SP.POP.1519.FE.5Y SP.POP.1519.MA
## "numeric" "integer" "numeric" "integer"
## SP.POP.1519.MA.5Y SP.POP.1564.TO.ZS SP.POP.1564.FE.IN SP.POP.1564.FE.ZS
## "numeric" "numeric" "numeric" "numeric"
## SP.POP.1564.MA.IN SP.POP.1564.MA.ZS SP.POP.1564.TO SP.POP.2024.FE
## "numeric" "numeric" "numeric" "integer"
## SP.POP.2024.FE.5Y SP.POP.2024.MA SP.POP.2024.MA.5Y SP.POP.2529.FE
## "numeric" "integer" "numeric" "integer"
## SP.POP.2529.FE.5Y SP.POP.2529.MA SP.POP.2529.MA.5Y SP.POP.3034.FE
## "numeric" "integer" "numeric" "integer"
## SP.POP.3034.FE.5Y SP.POP.3034.MA SP.POP.3034.MA.5Y SP.POP.3539.FE
## "numeric" "integer" "numeric" "integer"
## SP.POP.3539.FE.5Y SP.POP.3539.MA SP.POP.3539.MA.5Y SP.POP.4044.FE
## "numeric" "integer" "numeric" "integer"
## SP.POP.4044.FE.5Y SP.POP.4044.MA SP.POP.4044.MA.5Y SP.POP.4549.FE
## "numeric" "integer" "numeric" "integer"
## SP.POP.4549.FE.5Y SP.POP.4549.MA SP.POP.4549.MA.5Y SP.POP.5054.FE
## "numeric" "integer" "numeric" "integer"
## SP.POP.5054.FE.5Y SP.POP.5054.MA SP.POP.5054.MA.5Y SP.POP.5559.FE
## "numeric" "integer" "numeric" "integer"
## SP.POP.5559.FE.5Y SP.POP.5559.MA SP.POP.5559.MA.5Y SP.POP.6064.FE
## "numeric" "integer" "numeric" "integer"
## SP.POP.6064.FE.5Y SP.POP.6064.MA SP.POP.6064.MA.5Y SP.POP.65UP.TO.ZS
## "numeric" "integer" "numeric" "numeric"
## SP.POP.65UP.FE.IN SP.POP.65UP.FE.ZS SP.POP.65UP.MA.IN SP.POP.65UP.MA.ZS
## "integer" "numeric" "integer" "numeric"
## SP.POP.65UP.TO SP.POP.6569.FE SP.POP.6569.FE.5Y SP.POP.6569.MA
## "integer" "integer" "numeric" "integer"
## SP.POP.6569.MA.5Y SP.POP.7074.FE SP.POP.7074.FE.5Y SP.POP.7074.MA
## "numeric" "integer" "numeric" "integer"
## SP.POP.7074.MA.5Y SP.POP.7579.FE SP.POP.7579.FE.5Y SP.POP.7579.MA
## "numeric" "integer" "numeric" "integer"
## SP.POP.7579.MA.5Y SP.POP.80UP.FE SP.POP.80UP.MA SP.POP.80UP.MA.5Y
## "numeric" "integer" "integer" "numeric"
## SP.POP.80UP.FE.5Y SP.POP.GROW SP.POP.TOTL.FE.IN SP.POP.TOTL.FE.ZS
## "numeric" "numeric" "numeric" "numeric"
## SP.POP.TOTL.MA.IN SP.POP.TOTL.MA.ZS SP.POP.TOTL SI.POV.NAHC
## "numeric" "numeric" "numeric" "integer"
## SH.STA.OWGH.ZS SH.STA.OWGH.FE.ZS SH.STA.OWGH.MA.ZS SH.SVR.WAST.ZS
## "numeric" "numeric" "numeric" "numeric"
## SH.SVR.WAST.FE.ZS SH.SVR.WAST.MA.ZS SH.STA.WAST.ZS SH.STA.WAST.FE.ZS
## "numeric" "numeric" "numeric" "numeric"
## SH.STA.WAST.MA.ZS SE.PRM.CMPT.FE.ZS SE.PRM.CMPT.MA.ZS SE.PRM.CMPT.ZS
## "numeric" "numeric" "numeric" "numeric"
## SP.RUR.TOTL SP.RUR.TOTL.ZS SP.RUR.TOTL.ZG SE.PRM.ENRR
## "numeric" "numeric" "numeric" "numeric"
## SE.PRM.NENR SE.PRM.ENRR.FE SE.PRM.NENR.FE SE.PRM.ENRR.MA
## "numeric" "numeric" "numeric" "numeric"
## SE.PRM.NENR.MA SE.SEC.ENRR SE.SEC.NENR SE.SEC.ENRR.FE
## "numeric" "numeric" "numeric" "numeric"
## SE.SEC.NENR.FE SE.SEC.ENRR.MA SE.SEC.NENR.MA SE.TER.ENRR
## "numeric" "numeric" "numeric" "numeric"
## SE.TER.ENRR.FE SL.UEM.TOTL.FE.ZS SL.UEM.TOTL.MA.ZS SL.UEM.TOTL.ZS
## "numeric" "numeric" "numeric" "numeric"
## SP.UWT.TFRT SP.URB.TOTL SP.URB.TOTL.IN.ZS SP.URB.GROW
## "numeric" "numeric" "numeric" "numeric"
which indicates that all columns are numeric (including integer) except for the first two. Sometimes, numeric variables (and more specifically integers) are used to code for factors but this is not the case in this dataset.
The variable SP.RUR.TOTL.ZS
is the ratio of population (%) in rural areas. The numerical characteristics for this variables are:
sum(is.na(hnp_data$SP.RUR.TOTL.ZS)) # number of missing values
## [1] 3
mean(hnp_data$SP.RUR.TOTL.ZS, na.rm = TRUE) # mean
## [1] 39.30523
median(hnp_data$SP.RUR.TOTL.ZS, na.rm = TRUE) # median
## [1] 38.121
min(hnp_data$SP.RUR.TOTL.ZS, na.rm = TRUE) # minimum
## [1] 0
max(hnp_data$SP.RUR.TOTL.ZS, na.rm = TRUE) # maximum
## [1] 86.75
# quartiles and min/max
quantile(hnp_data$SP.RUR.TOTL.ZS, probs = c(0, 0.25, 0.5, 0.75, 1),
na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0.0000 19.5815 38.1210 57.6340 86.7500
The option na.rm = TRUE
must be used when you have missing values that you don’t want to be taken into account for the computation (otherwise, most of these functions would return the value NA
).
Exercise: How to interpret these values? More precisely, what do they say about the variable distribution?
Some of these values are also available with
summary(hnp_data$SP.RUR.TOTL.ZS)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 19.58 38.12 39.31 57.63 86.75 3
and can be computed for several columns at once with:
colMeans(hnp_data[ ,3:5], na.rm = TRUE)
## SP.POP.DPND SP.POP.DPND.OL SP.POP.DPND.YG
## 58.63381 14.03833 44.48563
apply(hnp_data[ ,3:5], 2, median, na.rm = TRUE)
## SP.POP.DPND SP.POP.DPND.OL SP.POP.DPND.YG
## 54.66260 10.62035 38.51829
Dispersion characteristics are obtained with:
var(hnp_data$SP.RUR.TOTL.ZS, na.rm = TRUE) # variance
## [1] 521.175
sd(hnp_data$SP.RUR.TOTL.ZS, na.rm = TRUE) # standard deviation
## [1] 22.82926
range(hnp_data$SP.RUR.TOTL.ZS, na.rm = TRUE) # range
## [1] 0.00 86.75
diff(range(hnp_data$SP.RUR.TOTL.ZS, na.rm = TRUE))
## [1] 86.75
Exercise: How would you compute the inter-quartile range (in just one line of code)?
## 75%
## 43.375
Exercise: What is the coefficient of variation (CV) for this variable? How does it compare with the CV of SP.RUR.TOTL
? Is it expected (and why)?
## [1] 0.5808198
## [1] 3.463223
Standard modifications of data include:
hnp_data$cut1 <- cut(hnp_data$SP.RUR.TOTL.ZS, breaks = 5)
hnp_data$cut2 <- cut(hnp_data$SP.RUR.TOTL.ZS, breaks = 5, labels = FALSE)
table(hnp_data$cut1)
##
## (-0.0867,17.4] (17.4,34.7] (34.7,52.1] (52.1,69.4] (69.4,86.8]
## 47 68 61 51 28
table(hnp_data$cut2)
##
## 1 2 3 4 5
## 47 68 61 51 28
hnp_data$cut3 <- cut(hnp_data$SP.RUR.TOTL.ZS, breaks = c(0, 20, 40, 70, 100))
table(hnp_data$cut3)
##
## (0,20] (20,40] (40,70] (70,100]
## 56 66 95 28
Exercise: What is the mode of cut3
?
hnp_data$scaled <- as.vector(scale(hnp_data$SP.RUR.TOTL))
summary(hnp_data$scaled)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.2887 -0.2878 -0.2803 0.0000 -0.2417 7.9674 3
Exercise: SH.STA.DIAB.ZS
is the diabete prevalence. Compared to the rest of the world, does France has a higher position for its number of people in rural areas or for its number of diabetes?
## scaled scaled2
## 111 -0.2573116 -0.8108099
Before we start, a short note on color palettes:
display.brewer.all()
cut2
):pie(table(hnp_data$cut2), col = brewer.pal(5, "Set1"))
barplot(table(hnp_data$cut2), col = "darkgreen")
hist(hnp_data$SP.RUR.TOTL)
hist(log(hnp_data$SP.RUR.TOTL), main = "number of people in rural areas",
xlab = "Number of people", breaks = 20)
plot(density(na.omit(log(hnp_data$SP.RUR.TOTL))))
hist(log(hnp_data$SP.RUR.TOTL), main = "number of people in rural areas",
xlab = "Number of people", breaks = 20, freq = FALSE)
lines(density(na.omit(log(hnp_data$SP.RUR.TOTL))), col = "red", lwd = 2)
boxplot(log(hnp_data$SP.RUR.TOTL))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
boxplot(log(hnp_data$SP.RUR.TOTL), horizontal = TRUE, notch = TRUE)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group
## == : Outlier (-Inf) in boxplot 1 is not drawn
Using a binarization of the variables SP.RUR.TOTL.ZS
(as already obtained in cut2
) and of SH.STA.DIAB.ZS
, we can obtain a contingency table of the two resulting variables (that are encoded as numeric variables but are actually factors).
Exercise: Obtain a binarization of SH.STA.DIAB.ZS
into 5 classes with approximately equal frequencies and labelled with numbers (1, 2, …). Call the new variable cutdiab
.
cont_table <- table(hnp_data$cut2, hnp_data$cutdiab)
cont_table
##
## 1 2 3 4 5
## 1 21 16 6 0 0
## 2 29 33 4 1 1
## 3 25 29 3 3 0
## 4 32 11 5 3 0
## 5 12 12 3 1 0
cont_table / rowSums(cont_table)
##
## 1 2 3 4 5
## 1 0.48837209 0.37209302 0.13953488 0.00000000 0.00000000
## 2 0.42647059 0.48529412 0.05882353 0.01470588 0.01470588
## 3 0.41666667 0.48333333 0.05000000 0.05000000 0.00000000
## 4 0.62745098 0.21568627 0.09803922 0.05882353 0.00000000
## 5 0.42857143 0.42857143 0.10714286 0.03571429 0.00000000
Exercise: How to interpret the numbers in the last table (we call them “row profiles”)? Compute the column profils.
##
## 1 2 3 4 5
## 1 0.17647059 0.13445378 0.05042017 0.00000000 0.00000000
## 2 0.28712871 0.32673267 0.03960396 0.00990099 0.00990099
## 3 1.19047619 1.38095238 0.14285714 0.14285714 0.00000000
## 4 4.00000000 1.37500000 0.62500000 0.37500000 0.00000000
## 5 12.00000000 12.00000000 3.00000000 1.00000000 0.00000000
Barplots are obtained using the contingency table as well:
barplot(cont_table, legend.text = TRUE, xlab = "cut2", ylab = "frequency")
barplot(cont_table, legend.text = TRUE, xlab = "cut2", ylab = "frequency",
beside = TRUE, col = brewer.pal(5, "Set3"))
Covariances can be obtained using the function cov
:
cov(hnp_data$SP.RUR.TOTL.ZS, hnp_data$SH.STA.DIAB.ZS, use = "complete.obs")
## [1] -2.327047
cov(hnp_data$SP.RUR.TOTL.ZS, hnp_data$SH.STA.DIAB.ZS, method = "spearman",
use = "complete.obs")
## [1] -273.6426
cov(hnp_data[ ,c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS", "SP.RUR.TOTL")],
use = "complete.obs")
## SP.RUR.TOTL.ZS SH.STA.DIAB.ZS SP.RUR.TOTL
## SP.RUR.TOTL.ZS 5.139864e+02 -2.327047e+00 1.194007e+09
## SH.STA.DIAB.ZS -2.327047e+00 1.955978e+01 3.647091e+07
## SP.RUR.TOTL 1.194007e+09 3.647091e+07 1.724506e+17
cov(hnp_data[ ,c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS", "SP.RUR.TOTL")],
use = "pairwise.complete.obs")
## SP.RUR.TOTL.ZS SH.STA.DIAB.ZS SP.RUR.TOTL
## SP.RUR.TOTL.ZS 5.211750e+02 -2.327047e+00 1.228466e+09
## SH.STA.DIAB.ZS -2.327047e+00 1.952482e+01 3.647091e+07
## SP.RUR.TOTL 1.228466e+09 3.647091e+07 1.693394e+17
and correlations are obtained with the function cor
that takes the same arguments than cov
.
Exercise: Compute the correlation between SP.RUR.TOTL.ZS
, SH.STA.DIAB.ZS
, SP.RUR.TOTL
using a common set of observations.
## SP.RUR.TOTL.ZS SH.STA.DIAB.ZS SP.RUR.TOTL
## SP.RUR.TOTL.ZS 1.00000000 -0.02320851 0.12682315
## SH.STA.DIAB.ZS -0.02320851 1.00000000 0.01985785
## SP.RUR.TOTL 0.12682315 0.01985785 1.00000000
Partial correlation between SP.RUR.TOTL.ZS
and SH.STA.DIAB.ZS
given SP.POP.DPND
is obtained with:
with(na.omit(hnp_data[ ,c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS", "SP.POP.DPND")]),
cor(lm(SP.RUR.TOTL.ZS ~ SP.POP.DPND)$residuals,
lm(SH.STA.DIAB.ZS ~ SP.POP.DPND)$residuals))
## [1] 0.1819244
Dot plots (scatterplots) between two variables are obtained with:
plot(hnp_data$SP.RUR.TOTL.ZS, hnp_data$SH.STA.DIAB.ZS, pch = 19)
or with scatterplot matrices:
hnp_data$log <- log(hnp_data$SP.RUR.TOTL + 1)
scatterplotMatrix(hnp_data[ ,c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS", "log")])
scatterplotMatrix(hnp_data[ ,c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS", "log")],
col = "black", regLine = FALSE, smooth = FALSE, pch = "+")
Within and between variance of SH.STA.DIAB.ZS
with respect to cut2
are obtained from the function anova
(in the column Mean Sq
, within group variance corresponds to the row residuals
):
with(na.omit(hnp_data[ ,c("SH.STA.DIAB.ZS", "cut2")]),
anova(lm(SH.STA.DIAB.ZS ~ cut2)))
## Analysis of Variance Table
##
## Response: SH.STA.DIAB.ZS
## Df Sum Sq Mean Sq F value Pr(>F)
## cut2 1 2.2 2.2245 0.1133 0.7367
## Residuals 248 4868.2 19.6297
and the correlation ratio is the square root of the column F value
:
cor_ratio <- with(na.omit(hnp_data[ ,c("SH.STA.DIAB.ZS", "cut2")]),
anova(lm(SH.STA.DIAB.ZS ~ cut2)))
sqrt(cor_ratio["cut2", "F value"])
## [1] 0.3366381
Parallel boxplots or dot plots are obtained using the same syntax with the ~
:
with(hnp_data[ ,c("SH.STA.DIAB.ZS", "cut2")],
boxplot(SH.STA.DIAB.ZS ~ cut2))
with(hnp_data[ ,c("SH.STA.DIAB.ZS", "cut2")],
plot(SH.STA.DIAB.ZS ~ cut2, pch = 19, col = alpha("black", alpha = 0.2)))
To obtain multiple histograms on the same plot, you can use:
par(mfrow = c(2, 3))
for (clust in 1:5) {
tmp <- hnp_data[hnp_data$cut2 == clust,c("SH.STA.DIAB.ZS")]
hist(tmp, main = paste("Histogram of diabete % for ", clust),
xlab = "diabete %", xlim = c(0, 35))
}
Exercise: Remade the histograms to have uniform scales on the \(y\) axes.
To test if SP.RUR.TOTL.ZS
average is equal to 50%.
shapiro.test(hnp_data$SP.RUR.TOTL.ZS)
##
## Shapiro-Wilk normality test
##
## data: hnp_data$SP.RUR.TOTL.ZS
## W = 0.97328, p-value = 0.0001012
Despite significant deviation to normality, we will perform a Student test (for the sake of the example):
res <- t.test(hnp_data$SP.RUR.TOTL.ZS, mu = 50, conf.level = 0.99)
res
##
## One Sample t-test
##
## data: hnp_data$SP.RUR.TOTL.ZS
## t = -7.4808, df = 254, p-value = 1.205e-12
## alternative hypothesis: true mean is not equal to 50
## 99 percent confidence interval:
## 35.59490 43.01557
## sample estimates:
## mean of x
## 39.30523
res$statistic
## t
## -7.480826
res$p.value
## [1] 1.204999e-12
res$conf.int
## [1] 35.59490 43.01557
## attr(,"conf.level")
## [1] 0.99
res <- wilcox.test(hnp_data$SP.RUR.TOTL.ZS, mu = 50, conf.int = TRUE)
res
##
## Wilcoxon signed rank test with continuity correction
##
## data: hnp_data$SP.RUR.TOTL.ZS
## V = 8557, p-value = 4.57e-11
## alternative hypothesis: true location is not equal to 50
## 95 percent confidence interval:
## 35.97255 42.06726
## sample estimates:
## (pseudo)median
## 39.0099
res$statistic
## V
## 8557
res$p.value
## [1] 4.570157e-11
res$conf.int
## [1] 35.97255 42.06726
## attr(,"conf.level")
## [1] 0.95
For small contingency tables, the independence between rows and columns can be tested with a Fisher’s exact test (to be preferred) or a \(\chi^2\) test (only when Fisher’s exact test is computationally too heavy to run or when the sample size is sufficiently large).
res <- with(hnp_data[ ,c("cut2", "cutdiab")], fisher.test(table(cut2, cutdiab)))
# Error in fisher.test(table(cut2, cutdiab)) : FEXACT error 6 (f5xact). LDKEY=621 is too small for this problem: kval=279274087. Try increasing the size of the workspace.
In addition to being more suited to large contingency tables, \(\chi^2\) test also provide interesting statistics for interpretation of the results:
res <- with(hnp_data[ ,c("cut2", "cutdiab")], chisq.test(table(cut2, cutdiab)))
## Warning in chisq.test(table(cut2, cutdiab)): Chi-squared approximation may be
## incorrect
res
##
## Pearson's Chi-squared test
##
## data: table(cut2, cutdiab)
## X-squared = 19.742, df = 16, p-value = 0.2321
res$observed
## cutdiab
## cut2 1 2 3 4 5
## 1 21 16 6 0 0
## 2 29 33 4 1 1
## 3 25 29 3 3 0
## 4 32 11 5 3 0
## 5 12 12 3 1 0
res$expected
## cutdiab
## cut2 1 2 3 4 5
## 1 20.468 17.372 3.612 1.376 0.172
## 2 32.368 27.472 5.712 2.176 0.272
## 3 28.560 24.240 5.040 1.920 0.240
## 4 24.276 20.604 4.284 1.632 0.204
## 5 13.328 11.312 2.352 0.896 0.112
res$residuals^2
## cutdiab
## cut2 1 2 3 4 5
## 1 0.01382763 0.10835736 1.57877741 1.37600000 0.17200000
## 2 0.35045180 1.11236109 0.51312045 0.63555882 1.94847059
## 3 0.44375350 0.93471947 0.82571429 0.60750000 0.24000000
## 4 2.45757851 4.47664609 0.11966760 1.14670588 0.20400000
## 5 0.13232173 0.04184441 0.17853061 0.01207143 0.11200000
Exercise: Titanic[ , Sex = "Male", Age = "Adult", ]
is the contingency table of Male Adults in Titanics for the variables Class
and Survival
. Are these two variables independant? Which classes deviate the most from the independence? Same questions for Women.
##
## Fisher's Exact Test for Count Data
##
## data: Titanic[, Sex = "Male", Age = "Adult", ]
## p-value = 1.559e-08
## alternative hypothesis: two.sided
##
## Pearson's Chi-squared test
##
## data: Titanic[, Sex = "Male", Age = "Adult", ]
## X-squared = 37.988, df = 3, p-value = 2.843e-08
## Survived
## Class No Yes
## 1st 118 57
## 2nd 154 14
## 3rd 387 75
## Crew 670 192
## Survived
## Class No Yes
## 1st 139.5171 35.48290
## 2nd 133.9364 34.06359
## 3rd 368.3251 93.67487
## Crew 687.2214 174.77864
## Survived
## Class No Yes
## 1st 3.3184854 13.0481274
## 2nd 3.0055123 11.8175321
## 3rd 0.9468552 3.7229900
## Crew 0.4315569 1.6968612
## Survived
## Class No Yes
## 1st 4 140
## 2nd 13 80
## 3rd 89 76
## Crew 3 20
## Survived
## Class No Yes
## 1st 36.931765 107.06824
## 2nd 23.851765 69.14824
## 3rd 42.317647 122.68235
## Crew 5.898824 17.10118
## Survived
## Class No Yes
## 1st 29.3649961 10.1290651
## 2nd 4.9371943 1.7030196
## 3rd 51.4972412 17.7632889
## Crew 1.4245515 0.4913801
Correlation tests are performed with:
with(na.omit(hnp_data[ ,c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS")]),
cor.test(SP.RUR.TOTL.ZS, SH.STA.DIAB.ZS))
##
## Pearson's product-moment correlation
##
## data: SP.RUR.TOTL.ZS and SH.STA.DIAB.ZS
## t = -0.36559, df = 248, p-value = 0.715
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1468527 0.1011497
## sample estimates:
## cor
## -0.02320851
with(na.omit(hnp_data[ ,c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS")]),
cor.test(SP.RUR.TOTL.ZS, SH.STA.DIAB.ZS, method = "spearman"))
## Warning in cor.test.default(SP.RUR.TOTL.ZS, SH.STA.DIAB.ZS, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: SP.RUR.TOTL.ZS and SH.STA.DIAB.ZS
## S = 2740422, p-value = 0.41
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.0523388
Exercise: The object Soils
contain characteristics on soil samples. Variables 6-13 are numerical characteristics. Make a scatterplot matrix of these variables and pick two variables that you think are linearly correlated. Test your hypothesis.
##
## Pearson's product-moment correlation
##
## data: pH and Ca
## t = 9.3221, df = 46, p-value = 3.614e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6809493 0.8885997
## sample estimates:
## cor
## 0.8086293
Comparing the means of the given numeric variables SH.STA.DIAB.ZS
in the first two groups defined by the factor cut2
(under normality assumption):
with(na.omit(hnp_data[hnp_data$cut2 %in% c(1, 2), c("SH.STA.DIAB.ZS", "cut2")]),
t.test(SH.STA.DIAB.ZS ~ factor(cut2)))
##
## Welch Two Sample t-test
##
## data: SH.STA.DIAB.ZS by factor(cut2)
## t = -0.096037, df = 94.203, p-value = 0.9237
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.687694 1.531959
## sample estimates:
## mean in group 1 mean in group 2
## 8.460465 8.538333
t.test(SH.STA.DIAB.ZS ~ cut2, var.equal = TRUE,
data = na.omit(hnp_data[hnp_data$cut2 %in% c(1, 2),
c("SH.STA.DIAB.ZS", "cut2")]))
##
## Two Sample t-test
##
## data: SH.STA.DIAB.ZS by cut2
## t = -0.094493, df = 109, p-value = 0.9249
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.711122 1.555387
## sample estimates:
## mean in group 1 mean in group 2
## 8.460465 8.538333
How to know if two variances are equal? Bartlett test or Fisher’s test…
with(na.omit(hnp_data[hnp_data$cut2 %in% c(1, 2), c("SH.STA.DIAB.ZS", "cut2")]),
bartlett.test(SH.STA.DIAB.ZS ~ cut2))
##
## Bartlett test of homogeneity of variances
##
## data: SH.STA.DIAB.ZS by cut2
## Bartlett's K-squared = 0.25803, df = 1, p-value = 0.6115
with(na.omit(hnp_data[hnp_data$cut2 %in% c(1, 2), c("SH.STA.DIAB.ZS", "cut2")]),
var.test(SH.STA.DIAB.ZS ~ cut2))
##
## F test to compare two variances
##
## data: SH.STA.DIAB.ZS by cut2
## F = 0.86683, num df = 42, denom df = 67, p-value = 0.6264
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5080829 1.5319545
## sample estimates:
## ratio of variances
## 0.8668313
Comparing the medians of the given numeric variables SH.STA.DIAB.ZS
in the first two groups defined by the factor cut2
(without assumption):
with(na.omit(hnp_data[hnp_data$cut2 %in% c(1, 2), c("SH.STA.DIAB.ZS", "cut2")]),
wilcox.test(SH.STA.DIAB.ZS ~ cut2))
##
## Wilcoxon rank sum test with continuity correction
##
## data: SH.STA.DIAB.ZS by cut2
## W = 1411, p-value = 0.7598
## alternative hypothesis: true location shift is not equal to 0
Exercise: In the dataset Soils
, does the pH significantly differ between samples on depressions and on slopes (variable Contour
)? Visually confirm with the appropriate plot.
##
## Shapiro-Wilk normality test
##
## data: Soils$pH[Soils$Contour == "Depression"]
## W = 0.96308, p-value = 0.7179
##
## Shapiro-Wilk normality test
##
## data: Soils$pH[Soils$Contour == "Slope"]
## W = 0.89835, p-value = 0.07564
##
## Bartlett test of homogeneity of variances
##
## data: pH by Contour
## Bartlett's K-squared = 2.8034, df = 1, p-value = 0.09407
##
## Two Sample t-test
##
## data: pH by Contour
## t = -0.21586, df = 30, p-value = 0.8306
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5688226 0.4600726
## sample estimates:
## mean in group Depression mean in group Slope
## 4.691875 4.746250
ANOVA is based on the previous fitting of a linear model. For instance, if we want to check if the means SH.STA.DIAB.ZS
are different in the groups induced by cut2
, we have to run (under normality and equal variance assumptions):
with(na.omit(hnp_data[ , c("SH.STA.DIAB.ZS", "cut2")]),
anova(lm(SH.STA.DIAB.ZS ~ factor(cut2))))
## Analysis of Variance Table
##
## Response: SH.STA.DIAB.ZS
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(cut2) 4 22.3 5.5794 0.282 0.8895
## Residuals 245 4848.1 19.7880
Kruskal-Wallis nonparametric test is performed with:
with(na.omit(hnp_data[ , c("SH.STA.DIAB.ZS", "cut2")]),
kruskal.test(SH.STA.DIAB.ZS ~ cut2))
##
## Kruskal-Wallis rank sum test
##
## data: SH.STA.DIAB.ZS by cut2
## Kruskal-Wallis chi-squared = 4.3969, df = 4, p-value = 0.355
Exercise: In the dataset Soils
, does the pH significantly differ between the different contours? Visually confirm with the appropriate plot.
##
## Shapiro-Wilk normality test
##
## data: Soils$pH[Soils$Contour == "Top"]
## W = 0.92004, p-value = 0.1689
##
## Bartlett test of homogeneity of variances
##
## data: pH by Contour
## Bartlett's K-squared = 3.1948, df = 2, p-value = 0.2024
## Analysis of Variance Table
##
## Response: pH
## Df Sum Sq Mean Sq F value Pr(>F)
## Contour 2 0.2607 0.13033 0.2799 0.7572
## Residuals 45 20.9546 0.46566
The sleep
data show the effect (extra
) of two soporific drugs (group
) on 10 patients (ID
):
head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
sleep2 <- reshape(sleep, direction = "wide", idvar = "ID", timevar = "group")
head(sleep2)
## ID extra.1 extra.2
## 1 1 0.7 1.9
## 2 2 -1.6 0.8
## 3 3 -0.2 1.1
## 4 4 -1.2 0.1
## 5 5 -0.1 -0.1
## 6 6 3.4 4.4
Paired comparison tests are performed with (under normality assumption):
with(sleep2, t.test(extra.1, extra.2, paired = TRUE))
##
## Paired t-test
##
## data: extra.1 and extra.2
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean of the differences
## -1.58
and with a nonparametric test:
with(sleep2, wilcox.test(extra.1, extra.2, paired = TRUE))
## Warning in wilcox.test.default(extra.1, extra.2, paired = TRUE): cannot compute
## exact p-value with ties
## Warning in wilcox.test.default(extra.1, extra.2, paired = TRUE): cannot compute
## exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: extra.1 and extra.2
## V = 0, p-value = 0.009091
## alternative hypothesis: true location shift is not equal to 0
res <- lm(SP.RUR.TOTL.ZS ~ SH.STA.DIAB.ZS,
data = na.omit(hnp_data[ ,c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS")]))
res
##
## Call:
## lm(formula = SP.RUR.TOTL.ZS ~ SH.STA.DIAB.ZS, data = na.omit(hnp_data[,
## c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS")]))
##
## Coefficients:
## (Intercept) SH.STA.DIAB.ZS
## 40.790 -0.119
summary(res)
##
## Call:
## lm(formula = SP.RUR.TOTL.ZS ~ SH.STA.DIAB.ZS, data = na.omit(hnp_data[,
## c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS")]))
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.445 -19.212 -0.658 17.940 48.090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.7900 3.0868 13.214 <2e-16 ***
## SH.STA.DIAB.ZS -0.1190 0.3254 -0.366 0.715
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.71 on 248 degrees of freedom
## Multiple R-squared: 0.0005386, Adjusted R-squared: -0.003491
## F-statistic: 0.1337 on 1 and 248 DF, p-value: 0.715
coefficients(res)
## (Intercept) SH.STA.DIAB.ZS
## 40.790002 -0.118971
confint(res)
## 2.5 % 97.5 %
## (Intercept) 34.7103759 46.8696285
## SH.STA.DIAB.ZS -0.7599203 0.5219783
plot(res)
df <- na.omit(hnp_data[ ,c("SP.RUR.TOTL.ZS", "SH.STA.DIAB.ZS", "SP.POP.DPND")])
res <- lm(SP.RUR.TOTL.ZS ~ ., data = df)
summary(res)
##
## Call:
## lm(formula = SP.RUR.TOTL.ZS ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.225 -13.988 -1.157 11.861 52.140
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.82141 6.65836 -1.775 0.07714 .
## SH.STA.DIAB.ZS 0.92084 0.32747 2.812 0.00535 **
## SP.POP.DPND 0.76833 0.08606 8.928 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.04 on 231 degrees of freedom
## Multiple R-squared: 0.2572, Adjusted R-squared: 0.2508
## F-statistic: 39.99 on 2 and 231 DF, p-value: 1.217e-15
plot(res)
res <- with(na.omit(hnp_data[ , c("SH.STA.DIAB.ZS", "cut2")]),
lm(SH.STA.DIAB.ZS ~ factor(cut2)))
summary(res)
##
## Call:
## lm(formula = SH.STA.DIAB.ZS ~ factor(cut2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.816 -2.666 -1.337 2.185 21.962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.46047 0.67837 12.472 <2e-16 ***
## factor(cut2)2 0.07787 0.86671 0.090 0.928
## factor(cut2)3 0.09395 0.88881 0.106 0.916
## factor(cut2)4 -0.64497 0.92097 -0.700 0.484
## factor(cut2)5 0.20739 1.08023 0.192 0.848
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.448 on 245 degrees of freedom
## Multiple R-squared: 0.004582, Adjusted R-squared: -0.01167
## F-statistic: 0.282 on 4 and 245 DF, p-value: 0.8895
anova(res)
## Analysis of Variance Table
##
## Response: SH.STA.DIAB.ZS
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(cut2) 4 22.3 5.5794 0.282 0.8895
## Residuals 245 4848.1 19.7880
Exercise: Explain the pH of Soils
by an additive effect between contour and Ca
and with a additive effect with interaction between contour and Depth.
##
## Call:
## lm(formula = pH ~ Contour + Ca, data = Soils)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55729 -0.25762 -0.08993 0.16989 1.10483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41955 0.14900 22.951 < 2e-16 ***
## ContourSlope -0.12625 0.12918 -0.977 0.33377
## ContourTop -0.44012 0.13146 -3.348 0.00168 **
## Ca 0.17917 0.01666 10.754 6.76e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3623 on 44 degrees of freedom
## Multiple R-squared: 0.7278, Adjusted R-squared: 0.7092
## F-statistic: 39.21 on 3 and 44 DF, p-value: 1.711e-12
##
## Call:
## lm(formula = pH ~ Contour + Depth + Contour:Depth, data = Soils)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7125 -0.1775 -0.0125 0.1681 1.3875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3525 0.1951 27.438 < 2e-16 ***
## ContourSlope 0.1550 0.2759 0.562 0.577703
## ContourTop -0.0200 0.2759 -0.072 0.942608
## Depth10-30 -0.4725 0.2759 -1.713 0.095366 .
## Depth30-60 -0.9900 0.2759 -3.589 0.000982 ***
## Depth60-90 -1.1800 0.2759 -4.277 0.000133 ***
## ContourSlope:Depth10-30 0.2475 0.3901 0.634 0.529848
## ContourTop:Depth10-30 -0.0100 0.3901 -0.026 0.979693
## ContourSlope:Depth30-60 -0.2500 0.3901 -0.641 0.525723
## ContourTop:Depth30-60 -0.1375 0.3901 -0.352 0.726571
## ContourSlope:Depth60-90 -0.4000 0.3901 -1.025 0.312085
## ContourTop:Depth60-90 -0.2600 0.3901 -0.666 0.509396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3901 on 36 degrees of freedom
## Multiple R-squared: 0.7417, Adjusted R-squared: 0.6628
## F-statistic: 9.398 on 11 and 36 DF, p-value: 1.188e-07
df <- na.omit(hnp_data[hnp_data$cut2 %in% c(1,5), c("SH.STA.DIAB.ZS", "cut2")])
df$cut2 <- factor(df$cut2)
res <- glm(cut2 ~ SH.STA.DIAB.ZS, data = df, family = binomial(link = "logit"))
summary(res)
##
## Call:
## glm(formula = cut2 ~ SH.STA.DIAB.ZS, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0495 -1.0003 -0.9849 1.3542 1.3968
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.53079 0.55448 -0.957 0.338
## SH.STA.DIAB.ZS 0.01189 0.05806 0.205 0.838
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 95.234 on 70 degrees of freedom
## Residual deviance: 95.192 on 69 degrees of freedom
## AIC: 99.192
##
## Number of Fisher Scoring iterations: 4
If we test the differences between the levels of Contour of the mean for all numeric variables in Soils
, we obtain 9 p-values on which multiple testing can be performed:
all_pvals <- apply(Soils[ ,6:14], 2, function(avar)
anova(lm(avar ~ Soils$Group))[1,"Pr(>F)"])
all_pvals
## pH N Dens P Ca Mg
## 1.187636e-07 5.331419e-10 3.456879e-08 1.861579e-09 1.333073e-10 1.579912e-03
## K Na Conduc
## 2.456518e-06 1.369506e-15 3.425749e-17
p.adjust(all_pvals, method = "BH")
## pH N Dens P Ca Mg
## 1.526961e-07 1.199569e-09 5.185319e-08 3.350841e-09 3.999220e-10 1.579912e-03
## K Na Conduc
## 2.763582e-06 6.162778e-15 3.083174e-16
p.adjust(all_pvals, method = "bonferroni")
## pH N Dens P Ca Mg
## 1.068873e-06 4.798277e-09 3.111191e-07 1.675421e-08 1.199766e-09 1.421921e-02
## K Na Conduc
## 2.210866e-05 1.232556e-14 3.083174e-16
This file has been compiled with the current system:
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_1.1.1 car_3.0-9 carData_3.0-4 RColorBrewer_1.1-2
##
## loaded via a namespace (and not attached):
## [1] zip_2.1.0 Rcpp_1.0.5 compiler_4.0.2 pillar_1.4.6
## [5] cellranger_1.1.0 forcats_0.5.0 tools_4.0.2 digest_0.6.25
## [9] evaluate_0.14 lifecycle_0.2.0 tibble_3.0.3 pkgconfig_2.0.3
## [13] rlang_0.4.7 openxlsx_4.1.5 curl_4.3 yaml_2.2.1
## [17] haven_2.3.1 xfun_0.16 rio_0.5.16 stringr_1.4.0
## [21] knitr_1.29 vctrs_0.3.2 hms_0.5.3 data.table_1.13.0
## [25] R6_2.4.1 readxl_1.3.1 foreign_0.8-79 rmarkdown_2.3
## [29] farver_2.0.3 magrittr_1.5 ellipsis_0.3.1 htmltools_0.5.0
## [33] abind_1.4-5 colorspace_1.4-1 stringi_1.4.6 munsell_0.5.0
## [37] crayon_1.3.4