::p_load(
pacman# file locator
here, # data management and ggplot2 graphics
tidyverse, # get overview of data
skimr, # produce and adorn tabulations and cross-tabulations
janitor )
Task 2: Datasets combination
Introduction
During the first task, data loading and cleaning has been performed. For each of the datasets, the final files were stored in the “/data/clean/” project folder by saving them as a RDS file, which creates a serialized version of the dataset and then saves it with gzip.
By doing this we avoid problems in further reading, since the dataset will be loaded as an R object exactly as when the cleaning of the dataset were finished.
First, we need to load packages:
CNE + AEMET
Import clean datasets:
<- readRDS(here("data", "clean", "cne_data.rds"))
cne_clean_data cne_clean_data
# A tibble: 42,588 × 11
provincia_iso fecha num_casos num_casos_prueba_pcr num_casos_prueba_tes…
<chr> <date> <dbl> <dbl> <dbl>
1 A 2020-01-01 0 0 0
2 AB 2020-01-01 0 0 0
3 AL 2020-01-01 0 0 0
4 AV 2020-01-01 0 0 0
5 B 2020-01-01 0 0 0
6 BA 2020-01-01 0 0 0
7 BI 2020-01-01 0 0 0
8 BU 2020-01-01 0 0 0
9 C 2020-01-01 0 0 0
10 CA 2020-01-01 0 0 0
# … with 42,578 more rows, and 6 more variables: num_casos_prueba_ag <dbl>,
# num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
# num_hosp <dbl>, num_uci <dbl>, num_def <dbl>
<- readRDS(here("data", "clean", "aemet_data.rds"))
aemet_clean_data aemet_clean_data
# A tibble: 4,105 × 10
fecha provincia tmed prec tmin tmax wd ws ws_max sol
<date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020-01-01 Asturias 7.4 0 3.8 11 99 2.5 7.8 7.9
2 2020-01-02 Asturias 7.3 0 3.2 11.4 22 2.8 10.3 2.8
3 2020-01-03 Asturias 10.7 0.7 7.4 14 99 2.8 9.7 1.9
4 2020-01-04 Asturias 8.4 2.4 5.6 11.1 99 3.6 7.2 0.7
5 2020-01-05 Asturias 8.2 0 4.7 11.8 13 2.5 7.2 8.6
6 2020-01-06 Asturias 7.6 0 2.8 12.3 22 3.1 8.9 6.3
7 2020-01-07 Asturias 8.8 0 4.3 13.3 99 1.9 8.9 8
8 2020-01-08 Asturias 13.4 0 9 17.7 22 2.8 8.3 6.1
9 2020-01-09 Asturias 14.4 5.5 8.7 20.2 29 7.8 24.2 4.3
10 2020-01-10 Asturias 8.2 0.6 4.9 11.6 28 2.2 16.1 1.9
# … with 4,095 more rows
In the following analyses we will focus on the provinces:
- Asturias
- Barcelona
- Madrid
- Málaga
- Sevilla
Meteorological data from AEMET already corresponds to those provinces so we will proceed to filter the data from CNE.
# Ref iso code provinces https://es.wikipedia.org/wiki/ISO_3166-2:ES
<- cne_clean_data %>%
cne_clean_data filter(provincia_iso %in% c("O", "B", "M", "MA", "SE")) %>%
mutate(
provincia_iso = case_when(
== "O" ~ "Asturias",
provincia_iso == "B" ~ "Barcelona",
provincia_iso == "M" ~ "Madrid",
provincia_iso == "MA" ~ "Málaga",
provincia_iso == "SE" ~ "Sevilla",
provincia_iso TRUE ~ provincia_iso
)%>%
) rename(provincia = provincia_iso)
cne_clean_data
# A tibble: 4,095 × 11
provincia fecha num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
<chr> <date> <dbl> <dbl> <dbl>
1 Barcelona 2020-01-01 0 0 0
2 Madrid 2020-01-01 1 1 0
3 Málaga 2020-01-01 0 0 0
4 Asturias 2020-01-01 0 0 0
5 Sevilla 2020-01-01 0 0 0
6 Barcelona 2020-01-02 0 0 0
7 Madrid 2020-01-02 0 0 0
8 Málaga 2020-01-02 0 0 0
9 Asturias 2020-01-02 0 0 0
10 Sevilla 2020-01-02 0 0 0
# … with 4,085 more rows, and 6 more variables: num_casos_prueba_ag <dbl>,
# num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
# num_hosp <dbl>, num_uci <dbl>, num_def <dbl>
<- left_join(cne_clean_data, aemet_clean_data, by=c("fecha", "provincia"))
covid_data covid_data
# A tibble: 4,095 × 19
provincia fecha num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
<chr> <date> <dbl> <dbl> <dbl>
1 Barcelona 2020-01-01 0 0 0
2 Madrid 2020-01-01 1 1 0
3 Málaga 2020-01-01 0 0 0
4 Asturias 2020-01-01 0 0 0
5 Sevilla 2020-01-01 0 0 0
6 Barcelona 2020-01-02 0 0 0
7 Madrid 2020-01-02 0 0 0
8 Málaga 2020-01-02 0 0 0
9 Asturias 2020-01-02 0 0 0
10 Sevilla 2020-01-02 0 0 0
# … with 4,085 more rows, and 14 more variables: num_casos_prueba_ag <dbl>,
# num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
# num_hosp <dbl>, num_uci <dbl>, num_def <dbl>, tmed <dbl>, prec <dbl>,
# tmin <dbl>, tmax <dbl>, wd <dbl>, ws <dbl>, ws_max <dbl>, sol <dbl>
Check if missing information:
skim(covid_data)
Name | covid_data |
Number of rows | 4095 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
character | 1 |
Date | 1 |
numeric | 17 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
provincia | 0 | 1 | 6 | 9 | 0 | 5 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
fecha | 0 | 1 | 2020-01-01 | 2022-03-29 | 2021-02-13 | 819 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
num_casos | 0 | 1 | 1014.00 | 2598.89 | 0.0 | 64.0 | 280.0 | 984.00 | 34701.0 | ▇▁▁▁▁ |
num_casos_prueba_pcr | 0 | 1 | 495.68 | 854.53 | 0.0 | 50.0 | 176.0 | 565.50 | 10167.0 | ▇▁▁▁▁ |
num_casos_prueba_test_ac | 0 | 1 | 0.03 | 0.25 | 0.0 | 0.0 | 0.0 | 0.00 | 5.0 | ▇▁▁▁▁ |
num_casos_prueba_ag | 0 | 1 | 504.14 | 2018.32 | 0.0 | 0.0 | 43.0 | 240.00 | 31073.0 | ▇▁▁▁▁ |
num_casos_prueba_elisa | 0 | 1 | 0.71 | 3.42 | 0.0 | 0.0 | 0.0 | 0.00 | 55.0 | ▇▁▁▁▁ |
num_casos_prueba_desconocida | 0 | 1 | 13.44 | 53.64 | 0.0 | 0.0 | 0.0 | 0.00 | 519.0 | ▇▁▁▁▁ |
num_hosp | 0 | 1 | 51.28 | 119.70 | 0.0 | 6.0 | 18.0 | 48.00 | 1934.0 | ▇▁▁▁▁ |
num_uci | 0 | 1 | 4.45 | 9.74 | 0.0 | 0.0 | 1.0 | 4.00 | 135.0 | ▇▁▁▁▁ |
num_def | 0 | 1 | 9.78 | 23.79 | 0.0 | 0.0 | 3.0 | 10.00 | 334.0 | ▇▁▁▁▁ |
tmed | 0 | 1 | 16.63 | 6.45 | -6.2 | 12.0 | 15.6 | 21.20 | 34.5 | ▁▂▇▅▂ |
prec | 0 | 1 | 1.75 | 6.14 | 0.0 | 0.0 | 0.0 | 0.20 | 87.9 | ▇▁▁▁▁ |
tmin | 0 | 1 | 11.69 | 6.23 | -13.4 | 7.4 | 11.3 | 16.10 | 27.5 | ▁▁▇▇▂ |
tmax | 0 | 1 | 21.57 | 7.30 | 0.3 | 16.2 | 20.3 | 26.40 | 44.9 | ▁▇▇▃▁ |
wd | 0 | 1 | 41.76 | 36.25 | 1.0 | 15.0 | 26.0 | 99.00 | 99.0 | ▆▇▁▁▆ |
ws | 0 | 1 | 3.72 | 1.73 | 0.3 | 2.5 | 3.6 | 4.40 | 18.9 | ▇▅▁▁▁ |
ws_max | 0 | 1 | 10.24 | 3.61 | 2.5 | 7.8 | 9.7 | 12.05 | 31.9 | ▅▇▂▁▁ |
sol | 0 | 1 | 6.98 | 4.12 | 0.0 | 3.5 | 7.7 | 10.30 | 14.3 | ▇▅▆▇▅ |
CNE/AEMET + GOOGLE
Load clean data:
<- readRDS(here("data", "clean", "google_data.rds")) google_data
Rename provinces to match previous dataset:
<- google_data %>%
google_data filter(province %in% c("Asturias", "Barcelona", "Madrid", "Málaga", "Seville")) %>%
mutate(province = ifelse(province == "Seville", "Sevilla", province)) %>%
select(-CA, -iso_3166_2_code, provincia = province)
google_data
# A tibble: 4,025 × 8
# Groups: provincia [5]
provincia fecha mob_grocery_pharmacy mob_parks mob_residential
<chr> <date> <dbl> <dbl> <dbl>
1 Asturias 2020-02-15 -1 20 -1
2 Asturias 2020-02-16 1 11 -2
3 Asturias 2020-02-17 0 -13 -1
4 Asturias 2020-02-18 1 11 -1
5 Asturias 2020-02-19 1 29 -2
6 Asturias 2020-02-20 0 32 -3
7 Asturias 2020-02-21 -1 18 0
8 Asturias 2020-02-22 1 29 -1
9 Asturias 2020-02-23 6 23 -3
10 Asturias 2020-02-24 5 48 -1
# … with 4,015 more rows, and 3 more variables: mob_retail_recreation <dbl>,
# mob_transit_stations <dbl>, mob_workplaces <dbl>
Statistics for Google data
skim(google_data)
Name | google_data |
Number of rows | 4025 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
Date | 1 |
numeric | 6 |
________________________ | |
Group variables | provincia |
Variable type: Date
skim_variable | provincia | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|---|
fecha | Asturias | 0 | 1 | 2020-02-15 | 2022-04-29 | 2021-03-23 | 805 |
fecha | Barcelona | 0 | 1 | 2020-02-15 | 2022-04-29 | 2021-03-23 | 805 |
fecha | Madrid | 0 | 1 | 2020-02-15 | 2022-04-29 | 2021-03-23 | 805 |
fecha | Málaga | 0 | 1 | 2020-02-15 | 2022-04-29 | 2021-03-23 | 805 |
fecha | Sevilla | 0 | 1 | 2020-02-15 | 2022-04-29 | 2021-03-23 | 805 |
Variable type: numeric
skim_variable | provincia | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
mob_grocery_pharmacy | Asturias | 0 | 1 | -0.76 | 19.96 | -94 | -3 | 3 | 9 | 42 | ▁▁▁▇▁ |
mob_grocery_pharmacy | Barcelona | 0 | 1 | -4.14 | 20.77 | -91 | -10 | 0 | 9 | 58 | ▁▁▆▇▁ |
mob_grocery_pharmacy | Madrid | 0 | 1 | -6.40 | 19.37 | -87 | -13 | -3 | 6 | 50 | ▁▁▆▇▁ |
mob_grocery_pharmacy | Málaga | 0 | 1 | 2.88 | 30.53 | -91 | -8 | 4 | 14 | 156 | ▁▇▆▁▁ |
mob_grocery_pharmacy | Sevilla | 0 | 1 | -4.90 | 22.74 | -91 | -12 | -1 | 7 | 117 | ▁▃▇▁▁ |
mob_parks | Asturias | 0 | 1 | 45.19 | 71.02 | -88 | 5 | 33 | 70 | 333 | ▂▇▂▁▁ |
mob_parks | Barcelona | 0 | 1 | -10.48 | 23.79 | -93 | -19 | -6 | 5 | 33 | ▁▁▃▇▃ |
mob_parks | Madrid | 0 | 1 | -9.83 | 28.19 | -92 | -22 | -6 | 10 | 77 | ▁▂▇▃▁ |
mob_parks | Málaga | 0 | 1 | 6.13 | 43.88 | -92 | -18 | 2 | 27 | 152 | ▂▇▆▂▁ |
mob_parks | Sevilla | 0 | 1 | -17.95 | 25.70 | -93 | -30 | -13 | -3 | 95 | ▁▅▇▁▁ |
mob_residential | Asturias | 0 | 1 | 5.34 | 7.44 | -8 | 1 | 3 | 8 | 40 | ▆▇▁▁▁ |
mob_residential | Barcelona | 0 | 1 | 9.00 | 8.45 | -7 | 4 | 7 | 11 | 47 | ▃▇▁▁▁ |
mob_residential | Madrid | 0 | 1 | 8.95 | 9.18 | -10 | 4 | 7 | 11 | 46 | ▂▇▂▁▁ |
mob_residential | Málaga | 0 | 1 | 6.22 | 7.37 | -4 | 2 | 4 | 8 | 40 | ▇▆▁▁▁ |
mob_residential | Sevilla | 0 | 1 | 5.59 | 7.59 | -7 | 1 | 4 | 8 | 42 | ▆▇▁▁▁ |
mob_retail_recreation | Asturias | 0 | 1 | -26.49 | 24.13 | -97 | -34 | -21 | -11 | 15 | ▁▁▃▇▃ |
mob_retail_recreation | Barcelona | 0 | 1 | -32.91 | 21.91 | -97 | -43 | -27 | -17 | 3 | ▁▁▃▇▅ |
mob_retail_recreation | Madrid | 0 | 1 | -32.62 | 21.35 | -96 | -38 | -28 | -19 | 4 | ▁▁▃▇▂ |
mob_retail_recreation | Málaga | 0 | 1 | -21.14 | 26.07 | -97 | -31 | -16 | -4 | 30 | ▁▁▅▇▂ |
mob_retail_recreation | Sevilla | 0 | 1 | -26.91 | 22.30 | -97 | -35 | -23 | -11 | 23 | ▁▁▆▇▁ |
mob_transit_stations | Asturias | 0 | 1 | -17.30 | 20.87 | -85 | -26 | -14 | -4 | 26 | ▁▁▅▇▂ |
mob_transit_stations | Barcelona | 0 | 1 | -27.65 | 18.95 | -91 | -35 | -25 | -15 | 10 | ▁▁▃▇▂ |
mob_transit_stations | Madrid | 0 | 1 | -33.40 | 19.63 | -93 | -42 | -32 | -19 | 10 | ▁▁▇▇▁ |
mob_transit_stations | Málaga | 0 | 1 | -22.89 | 25.52 | -93 | -37 | -22 | -3 | 33 | ▂▂▇▆▂ |
mob_transit_stations | Sevilla | 0 | 1 | -28.81 | 21.99 | -94 | -40 | -27 | -14 | 52 | ▁▃▇▂▁ |
mob_workplaces | Asturias | 0 | 1 | -23.33 | 17.05 | -88 | -28 | -19 | -13 | 12 | ▁▁▂▇▁ |
mob_workplaces | Barcelona | 0 | 1 | -30.01 | 19.76 | -92 | -39 | -26 | -17 | 6 | ▁▂▃▇▃ |
mob_workplaces | Madrid | 0 | 1 | -30.44 | 22.60 | -92 | -45 | -27 | -15 | 20 | ▂▃▇▇▂ |
mob_workplaces | Málaga | 0 | 1 | -21.63 | 19.72 | -89 | -29 | -18 | -10 | 29 | ▁▁▆▇▁ |
mob_workplaces | Sevilla | 0 | 1 | -23.28 | 19.04 | -90 | -32 | -17 | -10 | 9 | ▁▁▃▇▅ |
Since everything is right, we will combine the data with the already combined dataset from CNE and AEMET:
<- left_join(covid_data, google_data, by=c("fecha", "provincia"))
covid_data covid_data
# A tibble: 4,095 × 25
provincia fecha num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
<chr> <date> <dbl> <dbl> <dbl>
1 Barcelona 2020-01-01 0 0 0
2 Madrid 2020-01-01 1 1 0
3 Málaga 2020-01-01 0 0 0
4 Asturias 2020-01-01 0 0 0
5 Sevilla 2020-01-01 0 0 0
6 Barcelona 2020-01-02 0 0 0
7 Madrid 2020-01-02 0 0 0
8 Málaga 2020-01-02 0 0 0
9 Asturias 2020-01-02 0 0 0
10 Sevilla 2020-01-02 0 0 0
# … with 4,085 more rows, and 20 more variables: num_casos_prueba_ag <dbl>,
# num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
# num_hosp <dbl>, num_uci <dbl>, num_def <dbl>, tmed <dbl>, prec <dbl>,
# tmin <dbl>, tmax <dbl>, wd <dbl>, ws <dbl>, ws_max <dbl>, sol <dbl>,
# mob_grocery_pharmacy <dbl>, mob_parks <dbl>, mob_residential <dbl>,
# mob_retail_recreation <dbl>, mob_transit_stations <dbl>,
# mob_workplaces <dbl>
It is important to note that Google provides data as of February 15nd. Before that day, there is not information related to Google´s mobility.
CNE/AEMET/GOOGLE + INE
Load clean INE data:
<- readRDS(here("data", "clean", "ine_data.rds")) ine_data
Rename provinces to match previous datasets:
<- ine_data %>%
ine_data filter(province %in% c("Asturias", "Barcelona", "Madrid", "Málaga", "Sevilla")) %>%
rename(
provincia = province,
mob_flujo = flujo
) ine_data
# A tibble: 3,270 × 3
fecha provincia mob_flujo
<date> <chr> <dbl>
1 2020-03-16 Málaga 11.4
2 2020-03-16 Sevilla 12.3
3 2020-03-16 Asturias 13.1
4 2020-03-16 Barcelona 14.6
5 2020-03-16 Madrid 13.9
6 2020-03-17 Málaga 10.7
7 2020-03-17 Sevilla 11.5
8 2020-03-17 Asturias 12.5
9 2020-03-17 Barcelona 14.0
10 2020-03-17 Madrid 13.4
# … with 3,260 more rows
Combination with previous datasets:
<- left_join(covid_data, ine_data, by=c("fecha", "provincia"))
covid_data covid_data
# A tibble: 4,095 × 26
provincia fecha num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
<chr> <date> <dbl> <dbl> <dbl>
1 Barcelona 2020-01-01 0 0 0
2 Madrid 2020-01-01 1 1 0
3 Málaga 2020-01-01 0 0 0
4 Asturias 2020-01-01 0 0 0
5 Sevilla 2020-01-01 0 0 0
6 Barcelona 2020-01-02 0 0 0
7 Madrid 2020-01-02 0 0 0
8 Málaga 2020-01-02 0 0 0
9 Asturias 2020-01-02 0 0 0
10 Sevilla 2020-01-02 0 0 0
# … with 4,085 more rows, and 21 more variables: num_casos_prueba_ag <dbl>,
# num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
# num_hosp <dbl>, num_uci <dbl>, num_def <dbl>, tmed <dbl>, prec <dbl>,
# tmin <dbl>, tmax <dbl>, wd <dbl>, ws <dbl>, ws_max <dbl>, sol <dbl>,
# mob_grocery_pharmacy <dbl>, mob_parks <dbl>, mob_residential <dbl>,
# mob_retail_recreation <dbl>, mob_transit_stations <dbl>,
# mob_workplaces <dbl>, mob_flujo <dbl>
Mobility information published by INE is available as of March 16th which we will consider the beginning of our analysis.
Final covid dataset
The statistics of the final dataset looks like:
skim(covid_data)
Name | covid_data |
Number of rows | 4095 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
character | 1 |
Date | 1 |
numeric | 24 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
provincia | 0 | 1 | 6 | 9 | 0 | 5 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
fecha | 0 | 1 | 2020-01-01 | 2022-03-29 | 2021-02-13 | 819 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
num_casos | 0 | 1.00 | 1014.00 | 2598.89 | 0.00 | 64.00 | 280.00 | 984.00 | 34701.00 | ▇▁▁▁▁ |
num_casos_prueba_pcr | 0 | 1.00 | 495.68 | 854.53 | 0.00 | 50.00 | 176.00 | 565.50 | 10167.00 | ▇▁▁▁▁ |
num_casos_prueba_test_ac | 0 | 1.00 | 0.03 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 5.00 | ▇▁▁▁▁ |
num_casos_prueba_ag | 0 | 1.00 | 504.14 | 2018.32 | 0.00 | 0.00 | 43.00 | 240.00 | 31073.00 | ▇▁▁▁▁ |
num_casos_prueba_elisa | 0 | 1.00 | 0.71 | 3.42 | 0.00 | 0.00 | 0.00 | 0.00 | 55.00 | ▇▁▁▁▁ |
num_casos_prueba_desconocida | 0 | 1.00 | 13.44 | 53.64 | 0.00 | 0.00 | 0.00 | 0.00 | 519.00 | ▇▁▁▁▁ |
num_hosp | 0 | 1.00 | 51.28 | 119.70 | 0.00 | 6.00 | 18.00 | 48.00 | 1934.00 | ▇▁▁▁▁ |
num_uci | 0 | 1.00 | 4.45 | 9.74 | 0.00 | 0.00 | 1.00 | 4.00 | 135.00 | ▇▁▁▁▁ |
num_def | 0 | 1.00 | 9.78 | 23.79 | 0.00 | 0.00 | 3.00 | 10.00 | 334.00 | ▇▁▁▁▁ |
tmed | 0 | 1.00 | 16.63 | 6.45 | -6.20 | 12.00 | 15.60 | 21.20 | 34.50 | ▁▂▇▅▂ |
prec | 0 | 1.00 | 1.75 | 6.14 | 0.00 | 0.00 | 0.00 | 0.20 | 87.90 | ▇▁▁▁▁ |
tmin | 0 | 1.00 | 11.69 | 6.23 | -13.40 | 7.40 | 11.30 | 16.10 | 27.50 | ▁▁▇▇▂ |
tmax | 0 | 1.00 | 21.57 | 7.30 | 0.30 | 16.20 | 20.30 | 26.40 | 44.90 | ▁▇▇▃▁ |
wd | 0 | 1.00 | 41.76 | 36.25 | 1.00 | 15.00 | 26.00 | 99.00 | 99.00 | ▆▇▁▁▆ |
ws | 0 | 1.00 | 3.72 | 1.73 | 0.30 | 2.50 | 3.60 | 4.40 | 18.90 | ▇▅▁▁▁ |
ws_max | 0 | 1.00 | 10.24 | 3.61 | 2.50 | 7.80 | 9.70 | 12.05 | 31.90 | ▅▇▂▁▁ |
sol | 0 | 1.00 | 6.98 | 4.12 | 0.00 | 3.50 | 7.70 | 10.30 | 14.30 | ▇▅▆▇▅ |
mob_grocery_pharmacy | 225 | 0.95 | -2.89 | 23.44 | -94.00 | -10.00 | 0.00 | 9.00 | 156.00 | ▁▇▅▁▁ |
mob_parks | 225 | 0.95 | 2.16 | 48.40 | -93.00 | -20.00 | -3.00 | 15.75 | 333.00 | ▇▇▁▁▁ |
mob_residential | 225 | 0.95 | 7.24 | 8.26 | -8.00 | 2.00 | 5.00 | 10.00 | 47.00 | ▆▇▁▁▁ |
mob_retail_recreation | 225 | 0.95 | -28.46 | 23.87 | -97.00 | -38.00 | -24.00 | -13.00 | 30.00 | ▂▂▇▇▁ |
mob_transit_stations | 225 | 0.95 | -26.74 | 22.16 | -94.00 | -38.00 | -25.00 | -12.00 | 26.00 | ▁▂▇▇▂ |
mob_workplaces | 225 | 0.95 | -26.11 | 20.04 | -92.00 | -35.00 | -22.00 | -12.00 | 29.00 | ▁▂▇▇▁ |
mob_flujo | 825 | 0.80 | 15.75 | 3.84 | 4.56 | 13.31 | 15.95 | 18.44 | 26.24 | ▁▃▇▅▁ |
The missing information corresponds to the periods for which no data have been provided. These are placed at the beginning of the dataset, so as of March 16, 2020 all the information is available up to December 29, 2021.
%>%
covid_data filter(!is.na(mob_flujo)) %>%
summarise(
initial_data_without_nas = min(fecha),
final_data_without_nas = max(fecha)
)
# A tibble: 1 × 2
initial_data_without_nas final_data_without_nas
<date> <date>
1 2020-03-16 2021-12-29
Statistics for that period:
%>%
covid_data filter(fecha >= as.Date("2020-03-16", format = "%Y-%m-%d") &
<= as.Date("2021-12-29", format = "%Y-%m-%d")) %>%
fecha skim()
Name | Piped data |
Number of rows | 3270 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
character | 1 |
Date | 1 |
numeric | 24 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
provincia | 0 | 1 | 6 | 9 | 0 | 5 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
fecha | 0 | 1 | 2020-03-16 | 2021-12-29 | 2021-02-05 | 654 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
num_casos | 0 | 1 | 780.02 | 1607.55 | 0.00 | 89.00 | 271.00 | 923.75 | 23811.00 | ▇▁▁▁▁ |
num_casos_prueba_pcr | 0 | 1 | 491.53 | 831.31 | 0.00 | 65.25 | 179.00 | 580.00 | 10167.00 | ▇▁▁▁▁ |
num_casos_prueba_test_ac | 0 | 1 | 0.04 | 0.27 | 0.00 | 0.00 | 0.00 | 0.00 | 5.00 | ▇▁▁▁▁ |
num_casos_prueba_ag | 0 | 1 | 273.56 | 942.12 | 0.00 | 1.00 | 40.00 | 182.00 | 16226.00 | ▇▁▁▁▁ |
num_casos_prueba_elisa | 0 | 1 | 0.87 | 3.79 | 0.00 | 0.00 | 0.00 | 0.00 | 55.00 | ▇▁▁▁▁ |
num_casos_prueba_desconocida | 0 | 1 | 14.02 | 56.25 | 0.00 | 0.00 | 0.00 | 0.00 | 519.00 | ▇▁▁▁▁ |
num_hosp | 0 | 1 | 55.27 | 127.95 | 0.00 | 7.00 | 20.00 | 54.00 | 1934.00 | ▇▁▁▁▁ |
num_uci | 0 | 1 | 4.97 | 10.32 | 0.00 | 0.00 | 2.00 | 5.00 | 135.00 | ▇▁▁▁▁ |
num_def | 0 | 1 | 11.10 | 26.14 | 0.00 | 1.00 | 3.00 | 11.00 | 334.00 | ▇▁▁▁▁ |
tmed | 0 | 1 | 17.85 | 6.48 | -6.20 | 13.20 | 17.40 | 23.10 | 34.50 | ▁▂▇▆▂ |
prec | 0 | 1 | 1.78 | 6.06 | 0.00 | 0.00 | 0.00 | 0.20 | 78.50 | ▇▁▁▁▁ |
tmin | 0 | 1 | 12.89 | 6.11 | -13.40 | 8.70 | 12.90 | 17.40 | 27.50 | ▁▁▆▇▃ |
tmax | 0 | 1 | 22.80 | 7.46 | 0.30 | 17.22 | 21.70 | 28.28 | 44.90 | ▁▆▇▅▁ |
wd | 0 | 1 | 41.45 | 35.80 | 1.00 | 16.00 | 26.00 | 99.00 | 99.00 | ▆▇▁▁▅ |
ws | 0 | 1 | 3.73 | 1.62 | 0.30 | 2.50 | 3.60 | 4.40 | 13.10 | ▅▇▂▁▁ |
ws_max | 0 | 1 | 10.30 | 3.50 | 2.50 | 7.80 | 9.70 | 12.06 | 31.90 | ▅▇▂▁▁ |
sol | 0 | 1 | 7.32 | 4.20 | 0.00 | 3.80 | 8.10 | 10.90 | 14.30 | ▆▅▆▇▆ |
mob_grocery_pharmacy | 0 | 1 | -4.38 | 24.03 | -94.00 | -12.00 | -1.00 | 8.00 | 156.00 | ▁▇▃▁▁ |
mob_parks | 0 | 1 | 2.22 | 51.72 | -93.00 | -22.00 | -4.00 | 16.00 | 333.00 | ▇▇▁▁▁ |
mob_residential | 0 | 1 | 7.92 | 8.62 | -8.00 | 2.00 | 6.00 | 11.00 | 47.00 | ▅▇▁▁▁ |
mob_retail_recreation | 0 | 1 | -30.76 | 24.53 | -97.00 | -40.00 | -26.00 | -14.00 | 30.00 | ▂▂▇▆▁ |
mob_transit_stations | 0 | 1 | -29.40 | 22.41 | -94.00 | -40.00 | -28.00 | -14.00 | 26.00 | ▁▂▇▅▁ |
mob_workplaces | 0 | 1 | -28.32 | 20.09 | -92.00 | -37.00 | -25.00 | -14.00 | 29.00 | ▁▂▇▆▁ |
mob_flujo | 0 | 1 | 15.75 | 3.84 | 4.56 | 13.31 | 15.95 | 18.44 | 26.24 | ▁▃▇▅▁ |
Aditional dataset
In addition, we have the clean dataset corresponding to hospitalizations which provide detail by demographics.
We will not combine it with the previous dataset, but we will treat it additionally in case of needing more detail.
<- readRDS(here("data", "clean", "cne_hosp_data.rds"))
cne_hosp_data cne_hosp_data
# A tibble: 1,277,640 × 8
provincia_iso sexo grupo_edad fecha num_casos num_hosp num_uci num_def
<chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 A H 0-9 2020-01-01 0 0 0 0
2 A H 10-19 2020-01-01 0 0 0 0
3 A H 20-29 2020-01-01 0 0 0 0
4 A H 30-39 2020-01-01 0 0 0 0
5 A H 40-49 2020-01-01 0 0 0 0
6 A H 50-59 2020-01-01 0 0 0 0
7 A H 60-69 2020-01-01 0 0 0 0
8 A H 70-79 2020-01-01 0 0 0 0
9 A H 80+ 2020-01-01 0 0 0 0
10 A H NC 2020-01-01 0 0 0 0
# … with 1,277,630 more rows
<- cne_hosp_data %>%
cne_hosp_data filter(provincia_iso %in% c("O", "B", "M", "MA", "SE")) %>%
mutate(
provincia_iso = case_when(
== "O" ~ "Asturias",
provincia_iso == "B" ~ "Barcelona",
provincia_iso == "M" ~ "Madrid",
provincia_iso == "MA" ~ "Málaga",
provincia_iso == "SE" ~ "Sevilla",
provincia_iso TRUE ~ provincia_iso
)%>%
) rename(provincia = provincia_iso)
cne_hosp_data
# A tibble: 122,850 × 8
provincia sexo grupo_edad fecha num_casos num_hosp num_uci num_def
<chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 Barcelona H 0-9 2020-01-01 0 0 0 0
2 Barcelona H 10-19 2020-01-01 0 0 0 0
3 Barcelona H 20-29 2020-01-01 0 0 0 0
4 Barcelona H 30-39 2020-01-01 0 0 0 0
5 Barcelona H 40-49 2020-01-01 0 0 0 0
6 Barcelona H 50-59 2020-01-01 0 0 0 0
7 Barcelona H 60-69 2020-01-01 0 0 0 0
8 Barcelona H 70-79 2020-01-01 0 0 0 0
9 Barcelona H 80+ 2020-01-01 0 0 0 0
10 Barcelona H NC 2020-01-01 0 0 0 0
# … with 122,840 more rows
skim(cne_hosp_data)
Name | cne_hosp_data |
Number of rows | 122850 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 3 |
Date | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
provincia | 0 | 1 | 6 | 9 | 0 | 5 | 0 |
sexo | 0 | 1 | 1 | 2 | 0 | 3 | 0 |
grupo_edad | 0 | 1 | 2 | 5 | 0 | 10 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
fecha | 0 | 1 | 2020-01-01 | 2022-03-29 | 2021-02-13 | 819 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
num_casos | 0 | 1 | 33.80 | 137.91 | 0 | 0 | 0 | 18 | 3741 | ▇▁▁▁▁ |
num_hosp | 0 | 1 | 1.71 | 7.27 | 0 | 0 | 0 | 1 | 270 | ▇▁▁▁▁ |
num_uci | 0 | 1 | 0.15 | 0.79 | 0 | 0 | 0 | 0 | 35 | ▇▁▁▁▁ |
num_def | 0 | 1 | 0.33 | 2.27 | 0 | 0 | 0 | 0 | 100 | ▇▁▁▁▁ |
Note
Final clean data are stored in the project’s folder “/data/clean/” both in rds and csv format.
- final_covid_data.csv (CNE+AEMET+GOOGLE+INE data)
- final_hosp_data.csv (CNE data with demographic detail)