Task 2: Datasets combination

Introduction

During the first task, data loading and cleaning has been performed. For each of the datasets, the final files were stored in the “/data/clean/” project folder by saving them as a RDS file, which creates a serialized version of the dataset and then saves it with gzip.

By doing this we avoid problems in further reading, since the dataset will be loaded as an R object exactly as when the cleaning of the dataset were finished.

First, we need to load packages:

pacman::p_load(
      here,      # file locator
      tidyverse, # data management and ggplot2 graphics
      skimr,     # get overview of data
      janitor    # produce and adorn tabulations and cross-tabulations
)

CNE + AEMET

Import clean datasets:

cne_clean_data <- readRDS(here("data", "clean", "cne_data.rds"))
cne_clean_data

# A tibble: 42,588 × 11
   provincia_iso fecha      num_casos num_casos_prueba_pcr num_casos_prueba_tes…
   <chr>         <date>         <dbl>                <dbl>                 <dbl>
 1 A             2020-01-01         0                    0                     0
 2 AB            2020-01-01         0                    0                     0
 3 AL            2020-01-01         0                    0                     0
 4 AV            2020-01-01         0                    0                     0
 5 B             2020-01-01         0                    0                     0
 6 BA            2020-01-01         0                    0                     0
 7 BI            2020-01-01         0                    0                     0
 8 BU            2020-01-01         0                    0                     0
 9 C             2020-01-01         0                    0                     0
10 CA            2020-01-01         0                    0                     0
# … with 42,578 more rows, and 6 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>

aemet_clean_data <- readRDS(here("data", "clean", "aemet_data.rds"))
aemet_clean_data

# A tibble: 4,105 × 10
   fecha      provincia  tmed  prec  tmin  tmax    wd    ws ws_max   sol
   <date>     <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1 2020-01-01 Asturias    7.4   0     3.8  11      99   2.5    7.8   7.9
 2 2020-01-02 Asturias    7.3   0     3.2  11.4    22   2.8   10.3   2.8
 3 2020-01-03 Asturias   10.7   0.7   7.4  14      99   2.8    9.7   1.9
 4 2020-01-04 Asturias    8.4   2.4   5.6  11.1    99   3.6    7.2   0.7
 5 2020-01-05 Asturias    8.2   0     4.7  11.8    13   2.5    7.2   8.6
 6 2020-01-06 Asturias    7.6   0     2.8  12.3    22   3.1    8.9   6.3
 7 2020-01-07 Asturias    8.8   0     4.3  13.3    99   1.9    8.9   8  
 8 2020-01-08 Asturias   13.4   0     9    17.7    22   2.8    8.3   6.1
 9 2020-01-09 Asturias   14.4   5.5   8.7  20.2    29   7.8   24.2   4.3
10 2020-01-10 Asturias    8.2   0.6   4.9  11.6    28   2.2   16.1   1.9
# … with 4,095 more rows

In the following analyses we will focus on the provinces:

Asturias
Barcelona
Madrid
Málaga
Sevilla

Meteorological data from AEMET already corresponds to those provinces so we will proceed to filter the data from CNE.

# Ref iso code provinces https://es.wikipedia.org/wiki/ISO_3166-2:ES
cne_clean_data <- cne_clean_data %>% 
      filter(provincia_iso %in% c("O", "B", "M", "MA", "SE")) %>% 
      mutate(
            provincia_iso = case_when(
                  provincia_iso == "O" ~ "Asturias",
                  provincia_iso == "B" ~ "Barcelona",
                  provincia_iso == "M" ~ "Madrid",
                  provincia_iso == "MA" ~ "Málaga",
                  provincia_iso == "SE" ~ "Sevilla",
                  TRUE ~ provincia_iso
            )
      ) %>% 
      rename(provincia = provincia_iso)
cne_clean_data

# A tibble: 4,095 × 11
   provincia fecha      num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
   <chr>     <date>         <dbl>                <dbl>                    <dbl>
 1 Barcelona 2020-01-01         0                    0                        0
 2 Madrid    2020-01-01         1                    1                        0
 3 Málaga    2020-01-01         0                    0                        0
 4 Asturias  2020-01-01         0                    0                        0
 5 Sevilla   2020-01-01         0                    0                        0
 6 Barcelona 2020-01-02         0                    0                        0
 7 Madrid    2020-01-02         0                    0                        0
 8 Málaga    2020-01-02         0                    0                        0
 9 Asturias  2020-01-02         0                    0                        0
10 Sevilla   2020-01-02         0                    0                        0
# … with 4,085 more rows, and 6 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>

covid_data <- left_join(cne_clean_data, aemet_clean_data, by=c("fecha", "provincia"))
covid_data

# A tibble: 4,095 × 19
   provincia fecha      num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
   <chr>     <date>         <dbl>                <dbl>                    <dbl>
 1 Barcelona 2020-01-01         0                    0                        0
 2 Madrid    2020-01-01         1                    1                        0
 3 Málaga    2020-01-01         0                    0                        0
 4 Asturias  2020-01-01         0                    0                        0
 5 Sevilla   2020-01-01         0                    0                        0
 6 Barcelona 2020-01-02         0                    0                        0
 7 Madrid    2020-01-02         0                    0                        0
 8 Málaga    2020-01-02         0                    0                        0
 9 Asturias  2020-01-02         0                    0                        0
10 Sevilla   2020-01-02         0                    0                        0
# … with 4,085 more rows, and 14 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>, tmed <dbl>, prec <dbl>,
#   tmin <dbl>, tmax <dbl>, wd <dbl>, ws <dbl>, ws_max <dbl>, sol <dbl>

Check if missing information:

skim(covid_data)

Data summary
Name	covid_data
Number of rows	4095
Number of columns	19
_______________________
Column type frequency:
character	1
Date	1
numeric	17
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
provincia	0	1	6	9	0	5	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
fecha	0	1	2020-01-01	2022-03-29	2021-02-13	819

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
num_casos	1	1014.00	2598.89	0.0	64.0	280.0	984.00	34701.0	▇▁▁▁▁
num_casos_prueba_pcr	1	495.68	854.53	0.0	50.0	176.0	565.50	10167.0	▇▁▁▁▁
num_casos_prueba_test_ac	1	0.03	0.25	0.0	0.0	0.0	0.00	5.0	▇▁▁▁▁
num_casos_prueba_ag	1	504.14	2018.32	0.0	0.0	43.0	240.00	31073.0	▇▁▁▁▁
num_casos_prueba_elisa	1	0.71	3.42	0.0	0.0	0.0	0.00	55.0	▇▁▁▁▁
num_casos_prueba_desconocida	1	13.44	53.64	0.0	0.0	0.0	0.00	519.0	▇▁▁▁▁
num_hosp	1	51.28	119.70	0.0	6.0	18.0	48.00	1934.0	▇▁▁▁▁
num_uci	1	4.45	9.74	0.0	0.0	1.0	4.00	135.0	▇▁▁▁▁
num_def	1	9.78	23.79	0.0	0.0	3.0	10.00	334.0	▇▁▁▁▁
tmed	1	16.63	6.45	-6.2	12.0	15.6	21.20	34.5	▁▂▇▅▂
prec	1	1.75	6.14	0.0	0.0	0.0	0.20	87.9	▇▁▁▁▁
tmin	1	11.69	6.23	-13.4	7.4	11.3	16.10	27.5	▁▁▇▇▂
tmax	1	21.57	7.30	0.3	16.2	20.3	26.40	44.9	▁▇▇▃▁
wd	1	41.76	36.25	1.0	15.0	26.0	99.00	99.0	▆▇▁▁▆
ws	1	3.72	1.73	0.3	2.5	3.6	4.40	18.9	▇▅▁▁▁
ws_max	1	10.24	3.61	2.5	7.8	9.7	12.05	31.9	▅▇▂▁▁
sol	1	6.98	4.12	0.0	3.5	7.7	10.30	14.3	▇▅▆▇▅

CNE/AEMET + GOOGLE

Load clean data:

google_data <- readRDS(here("data", "clean", "google_data.rds"))

Rename provinces to match previous dataset:

google_data <- google_data %>% 
      filter(province %in% c("Asturias", "Barcelona", "Madrid", "Málaga", "Seville")) %>% 
      mutate(province = ifelse(province == "Seville", "Sevilla", province)) %>% 
      select(-CA, -iso_3166_2_code, provincia = province)
google_data

# A tibble: 4,025 × 8
# Groups:   provincia [5]
   provincia fecha      mob_grocery_pharmacy mob_parks mob_residential
   <chr>     <date>                    <dbl>     <dbl>           <dbl>
 1 Asturias  2020-02-15                   -1        20              -1
 2 Asturias  2020-02-16                    1        11              -2
 3 Asturias  2020-02-17                    0       -13              -1
 4 Asturias  2020-02-18                    1        11              -1
 5 Asturias  2020-02-19                    1        29              -2
 6 Asturias  2020-02-20                    0        32              -3
 7 Asturias  2020-02-21                   -1        18               0
 8 Asturias  2020-02-22                    1        29              -1
 9 Asturias  2020-02-23                    6        23              -3
10 Asturias  2020-02-24                    5        48              -1
# … with 4,015 more rows, and 3 more variables: mob_retail_recreation <dbl>,
#   mob_transit_stations <dbl>, mob_workplaces <dbl>

Statistics for Google data

skim(google_data)

Data summary
Name	google_data
Number of rows	4025
Number of columns	8
_______________________
Column type frequency:
Date	1
numeric	6
________________________
Group variables	provincia

Variable type: Date

skim_variable	provincia	complete_rate	min	max	median	n_unique
fecha	Asturias	1	2020-02-15	2022-04-29	2021-03-23	805
fecha	Barcelona	1	2020-02-15	2022-04-29	2021-03-23	805
fecha	Madrid	1	2020-02-15	2022-04-29	2021-03-23	805
fecha	Málaga	1	2020-02-15	2022-04-29	2021-03-23	805
fecha	Sevilla	1	2020-02-15	2022-04-29	2021-03-23	805

Variable type: numeric

skim_variable	provincia	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mob_grocery_pharmacy	Asturias	1	-0.76	19.96	-94	-3	3	9	42	▁▁▁▇▁
mob_grocery_pharmacy	Barcelona	1	-4.14	20.77	-91	-10	0	9	58	▁▁▆▇▁
mob_grocery_pharmacy	Madrid	1	-6.40	19.37	-87	-13	-3	6	50	▁▁▆▇▁
mob_grocery_pharmacy	Málaga	1	2.88	30.53	-91	-8	4	14	156	▁▇▆▁▁
mob_grocery_pharmacy	Sevilla	1	-4.90	22.74	-91	-12	-1	7	117	▁▃▇▁▁
mob_parks	Asturias	1	45.19	71.02	-88	5	33	70	333	▂▇▂▁▁
mob_parks	Barcelona	1	-10.48	23.79	-93	-19	-6	5	33	▁▁▃▇▃
mob_parks	Madrid	1	-9.83	28.19	-92	-22	-6	10	77	▁▂▇▃▁
mob_parks	Málaga	1	6.13	43.88	-92	-18	2	27	152	▂▇▆▂▁
mob_parks	Sevilla	1	-17.95	25.70	-93	-30	-13	-3	95	▁▅▇▁▁
mob_residential	Asturias	1	5.34	7.44	-8	1	3	8	40	▆▇▁▁▁
mob_residential	Barcelona	1	9.00	8.45	-7	4	7	11	47	▃▇▁▁▁
mob_residential	Madrid	1	8.95	9.18	-10	4	7	11	46	▂▇▂▁▁
mob_residential	Málaga	1	6.22	7.37	-4	2	4	8	40	▇▆▁▁▁
mob_residential	Sevilla	1	5.59	7.59	-7	1	4	8	42	▆▇▁▁▁
mob_retail_recreation	Asturias	1	-26.49	24.13	-97	-34	-21	-11	15	▁▁▃▇▃
mob_retail_recreation	Barcelona	1	-32.91	21.91	-97	-43	-27	-17	3	▁▁▃▇▅
mob_retail_recreation	Madrid	1	-32.62	21.35	-96	-38	-28	-19	4	▁▁▃▇▂
mob_retail_recreation	Málaga	1	-21.14	26.07	-97	-31	-16	-4	30	▁▁▅▇▂
mob_retail_recreation	Sevilla	1	-26.91	22.30	-97	-35	-23	-11	23	▁▁▆▇▁
mob_transit_stations	Asturias	1	-17.30	20.87	-85	-26	-14	-4	26	▁▁▅▇▂
mob_transit_stations	Barcelona	1	-27.65	18.95	-91	-35	-25	-15	10	▁▁▃▇▂
mob_transit_stations	Madrid	1	-33.40	19.63	-93	-42	-32	-19	10	▁▁▇▇▁
mob_transit_stations	Málaga	1	-22.89	25.52	-93	-37	-22	-3	33	▂▂▇▆▂
mob_transit_stations	Sevilla	1	-28.81	21.99	-94	-40	-27	-14	52	▁▃▇▂▁
mob_workplaces	Asturias	1	-23.33	17.05	-88	-28	-19	-13	12	▁▁▂▇▁
mob_workplaces	Barcelona	1	-30.01	19.76	-92	-39	-26	-17	6	▁▂▃▇▃
mob_workplaces	Madrid	1	-30.44	22.60	-92	-45	-27	-15	20	▂▃▇▇▂
mob_workplaces	Málaga	1	-21.63	19.72	-89	-29	-18	-10	29	▁▁▆▇▁
mob_workplaces	Sevilla	1	-23.28	19.04	-90	-32	-17	-10	9	▁▁▃▇▅

Since everything is right, we will combine the data with the already combined dataset from CNE and AEMET:

covid_data <- left_join(covid_data, google_data, by=c("fecha", "provincia"))
covid_data

# A tibble: 4,095 × 25
   provincia fecha      num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
   <chr>     <date>         <dbl>                <dbl>                    <dbl>
 1 Barcelona 2020-01-01         0                    0                        0
 2 Madrid    2020-01-01         1                    1                        0
 3 Málaga    2020-01-01         0                    0                        0
 4 Asturias  2020-01-01         0                    0                        0
 5 Sevilla   2020-01-01         0                    0                        0
 6 Barcelona 2020-01-02         0                    0                        0
 7 Madrid    2020-01-02         0                    0                        0
 8 Málaga    2020-01-02         0                    0                        0
 9 Asturias  2020-01-02         0                    0                        0
10 Sevilla   2020-01-02         0                    0                        0
# … with 4,085 more rows, and 20 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>, tmed <dbl>, prec <dbl>,
#   tmin <dbl>, tmax <dbl>, wd <dbl>, ws <dbl>, ws_max <dbl>, sol <dbl>,
#   mob_grocery_pharmacy <dbl>, mob_parks <dbl>, mob_residential <dbl>,
#   mob_retail_recreation <dbl>, mob_transit_stations <dbl>,
#   mob_workplaces <dbl>

It is important to note that Google provides data as of February 15nd. Before that day, there is not information related to Google´s mobility.

CNE/AEMET/GOOGLE + INE

Load clean INE data:

ine_data <- readRDS(here("data", "clean", "ine_data.rds"))

Rename provinces to match previous datasets:

ine_data <- ine_data %>% 
      filter(province %in% c("Asturias", "Barcelona", "Madrid", "Málaga", "Sevilla")) %>% 
      rename(
            provincia = province,
            mob_flujo = flujo
      )
ine_data

# A tibble: 3,270 × 3
   fecha      provincia mob_flujo
   <date>     <chr>         <dbl>
 1 2020-03-16 Málaga         11.4
 2 2020-03-16 Sevilla        12.3
 3 2020-03-16 Asturias       13.1
 4 2020-03-16 Barcelona      14.6
 5 2020-03-16 Madrid         13.9
 6 2020-03-17 Málaga         10.7
 7 2020-03-17 Sevilla        11.5
 8 2020-03-17 Asturias       12.5
 9 2020-03-17 Barcelona      14.0
10 2020-03-17 Madrid         13.4
# … with 3,260 more rows

Combination with previous datasets:

covid_data <- left_join(covid_data, ine_data, by=c("fecha", "provincia"))
covid_data

# A tibble: 4,095 × 26
   provincia fecha      num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
   <chr>     <date>         <dbl>                <dbl>                    <dbl>
 1 Barcelona 2020-01-01         0                    0                        0
 2 Madrid    2020-01-01         1                    1                        0
 3 Málaga    2020-01-01         0                    0                        0
 4 Asturias  2020-01-01         0                    0                        0
 5 Sevilla   2020-01-01         0                    0                        0
 6 Barcelona 2020-01-02         0                    0                        0
 7 Madrid    2020-01-02         0                    0                        0
 8 Málaga    2020-01-02         0                    0                        0
 9 Asturias  2020-01-02         0                    0                        0
10 Sevilla   2020-01-02         0                    0                        0
# … with 4,085 more rows, and 21 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>, tmed <dbl>, prec <dbl>,
#   tmin <dbl>, tmax <dbl>, wd <dbl>, ws <dbl>, ws_max <dbl>, sol <dbl>,
#   mob_grocery_pharmacy <dbl>, mob_parks <dbl>, mob_residential <dbl>,
#   mob_retail_recreation <dbl>, mob_transit_stations <dbl>,
#   mob_workplaces <dbl>, mob_flujo <dbl>

Mobility information published by INE is available as of March 16th which we will consider the beginning of our analysis.

Final covid dataset

The statistics of the final dataset looks like:

skim(covid_data)

Data summary
Name	covid_data
Number of rows	4095
Number of columns	26
_______________________
Column type frequency:
character	1
Date	1
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
provincia	0	1	6	9	0	5	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
fecha	0	1	2020-01-01	2022-03-29	2021-02-13	819

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
num_casos	0	1.00	1014.00	2598.89	0.00	64.00	280.00	984.00	34701.00	▇▁▁▁▁
num_casos_prueba_pcr	0	1.00	495.68	854.53	0.00	50.00	176.00	565.50	10167.00	▇▁▁▁▁
num_casos_prueba_test_ac	0	1.00	0.03	0.25	0.00	0.00	0.00	0.00	5.00	▇▁▁▁▁
num_casos_prueba_ag	0	1.00	504.14	2018.32	0.00	0.00	43.00	240.00	31073.00	▇▁▁▁▁
num_casos_prueba_elisa	0	1.00	0.71	3.42	0.00	0.00	0.00	0.00	55.00	▇▁▁▁▁
num_casos_prueba_desconocida	0	1.00	13.44	53.64	0.00	0.00	0.00	0.00	519.00	▇▁▁▁▁
num_hosp	0	1.00	51.28	119.70	0.00	6.00	18.00	48.00	1934.00	▇▁▁▁▁
num_uci	0	1.00	4.45	9.74	0.00	0.00	1.00	4.00	135.00	▇▁▁▁▁
num_def	0	1.00	9.78	23.79	0.00	0.00	3.00	10.00	334.00	▇▁▁▁▁
tmed	0	1.00	16.63	6.45	-6.20	12.00	15.60	21.20	34.50	▁▂▇▅▂
prec	0	1.00	1.75	6.14	0.00	0.00	0.00	0.20	87.90	▇▁▁▁▁
tmin	0	1.00	11.69	6.23	-13.40	7.40	11.30	16.10	27.50	▁▁▇▇▂
tmax	0	1.00	21.57	7.30	0.30	16.20	20.30	26.40	44.90	▁▇▇▃▁
wd	0	1.00	41.76	36.25	1.00	15.00	26.00	99.00	99.00	▆▇▁▁▆
ws	0	1.00	3.72	1.73	0.30	2.50	3.60	4.40	18.90	▇▅▁▁▁
ws_max	0	1.00	10.24	3.61	2.50	7.80	9.70	12.05	31.90	▅▇▂▁▁
sol	0	1.00	6.98	4.12	0.00	3.50	7.70	10.30	14.30	▇▅▆▇▅
mob_grocery_pharmacy	225	0.95	-2.89	23.44	-94.00	-10.00	0.00	9.00	156.00	▁▇▅▁▁
mob_parks	225	0.95	2.16	48.40	-93.00	-20.00	-3.00	15.75	333.00	▇▇▁▁▁
mob_residential	225	0.95	7.24	8.26	-8.00	2.00	5.00	10.00	47.00	▆▇▁▁▁
mob_retail_recreation	225	0.95	-28.46	23.87	-97.00	-38.00	-24.00	-13.00	30.00	▂▂▇▇▁
mob_transit_stations	225	0.95	-26.74	22.16	-94.00	-38.00	-25.00	-12.00	26.00	▁▂▇▇▂
mob_workplaces	225	0.95	-26.11	20.04	-92.00	-35.00	-22.00	-12.00	29.00	▁▂▇▇▁
mob_flujo	825	0.80	15.75	3.84	4.56	13.31	15.95	18.44	26.24	▁▃▇▅▁

The missing information corresponds to the periods for which no data have been provided. These are placed at the beginning of the dataset, so as of March 16, 2020 all the information is available up to December 29, 2021.

covid_data %>% 
      filter(!is.na(mob_flujo)) %>% 
      summarise(
            initial_data_without_nas = min(fecha),
            final_data_without_nas = max(fecha)
      )

# A tibble: 1 × 2
  initial_data_without_nas final_data_without_nas
  <date>                   <date>                
1 2020-03-16               2021-12-29

Statistics for that period:

covid_data %>% 
      filter(fecha >= as.Date("2020-03-16", format = "%Y-%m-%d") &
                   fecha <= as.Date("2021-12-29", format = "%Y-%m-%d")) %>% 
      skim()

Data summary
Name	Piped data
Number of rows	3270
Number of columns	26
_______________________
Column type frequency:
character	1
Date	1
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
provincia	0	1	6	9	0	5	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
fecha	0	1	2020-03-16	2021-12-29	2021-02-05	654

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
num_casos	1	780.02	1607.55	0.00	89.00	271.00	923.75	23811.00	▇▁▁▁▁
num_casos_prueba_pcr	1	491.53	831.31	0.00	65.25	179.00	580.00	10167.00	▇▁▁▁▁
num_casos_prueba_test_ac	1	0.04	0.27	0.00	0.00	0.00	0.00	5.00	▇▁▁▁▁
num_casos_prueba_ag	1	273.56	942.12	0.00	1.00	40.00	182.00	16226.00	▇▁▁▁▁
num_casos_prueba_elisa	1	0.87	3.79	0.00	0.00	0.00	0.00	55.00	▇▁▁▁▁
num_casos_prueba_desconocida	1	14.02	56.25	0.00	0.00	0.00	0.00	519.00	▇▁▁▁▁
num_hosp	1	55.27	127.95	0.00	7.00	20.00	54.00	1934.00	▇▁▁▁▁
num_uci	1	4.97	10.32	0.00	0.00	2.00	5.00	135.00	▇▁▁▁▁
num_def	1	11.10	26.14	0.00	1.00	3.00	11.00	334.00	▇▁▁▁▁
tmed	1	17.85	6.48	-6.20	13.20	17.40	23.10	34.50	▁▂▇▆▂
prec	1	1.78	6.06	0.00	0.00	0.00	0.20	78.50	▇▁▁▁▁
tmin	1	12.89	6.11	-13.40	8.70	12.90	17.40	27.50	▁▁▆▇▃
tmax	1	22.80	7.46	0.30	17.22	21.70	28.28	44.90	▁▆▇▅▁
wd	1	41.45	35.80	1.00	16.00	26.00	99.00	99.00	▆▇▁▁▅
ws	1	3.73	1.62	0.30	2.50	3.60	4.40	13.10	▅▇▂▁▁
ws_max	1	10.30	3.50	2.50	7.80	9.70	12.06	31.90	▅▇▂▁▁
sol	1	7.32	4.20	0.00	3.80	8.10	10.90	14.30	▆▅▆▇▆
mob_grocery_pharmacy	1	-4.38	24.03	-94.00	-12.00	-1.00	8.00	156.00	▁▇▃▁▁
mob_parks	1	2.22	51.72	-93.00	-22.00	-4.00	16.00	333.00	▇▇▁▁▁
mob_residential	1	7.92	8.62	-8.00	2.00	6.00	11.00	47.00	▅▇▁▁▁
mob_retail_recreation	1	-30.76	24.53	-97.00	-40.00	-26.00	-14.00	30.00	▂▂▇▆▁
mob_transit_stations	1	-29.40	22.41	-94.00	-40.00	-28.00	-14.00	26.00	▁▂▇▅▁
mob_workplaces	1	-28.32	20.09	-92.00	-37.00	-25.00	-14.00	29.00	▁▂▇▆▁
mob_flujo	1	15.75	3.84	4.56	13.31	15.95	18.44	26.24	▁▃▇▅▁

Aditional dataset

In addition, we have the clean dataset corresponding to hospitalizations which provide detail by demographics.

We will not combine it with the previous dataset, but we will treat it additionally in case of needing more detail.

cne_hosp_data <- readRDS(here("data", "clean", "cne_hosp_data.rds"))
cne_hosp_data

# A tibble: 1,277,640 × 8
   provincia_iso sexo  grupo_edad fecha      num_casos num_hosp num_uci num_def
   <chr>         <chr> <chr>      <date>         <dbl>    <dbl>   <dbl>   <dbl>
 1 A             H     0-9        2020-01-01         0        0       0       0
 2 A             H     10-19      2020-01-01         0        0       0       0
 3 A             H     20-29      2020-01-01         0        0       0       0
 4 A             H     30-39      2020-01-01         0        0       0       0
 5 A             H     40-49      2020-01-01         0        0       0       0
 6 A             H     50-59      2020-01-01         0        0       0       0
 7 A             H     60-69      2020-01-01         0        0       0       0
 8 A             H     70-79      2020-01-01         0        0       0       0
 9 A             H     80+        2020-01-01         0        0       0       0
10 A             H     NC         2020-01-01         0        0       0       0
# … with 1,277,630 more rows

cne_hosp_data <- cne_hosp_data %>% 
      filter(provincia_iso %in% c("O", "B", "M", "MA", "SE")) %>% 
      mutate(
            provincia_iso = case_when(
                  provincia_iso == "O" ~ "Asturias",
                  provincia_iso == "B" ~ "Barcelona",
                  provincia_iso == "M" ~ "Madrid",
                  provincia_iso == "MA" ~ "Málaga",
                  provincia_iso == "SE" ~ "Sevilla",
                  TRUE ~ provincia_iso
            )
      ) %>% 
      rename(provincia = provincia_iso)
cne_hosp_data

# A tibble: 122,850 × 8
   provincia sexo  grupo_edad fecha      num_casos num_hosp num_uci num_def
   <chr>     <chr> <chr>      <date>         <dbl>    <dbl>   <dbl>   <dbl>
 1 Barcelona H     0-9        2020-01-01         0        0       0       0
 2 Barcelona H     10-19      2020-01-01         0        0       0       0
 3 Barcelona H     20-29      2020-01-01         0        0       0       0
 4 Barcelona H     30-39      2020-01-01         0        0       0       0
 5 Barcelona H     40-49      2020-01-01         0        0       0       0
 6 Barcelona H     50-59      2020-01-01         0        0       0       0
 7 Barcelona H     60-69      2020-01-01         0        0       0       0
 8 Barcelona H     70-79      2020-01-01         0        0       0       0
 9 Barcelona H     80+        2020-01-01         0        0       0       0
10 Barcelona H     NC         2020-01-01         0        0       0       0
# … with 122,840 more rows

skim(cne_hosp_data)

Data summary
Name	cne_hosp_data
Number of rows	122850
Number of columns	8
_______________________
Column type frequency:
character	3
Date	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
provincia	1	6	9	5
sexo	1	1	2	3
grupo_edad	1	2	5	10

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
fecha	0	1	2020-01-01	2022-03-29	2021-02-13	819

Variable type: numeric

skim_variable	complete_rate	mean	sd	p75	p100	hist
num_casos	1	33.80	137.91	18	3741	▇▁▁▁▁
num_hosp	1	1.71	7.27	1	270	▇▁▁▁▁
num_uci	1	0.15	0.79	0	35	▇▁▁▁▁
num_def	1	0.33	2.27	0	100	▇▁▁▁▁

Note

Final clean data are stored in the project’s folder “/data/clean/” both in rds and csv format.

final_covid_data.csv (CNE+AEMET+GOOGLE+INE data)
final_hosp_data.csv (CNE data with demographic detail)