Task 2: Datasets combination

Introduction

During the first task, data loading and cleaning has been performed. For each of the datasets, the final files were stored in the “/data/clean/” project folder by saving them as a RDS file, which creates a serialized version of the dataset and then saves it with gzip.

By doing this we avoid problems in further reading, since the dataset will be loaded as an R object exactly as when the cleaning of the dataset were finished.

First, we need to load packages:

pacman::p_load(
      here,      # file locator
      tidyverse, # data management and ggplot2 graphics
      skimr,     # get overview of data
      janitor    # produce and adorn tabulations and cross-tabulations
)

CNE + AEMET

Import clean datasets:

cne_clean_data <- readRDS(here("data", "clean", "cne_data.rds"))
cne_clean_data
# A tibble: 42,588 × 11
   provincia_iso fecha      num_casos num_casos_prueba_pcr num_casos_prueba_tes…
   <chr>         <date>         <dbl>                <dbl>                 <dbl>
 1 A             2020-01-01         0                    0                     0
 2 AB            2020-01-01         0                    0                     0
 3 AL            2020-01-01         0                    0                     0
 4 AV            2020-01-01         0                    0                     0
 5 B             2020-01-01         0                    0                     0
 6 BA            2020-01-01         0                    0                     0
 7 BI            2020-01-01         0                    0                     0
 8 BU            2020-01-01         0                    0                     0
 9 C             2020-01-01         0                    0                     0
10 CA            2020-01-01         0                    0                     0
# … with 42,578 more rows, and 6 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>
aemet_clean_data <- readRDS(here("data", "clean", "aemet_data.rds"))
aemet_clean_data
# A tibble: 4,105 × 10
   fecha      provincia  tmed  prec  tmin  tmax    wd    ws ws_max   sol
   <date>     <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1 2020-01-01 Asturias    7.4   0     3.8  11      99   2.5    7.8   7.9
 2 2020-01-02 Asturias    7.3   0     3.2  11.4    22   2.8   10.3   2.8
 3 2020-01-03 Asturias   10.7   0.7   7.4  14      99   2.8    9.7   1.9
 4 2020-01-04 Asturias    8.4   2.4   5.6  11.1    99   3.6    7.2   0.7
 5 2020-01-05 Asturias    8.2   0     4.7  11.8    13   2.5    7.2   8.6
 6 2020-01-06 Asturias    7.6   0     2.8  12.3    22   3.1    8.9   6.3
 7 2020-01-07 Asturias    8.8   0     4.3  13.3    99   1.9    8.9   8  
 8 2020-01-08 Asturias   13.4   0     9    17.7    22   2.8    8.3   6.1
 9 2020-01-09 Asturias   14.4   5.5   8.7  20.2    29   7.8   24.2   4.3
10 2020-01-10 Asturias    8.2   0.6   4.9  11.6    28   2.2   16.1   1.9
# … with 4,095 more rows

In the following analyses we will focus on the provinces:

  • Asturias
  • Barcelona
  • Madrid
  • Málaga
  • Sevilla

Meteorological data from AEMET already corresponds to those provinces so we will proceed to filter the data from CNE.

# Ref iso code provinces https://es.wikipedia.org/wiki/ISO_3166-2:ES
cne_clean_data <- cne_clean_data %>% 
      filter(provincia_iso %in% c("O", "B", "M", "MA", "SE")) %>% 
      mutate(
            provincia_iso = case_when(
                  provincia_iso == "O" ~ "Asturias",
                  provincia_iso == "B" ~ "Barcelona",
                  provincia_iso == "M" ~ "Madrid",
                  provincia_iso == "MA" ~ "Málaga",
                  provincia_iso == "SE" ~ "Sevilla",
                  TRUE ~ provincia_iso
            )
      ) %>% 
      rename(provincia = provincia_iso)
cne_clean_data
# A tibble: 4,095 × 11
   provincia fecha      num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
   <chr>     <date>         <dbl>                <dbl>                    <dbl>
 1 Barcelona 2020-01-01         0                    0                        0
 2 Madrid    2020-01-01         1                    1                        0
 3 Málaga    2020-01-01         0                    0                        0
 4 Asturias  2020-01-01         0                    0                        0
 5 Sevilla   2020-01-01         0                    0                        0
 6 Barcelona 2020-01-02         0                    0                        0
 7 Madrid    2020-01-02         0                    0                        0
 8 Málaga    2020-01-02         0                    0                        0
 9 Asturias  2020-01-02         0                    0                        0
10 Sevilla   2020-01-02         0                    0                        0
# … with 4,085 more rows, and 6 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>
covid_data <- left_join(cne_clean_data, aemet_clean_data, by=c("fecha", "provincia"))
covid_data
# A tibble: 4,095 × 19
   provincia fecha      num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
   <chr>     <date>         <dbl>                <dbl>                    <dbl>
 1 Barcelona 2020-01-01         0                    0                        0
 2 Madrid    2020-01-01         1                    1                        0
 3 Málaga    2020-01-01         0                    0                        0
 4 Asturias  2020-01-01         0                    0                        0
 5 Sevilla   2020-01-01         0                    0                        0
 6 Barcelona 2020-01-02         0                    0                        0
 7 Madrid    2020-01-02         0                    0                        0
 8 Málaga    2020-01-02         0                    0                        0
 9 Asturias  2020-01-02         0                    0                        0
10 Sevilla   2020-01-02         0                    0                        0
# … with 4,085 more rows, and 14 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>, tmed <dbl>, prec <dbl>,
#   tmin <dbl>, tmax <dbl>, wd <dbl>, ws <dbl>, ws_max <dbl>, sol <dbl>

Check if missing information:

skim(covid_data)
Data summary
Name covid_data
Number of rows 4095
Number of columns 19
_______________________
Column type frequency:
character 1
Date 1
numeric 17
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
provincia 0 1 6 9 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
fecha 0 1 2020-01-01 2022-03-29 2021-02-13 819

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
num_casos 0 1 1014.00 2598.89 0.0 64.0 280.0 984.00 34701.0 ▇▁▁▁▁
num_casos_prueba_pcr 0 1 495.68 854.53 0.0 50.0 176.0 565.50 10167.0 ▇▁▁▁▁
num_casos_prueba_test_ac 0 1 0.03 0.25 0.0 0.0 0.0 0.00 5.0 ▇▁▁▁▁
num_casos_prueba_ag 0 1 504.14 2018.32 0.0 0.0 43.0 240.00 31073.0 ▇▁▁▁▁
num_casos_prueba_elisa 0 1 0.71 3.42 0.0 0.0 0.0 0.00 55.0 ▇▁▁▁▁
num_casos_prueba_desconocida 0 1 13.44 53.64 0.0 0.0 0.0 0.00 519.0 ▇▁▁▁▁
num_hosp 0 1 51.28 119.70 0.0 6.0 18.0 48.00 1934.0 ▇▁▁▁▁
num_uci 0 1 4.45 9.74 0.0 0.0 1.0 4.00 135.0 ▇▁▁▁▁
num_def 0 1 9.78 23.79 0.0 0.0 3.0 10.00 334.0 ▇▁▁▁▁
tmed 0 1 16.63 6.45 -6.2 12.0 15.6 21.20 34.5 ▁▂▇▅▂
prec 0 1 1.75 6.14 0.0 0.0 0.0 0.20 87.9 ▇▁▁▁▁
tmin 0 1 11.69 6.23 -13.4 7.4 11.3 16.10 27.5 ▁▁▇▇▂
tmax 0 1 21.57 7.30 0.3 16.2 20.3 26.40 44.9 ▁▇▇▃▁
wd 0 1 41.76 36.25 1.0 15.0 26.0 99.00 99.0 ▆▇▁▁▆
ws 0 1 3.72 1.73 0.3 2.5 3.6 4.40 18.9 ▇▅▁▁▁
ws_max 0 1 10.24 3.61 2.5 7.8 9.7 12.05 31.9 ▅▇▂▁▁
sol 0 1 6.98 4.12 0.0 3.5 7.7 10.30 14.3 ▇▅▆▇▅

CNE/AEMET + GOOGLE

Load clean data:

google_data <- readRDS(here("data", "clean", "google_data.rds"))

Rename provinces to match previous dataset:

google_data <- google_data %>% 
      filter(province %in% c("Asturias", "Barcelona", "Madrid", "Málaga", "Seville")) %>% 
      mutate(province = ifelse(province == "Seville", "Sevilla", province)) %>% 
      select(-CA, -iso_3166_2_code, provincia = province)
google_data
# A tibble: 4,025 × 8
# Groups:   provincia [5]
   provincia fecha      mob_grocery_pharmacy mob_parks mob_residential
   <chr>     <date>                    <dbl>     <dbl>           <dbl>
 1 Asturias  2020-02-15                   -1        20              -1
 2 Asturias  2020-02-16                    1        11              -2
 3 Asturias  2020-02-17                    0       -13              -1
 4 Asturias  2020-02-18                    1        11              -1
 5 Asturias  2020-02-19                    1        29              -2
 6 Asturias  2020-02-20                    0        32              -3
 7 Asturias  2020-02-21                   -1        18               0
 8 Asturias  2020-02-22                    1        29              -1
 9 Asturias  2020-02-23                    6        23              -3
10 Asturias  2020-02-24                    5        48              -1
# … with 4,015 more rows, and 3 more variables: mob_retail_recreation <dbl>,
#   mob_transit_stations <dbl>, mob_workplaces <dbl>

Statistics for Google data

skim(google_data)
Data summary
Name google_data
Number of rows 4025
Number of columns 8
_______________________
Column type frequency:
Date 1
numeric 6
________________________
Group variables provincia

Variable type: Date

skim_variable provincia n_missing complete_rate min max median n_unique
fecha Asturias 0 1 2020-02-15 2022-04-29 2021-03-23 805
fecha Barcelona 0 1 2020-02-15 2022-04-29 2021-03-23 805
fecha Madrid 0 1 2020-02-15 2022-04-29 2021-03-23 805
fecha Málaga 0 1 2020-02-15 2022-04-29 2021-03-23 805
fecha Sevilla 0 1 2020-02-15 2022-04-29 2021-03-23 805

Variable type: numeric

skim_variable provincia n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mob_grocery_pharmacy Asturias 0 1 -0.76 19.96 -94 -3 3 9 42 ▁▁▁▇▁
mob_grocery_pharmacy Barcelona 0 1 -4.14 20.77 -91 -10 0 9 58 ▁▁▆▇▁
mob_grocery_pharmacy Madrid 0 1 -6.40 19.37 -87 -13 -3 6 50 ▁▁▆▇▁
mob_grocery_pharmacy Málaga 0 1 2.88 30.53 -91 -8 4 14 156 ▁▇▆▁▁
mob_grocery_pharmacy Sevilla 0 1 -4.90 22.74 -91 -12 -1 7 117 ▁▃▇▁▁
mob_parks Asturias 0 1 45.19 71.02 -88 5 33 70 333 ▂▇▂▁▁
mob_parks Barcelona 0 1 -10.48 23.79 -93 -19 -6 5 33 ▁▁▃▇▃
mob_parks Madrid 0 1 -9.83 28.19 -92 -22 -6 10 77 ▁▂▇▃▁
mob_parks Málaga 0 1 6.13 43.88 -92 -18 2 27 152 ▂▇▆▂▁
mob_parks Sevilla 0 1 -17.95 25.70 -93 -30 -13 -3 95 ▁▅▇▁▁
mob_residential Asturias 0 1 5.34 7.44 -8 1 3 8 40 ▆▇▁▁▁
mob_residential Barcelona 0 1 9.00 8.45 -7 4 7 11 47 ▃▇▁▁▁
mob_residential Madrid 0 1 8.95 9.18 -10 4 7 11 46 ▂▇▂▁▁
mob_residential Málaga 0 1 6.22 7.37 -4 2 4 8 40 ▇▆▁▁▁
mob_residential Sevilla 0 1 5.59 7.59 -7 1 4 8 42 ▆▇▁▁▁
mob_retail_recreation Asturias 0 1 -26.49 24.13 -97 -34 -21 -11 15 ▁▁▃▇▃
mob_retail_recreation Barcelona 0 1 -32.91 21.91 -97 -43 -27 -17 3 ▁▁▃▇▅
mob_retail_recreation Madrid 0 1 -32.62 21.35 -96 -38 -28 -19 4 ▁▁▃▇▂
mob_retail_recreation Málaga 0 1 -21.14 26.07 -97 -31 -16 -4 30 ▁▁▅▇▂
mob_retail_recreation Sevilla 0 1 -26.91 22.30 -97 -35 -23 -11 23 ▁▁▆▇▁
mob_transit_stations Asturias 0 1 -17.30 20.87 -85 -26 -14 -4 26 ▁▁▅▇▂
mob_transit_stations Barcelona 0 1 -27.65 18.95 -91 -35 -25 -15 10 ▁▁▃▇▂
mob_transit_stations Madrid 0 1 -33.40 19.63 -93 -42 -32 -19 10 ▁▁▇▇▁
mob_transit_stations Málaga 0 1 -22.89 25.52 -93 -37 -22 -3 33 ▂▂▇▆▂
mob_transit_stations Sevilla 0 1 -28.81 21.99 -94 -40 -27 -14 52 ▁▃▇▂▁
mob_workplaces Asturias 0 1 -23.33 17.05 -88 -28 -19 -13 12 ▁▁▂▇▁
mob_workplaces Barcelona 0 1 -30.01 19.76 -92 -39 -26 -17 6 ▁▂▃▇▃
mob_workplaces Madrid 0 1 -30.44 22.60 -92 -45 -27 -15 20 ▂▃▇▇▂
mob_workplaces Málaga 0 1 -21.63 19.72 -89 -29 -18 -10 29 ▁▁▆▇▁
mob_workplaces Sevilla 0 1 -23.28 19.04 -90 -32 -17 -10 9 ▁▁▃▇▅

Since everything is right, we will combine the data with the already combined dataset from CNE and AEMET:

covid_data <- left_join(covid_data, google_data, by=c("fecha", "provincia"))
covid_data
# A tibble: 4,095 × 25
   provincia fecha      num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
   <chr>     <date>         <dbl>                <dbl>                    <dbl>
 1 Barcelona 2020-01-01         0                    0                        0
 2 Madrid    2020-01-01         1                    1                        0
 3 Málaga    2020-01-01         0                    0                        0
 4 Asturias  2020-01-01         0                    0                        0
 5 Sevilla   2020-01-01         0                    0                        0
 6 Barcelona 2020-01-02         0                    0                        0
 7 Madrid    2020-01-02         0                    0                        0
 8 Málaga    2020-01-02         0                    0                        0
 9 Asturias  2020-01-02         0                    0                        0
10 Sevilla   2020-01-02         0                    0                        0
# … with 4,085 more rows, and 20 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>, tmed <dbl>, prec <dbl>,
#   tmin <dbl>, tmax <dbl>, wd <dbl>, ws <dbl>, ws_max <dbl>, sol <dbl>,
#   mob_grocery_pharmacy <dbl>, mob_parks <dbl>, mob_residential <dbl>,
#   mob_retail_recreation <dbl>, mob_transit_stations <dbl>,
#   mob_workplaces <dbl>

It is important to note that Google provides data as of February 15nd. Before that day, there is not information related to Google´s mobility.

CNE/AEMET/GOOGLE + INE

Load clean INE data:

ine_data <- readRDS(here("data", "clean", "ine_data.rds"))

Rename provinces to match previous datasets:

ine_data <- ine_data %>% 
      filter(province %in% c("Asturias", "Barcelona", "Madrid", "Málaga", "Sevilla")) %>% 
      rename(
            provincia = province,
            mob_flujo = flujo
      )
ine_data
# A tibble: 3,270 × 3
   fecha      provincia mob_flujo
   <date>     <chr>         <dbl>
 1 2020-03-16 Málaga         11.4
 2 2020-03-16 Sevilla        12.3
 3 2020-03-16 Asturias       13.1
 4 2020-03-16 Barcelona      14.6
 5 2020-03-16 Madrid         13.9
 6 2020-03-17 Málaga         10.7
 7 2020-03-17 Sevilla        11.5
 8 2020-03-17 Asturias       12.5
 9 2020-03-17 Barcelona      14.0
10 2020-03-17 Madrid         13.4
# … with 3,260 more rows

Combination with previous datasets:

covid_data <- left_join(covid_data, ine_data, by=c("fecha", "provincia"))
covid_data
# A tibble: 4,095 × 26
   provincia fecha      num_casos num_casos_prueba_pcr num_casos_prueba_test_ac
   <chr>     <date>         <dbl>                <dbl>                    <dbl>
 1 Barcelona 2020-01-01         0                    0                        0
 2 Madrid    2020-01-01         1                    1                        0
 3 Málaga    2020-01-01         0                    0                        0
 4 Asturias  2020-01-01         0                    0                        0
 5 Sevilla   2020-01-01         0                    0                        0
 6 Barcelona 2020-01-02         0                    0                        0
 7 Madrid    2020-01-02         0                    0                        0
 8 Málaga    2020-01-02         0                    0                        0
 9 Asturias  2020-01-02         0                    0                        0
10 Sevilla   2020-01-02         0                    0                        0
# … with 4,085 more rows, and 21 more variables: num_casos_prueba_ag <dbl>,
#   num_casos_prueba_elisa <dbl>, num_casos_prueba_desconocida <dbl>,
#   num_hosp <dbl>, num_uci <dbl>, num_def <dbl>, tmed <dbl>, prec <dbl>,
#   tmin <dbl>, tmax <dbl>, wd <dbl>, ws <dbl>, ws_max <dbl>, sol <dbl>,
#   mob_grocery_pharmacy <dbl>, mob_parks <dbl>, mob_residential <dbl>,
#   mob_retail_recreation <dbl>, mob_transit_stations <dbl>,
#   mob_workplaces <dbl>, mob_flujo <dbl>

Mobility information published by INE is available as of March 16th which we will consider the beginning of our analysis.

Final covid dataset

The statistics of the final dataset looks like:

skim(covid_data)
Data summary
Name covid_data
Number of rows 4095
Number of columns 26
_______________________
Column type frequency:
character 1
Date 1
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
provincia 0 1 6 9 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
fecha 0 1 2020-01-01 2022-03-29 2021-02-13 819

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
num_casos 0 1.00 1014.00 2598.89 0.00 64.00 280.00 984.00 34701.00 ▇▁▁▁▁
num_casos_prueba_pcr 0 1.00 495.68 854.53 0.00 50.00 176.00 565.50 10167.00 ▇▁▁▁▁
num_casos_prueba_test_ac 0 1.00 0.03 0.25 0.00 0.00 0.00 0.00 5.00 ▇▁▁▁▁
num_casos_prueba_ag 0 1.00 504.14 2018.32 0.00 0.00 43.00 240.00 31073.00 ▇▁▁▁▁
num_casos_prueba_elisa 0 1.00 0.71 3.42 0.00 0.00 0.00 0.00 55.00 ▇▁▁▁▁
num_casos_prueba_desconocida 0 1.00 13.44 53.64 0.00 0.00 0.00 0.00 519.00 ▇▁▁▁▁
num_hosp 0 1.00 51.28 119.70 0.00 6.00 18.00 48.00 1934.00 ▇▁▁▁▁
num_uci 0 1.00 4.45 9.74 0.00 0.00 1.00 4.00 135.00 ▇▁▁▁▁
num_def 0 1.00 9.78 23.79 0.00 0.00 3.00 10.00 334.00 ▇▁▁▁▁
tmed 0 1.00 16.63 6.45 -6.20 12.00 15.60 21.20 34.50 ▁▂▇▅▂
prec 0 1.00 1.75 6.14 0.00 0.00 0.00 0.20 87.90 ▇▁▁▁▁
tmin 0 1.00 11.69 6.23 -13.40 7.40 11.30 16.10 27.50 ▁▁▇▇▂
tmax 0 1.00 21.57 7.30 0.30 16.20 20.30 26.40 44.90 ▁▇▇▃▁
wd 0 1.00 41.76 36.25 1.00 15.00 26.00 99.00 99.00 ▆▇▁▁▆
ws 0 1.00 3.72 1.73 0.30 2.50 3.60 4.40 18.90 ▇▅▁▁▁
ws_max 0 1.00 10.24 3.61 2.50 7.80 9.70 12.05 31.90 ▅▇▂▁▁
sol 0 1.00 6.98 4.12 0.00 3.50 7.70 10.30 14.30 ▇▅▆▇▅
mob_grocery_pharmacy 225 0.95 -2.89 23.44 -94.00 -10.00 0.00 9.00 156.00 ▁▇▅▁▁
mob_parks 225 0.95 2.16 48.40 -93.00 -20.00 -3.00 15.75 333.00 ▇▇▁▁▁
mob_residential 225 0.95 7.24 8.26 -8.00 2.00 5.00 10.00 47.00 ▆▇▁▁▁
mob_retail_recreation 225 0.95 -28.46 23.87 -97.00 -38.00 -24.00 -13.00 30.00 ▂▂▇▇▁
mob_transit_stations 225 0.95 -26.74 22.16 -94.00 -38.00 -25.00 -12.00 26.00 ▁▂▇▇▂
mob_workplaces 225 0.95 -26.11 20.04 -92.00 -35.00 -22.00 -12.00 29.00 ▁▂▇▇▁
mob_flujo 825 0.80 15.75 3.84 4.56 13.31 15.95 18.44 26.24 ▁▃▇▅▁

The missing information corresponds to the periods for which no data have been provided. These are placed at the beginning of the dataset, so as of March 16, 2020 all the information is available up to December 29, 2021.

covid_data %>% 
      filter(!is.na(mob_flujo)) %>% 
      summarise(
            initial_data_without_nas = min(fecha),
            final_data_without_nas = max(fecha)
      )
# A tibble: 1 × 2
  initial_data_without_nas final_data_without_nas
  <date>                   <date>                
1 2020-03-16               2021-12-29            

Statistics for that period:

covid_data %>% 
      filter(fecha >= as.Date("2020-03-16", format = "%Y-%m-%d") &
                   fecha <= as.Date("2021-12-29", format = "%Y-%m-%d")) %>% 
      skim()
Data summary
Name Piped data
Number of rows 3270
Number of columns 26
_______________________
Column type frequency:
character 1
Date 1
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
provincia 0 1 6 9 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
fecha 0 1 2020-03-16 2021-12-29 2021-02-05 654

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
num_casos 0 1 780.02 1607.55 0.00 89.00 271.00 923.75 23811.00 ▇▁▁▁▁
num_casos_prueba_pcr 0 1 491.53 831.31 0.00 65.25 179.00 580.00 10167.00 ▇▁▁▁▁
num_casos_prueba_test_ac 0 1 0.04 0.27 0.00 0.00 0.00 0.00 5.00 ▇▁▁▁▁
num_casos_prueba_ag 0 1 273.56 942.12 0.00 1.00 40.00 182.00 16226.00 ▇▁▁▁▁
num_casos_prueba_elisa 0 1 0.87 3.79 0.00 0.00 0.00 0.00 55.00 ▇▁▁▁▁
num_casos_prueba_desconocida 0 1 14.02 56.25 0.00 0.00 0.00 0.00 519.00 ▇▁▁▁▁
num_hosp 0 1 55.27 127.95 0.00 7.00 20.00 54.00 1934.00 ▇▁▁▁▁
num_uci 0 1 4.97 10.32 0.00 0.00 2.00 5.00 135.00 ▇▁▁▁▁
num_def 0 1 11.10 26.14 0.00 1.00 3.00 11.00 334.00 ▇▁▁▁▁
tmed 0 1 17.85 6.48 -6.20 13.20 17.40 23.10 34.50 ▁▂▇▆▂
prec 0 1 1.78 6.06 0.00 0.00 0.00 0.20 78.50 ▇▁▁▁▁
tmin 0 1 12.89 6.11 -13.40 8.70 12.90 17.40 27.50 ▁▁▆▇▃
tmax 0 1 22.80 7.46 0.30 17.22 21.70 28.28 44.90 ▁▆▇▅▁
wd 0 1 41.45 35.80 1.00 16.00 26.00 99.00 99.00 ▆▇▁▁▅
ws 0 1 3.73 1.62 0.30 2.50 3.60 4.40 13.10 ▅▇▂▁▁
ws_max 0 1 10.30 3.50 2.50 7.80 9.70 12.06 31.90 ▅▇▂▁▁
sol 0 1 7.32 4.20 0.00 3.80 8.10 10.90 14.30 ▆▅▆▇▆
mob_grocery_pharmacy 0 1 -4.38 24.03 -94.00 -12.00 -1.00 8.00 156.00 ▁▇▃▁▁
mob_parks 0 1 2.22 51.72 -93.00 -22.00 -4.00 16.00 333.00 ▇▇▁▁▁
mob_residential 0 1 7.92 8.62 -8.00 2.00 6.00 11.00 47.00 ▅▇▁▁▁
mob_retail_recreation 0 1 -30.76 24.53 -97.00 -40.00 -26.00 -14.00 30.00 ▂▂▇▆▁
mob_transit_stations 0 1 -29.40 22.41 -94.00 -40.00 -28.00 -14.00 26.00 ▁▂▇▅▁
mob_workplaces 0 1 -28.32 20.09 -92.00 -37.00 -25.00 -14.00 29.00 ▁▂▇▆▁
mob_flujo 0 1 15.75 3.84 4.56 13.31 15.95 18.44 26.24 ▁▃▇▅▁

Aditional dataset

In addition, we have the clean dataset corresponding to hospitalizations which provide detail by demographics.

We will not combine it with the previous dataset, but we will treat it additionally in case of needing more detail.

cne_hosp_data <- readRDS(here("data", "clean", "cne_hosp_data.rds"))
cne_hosp_data
# A tibble: 1,277,640 × 8
   provincia_iso sexo  grupo_edad fecha      num_casos num_hosp num_uci num_def
   <chr>         <chr> <chr>      <date>         <dbl>    <dbl>   <dbl>   <dbl>
 1 A             H     0-9        2020-01-01         0        0       0       0
 2 A             H     10-19      2020-01-01         0        0       0       0
 3 A             H     20-29      2020-01-01         0        0       0       0
 4 A             H     30-39      2020-01-01         0        0       0       0
 5 A             H     40-49      2020-01-01         0        0       0       0
 6 A             H     50-59      2020-01-01         0        0       0       0
 7 A             H     60-69      2020-01-01         0        0       0       0
 8 A             H     70-79      2020-01-01         0        0       0       0
 9 A             H     80+        2020-01-01         0        0       0       0
10 A             H     NC         2020-01-01         0        0       0       0
# … with 1,277,630 more rows
cne_hosp_data <- cne_hosp_data %>% 
      filter(provincia_iso %in% c("O", "B", "M", "MA", "SE")) %>% 
      mutate(
            provincia_iso = case_when(
                  provincia_iso == "O" ~ "Asturias",
                  provincia_iso == "B" ~ "Barcelona",
                  provincia_iso == "M" ~ "Madrid",
                  provincia_iso == "MA" ~ "Málaga",
                  provincia_iso == "SE" ~ "Sevilla",
                  TRUE ~ provincia_iso
            )
      ) %>% 
      rename(provincia = provincia_iso)
cne_hosp_data
# A tibble: 122,850 × 8
   provincia sexo  grupo_edad fecha      num_casos num_hosp num_uci num_def
   <chr>     <chr> <chr>      <date>         <dbl>    <dbl>   <dbl>   <dbl>
 1 Barcelona H     0-9        2020-01-01         0        0       0       0
 2 Barcelona H     10-19      2020-01-01         0        0       0       0
 3 Barcelona H     20-29      2020-01-01         0        0       0       0
 4 Barcelona H     30-39      2020-01-01         0        0       0       0
 5 Barcelona H     40-49      2020-01-01         0        0       0       0
 6 Barcelona H     50-59      2020-01-01         0        0       0       0
 7 Barcelona H     60-69      2020-01-01         0        0       0       0
 8 Barcelona H     70-79      2020-01-01         0        0       0       0
 9 Barcelona H     80+        2020-01-01         0        0       0       0
10 Barcelona H     NC         2020-01-01         0        0       0       0
# … with 122,840 more rows
skim(cne_hosp_data)
Data summary
Name cne_hosp_data
Number of rows 122850
Number of columns 8
_______________________
Column type frequency:
character 3
Date 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
provincia 0 1 6 9 0 5 0
sexo 0 1 1 2 0 3 0
grupo_edad 0 1 2 5 0 10 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
fecha 0 1 2020-01-01 2022-03-29 2021-02-13 819

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
num_casos 0 1 33.80 137.91 0 0 0 18 3741 ▇▁▁▁▁
num_hosp 0 1 1.71 7.27 0 0 0 1 270 ▇▁▁▁▁
num_uci 0 1 0.15 0.79 0 0 0 0 35 ▇▁▁▁▁
num_def 0 1 0.33 2.27 0 0 0 0 100 ▇▁▁▁▁

Note

Final clean data are stored in the project’s folder “/data/clean/” both in rds and csv format.

  • final_covid_data.csv (CNE+AEMET+GOOGLE+INE data)
  • final_hosp_data.csv (CNE data with demographic detail)