Passer au contenu

Introduction

Before using the functions that produce safe tables, it is necessary to analyze the tables that need to be protected. Indeed, the disseminated tables differ from the tables to be protected. This is explained in detail in Chapter 3 of the Manuel de protection des tableaux. This phase of analyzing common margins between different tables can be complex. Therefore, functions have been developed in rtauargus to facilitate this work.

How Does Automatic Metadata Analysis Work?

Creating Metadata

Metadata must adhere to a specific format. Each line represents a table, and for each table, the following must be specified:

  • table_name : the name of the table ;
  • field : the field (often, geographical area, studied population, year of data collection) ;
  • hrc_field : whether there is a hierarchical link between this field and another field. If there is no hierarchical link, this field should be NA. If a hierarchical link exists, a name must be given to this link and the same name must be written in the other tables ;
  • indicator : the table’s indicator name. This is the name of the variable that is counted in each cell of the table. For example, if my table presents the turnover of companies, my variable will likely be “Turnover”. It is best to use the variable name as described in the data;
  • hrc_indicator : whether there is a hierarchical link between this indicator and another indicator. If there is no hierarchical link, this field should be NA. If a hierarchical link exists, a name must be given to this link and the same name must be written in the other tables;
  • spanning_1 : the name of the first cross-tabulation (spanning) variable. Additional columns are added for each cross-tabulation variable, following the naming convention: spanning_1, spanning_2, spanning_3, etc ;
  • hrc_spanning_1 : whether there is a hierarchical link between this cross-tabulation variable and another cross-tabulation variable. If there is no hierarchical link, this field should be NA. If a hierarchical link exists, a name must be given to this link and the same name must be written in the other tables. Additional columns are added for each cross-tabulation variable, following the naming convention: hrc_spanning_1, hrc_spanning_2, hrc_spanning_3, etc. Note: there must be as many spanning_X columns as hrc_spanning_X columns.

Formatted Input: Eurostat Templates

When the tables to be processed are Eurostat tables, the metadata file can be automatically created from the template of the published cells. Indeed, for Eurostat publications, National Statistical Institutes (NSIs) must provide a template that groups the published cells.

To do this, it is essential to understand each column of the template. Determine for each one whether it is the response variable/indicator, a cross-tabulation variable, or a field, in order to specify this in the arguments of the function . It is particularly important to verify that the template adheres to the following: one cross-tabulation variable = one column in the template, and thus one supertotal. Indeed, if a column groups multiple cross-tabulation variables, then as many columns as cross-tabulation variables must be created.

The Arguments of analyse_metadata

args(analyse_metadata)
## function (df_metadata, df_eq_indicator = NULL, verbose = FALSE) 
## NULL

The function arguments are as follows:

  • df_metadata: a dataframe structured as specified in section .
  • df_eq_indicator : a dataframe of equations on indicators.
  • verbose: a variable of type logical. If TRUE he function returns a list with the different steps of the analysis; if FALSE, it returns only the final dataframe with the cluster indicator.

The df_eq_indicator data frame consists of three columns that must follow the format below:

data.frame(eq_name = c("eq1"),
           eq_indicator = c("ca_salades = ca_batavia + ca_mache"),
           unit = c("EUR"))
##   eq_name                       eq_indicator unit
## 1     eq1 ca_salades = ca_batavia + ca_mache  EUR

The Arguments of format_template

args(format_template)
## function (data, indicator_column, spanning_var_tot, field_columns) 
## NULL

The function arguments are as follows:

  • data: the Eurostat template data frame containing all published cells.
  • indicator_column: the name of the column containing the indicators.
  • spanning_var_tot: a named list of the crossing variables and their totals.
  • field_columns: a vector of all columns representing fields (e.g., vintage/year).

Examples

Case Where the Input is Metadata

In this example, we start with a metadata file presenting 12 tables to be published on the turnover of pizza and salad sales.

str(metadata_pizza_lettuce)
## 'data.frame':    12 obs. of  9 variables:
##  $ table_name    : chr  "T1" "T2" "T3" "T4" ...
##  $ field         : chr  "france_entreprises_2023" "france_entreprises_2023" "france_entreprises_2023" "france_entreprises_2023" ...
##  $ hrc_field     : logi  NA NA NA NA NA NA ...
##  $ indicator     : chr  "to_pizza" "to_pizza" "to_pizza" "to_pizza" ...
##  $ hrc_indicator : chr  NA NA NA NA ...
##  $ spanning_1    : chr  "nuts2" "nuts3" "a10" "a10" ...
##  $ hrc_spanning_1: chr  "hrc_nuts" "hrc_nuts" "hrc_naf" "hrc_naf" ...
##  $ spanning_2    : chr  "size" "size" "nuts2" "nuts3" ...
##  $ hrc_spanning_2: chr  NA NA "hrc_nuts" "hrc_nuts" ...

Example code:

library(rtauargus)

data(metadata_pizza_lettuce)

# Analyse complète, avec les étapes
detailed_analysis <- analyse_metadata(metadata_pizza_lettuce, verbose = TRUE)

# Output simplifié, uniquement le dataframe avec l'indicatrice de cluster
cluster_id_dataframe <- analyse_metadata(metadata_pizza_lettuce, verbose = FALSE)

The output is a dataframe showing how to process the tables for applying the secret.

cluster_id_dataframe
##                               cluster  table_name                   field indicator spanning_1 spanning_2    spanning_3
## 1 france_entreprises_2023.hrc_lettuce  T10.T12.T8 france_entreprises_2023   LETTUCE    HRC_NAF         cj HRC_LETTUCE^h
## 2 france_entreprises_2023.hrc_lettuce   T11.T7.T9 france_entreprises_2023   LETTUCE    HRC_NAF       size HRC_LETTUCE^h
## 3    france_entreprises_2023.to_pizza       T1.T2 france_entreprises_2023  to_pizza   HRC_NUTS       size          <NA>
## 4    france_entreprises_2023.to_pizza T3.T4.T5.T6 france_entreprises_2023  to_pizza    HRC_NAF   HRC_NUTS          <NA>
##   hrc_spanning_1 hrc_spanning_2 hrc_spanning_3
## 1        hrc_naf           <NA>    hrc_lettuce
## 2        hrc_naf           <NA>    hrc_lettuce
## 3       hrc_nuts           <NA>           <NA>
## 4        hrc_naf       hrc_nuts           <NA>

For the 12 tables to be published, it is sufficient to protect 4 tables. These tables are distributed across two different clusters. Therefore, tab_multi_manager() needs to be called twice. A new spanning variable appeared HRC_LETTUCE^h. This variable takes 3 values batavia, arugula and lettuce the margin. The indicator LETTUCE here refers to the turnover (to) of salad sales. For example, the table T10.T12.T8 presents the revenue from salad sales based on the activity and category of the company, as well as the type of salad.

To obtain more information about the metadata analysis, the verbose = TRUE argument can be specified. This way, the object returned by the function will be a list of the different steps of the analysis.

names(detailed_analysis)
## [1] "identify_hrc"        "info_var"            "split_in_clusters"   "create_edges"        "grp_tab_names"      
## [6] "grp_tab_in_clusters" "tab_to_treat"        "df_tab_to_treat"

One finds the dataframe with the cluster indicator df_tab_to_treat. The result is the same but in list format: each element of the list is an independent cluster tab_to_treat. Additionally, the 6 steps of the analysis are included.

Case Where the Input is a Eurostat Template

The format_template() function allows creating the metadata to be used as input for analyse_metadata() from the Eurostat template of the published cells.

data(enterprise_template)

str(enterprise_template)
## 'data.frame':    3168 obs. of  5 variables:
##  $ TIME_PERIOD: int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ INDICATOR  : chr  "SAL" "SAL" "SAL" "SAL" ...
##  $ ACTIVITY   : chr  "B" "B" "B" "B" ...
##  $ NUMBER_EMPL: chr  "E0" "E1T4" "E5T9" "EGE10" ...
##  $ LEGAL_FORM : chr  "_T" "_T" "_T" "_T" ...
template_formatted <- format_template(
  data = enterprise_template,
  indicator_column = "INDICATOR",
  spanning_var_tot = list(
    ACTIVITY = "BTSXO_S94",
    NUMBER_EMPL = "_T",
    LEGAL_FORM = "_T"),
  field_columns = c("TIME_PERIOD")
)
## treating the field 2021 
## treating the field 2022
template_formatted$metadata
##             table_name field indicator spanning_1  spanning_2   hrc_spanning_1    hrc_spanning_2
## 1 table_2021_SAL_DTH_1  2021   SAL_DTH   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 2 table_2021_SAL_DTH_2  2021   SAL_DTH   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
## 3     table_2022_SAL_1  2022       SAL   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 4     table_2022_SAL_2  2022       SAL   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
## 5 table_2022_SAL_DTH_1  2022   SAL_DTH   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 6 table_2022_SAL_DTH_2  2022   SAL_DTH   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4

There are 6 tables to be published.

Remark: format_template() does not construct the hrc_indicator column because the information contained in the template does not allow it to be determined. Therefore, one must create it based on the information available about the indicators.

Here, there is no hierarchical link between SAL (employees of active companies) and SAL_DTH (employees in deaths).

# cas où il n'y a aucune hiérarchie sur les indicateurs
metadata_template <- template_formatted$metadata %>%
  mutate(hrc_indicator = NA) %>%
  select(table_name,field,indicator,hrc_indicator, everything())

metadata_template
##             table_name field indicator hrc_indicator spanning_1  spanning_2   hrc_spanning_1    hrc_spanning_2
## 1 table_2021_SAL_DTH_1  2021   SAL_DTH            NA   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 2 table_2021_SAL_DTH_2  2021   SAL_DTH            NA   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
## 3     table_2022_SAL_1  2022       SAL            NA   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 4     table_2022_SAL_2  2022       SAL            NA   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
## 5 table_2022_SAL_DTH_1  2022   SAL_DTH            NA   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 6 table_2022_SAL_DTH_2  2022   SAL_DTH            NA   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4

Next, this dataframe is used as input for the analysis function.

# Analyse complète, avec les étapes
detailed_analysis <- analyse_metadata(metadata_template, verbose = TRUE)
## Error in `check_column_names()`:
## ! Error: The dataframe is missing one or more required columns: table_name, field, hrc_field, indicator, hrc_indicator.
# Output simplifié, uniquement le dataframe avec l'indicatrice de cluster
cluster_id_dataframe <- analyse_metadata(metadata_template, verbose = FALSE)
## Error in `check_column_names()`:
## ! Error: The dataframe is missing one or more required columns: table_name, field, hrc_field, indicator, hrc_indicator.
# visualisation du résultat de l'analyse
cluster_id_dataframe
##                               cluster  table_name                   field indicator spanning_1 spanning_2    spanning_3
## 1 france_entreprises_2023.hrc_lettuce  T10.T12.T8 france_entreprises_2023   LETTUCE    HRC_NAF         cj HRC_LETTUCE^h
## 2 france_entreprises_2023.hrc_lettuce   T11.T7.T9 france_entreprises_2023   LETTUCE    HRC_NAF       size HRC_LETTUCE^h
## 3    france_entreprises_2023.to_pizza       T1.T2 france_entreprises_2023  to_pizza   HRC_NUTS       size          <NA>
## 4    france_entreprises_2023.to_pizza T3.T4.T5.T6 france_entreprises_2023  to_pizza    HRC_NAF   HRC_NUTS          <NA>
##   hrc_spanning_1 hrc_spanning_2 hrc_spanning_3
## 1        hrc_naf           <NA>    hrc_lettuce
## 2        hrc_naf           <NA>    hrc_lettuce
## 3       hrc_nuts           <NA>           <NA>
## 4        hrc_naf       hrc_nuts           <NA>

Ultimately, there are 6 tables to process in 3 different clusters. In other words, tab_multi_manager() will need to be called three times.

Equations between indicators

In this example, we start from a metadata file describing 12 tables to be published on the turnover from pizza and salad sales.

str(metadata_pizza_lettuce)
## 'data.frame':    12 obs. of  9 variables:
##  $ table_name    : chr  "T1" "T2" "T3" "T4" ...
##  $ field         : chr  "france_entreprises_2023" "france_entreprises_2023" "france_entreprises_2023" "france_entreprises_2023" ...
##  $ hrc_field     : logi  NA NA NA NA NA NA ...
##  $ indicator     : chr  "to_pizza" "to_pizza" "to_pizza" "to_pizza" ...
##  $ hrc_indicator : chr  NA NA NA NA ...
##  $ spanning_1    : chr  "nuts2" "nuts3" "a10" "a10" ...
##  $ hrc_spanning_1: chr  "hrc_nuts" "hrc_nuts" "hrc_naf" "hrc_naf" ...
##  $ spanning_2    : chr  "size" "size" "nuts2" "nuts3" ...
##  $ hrc_spanning_2: chr  NA NA "hrc_nuts" "hrc_nuts" ...

This time a data frame describing the relationships between indicators is added.

liens_eq <- data.frame(eq_name = c("eq1"),
                       eq_indicator = c("ca_salades = ca_batavia + ca_mache"),
                       unit = c("EUR"))

Example code :

library(rtauargus)

data(metadata_pizza_lettuce)

# Full analysis, with intermediate steps
detailed_analysis <- analyse_metadata(metadata_pizza_lettuce,
                                      df_eq_indicator = liens_eq,
                                      verbose = TRUE)
## Warning in analyse_metadata(metadata_pizza_lettuce, df_eq_indicator = liens_eq, : For the variables part of equations specified in df_eq_indicator,
##     the hrc_indicator column will be ignored.
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust
##   accordingly.
## ℹ The deprecated feature was likely used in the rtauargus package.
##   Please report the issue at <https://github.com/inseefrlab/rtauargus/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# Simplified output, only the data frame with the cluster indicator
cluster_id_dataframe <- analyse_metadata(metadata_pizza_lettuce, verbose = FALSE)

Here, a warning informs us that the hrc_indicator column will be ignored since the relationships between indicators have been specified using the df_eq_indicator argument. Therefore, links should not be defined in hrc_indicator when using the df_eq_indicator argument.

The result is the same as the first example.

cluster_id_dataframe
##                               cluster  table_name                   field indicator spanning_1 spanning_2    spanning_3
## 1 france_entreprises_2023.hrc_lettuce  T10.T12.T8 france_entreprises_2023   LETTUCE    HRC_NAF         cj HRC_LETTUCE^h
## 2 france_entreprises_2023.hrc_lettuce   T11.T7.T9 france_entreprises_2023   LETTUCE    HRC_NAF       size HRC_LETTUCE^h
## 3    france_entreprises_2023.to_pizza       T1.T2 france_entreprises_2023  to_pizza   HRC_NUTS       size          <NA>
## 4    france_entreprises_2023.to_pizza T3.T4.T5.T6 france_entreprises_2023  to_pizza    HRC_NAF   HRC_NUTS          <NA>
##   hrc_spanning_1 hrc_spanning_2 hrc_spanning_3
## 1        hrc_naf           <NA>    hrc_lettuce
## 2        hrc_naf           <NA>    hrc_lettuce
## 3       hrc_nuts           <NA>           <NA>
## 4        hrc_naf       hrc_nuts           <NA>

Going Further: Visualizing Inclusions

The create_edges step in the metadata analysis identifies tables included within other tables. For example, XXXXX is included in XXXXX. The following code allows visualizing these inclusions using graphs to better understand the analysis procedure.

## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
## Error in `library()`:
## ! there is no package called 'visNetwork'
graph_links_tab <- function(list_desc_links){
  list_desc_links %>% purrr::imap(function(ss_dem,i){
    if(!is.null(ss_dem)){
      nodes <- data.frame(id = unique(unlist(ss_dem)))
      visNetwork(nodes = nodes, edges = ss_dem, main = i) %>%
        visIgraphLayout() %>%
        visEdges(shadow = TRUE,
                 arrows =list(to = list(enabled = TRUE, scaleFactor = 0.5)),
                 color = list(color = "steelblue", highlight = "red")) %>%
        visOptions(highlightNearest = list(enabled = T, hover = T),
                   nodesIdSelection = T)
    }
  })
}

data(metadata_pizza_lettuce)
detailed_analysis <- analyse_metadata(metadata_pizza_lettuce, verbose = TRUE)

graph_links_tab(detailed_analysis$create_edges)
## Error in `map2()`:
## ℹ In index: 1.
## ℹ With name: france_entreprises_2023.hrc_lettuce.
## Caused by error in `visOptions()`:
## ! could not find function "visOptions"