Automatic analysis of metadata

library(rtauargus)
library(dplyr)

Introduction

Before using the functions that produce safe tables, it is necessary to analyze the tables that need to be protected. Indeed, the disseminated tables differ from the tables to be protected. This is explained in detail in Chapter 3 of the Manuel de protection des tableaux. This phase of analyzing common margins between different tables can be complex. Therefore, functions have been developed in rtauargus to facilitate this work.

How Does Automatic Metadata Analysis Work?

Creating Metadata

Metadata must adhere to a specific format. Each line represents a table, and for each table, the following must be specified:

table_name : the name of the table ;
field : the field (often, geographical area, studied population, year of data collection) ;
hrc_field : whether there is a hierarchical link between this field and another field. If there is no hierarchical link, this field should be NA. If a hierarchical link exists, a name must be given to this link and the same name must be written in the other tables ;
indicator : the table’s indicator name. This is the name of the variable that is counted in each cell of the table. For example, if my table presents the turnover of companies, my variable will likely be “Turnover”. It is best to use the variable name as described in the data;
hrc_indicator : whether there is a hierarchical link between this indicator and another indicator. If there is no hierarchical link, this field should be NA. If a hierarchical link exists, a name must be given to this link and the same name must be written in the other tables;
spanning_1 : the name of the first cross-tabulation (spanning) variable. Additional columns are added for each cross-tabulation variable, following the naming convention: spanning_1, spanning_2, spanning_3, etc ;
hrc_spanning_1 : whether there is a hierarchical link between this cross-tabulation variable and another cross-tabulation variable. If there is no hierarchical link, this field should be NA. If a hierarchical link exists, a name must be given to this link and the same name must be written in the other tables. Additional columns are added for each cross-tabulation variable, following the naming convention: hrc_spanning_1, hrc_spanning_2, hrc_spanning_3, etc. Note: there must be as many spanning_X columns as hrc_spanning_X columns.

Formatted Input: Eurostat Templates

When the tables to be processed are Eurostat tables, the metadata file can be automatically created from the template of the published cells. Indeed, for Eurostat publications, National Statistical Institutes (NSIs) must provide a template that groups the published cells.

To do this, it is essential to understand each column of the template. Determine for each one whether it is the response variable/indicator, a cross-tabulation variable, or a field, in order to specify this in the arguments of the function . It is particularly important to verify that the template adheres to the following: one cross-tabulation variable = one column in the template, and thus one supertotal. Indeed, if a column groups multiple cross-tabulation variables, then as many columns as cross-tabulation variables must be created.

The Arguments of `analyse_metadata`

args(analyse_metadata)

## function (df_metadata, verbose = FALSE) 
## NULL

The function arguments are as follows:

df_metadata: a dataframe structured as specified in section .
verbose: a variable of type logical. If TRUE he function returns a list with the different steps of the analysis; if FALSE, it returns only the final dataframe with the cluster indicator.

The Arguments of `format_template`

args(format_template)

## function (data, indicator_column, spanning_var_tot, field_columns) 
## NULL

Les arguments de la fonction sont les suivants :

data : dataframe du template Eurostat contenant toutes les cellules publiées.
indicator_column : nom de la colonne dans laquelle se trouvent les indicateurs.
spanning_var_tot : liste nommée des variables de croisements et de leurs totaux.
field_columns : vecteur de toutes les colonnes représentant des champs (ex : millésime).

Examples

Case Where the Input is Metadata

In this example, we start with a metadata file presenting 12 tables to be published on the turnover of pizza and salad sales.

str(metadata_pizza_lettuce)

## 'data.frame':    12 obs. of  9 variables:
##  $ table_name    : chr  "T1" "T2" "T3" "T4" ...
##  $ field         : chr  "france_entreprises_2023" "france_entreprises_2023" "france_entreprises_2023" "france_entreprises_2023" ...
##  $ hrc_field     : logi  NA NA NA NA NA NA ...
##  $ indicator     : chr  "to_pizza" "to_pizza" "to_pizza" "to_pizza" ...
##  $ hrc_indicator : chr  NA NA NA NA ...
##  $ spanning_1    : chr  "nuts2" "nuts3" "a10" "a10" ...
##  $ hrc_spanning_1: chr  "hrc_nuts" "hrc_nuts" "hrc_naf" "hrc_naf" ...
##  $ spanning_2    : chr  "size" "size" "nuts2" "nuts3" ...
##  $ hrc_spanning_2: chr  NA NA "hrc_nuts" "hrc_nuts" ...

Example code:

library(rtauargus)

data(metadata_pizza_lettuce)

# Analyse complète, avec les étapes
detailed_analysis <- analyse_metadata(metadata_pizza_lettuce, verbose = TRUE)

# Output simplifié, uniquement le dataframe avec l'indicatrice de cluster
cluster_id_dataframe <- analyse_metadata(metadata_pizza_lettuce, verbose = FALSE)

The output is a dataframe showing how to process the tables for applying the secret.

cluster_id_dataframe

## # A tibble: 4 x 10
## # Groups:   table_name [4]
##   cluster        table_name field indicator spanning_1 spanning_2 spanning_3 hrc_spanning_1 hrc_spanning_2 hrc_spanning_3
##   <chr>          <chr>      <chr> <chr>     <chr>      <chr>      <chr>      <chr>          <chr>          <chr>         
## 1 france_entrep~ T10.T12.T8 fran~ LETTUCE   HRC_NAF    cj         HRC_LETTU~ hrc_naf        <NA>           hrc_lettuce   
## 2 france_entrep~ T11.T7.T9  fran~ LETTUCE   HRC_NAF    size       HRC_LETTU~ hrc_naf        <NA>           hrc_lettuce   
## 3 france_entrep~ T1.T2      fran~ to_pizza  HRC_NUTS   size       <NA>       hrc_nuts       <NA>           <NA>          
## 4 france_entrep~ T3.T4.T5.~ fran~ to_pizza  HRC_NAF    HRC_NUTS   <NA>       hrc_naf        hrc_nuts       <NA>

For the 12 tables to be published, it is sufficient to protect 4 tables. These tables are distributed across two different clusters. Therefore, tab_multi_manager() needs to be called twice. A new spanning variable appeared HRC_LETTUCE^h. This variable takes 3 values batavia, arugula and lettuce the margin. The indicator LETTUCE here refers to the turnover (to) of salad sales. For example, the table T10.T12.T8 presents the revenue from salad sales based on the activity and category of the company, as well as the type of salad.

To obtain more information about the metadata analysis, the verbose = TRUE argument can be specified. This way, the object returned by the function will be a list of the different steps of the analysis.

names(detailed_analysis)

## [1] "identify_hrc"        "info_var"            "split_in_clusters"   "create_edges"        "grp_tab_names"      
## [6] "grp_tab_in_clusters" "tab_to_treat"        "df_tab_to_treat"

One finds the dataframe with the cluster indicator df_tab_to_treat. The result is the same but in list format: each element of the list is an independent cluster tab_to_treat. Additionally, the 6 steps of the analysis are included.

Case Where the Input is a Eurostat Template

The format_template() function allows creating the metadata to be used as input for analyse_metadata() from the Eurostat template of the published cells.

data(enterprise_template)

str(enterprise_template)

## 'data.frame':    3168 obs. of  5 variables:
##  $ TIME_PERIOD: int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ INDICATOR  : chr  "SAL" "SAL" "SAL" "SAL" ...
##  $ ACTIVITY   : chr  "B" "B" "B" "B" ...
##  $ NUMBER_EMPL: chr  "E0" "E1T4" "E5T9" "EGE10" ...
##  $ LEGAL_FORM : chr  "_T" "_T" "_T" "_T" ...

template_formatted <- format_template(
  data = enterprise_template,
  indicator_column = "INDICATOR",
  spanning_var_tot = list(
    ACTIVITY = "BTSXO_S94",
    NUMBER_EMPL = "_T",
    LEGAL_FORM = "_T"),
  field_columns = c("TIME_PERIOD")
)

template_formatted$metadata

##             table_name field indicator spanning_1  spanning_2   hrc_spanning_1    hrc_spanning_2
## 1 table_2021_SAL_DTH_1  2021   SAL_DTH   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 2 table_2021_SAL_DTH_2  2021   SAL_DTH   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
## 3     table_2022_SAL_1  2022       SAL   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 4     table_2022_SAL_2  2022       SAL   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
## 5 table_2022_SAL_DTH_1  2022   SAL_DTH   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 6 table_2022_SAL_DTH_2  2022   SAL_DTH   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4

There are 6 tables to be published.

Remark: format_template() does not construct the hrc_indicator column because the information contained in the template does not allow it to be determined. Therefore, one must create it based on the information available about the indicators.

Here, there is no hierarchical link between SAL (employees of active companies) and SAL_DTH (employees in deaths).

# cas où il n'y a aucune hiérarchie sur les indicateurs
metadata_template <- template_formatted$metadata %>%
  mutate(hrc_indicator = NA) %>%
  select(table_name,field,indicator,hrc_indicator, everything())

metadata_template

##             table_name field indicator hrc_indicator spanning_1  spanning_2   hrc_spanning_1    hrc_spanning_2
## 1 table_2021_SAL_DTH_1  2021   SAL_DTH            NA   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 2 table_2021_SAL_DTH_2  2021   SAL_DTH            NA   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
## 3     table_2022_SAL_1  2022       SAL            NA   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 4     table_2022_SAL_2  2022       SAL            NA   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4
## 5 table_2022_SAL_DTH_1  2022   SAL_DTH            NA   ACTIVITY  LEGAL_FORM hrc_activity_131  hrc_legal_form_3
## 6 table_2022_SAL_DTH_2  2022   SAL_DTH            NA   ACTIVITY NUMBER_EMPL hrc_activity_131 hrc_number_empl_4

Next, this dataframe is used as input for the analysis function.

# Analyse complète, avec les étapes
detailed_analysis <- analyse_metadata(metadata_template, verbose = TRUE)

# Output simplifié, uniquement le dataframe avec l'indicatrice de cluster
cluster_id_dataframe <- analyse_metadata(metadata_template, verbose = FALSE)

# visualisation du résultat de l'analyse
cluster_id_dataframe

## # A tibble: 6 x 8
## # Groups:   table_name [6]
##   cluster      table_name           field indicator spanning_1       spanning_2        hrc_spanning_1   hrc_spanning_2   
##   <chr>        <chr>                <chr> <chr>     <chr>            <chr>             <chr>            <chr>            
## 1 2021.SAL_DTH table_2021_SAL_DTH_1 2021  SAL_DTH   HRC_ACTIVITY_131 HRC_LEGAL_FORM_3  hrc_activity_131 hrc_legal_form_3 
## 2 2021.SAL_DTH table_2021_SAL_DTH_2 2021  SAL_DTH   HRC_ACTIVITY_131 HRC_NUMBER_EMPL_4 hrc_activity_131 hrc_number_empl_4
## 3 2022.SAL     table_2022_SAL_1     2022  SAL       HRC_ACTIVITY_131 HRC_LEGAL_FORM_3  hrc_activity_131 hrc_legal_form_3 
## 4 2022.SAL     table_2022_SAL_2     2022  SAL       HRC_ACTIVITY_131 HRC_NUMBER_EMPL_4 hrc_activity_131 hrc_number_empl_4
## 5 2022.SAL_DTH table_2022_SAL_DTH_1 2022  SAL_DTH   HRC_ACTIVITY_131 HRC_LEGAL_FORM_3  hrc_activity_131 hrc_legal_form_3 
## 6 2022.SAL_DTH table_2022_SAL_DTH_2 2022  SAL_DTH   HRC_ACTIVITY_131 HRC_NUMBER_EMPL_4 hrc_activity_131 hrc_number_empl_4

Ultimately, there are 6 tables to process in 3 different clusters. In other words, tab_multi_manager() will need to be called three times.

Going Further: Visualizing Inclusions

The create_edges step in the metadata analysis identifies tables included within other tables. For example, XXXXX is included in XXXXX. The following code allows visualizing these inclusions using graphs to better understand the analysis procedure.

library(rtauargus)
library(igraph)
library(visNetwork)

graph_links_tab <- function(list_desc_links){
  list_desc_links %>% purrr::imap(function(ss_dem,i){
    if(!is.null(ss_dem)){
      nodes <- data.frame(id = unique(unlist(ss_dem)))
      visNetwork(nodes = nodes, edges = ss_dem, main = i) %>%
        visIgraphLayout() %>%
        visEdges(shadow = TRUE,
                 arrows =list(to = list(enabled = TRUE, scaleFactor = 0.5)),
                 color = list(color = "steelblue", highlight = "red")) %>%
        visOptions(highlightNearest = list(enabled = T, hover = T),
                   nodesIdSelection = T)
    }
  })
}

data(metadata_pizza_lettuce)
detailed_analysis <- analyse_metadata(metadata_pizza_lettuce, verbose = TRUE)

graph_links_tab(detailed_analysis$create_edges)

## $france_entreprises_2023.hrc_lettuce
## 
## $france_entreprises_2023.to_pizza

Package {rtauargus}