Protect several linked tables at once
¶
Package {rtauargus} ¶
Source: Package {rtauargus} ¶
vignettes/protect_multi_tables.Rmd
protect_multi_tables.Rmd
Introduction
In rtauargus
package, it is now possible to protect a
set of linked tables at once. The function to do this is called
tab_multi_manager()
. The very simple algorithm implemented
to handle the protection doesn’t depend on the number of tables. Hence,
the function can theoretically deal with an indefinite number of tables.
Of course, the practical use is limited by the power of the computer.
However, we are confident on the ability of the function to treat the
most cases. We are interested in cases that you would encounter and the
function can’t handle.
A link between two (or more) tables describes the common cells of the tables, that is the cells appearing in them. The function can handle all types of links between tables:
- margin links: when two tables share one (or more) variables;
- links between the response variables, when exists some equations connecting the variables;
- non-nested hierarchies: when an explanatory variable is broken down into two (or more) non-nested hierarchies.
Actually, the last two types can be handled as the first one. The challenge for the user is to properly define the set of the tables. Once the set is correctly defined, the function takes care of everything.
Firstly, the vignette will present how the function works, secondly, the parameters are quickly described. In the third part, some examples are developped to show how to use it in diverse situations.
How does tab_multi_manager()
handle the protection of a
set of linked tables ?
Firstly, the function merges all the tables of the set. In the resulted table, boolean variables are added to mention whether a cell owns to a given table of the set. So, there are as many boolean variables as the number of tables in the set. The merged table is a very effective way to quickly detect the common cells. Thus, the report of the suppressions is automatic.
The protection process is sequential: one table at a time. To handle the protection of all the tables and not to forget one table or some links, the implemented algorithm works with two lists:
-
todolist
: list of tables which have to be protected; -
remainlist
: list of original tables which haven’t yet been protected at all.
Both lists are initialized as follows:
-
todolist
= first table in the original list; -
remainlist
= all the original tables except the firs one.
Then, the algorithm proceeds as shown in the following figure:
Results
The function returns the original list of tables with some other variables which are boolean variables (TRUE if the cell has to be masked, otherwise FALSE), describing all steps of the suppression process. Each step takes the previous one into account and the last variable indicates the final step. Final status of cell is easily computable with this last variable and the primary suppression status previously computed by the user.
In addition, the function writes all τ-Argus files created during the
process. At the end, only the last process for each table is available
in the chosen directory. A journal (journal.txt
) is also
provided. It describes each step of the suppression process: Which table
is being protected, how many new common cells have been hit by secondary
suppression. At the end of the process, it provides a description of all
common cells that have been hit, mentionning the iteration of when each
has been hit.
Some details on the parameters of
tab_multi_manager()
dir_name the directory that will contain all the created files, if non existing it will be created. example : dir_name = “my_directory”
hrc A named vector specifying the path to the hrc file for each hierarchical variables. example : c(ACTIVITY = “path_to_file/act.hrc”, NUTS = “path_to_file/nuts.hrc” )
-
alt_hrc A named list useful for dealing with non-nested hierachies. The names of the list are the names of the tables when alternative (non-nested in general) hierarchies. example : If T1 and T2 have one explanatory variable, called ACTIVITY, and the same response variable, but the
ACTIVITY
variable has not the same hierarchy in the two tables. Let’s assume that the hierarchies (act1.hrc and act2.hrc) are not nested. In that case, we write the two arguments as follows:- hrc = c(ACTIVITY = “path_to_file/act_1.hrc”): By default, this hierarchy will be used.
- alt_hrc = list(T2 = c(ACTIVITY = “path_to_file/act_2.hrc”)): In the table T2, the alternative hierarchy will be used.
totcode The code for total for each explanatory variables. It is recommanded to use the same totcode for each variables. If for example the code is “Total” for all variables. This syntax is allowed : totcode = “Total”
Otherwise the expected input is a list specifying the totcode for each explanatory vars. For example : totcode = list(ACTIVITY = “Total”, NUTS = “FR”, SIZE =“Ensemble”, CJ = “Total”) Default : “Total”
Warning : If the totals are not in the table, they will be computed by Tau-Argus, but they won’t be eligible for primary suppression. We, then, advise users to provide totals in their tables.
alt_totcode A named list for alternative total codes. See
alt_hrc
for use.value The colname of the response variable in the tables, it MUST be the same name for each tables. For example : value = “turnover” Default : “value”
freq The colname of the frequency variable in the tables, it MUST be the same name for each tables. For example : freq = “frequency” Default : “freq”
secret_var The name of the boolean variable specifying primary suppression in the tables, it MUST be the same name for each tables. For example : secret_var = “is_secret_prim” Default : “is_secret_prim”
suppress The algorithm required to perform secondary suppression, explained in the safety_rules vignette. If Modular approach is chosen, after the first iteration on a given table, the singleton, multi-singleton and minFreq options are deactivated. default : MOD(1,5,1,0,0)”
ip_start The manual safety range for the first iteration on a table (integer) default : 10
ip_end The manual safety range for the second iteration on a table (integer) default : 0
num_iter_max This parameter is here to ensure the fact that the function will stop. default : 10
Some examples
Let’s specify the location of the TauArgus.exe file in our computer:
options(
rtauargus.tauargus_exe =
"Y:/Logiciels/TauArgus/TauArgus4.2.3/TauArgus.exe"
)
About protecting 4 linked tables at once
In the following example we are going to protect a set of 4 linked tables sharing the same response variable, that is turnover.
Preparing the data
data("turnover_act_size")
data("turnover_act_cj")
data ("turnover_nuts_size")
data("turnover_nuts_cj")
ACTIVITY X SIZE named turnover_act_size ACTIVITY X CJ named turnover_act_cj NUTS X SIZE named turnover_nuts_size NUTS X CJ named turnover_nuts_cj
str(turnover_act_size)
#> tibble [414 x 5] (S3: tbl_df/tbl/data.frame)
#> $ ACTIVITY: chr [1:414] "AZ" "BE" "FZ" "GI" ...
#> $ SIZE : chr [1:414] "Total" "Total" "Total" "Total" ...
#> $ N_OBS : int [1:414] 405 12878 28043 62053 8135 8140 11961 41359 26686 25108 ...
#> $ TOT : num [1:414] 44475 24827613 8907311 26962063 8584917 ...
#> $ MAX : num [1:414] 6212 1442029 1065833 3084242 3957364 ...
str(turnover_nuts_cj)
#> tibble [452 x 5] (S3: tbl_df/tbl/data.frame)
#> $ NUTS : chr [1:452] "FR10" "FR21" "FR22" "FR23" ...
#> $ CJ : chr [1:452] "Total" "Total" "Total" "Total" ...
#> $ N_OBS: int [1:452] 38462 6769 4561 5090 8611 7811 5643 10411 8179 5163 ...
#> $ TOT : num [1:452] 33026385 2947560 1917663 3701935 5089279 ...
#> $ MAX : num [1:452] 3084242 544763 651848 298134 1165019 ...
The first step is to create a list containing our four tables, it is advised to give a name to each tables. By doing so it will be easier to track secondary suppression.
list_data_4_tabs <- list(
act_size = turnover_act_size,
act_cj = turnover_act_cj,
nuts_size = turnover_nuts_size,
nuts_cj = turnover_nuts_cj
)
str(list_data_4_tabs)
#> List of 4
#> $ act_size : tibble [414 x 5] (S3: tbl_df/tbl/data.frame)
#> ..$ ACTIVITY: chr [1:414] "AZ" "BE" "FZ" "GI" ...
#> ..$ SIZE : chr [1:414] "Total" "Total" "Total" "Total" ...
#> ..$ N_OBS : int [1:414] 405 12878 28043 62053 8135 8140 11961 41359 26686 25108 ...
#> ..$ TOT : num [1:414] 44475 24827613 8907311 26962063 8584917 ...
#> ..$ MAX : num [1:414] 6212 1442029 1065833 3084242 3957364 ...
#> $ act_cj : tibble [406 x 5] (S3: tbl_df/tbl/data.frame)
#> ..$ ACTIVITY: chr [1:406] "AZ" "BE" "FZ" "GI" ...
#> ..$ CJ : chr [1:406] "Total" "Total" "Total" "Total" ...
#> ..$ N_OBS : int [1:406] 405 12878 28043 62053 8135 8140 11961 41359 26686 25108 ...
#> ..$ TOT : num [1:406] 44475 24827613 8907311 26962063 8584917 ...
#> ..$ MAX : num [1:406] 6212 1442029 1065833 3084242 3957364 ...
#> $ nuts_size: tibble [460 x 5] (S3: tbl_df/tbl/data.frame)
#> ..$ NUTS : chr [1:460] "FR10" "FR21" "FR22" "FR23" ...
#> ..$ SIZE : chr [1:460] "Total" "Total" "Total" "Total" ...
#> ..$ N_OBS: int [1:460] 38462 6769 4561 5090 8611 7811 5643 10411 8179 5163 ...
#> ..$ TOT : num [1:460] 33026385 2947560 1917663 3701935 5089279 ...
#> ..$ MAX : num [1:460] 3084242 544763 651848 298134 1165019 ...
#> $ nuts_cj : tibble [452 x 5] (S3: tbl_df/tbl/data.frame)
#> ..$ NUTS : chr [1:452] "FR10" "FR21" "FR22" "FR23" ...
#> ..$ CJ : chr [1:452] "Total" "Total" "Total" "Total" ...
#> ..$ N_OBS: int [1:452] 38462 6769 4561 5090 8611 7811 5643 10411 8179 5163 ...
#> ..$ TOT : num [1:452] 33026385 2947560 1917663 3701935 5089279 ...
#> ..$ MAX : num [1:452] 3084242 544763 651848 298134 1165019 ...
Applying primary suppression rules
Then we need to apply the primary suppression for each table. Here we apply 2 rules : The dominance rule NK(1,85) and the frequency rule with a threshold set to 3.
list_data_4_tabs <- list_data_4_tabs %>%
purrr::map(
function(df){
df %>%
mutate(
is_secret_freq = N_OBS > 0 & N_OBS < 3,
is_secret_dom = ifelse(MAX == 0, FALSE, MAX/TOT>0.85),
is_secret_prim = is_secret_freq | is_secret_dom
)
}
)
str(list_data_4_tabs)
#> List of 4
#> $ act_size : tibble [414 x 8] (S3: tbl_df/tbl/data.frame)
#> ..$ ACTIVITY : chr [1:414] "AZ" "BE" "FZ" "GI" ...
#> ..$ SIZE : chr [1:414] "Total" "Total" "Total" "Total" ...
#> ..$ N_OBS : int [1:414] 405 12878 28043 62053 8135 8140 11961 41359 26686 25108 ...
#> ..$ TOT : num [1:414] 44475 24827613 8907311 26962063 8584917 ...
#> ..$ MAX : num [1:414] 6212 1442029 1065833 3084242 3957364 ...
#> ..$ is_secret_freq: logi [1:414] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> ..$ is_secret_dom : logi [1:414] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> ..$ is_secret_prim: logi [1:414] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ act_cj : tibble [406 x 8] (S3: tbl_df/tbl/data.frame)
#> ..$ ACTIVITY : chr [1:406] "AZ" "BE" "FZ" "GI" ...
#> ..$ CJ : chr [1:406] "Total" "Total" "Total" "Total" ...
#> ..$ N_OBS : int [1:406] 405 12878 28043 62053 8135 8140 11961 41359 26686 25108 ...
#> ..$ TOT : num [1:406] 44475 24827613 8907311 26962063 8584917 ...
#> ..$ MAX : num [1:406] 6212 1442029 1065833 3084242 3957364 ...
#> ..$ is_secret_freq: logi [1:406] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> ..$ is_secret_dom : logi [1:406] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> ..$ is_secret_prim: logi [1:406] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ nuts_size: tibble [460 x 8] (S3: tbl_df/tbl/data.frame)
#> ..$ NUTS : chr [1:460] "FR10" "FR21" "FR22" "FR23" ...
#> ..$ SIZE : chr [1:460] "Total" "Total" "Total" "Total" ...
#> ..$ N_OBS : int [1:460] 38462 6769 4561 5090 8611 7811 5643 10411 8179 5163 ...
#> ..$ TOT : num [1:460] 33026385 2947560 1917663 3701935 5089279 ...
#> ..$ MAX : num [1:460] 3084242 544763 651848 298134 1165019 ...
#> ..$ is_secret_freq: logi [1:460] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> ..$ is_secret_dom : logi [1:460] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> ..$ is_secret_prim: logi [1:460] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ nuts_cj : tibble [452 x 8] (S3: tbl_df/tbl/data.frame)
#> ..$ NUTS : chr [1:452] "FR10" "FR21" "FR22" "FR23" ...
#> ..$ CJ : chr [1:452] "Total" "Total" "Total" "Total" ...
#> ..$ N_OBS : int [1:452] 38462 6769 4561 5090 8611 7811 5643 10411 8179 5163 ...
#> ..$ TOT : num [1:452] 33026385 2947560 1917663 3701935 5089279 ...
#> ..$ MAX : num [1:452] 3084242 544763 651848 298134 1165019 ...
#> ..$ is_secret_freq: logi [1:452] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> ..$ is_secret_dom : logi [1:452] FALSE FALSE FALSE FALSE FALSE FALSE ...
#> ..$ is_secret_prim: logi [1:452] FALSE FALSE FALSE FALSE FALSE FALSE ...
Three variables have been added to each table. Only the last one will be used during the protection process. The two others will be useful to state the final status of the cells.
Preparing the list of explanatory variables
Then we are going to create the list of explanatory variables of the tables. In our case, the explanatory variables are the 2 first columns of each table.
nom_var_list <- purrr::map(
list_data_4_tabs,
function(data) colnames(data)[1:2]
)
nom_var_list
#> $act_size
#> [1] "ACTIVITY" "SIZE"
#>
#> $act_cj
#> [1] "ACTIVITY" "CJ"
#>
#> $nuts_size
#> [1] "NUTS" "SIZE"
#>
#> $nuts_cj
#> [1] "NUTS" "CJ"
The names of the list have to be the names of the corresponding
tables in the list_tables
argument.
Preparing the list of total codes
The labels used to mention the total for each variable have to declared by the user, in a named list. However, when all of the variables have the same label to refer to the total, the information can be mentionned as follows:
total_codes <- "Total"
Making the hierarchical files
ACTIVITY
and NUTS
are two hierarchical
variables. We use a correspondance table for each of them to create the
.hrc
files that tau-Argus need.
hrc_file_activity <- write_hrc2(
corr_table = activity_corr_table,
file_name = "hrc/activity.hrc"
)
hrc_file_nuts <- write_hrc2(
corr_table = nuts23_fr_corr_table,
file_name = "hrc/nuts.hrc"
)
To mention the names of the hrc files, we create a named vector (or list) as follows:
hrc_files <- list(
ACTIVITY = hrc_file_activity,
NUTS = hrc_file_nuts
)
The names of the vector have to be the names of the corresponding hierarchical variables.
Running the protection of all the tables at once
res <- tab_multi_manager(
list_tables = list_data_4_tabs,
list_explanatory_vars = nom_var_list,
dir_name = "tauargus_files/ex4",
hrc = hrc_files,
totcode = total_codes,
value = "TOT",
freq = "N_OBS",
secret_var = "is_secret_prim"
)
#> --- Current table to treat: act_size ---
#> --- Current table to treat: act_cj ---
#> --- Current table to treat: act_size ---
#> --- Current table to treat: nuts_size ---
#> --- Current table to treat: nuts_cj ---
Results
list_with_status <- res %>%
purrr::map(
function(df){
df %>%
rename_with(~"final_suppress", last_col()) %>%
mutate(
status = case_when(
is_secret_freq ~ "A",
is_secret_dom ~ "B",
final_suppress ~ "D",
TRUE ~"V"
)
) %>%
select(1:2, TOT, N_OBS, status)
}
)
str(list_with_status$act_size)
#> 'data.frame': 414 obs. of 5 variables:
#> $ ACTIVITY: chr "01" "01" "02" "02" ...
#> $ SIZE : chr "Total" "tr1" "Total" "tr1" ...
#> $ TOT : num 853 853 43623 35503 8120 ...
#> $ N_OBS : int 18 18 387 381 6 1 1 4 4 84 ...
#> $ status : chr "V" "V" "V" "V" ...
list_with_status %>%
purrr::iwalk(
function(df,name){
cat(name, "\n")
df %>% count(status) %>% print()
}
)
#> act_size
#> status n
#> 1 A 52
#> 2 B 25
#> 3 D 83
#> 4 V 254
#> act_cj
#> status n
#> 1 A 35
#> 2 B 25
#> 3 D 88
#> 4 V 258
#> nuts_size
#> status n
#> 1 A 55
#> 2 B 17
#> 3 D 82
#> 4 V 306
#> nuts_cj
#> status n
#> 1 A 45
#> 2 B 20
#> 3 D 101
#> 4 V 286
Dealing with non-nested hierarchies
Let’s assume that we’d like to protect the two tables crossing
ACTIVITY
and SIZE
in one hand and
ACTIVITY
and CJ
in the other hand. The
ACTIVITY
variable is a hierarchical one. Moreover, in that
case, we’d like to release a subtotal, called D_TO_M
, in
addition to those in the activity_corr_table
. This subtotal
is the sum of the D
to M
labels of the
A21
level. As this subtotal can’t be inserted in the main
hierarchy, this leads us to a case of non-nested hierarchies.
To confirm the non-nesting of the two hierachies, let’s display an extract of both.
Extract of the main hierarchy:
sdcHierarchies::hier_create(
root="Total",
nodes = activity_corr_table$A10 %>% unique()
) %>%
sdcHierarchies::hier_display()
#> Total
#> +-AZ
#> +-BE
#> +-FZ
#> +-GI
#> +-JZ
#> +-KZ
#> +-LZ
#> +-MN
#> +-OQ
#> \-RU
Extract of the alternative hierarchy:
sdcHierarchies::hier_create(
root="D_TO_M",
nodes = activity_corr_table_D_TO_M$A21
) %>%
sdcHierarchies::hier_display()
#> Error in sdcHierarchies::hier_create(root = "D_TO_M", nodes = activity_corr_table_D_TO_M$A21): object 'activity_corr_table_D_TO_M' not found
To handle this case, the preferred approach is to create a third
table crossing ACTIVITY
and SIZE
with D to M
labels and the subtotal D_TO_M
:
turnover_act_size_D_TO_M <- turnover_act_size %>%
filter(
ACTIVITY %in% LETTERS[4:13]
) %>%
bind_rows(
turnover_act_size %>%
filter(
ACTIVITY %in% LETTERS[4:13]
) %>%
group_by(SIZE) %>%
summarise(N_OBS = sum(N_OBS), TOT = sum(TOT), MAX = max(MAX)) %>%
mutate(ACTIVITY = "D_TO_M")
)
str(turnover_act_size_D_TO_M)
#> tibble [44 x 5] (S3: tbl_df/tbl/data.frame)
#> $ ACTIVITY: chr [1:44] "D" "E" "F" "G" ...
#> $ SIZE : chr [1:44] "Total" "Total" "Total" "Total" ...
#> $ N_OBS : int [1:44] 1411 828 28043 41624 6524 13905 8135 8140 11961 28221 ...
#> $ TOT : num [1:44] 2438454 2264393 8907311 18244309 6273334 ...
#> $ MAX : num [1:44] 981369 306905 1065833 765244 3084242 ...
Let’s create the alternative hierarchy file.
activity_corr_table_D_TO_M <- activity_corr_table %>%
filter(A21 %in% LETTERS[4:13]) %>%
select(-A88) %>%
mutate(A10 = "D_TO_M") %>%
unique()
hrc_file_activity_D_TO_M <- activity_corr_table_D_TO_M %>%
select(-1) %>%
write_hrc2(file_name = "hrc/activity_D_TO_M")
Then, let’s build the list of tables and apply the primary suppression rules on them.
list_data_3_tabs_nn <- list(
act_size = turnover_act_size,
act_size_D_TO_M = turnover_act_size_D_TO_M,
act_cj = turnover_act_cj
) %>%
purrr::map(
function(df){
df %>%
mutate(
is_secret_freq = N_OBS > 0 & N_OBS < 3,
is_secret_dom = ifelse(MAX == 0, FALSE, MAX/TOT>0.85),
is_secret_prim = is_secret_freq | is_secret_dom
)
}
)
The hrc
and alt_hrc
arguments are used to
mention both hierarchies of the ACTIVITY
variable:
-
hrc
is used as int the previous example (a named vector where names are the different hierarchical variables appearingin the tables and values are the files names). Hence, the hierarchy file will be taken in this vector by default. -
alt_hrc
is only used to mention the alternative hierarchies. This argument is a named list where names are the tables where the alternative hierarchies have to be applied.
The same rationale is used to fill the alt_totcode
argument.
res <- tab_multi_manager(
list_tables = list_data_3_tabs_nn,
list_explanatory_vars = list(
act_size = c("ACTIVITY", "SIZE"),
act_size_D_TO_M = c("ACTIVITY", "SIZE"),
act_cj = c("ACTIVITY", "CJ")
),
hrc = c(ACTIVITY = hrc_file_activity),
alt_hrc = list(
act_size_D_TO_M = c(ACTIVITY = hrc_file_activity_D_TO_M)
),
dir_name = "tauargus_files/ex5",
value = "TOT",
freq = "N_OBS",
secret_var = "is_secret_prim",
totcode = "Total",
alt_totcode = list(
act_size_D_TO_M = c(ACTIVITY = "D_TO_M")
)
)
#> --- Current table to treat: act_size ---
#> --- Current table to treat: act_size_D_TO_M ---
#> --- Current table to treat: act_cj ---
#> --- Current table to treat: act_size ---
#> --- Current table to treat: act_size_D_TO_M ---
#> --- Current table to treat: act_size ---
#> --- Current table to treat: act_cj ---
About this vignette
- Authors: Julien Jamme & Nathanael Rastout
- Last update: 21/09/2023
- Version of rtauargus used: 1.2.0
- Version of τ-Argus used : TauArgus 4.2.3
- R version used : 4.2.3