Générer des données synthétiques

Auteur·rice

Insee, Département des Méthodes Statistiques

Code

library(readr)
library(purrr)
library(dplyr)
library(synthpop)

Importons les données

Code

source("../R/fun_import_data.R")
lfs_2023 <- import_lfs()

Code

head(lfs_2023)

      REG    DEP    ARR   SEXE   AGE   AGE6  ACTEU   DIP7  PCS1Q ANCCHOM  HHID
   <fctr> <fctr> <fctr> <fctr> <int> <fctr> <fctr> <fctr> <fctr>  <fctr> <int>
1:     28     76    761      1    53     50      1      4     30      99  3558
2:     28     76    761      2    43     25      1      7     52      99  3558
3:     28     76    761      2    17     15      3      5     99      99  3558
4:     28     76    761      1    17     15      3      4     99      99  3558
5:     11     92    922      1    42     25      1      7     62      99  5973
6:     11     92    922      2    54     50      1      7     62      99  5973
   HH_TAILLE HH_AGE HH_DIP HH_PCS IS_CHOM
      <fctr> <fctr> <fctr> <fctr>   <int>
1:         4     53      4     30       0
2:         4     53      4     30       0
3:         4     53      4     30       0
4:         4     53      4     30       0
5:         2     54      7     62       0
6:         2     54      7     62       0

Pour plus d’informations sur les données, on pourra se reporter à la fiche “Présentation des données”.

Préparation des données

Code

lfs_orig <- lfs_2023 |> 
    select(-starts_with("HH")) |>
    select(-ARR) |>
    mutate(across(-AGE, as.character))

Ordre des variables

L’ordre des variables joue un rôle considérable sur les temps de traitements et sur la qualité de la synthétisation. Par expérience, un ordre relativement efficace consiste à ordonner les variables en plaçant les variables numériques avant les variables catgéorielles et en classant les variables catégorielles par ordre croissant de nombre de modalités.

Code

seq1 <- lfs_orig |> select(where(is.character)) |> 
    unique() |> 
    tidyr::pivot_longer(everything(), names_to = "var", values_to = "mod") |>
    unique() |>
    count(var) |>
    arrange(n) |>
    pull(var)

Synthétiser

Code

lfs_syn <- synthpop::syn(lfs_orig |> select(-AGE), method = "cart", visit.sequence = seq1)


Variable(s): REG, DEP, SEXE, AGE6, ACTEU, DIP7, PCS1Q, ANCCHOM, IS_CHOM have been changed for synthesis from character to factor.

Synthesis
-----------
 IS_CHOM SEXE ACTEU REG AGE6 DIP7 ANCCHOM PCS1Q DEP

Risque / Utilité

Code

synthpop::disclosure(lfs_syn, lfs_orig, keys = c("AGE6", "SEXE", "REG", "DEP"), target = c("IS_CHOM"))

-------------------Synthesis 1 --------------------
Table for target IS_CHOM from GT alone with keys has 2 rows 240 colums.
Table for target  IS_CHOM from GT & SD with all key combinations has 2 rows 242 colums.

Disclosure measures from synthesis for 34053 records in original data.

Identity  measures for keys REG DEP SEXE AGE6 
and attribute measures for IS_CHOM from the same keys
            Identity (UiO/repU) Attrib (Dorig/DiSCO)
Original                   0.01                20.71
Synthesis 1                0.00                22.23

The 1 way distributions of IS_CHOM has a large contribution to disclosure from level Check  IS_CHOM  level  0 
Please add 'check_1way' to to.print (e.g. print(disclosure_result, to.print = 'check_1way'),
and look at original data to decide if backround knowledge would make this disclosure likely
Consider excluding this level with the not.targetlev parameter to the disclosure function.


Details of target-key pairs contributing disproportionately to disclosure of IS_CHOM 
37 pairs need checks 

Please examine component $check_2way of the disclosure object and look at original data.
Consider excluding these key-target pairs with some the following parameters to disclosure:
exclude_ov_denom_lim = TRUE or defining key-target combinations from exclude.targetlevs,
exclude.keys and exclude.keylevs