UNECE Machine Learning for Official Statistics Workshop 2023
5 June 2023
Several major changes:
Observation: Sicore is no longer a suitable tool ➨ 30% automated coding.
Consequence: Ideal moment to propose a new methodology for automated NACE coding.
\(\approx\) 10 million observations from Sirene 3 covering the period 2014-2022.
Data labeled both by Sicore and manually.
An observation consists of:
| Level | NACE | Title | Size |
|---|---|---|---|
| Section | H | Transportation and storage | 21 |
| Division | 52 | Warehousing and support activities for transportation | 88 |
| Group | 522 | Support activities for transportation | 272 |
| Class | 5224 | Cargo handling | 615 |
| Subclass | 5224A | Harbour handling | 732 |

C++) “bag of n-grams” model.| Text | NAT | TYP | EVT | SUR |
|---|---|---|---|---|
| Cours de musique | NaN | X | 01P | NaN |
“Cours de musique NAT_NaN TYP_X EVT_01P SUR_NaN”
| Transformation | Text description |
|---|---|
| Input | 3 D: La Deratisation - La Desinsectisation - La Desinfection |
| Lower-case conversion | 3 d: la deratisation - la desinsectisation - la desinfection |
| Punctuations removal | 3 d la deratisation la desinsectisation la desinfection |
| Transformation | Text description |
|---|---|
| Input | 3 D: La Deratisation - La Desinsectisation - La Desinfection |
| … | … |
| Numbers removal | d la deratisation la desinsectisation la desinfection |
| One-letter word removal | la deratisation la desinsectisation la desinfection |
| Stopwords removal | deratisation desinsectisation desinfection |
| Transformation | Text description |
|---|---|
| Input | 3 D: La Deratisation - La Desinsectisation - La Desinfection |
| … | … |
| NaN removal | deratisation desinsectisation desinfection |
| Stemming | deratis desinsectis desinfect |
Figure 1: Accuracy for various level of the NACE nomenclature.
Figure 2: Top-\(k\) accuracy per sample.
Figure 3: Distribution of the confidence index based on prediction results.
Figure 4: Accuracy for various shares of manual coding.