NACE Rév. 2.1 update: Retraining an ML model in production using LLM

UNECE Generative AI and Official Statistics Workshop 2025

1️⃣ Introduction

ML Model Robustness

  • ML models are trained using reference datasets tailored for a specific task.
  • In practice, real-world data often drifts over time from these reference datasets ➡️ performance degradation?
  • Regular retraining becomes essential.
  • A particularly challenging case in official statistics is the change in the classification.

2️⃣ Transition towards NACE Rév. 2.1

Timeline for Adoption 🗓️

  • Phased adoption approach:
  • 2025 ➡️ dual labeling in both NACE Rév. 2 and 2.1
  • 2026 ➡️ improving the NACE Rév. 2.1. classifier model
  • 2027 ➡️ full NACE Rév. 2.1 classification while maintaining legacy NACE Rév. 2 codes for specific usages.

What’s New in NACE Rév. 2.1?

  • At level 5: 746 sub-classes compared to 732 before.
  • Mainly fine-grained splits at class level (level 4), but not exclusively.
  • 551 unambiguous mappings, i.e., one-to-one correspondence ➡️ ideal case! 👌

Ambiguous Cases

  • 181 ambiguous mappings, i.e., one-to-many ➡️ challenging! 🚩
  • Requires expert review for proper recoding.
Show distribution of ambiguous codes
1-to-N # occurence
2 109
3 30
4 24
5 6
6 4
8 1
9 2
21 1
27 1
36 1
38 2

Multiple Challenges

  • Need to recode the stock of the registry forms ➡️ over 14 million entries.
  • Building a classifier for new data flow requires a clean stock base as a training dataset.
  • Previous fastText model trained on over 10 million labeled entries.
  • Performance is highly sensitive to training data volume.

Available Data

  • Old registry dataset: \(~10\) million entries, but poorly suited for new labels.
  • New registry dataset: \(~2.7\) million entries.
    • Unambiguous: \(1.3\) million, covering 504 sub-classes.
    • Ambiguous: \(1.4\) million, covering 177 sub-classes.
  • Manual annotation campaign is critical.

Annotation Campaign

  • Annotation launched since mid-2024.
  • Focused on uniquely ambiguous cases.
  • 🎯 Dual objectives:
    • Assign a NACE Rév. 2.1 code 🚀
    • Assess NACE Rév. 2 code quality 🗸
    • Current count: \(~27k\) annotated entries… still insufficient!

3️⃣ Methodology applied

Methodology

  • 🎯 Goal: Build the most comprehensive training dataset possible.
  • One-shot experimentation ➡️ not intended for production reproducibility.
  • Leveraging LLMs for automated NACE Rév. 2.1 labeling.
  • Data used:
    1. New registry stock dataset (\(~2.7M\) records)
    2. Mapping table from NACE experts
    3. NACE explanatory notes
    4. Manually annotated data (\(~27k\) entries)

Methodology

Leveraging LLMs

  • Augmented Generation (RAG/CAG) vs fine-tunning
  1. RAG: Unstructured prior knowledge based on similarity of notes embeddings
  2. CAG: Structured prior knowledge based on known mappings
  • 💡 Core idea ➡️ Provide key information to the LLM to translate NACE Rév. 2 into 2.1

Warning

RAG can act like a zero-shot classifier, while CAG is not a classifier as it relies on prior knowledge.

Prompt Design

  • A common system prompt for all entries

    Afficher le prompt sytème

    Tu es un expert de la Nomenclature statistique des Activités économiques dans la Communauté Européenne (NACE). Tu es chargé de réaliser le changement de nomenclature. Ta mission consiste à attribuer un code NACE 2025 à une entreprise, en t'appuyant sur le descriptif de son activité et à partir d'une liste de codes proposés (identifiée à partir de son code NACE 2008 existant). Voici les instructions à suivre:
    1. Analyse la description de l'activité principale de l'entreprise et le code NACE 2008 fourni par l'utilisateur.
    2. À partir de la liste des codes NACE 2025 disponible, identifie la catégorie la plus appropriée qui correspond à l'activité principale de l'entreprise.
    3. Retourne le code NACE 2025 au format JSON comme spécifié par l'utilisateur. Si la description de l'activité de l'entreprise n'est pas suffisamment précise pour identifier un code NACE 2025 adéquat, retourne `null` dans le JSON.
    4. Évalue la cohérence entre le code NACE 2008 fourni et la description de l'activité de l'entreprise. Si le code NACE 2008 ne semble pas correspondre à cette description, retourne `False` dans le champ `nace08_valid` du JSON. Note que si tu arrives à classer la description de l'activité de l'entreprise dans un code NACE 2025, le champ `nace08_valid` devrait `True`, sinon il y a incohérence.
    5. Réponds seulement avec le JSON complété aucune autres information ne doit être retourné.
  • Each observation gets a custom prompt including:

    • Business activity description
    • Original NACE Rév. 2 code (in CAG)
    • Candidate codes list from retriever
  • Instruction on output format required.

Output Validation

  • LLMs tend to be overly verbose
  • Responses shaped into structured, minimal format
  • JSON is the preferred schema.
Show expected response format
{
    "codable": true,
    "nace_2008_valid": true,
    "nace2025": "0147J" 
}
  • Response parsing:
    1. Format check
    2. Detecting hallucinations

4️⃣ Results

Evaluation Challenge

  • ❓ Key question: How to evaluate an LLM?:

  • Classification seems simpler… but complexity of taxonomy matters.

  • Used \(27k\) manual annotations as the benchmark 🥇

  • 3 performance metrics:

    • Overall accuracy
    • Accuracy among codable entries
    • Accuracy of LLM only

Performance of models

Reconstructing Ambiguous Dataset

  • 💡Idea: Treat LLMs as additional annotators
  • ❓Can we boost performance via ensemble methods?
  • Built 3 more annotations:
  1. Cascade selection
  2. Majority vote
  3. Weighted vote

Annotation Fusion

Retraining with NACE Rév. 2.1

  • Rebuilt new registry dataset with NACE Rév. 2.1 (~2M records)
  • Data distribution unchanged
  • New registry variables used
  • Achieved comparable performance to NACE Rév. 2 model

Retraining Accuracy

The NACE Rév. 2.1 Model