Methodological work on the Family Budget survey
Modernisation of the family budget survey using automatic classification tools
1 Jan 2022
1 January 2021
| Automatic coding of occupations in the PCS 2020 nomenclature | |
|---|---|
| Project details | The renewal of the PCS nomenclature in 2020 is accompanied by the promotion of a autocompletion tool for occupation labels in a list of enriched headings enabling the direct coding in the box of the nomenclature or ad hoc groupings complementary to the nomenclature. However, the autocomplete tool will only be available for computerised data collection, and moreover it includes the possibility of answering “off-list”. In order to be able to integrate the new 2020 PCS into the population census for the 2024 collection and also to be able to code paper ballots as well as computerised “off-list” responses, a automatic coding algorithm in PCS 2020 of these ballots must be created. Following the postponement of the 2021 census survey, managers have annotated in PCS 2020, with double coding and arbitration, 119,000 bulletins from the EAR 2020. The aim of the experiment is to test and compare different statistical learning models and pre-processing methods for coding occupations in the PCS 2020 nomenclature on the basis of the occupation wording and ancillary variables used during the annotation phase (employer status, etc.), and to select the best performing model. The aim is to maintain the rate of correct codings and the rate of manual data transmission at levels similar to the current situation. |
| Stakeholders | Insee |
| Project products and documentation | - Application of machine learning techniques to code occupations in the nomenclature of occupations and socio-professional categories 2020, 2022 Statistical Methodology Days (Journées de méthodologie statistique 2022) |