Project summary
| Automatic coding of occupations in the PCS 2020 nomenclature | |
|---|---|
| Project details | The renewal of the PCS nomenclature in 2020 is accompanied by the promotion of a autocompletion tool for occupation labels in a list of enriched headings enabling the direct coding in the box of the nomenclature or ad hoc groupings complementary to the nomenclature. However, the autocomplete tool will only be available for computerised data collection, and moreover it includes the possibility of answering “off-list”. In order to be able to integrate the new 2020 PCS into the population census for the 2024 collection and also to be able to code paper ballots as well as computerised “off-list” responses, a automatic coding algorithm in PCS 2020 of these ballots must be created. Following the postponement of the 2021 census survey, managers have annotated in PCS 2020, with double coding and arbitration, 119,000 bulletins from the EAR 2020. The aim of the experiment is to test and compare different statistical learning models and pre-processing methods for coding occupations in the PCS 2020 nomenclature on the basis of the occupation wording and ancillary variables used during the annotation phase (employer status, etc.), and to select the best performing model. The aim is to maintain the rate of correct codings and the rate of manual data transmission at levels similar to the current situation. |
| Players | Insee |
| Project results | - report on the experiment and results (December 2021) - paper and presentation at the “statistiques 2022” methodology daysApplication of machine learning techniques to code occupations in the nomenclature of occupations and socio-professional categories 2020” article and slides presented at the Q2022 Conference “Machine learning for coding occupations in the Census: first lessons from experiments to production”. |
| Project code |