Project summary
| Enriching case data with nutritional information to analyse the impact of consumption on health | |
|---|---|
| Project details | In addition to the many well-documented dimensions of inequality (income, wealth, education, housing, access to healthcare and public services in general), disparities in food consumption are also likely to be a source of health inequalities, as well as social and territorial markers. Checkout data could provide a very rich description of local consumption, provided that product identifiers allow enrichment with external sources, such as nutritional information. The aim of this project is to enrich supermarket data with nutritional information extracted from Open Food Facts, supplemented by Ciqual data from Anses. |
| Players | Insee |
| Project results | To compensate for the partial matching via the barcode, a method for efficiently matching short labels was implemented. After a pre-processing stage to normalise short labels, fuzzy matching techniques are implemented. These are based on several tokenizers (including n-grams) by querying an ElasticSearch custom index and validating candidate echoes with a Levenstein distance. The pipeline consists of several stages successively relaxing constraints to find relevant candidates. Finally, the final match is evaluated using a similarity measure based on word embedding obtained by training a Siamese network on the exact match via barcodes. The data used, covering the period 2015-2018, is that of several chains belonging to the same mass retail group (relevanC). |
| Project products and documentation | - Enriching checkout data with nutritional information: a fuzzy matching approach on high-dimensional data, 2022 Statistical Methodology Days (Journées de méthodologie statistique 2022) |
Similar projects
No matching items











