Automatic extraction of the table of subsidiaries and holdings from company accounts

Extract information from company accounts tables, in particular tables of subsidiaries and holdings, contained in scanned images made available by INPI via an API
Python
data extraction
API
machine learning
in production
Author

Nicolas

Published

1 January 2021

Project summary

Automatic extraction of the table of subsidiaries and holdings from company accounts
Project details A trial on the automated extraction of the table of subsidiaries and holdings from the company accounts was started in 2022 following an internship in the PTGU division. This table is used manually during company profiling operations, which is tedious work. The experiment was conducted with the Banque de France, which is interested in the same use case as Insee, with two main objectives:

- to develop a prototype application enabling users to retrieve a table of subsidiaries and holdings automatically for a given Siren number and year;
- compare the performance of different automatic table extraction methods, based on open-source tools on the one hand and commercial solutions on the other.
Players Insee, Inpi, Banque de France
Project results - A new experimental IPA and a experimental interface were put in place to meet the business need.
- The project was also presented at an internal Insee seminar on 27 June 2024. slides
Project code - https://github.com/InseeFrLab/ca-document-querier/ Python wrapper for Inpi’s enterprise API
- https://github.com/InseeFrLab/extraction-comptes-sociaux core” repository containing elements for page detection/table extraction
- https://github.com/InseeFrLab/extract-table-ui code repository for theexperimental interface
- https://github.com/InseeFrLab/extraction-comptes-sociaux-llm Kubernetes: source code for the overall architecture, comprising containerised microservices orchestrated by Kubernetes. It combines calls to external APIs (INPI), PDF processing and the use of language models (LLM) for data analysis and extraction.