Reproducibility
At the INSEE Innovation Team, we greatly value the concept of reproducibility in scientific research. We understand that reproducibility is crucial for ensuring the credibility and trustworthiness of scientific findings. That’s why we prioritize documenting our research methodologies, data analyses, and experimental processes in a transparent and accessible manner. Our commitment to reproducibility is driven by a desire to contribute to the advancement of scientific knowledge and promote rigorous research practices.
Getting started
To ensure full reproducibility of the results, the project is accompanied by a Docker image that contains all the necessary packages and dependencies. You can pull the Docker image using the following command in your terminal after installing Docker :
docker pull inseefrlab/esa-nowcasting-2023:latest
Alternatively, you can use the Onyxia instance SSPCloud (Comte, Degorre, and Lesur (2022)), a datalab developed by the French National Institute of Statistics and Economic Studies (INSEE) that provides an easy-to-use interface for running the Docker image.
To get started with SSPCloud:
- Step 0 : Go to https://datalab.sspcloud.fr/home. Click on Sign In and then Create an account with your academic or institutional email address.
- Step 1 : Click here or on the orange badge on top of the page.
- Step 2 : Open the service and follow the instructions regarding username and credentials.
-
Step 3
: Open a new project by clicking the following file:
~/work/ESA-Nowcasting-2023/ESA-Nowcasting-2023.Rproj
. -
Step 4
: Ensure all necessary packages are installed by executing the
renv::restore()
command in the console. If prompted to proceed with the installation, entery
.
You are all set!
Codes
Functions
All functions used in the project are organized by theme in the R/
folder :
ESA-Nowcasting-2023
└─── R
│ data_preprocessing.R
│ data_retrieval.R
│ dfms_functions.R
│ ets_functions.R
│ lstm_functions.R
│ post_mortem_functions.R
│ regarima_functions.R
│ saving_functions.R
│ XGBoost_functions.R
Configuration files
The project is composed of three configuration files that enable the operation of the models and the challenges as a whole. The first file, challenges.yaml
, contains information about the challenges themselves, including the countries used for each challenge and the current dates.
The second file, models.yaml
, is the backbone of the project as it contains all of the parameters used for all the models and challenges. This file is responsible for ensuring that the models are appropriately tuned. Any adjustments made to this file can have a significant impact on the accuracy of the models, and thus it is vital that the parameters are fine-tuned carefully.
Finally, the data.yaml
configuration file is responsible for specifying all the relevant information about the data sources used in the challenge. It is essential that this file is accurately updated as changes to data sources or updates can have a significant impact on the accuracy of the models.
Pipelines
The project is deeply relying on the target package, which is a tool for creating and running reproducible pipelines in R. target
is particularly useful for managing large or complex data sets, as it allows you to define each task in a pipeline as a separate function, and then run the pipeline by calling the targets::tar_make()
function. This ensures that tasks are run in the correct order, and can save time by only running tasks that are out of date or have not been run before.
The project is decomposed into four different pipelines specified in the targets_yaml file:
- data: `run_data.R`
- ppi: `run_ppi.R`
- pvi: `run_pvi.R`
- tourism: `run_tourism.R`
The first pipeline retrieves all the data necessary for the different challenges, while the other three run the five models for each challenge independently. Each pipeline can be run using the following command: targets::tar_make(script = "run_***.R")
.
Note that the data used for the challenges is stored in a private bucket, and writing permissions are required to run the pipeline as is. Hence, if you don’t have access to our private bucket you have to run all 4 pipelines with the parameter SAVE_TO_S3
equals to False
.
Replicating past results
We have made it a priority to ensure the full reproducibility of all our past submissions. In order to achieve this, we have taken the necessary steps to automatically save the data used for each specific submission in a publicly accessible S3 bucket. This allows anyone to easily access the exact datasets that were utilized in our analyses. In the event that there have been changes to the model codes, it is simply a matter of checking out the commit corresponding to the submission date and adjusting the relevant date variables in the challenges.yaml
configuration file. By combining the code retrieval with the availability of the specific datasets, we have established a robust framework that enables the replication and verification of our past results. This commitment to transparency and reproducibility ensures that the findings and outcomes of our submissions can be reliably validated and built upon by anyone.
If you encounter any difficulties or require assistance in replicating our past results, please do not hesitate to reach out to us. We understand that the replication process can sometimes be challenging, and we are here to provide support and guidance. Our team is available to answer any questions, clarify any uncertainties, and offer further explanations regarding the methodologies, data, or code used in our previous submissions.