Partie 1 : Téléchargement et mise en forme des données

Project 3 - Population estimation from census data

The goal of this project is to perform a quick statistical analysis of a dataset whose format is not directly optimized for analysis in python. We will exclusively use the pandas library for data analysis. To best reproduce a situation you might encounter, we strongly encourage you to consult the library’s documentation (docs).

We will focus on the population estimate as of January 1st of each year, this estimate being made from censuses and population evolution models. The data is accessible on the Insee website at the following address: https://www.insee.fr/en/statistics/1893198. The file we will use can be downloaded directly via this url: https://www.insee.fr/fr/statistiques/fichier/1893198/estim-pop-dep-sexe-aq-1975-2023.xls.

import copy
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import seaborn as sns

import solutions

Part 1: Downloading and formatting data

Before downloading the data with python, it is necessary to know the format of our data. In our case, it is the Excel format (.xlsx). Additionally, it can be useful to look at what the data we want to import looks like, especially when its format is not standard. So, before starting, take the time to glance at the data.

Question 0

Download the data by clicking on this link and open it with your favorite spreadsheet software. Analyze the data structure.

Question 1

Define the load_data() function which has no parameters and returns a Dict where the keys correspond to the names of the tabs of our file and the values correspond to the data of the different spreadsheets. To do this, use a function from the pandas library by specifying the correct parameters.

Expected result

data = solutions.load_data()
data["2022"]

Your turn!

def load_data():
    # Your code here
    return data

Question 2

Now that the data is imported, we will format it into a single DataFrame with the columns:

  • gender;
  • age;
  • population;
  • dep_code;
  • dep;
  • year.

2.1 - To do this, create a function reshape_data_by_year(df, year) which takes as argument a DataFrame from your Dict data and a given year.

Expected result

year = 2022
df = solutions.reshape_table_by_year(data[f"{year}"], year)
df

Your turn!

def reshape_table_by_year(df, year):
    # Your code here
    return df

2.2 - Create a function reshape_data(data) which produces a DataFrame with the data for all the years between 1975 and 2022.

Expected result

df = solutions.reshape_data(data)
df

Your turn!

def reshape_data(data):
    # Your code here
    return df

Part 2: Data visualization

We now have a dataset ready to be analyzed! Let’s start by visualizing the population evolution for different departments.

Question 3

Write a function plot_population_by_gender_per_department(df, department_code) which returns a graph representing the population evolution in a given department. Use the matplotlib library. You can look at the data for Haute Garonne (31), Loir-et-Cher (41), and Réunion (974) to see disparities in evolution.

Expected result

solutions.plot_population_by_gender_per_department(df, "31")

Your turn!

def plot_population_by_gender_per_department(data, department_code):
    # Your code here

Question 4

To compare 2 graphs, it can be useful to display them side by side. Thanks to the subplots() method of matplotlib, this is very easy to achieve in python. To see this, we will represent the age pyramid of France in 1975 and in 2022.

4.1- Define the get_age_pyramid_data(df, year) function which, from the DataFrame generated by the reshape_data() function, returns a DataFrame with the columns age, Females, Males. The age column should contain all the age groups present in the dataset, the Females/Males columns correspond to the female/male population for a given age group. For aesthetics, the Males column will be multiplied by -1 beforehand.

Expected result

pyramide_data = solutions.get_age_pyramid_data(df, 2022)
pyramide_data

Your turn!

def get_age_pyramid_data(df, year):
    # Your code here
    return pyramide_data

4.2- Define the plot_age_pyramid(df, year, ax=None) function which represents the age pyramid of France for a given year. You can get inspiration from what was done in this blog.

Expected result

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

solutions.plot_age_pyramid(df, 1975, ax=ax1)
solutions.plot_age_pyramid(df, 2022, ax=ax2)

Your turn!

def plot_age_pyramid(df, year, ax=None):
    if ax is None:
        ax = plt.gca()
    # Your code here
    return df

Part 3: An introduction to geographical data

Geographical data is very useful because it allows for visualizing and analyzing information related to specific locations on Earth. Geographical data can be used to create maps, 3D visualizations, and spatial analyses to understand trends, patterns, and relationships in the data. By using Python libraries such as Geopandas or Folium, you can easily manipulate and visualize geographical data to meet your analytical needs.

To graphically represent geographical data, it is necessary to obtain the contour data (shapefile) of the areas we want to represent. The goal of this part is to create a choropleth map of regions based on their respective population.

The data we currently have contains information by department and not by region. First, it is necessary to assign each department to its corresponding region. For this, you can use the .json file available at the following address: https://static.data.gouv.fr/resources/departements-et-leurs-regions/20190815-175403/departements-region.json.

Question 5

Create a DataFrame from the .json file of French departments and regions mentioned earlier. Ensure that the columns are in the correct format.

Expected result

df_matching = solutions.load_departements_regions("https://static.data.gouv.fr/resources/departements-et-leurs-regions/20190815-175403/departements-region.json")
df_matching

Your turn!

def load_departements_regions(url):
    # Your code here
    return df_matching

Question 6

Match the DataFrame containing population data by department with the DataFrame of French regions.

Expected result

df_regions = solutions.match_department_regions(df, df_matching)
df_regions

Your turn!

def match_department_regions(df, df_matching):
    # Your code here
    return df_regions

Question 7

Download the geographical contour data of the regions using the cartiflette package and the geopandas library. The data is accessible at this URL.

Expected result

geo = solutions.load_geo_data("https://minio.lab.sspcloud.fr/projet-cartiflette/diffusion/shapefiles-test1/year=2022/administrative_level=REGION/crs=4326/FRANCE_ENTIERE=metropole/vectorfile_format='geojson'/provider='IGN'/source='EXPRESS-COG-CARTO-TERRITOIRE'/raw.geojson")
geo

Your turn!

def load_geo_data(url):
    # Your code here
    return geo

Question 8

Produce a choropleth map of the 2022 population of French regions. You can consult the geopandas documentation here.

Expected result

solutions.plot_population_by_regions(df_regions, geo, 2022)

Your turn!

def plot_population_by_regions(df, geo, year):
    # Your code here

Question 9

The total population of a region is not sufficient to analyze the demographics of a region. It can be interesting to look at demographic growth.

9.1- Write a function compute_population_growth_per_region(df) which calculates the annual population growth percentage for each region.

Expected result

df_growth = solutions.compute_population_growth_per_region(df_regions)
df_growth

Your turn!

def compute_population_growth_per_region(df_regions):
    # Your code here
    return df_growth

9.2- Write a function compute_mean_population_growth_per_region(df, min_year, max_year) which calculates the average population growth between two given years.

Expected result

df_growth = solutions.compute_mean_population_growth_per_region(df_regions, 2015, 2022)
df_growth

Your turn!

def compute_mean_population_growth_per_region(df, geo, year):
    # Your code here
    return df_growth

9.3- Write a function plot_growth_population_by_regions(df, geo, min_year, max_year) which represents the average population growth between two given years for all French regions on a choropleth map.

Expected result

solutions.plot_growth_population_by_regions(df_regions, geo, 2015, 2022)

Your turn!

def plot_growth_population_by_regions(df, geo, min_year, max_year):
    # Your code here