import copy
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import seaborn as sns
import solutions
Partie 1 : Téléchargement et mise en forme des données
Project 3 - Population estimation from census data
The goal of this project is to perform a quick statistical analysis of a dataset whose format is not directly optimized for analysis in python. We will exclusively use the pandas
library for data analysis. To best reproduce a situation you might encounter, we strongly encourage you to consult the library’s documentation (docs).
We will focus on the population estimate as of January 1st of each year, this estimate being made from censuses and population evolution models. The data is accessible on the Insee website at the following address: https://www.insee.fr/en/statistics/1893198. The file we will use can be downloaded directly via this url: https://www.insee.fr/fr/statistiques/fichier/1893198/estim-pop-dep-sexe-aq-1975-2023.xls.
Part 1: Downloading and formatting data
Before downloading the data with python, it is necessary to know the format of our data. In our case, it is the Excel format (.xlsx
). Additionally, it can be useful to look at what the data we want to import looks like, especially when its format is not standard. So, before starting, take the time to glance at the data.
Question 0
Download the data by clicking on this link and open it with your favorite spreadsheet software. Analyze the data structure.
Question 1
Define the load_data()
function which has no parameters and returns a Dict
where the keys correspond to the names of the tabs of our file and the values correspond to the data of the different spreadsheets. To do this, use a function from the pandas
library by specifying the correct parameters.
Expected result
= solutions.load_data()
data "2022"] data[
Your turn!
def load_data():
# Your code here
return data
Question 2
Now that the data is imported, we will format it into a single DataFrame
with the columns:
gender
;age
;population
;dep_code
;dep
;year
.
2.1 - To do this, create a function reshape_data_by_year(df, year)
which takes as argument a DataFrame from your Dict data
and a given year.
Expected result
= 2022
year = solutions.reshape_table_by_year(data[f"{year}"], year)
df df
Your turn!
def reshape_table_by_year(df, year):
# Your code here
return df
2.2 - Create a function reshape_data(data)
which produces a DataFrame
with the data for all the years between 1975 and 2022.
Expected result
= solutions.reshape_data(data)
df df
Your turn!
def reshape_data(data):
# Your code here
return df
Part 2: Data visualization
We now have a dataset ready to be analyzed! Let’s start by visualizing the population evolution for different departments.
Question 3
Write a function plot_population_by_gender_per_department(df, department_code)
which returns a graph representing the population evolution in a given department. Use the matplotlib
library. You can look at the data for Haute Garonne (31), Loir-et-Cher (41), and Réunion (974) to see disparities in evolution.
Expected result
"31") solutions.plot_population_by_gender_per_department(df,
Your turn!
def plot_population_by_gender_per_department(data, department_code):
# Your code here
Question 4
To compare 2 graphs, it can be useful to display them side by side. Thanks to the subplots()
method of matplotlib
, this is very easy to achieve in python. To see this, we will represent the age pyramid of France in 1975 and in 2022.
4.1- Define the get_age_pyramid_data(df, year)
function which, from the DataFrame generated by the reshape_data()
function, returns a DataFrame with the columns age
, Females
, Males
. The age
column should contain all the age groups present in the dataset, the Females/Males
columns correspond to the female/male population for a given age group. For aesthetics, the Males
column will be multiplied by -1 beforehand.
Expected result
= solutions.get_age_pyramid_data(df, 2022)
pyramide_data pyramide_data
Your turn!
def get_age_pyramid_data(df, year):
# Your code here
return pyramide_data
4.2- Define the plot_age_pyramid(df, year, ax=None)
function which represents the age pyramid of France for a given year. You can get inspiration from what was done in this blog.
Expected result
= plt.subplots(1, 2, figsize=(15, 6))
fig, (ax1, ax2)
1975, ax=ax1)
solutions.plot_age_pyramid(df, 2022, ax=ax2) solutions.plot_age_pyramid(df,
Your turn!
def plot_age_pyramid(df, year, ax=None):
if ax is None:
= plt.gca()
ax # Your code here
return df
Part 3: An introduction to geographical data
Geographical data is very useful because it allows for visualizing and analyzing information related to specific locations on Earth. Geographical data can be used to create maps, 3D visualizations, and spatial analyses to understand trends, patterns, and relationships in the data. By using Python libraries such as Geopandas
or Folium
, you can easily manipulate and visualize geographical data to meet your analytical needs.
To graphically represent geographical data, it is necessary to obtain the contour data (shapefile) of the areas we want to represent. The goal of this part is to create a choropleth map of regions based on their respective population.
The data we currently have contains information by department and not by region. First, it is necessary to assign each department to its corresponding region. For this, you can use the .json
file available at the following address: https://static.data.gouv.fr/resources/departements-et-leurs-regions/20190815-175403/departements-region.json.
Question 5
Create a DataFrame from the .json
file of French departments and regions mentioned earlier. Ensure that the columns are in the correct format.
Expected result
= solutions.load_departements_regions("https://static.data.gouv.fr/resources/departements-et-leurs-regions/20190815-175403/departements-region.json")
df_matching df_matching
Your turn!
def load_departements_regions(url):
# Your code here
return df_matching
Question 6
Match the DataFrame containing population data by department with the DataFrame of French regions.
Expected result
= solutions.match_department_regions(df, df_matching)
df_regions df_regions
Your turn!
def match_department_regions(df, df_matching):
# Your code here
return df_regions
Question 7
Download the geographical contour data of the regions using the cartiflette
package and the geopandas
library. The data is accessible at this URL.
Expected result
= solutions.load_geo_data("https://minio.lab.sspcloud.fr/projet-cartiflette/diffusion/shapefiles-test1/year=2022/administrative_level=REGION/crs=4326/FRANCE_ENTIERE=metropole/vectorfile_format='geojson'/provider='IGN'/source='EXPRESS-COG-CARTO-TERRITOIRE'/raw.geojson")
geo geo
Your turn!
def load_geo_data(url):
# Your code here
return geo
Question 8
Produce a choropleth map of the 2022 population of French regions. You can consult the geopandas
documentation here.
Expected result
2022) solutions.plot_population_by_regions(df_regions, geo,
Your turn!
def plot_population_by_regions(df, geo, year):
# Your code here
Question 9
The total population of a region is not sufficient to analyze the demographics of a region. It can be interesting to look at demographic growth.
9.1- Write a function compute_population_growth_per_region(df)
which calculates the annual population growth percentage for each region.
Expected result
= solutions.compute_population_growth_per_region(df_regions)
df_growth df_growth
Your turn!
def compute_population_growth_per_region(df_regions):
# Your code here
return df_growth
9.2- Write a function compute_mean_population_growth_per_region(df, min_year, max_year)
which calculates the average population growth between two given years.
Expected result
= solutions.compute_mean_population_growth_per_region(df_regions, 2015, 2022)
df_growth df_growth
Your turn!
def compute_mean_population_growth_per_region(df, geo, year):
# Your code here
return df_growth
9.3- Write a function plot_growth_population_by_regions(df, geo, min_year, max_year)
which represents the average population growth between two given years for all French regions on a choropleth map.
Expected result
2015, 2022) solutions.plot_growth_population_by_regions(df_regions, geo,
Your turn!
def plot_growth_population_by_regions(df, geo, min_year, max_year):
# Your code here