import copy
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import seaborn as sns
import solutionsPartie 1 : Téléchargement et mise en forme des données
Project 3 - Population estimation from census data
The goal of this project is to perform a quick statistical analysis of a dataset whose format is not directly optimized for analysis in python. We will exclusively use the pandas library for data analysis. To best reproduce a situation you might encounter, we strongly encourage you to consult the library’s documentation (docs).
We will focus on the population estimate as of January 1st of each year, this estimate being made from censuses and population evolution models. The data is accessible on the Insee website at the following address: https://www.insee.fr/en/statistics/1893198. The file we will use can be downloaded directly via this url: https://www.insee.fr/fr/statistiques/fichier/1893198/estim-pop-dep-sexe-aq-1975-2023.xls.
Part 1: Downloading and formatting data
Before downloading the data with python, it is necessary to know the format of our data. In our case, it is the Excel format (.xlsx). Additionally, it can be useful to look at what the data we want to import looks like, especially when its format is not standard. So, before starting, take the time to glance at the data.
Question 0
Download the data by clicking on this link and open it with your favorite spreadsheet software. Analyze the data structure.
Question 1
Define the load_data() function which has no parameters and returns a Dict where the keys correspond to the names of the tabs of our file and the values correspond to the data of the different spreadsheets. To do this, use a function from the pandas library by specifying the correct parameters.
Expected result
data = solutions.load_data()
data["2022"]Your turn!
def load_data():
# Your code here
return dataQuestion 2
Now that the data is imported, we will format it into a single DataFrame with the columns:
gender;age;population;dep_code;dep;year.
2.1 - To do this, create a function reshape_data_by_year(df, year) which takes as argument a DataFrame from your Dict data and a given year.
Expected result
year = 2022
df = solutions.reshape_table_by_year(data[f"{year}"], year)
dfYour turn!
def reshape_table_by_year(df, year):
# Your code here
return df2.2 - Create a function reshape_data(data) which produces a DataFrame with the data for all the years between 1975 and 2022.
Expected result
df = solutions.reshape_data(data)
dfYour turn!
def reshape_data(data):
# Your code here
return dfPart 2: Data visualization
We now have a dataset ready to be analyzed! Let’s start by visualizing the population evolution for different departments.
Question 3
Write a function plot_population_by_gender_per_department(df, department_code) which returns a graph representing the population evolution in a given department. Use the matplotlib library. You can look at the data for Haute Garonne (31), Loir-et-Cher (41), and Réunion (974) to see disparities in evolution.
Expected result
solutions.plot_population_by_gender_per_department(df, "31")Your turn!
def plot_population_by_gender_per_department(data, department_code):
# Your code hereQuestion 4
To compare 2 graphs, it can be useful to display them side by side. Thanks to the subplots() method of matplotlib, this is very easy to achieve in python. To see this, we will represent the age pyramid of France in 1975 and in 2022.
4.1- Define the get_age_pyramid_data(df, year) function which, from the DataFrame generated by the reshape_data() function, returns a DataFrame with the columns age, Females, Males. The age column should contain all the age groups present in the dataset, the Females/Males columns correspond to the female/male population for a given age group. For aesthetics, the Males column will be multiplied by -1 beforehand.
Expected result
pyramide_data = solutions.get_age_pyramid_data(df, 2022)
pyramide_dataYour turn!
def get_age_pyramid_data(df, year):
# Your code here
return pyramide_data4.2- Define the plot_age_pyramid(df, year, ax=None) function which represents the age pyramid of France for a given year. You can get inspiration from what was done in this blog.
Expected result
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
solutions.plot_age_pyramid(df, 1975, ax=ax1)
solutions.plot_age_pyramid(df, 2022, ax=ax2)Your turn!
def plot_age_pyramid(df, year, ax=None):
if ax is None:
ax = plt.gca()
# Your code here
return dfPart 3: An introduction to geographical data
Geographical data is very useful because it allows for visualizing and analyzing information related to specific locations on Earth. Geographical data can be used to create maps, 3D visualizations, and spatial analyses to understand trends, patterns, and relationships in the data. By using Python libraries such as Geopandas or Folium, you can easily manipulate and visualize geographical data to meet your analytical needs.
To graphically represent geographical data, it is necessary to obtain the contour data (shapefile) of the areas we want to represent. The goal of this part is to create a choropleth map of regions based on their respective population.
The data we currently have contains information by department and not by region. First, it is necessary to assign each department to its corresponding region. For this, you can use the .json file available at the following address: https://static.data.gouv.fr/resources/departements-et-leurs-regions/20190815-175403/departements-region.json.
Question 5
Create a DataFrame from the .json file of French departments and regions mentioned earlier. Ensure that the columns are in the correct format.
Expected result
df_matching = solutions.load_departements_regions("https://static.data.gouv.fr/resources/departements-et-leurs-regions/20190815-175403/departements-region.json")
df_matchingYour turn!
def load_departements_regions(url):
# Your code here
return df_matchingQuestion 6
Match the DataFrame containing population data by department with the DataFrame of French regions.
Expected result
df_regions = solutions.match_department_regions(df, df_matching)
df_regionsYour turn!
def match_department_regions(df, df_matching):
# Your code here
return df_regionsQuestion 7
Download the geographical contour data of the regions using the cartiflette package and the geopandas library. The data is accessible at this URL.
Expected result
geo = solutions.load_geo_data("https://minio.lab.sspcloud.fr/projet-cartiflette/diffusion/shapefiles-test1/year=2022/administrative_level=REGION/crs=4326/FRANCE_ENTIERE=metropole/vectorfile_format='geojson'/provider='IGN'/source='EXPRESS-COG-CARTO-TERRITOIRE'/raw.geojson")
geoYour turn!
def load_geo_data(url):
# Your code here
return geoQuestion 8
Produce a choropleth map of the 2022 population of French regions. You can consult the geopandas documentation here.
Expected result
solutions.plot_population_by_regions(df_regions, geo, 2022)Your turn!
def plot_population_by_regions(df, geo, year):
# Your code hereQuestion 9
The total population of a region is not sufficient to analyze the demographics of a region. It can be interesting to look at demographic growth.
9.1- Write a function compute_population_growth_per_region(df) which calculates the annual population growth percentage for each region.
Expected result
df_growth = solutions.compute_population_growth_per_region(df_regions)
df_growthYour turn!
def compute_population_growth_per_region(df_regions):
# Your code here
return df_growth9.2- Write a function compute_mean_population_growth_per_region(df, min_year, max_year) which calculates the average population growth between two given years.
Expected result
df_growth = solutions.compute_mean_population_growth_per_region(df_regions, 2015, 2022)
df_growthYour turn!
def compute_mean_population_growth_per_region(df, geo, year):
# Your code here
return df_growth9.3- Write a function plot_growth_population_by_regions(df, geo, min_year, max_year) which represents the average population growth between two given years for all French regions on a choropleth map.
Expected result
solutions.plot_growth_population_by_regions(df_regions, geo, 2015, 2022)Your turn!
def plot_growth_population_by_regions(df, geo, min_year, max_year):
# Your code here