Working with CSV and JSON files

In the previous tutorial, we learned how to use functions from modules and how to read and write text files. In this tutorial, we will leverage these new skills by focusing on two types of text files commonly used for storing and sharing data: CSV and JSON files. We will learn to handle these two types of files using Python modules dedicated to their respective processing: the csv module and the json module.

Handling CSV files

CSV files

CSV stands for comma-separated values. CSV files aim to reproduce the structure of data from spreadsheet software like Microsoft Excel or LibreOffice Calc, reduced to strictly textual data (no formatting, no column types, etc.).

We will use the CSV file containing the list of departments in 2021 from the Official Geographic Code (COG) as an example. Let’s look at the first few lines of this file using a shell command to keep in mind the structure of such a file.

!head -n 5 departement2021.csv
DEP,REG,CHEFLIEU,TNCC,NCC,NCCENR,LIBELLE
01,84,01053,5,AIN,Ain,Ain
02,32,02408,5,AISNE,Aisne,Aisne
03,84,03190,5,ALLIER,Allier,Allier
04,93,04070,4,ALPES DE HAUTE PROVENCE,Alpes-de-Haute-Provence,Alpes-de-Haute-Provence

In analogy with a spreadsheet file, each line of the file represents a row in the spreadsheet, and the cells in a row are separated by commas. The first line may contain a header, i.e., the column names, but this is not always the case.

The main advantages of CSV files are:

  • their simplicity: they contain raw textual data, so they are very lightweight and can be easily edited using any text editor or programming language
  • their universality: they are widely used as a standard format for data exchange

The csv module

Since the data in a CSV file is textual, one might wonder why a particular module is needed to manipulate them and why the tools we saw in the previous tutorial would not suffice. The main reason is that CSV files have some subtleties and standards, often invisible to the user, but very important in practice. For example: if we want to separate different data with commas, what happens if the text data itself contains commas?

This is why we use the csv module to interact with such files, leveraging the fact that others have already considered these questions, thus avoiding reinventing the wheel every time we import a CSV file.

In practice, we tend to handle this type of data in the form of DataFrames (like in R) to take advantage of their tabular structure. We will study the Pandas package, which allows us to do this in Python in a future tutorial. However, it is always useful to know how to handle CSV data as textual data and, thus, know the csv module.

Reading

import csv

The syntax for reading and manipulating CSV files in Python is very similar to that for simple text files. The only difference is that you must create a reader object from the file object to iterate over the lines.

rows = []

with open("departement2021.csv") as file_in:
    csv_reader = csv.reader(file_in)
    for row in csv_reader:
        rows.append(row)

rows[:4]
[['DEP', 'REG', 'CHEFLIEU', 'TNCC', 'NCC', 'NCCENR', 'LIBELLE'],
 ['01', '84', '01053', '5', 'AIN', 'Ain', 'Ain'],
 ['02', '32', '02408', '5', 'AISNE', 'Aisne', 'Aisne'],
 ['03', '84', '03190', '5', 'ALLIER', 'Allier', 'Allier']]

The syntax is the same as for simple text files: once the reader is created, you can iterate over the lines and perform operations with them, such as storing them in a list as shown above.

When you have a CSV file with column names like in our case, it is useful to use them to manipulate named data rather than by position using a simple list. To do this, use a DictReader instead of a reader. Now, when iterating over the DictReader object, each line is a dictionary, with the key being the column name and the value being the cell data.

To illustrate its usefulness, display the names of departments whose department number is between 20 and 29.

with open("departement2021.csv") as file_in:
    dict_reader = csv.DictReader(file_in)
    for row in dict_reader:
        if row["DEP"].startswith("2"):
            print(row["LIBELLE"])
Côte-d'Or
Côtes-d'Armor
Creuse
Dordogne
Doubs
Drôme
Eure
Eure-et-Loir
Finistère
Corse-du-Sud
Haute-Corse

The code is much more readable: you can easily understand which data is being manipulated and how.

Writing

The syntax for writing is again quite similar to that for text files. The difference is that you are dealing with 2D data (row x column), so you cannot just pass a string to write, but pass a list of elements.

header = ["name", "class", "age"]
row1 = ["Maurice", "5thB", 12]
row2 = ["Manuela", "6thA", 11]

with open("test.csv", "w") as file_out:
    csv_writer = csv.writer(file_out)
    csv_writer.writerow(header)
    csv_writer.writerow(row1)
    csv_writer.writerow(row2)

Let’s check that our raw CSV file looks as expected.

# Shell command to display the content of a file
!cat test.csv
name,class,age
Maurice,5thB,12
Manuela,6thA,11

The header

Like in a spreadsheet document, the first line of a CSV file usually contains the variable names (columns). This line is called the header. This line is not mandatory in theory, but it is quite handy for quickly understanding the nature of the data in a CSV file. Therefore, it is good practice to include a header when generating a CSV file.

We saw in the previous example that writing the header is done like writing any other data row. It’s during reading that things get complicated because you need to retrieve the header separately from the other data if the CSV file contains one. Let’s use the CSV generated in the previous step to illustrate this.

data = []
with open("test.csv", "r") as file_in:
    csv_reader = csv.reader(file_in)
    header = next(csv_reader)
    for row in csv_reader:
        data.append(row)
print(header)
['name', 'class', 'age']
print(data)
[['Maurice', '5thB', '12'], ['Manuela', '6thA', '11']]

To retrieve the header, use the next function. It is a built-in function that calls the __next__ method of the reader object, allowing it to iterate one step forward on the reader. The first call to the next function returns the first line of the document. If a header is present in the file (which must be ensured), the returned element is the header. Then, you typically retrieve the rest of the data via a loop on the reader object, storing it in a list of lists (one list per line).

Importance of the delimiter

The delimiter is the character used to separate successive values in a line in a CSV file.

The CSV standard uses — as its name suggests — the comma as the delimiter, but this can be modified, and it is not uncommon to encounter CSV files that have a different delimiter. In such a case, look directly at the raw text to see the delimiter used. For example, you often find tab-separated values (the character is \t), i.e., a given number of spaces, and the file may have the extension .tsv for tab-separated value. In this case, specify the delimiter with the delimiter parameter when creating the reader.

In practice, like text file encoding, there is little valid reason to change the delimiter. Even if commas appear in file values — for example, in an address — these values are then enclosed in quotes, allowing the separation of values to be done correctly in most cases.

Handling JSON files

JSON files

JSON (JavaScript Object Notation) is a very popular file format for writing and exchanging data in the form of a single, human-readable string — at least in theory.

As its name suggests, JSON is linked to the JavaScript language, as it is derived from the notation of objects in that language. However, the format is now independent of any programming language but is widely used in various languages.

The JSON format is particularly important for statisticians and data scientists because it is the quasi-standard response format for APIs. Interacting with APIs goes beyond this introductory course’s scope. However, as APIs are becoming the standard mode of communication for data exchange, it is essential to master the basics of the JSON format to handle API responses when interacting with them.

Since JSON stores objects as key-value pairs and the values can be arrays — a broad concept in computing that includes lists we know — it closely resembles Python dictionaries. Thus, it is a natural file format for serializing them, i.e., converting a data structure in memory (here, a dictionary) to a byte sequence that any computer can read. Let’s look at the JSON representation of a Python dictionary as an example.

cv = {
    "marc": {"position": "manager", "experience": 7, "hobbies": ["sewing", "frisbee"]},
    "miranda": {"position": "engineer", "experience": 5, "hobbies": ["trekking"]}
}

print(cv)
{'marc': {'position': 'manager', 'experience': 7, 'hobbies': ['sewing', 'frisbee']}, 'miranda': {'position': 'engineer', 'experience': 5, 'hobbies': ['trekking']}}
import json

print(json.dumps(cv))
{"marc": {"position": "manager", "experience": 7, "hobbies": ["sewing", "frisbee"]}, "miranda": {"position": "engineer", "experience": 5, "hobbies": ["trekking"]}}

You can see that the JSON representation is quite similar to the Python dictionary, with a few peculiarities. In this case, for example, special characters like accents are automatically encoded in Unicode.

The json module

The json module handles importing JSON files and exporting Python objects to JSON format. It takes care of handling the conversion constraints to JSON mentioned earlier, such as accents.

In particular, JSON can store most of the built-in Python object types we have seen so far (strings, numeric values, Booleans, lists, dictionaries, NoneType) and many others, but it cannot represent manually created Python objects via classes.

Writing

Let’s start with writing. As we saw in the previous example, the dumps function (for dump string) converts a serializable Python value to its JSON representation as a string.

x = "test"
json.dumps(x)
'"test"'
x = [1, 2, 3]
json.dumps(x)
'[1, 2, 3]'

Writing a JSON file from Python simply means writing this representation into a text file, which we will give the .json extension to clearly indicate that it is a particular text file. As this operation is very common, there is a similar function, dump, which performs both conversion and writing.

with open("cv.json", "w") as file_out:
    json.dump(cv, file_out)
!cat cv.json
{"marc": {"position": "manager", "experience": 7, "hobbies": ["sewing", "frisbee"]}, "miranda": {"position": "engineer", "experience": 5, "hobbies": ["trekking"]}}

In a single operation, we serialized a Python dictionary (the cv object) into a JSON file.

Reading

The json module provides the load and loads functions, which respectively perform the opposite operations of the dump and dumps functions:

  • The load function imports JSON content from a text file and converts it into a dictionary.
  • The loads function converts JSON content from a string into a dictionary.

Let’s reuse the CV we serialized earlier into JSON format to illustrate reading from a file.

with open("cv.json", "r") as file_in:
    data = json.load(file_in)
    
data
{'marc': {'position': 'manager',
  'experience': 7,
  'hobbies': ['sewing', 'frisbee']},
 'miranda': {'position': 'engineer', 'experience': 5, 'hobbies': ['trekking']}}

We will illustrate reading JSON content from a string with a realistic example: querying an API. For example, we will query the French National Address Base (BAN), which allows geolocating any national address.

Querying an API in Python is straightforward with the requests library. Let’s see how we can retrieve geographical information about all streets that contain the name “comedie” in France in just two lines of code.

import requests
response = requests.get("https://api-adresse.data.gouv.fr/search/?q=comedie&type=street")
r_text = response.text
print(r_text[:150])
{"type":"FeatureCollection","version":"draft","features":[{"type":"Feature","geometry":{"type":"Point","coordinates":[3.063832,50.635191]},"properties

The API sends us a response, from which we extract the textual content. As with most APIs, this content is JSON. We can then import it into a Python dictionary using the loads function (for load string) to manipulate the data it contains.

r_dict = json.loads(r_text)
r_dict.keys()
dict_keys(['type', 'version', 'features', 'attribution', 'licence', 'query', 'filters', 'limit'])
type(r_dict["features"])
list

The results we are interested in are contained in the dictionary value associated with the features key, which is a list of dictionaries, one per result.

r_dict["features"][0]
{'type': 'Feature',
 'geometry': {'type': 'Point', 'coordinates': [3.063832, 50.635191]},
 'properties': {'label': 'Rue de la Vieille Comédie 59800 Lille',
  'score': 0.7018699999999999,
  'id': '59350_9149',
  'banId': '01a7bf20-86a1-479c-a4c8-36fbba754195',
  'name': 'Rue de la Vieille Comédie',
  'postcode': '59800',
  'citycode': '59350',
  'oldcitycode': '59350',
  'x': 704523.56,
  'y': 7059804.52,
  'city': 'Lille',
  'oldcity': 'Lille',
  'context': '59, Nord, Hauts-de-France',
  'type': 'street',
  'importance': 0.72057,
  'street': 'Rue de la Vieille Comédie'}}
r_dict["features"][1]
{'type': 'Feature',
 'geometry': {'type': 'Point', 'coordinates': [3.879638, 43.608525]},
 'properties': {'label': 'Place de la Comédie 34000 Montpellier',
  'score': 0.70161,
  'id': '34172_1485',
  'name': 'Place de la Comédie',
  'postcode': '34000',
  'citycode': '34172',
  'x': 771035.57,
  'y': 6279225.95,
  'city': 'Montpellier',
  'context': '34, Hérault, Occitanie',
  'type': 'street',
  'importance': 0.71771,
  'street': 'Place de la Comédie'}}

Exercises

Questions of understanding

  • 1/ What is a CSV file?
  • 2/ What are the advantages of the CSV format?
  • 3/ Why do we use the csv module to read and write CSV files?
  • 4/ Are the data in a CSV file always separated by commas?
  • 5/ What is the header of a CSV file? Does it necessarily exist?
  • 6/ Why is the JSON format widely used in data manipulation?
  • 7/ What Python object does JSON content resemble?
  • 8/ What types of Python objects can be converted to JSON?
  • 9/ What is the serialization of a Python object?
  • 10/ What is the main similarity between CSV and JSON files?
  • 11/ Does a file with a .json extension necessarily contain JSON?
Show solution
  • 1/ A CSV is a text file that represents the raw data of a spreadsheet-like document. Each line of the file represents a row in the spreadsheet, and the cells in a row are separated by commas. The first line may contain a header (column names), but this is not always the case.
  • 2/ Simplicity of reading and editing, universality.
  • 3/ Even though the CSV format is very simple, it has some characteristics (delimiter, end-of-line character, etc.) that need to be considered when reading or editing CSV. The csv module provides functions that account for these peculiarities.
  • 4/ No, data can theoretically be separated by any character or sequence of characters. In practice, follow the convention in most cases, which is to use a comma.
  • 5/ It is the first line of the CSV file, which usually contains the variable names, but this is not always the case.
  • 6/ It is the primary response format for APIs, which are widely used for data dissemination and exchange.
  • 7/ Dictionaries.
  • 8/ All serializable objects, which include most of the basic objects we have seen, but not manually created objects via classes.
  • 9/ The serialization of a (serializable) Python object is converting the data contained in that object into a byte sequence, i.e., a message that any computer can understand.
  • 10/ They are text files.
  • 11/ No, JSON files like CSV files are text files. The extension is a convention that allows, in most cases, knowing what the file contains, but it cannot guarantee it.

Sort the keys when writing a JSON

The following cell contains a dictionary. The goal of the exercise is to write this data to a JSON file, sorting the dictionary keys alphabetically.

Hint: The dump function of the json module contains a parameter that allows sorting the keys. Read the function documentation to determine it.

data = {"id": 1, "name": "Isidore", "age": 29}
# Test your answer in this cell
Show solution
import json

data = {"id": 1, "name": "Isidore", "age": 29}

with open("data_sorted.json", "w") as file_out:
    json.dump(data, file_out, sort_keys=True)

Convert a non-serializable object to JSON

We have seen that manually created objects via classes are generally not serializable. The following cell shows an example with our Citron object used in the OOP tutorial. Trying to convert the object directly to JSON returns an error.

You must modify the following code to serialize the object. To do this, you must:

  • Convert the mon_citron instance using the built-in __dict__ method that all Python objects have.
  • Convert the obtained dictionary to JSON as a string.
import json

class Citron:

    def __init__(self, color, juice_qty):
        self.flavor = "acidic"
        self.color = color
        self.juice = juice_qty
        
mon_citron = Citron(color="yellow", juice_qty=45)
json.dumps(mon_citron)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[58], line 11
      8         self.juice = juice_qty
     10 mon_citron = Citron(color="yellow", juice_qty=45)
---> 11 json.dumps(mon_citron)

File /opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/json/__init__.py:231, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    226 # cached encoder
    227 if (not skipkeys and ensure_ascii and
    228     check_circular and allow_nan and
    229     cls is None and indent is None and separators is None and
    230     default is None and not sort_keys and not kw):
--> 231     return _default_encoder.encode(obj)
    232 if cls is None:
    233     cls = JSONEncoder

File /opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/json/encoder.py:199, in JSONEncoder.encode(self, o)
    195         return encode_basestring(o)
    196 # This doesn't pass the iterator directly to ''.join() because the
    197 # exceptions aren't as detailed.  The list call should be roughly
    198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
    201     chunks = list(chunks)

File /opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot)
    252 else:
    253     _iterencode = _make_iterencode(
    254         markers, self.default, _encoder, self.indent, floatstr,
    255         self.key_separator, self.item_separator, self.sort_keys,
    256         self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)

File /opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/json/encoder.py:179, in JSONEncoder.default(self, o)
    160 def default(self, o):
    161     """Implement this method in a subclass such that it returns
    162     a serializable object for ``o``, or calls the base implementation
    163     (to raise a ``TypeError``).
   (...)
    177 
    178     """
--> 179     raise TypeError(f'Object of type {o.__class__.__name__} '
    180                     f'is not JSON serializable')

TypeError: Object of type Citron is not JSON serializable
# Test your answer in this cell
Show solution
import json

class Citron:

    def __init__(self, color, juice_qty):
        self.flavor = "acidic"
        self.color = color
        self.juice = juice_qty
        
mon_citron = Citron(color="yellow", juice_qty=45)
mon_citron_dict = mon_citron.__dict__

json.dumps(mon_citron_dict)
'{"flavor": "acidic", "color": "yellow", "juice": 45}'

Change the delimiter of a CSV file

Your current directory contains the file nat2020.csv. It is the file of first names published by Insee, containing data on the first names given to children born in France between 1900 and 2020.

Problem: Contrary to the CSV standard, the delimiter used is not a comma. You must:

  • Find the delimiter used (via the Jupyter text editor, a shell command, or by testing with the csv module in Python) to read the file correctly.
  • Generate a new CSV file nat2020_corr.csv containing the same data, but this time with a comma as the separator.
# Test your answer in this cell
Show solution
# Let's find the delimiter used with a shell command
!head -n 3 nat2020.csv

with open('nat2020.csv', 'r') as file_in:
    # Read the existing CSV file
    reader = csv.reader(file_in, delimiter=';')
    with open('nat2020_corr.csv', 'w') as file_out:
        # Write to the new CSV file
        writer = csv.writer(file_out)  # By default, the delimiter is a comma
        for row in reader:
            writer.writerow(row)
            
# Verify with a shell command
!head -n 3 nat2020_corr.csv
sexe;preusuel;annais;nombre
1;TRESOR;2002;4
2;AWATEF;1972;3
sexe,preusuel,annais,nombre
1,TRESOR,2002,4
2,AWATEF,1972,3

Extract and save data from an API

The exercise is to make a request to the National Address Base API and save the results in a CSV file. Here are the steps to implement:

  • Make a street name request with a keyword like in the tutorial (if you want to make a more complex request, you can check the API documentation) and store the results in a dictionary.
  • Create a CSV file resultats_ban.csv in which we will store the following information: ‘name’, ‘city’, ‘city_code’, ‘longitude’, ‘latitude’.
  • Using a writer object and a loop on the results returned by the API, write each line to the CSV.

For example, for the query of streets containing the word “comédie”, here is the CSV to obtain:

name,city,city_code,longitude,latitude
Rue de la Vieille Comedie,Lille,59350,3.063832,50.635192
Place de la Comédie,Montpellier,34172,3.879638,43.608525
Rue de la Comédie,Cherbourg-en-Cotentin,50129,-1.629732,49.641574
Allee de la Comedie,Villeneuve-d'Ascq,59009,3.162808,50.64628
Rue de l’Ancienne Comedie,Poitiers,86194,0.342649,46.580457
# Test your answer in this cell
Show solution
response = requests.get("https://api-adresse.data.gouv.fr/search/?q=comedie&type=street")
r_text = response.text
r_dict = json.loads(r_text)

with open('resultats_ban.csv', 'w') as file_out:
    header = ['name', 'city', 'city_code', 'longitude', 'latitude']
    csv_writer = csv.writer(file_out)
    csv_writer.writerow(header)
    for result in r_dict['features']:
        name = result['properties']['name']
        city = result['properties']['city']
        city_code = result['properties']['citycode']
        long, lat = result['geometry']['coordinates']
        row = [name, city, city_code, long, lat]
        csv_writer.writerow(row)

Split the department base by regions

The goal of this exercise is to split the CSV file of French departments used in the tutorial into several small CSV files, one per region. This type of operation can be useful, for example, when working with a very large file that does not fit in memory; splitting it into several files to process independently, when possible, reduces the volume.

Here are the operations to perform:

  • Create a dep folder in the current directory using the pathlib module (cf. previous tutorial).
  • With a csv module reader object, loop through the lines of the department CSV file. Be careful not to include the header, using the next function to skip the first line. For each following line:
    • Retrieve the region code (variable REG).
    • Generate the path of the CSV file dep/{REG}.csv where {REG} is to be replaced by the region code of the line.
    • Open this CSV file in append mode to write the line at the end of the file.
# Test your answer in this cell
Show solution
from pathlib import Path

path_dep = Path("dep/")
path_dep.mkdir(exist_ok=True)

with open('departement2021.csv', 'r') as file_in:
    csv_reader = csv.reader(file_in)
    next(csv_reader)  # Skip the header
    for row in csv_reader:
        reg = row[1]
        filename = reg + '.csv'
        path_reg_file = path_dep / filename  # Path of the region csv file
        with open(path_reg_file, 'a') as file_reg_in:
                writer = csv.writer(file_reg_in)
                writer.writerow(row)

Add missing headers

In the previous exercise, we split the CSV file of French departments into several CSV files, one per region. However, we did not include the header in the different files, i.e., the first line containing the column names. We will add it manually to each of the CSV files created in the previous exercise.

Here are the operations to perform:

  • Read the complete department file and retrieve the header in a list with the next function.
  • Record in a list the paths of the different CSV files contained in the dep folder using the glob method of pathlib (cf. previous tutorial).
  • For each path:
    • Open the existing CSV file and retrieve the data as a list of lists (one list per line).
    • Open the CSV file in write mode to reset it, write the header first, and then write the data previously saved in a list of lists.
# Test your answer in this cell
Show solution
from pathlib import Path

with open('departement2021.csv', 'r') as file_in:
    csv_reader = csv.reader(file_in)
    header = next(csv_reader)

dep_files_paths = list(Path("dep/").glob('*.csv'))

for path in dep_files_paths:
    # Read the existing file, storing the lines in a list
    with open(path, 'r') as file_dep_in:
        reader = csv.reader(file_dep_in)
        dep_rows = []
        for row in reader:
            dep_rows.append(row)
    # Rewrite the output file, adding the header first
    with open(path, 'w') as file_dep_out:
        writer = csv.writer(file_dep_out)
        writer.writerow(header)
        for row in dep_rows:
            writer.writerow(row)