Data Structures 2: Dictionaries and Sets

In the previous tutorial, we worked with sequential data structures: lists and tuples. Now, we will discover dictionaries and sets, which are unordered data structures: objects are no longer stored by position (or index) but by key, which is a unique identifier.

Dictionaries

Definition

Dictionaries are unordered collections of key-value pairs. A dictionary is defined using the following syntax: d = {'key1': 'value1', 'key2': 'value2'}.

inventory = {'coffee': '500g', 'milk': '1.5L'}
inventory
{'coffee': '500g', 'milk': '1.5L'}
type(inventory)
dict

It is possible to have as many keys as desired in a dictionary. However, keys are unique to uniquely identify the associated value. If you try to define a dictionary with a duplicated key, Python does not return an error, but only the last duplicated key is taken into account.

inventory = {'coffee': '500g', 'milk': '1.5L', 'coffee': '300g'}
inventory
{'coffee': '300g', 'milk': '1.5L'}

What can a dictionary contain? Keys can be of different types, but strings or integers are generally used. The values of a dictionary can be any type of Python object.

Usage

Since dictionaries are unordered, there is no notion of position: you access a value by its associated key. For example, to retrieve the value ('1.5L') associated with the key 'milk':

inventory['milk']
'1.5L'

Additional key-value pairs can be added to an existing dictionary using variable assignment syntax.

inventory["cereal"] = "250g"
inventory
{'coffee': '300g', 'milk': '1.5L', 'cereal': '250g'}

Unlike lists, keys do not necessarily start at 0 and can be any number.

dic1 = {12: "Aveyron", 33: "Gironde"}

print("The department 33 is " + dic1[33])  # String concatenation!
The department 33 is Gironde

Similarly, values can be of different types, including data containers.

dic2 = {"scale": "C major",
        "notes": ["do", "re", "mi", "fa", "sol", "la", "si"]}

dic2["notes"]
['do', 're', 'mi', 'fa', 'sol', 'la', 'si']

Dictionaries can also contain other dictionaries. This makes them particularly suitable for representing hierarchical data structures.

resume = {
    "marc": {"position": "manager", "experience": 7, "hobbies": ["sewing", "frisbee"]},
    "mirande": {"position": "engineer", "experience": 5, "hobbies": ["trekking"]}
}

print(resume["marc"])
print(resume["marc"]["hobbies"][0])
{'position': 'manager', 'experience': 7, 'hobbies': ['sewing', 'frisbee']}
sewing

Let’s repeat: dictionaries have no notion of order. Therefore, there is no sense in querying the element at position 0 in a dictionary (unless the key 0 exists). Querying a non-existent key returns an error.

dic1 = {12: "Aveyron", 33: "Gironde"}

dic1[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[50], line 3
      1 dic1 = {12: "Aveyron", 33: "Gironde"}
----> 3 dic1[0]

KeyError: 0

Modifying Elements

It is possible to modify a value associated with an existing key in the dictionary. The new value can be of a different type from the original.

inventory = {'coffee': '500g', 'milk': '1.5L'}
inventory['coffee'] = {'arabica': '250g', 'robusta': '400g'}
inventory
{'coffee': {'arabica': '250g', 'robusta': '400g'}, 'milk': '1.5L'}

Deleting Elements

To delete a key (and the associated value), you can use the same operations as those used to delete elements from a list.

# Using the `del` operator
inventory = {'coffee': '500g', 'milk': '1.5L'}
del inventory['milk']
inventory
{'coffee': '500g'}
# Using the `pop` method
inventory = {'coffee': '500g', 'milk': '1.5L'}
inventory.pop('milk')
inventory
{'coffee': '500g'}

Some Useful Methods

We saw earlier that querying a non-existent key returns an error. The .get() method allows querying a key without being sure of its existence, as it does not return an error in this case, but the None object, which we will see in a future tutorial.

inventory = {'coffee': '500g', 'milk': '1.5L'}
inventory.get('honey')

You can also specify a default value when the key does not exist.

inventory.get('honey', 'not found')
'not found'

The .keys(), .values(), and .items() methods return the keys, values, and key-value pairs of a dictionary, respectively. The objects returned by these methods are a bit complex, but they can be converted to lists using the list function to query them by position.

resume = {
    "marc": {"position": "manager", "experience": 7, "hobbies": ["sewing", "frisbee"]},
    "mirande": {"position": "engineer", "experience": 5, "hobbies": ["triathlon"]}
}

list(resume.keys())
['marc', 'mirande']
list(resume.values())
[{'position': 'manager', 'experience': 7, 'hobbies': ['sewing', 'frisbee']},
 {'position': 'engineer', 'experience': 5, 'hobbies': ['triathlon']}]
list(resume.items())
[('marc',
  {'position': 'manager', 'experience': 7, 'hobbies': ['sewing', 'frisbee']}),
 ('mirande',
  {'position': 'engineer', 'experience': 5, 'hobbies': ['triathlon']})]

Sets

Definition

Sets are unordered collections of unique elements. As such, they can be seen as dictionaries without values, where only the keys (which are unique by definition in a dictionary) are kept. Another analogy is that of mathematical sets, whose elements are also unordered and unique.

Due to their similarity to dictionaries, sets are also defined using curly braces {}.

x = {3, 2, 1}
x
{1, 2, 3}
type(x)
set

Just like dictionaries, sets are unordered, so there is no notion of position. Asking for the element at position i, as in a list, returns an error.

x = {3, 2, 1}
x[0]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[61], line 2
      1 x = {3, 2, 1}
----> 2 x[0]

TypeError: 'set' object is not subscriptable

Modifying Elements

It is possible to add an element to a set using the add method.

x = {3, 2, 1}
x.add("4")
x
{1, 2, 3, '4'}

Adding an existing element changes nothing by definition.

x = {3, 2, 1}
x.add(2)
x
{1, 2, 3}

It is possible to remove an element from a set using the remove method.

x = {3, 2, 1}
x.remove(2)
x
{1, 3}

Usage

Sets are not very often used in practice, but they are quite useful in certain specific situations. Due to the uniqueness of the elements they contain, sets allow you to simply and effectively remove duplicates from a sequential container, such as a list.

Deduplication

Suppose you want to remove duplicates from a given list. By definition, converting a list to a set removes duplicates. However, you generally want to return to a list, as sets do not offer the same flexibility as lists (for example, the ability to find an element by position). It is therefore common to perform the list -> set -> list chain of operations to deduplicate a list.

l = [1, 2, 3, 3, 2, 1]
l_dedup = list(set(l))
l_dedup
[1, 2, 3]

Set Operations

Since sets programmatically represent mathematical sets, it is not surprising that they allow for elementary set operations. For example, union and intersection.

l1 = [5, 3, 2, 3, 3, 5, 8, 9]
l2 = [3, 7, 0, 0, 1, 9, 4, 6]
# Union: elements in l1, l2, or both
l_union = list(set(l1) | set(l2))
l_union
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Intersection: elements in both l1 and l2
l_inter = list(set(l1) & set(l2))
l_inter
[9, 3]

Membership Tests

Sets are also very useful for membership tests, as they offer much better performance than lists for this type of test.

The concept of testing will be covered in a future tutorial. For now, let’s note that a membership test such as “is element a in list l” is written in Python as a in l and returns True or False depending on whether a is actually present in list l.

l = [1, 2, 3]
2 in l
True
4 in l
False

Now, imagine performing this test on a list containing millions of elements. Exaggerating, the Python interpreter would then have to go through all the elements of the list one by one until it finds the element in question, which can take a long time.

In contrast, since the elements of a set are unique, Python can easily keep track of the list of unique elements contained in the set and thus conclude the test very quickly. We will see a performance comparison in an end-of-tutorial exercise.

NB: The implementation of the concepts mentioned above is called a “hash table”. Interested readers can find more information about this data structure here.

Exercises

Comprehension Questions

  1. Can you access the ith element of a dictionary? of a set?

  2. What types of objects can be used as dictionary keys? As values?

  3. For what types of data is it beneficial to use a dictionary?

  4. Can a dictionary have duplicate keys?

  5. Why can we say that a set is a special type of dictionary?

  6. Why are sets used to deduplicate lists?

  7. Why are sets more suitable than lists for membership tests?

Show the solution
  1. No, dictionaries and sets are unordered collections of objects.

  2. For values: any type of object. For keys, we generally restrict to strings and/or integers.

  3. Hierarchical data.

  4. No, keys are unique.

  5. A set contains only unique elements and is written with curly braces. It can thus be seen as a special dictionary containing only keys.

  6. By definition, the elements of a set are unique. Converting a list to a set removes duplicates.

  7. Due to the uniqueness of elements, Python can keep track of the positions of different elements. Membership tests are therefore highly optimized compared to performing them with a list.

Querying a Dictionary

Given the dictionary defined in the cell below.

Display using print operations:

  • the list of class names
  • Miranda’s history grade
  • the list of grades obtained by Hypolyte
  • the list of student names in 6emeB
  • the list of subjects taught in 6emeA
  • the list of all subjects taught
  • the list of grades obtained by girls in both classes
results = {
    "6emeA": {"Miranda" : {"grades": {"physics": 16, "history": 12}},
              "Celestin": {"grades": {"physics": "absent", "history": 18}}
             },
    "6emeB": {"Hypolyte": {"grades": {"math": 11, "english": 0}},
              "Josephine": {"grades": {"math": 16, "english": 20}}
             }
}
# Test your answer in this cell
Show the solution
print(list(results.keys()))

print(results["6emeA"]["Miranda"]["grades"]["history"])

print(list(results["6emeB"]["Hypolyte"]["grades"].values()))

print(list(results["6emeB"].keys()))

print(list(results["6emeA"]["Miranda"]["grades"].keys()))

print(list(results["6emeA"]["Miranda"]["grades"].keys()) 
      + list(results["6emeB"]["Josephine"]["grades"].keys()))

print(list(results["6emeA"]["Miranda"]["grades"].values()) 
      + list(results["6emeB"]["Josephine"]["grades"].values()))
['6emeA', '6emeB']
12
[11, 0]
['Hypolyte', 'Josephine']
['physics', 'history']
['physics', 'history', 'math', 'english']
[16, 12, 16, 20]

Dictionary Length

In previous tutorials, we saw the len function, which counts the number of elements in a sequence. Does this function work with dictionaries? What does it count?

# Test your answer in this cell
Show the solution
resume = {
    "marc": {"position": "manager", "experience": 7, "hobbies": ["sewing", "frisbee"]},
    "miranda": {"position": "engineer", "experience": 5, "hobbies": ["trekking"]}
}

print(len(resume))
print(len(resume["marc"]))
2
3

The len function applied to a dictionary counts the number of keys.

Deleting Dictionary Elements

We saw that we can delete a key from a dictionary in two different ways:

  • with the del operator: del my_dict[key]
  • with the pop method: my_dict.pop(key)

Beyond syntax, what are the two major differences between these two ways of deleting a key from a dictionary? Feel free to experiment with examples of your choice.

# Test your answer in this cell
Show the solution
inventory = {'coffee': '500g', 'milk': '1.5L'}

print(inventory.pop('coffee'))
print(inventory.pop('orange', 'unavailable'))
500g
unavailable

1st difference: when deleting an existing key with the pop method, the value associated with the key is returned. The del operation returns nothing (actually, a None object).

2nd difference: the pop method allows specifying a default value in case the key does not exist, and thus does not return an error in this case. The del operation necessarily returns an error when the key does not exist.

Renaming a Dictionary Key

By exploiting the fact that the pop method used to delete a key from a dictionary returns the value associated with that key, propose a method to rename a dictionary key in a single operation.

# Test your answer in this cell
Show the solution
inventory = {'coffee': '500g', 'milk': '1.5L'}

inventory['water'] = inventory.pop('milk')

inventory
{'coffee': '500g', 'water': '1.5L'}

Dictionary Membership Tests

Given the following dictionary:

animals = {'cats': 5, 'dogs': 12}

What will the following membership tests return? Verify your predictions.

  • 'cats' in animals.keys()
  • 'cats' in animals.values()
  • 'cats' in animals
# Test your answer in this cell
Show the solution
animals = {'cats': 5, 'dogs': 12}

print(animals.keys())
print('cats' in animals.keys()) 
# True: 'cats' is indeed in the keys of `animals`

print()
print(animals.values())
print('cats' in animals.values()) 
# False: 'cats' is not a value in `animals`

print()
print(animals)
print('cats' in animals) 
# True: this test is strictly equivalent to 'cats' in animals.keys()
dict_keys(['cats', 'dogs'])
True

dict_values([5, 12])
False

{'cats': 5, 'dogs': 12}
True

Deleting a Non-Existent Key

We saw that the del operation returns an error when used to delete a non-existent key from a dictionary. Using your new knowledge of membership tests, can you imagine a method (without necessarily implementing it) that would handle this case without an error?

# Test your answer in this cell
Show the solution
inventory = {'coffee': '500g', 'milk': '1.5L'}

key = 'chocolate'

if key in inventory.keys():
    del inventory[key]

We use a membership test: if the key exists in the dictionary’s keys, it is deleted. Otherwise, nothing happens. This syntax will become clearer with the next tutorial.

Deduplication

Given the following string with repetitions:

x = "cdabcdabcdabcdabcdabcdabcdabcdabcdab"

Construct a list of unique characters in this string, sorted alphabetically, i.e.:

l = ['a', 'b', 'c', 'd']

Hint: the procedure is similar to removing duplicates from a list.

# Test your answer in this cell
Show the solution
x = "cdabcdabcdabcdabcdabcdabcdabcdabcdab"
l = list(set(x))
l.sort()
l
['a', 'b', 'c', 'd']

Intersections and Unions of Strings

Given the following two strings:

cyrano1 = 'C’est un roc ! … c’est un pic ! … c’est un cap !'

cyrano2 = 'Que dis-je, c’est un cap ? … C’est une péninsule !'

Question 1: find the characters that appear in both strings.

Question 2: find the characters that appear in at least one of the two texts.

# Test your answer in this cell
Show the solution
cyrano1 = 'C’est un roc ! … c’est un pic ! … c’est un cap !'
cyrano2 = 'Que dis-je, c’est un cap ? … C’est une péninsule !'

# Question 1

inter = list(set(cyrano1) & set(cyrano2))
print(inter)

# Question 2

union = list(set(cyrano1) | set(cyrano2))
print(union)
['p', 'a', 'c', '’', ' ', 'e', 'u', 't', 'i', 'C', 's', '!', '…', 'n']
['p', 'a', 'c', 'j', 'e', 't', ',', '-', ' ', 'C', 'u', '!', '…', 'n', 'r', 'Q', '’', 'd', 'o', 'i', 's', 'l', 'é', '?']

The Usefulness of Sets for Membership Tests

The code below generates a list with the letters a, b, c, and d repeated 1 million times. Then, it performs a membership test for a letter that does not exist in the list and calculates the time taken by the Python interpreter to perform the test.

Using this syntax, compare the time taken for the same membership test when the list is converted to a set beforehand.

x = ['a', 'b', 'c', 'd'] * 1000000
print(x[:10])

%time 'e' in x
['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd', 'a', 'b']
CPU times: user 39.7 ms, sys: 32 μs, total: 39.7 ms
Wall time: 39.7 ms
False
# Test your answer in this cell
Show the solution
x = ['a', 'b', 'c', 'd'] * 1000000
print(x[:10])

x_set = set(x)
print(x_set)

%time 'e' in x
%time 'e' in x_set
['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd', 'a', 'b']
{'a', 'd', 'b', 'c'}
CPU times: user 39.9 ms, sys: 53 μs, total: 40 ms
Wall time: 40 ms
CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 7.39 μs
False

The initial membership test is measured in milliseconds. The one performed on the set is measured in microseconds. The test is much faster when converted to a set beforehand.