High Performance Python (2024)

Chapter4.Dictionaries and Sets

Questions You’ll Be Able to Answer After This Chapter

  • What are dictionaries and sets good for?
  • How are dictionaries and sets the same?
  • What is the overhead when using a dictionary?
  • How can I optimize the performance of a dictionary?
  • How does Python use dictionaries to keep track of namespaces?

Sets and dictionaries are ideal data structures to be used when your data has nointrinsic order, but does have a unique object that can be used to reference it(the reference object is normally a string, but can be any hashable type).This reference object is called the “key,” while the data is the “value.” Dictionaries and sets are almost identical, except that sets do notactually contain values: a set is simply a collection of unique keys. As the nameimplies, sets are very useful for doing set operations.

Note

A hashable type is one that implements both the __hash__ magic function andeither __eq__ or __cmp__. All native types in Python already implementthese, and any user classes have default values. SeeHash Functions and Entropy for more details.

While we saw in the previous chapter that we are restricted to, at best, O(log n) lookup time onlists/tuples with no intrinsic order (through a search operation), dictionariesand sets give us O(n) lookups based on the arbitrary index. In addition, likelists/tuples, dictionaries and sets have O(1) insertion time.[4] As we will see in How Do Dictionaries and Sets Work?, this speed is accomplished throughthe use of an open address hash table as the underlying data structure.

However, there is a cost to using dictionaries and sets. First, they generally take up alarger footprint in memory. Also, although the complexity forinsertions/lookups is O(1), the actual speed depends greatly on the hashing function thatis in use. If the hash function is slow to evaluate, then any operations ondictionaries or sets will be similarly slow.

Let’s look at an example. Say we want to store contact information for everyonein the phone book. We would like to store this in a form that will make itsimple to answer the question, “What is John Doe’s phone number?” in the future.With lists, we would store the phone numbers and names sequentially and scanthrough the entire list to find the phone number we required, as shown inExample4-1.

Example4-1.Phone book lookup with a list

def find_phonenumber(phonebook, name): for n, p in phonebook: if n == name: return p return Nonephonebook = [ ("John Doe", "555-555-5555"), ("Albert Einstein", "212-555-5555"),]print "John Doe's phone number is", find_phonenumber(phonebook, "John Doe")

Note

We could also do this by sorting the list and using the bisect module inorder to get O(log n) performance.

With a dictionary, however, we can simply have the “index” be the names and the“values” be the phone numbers, as shown in Example4-2. Thisallows us to simply look up the value we need and get a direct reference to it,instead of having to read every value in our dataset.

Example4-2.Phone book lookup with a dictionary

phonebook = { "John Doe": "555-555-5555", "Albert Einstein" : "212-555-5555",}print "John Doe's phone number is", phonebook["John Doe"]

For large phone books, the difference between the O(1) lookup of the dictionaryand the O(n) time for linear search over the list (or, at best, the O(log n)with the bisect module) is quite substantial.

Tip

Create a script that times the performance of the list-bisect method versus adictionary for finding a number in a phone book. How does the timing scale asthe size of the phone book grows?

If, on the other hand, we wanted to answer the question, “How many unique firstnames are there in my phone book?” we could use the power of sets. Recall thata set is simply a collection of unique keys—this is the exact property wewould like to enforce in our data. This is in stark contrast to a list-basedapproach, where that property needs to be enforced separately from thedata structure by comparing all names with all other names. Example4-3 illustrates.

Example4-3.Finding unique names with lists and sets

def list_unique_names(phonebook): unique_names = [] for name, phonenumber in phonebook: # High Performance Python (1) first_name, last_name = name.split(" ", 1) for unique in unique_names: # High Performance Python (2) if unique == first_name: break else: unique_names.append(first_name) return len(unique_names)def set_unique_names(phonebook): unique_names = set() for name, phonenumber in phonebook: # High Performance Python (3) first_name, last_name = name.split(" ", 1) unique_names.add(first_name) # High Performance Python (4) return len(unique_names)phonebook = [ ("John Doe", "555-555-5555"), ("Albert Einstein", "212-555-5555"), ("John Murphey", "202-555-5555"), ("Albert Rutherford", "647-555-5555"), ("Elaine Bodian", "301-555-5555"),]print "Number of unique names from set method:", set_unique_names(phonebook)print "Number of unique names from list method:", list_unique_names(phonebook)

We must go over all the items in our phone book, and thus this loop costs O(n).

Here, we must check the current name against all the unique names we have already seen. If it is a new unique name, we add it to our list of unique names. We then continue through the list, performing this step for every item in the phone book.

For the set method, instead of iterating over all unique names we have already seen, we can simply add the current name to our set of unique names. Because sets guarantee the uniqueness of the keys they contain, if you try to add an item that is already in the set, that item simply won’t be added. Furthermore, this operation costs O(1).

The list algorithm’s inner loop iterates over unique_names, which starts outas empty and then grows, in the worst case, when all names are unique, to be thesize of phonebook. This can be seen as performing alinear search for each name in the phone book over a listthat is constantly growing. Thus, the complete algorithm performs as O(n logn), since the outer loop contributes the O(n) factor, while the inner loopcontributes the O(log n) factor.

On the other hand, the set algorithm has no inner loop; the set.add operationis an O(1) process that completes in a fixed number of operations regardlessof how large the phone book is (there are some minor caveats to this, which wewill cover while discussing the implementation of dictionaries and sets). Thus,the only nonconstant contribution to the complexity of this algorithm is theloop over the phone book, making this algorithm perform in O(n).

When timing these two algorithms using a phonebook with 10,000 entries and 7,422unique first names, we see how drastic the difference between O(n) and O(nlog n) can be:

>>> %timeit list_unique_names(large_phonebook)1 loops, best of 3: 2.56 s per loop>>> %timeit set_unique_names(large_phonebook)100 loops, best of 3: 9.57 ms per loop

In other words, the set algorithm gave us a 267x speedup! In addition, as thesize of the phonebook grows, the speed gains increase (we get a 557x speedupwith a phonebook with 100,000 entries and 15,574 unique first names).

How Do Dictionaries and Sets Work?

Dictionaries and sets use hash tables in order to achieve their O(1) lookupsand insertions. This efficiency is the result of a very clever usage ofa hash function to turn an arbitrary key (i.e., a string or object) into an index for a list. The hash function and list can later be used to determine where any particular piece of data is right away, without a search. By turning the data’s key into something that can be used like a listindex, we can get the same performance as with a list. In addition, insteadof having to refer to data by a numerical index, which itself implies someordering to the data, we can refer to it by this arbitrary key.

Inserting and Retrieving

In order to create a hash table from scratch, we start with some allocatedmemory, similar to what we started with for arrays. For an array, if we wantto insert data, we simply find the smallest unused bucket and insert our datathere (and resize if necessary). For hash tables, we must first figure out theplacement of the data in this contiguous chunk of memory.

The placement of the new data is contingent on two properties of the data weare inserting: the hashed value of the key and how the value compares to otherobjects. This is because when we insert data, the key is first hashed andmasked so that it turns into an effective index in an array.[5] The mask makes sure that the hash value,which can take the value of any integer, fits within the allocated number ofbuckets. So, if we have allocated 8 blocks of memory and our hash value is28975, we consider the bucket at index 28975 & 0b111 = 7. If, however, ourdictionary has grown to require 512 blocks of memory, then the mask becomes 0b111111111 (and in this case, we would consider the bucket at index 28975 & 0b11111111). Nowwe must check if this bucket is already in use. If it is empty, we can insertthe key and the value into this block of memory. We store the key so that wecan make sure we are retrieving the correct value on lookups. If it is in useand the value of the bucket is equal to the value we wish to insert (acomparison done with the cmp built-in), then the key/value pair is already inthe hash table and we can return. However, if the values don’t match, then wemust find a new place to put the data.

To find the new index, we compute a new index using a simple linearfunction, a method called probing. Python’s probing mechanism addsa contribution from the higher-order bits of the original hash (recall that for atable of length 8 we only considered the last 3 bits of the hash for the initialindex, through the use of a mask value of mask = 0b111 = bin(8 - 1)). Usingthese higher-order bits gives each hash a different sequence of next possiblehashes, which helps to avoid future collisions. There is a lot of freedom whenpicking the algorithm to generate a new index; however, it is quite importantthat the scheme visits every possible index in order to evenly distribute thedata in the table. How well distributed the data is throughout the hash tableis called the “load factor” and is related to theentropy of the hash function. Thepseudocode in Example4-4 illustrates the calculation of hash indices used in CPython 2.7.

Example4-4.Dictionary lookup sequence

def index_sequence(key, mask=0b111, PERTURB_SHIFT=5): perturb = hash(key) # High Performance Python (9) i = perturb & mask yield i while True: i = ((i << 2) + i + perturb + 1) perturb >>= PERTURB_SHIFT yield i & mask

hash returns an integer, while the actual C code in CPython uses an unsigned integer. Because of this, this pseudocode doesn’t 100% replicate the behavior in CPython; however, it is a good approximation.

This probing is a modification of the naive method of “linear probing.” Inlinear probing, we simply yield the values i = (5 * i + 1) & mask, where i isinitialized to the hash value of the key and the value ‘5` is unimportant to the current discussion.[6] An important thingto note is that linear probing only deals with the last several bytes of the hash anddisregards the rest (i.e., for a dictionary with eight elements, we only look at thelast 3 bits since at that point the mask is 0x111). This means that if hashing two items gives the same last three binary digits, we will not only have a collision, but the sequence of probed indiceswill be the same. The perturbed scheme that Python uses will start taking intoconsideration more bits from the items’ hashes in order to resolve this problem.

A similar procedure is done when we are performing lookups on a specific key:the given key is transformed into an index and that index is examined. If thekey in that index matches (recall that we also store the original key when doinginsert operations), then we can return that value. If it doesn’t, we keepcreating new indices using the same scheme, until we either find thedata or hit an empty bucket. If we hit an empty bucket, we can conclude thatthe data does not exist in the table.

Figure4-1 illustrates the process of adding some datainto a hash table. Here, we chose to create a hash function that simply uses the first letter of the input. We accomplish this by using Python’s ord function on the first letter of the input to get the integer representation of that letter (recall the hash functions must return integers). As we’ll see inHash Functions and Entropy, Python provides hashing functions for most ofits types, so typically you won’t have to provide one yourself except inextreme situations.

High Performance Python (11)

Figure4-1.The resulting hash table from inserting with collisions

Insertion of the key “Barcelona” causes a collision, and anew index is calculated using the scheme inExample4-4. This dictionary can also be created in Pythonusing the code in Example4-5.

Example4-5.Custom hashing function

class City(str): def __hash__(self): return ord(self[0])# We create a dictionary where we assign arbitrary values to citiesdata = { City("Rome"): 4, City("San Francisco"): 3, City("New York"): 5, City("Barcelona"): 2,}

In this case, “Barcelona” and “Rome” cause the hash collision(Figure4-1 shows the outcome of this insertion). We seethis because for a dictionary with four elements we have a mask value of 0b111.As a result, “Barcelona” will try to use index ord("B") & 0b111 = 66 & 0b111 =0b1000010 & 0b111 = 0b010 = 2. Similarly, “Rome” will try to use the indexord("R") & 0b111 = 82 & 0b111 = 0b1010010 & 0b111 = 0b010 = 2.

Tip

Work through the following problems. A discussion of hash collisions follows:

  1. Finding an element—Using the dictionary created in Example4-5, what would a lookupon the key “Johannesburg” look like? What indices would be checked?
  2. Deleting an element—Using the dictionary created in Example4-5, how would youhandle the deletion of the key “Rome”? How would subsequent lookups for the keys“Rome” and “Barcelona” be handled?
  3. Hash collisions—Considering the dictionary created in Example4-5, how many hashcollisions could you expect if 500 cities, with names all starting with an uppercaseletter, were added into a hash table? How about 1,000 cities? Can you think ofa way of lowering this number?

For 500 cities, there would be approximately 474 dictionary elements thatcollided with a previous value (500–26), with each hash having 500 / 26 = 19.2cities associated with it. For 1,000 cities, 974 elements would collide and eachhash would have 1,000 / 26 = 38.4 cities associated with it. This is because thehash is simply based on the numerical value of the first letter, which can onlytake a value from AZ, allowing for only 26 independent hash values. This meansthat a lookup in this table could require as many as 38 subsequent lookups tofind the correct value. In order to fix this, we must increase the number ofpossible hash values by considering other aspects of the city in the hash. Thedefault hash function on a string considers every character in order to maximizethe number of possible values. See Hash Functions and Entropy for moreexplanation.

When a value is deleted from a hash table, we cannot simply write a NULL tothat bucket of memory. This is because we have used NULLs as a sentinel valuewhile probing for hash collisions. As a result, we must write a special valuethat signifies that the bucket is empty, but there still may be values after itto consider when resolving a hash collision. These empty slots can be written toin the future and are removed when the hash table is resized.

Resizing

As more items are inserted into the hash table, the table itself must be resizedto accommodate it. It can be shown that a table that is no more than two-thirds full willhave optimal space savings while still having a good bound on the number ofcollisions to expect. Thus, when a table reaches this critical point, it isgrown. In order to do this, a larger table is allocated (i.e., more buckets inmemory are reserved), the mask is adjusted to fit the new table, and all elements ofthe old table are reinserted into the new one. This requires recomputingindices, since the changed mask will change the resulting index. As a result,resizing large hash tables can be quite expensive! However, since we only dothis resizing operation when the table is too small, as opposed to on everyinsert, the amortized cost of an insert is still O(1).

By default, the smallest size of a dictionary or set is 8 (that is, if you areonly storing three values, Python will still allocate eight elements). On resize, thenumber of buckets increases by 4x until we reach 50,000 elements, after whichthe size is increased by 2x. This gives the following possible sizes:

8, 32, 128, 512, 2048, 8192, 32768, 131072, 262144, ...

It is important to note that resizing can happen to make a hash table larger orsmaller. That is, if sufficiently many elements of a hash table are deleted,the table can be scaled down in size. However, resizing only happens during an insert.

Hash Functions and Entropy

Objects in Python are generally hashable, since they already have built-in__hash__ and __cmp__ functions associated with them. For numerical types(int and float), the hash is simply based on the bit value of the numberthey represent. Tuples and strings have a hash value that is based on theircontents. Lists, on the other hand, do not support hashing because their valuescan change. Since a list’s values can change, so could the hash that representsthe list, which would change the relative placement of that key in the hashtable.[7]

User-defined classes also have default hash and comparisonfunctions. The default __hash__ function simply returns the object’s placementin memory as given by the built-in id function. Similarly, the __cmp__operator compares the numerical value of the object’s placement in memory.

This is generally acceptable, since two instances of a class are generallydifferent and should not collide in a hash table. However, in some cases wewould like to use set or dict objects to disambiguate between items. Takethe following class definition:

class Point(object): def __init__(self, x, y): self.x, self.y = x, y

If we were to instantiate multiple Point objects with the same values for xand y, they would all be independent objects in memory and thus have differentplacements in memory, which would give them all different hash values. Thismeans that putting them all into a set would result in all of them havingindividual entries:

>>> p1 = Point(1,1)>>> p2 = Point(1,1)>>> set([p1, p2])set([<__main__.Point at 0x1099bfc90>, <__main__.Point at 0x1099bfbd0>])>>> Point(1,1) in set([p1, p2])False

We can remedy this by forming a custom hash function that is based on theactual contents of the object as opposed to the object’s placement in memory.The hash function can be arbitrary as long as it consistently gives the sameresult for the same object. (There are also considerations regardingthe entropy of the hashing function, which we will discuss later.) The followingredefinition of the Point class will yield the results we expect:

class Point(object): def __init__(self, x, y): self.x, self.y = x, y def __hash__(self): return hash((self.x, self.y)) def __eq__(self, other): return self.x == other.x and self.y == other.y

This allows us to create entries in a set or dictionary indexed by theproperties of the Point object as opposed to the memory address of theinstantiated object:

>>> p1 = Point(1,1)>>> p2 = Point(1,1)>>> set([p1, p2])set([<__main__.Point at 0x109b95910>])>>> Point(1,1) in set([p1, p2])True

As alluded to in the earlier note where we discussed hash collisions, a custom-selected hashfunction should be careful to evenly distribute hash values in order to avoidcollisions. Having many collisions will degrade the performance of ahash table: if most keys have collisions, then we need to constantly “probe” theother values, effectively walking a potentially large portion of the dictionaryin order to find the key in question. In the worst case, when all keys in adictionary collide, the performance of lookups in the dictionary is O(n) andthus the same as if we were searching through a list.

If we know that we are storing 5,000 values in a dictionary and we needto create a hashing function for the object we wish to use as a key, we must beaware that the dictionary will be stored in a hash table of size 32,768, and thus only the last 15 bits of ourhash are being used to create an index (for a hash table of this size, the maskis bin(32758-1) = 0b111111111111111).

This idea of “how well distributed my hash function is” is called the entropy ofthe hash function. Entropy, defined as:

High Performance Python (12)

where p(i) is the probability that the hash functiongives hash i. It is maximized when every hash value has equal probability ofbeing chosen. A hash function that maximizes entropy is called an idealhash function since it guarantees the minimal number of collisions.

For an infinitely large dictionary, the hash function used for integers isideal. This is because the hash value for an integer is simply the integeritself! For an infinitely large dictionary, the mask value is infinite and thuswe consider all bits in the hash value. Thus, given any two numbers, we canguarantee that their hash values will not be the same.

However, if we made this dictionary finite, then we could no longer have thisguarantee. For example, for a dictionary with four elements, the mask we use is0b111. Thus, the hash value for the number 5 is 5 & 0b111 = 5 and the hashvalue for 501 is 501 & 0b111 = 5, and thus their entries will collide.

Note

To find the mask for a dictionary with an arbitrary number of elements, N, we first find the minimum number of buckets that dictionary must have to still be two-thirds full (N * 5 / 3). Then, we find the smallest dictionary size that will hold this number of elements (8; 32; 128; 512; 2,048; etc.) and find the number of bits necessary to hold this number. For example, if N=1039, then we must have at least 1,731 buckets, which means we need a dictionary with 2,048 buckets. Thus, the mask is bin(2048 - 1) = 0b11111111111.

There is no single best hash function to use when using a finite dictionary.However, knowing up front what range of values will be used and how large thedictionary will be helps in making a good selection. For example, if weare storing all 676 combinations of two lowercase letters as keys in adictionary (aa, ab, ac, etc.), then a good hashing function would be the one shown in Example4-6.

Example4-6.Optimal two-letter hashing function

def twoletter_hash(key): offset = ord('a') k1, k2 = key return (ord(k2) - offset) + 26 * (ord(k1) - offset)

This gives no hash collisions for any combination of two lowercase letters,considering a mask of 0b1111111111 (a dictionary of 676 values will be heldin a hash table of length 2,048, which has a mask of bin(2048-1) =0b11111111111).

Example4-7 very explicitly shows the ramifications of having a bad hashingfunction for a user-defined class—here, the cost of a bad hash function (infact, it is the worst possible hash function!) is a 21.8x slowdown of lookups.

Example4-7.Timing differences between good and bad hashing functions

import stringimport timeitclass BadHash(str): def __hash__(self): return 42class GoodHash(str): def __hash__(self): """ This is a slightly optimized version of twoletter_hash """ return ord(self[1]) + 26 * ord(self[0]) - 2619baddict = set()gooddict = set()for i in string.ascii_lowercase: for j in string.ascii_lowercase: key = i + j baddict.add(BadHash(key)) gooddict.add(GoodHash(key))badtime = timeit.repeat( "key in baddict", setup = "from __main__ import baddict, BadHash; key = BadHash('zz')", repeat = 3, number = 1000000,)goodtime = timeit.repeat( "key in gooddict", setup = "from __main__ import gooddict, GoodHash; key = GoodHash('zz')", repeat = 3, number = 1000000,)print "Min lookup time for baddict: ", min(badtime)print "Min lookup time for gooddict: ", min(goodtime)# Results:# Min lookup time for baddict: 16.3375990391# Min lookup time for gooddict: 0.748275995255

Tip

  1. Show that for an infinite dictionary (and thus an infinite mask), using aninteger’s value as its hash gives no collisions.
  2. Show that the hashing function given inExample4-6 is ideal for a hash table of size1,024. Why is it not ideal for smaller hash tables?

Dictionaries and Namespaces

Doing a lookup on a dictionary is fast; however, doing it unnecessarily will slowdown your code, just as any extraneous lines will. One area where this surfacesis in Python’s namespace management, which heavily uses dictionaries to do itslookups.

Whenever a variable, function, or module is invoked in Python, there is ahierarchy that determines where it looks for these objects. First, Python looks inside of thelocals() array, which has entries for all local variables. Python works hardto make local variable lookups fast, and this is the only part of the chain thatdoesn’t require a dictionary lookup. If it doesn’t exist there, then theglobals() dictionary is searched. Finally, if the object isn’t found there, the__builtin__ object is searched. It is important to note that while locals()and globals() are explicitly dictionaries and __builtin__ is technicallya module object, when searching __builtin__ for a given property we are justdoing a dictionary lookup inside of its locals() map (this is the case forall module objects and class objects!).

To make this clearer, let’s look at a simple example of calling functionsthat are defined in different scopes (Example4-8). We candisassemble the functions with the dis module (Example4-9) toget a better understanding of how these namespace lookups are happening.

Example4-8.Namespace lookups

import mathfrom math import sindef test1(x): """ >>> %timeit test1(123456) 1000000 loops, best of 3: 381 ns per loop """ return math.sin(x)def test2(x): """ >>> %timeit test2(123456) 1000000 loops, best of 3: 311 ns per loop """ return sin(x)def test3(x, sin=math.sin): """ >>> %timeit test3(123456) 1000000 loops, best of 3: 306 ns per loop """ return sin(x)

Example4-9.Namespace lookups disassembled

>>> dis.dis(test1) 9 0 LOAD_GLOBAL 0 (math) # Dictionary lookup 3 LOAD_ATTR 1 (sin) # Dictionary lookup 6 LOAD_FAST 0 (x) # Local lookup 9 CALL_FUNCTION 1 12 RETURN_VALUE>>> dis.dis(test2) 15 0 LOAD_GLOBAL 0 (sin) # Dictionary lookup 3 LOAD_FAST 0 (x) # Local lookup 6 CALL_FUNCTION 1 9 RETURN_VALUE>>> dis.dis(test3) 21 0 LOAD_FAST 1 (sin) # Local lookup 3 LOAD_FAST 0 (x) # Local lookup 6 CALL_FUNCTION 1 9 RETURN_VALUE

The first function, test1, makes the call to sin by explicitly looking at themath library. This is also evident in the bytecode that is produced: first areference to the math module must be loaded, and then we do an attribute lookup onthis module until we finally have a reference to the sin function. This is donethrough two dictionary lookups, one to find the math module and one to find thesin function within the module.

On the other hand, test2 explicitly imports the sin function from the mathmodule, and the function is then directly accessible within the global namespace.This means we can avoid the lookup of the math module and the subsequentattribute lookup. However, we still must find the sin function within theglobal namespace. This is yet another reason to be explicit about whatfunctions you are importing from a module. This practice not only makes codemore readable, because the reader knows exactly what functionality is requiredfrom external sources, but it also speeds up code!

Finally, test3 defines the sin function as a keyword argument, with itsdefault value being a reference to the sin function within the math module.While we still do need to find a reference to this function within the module,this is only necessary when the test3 function is first defined. After this, thereference to the sin function is stored within the function definition as alocal variable in the form of a default keyword argument. As mentioned previously,local variables do not need a dictionary lookup to be found; they are stored ina very slim array that has very fast lookup times. Because of this, findingthe function is quite fast!

While these effects are an interesting result of the way namespaces in Pythonare managed, test3 is definitely not “Pythonic.” Luckily, these extradictionary lookups only start to degrade performance when they are called a lot(i.e., in the innermost block of a very fast loop, such as in the Julia set example).With this in mind, a more readable solution would be to set a local variablewith the global reference before the loop is started. We’ll still have todo the global lookup once whenever the function is called, but all the callsto that function in the loop will be made faster. This speaks to the fact thateven minute slowdowns in code can be amplified if that code is being runmillions of times. Even though a dictionary lookup may only take severalhundred nanoseconds, if we are looping millions of times over this lookup it canquickly add up. In fact, looking at Example4-10 we see a9.4% speedup simply by making the sin function local to the tight loop that callsit.

Example4-10.Effects of slow namespace lookups in loops

from math import sindef tight_loop_slow(iterations): """ >>> %timeit tight_loop_slow(10000000) 1 loops, best of 3: 2.21 s per loop """ result = 0 for i in xrange(iterations): # this call to sin requires a global lookup result += sin(i)def tight_loop_fast(iterations): """ >>> %timeit tight_loop_fast(10000000) 1 loops, best of 3: 2.02 s per loop """ result = 0 local_sin = sin for i in xrange(iterations): # this call to local_sin requires a local lookup result += local_sin(i)

Wrap-Up

Dictionaries and sets provide a fantastic way to store data that can be indexedby a key. The way this key is used, through the hashing function, can greatlyaffect the resulting performance of the data structure. Furthermore,understanding how dictionaries work gives you a better understanding not only ofhow to organize your data, but also of how to organize your code, since dictionariesare an intrinsic part of Python’s internal functionality.

In the next chapter we will explore generators, which allow us to provide datato code with more control over ordering and without having to store full datasets inmemory beforehand. This lets us sidestep many of the possible hurdles that one might encounter whenusing any of Python’s intrinsic data structures.


[4] As wewill discuss in Hash Functions and Entropy, dictionaries and sets arevery dependent on their hash functions. If the hash function for a particulardatatype is not O(1), any dictionary or set containing that type will nolonger have its O(1) guarantee.

[5] A mask isa binary number that truncates the value of a number. So, 0b1111101 & 0b111 =0b101 = 5 represents the operation of 0b111 masking the number 0b1111101. Thisoperation can also be thought of as taking a certain number of theleast-significant digits of a number.

[6] The value of 5 comes from the properties of a linear congruentialgenerator (LCG), which is used in generating random numbers.

[7] More information about this can be found athttp://wiki.python.org/moin/DictionaryKeys.

High Performance Python (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Reed Wilderman

Last Updated:

Views: 5289

Rating: 4.1 / 5 (72 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Reed Wilderman

Birthday: 1992-06-14

Address: 998 Estell Village, Lake Oscarberg, SD 48713-6877

Phone: +21813267449721

Job: Technology Engineer

Hobby: Swimming, Do it yourself, Beekeeping, Lapidary, Cosplaying, Hiking, Graffiti

Introduction: My name is Reed Wilderman, I am a faithful, bright, lucky, adventurous, lively, rich, vast person who loves writing and wants to share my knowledge and understanding with you.