Python dictionary past, present, future Dmitry Alimov Senior Software Engineer Zodiac Interactive 2016 SPb Python Interest Group
Python dictionary past, present, future
Dmitry Alimov Senior Software Engineer
Zodiac Interactive
2016
SPb Python Interest Group
Dictionary in Python
>>> d = {} # the same as d = dict()
>>> d['a'] = 123
>>> d['b'] = 345
>>> d['c'] = 678
>>> d
{'a': 123, 'c': 678, 'b': 345}
>>> d['b']
345
>>> del d['c']
>>> d
{'a': 123, 'b': 345}
Dictionary keys must be hashable An object is hashable if it has a hash value which never changes during its lifetime
>>> d[list()] = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list' >>> d[set()] = 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'set' >>> d[dict()] = 3 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'dict'
All of Python’s immutable built-in objects are hashable
import random
class A(object):
def __init__(self, index):
self.index = index
def __eq__(self, other):
return True
def __hash__(self):
return random.randint(0, 3)
def __repr__(self):
return 'A%d' % self.index
d = {A(0): 0, A(1): 1, A(2): 2}
print('keys: %s' % d.keys())
print('values: %s' % d.values())
for k in d:
print('%s = %s' % (k, d.get(k, 'not found')))
Random hash is a bad idea
Run 1
keys: [A1, A2, A0]
values: [1, 2, 0]
A1 = 1
A2 = not found
A0 = 0
Run 2
keys: [A1, A0]
values: [2, 0]
A1 = not found
A0 = not found
Past
Three kinds of slots in the table: 1) Unused 2) Active 3) Dummy
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;
- Hash table - Open addressing collision resolution strategy - Initial size = 8 - Load factor = 2/3 - Growth rate = 2 or 4 (depending on the number of cells used) - “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt”
Dictionary in CPython >2.1
ma_fill – is the number of non-NULL keys (sum of Active and Dummy) ma_used – number of Active items ma_mask – mask == PyDict_MINSIZE - 1 ma_lookup – lookup function (lookdict_string by default)
#define PyDict_MINSIZE 8 typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; };
Good hash functions are needed
>>> map(hash, [0, 1, 2, 3, 4]) [0, 1, 2, 3, 4] >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [1540938117, 1540938118, 1540938119, 1540938112, 1540938113]
Modified FNV (Fowler–Noll–Vo) hash function for strings
“-R” option – turns on hash randomization, so that the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value
>>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [-218138032, -218138029, -218138030, -218138027, -218138028]
Hash functions
Collision resolution
Collision is a situation that occurs when two distinct pieces of data have the same hash value. Probing is a scheme in computer programming for resolving collisions in hash tables for maintaining a collection of key–value pairs and looking up the value associated with a given key. In CPython a pseudo-random probing is used
PERTURB_SHIFT = 5 perturb = hash(key) while True: j = (5 * j) + 1 + perturb perturb >>= PERTURB_SHIFT index = j % 2**i
See “/Objects/dictobject.c”
In CPython <2.2 used a polynomial-based index computing
>>> PyDict_MINSIZE = 8 >>> key = 123 >>> hash(key) % PyDict_MINSIZE >>> 3
Index computing
>>> mask = PyDict_MINSIZE - 1 >>> hash(key) & mask >>> 3
Instead of the modulo operation use logical "AND" and the mask
Get least significant bits of the hash: 2 ** i = PyDict_MINSIZE, hence i = 3, i.e. three least significant bits is enough hash(123) = 123 = 0b1111011 mask = PyDict_MINSIZE - 1 = 8 - 1 = 7 = 0b111 index = hash(123) & mask = 0b1111011 & 0b111 = 0b011 = 3
mask = PyDict_MINSIZE - 1 index = hash(123) & mask
Integers
Strings
mask = PyDict_MINSIZE - 1 index = hash(123) & mask
Dictionary in CPython >2.1
Dictionary initialization
Add an item
PyDict_SetItem()
PyDict_New() ma_used = 0 ma_fill = 0 ma_mask = PyDict_MINSIZE – 1 ma_table = ma_smalltable ma_lookup = lookdict_string
insertdict() ma_used += 1 ma_fill += 1 dictresize() if ma_fill >= 2/3 * size
Delete an item
PyDict_DelItem() ma_used -= 1
Add item
Add item
Add item
Add item
Add item
perturb = -1297030748 # i = (i * 5) + 1 + perturb i = (4 * 5) + 1 + (-1297030748) = -1297030727 index = -1297030727 & 7 = 1
hash('!!!') = -1297030748 i = -1297030748 & 7 = 4
# perturb = perturb >> PERTURB_SHIFT perturb = -1297030748 >> 5 = -40532211 # i = (i * 5) + 1 + perturb i = (-1297030727 * 5) + 1 + (-40532211) = -6525685845 index = -6525685845 & 7 = 3
>>> d {'python': 2, 'article': 4, '!!!': 5, 'dict': 3, 'a key': 1} >>> d.__sizeof__() 248
Add item
Hash table resize
>>> d {'!!!': 5, 'python': 2, 'dict': 3, 'a key': 1, 'article': 4, ';)': 6} >>> d.__sizeof__() 1016
Hash table resize
/* Find the smallest table size > minused. */ for (newsize = 8; newsize <= minused && newsize > 0; newsize <<= 1) ; ...
}
dictresize(PyDictObject *mp, Py_ssize_t minused) { ...
PyDict_SetItem(...) { ... dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used); ... }
In the example: ma_fill = 6 > (8 * 2 / 3) ma_used = 6
Hence minused = 4 * 6 = 24, therefore newsize = 32
Addition order
>>> d1 = {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5} >>> d2 = {'three': 3, 'two': 2, 'five': 5, 'four': 4, 'one': 1} >>> d1 == d2 True >>> d1.keys() ['four', 'three', 'five', 'two', 'one'] >>> d2.keys() ['four', 'one', 'five', 'three', 'two']
The order of items added to the dictionary depends on the items already in it
>>> 7.0 == 7 == (7+0j) True >>> d = {} >>> d[7.0] = 'float' >>> d {7.0: 'float'} >>> d[7] = 'int' >>> d {7.0: 'int'} >>> d[7+0j] = 'complex' >>> d {7.0: 'complex'} >>> type(d.keys()[0]) <type 'float'>
int, float, complex
>>> hash(7) 7 >>> hash(7.0) 7 >>> hash(7+0j) 7
>>> d = {'a': 1}
>>> for i in d:
... d['new item'] = 123
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration
Adding item during iteration
Delete item
dummy = PyString_FromString("<dummy key>"));
Interesting case
Interesting case
ma_fill = 6 > (8 * 2 / 3) dictresize()
Interesting case
ma_fill = 6 > (8 * 2 / 3) ma_used = 1
hence minused = 4 * 1 = 4, therefore newsize = 8
Cache
PyDictEntry ma_smalltable[8];
On x86 with 64 bytes per cache line: 64 / (4 * 3) = 5.333 entries
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;
Cache locality and collisions See “/Objects/dictnotes.txt”
Source Access time
L1 Cache 1 ns
L2 Cache 4 ns
RAM 100 ns
Open addressing vs separate chaining
Although here is the linear probing rather than pseudo-random as in CPython
OrderedDict
from collections import OrderedDict
- Internal dict - Circular doubly linked list - “/Lib/collections/__init__.py”
Present
Dictionary in CPython 3.5
- PEP 412 - Key-Sharing Dictionary - The DictObject can be in one of two forms: combined table or split table - Initial size = 4 (split table) or 8 (combined table) - Maximum dictionary load = (2*n+1)/3 - Growth rate = used*2 + capacity/2 - “/Objects/dict-common.h”, “/Include/dictobject.h”, “/Objects/dictobject.c”,
“/Objects/dictnotes.txt”
typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; dict_lookup_func dk_lookup; Py_ssize_t dk_usable; PyDictKeyEntry dk_entries[1]; };
typedef struct { PyObject_HEAD Py_ssize_t ma_used; PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;
Combined table vs split table
Combined table - For explicit dictionaries (dict() and {}) - ma_values = NULL, dk_refcnt = 1 - Never becomes a split-table dictionary
Split table - For attribute dictionaries (the__dict__ attribute of an object) - ma_values != NULL, dk_refcnt >= 1 - Only string (unicode) keys are allowed - Values are stored in the ma_values array - When resizing a split dictionary it is converted to a combined table (but if
resizing is as a result of storing an instance attribute, and there is only instance of a class, then the dictionary will be re-split immediately)
- Lookup function = lookdict_split
Dictionary in CPython 3.5
A new kind of slot: 1) Unused 2) Active 3) Dummy 4) Pending (me_key != NULL, me_key != dummy and me_value == NULL)
typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;
Split table
Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3, i.e. initially ma_keys->dk_usable = 3
Split table
class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 setattr(a, 'd', 4) # re-split print(a.__dict__.__sizeof__()) # 168
print({}.__sizeof__()) # 264
Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3 Growth rate = used*2 + capacity/2 = 3*2 + 4/2 = 8, hence minused = 8, therefore newsize = 16 (see dictresize)
class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 b = A() setattr(a, 'd', 4) # no re-split because of b print(a.__dict__.__sizeof__()) # 456
Split table
Split table is converted to a combined table
Key differences between this implementation and CPython 2.x: - The table can be split into two parts – the keys and the values - A new kind of slot - No more ma_smalltable embedded in the dict
- General dictionaries are slightly larger - All object dictionaries of a single class can share a single key-table, saving
about 60% memory for such cases (accordint to
https://github.com/python/cpython/blob/3.5/Objects/dictnotes.txt) Bugs still happens: Unbounded memory growth resizing split-table dicts (https://bugs.python.org/issue28147)
Summary
Hash functions in CPython 3.5
SipHash for strings and bytes (>= CPython 3.4)
- Resistant against hash flooding DoS attacks
- Successfully used in many other languages
Slightly modified hash function for float
PEP 456 – Secure and interchangeable hash algorithm
hash(float("+inf")) == 314159,
hash(float("-inf")) == -314159, was -271828
OrderedDict in CPython 3.5
- Doubly-linked-list - od_fast_nodes hash table that mirrors the od_dict table - “/Include/odictobject.h”, “/Objects/odictobject.c”
Alternative versions
Dictionary in PyPy
- Starting from PyPy 2.5.0 – ordereddict is used by default - Initial size = 16 - Load factor up to 2/3 - Growth rate = 4 (up to 30000 items) or 2 - If a lot of items are deleted the compaction is performed - “/rpython/rtyper/lltypesystem/rordereddict.py”
struct dicttable { int num_live_items; int num_ever_used_items; int resize_counter; variable_int *indexes; // byte, short, int, long dictentry *entries; ... }
struct dictentry { PyObject *key; PyObject *value; long hash; bool valid; }
Dictionary in PyPy
struct dicttable { variable_int *indexes; dictentry *entries; ... }
FREE = 0 DELETED = 1 VALID_OFFSET = 2
PyDictionary in Jython
- Based on ConcurrentHashMap - Separate chaining collision resolution - Initial size = 16, load factor = 0.75, growth rate = 2 - Segments and thread safety
PythonDictionary in IronPython
- Based on Dictionary (.NET) - Separate chaining collision resolution - Initial size = 0, load factor = 1.0 - Rehashing if the number of collisions >= 100 - Growth rate = 2 (the new size is equal to the next higher prime number) from a set of
primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,… , 4999559, 5999471, 7199369}
Future
Raymond Hettinger is happy
Dictionary in CPython 3.6
typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;
typedef struct { PyObject_HEAD Py_ssize_t ma_used; /* number of items in the dictionary */ uint64_t ma_version_tag; /* unique, changes when dict modified */ PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;
- ma_version_tag is added (PEP 509 – Add a private version to dict) - Initial size = 8 (for split table too) - Maximum dictionary load = (2*n)/3 - Contributed by INADA Naoki in https://bugs.python.org/issue27350
Four kinds of slots in the table: 1) Unused (index == DKIX_EMPTY == -1) 2) Active (index >= 0 , me_key != NULL and me_value != NULL) 3) Dummy (index == DKIX_DUMMY == -2, only for combined table) 4) Pending (index >= 0 , me_key != NULL and me_value == NULL, only for split table)
Dictionary in CPython 3.6
- Added dk_nentries and dk_indices
struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; /* Size of the hash table (dk_indices) */ dict_lookup_func dk_lookup; /* Function to lookup in dk_indices */ Py_ssize_t dk_usable; /* Number of usable entries in dk_entries */ Py_ssize_t dk_nentries; /* Number of used entries in dk_entries */ union { int8_t as_1[8]; int16_t as_2[4]; int32_t as_4[2]; #if SIZEOF_VOID_P > 4 int64_t as_8[1]; #endif } dk_indices; PyDictKeyEntry dk_entries[dk_usable]; /* using DK_ENTRIES macro */ };
Dictionary in CPython 3.6 (Combined table)
Key differences between this implementation and CPython 3.5: - Compact and ordered - Added dk_indices with type, depending on the size of dictionary - Added ma_version_tag (PEP 509) - Initial size for split table is changed to 8 - Maximum dictionary load changed to (2*n)/3 - Deleting item cause converting the dict to the combined table
- Preserving the order of **kwargs in a function (PEP 468) is implemented - Preserving Class Attribute Definition Order (PEP 520) is implemented - The memory usage of the new dict() is between 20% and 25% smaller compared
to Python 3.5 (https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes)
Summary
References 1. The implementation of a dictionary in Python 2.7 https://habrahabr.ru/post/247843/ 2. Python hash calculation algorithms http://delimitry.blogspot.com/2014/07/python-hash-calculation-algorithms.html 3. PEP 412 - Key-Sharing Dictionary https://www.python.org/dev/peps/pep-0412/ 4. PEP 456 - Secure and interchangeable hash algorithm https://www.python.org/dev/peps/pep-0456/ 5. Mirror of the CPython repository https://github.com/python/cpython/ 6. Faster, more memory efficient and more ordered dictionaries on PyPy https://morepypy.blogspot.com/2015/01/faster-
more-memory-efficient-and-more.html 7. PyDictionary (Jython API documentation) http://www.jython.org/javadoc/org/python/core/PyDictionary.html 8. Jython repository https://bitbucket.org/jython/jython 9. Java theory and practice: Building a better HashMap http://www.ibm.com/developerworks/library/j-jtp08223/ 10. Back to basics: Dictionary part 2, .NET implementation https://blog.markvincze.com/back-to-basics-dictionary-part-2-
net-implementation/ 11. http://referencesource.microsoft.com/mscorlib/system/collections/generic/dictionary.cs.html 12. https://github.com/IronLanguages/main/blob/ipy-2.7-maint/Languages/IronPython/IronPython/ 13. https://bitbucket.org/pypy/pypy/ 14. https://twitter.com/raymondh 15. PEP 509 - Add a private version to dict https://www.python.org/dev/peps/pep-0509/ 16. Compact and ordered dict http://bugs.python.org/issue27350 17. What’s New In Python 3.6 https://docs.python.org/3.6/whatsnew/3.6.html 18. PEP 468 - Preserving the order of **kwargs in a function https://www.python.org/dev/peps/pep-0468/ 19. PEP 520 - Preserving Class Attribute Definition Order https://www.python.org/dev/peps/pep-0520/ 20. https://en.wikipedia.org/ Images from: http://www.rcreptiles.com/blog/index.php/2008/06/28/read_the_operating_manual_first http://kiwigamer450.deviantart.com/art/Back-to-The-Past-Logo-567858767 http://beyondplm.com/wp-content/uploads/2014/04/time-paradox-past-future-present.jpg http://itband.ru/wp-content/uploads/2014/10/Future.jpg https://en.wikipedia.org/wiki/Hash_table
Q & A
@delimitry
spbpython.guru
SPb Python Interest Group
Additional slides
Separate chaining collision resolution
Open addressing collision resolution (pseudo-random probing)