unintentional nondeterminism in a Python program

There’s a Python 3 program I’ve been developing off-and-on for three years now. The program takes an XML file as input, does some processing, and outputs a binary file.  After some recent changes, I find that the program hangs. If I hit control-C, I get a traceback.  I added a print in one of the loops to tell me which object it is processing, and was astounded to find that if I run the program multiple times with the same input, it processes the objects in different orders.  There is no (intentional) non-determinism in my program.

Python has as one of its basic data types a dictionary, which is a key to value mapping. Internally it is implemented as a hash table, so the key can be of any data type that can be hashed, including numbers and strings. If you enumerate the keys that are present in a dictionary, you will get them in an unpredictable order. Historically it was unpredictable but deterministic, based on the hash values.

For my purposes, the dictionary enumeration order being unpredictable is fine, but I want it to be deterministic, so that multiple runs of the program with identical input will produce identical output.

It turns out that starting with Python 3.3, by default the hash function used is randomized, so the same program with the same dictionary contents will result in different enumeration orders on different runs:

http://stackoverflow.com/questions/14956313/why-is-dictionary-ordering-non-deterministic

The hash randomization is claimed to be a security feature.  I’m unclear on what security it provides.  It is possible to disable hash randomization using an environment variable passed to the Python interpreter.

I’m using Python 3.5.  Starting with Python 3.6, the implementation of dictionaries is different, so even though hash randomization is still the default, the dictionary enumeration order is no longer based on the hash, and is once again deterministic.

At some point, Python added support for a new dictionary type OrderedDict, which enumerates in the order of element insertion, which happens to be the same thing that Python 3.6 does with normal dictionaries. After changing one line in my program to initialize a particular variable as an OrderedDict rather than a “plain” dictionary, my program is once again deterministic, even with hash randomization enabled.

This entry was posted in Computing, Software. Bookmark the permalink.

Leave a Reply