Aidan Mitchell

An exercise in profiling and enhancing some Python

A couple of days ago JC/yosignals/thecontractor wrote a blog about generating a wordlist for Three-Word Password Attacks using Python. The source dictionary for generation is the Oxford 5000 corpus which is an expansion on the Oxford 3000, a list of the 3,000 core words that English language learners should know.

The prototype in Python is neat and straightforward. The 5000 word (newline-separated) source is ingested to a list. The Python itertools.combinations function is used to generated all possible 3-word combinations from the list and write them back to a file on disk. The expected content of the file resembles a newline-separated list of three-word combinations like applebananacarrot.

1with open('input.txt', 'r') as f:
2    words =
4with open('output.txt', 'w') as out_file:
5    for combo in combinations(words, 3):
6        out_file.write(''.join(combo) + '\n')

The use of the itertools.combinations function makes this program somewhat low intensity since the function returns a generator, rather than a list. Each combination is generated lazily (on demand) which results in significantly lower memory usage. However, the execution of the program when consuming a 5000 word source file is significant (suggested at 20 hours in JC’s blog).

Optimise and fail

I originally tried optimising this code with Go and then with Rust. When targeting the full wordlist, there seems to be some proportional degree of improvement, primarily due to the (seemingly) more optimised asynchronous functionality in these systems languages. It’s hard to do lots of iterations on the 5000 word list because of how long it takes which impedes researching why and where these implementations are faster. However, after returning to the problem in Python and profiling the code instead (as I should have done originally), I realised that for wordlists <= 1,000 (give or take a 1,000), Python was notably faster due to the hugely optimised nature of itertools.combinations. Unsurprisingly, the optimised function effortlessly beats out of my fragile Go and Rust algorithms.

So, let’s profile the hell out of it, try some optimisations on the least efficient aspects of the Python code and see what speed ups can be achieved.

Profiling variations of the original

I wrote 3 varying optimisations on the original, profiled them with cProfile, and also ran them through Python’s timeit function to get average execution times across 10,000 iterations. All the profile outputs below have been trimmed to remove calls or groups of calls with <0.001 cumulative time.

The profile of the original code, run with the first 100 words from the Oxford 5000, shows that the join and write operations occur once for every generated combination. That’s 161,700 operations, twice, making up the vast majority of the calls and tottime. They account for 0.022 seconds of the 0.065 seconds for total execution time.

    323413 function calls in 0.065 seconds

    Ordered by: cumulative time

    ncalls  tottime percall cumtime percall filename:lineno(function)
        1   0.042   0.042   0.065   0.065
    161700  0.012   0.000   0.012   0.000   {method 'join' of 'str' objects}
    161700  0.010   0.000   0.010   0.000   {method 'write' of '_io.TextIOWrapper' objects}

Optimise the join

The first optimisation moves the join operation into a list comprehension so the full list of candidates is built first and then concatenated. Since it’s now a list, we can use writelines to flush the content to a file in one operation.

 1def optimised_join():
 2    with open("input.txt", "r") as f:
 3        words =
 5    combined_strings = [
 6        "".join(combo) + "\n" for combo in itertools.combinations(words, 3)
 7    ]
 9    with open("output.txt", "w") as out_file:
10        out_file.writelines(combined_strings)

The profile of this optimisation, perhaps unintuitively, does not rid us of the 161,700 join calls but instead flattens the corresponding write calls to one writelines call instead.

    161715 function calls in 0.046 seconds

    Ordered by: cumulative time

    ncalls  tottime percall cumtime percall filename:lineno(function)
        1   0.000   0.000   0.046   0.046
        1   0.025   0.025   0.038   0.038<listcomp>)
    161700  0.012   0.000   0.012   0.000   {method 'join' of 'str' objects}

We get about a 0.020 second speed up, which is great, but now our whole combinations list has to be held in memory. This puts a hard limit on the number of combinations that can be generated and, since JC points out in his blog that his eventual wordlist is over 400GB, that makes this optimisation less than ideal.

Optimise the write

To exhaust the alternative, I re-wrote the function to again send all the combinations to a list but not via a list comprehension before writing them out using writelines. This was mainly to compare how much improvement came from flattening the write operations to a single one alone vs. the list comprehension.

 1def optimise_write():
 2    with open("input.txt", "r") as f:
 3        words =
 5    combined_strings = []
 6    for combo in itertools.combinations(words, 3):
 7        combined_strings.append("".join(combo) + "\n")
 9    with open("output.txt", "w") as out_file:
10        out_file.writelines(combined_strings)

Short answer: writelines is only useful with the list comprehension.

    323414 function calls in 0.066 seconds

    Ordered by: cumulative time

    ncalls  tottime percall cumtime percall filename:lineno(function)
        1   0.040   0.040   0.066   0.066
    161700  0.013   0.000   0.013   0.000   {method 'join' of 'str' objects}
        1   0.008   0.008   0.008   0.008   {method 'writelines' of '_io._IOBase' objects}
    161700  0.005   0.000   0.005   0.000   {method 'append' of 'list' objects}

Without the comprehension optimisation, we end up replacing our 161,700 write operations with append to list operations instead and push our execution time back to match the original code. Also, we haven’t solved any issues with storing the whole list in memory.

Optimise the write and join

Since it’s established that Python generators are pretty handy for reducing memory usage, I suspected that a combination of both might lead us to a speed up.

1def optimise_write_and_join():
2    with open('input.txt', 'r') as f:
3        words =
5    with open('output.txt', 'w') as out_file:
6        out_file.write(
7            '\n'.join(''.join(combo) 
8            for combo in itertools.combinations(words, 3))
9        )

This configuration offered no speed up over the original list comprehension optimisation. I wasn’t expecting much, but no change is disappointing.

    323416 function calls (161716 primitive calls) in 0.049 seconds

    Ordered by: cumulative time

    ncalls      tottime percall cumtime percall filename:lineno(function)
        1       0.000   0.000   0.049   0.049
    161701/1    0.023   0.000   0.048   0.048 {method 'join' of 'str' objects}
    161701      0.025   0.000   0.038   0.000<genexpr>)

So using the join or write+join optimisation, it’s possible to achieve a ~30% speed up over the original code. However, both these methods end up inflating the memory usage of the program and make it untenable when processing the full 5,000 word list. At even 100 words, they use ~30% more memory than the original code. If you’ve got a load of RAM, then this might be the choice for you. If, like me, you don’t then it’s back to the drawing board, I guess.

Timing statistics

I ran the various implementations using timeit across 10,000 iterations to get an idea of how long each took for execution. The input file was only 100 words since 10k iterations across a bigger wordlist will take a prohibitively long time for little practical value. I also included a variation using a concurrent.futures.ThreadPoolExecutor to try and naively parallelise things.

Function 'original' took an average of 0.02161585 seconds per iteration.
Function 'optimised_join' took an average of 0.02477935 seconds per iteration.
Function 'optimise_write' took an average of 0.02531100 seconds per iteration.
Function 'optimise_write_and_join' took an average of 0.01664675 seconds per iteration.
Function 'optimise_futures' took an average of 0.01653588 seconds per iteration.

The results are mostly unsurprising given the profiling exercise. The original implementation and the independent join and write optimisations take around the same time which is a little deviation from the single execution of each, where the original code was about 30% slower. Without proper profiling tools, I’m hard pushed to explain exactly why this occurred but I’d hazard a guess that it’s due to caching or some other optimisation by the CPU.

The latter two implementations are quicker, as expected from the profiling, but it’s important to remember they have a heavy memory cost.

Ideas for possible optimisations

I have a pie in the sky idea that might lead to improving the code, for speed or otherwise. I also have plans to investigate the more optimised itertools functions available to Rust

Chunk it

As the first step in the approach, all combinations generated would be stored to a database with fast, asynchronous access. This introduces an I/O overhead vs. flushing memory to disk but opens up some opportunities to de-duplicate in the background.

With deduplication occurring outside of the Python process, there’s a possible approach of breaking the source file into chunks and randomly distributing those to multiple threads. Each thread would be expected to eventually consume all chunks. Simultaneously, each instance of the Python program can generate combinations from its current chunk, write them to the database, and the proceed to load the next chunk (without discarding the previous chunk). As this process repeats, the execution time would grow and each process would eventually repeat the work of all the others.

However, the final wordlist is a known predicate based on multiplying the wordlist by the combinations (3) and the average word length (maybe 8-12). A warden process could run which checks the total unique combinations stored to the database and terminates all the Python processes when it reaches the pre-calculated total. De-duplication would need to be a background process operating on the database asynchronously to enable this.

The obvious trade off here is a multiplication of the required processing power but that isn’t an existing bottleneck for the original code so is worth pushing into.

Rust’s itertools library

Rust has an itertools library which has a combinations implementation. I’ve put together a rudimentary attempt at using it - Rust is not my forte. It seems to execute about 10ms slower across 100 words but that could well be my poor code. This will be for further exploration.