NEUROMANCER BLUES: THREADING VS MULTIPROCESSING - PART 2

12/30/2019

Neuromancer Blues" is a series of posts where I would like the reader to find guidance about overall data science topics such as data wrangling, database connectivity, applied mathematics and programming tips to boost code efficiency, readability and speed. My examples and coding snippets will be as short and sweet as possible in order to convey the key idea instead of providing a lengthy code with poor readability that damages the purpose of the post.

The first part of these topic was dedicated to briefly explain the key differences between Threading and Multiprocessing besides focusing more on threading applications to boost our productivity when executing I/O-bound tasks. Because an image is worth a thousand words, the picture below illustrates neatly the key points for understanding these concepts courtesy of Corey Schafer

As highlighted in the first part of this series, multiprocessing is advised to be used for CPU-bound tasks as it allows the programmer to open multiple processors on a given CPU, each one of them with their own memory and with no GIL limitations. Another big difference between threading and multiprocessing lies in the former sharing global variables, whereas the latter's separate processes are completely separate i.e. one process cannot affect another's variables.

Classic Approach: Using Multiprocessing Module
The built-in multiprocessing module in Python is the simplest way to spawn multiple processes that run parallel to each other instead of concurrently as when we implement threading. This module allows the user very easily and intuitively to implement multiprocessing very similarly as when using the theading module. Bear in mind that although the coding experience is similar using both modules, very different things happen in the backstage as previously explained; with a big difference, being that threading shares global variables, whereas multiprocessing runs simultaneously separate processes with their own variables.

Learning by doing is the best way to convey the aforementioned concepts so let's get down to business. Remember multiprocessing is meant to work better with CPU-bound tasks thus we will define a CPU-bound function that will train and deliver results from a naive machine learning model training based on SVM. Please beware we used the word "naive" for obvious reasons as we don't conduct data-pre-processing, feature selection, cross-validation or any kind of hyperparameter optimization since the purpose of this post is to focus on boosting our code efficiency using multiprocessing. Future posts in other series will focus on enhancing machine learning models applied to investments.

Firstly, we proceed to store in an object some data that our CPU-bound task will need in order to perform the SVM model training.

Secondly, the task is wrapped up in function cpu_task() with our naive SMV model training and delivery of model accuracy and accumulative return for each ticker that is passed as parameter. Note a time.sleep(3) line is included in order to simulate a more realistic and CPU-intensive task time completion.

Now that we have our data and the task/worker function to be run for each one of our companies, let's review different multiprocessing approaches to conduct this analysis. As illustrated in Part 1 of this series, the next code snippet provides an idea about the efficiency of sequential execution i.e. run one task at a time with the next task only starting when the last task is complete.

Our CPU-task is taking approximately 5 second per ticker to be completed. Initially it does seem quick, yet bear in mind our aim is to make our program as scalable as possible. Currently it would take our program more than 40 minutes to conduct the same analysis for the whole S&P 500 universe so improvements are needed here: Enter the Multiprocessing module:

Multiprocessing implementation almost cuts in half the initial time it was taken to run our CPU-task when using the sequential approach. New users of the multiprocessing module probably require further description upon the lines highlighted with numerical comments:

if __name__=='__main__' is mandatory for windows users. This line of code allows to test whether the script is being run directly or being imported. Without this line the multiprocessing.Process() module would start a new Python process and will imports the calling module triggering an infinite succession of new processes (or until your machine runs out of resources). Please note Mac users can ignore this line and take the indented content out to the main script.
Loop necessary to open a process per ticker.
multiprocessing.Process() creates a process object p using our task worker function and each ticker. Next line start each process so that they can start working on a parallel basis.
join() is essential to to avoid our script to jump to "finish" before our multiple processes finish. Without this line the time elapsed obtained would be a misleading figure close to nil. Hence, join() is tantamount to tell our script to wait for a process to be completed.

Best thing about the Multiprocessing module is that it allows to work using "Pooling" mode and obtain significant readability gains in our script, avoid non-necessary looping and harvest some gains in time execution as well:

Once again, the syntax above might require further clarification for those non-familiar with this module:

if __name__=='__main__' is mandatory for windows users. Please read the paragraph above for the same explanation applies here.
Define the number of processes to be opened. In our example this is equal to the number of tickers.
multiprocessing.Pool() creates a Pool process object "pool". Next line start each process in parallel using the class "map" method. Note how readability improves as it is no longer necessary to use join().

New Approach: Process Pooling using Concurrent.Futures Module

As reported in Part 1 of this series, The concurrent.futures module provides a more straightforward and readable way of conducting both threading and multiprocessing. This module is an abstraction layer on top of Python’s threading and multiprocessing modules that simplifies their use. Nonetheless, it should be noted that there's a trade-off between higher code simplicity and lower code flexibility. Hence, the user might be interested in using either Multiprocessing or Concurrent.Futures depending on the program complexity and requirements.

See below the code for running our program using this new library:

This new approach performs a bit worse than our previous multiprocessing options, yet it still is significantly much quicker than the standard looping sequential choice. Several things need to be underlined with regards the use of either concurrent.futures or the native multiprocessing module:

concurrent.futures.ProcessPoolExecutor() is a wrapper around the native multiprocessing.Pool(). Therefore, the same limitations of multiprocessing apply (e.g. objects need to be pickable).
concurrent.futures provides a single API to remember either you use Threading or Multiprocessing e.g. for IO-bound tasks use in the syntax ThreadPoolExecutor() instead of ProcessPoolExecutor() and you are done.
concurrent.futures usage makes maintenance much easier over the long runon the back of its simple API. Nevertheless, Multiprocessing library offers a trade off between more flexibility and less readability and maintenance issues.
Time performance is slightly worse when using concurrent.futures than when using multiprocessing.Pool. The reason for these time gains differences is that multiprocessing.Pool will batch the iterable passed to map into chunks, and then pass the chunks to the worker processes, which reduces workload time between the parent and children. Contrarily, concurrent.futures.ProcessPoolExecutor always passes one item from the iterable at a time to the children, which results in inferior time performance.

To sum up, both Multiprocessing and Concurrent.futures module have similar performance, although the former provides more flexibility for custom tasks while sacrificing code readability. This is the same reading we obtained from comparing both modules for Threading purposes.

This has been a short introduction to both threading and multiprocessing with quite simple and short code script to convey the key idea.

My objective in future post of the Neuromancer series is to delve into programming efficiency topics, and other data science topics useful for investment purposes.

Recommended Resources:

Best Multiprocessing youtube tutorial: Corey Schafer
Python Adventures: concurrent.futures
Github: Multiprocessing Script

0 Comments

NEUROMANCER BLUES: THREADING VS MULTIPROCESSING - PART 2

Leave a Reply.