NEUROMANCER BLUES: THREADING VS MULTIPROCESSING - PART 1

NEUROMANCER BLUES: Threading VS Multiprocessing - Part 1

12/19/2019

"Neuromancer Blues" is a series of posts where I would like the reader to find guidance about overall data science topics such as data wrangling, database connectivity, applied mathematics and programming tips to boost code efficiency, readability and speed. My examples and coding snippets will be as short and sweet as possible in order to convey the key idea instead of providing a lengthy code with poor readability that damages the purpose of the post.

Threading (AKA multithreading) and multiprocessing is a topic I wanted to write for a long time. This first post will be focused on introducing both concepts with emphasis on threading, and why is so important for developers in finance. Future Neuromancer series posts will spend more time on multiprocessing and programming efficiency issues such as race conditions or deadlocks.

The Basics
Let's clarify key concepts that will be recurrent in this post:

Concurrency: when two tasks can start, run, and complete in overlapping time periods.
Parallelism: when tasks literally run at the same time e.g. multi-core processor
I/O-bound task: tasks that spends most of the time in i/o (input/output) state i.e. network read/write, database queries, etc.
CPU-bond task: tasks that spends most of the time in CPU i.e. browsing multiple websites, machine algo training, etc.
GIL (Global Interpreter Lock): GIL is a lock that avoids a single python process to run threads in parallel (using multiple cores) at any point in time, yet they can be run concurrently. Although not visible for single-threaded tasks, GIL becomes an issue when performing multi-threaded code or CPU-bound tasks.
Deadlock: event that occurs when two or more threads/tasks are waiting on each other, not executing and blocking the whole program.
Race conditions: event that occurs when two or more threads/tasks run in parallel but end up delivering a flawed result as a consequence of an incorrect sequence of the operations.

In Coding we need to avoid both race conditions and deadlocks. The best recipe to avoid race conditions is to apply Thread Safety policies using either approaches that a void a shared state or methods related to synchronization. In addition, we wish also to avoid deadlocks by having processes crossing over building mutual dependencies i.e. reduce the need to lock anything as much as you can. These topic is more advanced and deserve another series of posts in the future.

Threading vs Multiprocessing: Brief Intro
Now we are in the same page thus let's answer what is the difference between threading and multiprocessing and why it matters so much. The table below provides a comprehensive comparison between methods.

In a nutshell, threading is used to run multiple threads/tasks at the same time inside the same process , yet it will not enhance speed if we are already using 100 % CPU time. On the other hand, multiprocessing allows the programmer to open multiple processors on a given CPU, each one of them with their own memory and with no GIL limitations.

Python threads are used in I/O-bound tasks mainly where the execution involves some waiting time. In Finance a straightforward example is when querying an external database, for which reason we will simulate a similar i/O-bound task using yahoo finance data.

As I mentioned in the introduction, I would like to concentrate on threading for this post so let's get down to business.

Classic Approach: Using Threading Module
The threading module provided with Python includes a simple-to-implement locking mechanism that allows you to synchronize threads. In other words, this module allows you to have different parts of your program run concurrently and improve the code readability.

Let's first understand what's the point of doing threading. The code snippet underneath shows a looping process where we execute a I/O-bound tasks reading financial data from yahoo finance using pandas datareader module. Although pandas datareader allows bulking - downloading data from a list of tickers at once - we are going to naively run each I/O-task on a stand-alone basis per ticker and following a sequential execution approach i.e. a new call only starts when the former call is finished.

Please notice about the one second delay introduced within our io_task() function with time module. For the sake of simplicity, we are only downloading price data for less than five years, for which reason this is a pretty fast query by nature. The introduction of this one second delay simulates that our query is taking more time e.g. downloading data such as +100 fundamental indicators, which is a more realistic simulation.

Running the code will produce output that looks like this:

As we see above, sequential execution is quite time consuming. Is there any way we can optimize and trim down waiting time? Enter the Threading module:

Well, that was fast. Threading outperforms dramatically a sequential approach cutting down time by more than six seconds tantamount to more than two thirds of waiting time. New users of Threading module probably require further description upon the lines highlighted with numerical comments:

A thread is created for each ticker in our list. This is a new class created by the Threading module.
The thread is started and the task is executed in a separate thread control.
Makes the program wait until the thread terminates. This line is important as if we ignore it, the program will print out the time before all the threads are finished and, consequently, it will give us a flawed time measurement.

New Approach: Thread Pooling using Concurrent.Futures Module
The concurrent.futures module provides a more straightforward and readable way of conducting both threading and multiprocessing. This module is an abstraction layer on top of Python’s threading and multiprocessing modules that simplifies their use. Nonetheless, it should be noted that there's a trade-off between higher code simplicity and lower code flexibility. Hence, the user might be interested in using either Threading or Concurrent.Futures depending on the program complexity and requirements.

But let's run our I/O-bound task now with this new module;

Pool Threading delivers similar significant time savings with regards the naive sequential approach plus showing a much better readability than our Threading module's first snippet. Once more, you can find below more explanations for those lines with numerical comments:

Helper function necessary to be used as input of the map function embedded into concurrent.futures module. It's simply our original I/O-bound task function, yet because it requires two arguments instead of one, we cannot use it in the executor.map() object defined in 3.
Executor object from concurrent.futures module. This is an abstract class that needs to be used as one of its two subclasses: ThreadPoolExecutor or ProcessPoolExecutor. Threading requires the former, yet the multiprocessing will require the latter as we'll see in future posts.
Executor.map() is a method embedded into the executor object that works similarly as Python's vanilla map() function by applying a function on all the elements of a specified iterable and returns map object. In this case, the method executes calls asynchronously creating a threads per ticker.

To sum up, both Threading and Concurrent.futures module have similar performance, although the former provides more flexibility for custom tasks while sacrificing code readability.

This has been a short introduction to both threading and multiprocessing with quite simple and short code script to convey the key idea. This time we were assigning a task per thread, although we could have assigned more jobs/tasks per thread if needed.

My objective in future post of the Neuromancer series is to delve into more threading specifics examples, discuss multiprocessing and other data science topics useful for investment purposes.

Recommended Resources:

Best Threading youtube tutorial: Corey Schafer
Python Adventures: concurrent.futures
Github: Threading Script

1 Comment

NEUROMANCER BLUES: Threading VS Multiprocessing - Part 1

Leave a Reply.