"Neuromancer Blues" is a series of posts where I would like the reader to find guidance about overall data science topics such as data wrangling, database connectivity, applied mathematics and programming tips to boost code efficiency, readability and speed. My examples and coding snippets will be as short and sweet as possible in order to convey the key idea instead of providing a lengthy code with poor readability that damages the purpose of the post. Threading (AKA multithreading) and multiprocessing is a topic I wanted to write for a long time. This first post will be focused on introducing both concepts with emphasis on threading, and why is so important for developers in finance. Future Neuromancer series posts will spend more time on multiprocessing and programming efficiency issues such as race conditions or deadlocks. The Basics Let's clarify key concepts that will be recurrent in this post:
In Coding we need to avoid both race conditions and deadlocks. The best recipe to avoid race conditions is to apply Thread Safety policies using either approaches that avoid a shared state or methods related to synchronization. In addition, we wish also to avoid deadlocks by having processes crossing over building mutual dependencies i.e. reduce the need to lock anything as much as you can. These topic is more advanced and deserve another series of posts in the future. Threading vs Multiprocessing: Brief Intro Now we are in the same page thus let's answer what is the difference between threading and multiprocessing and why it matters so much. The table below provides a comprehensive comparison between methods. In a nutshell, threading is used to run multiple threads/tasks at the same time inside the same process , yet it will not enhance speed if we are already using 100 % CPU time. On the other hand, multiprocessing allows the programmer to open multiple processors on a given CPU, each one of them with their own memory and with no GIL limitations. Python threads are used in I/O-bound tasks mainly where the execution involves some waiting time. In Finance a straightforward example is when querying an external database, for which reason we will simulate a similar i/O-bound task using yahoo finance data. As I mentioned in the introduction, I would like to concentrate on threading for this post so let's get down to business. Classic Approach: Using Threading Module The threading module provided with Python includes a simple-to-implement locking mechanism that allows you to synchronize threads. In other words, this module allows you to have different parts of your program run concurrently and improve the code readability. Let's first understand what's the point of doing threading. The code snippet underneath shows a looping process where we execute a I/O-bound tasks reading financial data from yahoo finance using pandas datareader module. Although pandas datareader allows bulking - downloading data from a list of tickers at once - we are going to naively run each I/O-task on a stand-alone basis per ticker and following a sequential execution approach i.e. a new call only starts when the former call is finished. Please notice about the one second delay introduced within our io_task() function with time module. For the sake of simplicity, we are only downloading price data for less than five years, for which reason this is a pretty fast query by nature. The introduction of this one second delay simulates that our query is taking more time e.g. downloading data such as +100 fundamental indicators, which is a more realistic simulation. Code Editor
Running the code will produce output that looks like this: As we see above, sequential execution is quite time consuming. Is there any way we can optimize and trim down waiting time? Enter the Threading module: Well, that was fast. Threading outperforms dramatically a sequential approach cutting down time by more than six seconds tantamount to more than two thirds of waiting time. New users of Threading module probably require further description upon the lines highlighted with numerical comments:
New Approach: Thread Pooling using Concurrent.Futures Module The concurrent.futures module provides a more straightforward and readable way of conducting both threading and multiprocessing. This module is an abstraction layer on top of Python’s threading and multiprocessing modules that simplifies their use. Nonetheless, it should be noted that there's a trade-off between higher code simplicity and lower code flexibility. Hence, the user might be interested in using either Threading or Concurrent.Futures depending on the program complexity and requirements. But let's run our I/O-bound task now with this new module; Code Editor
Pool Threading delivers similar significant time savings with regards the naive sequential approach plus showing a much better readability than our Threading module's first snippet. Once more, you can find below more explanations for those lines with numerical comments:
To sum up, both Threading and Concurrent.futures module have similar performance, although the former provides more flexibility for custom tasks while sacrificing code readability. This has been a short introduction to both threading and multiprocessing with quite simple and short code script to convey the key idea. This time we were assigning a task per thread, although we could have assigned more jobs/tasks per thread if needed. My objective in future post of the Neuromancer series is to delve into more threading specifics examples, discuss multiprocessing and other data science topics useful for investment purposes. Recommended Resources:
1 Comment
|
Carlos Salas
LINKS Data Science & ML NYC Data Science Blog Data Science Central Towards Data Science Kaggle Blog Analytics Vidhya Quant Finance Quantocracy MoneyScience QuantStrat Trade R Investments Market Screner Macro Calendar Corporate Calendar Advisor Perspectives Trading Economics Portfolio Visualizer Datasets Opendata Data.Gov World Bank Quandl |