In this post, different approaches to concurrent and parallel programming in Python will be presented on concise code snippets. We will use three standard Python modules, namely
concurrent.futures. This post will not cover the
subprocess module and the new
The aim of this article is to show different ways of concurrent and parallel programming in Python 3. However, in the first place, we need to understand the differences between these two forms of execution. For the following explanation, suppose that we have a system with two CPU cores, referred to as
CPU A and
CPU B. Moreover, for illustration, assume that every thread or process is formed from several tasks. The task name consists of a number and letter. The number specifies the point in time when the task is executed, whereas the letter specifies which CPU core is responsible for executing the task. Rectangular tasks are CPU-bound (e.g. computationally intensive operations), while the circular ones are I/O-bound (e.g. GUI or network operations).
For the sake of completeness, we start with the sequential execution depicted in the following figure:
From the picture we can see that there is one process with eight tasks. Some of these tasks are executed on
CPU A, while others are executed on
CPU B. Moreover, from the picture, it is obvious that no number representing a point in time appears more than once. Therefore, even though there are two CPU cores on the system, no two tasks are executed at the same moment. Indeed, they are executed one by one. In other words, they are executed sequentially. This may cause your GUI application to become unresponsive as GUI has to wait until all computations are finished.
A slightly different approach, particularly the concurrent execution, is illustrated in the subsequent figure:
In this scenario, there are two threads, each having four tasks. Like in the previous case, one part of them is executed on
CPU A, whereas the other part is executed on
CPU B. However, similarly to the sequential execution, we can see that no two tasks are executed at the same time. Unlike the sequential execution, the tasks from both threads are executed in an interleaved way. Firstly, the
1A task from the first thread is executed on
CPU A. Then, the
2B task from the second thread is executed on
CPU B. After that,
CPU A executes the
3A task from the second thread and this behavioral pattern continues until all tasks from both threads are executed. We say that the tasks are executed concurrently. This is beneficial in situations where GUI operations are interleaved with computations (the GUI will be always responsive).
As we will see later, in Python, the aforementioned types of execution have one big drawback: in every moment only one CPU core is working. This is very inefficient. However, this drawback is solved by the parallel execution that is depicted in the following image:
From the figure, we can see that there are two processes, each with four tasks. Like in the previous examples, some of these tasks are executed on
CPU A, while others are executed on
CPU B. However, the tasks are executed in parallel. That means that the
1A task from the first process is executed on
CPU A and at the same moment the
1B task from the second process is executed on
CPU B. This pattern continues until all tasks are processed. Therefore, both CPU cores are fully used. This is beneficial when all processes perform computational intensive operations.
Now that the basics about different types of execution are understood, we can continue further with this post. The first part is devoted to threads. Then, the second part describes processes. In both parts, pros and cons of the given unit of execution are covered. Also, appropriate use cases for threads and processes are discussed. Finally, short source codes with concise explanation are provided for an illustration. All the presented codes are easy to follow and do not require you to know other Python modules. Therefore, they are suitable even for infants :-). For our purposes, three Python modules are employed, namely the
threading module, the
multiprocessing module and the
The first one,
threading, defines higher-level interfaces built on top of the lower-level
thread module. The
thread module has been informally considered as deprecated for a long period of time and users have been encouraged to use the
threading module instead. In Python 3, this deprecation becomes more formal and the
thread module is not available anymore. To be more accurate, it is still there, hanging around, but it has been renamed to
_thread. Unlike the
thread module, the
threading module was designed with inheritance in mind. Therefore, every thread is treated as an object and not as a function as in the case of the
thread module. This module is intended for concurrent programming only.
The second used module is
multiprocessing. Unlike the
multiprocessing can be used for both concurrent and parallel programming. In addition,
multiprocessing.dummy provides the same API as the
multiprocessing module but uses threads instead of processes.
concurrent.futures module is a new addition to Python since version 3.2. It provides new high-level interfaces for concurrent and parallel execution performed by threads and processes.
A thread is the smallest unit of execution that can be scheduled by an operating system. Different operating systems provide different implementations of threads. However, usually a thread is contained inside a process and single process can encapsulate multiple threads as can be seen in the figure below:
In Python, threads in one process implicitly share an interpreter, code of the program and global variables. Local variables can be shared explicitly if they are passed to other threads. Moreover, threads are executed concurrently, independently on one another and in a non-deterministic order. The execution of a blocking operation, like reading from a file, can put the thread to sleep.
Threads possess some advantages in comparison to processes. Thanks to sharing the memory space, creating and using threads is less memory demanding than working with processes. From the viewpoint of system performance, it is easier to create and run one hundred threads than one hundred processes. Moreover, communication between threads is performed more easily than communication between processes.
However, there are also several disadvantages related to using threads. Despite the fact that the shared memory space can be advantageous in some cases, threads may corrupt shared data if their synchronization is not taken into account. This phenomenon is called race condition and its consequences can be severe. Depending on how threads are scheduled, your program can generate correct results, even for a longer period of time. Then, out of the blue, your program provides incorrect results. Debugging this kind of errors is a guaranteed way to make your head explode.
If you are a Python beginner coming from the Java or C++ world, the next drawback of Python threads may surprise you. Otherwise, if you are a more experienced Python programmer, this is probably one thing in Python you are complaining a lot about. So, beginners sit down, please. The thing is: full parallelism is not possible with Python threads because of the Global Interpreter Lock, usually abbreviated as GIL. To be more precise, GIL is not specified in the Python language as an integral part. Indeed, using GIL is more about the implementation of the Python interpreter rather than about the language itself. CPython, the reference implementation of Python written in C, uses GIL because some parts of the C implementation of the interpreter are not completely thread-safe. Therefore, the interpreter is protected by GIL that allows only one Python thread being executed at any given time. The noticeable effect of GIL is that a Python application only runs on a single CPU core no matter how many threads it uses. Fortunately, in the community of the CPython developers there are some efforts to remove the GIL (or perform gilectomy :-)). Frankly speaking, there are different Python implementations: some of them (e.g. PyPy) use GIL and others (e.g. Jython, IronPython) can do without GIL.
After reading the previous horror story about GIL, you may think that the
threading module in Python is useless. However, GIL tends to affect only CPU-bound applications. Using threads for computationally intensive tasks, where you are trying to achieve full parallelism on multiple CPU cores, is a bad idea resulting in a slower execution time. On the other hand, GIL does not present much of an issue for I/O-bound applications. Therefore, threads are much better suited for concurrent programming that performs blocking operations, like waiting for disks, waiting for network or waiting for database results. So, if your application is mostly performing some I/O handling, threads are often reasonable choice because they spend most of their time waiting in the kernel. When one thread is waiting in the kernel, another one can run. As we will see, this can give us a significant speedup.
Lets move to the real stuff. Suppose that we need to download several web pages and show the length of their content. Assume that URLs of these web pages are stored in the
URLS = [ 'https://xkcd.com/138/', 'https://xkcd.com/149/', 'https://xkcd.com/285/', 'https://xkcd.com/303/', 'https://xkcd.com/327/', 'https://xkcd.com/387/', 'https://xkcd.com/612/', 'https://xkcd.com/648/' ]
The definition of
URLS will stay the same for all subsequent examples. In the following parts, different approaches to this task will be explained. Firstly, the task will be performed sequentially and then different variants of using threads to accomplish the given task will be provided.
The following code will carry out the task sequentially:
def get_content_len(url): r = requests.get(url) return url, len(r.text) if __name__ == '__main__': for url in URLS: result = get_content_len(url) print(result)
In this case, the web pages are downloaded in the order as they are stored in
URLS. Since the
requests.get() method is blocking, if there is a need to wait for one web page to be downloaded, nothing else can be done during this time.
For the sake of completeness, the output of this program is listed below:
$ python sequential_download.py ('https://xkcd.com/138/', 6660) ('https://xkcd.com/149/', 6439) ('https://xkcd.com/285/', 6611) ('https://xkcd.com/303/', 6706) ('https://xkcd.com/327/', 6875) ('https://xkcd.com/387/', 6451) ('https://xkcd.com/612/', 6852) ('https://xkcd.com/648/', 6721)
As we know from the previous part, downloading web pages is a suitable task for employing threads. However, there are several ways to create and use threads in Python. The simplest one is passing a callable and its arguments to the constructor of the
threading.Thread class. This callable is then executed in its own thread:
def get_content_len(url, results): r = requests.get(url) results.put((url, len(r.text))) if __name__ == '__main__': results = queue.Queue() threads =  for url in URLS: t = threading.Thread(target=get_content_len, args=(url, results)) threads.append(t) t.start() for t in threads: t.join() while not results.empty(): print(results.get())
The first difference we can see in this source code in comparison to the sequential solution lies in the
get_content_len() function. This function does not return the result. Instead, it stores the result into the provided queue named
results. The queue is an instance of the
queue.Queue class and it represents a recommended way to transfer information between threads. Since
queue.Queue already has all of the required locking, it can be safely shared by as many threads as you wish. By the way, simple return does not work in this case.
Then, in the first
for loop, a thread is created for every web page that needs to be downloaded by passing
get_content_len() and its arguments to the constructor of the thread. After creation, the thread has to be started with the
start() method. The thread is also stored in the
threads list, so the consequent joining of the threads may be performed easily. Another way of joining the threads is by using
threading.enumerate() like showed in the following code snippet:
if __name__ == '__main__': results = queue.Queue() for url in URLS: t = threading.Thread(target=get_content_len, args=(url, results)) t.start() for t in threading.enumerate(): if t is not threading.main_thread(): t.join() ... # The rest is same as before.
Notice that this solution joins all threads in the program. In this case, this does not make a difference. However, if you create threads in different places in your program, the approach using
threading.enumerate() causes joining all of them. Be aware of this behaviour. Moreover, joining current thread,
threading.main_thread() in our example, would cause a deadlock. In this situation, the
join() method would raise the
RuntimeError exception, so we exclude the current thread from joining.
Finally, the results are printed to the screen:
$ python threading_no_inheritance.py ('https://xkcd.com/327/', 6875) ('https://xkcd.com/303/', 6706) ('https://xkcd.com/648/', 6721) ('https://xkcd.com/387/', 6451) ('https://xkcd.com/612/', 6852) ('https://xkcd.com/149/', 6439) ('https://xkcd.com/138/', 6660) ('https://xkcd.com/285/', 6611)
Notice that the web pages are downloaded in an arbitrary order. This results from the fact that if one thread is waiting for a response, another one can carry out a download. For instance, if the thread responsible for downloading the first URL from
URLS is waiting for a response, another thread may perform its downloading and store the result into the output queue before the waiting thread. Hence, the results are not ordered with respect to the input
class ContentLenThread(threading.Thread): def __init__(self, url): super().__init__() self.url = url self.content_len = None def run(self): r = requests.get(self.url) self.content_len = len(r.text) if __name__ == '__main__': threads =  for url in URLS: t = ContentLenThread(url) threads.append(t) t.start() for t in threads: t.join() for t in threads: print((t.url, t.content_len))
The only difference in comparison to the foregoing example lies in returning results from the threads. Since we implemented our own thread class,
ContentLenThread, results can be stored in its instances. Thus, no queue is needed. Like in the previous example, the web pages are downloaded in an arbitrary order. However, the results are printed in the order in which threads are created. In other words, they are printed from the first thread, responsible for downloading the first URL, to the last one, responsible for downloading the last URL. Therefore, they seem to be ordered according to the input
$ python threading_inheritance.py ('https://xkcd.com/138/', 6660) ('https://xkcd.com/149/', 6439) ('https://xkcd.com/285/', 6611) ('https://xkcd.com/303/', 6706) ('https://xkcd.com/327/', 6875) ('https://xkcd.com/387/', 6451) ('https://xkcd.com/612/', 6852) ('https://xkcd.com/648/', 6721)
In this and in the preceding approach, there is no upper limit on the number of created threads. If your
URLS contains one thousand URLs, then one thousand threads will be created. Although this approach works, it may not be the best idea and you should avoid writing applications that permit unlimited growth in the number of threads. Imagine that you have a server accepting requests from the clients. The server creates a new thread for every request. A malicious user may launch an attack on the server that forces this server to create so many threads that your application runs out of resources and crashes.
The two previous solutions pose a security risk. Also, the overhead is higher because a new thread is created for every URL. Suppose that there are one thousand URLs in
URLS. Hence, one thousand threads will be created. Then, every thread will download one web page and cease to exist. But what if there was a way to recycle threads? A way when one thread will do more than just one task before its termination? An answer to both problems is using
multiprocessing.pool.ThreadPool or its alternative
multiprocessing.dummy.Pool (it replicates the interface of
multiprocessing.pool.Pool but uses threads instead of processes). Both put an upper limit on the amount of the created threads and make recycling of threads possible. Unfortunately, the
multiprocessing.pool.ThreadPool class has not been documented yet. However, the next example shows its usage:
from multiprocessing.pool import ThreadPool # alternative: # from multiprocessing.dummy import Pool as ThreadPool def get_content_len(url): r = requests.get(url) return url, len(r.text) if __name__ == '__main__': with ThreadPool() as pool: results = pool.map(get_content_len, URLS) for result in results: print(result)
As we can see in this implementation, the
get_content_len() function returns its result in a usual way. To create a pool, we use a context manager (the
with statement). The number of worker threads in the pool may be specified as the first argument. If it is not provided,
os.cpu_count() will be used for this purpose. The returned number defines how many CPU cores are available on the given machine. Then, the
map() method takes care of everything else including sending the tasks to the threads in the pool and gathering the results from the threads. The call of
map() blocks until all results are ready. After that, it returns a list of results:
$ python multiprocessing_threadpool.py ('https://xkcd.com/138/', 6660) ('https://xkcd.com/149/', 6439) ('https://xkcd.com/285/', 6611) ('https://xkcd.com/303/', 6706) ('https://xkcd.com/327/', 6875) ('https://xkcd.com/387/', 6451) ('https://xkcd.com/612/', 6852) ('https://xkcd.com/648/', 6721)
As we can see, the results are printed in an order in which their URLs are stored in the input
map() method guarantees this behaviour. However, the
multiprocessing.dummy.Pool classes provide also other methods that can be used instead of
map(). These methods have slightly different behaviours. Therefore, if you need a little bit different functionality than the
map() method offers, feel free to use these other methods.
The above source code eliminates both previously mentioned problems related to the use of
threading.Thread itself and it is more concise. Also, we do not have to do any extra work like joining treads.
There is another way of creating a thread pool. This alternative approach is accessible via the new Python module called
concurrent.futures. The implementation is very similar to the previous example:
from concurrent.futures import ThreadPoolExecutor def get_content_len(url): r = requests.get(url) return url, len(r.text) if __name__ == '__main__': # max_workers is required only for Python < 3.5. with ThreadPoolExecutor(max_workers=os.cpu_count()) as pool: results = pool.map(get_content_len, URLS) for result in results: print(result)
Obviously, no modification to the
get_content_len() function is necessary. The pool creation is almost identical to the preceding implementation with three minor differences. Firstly,
concurrent.futures.ThreadPoolExecutor is employed and maximal number of worker threads is provided through the
max_workers parameter. For Python 3.5 or newer, this parameter does not need to be given or can be explicitly set to
None. In these circumstances, and this is the second difference, the number of worker threads is set to the number of processors on the machine multiplied by 5. Lastly,
map() returns the iterator to the results, not a list. The output looks as follows:
$ python concurrent_futures_threadpoolexecutor.py ('https://xkcd.com/138/', 6660) ('https://xkcd.com/149/', 6439) ('https://xkcd.com/285/', 6611) ('https://xkcd.com/303/', 6706) ('https://xkcd.com/327/', 6875) ('https://xkcd.com/387/', 6451) ('https://xkcd.com/612/', 6852) ('https://xkcd.com/648/', 6721)
As in the
multiprocessing.pool.ThreadPool example, the results are also ordered with respect to the order of URLs in the
From the overhead perspective, using a thread pool is the fastest implementation. All threads are created once at the beginning and then tasks are sent to them.
Like a thread, a process is an unit of execution. In other words, a process is a computer program in action. Generally, every process wraps a thread. And, as can be seen in the previous section, multiple threads can coexist inside a single process. In addition, a process can create another subprocess. Hence, one program can use more processes. The exact procedure of creating a subprocess depends on the hosting operating system. Usually, this subprocess is called a child process and the initiating process is referred to as a parent process.
In comparison to threads in Python, processes provide full parallelism. This allows your application to utilize all CPU cores. Also, processes are independent from each other, which makes it less likely to accidentally modify shared data.
However, full parallelism and independence of processes come with costs. Creating and running processes is more performance and memory demanding than using threads. For every process, there has to be one running interpreter. Another weakness of processes is a lack of communication. Thus, the need of exchanging information and synchronization of operations between processes has to be addressed via interprocess communication, abbreviated as IPC. This brings an extra communication overhead. Since every process works in its own interpreter, data exchanged between these interpreters have to be serialized. In other words, they have to be compatible with
pickle. Moving a lot of data between processes can introduce a tremendous amount of overhead because these data need to be pickled on the sending side and unpickled on the receiving side. There are also limitations to the function which implements process behaviour. It needs to be defined with the
def statement. Other types of function definitions, like lambdas, closures and callable instances, are not supported. Also, the function arguments and return value must be compatible with
When deciding whether threads or processes will be employed in your program, you need to think about the following question: Where will my program spend most of its execution time? If your answer is waiting for I/O, threads are very likely the right choice. However, if you work with computationally intensive tasks, processes are what you are looking for. So, if you want your program to make better use of the computational resources of multi-core machines, the
concurrent.futures modules are recommended. For CPU-bound applications, they are the right choice.
To illustrate the use of the
concurrent.futures modules, a sequence of the Fibonacci numbers will be computed. Suppose that the inputs for the Fibonacci function are stored in the
NUMS variable as defined below:
NUMS = list(range(25, 33))
The definition of
NUMS will stay the same for all subsequent examples and will not be repeated.
For the sake of clarity, the following code demonstrates how to perform the required computations sequentially:
def fib(n): if n <= 1: return 1 else: return fib(n - 1) + fib(n - 2) if __name__ == '__main__': for n in NUMS: result = fib(n) print((n, result))
As we can see, the
fib() function recursively computes the Fibonacci number for the parameter
n. Then, this function is called for all numbers stored in
NUMS and results are printed. The definition of
fib() will be the same in all the following examples. For the sake of brevity, it will not be repeated again.
The output of this program is as follows:
$ python sequential_fibonacci.py (25, 121393) (26, 196418) (27, 317811) (28, 514229) (29, 832040) (30, 1346269) (31, 2178309) (32, 3524578)
The easiest way to incorporate processes into the previous source code is using the
multiprocessing.Process class. The use is very similar to the one presented in the example employing
threading.Thread. In order to create a process, a callable and arguments for this callable need to be passed when creating a
def n_fib(n, results): results.put((n, fib(n))) if __name__ == '__main__': results = multiprocessing.Queue() processes =  for n in NUMS: p = multiprocessing.Process(target=n_fib, args=(n, results)) processes.append(p) p.start() for p in processes: p.join() for _ in range(len(NUMS)): print(results.get())
Since we want to preserve the form of output from the previous example, a new wrapper function called
n_fib() is defined. Subsequently, this function is used as the target function in the process instantiation instead of
fib(). Similarly to the
threading.Thread example, there is no straightforward way to return a result from a process. Therefore, the
multiprocessing.Queue class is employed to store the results. To track all the created processes, we initialize
processes to an empty list. Then, a process is instantiated for every number from
NUMS, added to the tracking list and started. Thanks to this tracking list, joining processes later in the program is simple. Finally, the output is printed to the screen (we cannot use
multiprocessing.Queue.empty() because it is not reliable):
$ python multiprocessing_no_inheritance.py (25, 121393) (26, 196418) (27, 317811) (29, 832040) (28, 514229) (30, 1346269) (31, 2178309) (32, 3524578)
Another approach to parallelize the computation of the Fibonacci numbers is implementing our own representation of a process by subclassing the
multiprocessing.Process class. Only two methods are necessary to be overridden, namely
class FibProcess(multiprocessing.Process): def __init__(self, n): super().__init__() self.n = n self.result = multiprocessing.Value('L') def run(self): self.result.value = fib(self.n) if __name__ == '__main__': processes =  for n in NUMS: p = FibProcess(n) processes.append(p) p.start() for p in processes: p.join() for p in processes: print((p.n, p.result.value))
The implementation is very similar to the previous one. However, there is one small nuance. Thanks to the
FibProcess class, the results can be stored in the instances of this class and no queue is necessary. Please, realize that the instances of
FibProcess are stored in the memory space of the main process. Therefore, in order to store the results, we need to access the
FibProcess instances from other processes that are represented by the
run() methods of these instances. Consequently, it is not enough to use ordinary variable for the
result attribute, so
multiprocessing.Value has to be employed. The first and only argument used during instantiation of
multiprocessing.Value determines the type of the
result attribute. In our case,
As may be expected, the output of this program is the same as in the previous examples:
$ python multiprocessing_inheritance.py (25, 121393) (26, 196418) (27, 317811) (28, 514229) (29, 832040) (30, 1346269) (31, 2178309) (32, 3524578)
For this and the preceding example, the number of created processes is not limited. It solely depends on the input size. If the input,
NUMS in this particular case, grows constantly, new processes will be created, too. From the first part dealing with threads, we know that it is not an ideal situation. So, in the last two examples, a process pool is used to overcome this shortage.
In the previous examples, each process after its birth computes one Fibonacci number and terminates. Since process creation is not for free, it is only reasonable to assign more tasks to each process. Of course, this can be easily achieved in Python with
multiprocessing.pool.Pool or, more succinctly,
multiprocessing.Pool. So, let’s get ecological and start recycling our processes:
from multiprocessing import Pool def n_fib(n): return n, fib(n) if __name__ == '__main__': with Pool() as pool: results = pool.map(n_fib, NUMS) for result in results: print(result)
In the implementation of
n_fib(), there is a small change. Indeed, this function returns its result in a usual way. Therefore, unlike the previous examples, no queue or implementation of an extra class are required. Moreover, we can see that the program is getting more concise. There is no need for storing the created processes, joining them, using a queue or implement a new class to communicate the results back to the main process. The instance of
multiprocessing.Pool, named as
pool in our code, takes care of everything. Particularly, the
map() method divides the tasks, sends them to the processes and collects the results returned from the processes into the list. If the number of worker processes is not specified during the instantiation of
os.cpu_count() will be used.
Of course, the output printed to the screen is the same as in the previous examples:
$ python multiprocessing_pool.py (25, 121393) (26, 196418) (27, 317811) (28, 514229) (29, 832040) (30, 1346269) (31, 2178309) (32, 3524578)
Python provides a very similar alternative way to create a process pool via the
concurrent.futures.ProcessPoolExecutor class. Indeed, these two implementations are like two peas in a pod:
from concurrent.futures import ProcessPoolExecutor def n_fib(n): return n, fib(n) if __name__ == '__main__': with ProcessPoolExecutor() as pool: results = pool.map(n_fib, NUMS) for result in results: print(result)
The number of created worker processes in the pool can be specified in the constructor as an optional argument. If it is not provided, the number of the CPU cores detected on the machine will be used. If one of these worker processes in the pool abruptly terminates, the
BrokenProcessPool exception is raised. The
map() method from the
concurrent.futures.ProcessPoolExecutor class has almost the same functionality as
map() from the
multiprocessing.Pool class. A small difference between them lies in return values. The first one mentioned returns an iterator to the results, whereas the second one returns a list of results.
As expected, the output looks the same as in the preceding examples:
$ python concurrent_futures_processpoolexecutor.py (25, 121393) (26, 196418) (27, 317811) (28, 514229) (29, 832040) (30, 1346269) (31, 2178309) (32, 3524578)
The complete source codes of all the presented examples are available on GitHub. All code was written and tested in Python 3.4 and should also work in newer Python versions. Potential diversities in newer versions, if exist, were mentioned in the appropriate sections.