Comparing two files using Python Part II: Parallel Processing with the Multiprocessing module

Before I begin, I would like to let you know that yes, I know, the title of this is a little misleading, as we are not actually comparing files in our program. We, however, are going to continue on with the discussion of performance, as started in Part I of this series, when dealing with large data sets as resource utilization is an important element of development. In part I of this series we compared two large files containing millions of numbers seeking common values between the two using the next() method. This program is processing the data in a serial manner as it is going to one file, reading a value, saving that value to memory and then it proceeds to the second file, repeats the value reading process, and then compares the two numbers. Depending on certain conditions after the comparison, one or both numbers are discarded from memory and the process is repeated again until the StopIteration exception is raised. While this program works for this particular task, what if we wanted to work on two files in parallel simultaneously? This is certainly possible if we design our program to take advantage of parallel processing by utilizing our machine’s resources effectively.

Parallel Processing

When discussing parallel processing in this post, we are referring to the act of dividing up a program’s tasks amongst multiple cores of a processor(s) and have each core execute the tasks simultaneously with the goal to complete the overall process faster as we are aren’t waiting for one task to finish before beginning another (Serial Processing); this is also known as Multiprocessing. By default, Python’s interpreter was designed to support serial processing for simplicity purposes. One-way to implement multiprocessing in python is utilizing the Multiprocessing module. This module allows developers to spawn multiple processes concurrently while allowing synchronization, communication, and state sharing. Each process created is assigned to a separate core, given its own interpreter, and the result of each process is recovered together once all processes are finished. For a detailed overview of the Multiprocessing module see: Multiprocessing. The Multiprocessing module provides many different classes for concurrent processing. Here, we are going to focus on using the Pool class, which allows us to apply a function to multiple inputs and it will process them simultaneously. In future posts, we will explore other Multiprocessing classes. Personally, I don’t think the example from part I is a great example to use in demonstrating basic parallel processing. As a result, we are going to use a different example to demonstrate the module’s capabilities.

Multiprocessing: Analyzing Three Files Simultaneously

For our first example, we are going to use two.txt and five.txt files from Part I of this series. The first file two.txt contains 250 million numbers, all multiples of two and the second file five.txt contains 100 million numbers, all multiples of 5. Also, we will add a third file titled eleven.txt that contains approximately 46 million numbers, all multiples of 11. We will iterate though each value in each file, find the odd numbers, sum all odds in each file, multiply the sum of all odds by two, and finally, sum the results.

Visual Example

Code

As with any python module, we are first importing the multiprocessing module, specifically the Pool class. Next, we are defining and creating our function in the usual pythonic way. The aggregateodds() function is taking a file as input and iterating through each line and determining if each number is odd by dividing the number by 2 and analyzing the remainder of the result. If the number is evenly divisible by two, it gets added to the variable sumofodds. This process is continued until the program reaches the end of the file.

After defining our function, we make use of the Pool class by defining the number of processes to create. My machine has a single CPU consisting of 4 cores. Since I’m applying the function to 3 large files, I specified 3 processes. Next, I’m taking advantage of the Pool’s map module, which is equivalent to Python’s built in map function in that it’s applying a function to every element in an iterable. In our case, p.map() is applying the aggregateodds() function to the list of files that we specified: [two.txt, eleven.txt, five.txt]. Finally, once the three processes are completed, the program is taking the three results, multiplying each of the results by two and summing the numbers together. The final result of this program is 36363636636363638.

Using the time command we find that our total elapsed time for this program when using parallel processing is 3 minutes and 52 seconds with a CPU time of approximately 6 minutes and 16 seconds. The CPU time in the case of multiprocessing is higher than the elapsed time as a result of multiple cores working on the program concurrently. If we compare these results to the results of running this program without multiprocessing, we see that the program is much faster when taking advantage of parallel processing.

Using Multiprocessing

NOT Using Multiprocessing

In future posts, we will dive deeper into python multiprocessing module, how it’s working around the constraints of python’s Global Interpreter Lock (GIL), other implementations of python, and explore other examples using the multiprocessing module.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s