Recently, I was presented with the task of comparing two files and writing the common datasets to a new file. Initially, each file was imported as a list stored in memory to be compared. Then, the common elements between the files were appended to an empty list that was then written to a text file. This solution, while it worked for what we were trying to do, isn’t the best approach when dealing with larger files as it stores all elements in memory making the program resource intensive. To combat the issue of high memory usage, I decided to iterate through each line one at a time and perform the comparison that way. In this post, I will walk through my method of file iteration and line comparison using example files I created specifically for this tutorial.
Creating the files for comparison
For this tutorial, I created two generators, one that will write 250 million numbers, all multiples of two, on individual lines to a text file titled two.txt and the second, will write 100 million numbers, all multiples of 5, on individual lines to a text file titled five.txt. These two files will be compared and the common numbers between the two files will be written to a text file titled common.txt.
After running these generators, we are left with two files one that is 977.8 MB and the second that is 2.44 GB, both uncompressed. Now that we have these two files saved, we can use them for comparison. For this, we first want to open three files, two of them being the two.txt and five.txt files which will be opened in read mode and the third being a new file titled common.txt which will be opened in write mode:
The file comparison process
To start the comparison process, we will want to grab the first element in both two.txt and five.txt and read them in as integers using the next() function. This function is a built in function that obtains the next element in an iterable and raises the StopInteration exception when the function reaches the end of an iterable.
With the first two elements of each file stored in memory, numbers two and five, we will want to compare them using the following while loop:
Before diving into the logic behind the code, it is worth mentioning that the data in both files are sorted which helps with iteration since we know that the numbers are listed in a logical order. Now as for the logic, we are pulling a number from one line in each file and comparing them. If the numbers are the same, the program will write the common number to common.txt and proceed to the next number in each file. When the program encounters numbers that are not equal, it will move on to the next number in a file depending on which file presented the smaller number during comparison. Here is a visual example:
Once the StopIteration exception is raised, when it reaches the end of the file two.txt, we are left with a 488.9 MB common.txt file containing the common numbers between the multiples of two and five.
The output of the time command yields: