One of the motivations for all of the effort and research spent on sorting was the need to coordinate processing in two or more sequential files at the same time. Because early processing was mostly batch processing, it was necessary to devise a way that the contents of two files could be processed reliably and without intervention of a human operator. The typical scenario was for a file of transactions to be collected throughout the day and either punched onto cards or keyed to a magnetic file. At the end of the day, the transaction file would be input to a program that would also use a master file as input. One output of the program, among other things, was to be an updated master file that reflected the results of the transactions.
Clearly, taking the transactions in the random order that they arrived and searching sequentially for the corresponding record in the master file would be unacceptable. Even if the master file were sorted and the information in key subblocks were used, it would still take multiple time-consuming passes over the data. However, if both files are sorted on the same key field in the same sequence, it may be possible to devise logic to process both files completely with only a single pass over each file.
The origin of the catchy name seems to be undocumented but probably derives from the 'balancing' of key values against each other that forms the heart of the algorithm. Johnson and Cooper (1981, p. 59) point out that this is more than an algorithm; it is a design approach and many customized routines may be based on it. This discussion will concern itself with the two-file problem; however, it may be extended to as many files as necessary.
The fundamental assumption of the balance line is that the files be sorted on the same key fields and in the same sequence. For all of the following discussion, the assumption is that the files are sorted in ascending sequence. Obviously, the key fields must be comparable, containing data in like formats and values from the same domain of values. Continuing with the transaction file-master file scenario established above, let File 1 with Key 1 be the transaction file and File 2 with Key 2 be the master file. Further, it is assumed that the transactions are only updates; additions and deletions are not used in this process (This is not very realistic, but it does provide a starting point for discussion). Consider the following:
+-- if (key1 < key2) | | /* THIS IS AN ERROR CONDITION | REPORT THE ERROR AND READ | ANOTHER TRANSACTION RECORD */ | +-- else | | +-- if(key1 > key2) | | | | /* THERE IS NOTHING FOR THIS MASTER RECORD | | WRITE THE RECORD TO THE NEW MASTER AND | | READ ANOTHER MASTER RECORD */ | | | +-- else /* keys must be equal */ | | | | /* APPLY TRANSACTION TO THE MASTER RECORD | | AND READ ANOTHER TRANSACTION RECORD */ | +-- end if +-- end if
The above logic is nested in a loop that continues until both input files are at end of file. Most implementations of the balance line insure the process runs to proper completion by moving a high value (some value larger than possible with any proper key) to the key field of a file that has reached end of file. This high value is larger than any value that occurs naturally in the data, and its presence in the key field of the exhaused file insures that the other file will continue to be processed until it, too, reaches end of file. The other control elements follow.
read(&record1,sizeof(record1),1,file1) if(eof(file1)) key1 = 0xFFFFU read(&record2,sizeof(record2),1,file2) if(eof(file2)) key2 = 0xFFFFU +== while( not (key1 == 0xFFFFU AND key2 == 0xFFFFU)) | | /* THE IF LOGIC ABOVE GOES HERE */ | +== end while
The key fields above are implemented as unsigned integers. The 0xFFFFU represents the highest value that can be represented in an unsigned integer. The initializing reads set up the conditions for the loop, but other actions may be desired should the first access to either file be an end of file, perhaps a branch to an abrupt exit from the program. Other than initial end of file conditions, the code above, coupled with the nested if statements and appropriate actions on each condition, to include moving a high value to the key field when end of file is reached, will run to completion and not miss any records. It is suggested that the student simulate this by hand with selected key values to understand how the logic performs in all cases.
To process multiple types of transactions, there must be codes in the transaction file to indicate the type of transaction. The specific transaction types that need to be addressed are those that deal with adding, updating , or deleting records from the master file. The codes to represent these three transactions must be chosen so that the transaction file may be sorted on the transaction code as a minor sort key and ordered so that the add code will appear ahead of updates and that updates will appear before deletes. 1, 2, and 3 have been suggested as codes; A(dd) , C(hange), and D(elete) would work as well. In implementing the balance line, the logic must be expanded to check the transaction code when KEY1 is less than KEY2 and when KEY1 is equal to KEY2 and the appropriate actions taken for each combination of conditions. These combinations of conditions must be planned out carefully and tested thoroughly before moving the code into production, otherwise serious corruption of the master files upon which the business depends might occur.
The balance line is also the basis for merging files, but the logic must be changed. The fundamental assumption remains as before, both files must be sorted on the same keys in the same sequence. Normally, both files might be expected to have the same record format, as would be the case when merging runs from the same original file; however, it is not an absolute requirement.
Consider the simplest of the problems, merging two files with the same record format sorted on the same key in the same sequence. First, it must be decided what to do with matches. For the purpose of this example, it will be assumed that neither of the matching records will be output to the result file; rather, both of the matching records will be output to a separate file for matches that will be examined to determine if they are duplicates or if one is to be retained over the other. Second, it must be decided whether to doublecheck for sorting errors and for duplicates within each of the files; to do so will require additional logic and memory to keep duplicate records for each file and to compare the record just read with the previous record from the same file. To keep this simple, this "sequence checking" will not be implemented, although the student must understand that sequence checking must be implemented in any production code. With these considerations in mind, examine the following example which is built on merging two lists of city names.
fgets(key1, file1) if( eof(file1)) strcpy(key1, "~") fgets(key2,file2) if( eof(file2)) strcpy(key2, "~") +== while( not (key1 == '~' AND key2 == '~')) | | result = strcmp(key1, key2) | +-- if (result < 0) | | | | fputs(key1, file3) | | fgets(key1, file1) | | if( eof(file1)) strcpy(key1, "~") | | | +-- else | | | | +-- if (result > 0) | | | | | | fputs(key2, file3) | | | fgets(key2, file2) | | | if( eof(file2)) strcpy(key2, "~") | | | | | +-- else /* keys must be equal */ | | | | | | fputs(key1, file4) | | | fputs(key2, file4) /* This assumes that both matches are wanted in the output. */ | | | fgets(key1, file1) | | | if( eof(file1)) strcpy(key1, "~") | | | fgets(key2, file2) | | | if(eof(file2)) strcpy(key2, "~") | | +-- end if | +-- end if | +== end while
Some commentary on the code is in order. First, when either input file reaches end of file, a tilde is moved to the first character of the key field. Tilde is larger than any other printable character and in the strcmp() function will cause the field containing the tilde to be evaluated as larger than the other field. The strcmp() function is needed to evaluate the string keys. If you are using a language like COBOL that allows direct testing, the tilde will still cause a correct evaluation. And finally and obviously, the above code is appropriate for cases where the key field is a string.
Students are often uncertain of how to implement the logic for handling three files simultaneously. Remember, the fundamental assumption still applies. The logic is based on determining which key is smaller among the three at hand. The only tricky part is the case in which two equal keys are smaller than the other; it must be decided how to pick among the ties. In the case of merging three files, an arbitrary choice is appropriate. Consider the following control structure based on integer keys.
+-- if(key1 <= key2 AND key1 <= key3) | | /* whatever is appropriate when key1 is smallest */ | /* this would also cover the case of all three keys equal */ | /* if one set of conditions will not cover the case of all three keys equal, | then a separate test for all three keys equal should be placed first */ | +-- else | | +-- if (key2 <= key1 AND key2 <= key3) | | | | /* whatever is appropriate when key2 is smallest */ | | | +-- else | | | | +-- if (key3 <= key1 AND key3 <= key2) | | | | | | /* whatever is appropriate when key3 is smallest */ | | +-- end if | +-- end if +-- end if
All that must be added to each branch is code to check each key against a copy of the last key output. If the current minimum key is equal to the last key output, do not write the key to the merged file; simply discard that key and read a new key from the source file. Of course, every time a key is written to the merged file, a copy of that key must be made in the field reserved for the last key output. This logic must reside on each of the branches to handle the problem of duplicate keys in the files that are being merged.
The student will realize that the nested-if code structure above will become difficult to manage if it is extended to handle more files. An elegant way to avoid this complexity is to set up an array of file pointers, one position for each of the runs to be merged, and an array of keys, one for each of the runs. These arrays may be of any size necessary. This is usually referred to as K-way merge. The simplest way is to use the same subscript to access each of these arrays. The program would first open the runs and place each file reference into one of the file pointers. Then a loop would initialize the array of keys with one key from each of the runs. After that a second loop would work through the array of keys, selecting the smallest, writing it to output, and then replacing it with another key from the appropriate run. The second loop would continue until end of file was reached on all of the runs. Pseudocode for these processes is given below.
open all files /* k is a value passed into the program after the runs have been formed. */ +== for(i=0;i < k; i++) | getkey(i) /* a function that uses an array of file pointers to access the runs */ +== set more_keys to TRUE +== while (more_keys) | | lowest = 0 | | +== for(i=1;i < k; i++) | | if (key[i] < key[lowest] lowest = i | +== | | output_record(lowest) /* a function that will take the record from the appropriate run and write it to the merge file */ | getkey(lowest) | set more_keys to FALSE | | +== for(i=0;i < k; i++) | | | | +-- if(key[i] != HIGH_VALUES) | | | set more_keys to TRUE | | | break | | +-- end if | | | +== end for +== end while close files
This is a merge that will handle any number of input files. It avoids the complications of a large nummber of nested if statements. Not shown in the pseudocode is the logic for preserving the value of the last key output and comparing the selected lowest key to it. That logic would reside just above the reference to the output_record function. What we have is a rather simple and flexible way to handle file merging.
In Chapter 5, several sorting algorithms were addressed. One that was deliberately given short mention was the sort-merge. Now that the balance line has been demonstrated, the sort-merge may be addressed.
Remember, all of our sorting efforts were limited to files that would fit into main memory. Large files that would not fit into memory were a problem. But now the k-way merge promises to solve the problem of files to large for main memory. Consider this solution. Sort as much of the file as possible using a heap sort, and then write that sorted part out into a work file called a run. Then load another portion of the original file into memory, sort it and write it into a second run. Continue until all of the original file is exhaused, and you have k runs. Then, using a k-way merge, merge the k runs into a single sorted file. Figure 1 is a conceptual diagram of this process.
Figure 1. Sort-Merge with intermediate runs.
Before continuing, a brief review of the heapsort is in order. Recall that one of the attractive characteristics of the heapsort is that it overlaps input and heapbuilding. This is most efficiently accomplished by using locate mode buffering. Locate mode buffering eliminates the need for an additional transfer of the data from one place in RAM to another as would be the case with move mode buffering. The heap array is divided into buffer-sized subdivisions. The first subdivision is filled and then heap building begins and continues while the second subdivision is filled. Then the second subdivision is added to the heap while the third subdivision is filled. This continues until the entire array is filled and processed. This arrangement does not add appreciably to the time it takes to read the data into memory.
On output the first record in the heap is output to a buffer, and the record in the last position is brought to the top of the heap and pushed down to its proper position. In this shuffle, the length of the heap array is shortened by 1. Then the next record, which was moved to the top of the heap, is output, and the record at the end of the array is brought to the top of the heap and pushed down to its proper position. In this shuffle, the length of the heap array is shortened again. When the output buffer has been filled, the heap array is has been shortened by the size of one buffer, freeing space for more sorting and output buffer operations while the physical output of the first buffer takes place. This continues until the entire heap has been output as a run. Then a new input phase begins again. This continues until the original input file reaches end of file.
Before analysis, some statements of assumptions are needed. These assumptions set up a best-case situation when it comes to performance. First, it is assumed that each file is located on contiguous sectors of the same track (extents) and all extents are in the same cylinder; when another cylinder is needed it will be in an adjacent cylinder that will minimize seek time. Further, it is assumed that from track to track and cylinder to cylinder, extents are staggered so that rotational delay is eliminated.
Now for some observations. Reference to Figure 1 may be helpful. Working from left to right, there are four times when I/O occurs. The first is input to the sort procedure. All of the input to the sort procedure is sequential. The only seeking is to access the file for input of each successive heap. The second I/O time is output from the sort procedure. All of the output from the sort is sequential, and the only seeking is to position the heads at the start of the output of each run. The third time is to input the runs to the merge procedure. All of the access to the runs for merging is sequential, but there will be seeking between these runs to read a portion of each run into its buffer for merging. Fourth and finally, all output from the merge to the sorted file is sequential, but there will be seeking to position the heads for the output of each buffer of data because the heads will almost certainly have to be moved to input a portion of a run between output operations.
Now to the problem. Let's say there is a file of 300-byte records and the total record count is 5 million records. That makes the total file size 1.5 gigabytes (GB). The main memory you have to work with allows 30 MB of work space for sorting, not counting the space for the executable program and other variables, the operating system, and system area. This means that at most there can be 100,000 records in main memory at one time. Dividing 100,000 into 5,000,000 records gives 50, the number of runs that will be created.
For this analysis, the drive performance parameters given in Table 1 will be used. These parameters represent the capabilities of high capacity hard disk drives available for use in desktop machines and servers. The purpose of this analysis is not to develop an accurate forecast of performance, but rather to demonstrate the considerations that go into this analysis and give the student an appreciation for the performance of the sort-merge approach in the hypothetical environment that has been estabilished. The student is reminded of the assumptions above that establish an optimum situation for the sort-merge.Table 1. Hard Disk Drive Performance Parameters
First, the input to the sort. There will be one seek to the original file. Then there is the transmission time. That is it. From Table 1, the average seek time is 4.2 ms and the average rotational delay is 3 ms. Adding these two together gives 7.2 ms to access the correct position in the file. The transfer rate is in terms of bits. The file size is in bytes; therefore, multiplying the file size by 8 will get consistent units. The file size becomes 12 gigabits and dividing by 50, the number of runs, gives the amount transfered for each run, 240 Mb. Dividing the maximum transfer rate of 6 Gb/sec into the run size gives 40 ms for the transfer time for each run. That gives a total time 47.2 ms for each run. The total time for the file is 2.360 seconds.
The second phase is output from the sort procedure to the run. Seek time and rotational delay are the same, and so is the transfer time. Therefore, the total time to ouput each run is 47.2 ms, and the total time for the file is 2.360 seconds.
Time for some intermediate book keeping. For both input and output for one run, the total time is 94.4 ms. Multiplying this value by 50, the number of runs, gives 4720 ms or 4.72 seconds to produce the 50 sorted runs. Not too bad.
The third phase gets a little messy. First, the matter of buffering is not as straightforward as before. Releasing the 30 MB of work space for duty as input buffers for the merge gives 50 buffers of 600 KB each. Dividing the buffer size by the record size gives 2,000, the maximum number of records from each run that can be in memory at any time. Each run is made from 100,000 records; taking these 2000 at a time means that there must be 50 accesses to each run. That means there will be 2500 accesses to all of the runs. The access time is seek time and rotational delay or 7.2 ms. Multiply that value by 2500 gives the total access time of 18,000 ms or 18 seconds. The total transfer time is for all of the data, 12 Gb as calculated above. Dividing 12 Gb by 6 Gb/s gives 2 seconds as the transfer time. Total for this third phase, the input to the merge, is 20 seconds.
The output buffers become a problem at this point. Let us assume that two 60 KB buffers can be allocated; this means that each buffer can hold 200 records. This means that there will be 25,000 outputs (5 million total records divided by 200) to the sorted file. Each output will require an access of 7.2 ms. Multiplying 7.2 by 25,000 gives 180 seconds. Transfer time remains the same, namely 2 seconds. The point that shows up here is that the small buffers substantially increase the overall time for the process. The total for this phase is 182 seconds, or 3 minutes and 2 seconds. Not too good.
The limiting factor is the size of main memory. First, more memory for the work area would reduce the number of runs and thus reduce some of the seeking. More memory for input buffers would cut some of the seeking in phase 3, but the real improvement would be in Phase 4. Being able to increase the size of the output buffers could cut the number of outputs significantly. For example, being able to double the size of the buffers to 120 KB each would cut the number of outputs in half. Increasing the size to 300 KB each would divide the number of outputs by 5.
Another way to cut the amount of seeking is to write the sorted file to a different device, and not have to seek at all for each output. After the first seek, there would be only rotational delay and transfer time. We would have 3 ms of rotational delay times the 25,000, giving 75 seconds. Add in the transfer time of 2 seconds, your conclusion should be that with a dedicated drive, the time would be slightly over 77 seconds. Quite an improvement over the 182 seconds calculated above.
Now, lets consider what a solid state drive would do for the whole process. Eliminate the seek time and rotational delay for each phase, 360 milliseconds each in Phases 1 and 2, 18 seconds in Phase 3 and a whopping 180 seconds in Phase 4; in all, almost 3.33 minutes!
For speed purposes, the answer is hardware. More memory will produce improvements, up to the limit supported by the motherboard. But a real saving can be achieved by having a separate drive dedicated to accepting the sorted output file, and a huge savings in time by using solid state drives.
The calculations above were for what is a relatively small file in today's world of "big data." It is no wonder that, regardless of cost, firms are turning increasingly to solid state drives.
The usefulness of the balance line logic is that it can be adapted to implement many business processes. In addition to posting transactions to a master file, it may be used to clean duplicates from two or more lists and to merge two or more files into one. All that needs to be changed is the detailed processing logic for each set of conditions. Balance line control logic remains a mainstay in the toolkit of the business programmer. Merge sorting is a noteable application of the balance line, and it functions best when given maximum hardware support.