Page 254 - DCAP605_ADVANCED_DATA_STRUCTURE_AND_ALGORITHMS
P. 254
Unit 12: Sorting
When only one active run remains the algorithm finishes up as shown in lines 30 and 32 of Notes
TWOWAY-MERGE-it just copies all the remaining elements to the end of file X. Figure 12.1
visualizes multiway merging.
Figure 12.1: Main-memory Organization for Multiway Merging
Input buffers, one for
each unfinished run
Select smallest
unchosen Bf 0
element for
output
P 0
………..
Pointers to first
unchosen elements
It is easy to see that phase 2 of the two-phase, multiway merge-sort algorithm performs only θ
(n) I/O operations and this is also the running time of the whole algorithm. In spite of this, the
algorithm has a limitation-it can not sort very large fi les.
If phase 1 of the algorithm produces more than m – 1 runs (N/M > m – 1), all runs can not be
merged in one go in phase 2, because each run requires a one-page input buffer in main-memory
and one page of main-memory is reserved for the output buffer. How large should the file be for
this to happen?
Multiway Merge Sort of Very Large Files
Sometimes there may be a need to sort extremely large files or there is only a small amount of
available main memory. As described in the previous section, two-phase, multiway merge sort
may not work in such situations.
A natural way to extend the two-phase, multiway merge sort for files of any size is to do not
one but many iterations in phase 2 of the algorithm. That is, we employ the external memory
merge-sort algorithm from Section 3, but instead of using TWOWAY-MERGE, we use the
multiway merging (as described in the previous section) to merge m – 1 runs from file Y into one
run in file X. Then, in each iteration of the main loop of phase 2, we reduce the number of runs
by a factor of m – 1.
What is the running time of this algorithm, which we call simply multiway merge sort. Phase 1
and each iteration of the main loop of phase 2 takes θ(n) I/O operations. After phase 1, we start
up with [N/M] = [n/m] runs, each iteration of the main loop of phase 2 reduces the number of
runs by a factor of m – 1, and we stop when we have just one run. Thus, there are log m–1 (n/m)
iterations of the main loop of phase 2. Therefore, the total running time of the algorithm is
θ(n log (n/m)) = θ (n log n – n log m) = θ (n log n – n) = θ (n log n).
m–1 m m m m
Remember that the cost of the external-memory merge-sort algorithm from Section 3 is
θ(n log (n/m)). Thus, multiway merge sort is faster by a factor of (1 – 1 /log n)log m). Actually,
2 m 2
LOVELY PROFESSIONAL UNIVERSITY 249