External Sorting Merge Sort, Replacement Selection Overview Structure: 1. What is “External Sorting”? 2. How does “Merge Sort” work? Balanced n-way-merging Improvements 3. What are the advantages of a “Selection Tree”? 4. What is “Replacement Selection”? Snow-plow example Improvements 5. Applicability and efficiency 2 1. Principle 3 Conventional sort algorithms: e.g. Quick Sort, Heap Sort, Selection Sort,... External sorting: performing sorting operations on amounts of data that are too large to fit into main memory. External sorting can not be done in one step. Very efficient but all data needs to fit completely into main memory. 1. Principle Multiple steps: 1. Split the data into pieces that fit into main memory 2. Sort the pieces with conventional sorting algorithms 3. Merge those so called runs and build the completely sorted data-set Internal Sorting and External Merging 4 2. Merge Sort Principle: 5 Internal Sorting and External Merging Source hard disk Working hard disk Source data Initial run Source data Initial run Sorting… Source data Initial run Source data Initial run Main memory 2.1 Balanced n-way-merge 6 Step 1: Unsorted data-set: 535 288 351 354 412 198 451 852 291 448 898 165 217 366 756 665 In this example four elements each fit into main memory. Creation of initial runs: 535 288 288 351 351 354 RUN1 354 535 412 198 198 412 451 RUN2 852 165 291 291 448 448 898 RUN3 898 165 217 366 665 756 RUN4 756 665 2.1 Balanced n-way-merge 7 Step 2: Initial runs: 288 351 354 535 198 RUN1 412 451 852 165 RUN2 291 448 898 217 RUN3 366 665 756 RUN4 Merging of initial runs: 198 288 351 354 412 RUN5 451 535 852 165 217 291 366 448 RUN6 665 756 898 2.1 Balanced n-way-merge 8 Step 3: Merged runs: 198 288 351 354 412 451 535 852 165 217 291 366 RUN5 448 665 756 898 RUN6 Re-merging: 165 198 351 288 291 351 354 366 412 448 451 535 FINAL RUN Result: After two mergeprocedures our formerly unsorted set is in perfect order and merge sort is complete. 665 756 852 898 2.1 Balanced n-way-merge 9 Explanation: This procedure is called Balanced 2-way-merging Balanced: 2-way: As well source- as working space is required Out of 2 merged runs one new run is formed 2.1 Balanced n-way-merge Example: 10 The merging-procedure can be certainly applied to more than two runs at each time. Then, it is termed n-way-merge or multiway merge. A balanced 3-way merge would be implemented as follows: RUN1 RUN2 RUN3 RUN1~ (1-3) RUN4 RUN2~ (4-6) RUN5 RUN3~ (7-8) RUN6 RUN7 RUN8 RUN1~ (1-8) 2.2 Sophisticated n-way-merge Optimizations: Algorithms like Polyphase merge, cascade merge Reducing the number of intermediate steps by implementing n-way-merging with great values of n. Maximizing speed by increasing the number of drives for storage disposals for minimal access time. Saving time by doing a perfect spreading of the runs on the storage media. Disadvantages: Additional costs and expenditure 11 2.2 Sophisticated n-way-merge Example: 12 Significant speed increase by storing all runs on different drives for minimal access time: RUN1 1 RUN2 2 RUN3 3 6 RUN4 4 RUN5 5 RUN1~ (1-5) 3. Selection Tree Problem: 13 Selecting the smallest element is very time-consuming. It requires (n / p) - 1 comparisons when using a non-advanced algorithm. 217 217 351 354 535 RUN1 198 412 451 852 RUN2 165 291 448 898 RUN3 288 366 665 756 RUN4 first element is compared subsequently with all remaining p-1 elements Solution: Building a selection tree saves lots of comparisons and speeds up the selection process: Then, just log2 p comparisons are necessary. 3. Selection Tree Start: 14 Building a selection tree: 288 351 354 535 198 412 451 852 165 291 448 898 217 366 665 756 198 165 165 Always the smallest element is taken out of the top of the tree New elements are pulled forward in the current branch Repeats until all branches of the selection tree are empty 3. Selection Tree Step 1: 15 Pulling smallest elements forward 288 351 354 535 198 412 451 852 291 448 898 217 366 665 198 165 198 217 756 Always the smallest element is taken out of the top of the tree New elements are pulled forward in the current branch Repeats until all branches of the selection tree are empty 3. Selection Tree 16 Pulling smallest elements forward Step 3: 288 351 354 412 451 852 291 448 898 366 665 756 535 288 165 198 217 288 291 Always the smallest element is taken out of the top of the tree New elements are pulled forward in the current branch Repeats until all branches of the selection tree are empty 3. Selection Tree 17 Pulling smallest elements forward Step 5: 351 354 535 412 451 852 448 898 366 665 351 165 198 217 288 291 351 366 756 Always the smallest element is taken out of the top of the tree New elements are pulled forward in the current branch Repeats until all branches of the selection tree are empty 4.1 Replacement selection Most efficient is to keep the number of initial runs very low → The length of runs has to be as great as possible Conventional run-creation: Maximum size of a run is limited by available size of main memory Modification: Records are replaced in memory to form even longer runs than memory is available. This technique is called replacement selection. 18 4.1 Replacement selection 19 Example of a replacement selection sequence: Four elements each fit into main memory Values in memory Run 12 42 2 21 2 12 42 73 21 12 (5) 42 73 21 21 (5) 42 73 39 39 Length of run: 6 Available memory: 4 (5) 42 73 (17) 42 size of run > size of memory (5) (18) 73 (17) 73 (5) (18) (11) (17) (End) 4.1 Replacement selection What happened: 1. The smallest record in memory is stored to the run 2. Right after that, a new record is loaded at its position in memory 3. If this new record is smaller than our last element of the current run, it is tagged, because we can’t use it now 4. Records are replaced in memory to form even longer runs than memory is available Result: • • • Long length of runs, especially when data is presorted Statistically, length of runs levels off at 2 * size of memory Practically, runs tend to contain even more records, because in almost every commercial application data is presorted 20 4.2 Replacement selection Demonstration: There’s a well-known way to proof why initial runs of a length of 2 * q can be expected when q is the size of main memory. A snowplow is clearing a road with snow randomly distributed all over. 21 4.2 Replacement selection Because snow is falling at constant speed, this stable situation will never change: • • • • Rectangle is cut in half by the line representing the actual snow level • The volume of snow removed in one circle (namely the length of a run) is twice the amount that is present on the track at any time. Level of existing snow represents records in main memory At the end of the road, there is no snow from the previous turn left All records from the last run are tagged with the marker, so a new run has to be created. 22 5. Applicability and efficiency Most popular algorithms: 1. Internal sorting: creates short runs with a constant maximum length equal to the size of main memory. 2. Replacement selection: mostly used, creates runs of big size. As well as 3. Delayed Reconstitution of the Runs 4. Replacement Selection with natural selection Today, speed and efficiency of external sorting is less concerned with the algorithm than with the thereby used hardware. 23 6. Conclusion Speed: Can’t compete by far with speed of internal sort algorithms Intention: Minimize accesses to slow external media Provide suitable and affordable solution Advantage: In practice, data records are often presorted in some way. In this case, replacement selection can produce extremely long runs Development: Increase of speed because of more sophisticated algorithms Increase of speed because of much faster external hardware 24
© Copyright 2026 Paperzz