ADFS Study Material Unit-3

US03CBCA03 (Advanced Data & File Structure)
Unit - III
CHARUTAR VIDYA MANDAL’S
SEMCOM
Vallabh Vidyanagar
Faculty Name: Ami D. Trivedi
Class: SYBCA
Subject: US03CBCA03 (Advanced Data & File Structure)
*UNIT – 3 (SORTING AND SEARCHING)
INTRODUCTION TO SORTING
Sorting is a task of rearranging data in an order such as ascending, descending or lexicographic.
Sorting also means rearranging a set of records based on their key values when the records are
stored in a file.
Sorting is known as a fundamental operation in Computer Science. Operation of sorting is
frequently performed in business data processing applications. Sorting operation has also
become important in many scientific applications.
Human can perform sorting task naturally. However, a computer program has to follow a
sequence of exact instructions to do sorting. This sequence of instructions is called an algorithm.
A sorting algorithm is a method that can be used to put a list of unordered items into an ordered
sequence.
Various sorting algorithms exist, and they differ in terms of their efficiency and performance.
Some important and well-known sorting algorithms are bubble sort, selection sort, insertion sort,
quick sort etc..
Sorting techniques can be classified into two broad categories: Internal sorting and External
sorting.
Internal sort: When a set of data is small enough such that entire sorting can be performed in a
computer’s internal storage (primary memory) then the sorting is called internal sort.
External sort: Sorting of a large set of data, which is stored in low speed computer’s external
memory (such as hard disk, magnetic tape, etc) is called external sort. It involves large amount of
data transfer between external memory (low speed) and main memory (high speed).
Many sorting techniques have been developed. e.g.
Bubble sort
Selection sort
Merge sort
Insertion sort
Quick sort
Heap sort
Shuttle sort
Radix sort
Bucket sort
Flash sort
Address calculation sort
Partition exchange sort
Two way merge sort
Shell sort / Comb sort
Simple pancake sort
Spaghetti (Poll) sort
Distribution sort
Tournament sort
LSD Radix sort
MSD Radix sort
postman sort
Mineral sort
Shaker sort
Timsort
Introsort
Cycle sort
Library sort
Strand sort
Smoothsort
Bogosort
Counting sort
Cocktail sort
Gnome sort
Pigeonhole sort
Spread sort
Burst sort
Stooge sort
Sample sort
Odd-even sort
Bead sort
A programmer has to choose from a verity of sorting methods. Basically three points that should
affect a programmer’s decision are:
1. Programming time
2. Execution time of program
3. Memory or secondary memory space required
Page 1 of 12
US03CBCA03 (Advanced Data & File Structure)
Unit - III
BASIC SORTING TECHNIQUES
1. BUBBLE SORT
Algorithm: BUBBLE_SORT (K, N)
This algorithm sorts elements into ascending order.
K
N
PASS
LAST
I
EXCHS
1.
2
3.
4.
5.
6.
Vector (Array)
Number of elements in a vector
Pass counter
Position of last unsorted element
Index (subscript) used for vector elements
Used to count number of exchanges done in any pass
[ Initialize ]
LAST ← N (entire list assumed unsorted at this point)
[ Loop on pass index ]
Repeat thru step 5 for PASS = 1, 2, ……….., N-1
[ Initialize exchanges counter for this pass ]
EXCHS ← 0
[ Perform pair wise comparisons on unsorted elements ]
Repeat for I = 1, 2, …………., LAST-1
If K [ I ] > K [ I + 1 ] then
K[I]
K[I+1]
EXCHS ← EXCHS + 1
[ Were any exchanges made on this pass ? ]
If EXCHS = 0 then
Return (mission accomplished; return early)
else
LAST ← LAST – 1 (reduce size of unsorted list)
[ Finished ]
Return
(maximum number of passes required)
Explanation
Bubble sort is well known sorting method. In this algorithm, at the most n-1 passes (rounds) are
required. Here, n is number of elements.
During 1st pass (round), element K1 and K2 are compared. If K1 is greater than K2 then they are
interchanged (swapped). This process will be repeated for K2 and K3, K3 and K4 and so on.
This method will force the small values to move up like a bubble. After 1 st pass, the largest value
will be at nth position.
In every pass, the next largest element will be at position n-1, n-2,…….,2, 1 respectively.
After each pass, checking is done to find out whether any interchanges (exchanges) were made
in that pass or not. If no interchanges required, it means that data is sorted. So now no pass is
required.
Step-by-step example
Let us take the array of numbers "5 1 4 2 8", and sort the array from lowest number to greatest
number using bubble sort. In each step, elements written in bold are being compared. Three
passes will be required.
Page 2 of 12
US03CBCA03 (Advanced Data & File Structure)
First Pass:
(51428)
(15428)
(14528)
(14258)
Second Pass:
(14258)
(14258)
(12458)
(12458)
Third Pass:
(12458)
(12458)
(12458)
(12458)
(15428)
(14528)
(14258)
(14258)
(14258)
(12458)
(12458)
(12458)
Unit - III
Compare first two elements and swaps them
Swap since 5 > 4
Swap since 5 > 2
These elements are already in order (8 > 5), algorithm
does not swap them.
Swap since 4 > 2
(12458)
(12458)
(12458)
(12458)
Advantages of Bubble sort
1. Easy to understand.
2. Easy to implement.
3. Better algorithm for almost sorted data.
Disadvantages of Bubble sort
Large amount of data movement required if data is in random order or reverse sorted order.
2. SELECTION SORT
Algorithm: SELECTION_SORT (K, N)
This algorithm sorts elements into ascending order.
K
N
PASS
Vector (Array)
Number of elements in a vector
Pass counter and position of first element in the vector which will
be checked during a particular pass
MIN_INDEX Position of smallest element found so far in a particular pass
I
Index (subscript) used for vector elements
1.
2
3.
4.
5.
[ Loop on pass index ]
Repeat thru step 4 for PASS = 1, 2, ……….., N-1
[ Initialize minimum index ]
MIN_INDEX ← PASS
[ Make a pass and obtain element with smallest value ]
Repeat for I = PASS + 1, PASS + 2, ………., N
If K [ I ] < K [ MIN_INDEX ] then
MIN_INDEX ← I
[ Exchange elements ]
If MIN_INDEX ≠ PASS then
K [ PASS ]
K [ MIN_INDEX ]
[ Finished ]
Return
Page 3 of 12
US03CBCA03 (Advanced Data & File Structure)
Unit - III
Step-by-step example
Explanation
Selection is the easiest way to sort. In this algorithm, n-1 passes (rounds) are required. Here, n is
number of elements.
In 1st pass we begin with first element K1 in the list considering it as minimum. Position of first
element is remembered as minimum position. Then K2 is compared with element at minimum
position. If K2 is found minimum then position of K2 is remembered as minimum position. This
process will be repeated for K3, K4 and so on.
At the end of 1st pass, we will get position of 1st smallest element from list. Element at this
position will be interchanged with 1st element. Note that interchange of elements will not be
required if position of minimum element is not changed means K1 is minimum.
In 2nd pass we begin with second element K2 in the list considering it as minimum. Repeat
above procedure. After n-1 pass, we will get sorted array.
Advantages of Selection sort
1.
2.
3.
4.
Easy to understand.
Easy to implement.
Faster than bubble sort because each pass requires at the most one interchange of data.
It performs well on a small list.
Disadvantages of Selection sort
Takes more time than bubble sort for almost sorted data because it needs all passes to be
performed.
Page 4 of 12
US03CBCA03 (Advanced Data & File Structure)
Unit - III
3. MERGE SORT
Algorithm: SIMPLE_MERGE (K, FIRST, SECOND, THIRD)
This algorithm sorts elements into ascending order.
K
TEMP
FIRST
SECOND
THIRD
I
J
L
1.
2
3.
4.
5.
Vector (Array) contains two ordered arrays
Temporary vector
Position of first element of First vector in K vector
Position of first element of Second vector in K vector
Position of last element of Second vector in K vector
Index (subscript) used for first vector elements
Index (subscript) used for second vector elements
Index (subscript) used for TEMP vector elements
[ Initialize ]
I ← FIRST
J ← SECOND
L←0
[ Compare corresponding elements and output the smallest ]
Repeat while I < SECOND and J ≤ THIRD
If K [ I ] ≤ K [ J ] then
L←L+1
TEMP [ L ] ← K [ I ]
I←I+1
Else
L←L+1
TEMP [ L ] ← K [ J ]
J←J+1
[ Copy the remaining unprocessed elements in output area ]
If I ≥ SECOND then
Repeat while J ≤ THIRD
L←L+1
TEMP [ L ] ← K [ J ]
J←J+1
Else
Repeat while I < SECOND
L←L+1
TEMP [ L ] ← K [ I ]
I←I+1
[ Copy elements of temporary vector into original area ]
Repeat for I = 1, 2, …………., L
K [ FIRST – 1 + I ] ← TEMP [ I ]
[ Finished ]
Return
Explanation
Operation of sorting is related to merging. This algorithm merges two sorted vectors to single
sorted vector.
This can be done by selecting the item with smallest value in one of the vector and place them in
new vector.
Page 5 of 12
US03CBCA03 (Advanced Data & File Structure)
Unit - III
Here, two sorted vectors are stored in a common vector K as follows.
K:
11
23
42
9
FIRST
25
SECOND THIRD
Where elements of first vector are stored from position FIRST to SECOND-1 and elements of
second vector are stored from position SECOND to THIRD.
A loop will be executed for comparison between elements of both the vector. 1st element of first
vector will be compared with 1st element of second vector.
Smallest out of two will be copied to temporary vector. Subscript of a vector whose element is
copied will be incremented by one to point to next element of same vector. This loop will
terminate when it comes to end of one of the vector.
Now, rest of the elements of remaining vector will be copied to temporary vector. And finally,
sorted elements from temporary vector will be copied to Original vector.
Note: This algorithm can be generalized to merge k sorted tables into a single sorted table. Such
a merging operation is called multiple merging or k-way merging.
Step-by-step example
K:
I and L
11
23
42
9
J
25
42
9
J
25
9
J
25
25
TEMP:
9
K:
11
I and L
23
TEMP:
9
11
K:
11
23
I and L
42
TEMP:
9
11
23
L
9
K:
11
23
I
42
TEMP:
9
11
23
25
K:
11
23
42
9
L
25
TEMP:
9
11
23
25
42
Advantages of Merge sort
Easy to merge already sorted lists into a new sorted list with merge sort.
Disadvantages of Merge sort
Merge sort requires extra storage space for temporary vector.
Page 6 of 12
US03CBCA03 (Advanced Data & File Structure)
Unit - III
APPLICATION OF SORTING
Sorting algorithms are essential in a broad variety of applications.
1.
Commercial computing
Government organizations, financial institutions, and commercial enterprises organize
much of their information by sorting it.
Information related to accounts to be sorted by name or number, transactions to
be sorted by time or place, mail to be sorted by postal code or address, files to be
sorted by name or date etc. Processing such data requires use of a sorting
algorithm.
2.
Search for information
Keeping data in sorted order makes it possible to efficiently search through it using the
classic binary search algorithm. Speeding up searching is perhaps the most important
application of sorting.
3.
Operations research
We can arrange jobs as per increasing order of processing time to complete jobs in such a
way that it maximizes customer satisfaction by minimizing the average completion time of
the jobs.
Suppose that we have N jobs to complete, where job j requires t j seconds of
processing time. We need to complete all of the jobs, but want to maximize
customer satisfaction by minimizing the average completion time of the jobs. We
schedule jobs in increasing order of processing time as per “shortest processing
time first” rule to accomplish this goal.
4.
Event-driven simulation
Many scientific applications involve simulation, to model some aspect of the real world to
understand it in a better way. In event driven simulation, pending events are sorted by
event time to save time required to search next event.
Take example of Bank simulator. We are given - number of cashier, arrival time of
each customer and time required to serve each customer. Our goal is design a
simulator that will tell us how long each customer waits in line. Start a simulation
clock at 0 ticks. At each iteration, advance the clock to time of next event. Pending
events are organized as a priority queue, sorted by event time.
5.
Numerical computations
Scientific computing is often concerned with accuracy (how close are we to the true
answer?). Some numerical algorithms use priority queues and sorting to control accuracy in
calculations.
Accuracy is extremely important when we are performing millions of computations
with estimated values such as the floating-point representation of real numbers
that we commonly use on computers.
For e.g. one way to do numerical integration is (where the goal is to estimate the
area under curve) to maintain priority queue with accuracy estimates for a set of
subintervals that compromise the whole interval. The process is to remove the
least accurate subinterval, split it in half (thus achieving better accuracy for the two
halves) and put the two halves back onto the priority queue, continuing until the
desired tolerance is reached.
6.
String processing algorithms are often based on sorting. For example, an algorithm for
finding the longest repeated substring in a given string that is based on first sorting suffixes
of the strings.
Page 7 of 12
US03CBCA03 (Advanced Data & File Structure)
7.
Unit - III
Records with multiple keys
Sorting is done on different key as per the requirement. E.g. transaction data can be sorted
on customer number or on date etc.
In typical applications, records have multiple data members that might need to
serve as sort keys. For example, one client may need to sort the transaction list by
account number; another client might need to sort the list by place; and other
clients might need to use other fields as sort keys.
8.
Display Google Page Rank results
Page rank results can be useful to know importance of a page. Important pages have high
page rank. Arranging pages according to their page rank will be helpful to find reputable
pages.
PageRank is an algorithm that calculates a web metric which shows
how reputable a particular page is according to Google. This rank depends not
only on the quality and the quantity of the incoming links but also on several other
parameters such as the number of outgoing links per page (on the linking
webpage), the position/visibility of the links and more.
9.
Find the median
The median of a finite list of numbers can be found by arranging all the
observations from lowest value to highest value and picking the middle one. If
there is an even number of observations, then there is no single middle value; the
median is then usually defined to be the mean of the two middle values.
10.
Frequency distribution and find the mode
Mode means given a set of n items, which element occurs the largest number of
times?
Frequency distribution displays the number of observations within a given interval.
For both of above task, sort the items and do a linear scan.
11.
Find the closest pair
Closest pair means given n numbers, find the pair which are closest to each other. Once
the numbers are sorted, the closest pair will be next to each other in sorted order. So a
linear scan will speedily complete the task of finding closest pair in sorted data.
12.
Identify statistical outliers
In statistics, an outlier is an observation that is numerically distant from the rest of the data.
To find outlier, first step is sorting of data.
13.
Find duplicates in a mailing list
14.
Organize an MP3 library
15.
Element uniqueness
Given a set of n items, are they all unique or are there any duplicates? To check this, sort
items and do a linear scan to check all adjacent pairs.
16.
Stability
Stable sorting method will keep the data in order after the sort.
A sorting method is stable if it preserves the relative order of equal keys in the
array. For example, suppose, in our internet commerce application, that we enter
transactions into an array as they arrive, so they are in order of the time field in the
array. Now suppose that the application requires that the transactions be
separated out by location for further processing. One easy way to do so is to sort
the array by location. If the sort is unstable, the transactions for each city may not
necessarily be in order by time after the sort. Some of the sorting methods are
stable (insertion sort and mergesort); many are not (selection sort, shellsort,
quicksort, and heapsort).
Page 8 of 12
US03CBCA03 (Advanced Data & File Structure)
Unit - III
INTRODUCTION TO SEARCHING
Searching is a process of finding an element within the list of elements. List of elements have
been stored in order or randomly.
Search algorithm is an algorithm for finding an item with specified properties among
a collection of items.
BASIC SEARCHING TECHNIQUES
Searching is divided into two categories:
1. Linear Search (Sequential Search)
2. Binary Search
1. LINEAR SEARCH
Suppose we want to search an element in given unordered list of elements. Simplest technique
is to scan every element in a sequential manner and check whether it is desired element or not.
A search will be successful if all the elements are accessed and the desired element is not found.
If the desired element is present in first position then only one comparison is required. If the
desired element is at last position then n comparisons are required. Here n is number of
elements.
Linear searching is the basic and simple method of searching.
Algorithm: LINEAR_SEARCH (K, N, X)
This algorithm searches an element from unordered / ordered vector.
K
N
X
Vector (Array) consist N+1 elements
Number of elements in a vector
Element to be searched
1.
[ Initialize search ]
I←1
K[N+1]←X
[ Search the vector ]
Repeat while K [ I ] ≠ X
I ← I +1
[ Successful search ? ]
If I = N + 1 then
Write (‘UNSUCCESSFUL SEARCH’)
Return ( 0 )
else
Write (‘SUCCESSFUL SEARCH’)
Return ( I )
2
3.
Explanation
In first step, we store the element to be searched at n+1 position of array. A sequential search is
performed on n+ 1 element.
A loop will be executed for comparison between array elements and X (element to be searched).
Loop will terminate if desired element will be found. At this time I will contain position of desired
element in array.
If we could not find element till nth position then element at n+1 position (which is X) will match
with X. And the loop will be terminated. This time I will contain value n+1.
After the loop, we can check that if I is equal to n+1 means it is an unsuccessful search.
Otherwise it is successful search.
Page 9 of 12
US03CBCA03 (Advanced Data & File Structure)
Unit - III
Step-by-step example
Searching for key = 05
Advantages of Linear search
1. Linear searching is the basic and simple method of searching.
2. Easy to implement.
3. Useful for searching an element in an unordered or ordered list.
Disadvantages of Linear search
Linear search is time consuming.
2. BINARY SEARCH
Binary search is very efficient algorithm. This search technique searches the given item in
minimum possible comparisons.
We need to sort the array elements in increasing order to perform binary search. Less time is
taken by Binary search to search an element from the sorted list of elements.
So binary search method is more efficient than the linear search.
Updating an ordered array due to insertions or deletions is time consuming task. So, binary search is
not useful when the array changes often.
Algorithm: BINARY_SEARCH (K, N, X)
This algorithm searches an element from an ordered vector.
K
N
X
Vector (Array)
Number of elements in a vector
Element to be searched
1.
[ Initialize ]
LOW ← 1
HIGH ← N
[ Perform search ]
Repeat thru step 4 while LOW ≤ HIGH
[ Obtain index of midpoint of interval ]
MIDDLE ← └ ( LOW + HIGH ) /2 ┘
[ Initialize ]
If X < MIDDLE then
HIGH ← MIDDLE – 1
else if X > MIDDLE then
LOW ← MIDDLE + 1
else
Write (‘SUCCESSFUL SEARCH’)
Return ( MIDDLE )
[ Unsuccessful search ]
Write (‘UNSUCCESSFUL SEARCH’)
Return ( 0 )
2
3.
4.
5.
LOW
MIDDLE
HIGH
Lower limit of search interval
Middle limit of search interval
Upper limit of search interval
Page 10 of 12
US03CBCA03 (Advanced Data & File Structure)
Unit - III
Step-by-step example
Binary search in case of Successful Binary search in case of Unsuccessful
search
search
Explanation
Logic in this technique is:
1. First find the middle element of the array.
2. Compare middle element with X.
3. There are 3 possibilities:
a. It is desired element, so search is successful.
b. If it is less than X then search only first half of array.
c. If it is greater than X then search only second half of array.
For b case, new search area will be lower limit to middle-1. For c case, new search area will be
middle+1 to upper limit.
Repeat the same steps until an element is found or we end with unsuccessful search.
Advantages of Binay search
1. Binary search is very efficient algorithm.
2. Require fewer number of comparisons as compared to Linear search.
Disadvantages of Binary search
1. Binary search is not useful when the array elements are frequently changed.
2. Array must be sorted to perform binary search.
Page 11 of 12
US03CBCA03 (Advanced Data & File Structure)
Unit - III
APPLICATION OF SEARCHING
1.
Search algorithms can be used to find solutions or objects with specified properties and
constraints in a large solution search space or among a collection of objects.
2.
There are Search algorithms which are designed for the prospective quantum computer.
Quantum computer is a device that uses quantum mechanical phenomena
(quantum physics) to perform operations on data.
3.
In text editors, we might want to search through a very large document for the occurrence
of a given string.
4.
In text retrieval tools, we may want to search through thousands of such documents.
5.
String matching algorithms as part of a more complex algorithm (e.g., the Unix program
``diff'' that works out the differences between two similar text files).
String matching / String searching algorithms will try to find a place where 1 or
more strings (patterns) are found within a large string or text.
6.
To search in binary strings (ie, sequences of 0s and 1s). For example the ``pbm'' graphics
format is based on sequences of 1s and 0s.
7.
Implementing a "switch() ... case:" construct in a virtual machine where the case labels are
individual integers. If you have 100 cases, you can find the correct entry in 6 to 7 steps
using binary search, whereas sequence of conditional branches takes on average 50
comparisons.
8.
Binary search is now used in 99% of 3D games and applications. Space is divided into a
tree structure and a binary search is used to retrieve which subdivisions to display
according to a 3D position and camera.
9.
Binary search offers a feature of finding non-exact matches (closest matches).
SORTING V/S SEARCHING
1.
2.
3.
4.
5.
6.
Sorting
The process of arranging data elements or 1.
data records in to data structure is called
Sorting.
There are various sorting techniques such 2.
as selection sort, bubble sort, insertion sort,
quick sort, merge sort, shell sort.
After performing sorting techniques, the 3.
position of data elements or data records
are changed.
After
performing
sorting,
searching 4.
becomes easy.
Output of sorting algorithms is sorted 5.
elements.
If insertion and deletions occur very 6.
frequently than sorting is time consuming
for large array.
Searching
The process of finding data elements or
data records from data structure is called
Searching.
There are two searching techniques such
as sequential search and binary search.
After performing searching techniques,
the position of data elements or data
records are not changed.
Without performing sorting, searching
becomes difficult.
Output of searching algorithm
successful or unsuccessful search.
is
Insertion and deletion in unordered array
will only increase / decrease few
comparisons in linear search,
Disclaimer
The study material is compiled by Ami D. Trivedi. The basic objective of this material is to
supplement teaching and discussion in the classroom in the subject. Students are required to go
for extra reading in the subject through library work.
Page 12 of 12