Objects

Searching and Sorting
A Summary on Searching
Copyright © 2009-2016 by Curt Hill
The lesson from Neural
Networks
• Neural networks are only used when
there are no algorithms that always
work
• We only use on hard problems
• In NN there are never any absolute
answers
• Instead each project is different and
we experiment with our options until
we are happy
• We never are sure that this is the best
answer we only hope that it is
acceptable
Copyright © 2009-2016 by Curt Hill
Apply the lesson
• This is the same problem with
constructing data structures for
programs
• It is extremely rare for us to know in
advance all the things that would make
the decision easy:
– The frequency or number of insertions,
deletions, lookups in an average run
– The frequency distribution of the key
– The density of the key
– What will be the optimal container class
– How the next revision will change all of
this
Copyright © 2009-2016 by Curt Hill
Therefore
• We make fuzzy choices based on
incomplete information
• We then become good at spotting
trends that favor one structure over
another
• With that in mind let us come back
and re-examine searching and
sorting
Copyright © 2009-2016 by Curt Hill
Why consider both at once?
• Our containers will fall into one of three
categories:
– Unordered
– Ordered by key
– Ordered or partially ordered by something
other than a key
• In first case there is no notion of sorting
• In rest there is
– In some cases we must sort before we get
started searching
– In both cases an insert must do some type of
partial sort to clean it up
– A delete may also affect the sorted order but
is often easier to correct
Copyright © 2009-2016 by Curt Hill
Searching Arrays or
Vectors
•
•
•
•
Three areas to review here
Linear
Binary
Self-organizing lists
Copyright © 2009-2016 by Curt Hill
Linear searching
• This is the best and worst
• Advantages
– It is the easiest to code
• Not much more than a for loop
– Does the best for small tables, typically less
than 10
– Applicable to Lists as well as tables
• Disadvantages
– Finding an item that is present in uniformly
distributed array needs ½N probes
– Finding that an item is not present requires
looking at each N items
– Clearly an O(N) algorithm which is the worst
for a search
Copyright © 2009-2016 by Curt Hill
Why use?
• When the advantages outweigh the
disadvantages
• For small tables it is the preferred
choice
• Often chosen early in a project
– If and when performance becomes a
problem then upgrade the search
based on what you now know about the
project
– It may be that the vector is large but
searched infrequently so it is not a
problem
Copyright © 2009-2016 by Curt Hill
Sequential Search in C
• There is a sequential search function
in stdlib.h
• It will search an array of items using a
user defined comparison
• The header is:
void* lfind (
const void * key,
const void * base,
size_t * num,
size_t * width,
int (_USERENTRY * fcmp)
(const void *,
const void *)
);
Copyright © 2009-2016 by Curt Hill
Notes
• It uses void * to represent any
pointer
• The key is a pointer to what is being
searched for
• The base is an array, not necessarily
of the same type as the key
– It may contain the key and other stuff
• The array has num entries and each
entry is width bytes long
Copyright © 2009-2016 by Curt Hill
The passed function
• fcmp is a user defined routine to
compare the key with a base item
• Key is the first parameter
• An array entry is the second
• Returns zero for equal and anything
else for not equal
• If the item is found then it returns the
pointer to it and NULL otherwise
Copyright © 2009-2016 by Curt Hill
Commentary
• Actually figuring out how to use this
thing is probably harder than coding
it from scratch
• However, it will generally use
machine language statements
• Thus it should do better than any C
style loop
• There is also in the C libraries:
– A binary search we will see later
– A quick sort routine
Copyright © 2009-2016 by Curt Hill
Example
int fcmpe(const void * a, const void *b){
if(*(int *)a == *(int *)b){
return 0;
}
return +1;
}
...
size_t s = tablesize;
int key;
int * unsorted; // dynamic array
...
lfind(&key, unsorted, &s, 4, fcmpe);
Copyright © 2009-2016 by Curt Hill
Commentary
• The lfind is classic C
• It is not a template function, but it
can be used much like a template
function
• Must use:
– void * pointers
– Makes user specify the length
– Requires a user-defined function for
comparison
• Then it will work on any array
Copyright © 2009-2016 by Curt Hill
STL considerations
• The STL has a search which is
customarily interesting
• It may search for an item or a range
of items
– In any container
• The header looks like this:
FI search(First1, Last1,
First2, Last2)
Copyright © 2009-2016 by Curt Hill
STL Notes
• The result and all parameters are
Forward iterators of the same
container class type
• First1 through Last1 are n the
container class to be searched
• First2 through Last2 may be in
another container
• If First2=Last2 then just one item
Copyright © 2009-2016 by Curt Hill
STL Results
• If search finds it the result is the
beginning of the sequence
• Otherwise it returns Last1
• In order to use the stored types
must be suitable for the equality
operator
• You may also provide your own
predicate
Copyright © 2009-2016 by Curt Hill
Binary search
• The binary search requires a sorted
table
• The sort order may be either
ascending or descending
– For this presentation assumed
ascending
Copyright © 2009-2016 by Curt Hill
Basic algorithm
• Set low to 0, high to the last used
• While low < high
– Set mid to be halfway between low and
high
– Compare the mid item with key
– If the mid item is equal you are done
– If the mid item is less than the key
• Remove the lower half of the table
• Set low to mid
– If the mid item is greater than the key
• Remove the upper half of the table
• Set high to mid
Copyright © 2009-2016 by Curt Hill
Commentary
• The loop terminates when we find
item or the high and low bounds
collapse
• We determine which after the loop
• The advantages
– The search is O(log2N) because at each
iteration we eliminate half of what is left
• The disadvantages
– The loop is much more complicated
• Most people do not get it right the first time
– The array must be sorted before we get
started Copyright © 2009-2016 by Curt Hill
Sorting
• Since sorting is either a O(N2) or O(N
log2N), this is a very serious
ramification
– You have to do quite a few searches to
pay for that sort
• If the table will allow insertions it
complicates that as well
– The search to find the item is log2N,
however the insertion may only be
linear in an array, since we have to
slide all the following items down one
Copyright © 2009-2016 by Curt Hill
C Function
• There is a binary search function in
stdlib.h
• It will search a sorted array of items
using a user defined comparison
void* bsearch (
const void * key,
const void * base,
size_t * num,
size_t * width,
int (_USERENTRY * fcmp)
(const void *,
const void *) );
Copyright © 2009-2016 by Curt Hill
Commentary
• It uses void * to represent any
pointer
• The key is a pointer to what is being
searched for
• The base is an array, not necessarily
of the same type as the key
– It may contain the key and other stuff
– The array has num entries and each
entry is width bytes long
Copyright © 2009-2016 by Curt Hill
User Defined Function
• fcmp is a user defined routine to
compare the key with a base item
• It returns a negative if the first
parameter is less than second
• It returns a zero if the first
parameter is equal to second
• It returns a positive if the first
parameter is greater than second
• If the item is found then it returns the
pointer to it and NULL otherwise
Copyright © 2009-2016 by Curt Hill
Example
int fcmp(const void * a, const void *b){
if(*(int *)a<*(int *)b)
return -1; // less
if(*(int *)a==*(int *)b)
return 0;
// equal
return +1;
// greater
}
...
int key;
int * table;
...
bsearch(&key, table, tablesize, 4, fcmp);
Copyright © 2009-2016 by Curt Hill
STL considerations
• There is also a binary search in the
STL
• The header is:
• bool binary_search(first, last, const
T& value)
• first and last are ForwardIterators in
the container
• value is the item looked for
• comp is comparison object to allow
you to specify the comparison
• Of course, the container is ordered
Copyright © 2009-2016 by Curt Hill
Segmented Search
• Intermediate between binary and
linear search
– Easier to code than binary search
– Faster than linear
• Requires sorted array
• Depending on size of table may
come in two to four stages
Copyright © 2009-2016 by Curt Hill
Two Stage
• Divide the array into segments
– Segment size is close to square root of
size
• First find the segment that contains
desired item
– Use a linear search but with segment
size increment
• Once segment is found find the
desired item
– Again with linear search
Copyright © 2009-2016 by Curt Hill
Example code
• Assume that size of table is 64 and it
is sorted:
int first = 0; last = 0;
for(int i = 1;i<64;i+=8){
last = i;
if(key > arr[i])
break;
first = last;
}
for(int j = first;j<last;j++)
if(key!=arr[j]) break;
Copyright © 2009-2016 by Curt Hill
Commentary
• The search should be O(2N½)
• On the above table of 64 a linear
search that finds would take
average 32 searches
• The segmented search will take no
more than 16
– Average is 8
• A binary search would average and
5.? and maximum 6
Copyright © 2009-2016 by Curt Hill
Other searches
• If the frequency of lookup is not
uniform you may do some other
things
• Storing the most commonly
accessed items at the beginning of
the list
• The most developed of which
becomes a self organizing list
Copyright © 2009-2016 by Curt Hill
Hashing
• Often the best vector search
technique
• Should be O(C) if done well
• No restrictions on the key
• The problems are well known and
discourage many from using
Copyright © 2009-2016 by Curt Hill
Problems with hashing
• Insertions and deletions
• Hash function does not generalize
well
• No such thing as a general hash function
• A good hash function is most often
constructed with knowledge of the data
• Performance degrades when full
• Processing the data in a sorted
order requires an extra sort
• Making the hash as robust as the
tree is quite difficult
Copyright © 2009-2016 by Curt Hill
Sermon
• Programmers usually avoid the hash
because of these problems
• Very often this is the best of the
search techniques
• The only question is: Is the work
needed to make the hash the search
technique of choice worth the work?
– Depends on the application
Copyright © 2009-2016 by Curt Hill
Other containers
• Pointer based
• Lists
• Trees
Copyright © 2009-2016 by Curt Hill
Lists
• Most of our techniques translate into lists
rather easily
• Insertions and deletions are much easier
• The lack of needing to know the size in
advance is also helpful
– Dynamic arrays, including the STL vector, are
as convenient
– There is a substantial run-time penalty when
the array has to be recopied to another larger
array
• The main exception is the binary search
– The binary search cannot be done since a list
is not a random access container
– Most sorts do not work on a list either
– Quick sort should work on a doubly linked list
Copyright © 2009-2016 by Curt Hill
Self organizing lists
• Only list that is recommended for
searching
– Only with very narrow criteria
• Types
– Move to top
• Delete the item and push onto front
– Transpose
• Remember the prior pointer and exchange
the two contents
– Sort by frequency is the hardest of the
SOLs because you can move up a
variable amount
Copyright © 2009-2016 by Curt Hill
Lists
• A self organizing list will provide
good results only if:
– Few items dominate the sought items
– The list is relatively short
• Other than this lists are not a good
search container unless
– Search, insertion and deletion are very
infrequent
– None are coded by the programmer
Copyright © 2009-2016 by Curt Hill
Skip Lists
• Way more complicated than
ordinary lists
• Some of that complication gets
log2N searching
– Which makes it much more
performance oriented
Copyright © 2009-2016 by Curt Hill
Trees
• Trees are inherently ordered
– There is nothing like an unsorted list
• Flavors to consider
–
–
–
–
–
Unbalanced
Balanced
Optimal Search
Btree
Trie
Copyright © 2009-2016 by Curt Hill
Unbalanced tree
• Normal searches perform slightly
worse than binary searches
– Rarely balanced
• Advantage of log2N insertion time
• When the search failed, you are at
the location that you want to insert
at with no additional work
• The worst case tree deletion is
better than the average table
deletion and the average case is
log2N
Copyright © 2009-2016 by Curt Hill
Balanced trees
• Search comparable to binary search
• Insertions and deletions are
generally less painful
• A rebalance can be quite extensive
and expensive
– Generally a rebalance is less painful
than an insertion or deletion in a table
because the sliding affects all the table
to the end
– Recopying table is hidden cost
Copyright © 2009-2016 by Curt Hill
Tries
• Most of the advantages of the tree but it
has two requirements to be useful:
• Dense key
• The key should have a small alphabet and
short length
– This is not much of a consideration if the key
is truly dense
• A binary tree has a O(log2N) search time
– While a trie has search time linear on the
length of the key rather than the number of
entries
Copyright © 2009-2016 by Curt Hill
Optimal search trees
• Somewhat similar to a list with
optimal static order but faster
– Requires knowledge of the frequencies
• Like a binary search it generally
cuts the items to be cut in half in
each pass
– The items are based on frequencies not
on keys
• Like a self organizing list it tends to
find high frequency items quite
quickly
Copyright © 2009-2016 by Curt Hill
Optimal search tree
• A standard unbalanced tree may be
used
– Need a prior program that orders the
keys based on frequency
• Generally not used if insertions and
deletions are possible
• May be used for a set of keys that
changes from day to day
• Keep the counts in every node and then write
out tomorrows based on the frequencies
• Could be quite effective but complicated to
implement
Copyright © 2009-2016 by Curt Hill
B-Trees
• Offer no advantages in memory
– Searching the node offsets the
shallowness of the tree
• Preferred for disks
• No DBMS should be without
Copyright © 2009-2016 by Curt Hill
Finally
• These are the tools we have to work
with
• The trick is figuring out what works
in the problem at hand
• In general, one size does not fit all
Copyright © 2009-2016 by Curt Hill

Download Report

Objects

Paperzz.com

Your Paperzz