|

Search Algorithms

The next chapter of the book Algorithms for Dummies is searching. Last time I wrote about sorting algorithms and I discussed how I had a little bit of a problem understanding the different sorting algorithms at a high-level concept. The book actually provided Python programs that did the sorting and I tried to understand what the programs were doing and distill it into a high-level concept. For the searching, no such programs are provided; instead, the author provided the name of a search package that was already written.

My post related to Algorithms for Dummies

And that is how it should be. Typically, developers re-use algorithms already written rather than re-invent the wheel. Occasionally, developers might try to see if they can create a better search program for a particular situation so there are multiple search programs out “there”. But, for the most part, developers re-use programs to quicken the development time.

When I studied the sorting section, the overall impression I received was that sorting was done at a very granular level, item by item, whether number, text or object. There is no way around the fact that every single item needs to be looked at in order to place the items in the sorted order. But even then, some sorting methods were better than others for certain situations; the speed had to do with the number of times the item had to be exchanged or placed. In search though, there are techniques one can employ to avoid searching every item and to me, most of the techniques rely on a “divide and conquer” method. I think typically the items need to be sorted first before doing the “divide and conquer” search.

“Divide and Conquer” Search

The first “divide and conquer” search does require that the list of items or data needs to be sorted first. Say you are looking for “x” in a list or dataset “Z”. First sort dataset “Z”, then divide in half the sorted dataset. Because the original list was sorted, the two halves will be sorted also and you can search for “x” by asking which halves (or buckets) “x” falls under, namely by asking if “x” is greater than or equal to the first item and less than or equal to the last item in one of the halves. Whichever half “x” falls into, you then divide that half into two and proceed with the same question. You keep dividing your dataset until you find your “x” or you run out of data, in which case, your dataset does not contain “x”. By doing this dividing, you don’t have to look through all of the items in the dataset while searching.

“Trees”

Another type of “divide and conquer” is the tree method: binary search tree and binary heap. The book is not clear but I think the dataset has to be sorted first before the items in the dataset can be placed in position on a tree. But…I just Googled both terms and in Wikipedia, binary search tree kind of sounds like it’s sorted but I’m unclear if the sorting begins before creating the tree or during the process of creating the tree. The Wikipedia for heap tree leaves me the impression you could build the tree without sorting.

So, I’m not real clear. I’ll have to do some further research to understand. Sigh.

One thing I just noticed upon re-reading the book is that binary search tree (BST) and binary heap is used to hold the keys rather than the data elements, so we’re searching for the keys.

Difference between binary search trees and binary heap trees

Let’s see if I got the difference between those two types of trees. In the binary search tree, according to the book but not really clear in Wikipedia, is that the node will be a value in the middle of the range of numbers found in the left and right trees. All values/keys/dataset in the left part of the tree will be less than the node whereas in the right part of the tree, the keys/values will be greater than the node. So, the very top node, the parent node of all nodes in the tree, will be a value in the middle of all values found in the tree branches. The binary heap is structured differently, depending on whether it is a binary max heap or a binary min heap. If the tree is a binary max heap, then the top-most node is the maximum number/value/key in the dataset. Going down the branches, the next level branches will be less than the previous upper branches and the left side will be less than the right side (in general).

These types of trees work best under certain scenarios. It sounds like the binary search trees search is faster than the binary heap trees but really works best under situations where you will be doing more searches rather than building up trees. If you are going to have to keep building up trees, because you are adding, changing, or deleting datasets, then using a binary heap tree will be better.

And, that’s about all I know about binary trees. I still am not real clear on actual mechanics or implementation. Maybe that will come later on in the book.

“Hashing”

The “divide and conquer” and the trees all depend on some type of sorting and sorting can take a while because you have to “touch” every data element in order to place it in its proper order. There is a method of searching that does away with sorting and it’s called “hashing”.

This is a very interesting concept.

First, you are sorting by keys rather than data elements. An example of a key may be an employee number or social security number that will lead one to the employee’s name, phone, and address. Sorting by keys rather than all of the data elements is more efficient.

Next, you apply some type of formula to arrive at a number as a pointer to where in the hash table the key can be found). As an example, the book used dividing your key by the number of positions or “slots” in the hash table and using its modulus as the pointer to the hash table. Now, every time you want to find or search a data element, you just apply a formula to the key and the result of the formula leads you to the position in the hash table where the data elements are located.

Now there is an issue called “collisions” where the formula calculations can lead to the same numerical results or “slots” in the hash table and the book discusses several options on how to handle.

This hashing is an interesting concept and it feels like it could have some really interesting uses. I have to keep this concept in the back of my mind.

By the way, this hashing is behind the use of MD5 when submitting healthcare reporting. Here, the use is for security purposes. You provide the government your hashing result during submission and the government checks to see if they get the same hashing results on their end. If their hashing is different, then something got altered or corrupted during the submission and the submitted data should not be used.

Similar Posts