Hash tables in graph theory

Hash tables in graph theory - algorithm

I am reading an article on Hash tables. Here is the text snippet.
A hash table is useful for any graph theory problem where the nodes
have real names instead of numbers. Here, as the input is read,
vertices are assigned integers from 1 onwards by order of appearance.
Again, the input is likely to have large groups of alphabetized
entries. For example, the vertices could be computers. Then if one
particular installation lists its computers as ibm1, ibm2, ibm3, . .
. , there could be a dramatic effect on efficiency if a search tree is
used.
My quesitons on above text
What does author mean "as input is read, verticies are assigned integers from 1 onward" don't we have calculate hash key for input read?
What does author mean by "there could be a dramatic effect on efficiency if a search tree is used."?
How hash tables are helpful in graph theory problems when compared to search tree?
Thanks!

(1) It's a map not a set, they calculate a hash value of course, but the node is mapped to an integer, and that's the purpose of the map.
(2) A search tree is O(logn) search, using a map based on search tree will increase time complexity of all ops *O(logn). [for example, BFS will take O(logV*[V+E)) instead of O(V+E), because of the lookup time.
(3) Hash table is O(1), so time complexity will be better for hash tables, in the average case.

(1) the author could be referencing methods of representing graphs in matrix form, like this example
(2) not sure about "search trees", though if you can represent hash tables as a graph, then there are some methods to optimize them, like this example.

Related

Base 3 or more search? [duplicate]

I recently heard about ternary search in which we divide an array into 3 parts and compare. Here there will be two comparisons but it reduces the array to n/3. Why don't people use this much?

Actually, people do use k-ary trees for arbitrary k.
This is, however, a tradeoff.
To find an element in a k-ary tree, you need around k*ln(N)/ln(k) operations (remember the change-of-base formula). The larger your k is, the more overall operations you need.
The logical extension of what you are saying is "why don't people use an N-ary tree for N data elements?". Which, of course, would be an array.

A ternary search will still give you the same asymptotic complexity O(log N) search time, and adds complexity to the implementation.
The same argument can be said for why you would not want a quad search or any other higher order.

Searching 1 billion (a US billion - 1,000,000,000) sorted items would take an average of about 15 compares with binary search and about 9 compares with a ternary search - not a huge advantage. And note that each 'ternary compare' might involve 2 actual comparisons.

Wow. The top voted answers miss the boat on this one, I think.
Your CPU doesn't support ternary logic as a single operation; it breaks ternary logic into several steps of binary logic. The most optimal code for the CPU is binary logic. If chips were common that supported ternary logic as a single operation, you'd be right.
B-Trees can have multiple branches at each node; a order-3 B-tree is ternary logic. Each step down the tree will take two comparisons instead of one, and this will probably cause it to be slower in CPU time.
B-Trees, however, are pretty common. If you assume that every node in the tree will be stored somewhere separately on disk, you're going to spend most of your time reading from disk... and the CPU won't be a bottleneck, but the disk will be. So you take a B-tree with 100,000 children per node, or whatever else will barely fit into one block of memory. B-trees with that kind of branching factor would rarely be more than three nodes high, and you'd only have three disk reads - three stops at a bottleneck - to search an enormous, enormous dataset.
Reviewing:
Ternary trees aren't supported by hardware, so they run less quickly.
B-tress with orders much, much, much higher than 3 are common for disk-optimization of large datasets; once you've gone past 2, go higher than 3.

The only way a ternary search can be faster than a binary search is if a 3-way partition determination can be done for less than about 1.55 times the cost of a 2-way comparison. If the items are stored in a sorted array, the 3-way determination will on average be 1.66 times as expensive as a 2-way determination. If information is stored in a tree, however, the cost to fetch information is high relative to the cost of actually comparing, and cache locality means the cost of randomly fetching a pair of related data is not much worse than the cost of fetching a single datum, a ternary or n-way tree may improve efficiency greatly.

What makes you think Ternary search should be faster?
Average number of comparisons:
in ternary search = ((1/3)*1 + (2/3)*2) * ln(n)/ln(3) ~ 1.517*ln(n)
in binary search = 1 * ln(n)/ln(2) ~ 1.443*ln(n).
Worst number of comparisons:
in ternary search = 2 * ln(n)/ln(3) ~ 1.820*ln(n)
in binary search = 1 * ln(n)/ln(2) ~ 1.443*ln(n).
So it looks like ternary search is worse.

Also, note that this sequence generalizes to linear search if we go on
Binary search
Ternary search
...
...
n-ary search ≡ linear search
So, in an n-ary search, we will have "one only COMPARE" which might take upto n actual comparisons.

"Terinary" (ternary?) search is more efficient in the best case, which would involve searching for the first element (or perhaps the last, depending on which comparison you do first). For elements farther from the end you're checking first, while two comparisons would narrow the array by 2/3 each time, the same two comparisons with binary search would narrow the search space by 3/4.
Add to that, binary search is simpler. You just compare and get one half or the other, rather than compare, if less than get the first third, else compare, if less than get the second third, else get the last third.

Ternary search can be effectively used on parallel architectures - FPGAs and ASICs. For example if internal FPGA memory required for search is less than half of the FPGA resource, you can make a duplicate memory block. This would allow to simultaneously access two different memory addresses and do all comparisons in a single clock cycle. This is one of the reasons why 100MHz FPGA can sometimes outperform the 4GHz CPU :)

Here's some random experimental evidence that I haven't vetted at all showing that it's slower than binary search.

Almost all textbooks and websites on binary search trees do not really talk about binary trees! They show you ternary search trees! True binary trees store data in their leaves not internal nodes (except for keys to navigate). Some call these leaf trees and make the distinction between node trees shown in textbooks:
J. Nievergelt, C.-K. Wong: Upper Bounds for the Total Path Length of Binary Trees,
Journal ACM 20 (1973) 1–6.
The following about this is from Peter Brass's book on data structures.
2.1 Two Models of Search Trees
In the outline just given, we supressed an important point that at first seems
trivial, but indeed it leads to two different models of search trees, either of
which can be combined with much of the following material, but one of which
is strongly preferable.
If we compare in each node the query key with the key contained in the
node and follow the left branch if the query key is smaller and the right branch
if the query key is larger, then what happens if they are equal? The two models
of search trees are as follows:
Take left branch if query key is smaller than node key; otherwise take the
right branch, until you reach a leaf of the tree. The keys in the interior node
of the tree are only for comparison; all the objects are in the leaves.
Take left branch if query key is smaller than node key; take the right branch
if the query key is larger than the node key; and take the object contained
in the node if they are equal.
This minor point has a number of consequences:
{ In model 1, the underlying tree is a binary tree, whereas in model 2, each
tree node is really a ternary node with a special middle neighbor.
{ In model 1, each interior node has a left and a right subtree (each possibly a
leaf node of the tree), whereas in model 2, we have to allow incomplete
nodes, where left or right subtree might be missing, and only the
comparison object and key are guaranteed to exist.
So the structure of a search tree of model 1 is more regular than that of a tree
of model 2; this is, at least for the implementation, a clear advantage.
{ In model 1, traversing an interior node requires only one comparison,
whereas in model 2, we need two comparisons to check the three
possibilities.
Indeed, trees of the same height in models 1 and 2 contain at most approximately
the same number of objects, but one needs twice as many comparisons in model
2 to reach the deepest objects of the tree. Of course, in model 2, there are also
some objects that are reached much earlier; the object in the root is found
with only two comparisons, but almost all objects are on or near the deepest
level.
Theorem. A tree of height h and model 1 contains at most 2^h objects.
A tree of height h and model 2 contains at most 2^h+1 − 1 objects.
This is easily seen because the tree of height h has as left and right subtrees a
tree of height at most h − 1 each, and in model 2 one additional object between
them.
{ In model 1, keys in interior nodes serve only for comparisons and may
reappear in the leaves for the identification of the objects. In model 2, each
key appears only once, together with its object.
It is even possible in model 1 that there are keys used for comparison that
do not belong to any object, for example, if the object has been deleted. By
conceptually separating these functions of comparison and identification, this
is not surprising, and in later structures we might even need to define artificial
tests not corresponding to any object, just to get a good division of the search
space. All keys used for comparison are necessarily distinct because in a model
1 tree, each interior node has nonempty left and right subtrees. So each key
occurs at most twice, once as comparison key and once as identification key in
the leaf.
Model 2 became the preferred textbook version because in most textbooks
the distinction between object and its key is not made: the key is the object.
Then it becomes unnatural to duplicate the key in the tree structure. But in
all real applications, the distinction between key and object is quite important.
One almost never wishes to keep track of just a set of numbers; the numbers
are normally associated with some further information, which is often much
larger than the key itself.

You may have heard ternary search being used in those riddles that involve weighing things on scales. Those scales can return 3 answers: left is lighter, both are the same, or left is heavier. So in a ternary search, it only takes 1 comparison.
However, computers use boolean logic, which only has 2 answers. To do the ternary search, you'd actually have to do 2 comparisons instead of 1.
I guess there are some cases where this is still faster as earlier posters mentioned, but you can see that ternary search isn't always better, and it's more confusing and less natural to implement on a computer.

Theoretically the minimum of k/ln(k) is achieved at e and since 3 is closer to e than 2 it requires less comparisons. You can check that 3/ln(3) = 2.73.. and 2/ln(2) = 2.88.. The reason why binary search could be faster is that the code for it will have less branches and will run faster on modern CPUs.

I have just posted a blog about the ternary search and I have shown some results. I have also provided some initial level implementations on my git repo I totally agree with every one about the theory part of the ternary search but why not give it a try? As per the implementation that part is easy enough if you have three years of coding experience.
I found that if you have huge data set and you need to search it many times ternary search has an advantage.
If you think you can do better with a ternary search go for it.

Although you get the same big-O complexity (ln n) in both search trees, the difference is in the constants. You have to do more comparisons for a ternary search tree at each level. So the difference boils down to k/ln(k) for a k-ary search tree. This has a minimum value at e=2.7 and k=2 provides the optimal result.

Indexing strategy for finding similar strings

I am working on devising indexing strategy for finding similar hashes. The hashes are generated for images. i.e
String A = "00007c3fff1f3b06738f390079c627c3ffe3fb11f0007c00fff07ff03f003000" //Image 1
String B = "6000fc3efb1f1b06638f1b0071c667c7fff3e738d0007c00fff03ff03f803000" //Image 2
These two hashes are similar (based on Hamming distance and Levenshtein distance) and hence similar images. I have more than 190 million such hashes. I have to select a suitable indexing data structure where the worst case complexity for finding similar hash is not O(n). Hash data structure won't work because it will search for <, = and > (or will it?). I can find Hamming distance or other distance to calculate the similarity but in worst case I will end up calculating it 190 million times.
This is my strategy now:
Currently I am working on BTree where I will rank all the keys in a node based on no. of consecutive same characters and traverse the key which is highest ranked and if the child's keys rank is less than other key's rank in parent node, I will start traversing that key in the parent node. If all the rank of parent is same I will do normal BTree traverse (givenkey < nodeKey --> go to Child node of nodeKey..using ASCII comparison) which is where my issue is.
Because it would lead to lot of false negatives in search. As in the worst case I will traverse only one part of tree where potentially similar key can be found in other traversals. Else I have to search entire tree which is again O(n) where I might as well not have tree.
I feel there has to be a better way and right now I am stuck and it would be great to hear any inputs on breaking down the problem. Please share your thoughts.
P.S : and I cannot use any external database.

First, this is a very difficult problem. Don't expect neat, tidy answers.
One approximate data structure I have seen is Spatial Approximation Sample Hierarchy (SASH).
A SASH (Spatial Approximation Sample Hierarchy) is a general-purpose data structure for efficiently computing approximate answers for similarity queries. Similarity queries naturally arise in a number of important computing contexts, in particular content-based retrieval on multimedia databases, and nearest-neighbor methods for clustering and classification.
SASH uses only a distance function to build a data structure, so the distance function (and in your case, the image hash function as well) needs to be "good". The basic intuition is roughly that if A ~ B (image A is close to image B) and B ~ C, then usually A ~ C. The data structure creates links between items that are relatively close, and you prune your search by only looking for things that are closer to your query. Whether this strategy actually works depends on the nature of your data and the distance function.
It has been 10 years or so since I looked at SASH, so there are probably newer developments as well. Michael Houle's page seems to indicate he has newer research on something called Rank Cover Trees, which seem similar in purpose to SASH. This should at least get you started on research in the area; read some papers and follow the reference trail.

Data structure for Phonebook

A cellular phone company is going to launch new model of an existing smart phone having maximum of 2 gigabytes memory. Being a programmer, you are given a task to develop application for better utilization of its phone book resource.
You should keep in mind the fact that a single contact can be stored as “First Name”, “Last Name” and “phone number” in alphabetical order. With the passage of time phone book updates as new contact comes or removed from the phone book.
Following are two factors which you must keep in mind while performing the required task.
Space limitations, as you know the available space is limited.Time required for accessing a particular contact, which must not exceed a given threshold.
As a programmer, which data structure will you use to perform the said task, provide proper reasons to support your answer?

I will use trie.
In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an
ordered tree data structure that is used to store a dynamic set or
associative array where the keys are usually strings. Unlike a binary
search tree, no node in the tree stores the key associated with that
node; instead, its position in the tree defines the key with which it
is associated. All the descendants of a node have a common prefix of
the string associated with that node, and the root is associated with
the empty string. Values are not necessarily associated with every
node. Rather, values tend only to be associated with leaves, and with
some inner nodes that correspond to keys of interest. For the
space-optimized presentation of prefix tree, see compact prefix tree.
In the example shown, keys are listed in the nodes and values below
them. Each complete English word has an arbitrary integer value
associated with it. A trie can be seen as a tree-shaped deterministic
finite automaton. Each finite language is generated by a trie
automaton, and each trie can be compressed into a deterministic
acyclic finite state automaton.
Image of trie from Wikipedia page
A trie has a number of advantages over binary search trees.A trie can also be used to replace a hash table, over which it has the following advantages:
Looking up data in a trie is faster in the worst case, O(m) time
(where m is the length of a search string), compared to an imperfect
hash table. An imperfect hash table can have key collisions. The
worst-case lookup speed in an imperfect hash table is O(N) time, but
far more typically is O(1), with O(m) time spent evaluating the
hash.
There is no need to provide a hash function or to change hash
functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.
According to Wikipedia page, Trie is a well-suited data structure for representing Predictive Text or Autocomplete dictionary.
For storing the phone numbers, we just need to add an additional node at the end of the trie which contains the phone number.
Also, we need to build another trie for storing the numbers. In this case, instead of letters, number become a node in the trie. The last node, that is leaf node contains the name of the person who owns that number. By using these two tries, we can easily implement phone book. And we can search with respect to the number and/or name of the person.
A Paragraph from Wikipedia article:
A common application of a trie is storing a predictive text or
autocomplete dictionary, such as found on a mobile telephone. Such
applications take advantage of a trie's ability to quickly search for,
insert, and delete entries

I'm not very experienced in programming, but I think that Hashing with Chaining could be an appropriate method to follow for the phonebook. I believe that this kind of structure covers all the requirements you asked for.
It allocates only the memory it needs for the data to store plus the pointers for the next node as it is implemented by using dynamically allocated nodes in linked lists.
Search, insertion and deletion all have O(n) worst case. More often 0(hash(x)).
If you hash the elements by the first letter of the Last name you can gain some sorting time. You will get 26 lists (if all first names begin with letters) which you will need to sort. And Linked Lists have O(n logn) worst case.
I hope i didn't mess up with my answer.

What are the advantages of storing all elements in the leaf nodes?

I'm reading Advanced Data Structures by Peter Brass.
In the beginning of the chapter on search trees, he stated that there is two models of search trees - one where nodes contain the actual object (the value if the tree is used as a dictionary), and an other where all objects are stored in leaves and internal nodes are only for comparisons.
What are the advantages of the second model over the first one?

One of the big advantages of a binary tree where data is only in the leaf nodes is that you can partition based on elements that are not in your dataset.
For example, if I have a possible dataset of 0-1 million, but the vast majority of items are either at the high end or low end but not in the middle, I may still want my first compare against 500,000 - even though that number is not in my data set. If every node had data, I could not do this. While not normally needed in theory, I've run into many times that partitioning based on a value outside my data simplified implementation.

B+ trees are an example of a case where all key/values are stored in leaf nodes. The primary advantage here is that since all items are in the leaf nodes, the leaf nodes can be linked together to form a linked list which allows rapid in-order traversal. If you access a particular element, you can always find the next element in the sequence without visiting any parents because the leaf nodes are linked together. Filesystems and database storage systems can take advantage of this structures for range searches and stuff.

Lets say you are building tree over some objects on some complex criteria. On example calculated from multiple properties. Sometimes you can't change this object to store calculated value and calculating this criteria is expansive. So you calculate this criteria only once, and store objects in leafs based on criteria result. Then when your tree is complete you can find required object much faster because you don't have to calculate criteria for each tree node in your path.

well storing information objects in the nodes, we talking in this case about a trie, is usefull for fast retrival of information(faster than storing stuff in an array/hashtable, where the worst case auf acces is O(n), in the trie this is O(m) [m is the lenght of n])
look here:
https://en.wikipedia.org/wiki/Trie
In a search tree this oerations can be much more complicated(look AVL Tree O(log n) ) and so can be slower and is more compley to implement.
What data structure to choose??
Well this depends on what u want to do

Why use binary search if there's ternary search?

I recently heard about ternary search in which we divide an array into 3 parts and compare. Here there will be two comparisons but it reduces the array to n/3. Why don't people use this much?

Actually, people do use k-ary trees for arbitrary k.
This is, however, a tradeoff.
To find an element in a k-ary tree, you need around k*ln(N)/ln(k) operations (remember the change-of-base formula). The larger your k is, the more overall operations you need.
The logical extension of what you are saying is "why don't people use an N-ary tree for N data elements?". Which, of course, would be an array.

A ternary search will still give you the same asymptotic complexity O(log N) search time, and adds complexity to the implementation.
The same argument can be said for why you would not want a quad search or any other higher order.

Searching 1 billion (a US billion - 1,000,000,000) sorted items would take an average of about 15 compares with binary search and about 9 compares with a ternary search - not a huge advantage. And note that each 'ternary compare' might involve 2 actual comparisons.

Wow. The top voted answers miss the boat on this one, I think.
Your CPU doesn't support ternary logic as a single operation; it breaks ternary logic into several steps of binary logic. The most optimal code for the CPU is binary logic. If chips were common that supported ternary logic as a single operation, you'd be right.
B-Trees can have multiple branches at each node; a order-3 B-tree is ternary logic. Each step down the tree will take two comparisons instead of one, and this will probably cause it to be slower in CPU time.
B-Trees, however, are pretty common. If you assume that every node in the tree will be stored somewhere separately on disk, you're going to spend most of your time reading from disk... and the CPU won't be a bottleneck, but the disk will be. So you take a B-tree with 100,000 children per node, or whatever else will barely fit into one block of memory. B-trees with that kind of branching factor would rarely be more than three nodes high, and you'd only have three disk reads - three stops at a bottleneck - to search an enormous, enormous dataset.
Reviewing:
Ternary trees aren't supported by hardware, so they run less quickly.
B-tress with orders much, much, much higher than 3 are common for disk-optimization of large datasets; once you've gone past 2, go higher than 3.

The only way a ternary search can be faster than a binary search is if a 3-way partition determination can be done for less than about 1.55 times the cost of a 2-way comparison. If the items are stored in a sorted array, the 3-way determination will on average be 1.66 times as expensive as a 2-way determination. If information is stored in a tree, however, the cost to fetch information is high relative to the cost of actually comparing, and cache locality means the cost of randomly fetching a pair of related data is not much worse than the cost of fetching a single datum, a ternary or n-way tree may improve efficiency greatly.

What makes you think Ternary search should be faster?
Average number of comparisons:
in ternary search = ((1/3)*1 + (2/3)*2) * ln(n)/ln(3) ~ 1.517*ln(n)
in binary search = 1 * ln(n)/ln(2) ~ 1.443*ln(n).
Worst number of comparisons:
in ternary search = 2 * ln(n)/ln(3) ~ 1.820*ln(n)
in binary search = 1 * ln(n)/ln(2) ~ 1.443*ln(n).
So it looks like ternary search is worse.

Also, note that this sequence generalizes to linear search if we go on
Binary search
Ternary search
...
...
n-ary search ≡ linear search
So, in an n-ary search, we will have "one only COMPARE" which might take upto n actual comparisons.

"Terinary" (ternary?) search is more efficient in the best case, which would involve searching for the first element (or perhaps the last, depending on which comparison you do first). For elements farther from the end you're checking first, while two comparisons would narrow the array by 2/3 each time, the same two comparisons with binary search would narrow the search space by 3/4.
Add to that, binary search is simpler. You just compare and get one half or the other, rather than compare, if less than get the first third, else compare, if less than get the second third, else get the last third.

Ternary search can be effectively used on parallel architectures - FPGAs and ASICs. For example if internal FPGA memory required for search is less than half of the FPGA resource, you can make a duplicate memory block. This would allow to simultaneously access two different memory addresses and do all comparisons in a single clock cycle. This is one of the reasons why 100MHz FPGA can sometimes outperform the 4GHz CPU :)

Here's some random experimental evidence that I haven't vetted at all showing that it's slower than binary search.

You may have heard ternary search being used in those riddles that involve weighing things on scales. Those scales can return 3 answers: left is lighter, both are the same, or left is heavier. So in a ternary search, it only takes 1 comparison.
However, computers use boolean logic, which only has 2 answers. To do the ternary search, you'd actually have to do 2 comparisons instead of 1.
I guess there are some cases where this is still faster as earlier posters mentioned, but you can see that ternary search isn't always better, and it's more confusing and less natural to implement on a computer.

Theoretically the minimum of k/ln(k) is achieved at e and since 3 is closer to e than 2 it requires less comparisons. You can check that 3/ln(3) = 2.73.. and 2/ln(2) = 2.88.. The reason why binary search could be faster is that the code for it will have less branches and will run faster on modern CPUs.

I have just posted a blog about the ternary search and I have shown some results. I have also provided some initial level implementations on my git repo I totally agree with every one about the theory part of the ternary search but why not give it a try? As per the implementation that part is easy enough if you have three years of coding experience.
I found that if you have huge data set and you need to search it many times ternary search has an advantage.
If you think you can do better with a ternary search go for it.

Although you get the same big-O complexity (ln n) in both search trees, the difference is in the constants. You have to do more comparisons for a ternary search tree at each level. So the difference boils down to k/ln(k) for a k-ary search tree. This has a minimum value at e=2.7 and k=2 provides the optimal result.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio