Indexing strategy for finding similar strings

Indexing strategy for finding similar strings - image

I am working on devising indexing strategy for finding similar hashes. The hashes are generated for images. i.e
String A = "00007c3fff1f3b06738f390079c627c3ffe3fb11f0007c00fff07ff03f003000" //Image 1
String B = "6000fc3efb1f1b06638f1b0071c667c7fff3e738d0007c00fff03ff03f803000" //Image 2
These two hashes are similar (based on Hamming distance and Levenshtein distance) and hence similar images. I have more than 190 million such hashes. I have to select a suitable indexing data structure where the worst case complexity for finding similar hash is not O(n). Hash data structure won't work because it will search for <, = and > (or will it?). I can find Hamming distance or other distance to calculate the similarity but in worst case I will end up calculating it 190 million times.
This is my strategy now:
Currently I am working on BTree where I will rank all the keys in a node based on no. of consecutive same characters and traverse the key which is highest ranked and if the child's keys rank is less than other key's rank in parent node, I will start traversing that key in the parent node. If all the rank of parent is same I will do normal BTree traverse (givenkey < nodeKey --> go to Child node of nodeKey..using ASCII comparison) which is where my issue is.
Because it would lead to lot of false negatives in search. As in the worst case I will traverse only one part of tree where potentially similar key can be found in other traversals. Else I have to search entire tree which is again O(n) where I might as well not have tree.
I feel there has to be a better way and right now I am stuck and it would be great to hear any inputs on breaking down the problem. Please share your thoughts.
P.S : and I cannot use any external database.

First, this is a very difficult problem. Don't expect neat, tidy answers.
One approximate data structure I have seen is Spatial Approximation Sample Hierarchy (SASH).
A SASH (Spatial Approximation Sample Hierarchy) is a general-purpose data structure for efficiently computing approximate answers for similarity queries. Similarity queries naturally arise in a number of important computing contexts, in particular content-based retrieval on multimedia databases, and nearest-neighbor methods for clustering and classification.
SASH uses only a distance function to build a data structure, so the distance function (and in your case, the image hash function as well) needs to be "good". The basic intuition is roughly that if A ~ B (image A is close to image B) and B ~ C, then usually A ~ C. The data structure creates links between items that are relatively close, and you prune your search by only looking for things that are closer to your query. Whether this strategy actually works depends on the nature of your data and the distance function.
It has been 10 years or so since I looked at SASH, so there are probably newer developments as well. Michael Houle's page seems to indicate he has newer research on something called Rank Cover Trees, which seem similar in purpose to SASH. This should at least get you started on research in the area; read some papers and follow the reference trail.

Related

Base 3 or more search? [duplicate]

I recently heard about ternary search in which we divide an array into 3 parts and compare. Here there will be two comparisons but it reduces the array to n/3. Why don't people use this much?

Actually, people do use k-ary trees for arbitrary k.
This is, however, a tradeoff.
To find an element in a k-ary tree, you need around k*ln(N)/ln(k) operations (remember the change-of-base formula). The larger your k is, the more overall operations you need.
The logical extension of what you are saying is "why don't people use an N-ary tree for N data elements?". Which, of course, would be an array.

A ternary search will still give you the same asymptotic complexity O(log N) search time, and adds complexity to the implementation.
The same argument can be said for why you would not want a quad search or any other higher order.

Searching 1 billion (a US billion - 1,000,000,000) sorted items would take an average of about 15 compares with binary search and about 9 compares with a ternary search - not a huge advantage. And note that each 'ternary compare' might involve 2 actual comparisons.

Wow. The top voted answers miss the boat on this one, I think.
Your CPU doesn't support ternary logic as a single operation; it breaks ternary logic into several steps of binary logic. The most optimal code for the CPU is binary logic. If chips were common that supported ternary logic as a single operation, you'd be right.
B-Trees can have multiple branches at each node; a order-3 B-tree is ternary logic. Each step down the tree will take two comparisons instead of one, and this will probably cause it to be slower in CPU time.
B-Trees, however, are pretty common. If you assume that every node in the tree will be stored somewhere separately on disk, you're going to spend most of your time reading from disk... and the CPU won't be a bottleneck, but the disk will be. So you take a B-tree with 100,000 children per node, or whatever else will barely fit into one block of memory. B-trees with that kind of branching factor would rarely be more than three nodes high, and you'd only have three disk reads - three stops at a bottleneck - to search an enormous, enormous dataset.
Reviewing:
Ternary trees aren't supported by hardware, so they run less quickly.
B-tress with orders much, much, much higher than 3 are common for disk-optimization of large datasets; once you've gone past 2, go higher than 3.

The only way a ternary search can be faster than a binary search is if a 3-way partition determination can be done for less than about 1.55 times the cost of a 2-way comparison. If the items are stored in a sorted array, the 3-way determination will on average be 1.66 times as expensive as a 2-way determination. If information is stored in a tree, however, the cost to fetch information is high relative to the cost of actually comparing, and cache locality means the cost of randomly fetching a pair of related data is not much worse than the cost of fetching a single datum, a ternary or n-way tree may improve efficiency greatly.

What makes you think Ternary search should be faster?
Average number of comparisons:
in ternary search = ((1/3)*1 + (2/3)*2) * ln(n)/ln(3) ~ 1.517*ln(n)
in binary search = 1 * ln(n)/ln(2) ~ 1.443*ln(n).
Worst number of comparisons:
in ternary search = 2 * ln(n)/ln(3) ~ 1.820*ln(n)
in binary search = 1 * ln(n)/ln(2) ~ 1.443*ln(n).
So it looks like ternary search is worse.

Also, note that this sequence generalizes to linear search if we go on
Binary search
Ternary search
...
...
n-ary search ≡ linear search
So, in an n-ary search, we will have "one only COMPARE" which might take upto n actual comparisons.

"Terinary" (ternary?) search is more efficient in the best case, which would involve searching for the first element (or perhaps the last, depending on which comparison you do first). For elements farther from the end you're checking first, while two comparisons would narrow the array by 2/3 each time, the same two comparisons with binary search would narrow the search space by 3/4.
Add to that, binary search is simpler. You just compare and get one half or the other, rather than compare, if less than get the first third, else compare, if less than get the second third, else get the last third.

Ternary search can be effectively used on parallel architectures - FPGAs and ASICs. For example if internal FPGA memory required for search is less than half of the FPGA resource, you can make a duplicate memory block. This would allow to simultaneously access two different memory addresses and do all comparisons in a single clock cycle. This is one of the reasons why 100MHz FPGA can sometimes outperform the 4GHz CPU :)

Here's some random experimental evidence that I haven't vetted at all showing that it's slower than binary search.

Almost all textbooks and websites on binary search trees do not really talk about binary trees! They show you ternary search trees! True binary trees store data in their leaves not internal nodes (except for keys to navigate). Some call these leaf trees and make the distinction between node trees shown in textbooks:
J. Nievergelt, C.-K. Wong: Upper Bounds for the Total Path Length of Binary Trees,
Journal ACM 20 (1973) 1–6.
The following about this is from Peter Brass's book on data structures.
2.1 Two Models of Search Trees
In the outline just given, we supressed an important point that at first seems
trivial, but indeed it leads to two different models of search trees, either of
which can be combined with much of the following material, but one of which
is strongly preferable.
If we compare in each node the query key with the key contained in the
node and follow the left branch if the query key is smaller and the right branch
if the query key is larger, then what happens if they are equal? The two models
of search trees are as follows:
Take left branch if query key is smaller than node key; otherwise take the
right branch, until you reach a leaf of the tree. The keys in the interior node
of the tree are only for comparison; all the objects are in the leaves.
Take left branch if query key is smaller than node key; take the right branch
if the query key is larger than the node key; and take the object contained
in the node if they are equal.
This minor point has a number of consequences:
{ In model 1, the underlying tree is a binary tree, whereas in model 2, each
tree node is really a ternary node with a special middle neighbor.
{ In model 1, each interior node has a left and a right subtree (each possibly a
leaf node of the tree), whereas in model 2, we have to allow incomplete
nodes, where left or right subtree might be missing, and only the
comparison object and key are guaranteed to exist.
So the structure of a search tree of model 1 is more regular than that of a tree
of model 2; this is, at least for the implementation, a clear advantage.
{ In model 1, traversing an interior node requires only one comparison,
whereas in model 2, we need two comparisons to check the three
possibilities.
Indeed, trees of the same height in models 1 and 2 contain at most approximately
the same number of objects, but one needs twice as many comparisons in model
2 to reach the deepest objects of the tree. Of course, in model 2, there are also
some objects that are reached much earlier; the object in the root is found
with only two comparisons, but almost all objects are on or near the deepest
level.
Theorem. A tree of height h and model 1 contains at most 2^h objects.
A tree of height h and model 2 contains at most 2^h+1 − 1 objects.
This is easily seen because the tree of height h has as left and right subtrees a
tree of height at most h − 1 each, and in model 2 one additional object between
them.
{ In model 1, keys in interior nodes serve only for comparisons and may
reappear in the leaves for the identification of the objects. In model 2, each
key appears only once, together with its object.
It is even possible in model 1 that there are keys used for comparison that
do not belong to any object, for example, if the object has been deleted. By
conceptually separating these functions of comparison and identification, this
is not surprising, and in later structures we might even need to define artificial
tests not corresponding to any object, just to get a good division of the search
space. All keys used for comparison are necessarily distinct because in a model
1 tree, each interior node has nonempty left and right subtrees. So each key
occurs at most twice, once as comparison key and once as identification key in
the leaf.
Model 2 became the preferred textbook version because in most textbooks
the distinction between object and its key is not made: the key is the object.
Then it becomes unnatural to duplicate the key in the tree structure. But in
all real applications, the distinction between key and object is quite important.
One almost never wishes to keep track of just a set of numbers; the numbers
are normally associated with some further information, which is often much
larger than the key itself.

You may have heard ternary search being used in those riddles that involve weighing things on scales. Those scales can return 3 answers: left is lighter, both are the same, or left is heavier. So in a ternary search, it only takes 1 comparison.
However, computers use boolean logic, which only has 2 answers. To do the ternary search, you'd actually have to do 2 comparisons instead of 1.
I guess there are some cases where this is still faster as earlier posters mentioned, but you can see that ternary search isn't always better, and it's more confusing and less natural to implement on a computer.

Theoretically the minimum of k/ln(k) is achieved at e and since 3 is closer to e than 2 it requires less comparisons. You can check that 3/ln(3) = 2.73.. and 2/ln(2) = 2.88.. The reason why binary search could be faster is that the code for it will have less branches and will run faster on modern CPUs.

I have just posted a blog about the ternary search and I have shown some results. I have also provided some initial level implementations on my git repo I totally agree with every one about the theory part of the ternary search but why not give it a try? As per the implementation that part is easy enough if you have three years of coding experience.
I found that if you have huge data set and you need to search it many times ternary search has an advantage.
If you think you can do better with a ternary search go for it.

Although you get the same big-O complexity (ln n) in both search trees, the difference is in the constants. You have to do more comparisons for a ternary search tree at each level. So the difference boils down to k/ln(k) for a k-ary search tree. This has a minimum value at e=2.7 and k=2 provides the optimal result.

Decision Tree Binary Classifier shortcut (sorting)

Normally, at each node of the decision tree, we consider all features and all splitting points for each feature. We calculate the difference between the entropy of the entire node and the weighted avg of the entropies of potential left and right branches, and the feature + splitting feature_value that gives us the greatest entropy drop is chosen as the splitting criterion for that particular node.
Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
sort the m distinct feature_values by the percentage of 1's of the samples within the node that takes that feature_value for that feature.
Only try the m-1 ways of splitting the sorted list.
This 'trying only m-1 splits' method is mentioned as a 'shortcut' in the article below, which (by definition of 'shortcut') means the results of the two methods which differ drastically in runtime are exactly the same.
The quote:"For regression and binary classification problems, with K = 2 response classes, there is a computational shortcut [1]. The tree can order the categories by mean response (for regression) or class probability for one of the classes (for classification). Then, the optimal split is one of the L – 1 splits for the ordered list. "
The article:
http://www.mathworks.com/help/stats/splitting-categorical-predictors-for-multiclass-classification.html?s_tid=gn_loc_drop&requestedDomain=uk.mathworks.com
Note that I'm talking only about categorical variables.

Can someone explain why the above process, which requires (2^m -2)/2 tries for each feature at each node, where m is the number of distinct feature_values at the node, is the same as trying ONLY m-1 splits:
The answer is simple: both procedures just aren't the same. As you noticed, splitting in the exact way is an NP-hard problem and thus hardly feasible for any problem in practice. Moreover, due to overfitting that would usually be not the optimal result in terms of generaluzation.
Instead, the exhaustive search is replaced by some kind of greedy procedure which goes like: sort first, then try all ordered splits. In general this leads to different results than the exact splitting.
In order to improve on the greedy result, one further often applies pruning (which can be seen as another greedy and heuristic method). And never methods like random forests or BART deal with this problem effectively by averaging over several trees -- so that the deviation of a single tree becomes less important.

Efficiently querying a B+ Tree holding multidimensional data

I have a collection of tuples (x,y) of 64-bit integers that make up my dataset. I have, say, trillions of these tuples; it is not feasible to keep the dataset in memory on any machine on earth. However, it is quite reasonable to store them on disk.
I have an on-disk store (a B+-tree) that allow for the quick, and concurrent, querying of data in a single dimension. However, some of my queries rely on both dimensions.
Query examples:
Find the tuple whose x is greater than or equal than some given value
Find the tuple whose x is as small as possible s.t. it's y is greater than or equal to some given value
Find the tuple whose x is as small as possible s.t. it's y is less than or equal to some given value
Perform maintenance operations (insert some tuple, remove some tuple)
The best bet I have found are Z-order curves but I cannot seem to figure out how to conduct the queries given my two dimensional data-set.
Solutions that are not acceptable include a sequential scan of the data, this could be far too slow.

I think, the most appropriate data structures for your requirements are R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree). R-tree is similar to B+-tree, but also allows multidimensional queries.
Other relevant data structure is Priority Search Tree. It is good for queries like your examples 1 .. 3, but not very efficient if you need frequent updates or on-disk store. For details see this paper or this book: "Handbook of Data Structures and Applications" (Chapter 18.5).

Are you saying you don't know how to query z-order curves? The Wikipedia page describes how you do range searches.
A z-curve divides your space into nested rectangles, where each additional bit in the key divides the space in half. To search for a point:
Start with the largest rectangle that might contain your point.
Recursively:
Create a result set of rectangles
For each rectangle in your set
If the rectangle is a single point, you are done, it is what you are looking for.
Otherwise, divide the rectangle in two (specify one additional bit of the z-curve)
If both halves contain a point
If one half is better
Add that rectangle to your result set of rectangles
Otherwise
Add both rectangles to your result set of rectangles
Otherwise, only one half contains a point
Add that rectangle to your result set of rectangles
Search your result set of rectangles
Worst case performance is bad, of course. You can adjust it by changing how you construct your z-order index.

I'm currently working on designing a data structure which is essentially a 'stacked' B+ tree (or a d+ tree where d is the number of dimensions) for multidimensional data. I believe it would suit your data perfectly and is being designed specifically for your use case.
The basic idea is this:
Each dimension is a B+ tree and is linked to the next dimension's B+ tree. Search through the first dimension normally, once a leaf is reached it contains a pointer to the root of the next B+ tree which belongs to the next dimension. Everything in the second B+ tree belongs to the same x value.
The original plan was to only store the unique values for each dimension along with it's count. This employs a very simple compression algorithm (if you can even call it that) while still allowing for the entire data set to be represented. This 'linked' dimension scheme could allow for extra dimensions to be added later as they are simply added to the stack of B+ trees.
Total insert/search/delete time for 2 dimensions would be something similar to this:
log b(card(x)) + log b(card(y))
where b is the base of each B+ tree and card(x) would be the cardinality of the x dimension.
I hope that makes sense. I'm still working on an implementation, however feel free to use or even augment the idea.

http://fallabs.com/tokyocabinet/
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. Tokyo Cabinet is a free software licensed under the GNU Lesser General Public License.
may it easy for u to embed?

kNN with dynamic insertions in high-dim space

I am looking for a method to do fast nearest neighbour (hopefully O(log n)) for high dimensional points (typically ~11-13 dimensional). I would like it to behave optimally during insertions after having initialized the structure. KD tree came to my mind but if you do not do bulk loading but do dynamic insertions, then kd tree ceases to be balanced and afaik balancing is an expensive operation.
So, I wanted to know what data structures would you prefer for such kind of setting. You have high dimensional points and you would like to do insertions and query for nearest neighbour.

Another data structure that comes to mind is the cover tree. Unlike KD trees which were originally developed to answer range queries, this data structure is optimal for nearest neighbor queries. It has been used in n-body problems that involve computing the k nearest neighbors of all the data points. Such problems also occur in density estimation schemes (Parzen windows).
I don't know enough about your specific problem, but I do know that there are online versions of this data structure. Check out Alexander Gray's page and this link

The Curse of Dimensionality gets in the way here. You might consider applying Principal Component Analysis (PCA) to reduce the dimensionality, but as far as I know, nobody has a great answer for this.
I have dealt with this type of problem before (in audio and video fingerprinting), sometimes with up to 30 dimensions. Analysis usually revealed that some of the dimensions did not contain relevant information for searches (actually fuzzy searches, my main goal), so I omitted them from the index structures used to access the data, but included them in the logic to determine matches from a list of candidates found during the search. This effectively reduced the dimensionality to a tractable level.
I simplified things further by quantizing the remaining dimensions severely, such that the entire multidimensional space was mapped into a 32-bit integer. I used this as the key in an STL map (a red-black tree), though I could have used a hash table. I was able to add millions of records dynamically to such a structure (RAM-based, of course) in about a minute or two, and searches took about a millisecond on average, though the data was by no means evenly distributed. Searches required careful enumeration of values in the dimensions that were mapped into the 32-bit key, but were reliable enough to use in a commercial product. I believe it is used to this day in iTunes Match, if my sources are correct. :)
The bottom line is that I recommend you take a look at your data and do something custom that exploits features in it to make for fast indexing and searching. Find the dimensions that vary the most and are the most independent of each other. Quantize those and use them as the key in an index. Each bucket in the index contains all items that share that key (there will likely be more than one). To find nearest neighbors, look at "nearby" keys and within each bucket, look for nearby values. Good luck.
p.s. I wrote a paper on my technique, available here. Sorry about the paywall. Perhaps you can find a free copy elsewhere. Let me know if you have any questions about it.

If you use a Bucket Kd-Tree with a reasonably large bucket size it lets the tree get better idea of where to split when the leaves get too full. The guys in Robocode do this under extremely harsh time-constraints, with random insertions happening on the fly and kNN with k>80, d > 10 and n > 30k in under 1ms. Check out this kD-Tree Tutorial which explains a bunch of kD-Tree enhancements and how to implement them.

In my experience, 11-13 dimensions is not too bad -- if you bulk-load. Both bulk-loaded R-trees (in contrast to k-d-trees these remain balanced!) and k-d-trees should still work much better than linear scanning.
Once you go fully dynamic, my experiences are much worse. Roughly: with bulk loaded trees I'm seeing 20x speedups, with incrementally built R-trees just 7x. So it does pay off to frequently rebuild the tree. And depending on how you organize your data, it may be much faster than you think. The bulk load for the k-d-tree that I'm using is O(n log n), and I read that there is a O(n log log n) variant, too. With a low constant factor. For the R-tree, Sort-Tile-Recursive is the best bulk load I have seen so far, and also O(n log n) with a low constant factor.
So yes, in high-dimensionality I would consider to just reload the tree from time to time.

Clustering tree structured data

Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell or Ocaml.
We are given a large number of "documents" in the tree structure. Our goal is to cluster documents which are similar. By clustering, we mean a way to divide the documents into j groups, such that elements in each looks like each other.
I am sure there are papers out there which describes approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who are what to look for and where to dig.
My current approach is something like this:
I want to convert each document into an N-dimensional vector set up for a K-means clustering.
To do this, I recursively walk the document tree and for each level I calculate a vector. If I am at a tree vertex, I recur on all subvertices and then sum their vectors. Also, whenever I recur, a power factor is applied so it does matter less and less the further down the tree I go. The documents final vector is the root of the tree.
Depending on the data at a tree leaf, I apply a function which takes the data into a vector.
But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present, but occurs farther down the tree, then my approach probably won't work very well.
I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.
Distance function
As suggested, one need to define a distance function between documents. Without this function, we can't apply a clustering algorithm.
In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.
The take-one-step-back viewpoint:
I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root are the inner function which fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.

Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.
If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.
Example:
Mapping:
Exception A -> 0
Exception B -> 1
Exception C -> 2
Exception D -> 3
Doc A: 0-1-2
Doc B: 1-2-3
2-grams for doc A:
X0, 01, 12, 2X
2-grams for doc B:
X1, 12, 23, 3X
Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)
However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).
Here are some pointers:
http://portal.acm.org/citation.cfm?id=1227182
http://www.springerlink.com/content/yu0bajqnn4cvh9w9/
Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.

Here you may find a paper that seems closely related to your problem.
From the abstract:
This thesis presents Ixor, a system which collects, stores, and analyzes
stack traces in distributed Java systems. When combined with third-party
clustering software and adaptive cluster filtering, unusual executions can be
identified.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio