Any good nearest-neighbors algorithm for similar images? - algorithm

I am looking for an algorithm that can search for similar images in a large collection.
I'm currently using a SURF implementation in OpenCL.
At first I used the KNN search algorithm to compare every image's interrest points to the rest of the collection but tests revealed that it doesn't scale well. I've also tried a Hadoop implementation of KNN-Join which really takes a lot of temporary space in HDFS, way too much compared to the amount of input data. In fact pairwise distance approach isn't really appropriate because of the dimension of my input vectors (64).
I heard of Locally Sensitive Hashing and wondered if there was any free implementation, or if it's worth implementing it, maybe there's another algorithm I am not aware of ?

IIRC the flann algorithm is a good compromise:
http://people.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

Related

Distributed/memory-bound cloudpoint registration algorithm

From my (brief) overview of pointcloud registration algorithms (PRA), it seems they all require that at least the one pointcloud (if not both) to be fully loaded into memory. Is there an algorithm without this requirement, i.e. working with subsets of the two pointclouds at any given time so that it can scale to arbitrarily large pointclouds and (ideally) be parallelizable so it can be computed in a distributed cluster?
Possible answers, from best to worst:
There is/are such algorithm(s) with an open source implementation available (bonus if it's in Python / has Python bindings).
There is/are such algorithm(s) presented in research literature, but without open source implementations currently available.
There isn't such an algorithm off-the-shelf, but there are (precise or heuristic) ways to break up the problem into subproblems which can be solved by any existing PRA and combine the partial solutions to form the full solution.
There cannot be such an algorithm in the general case, at least not without some extra conditions/assumptions about the pointclouds to be registered.
There cannot be such an algorithm at all.
Note: I am not looking for approaches that reduce the size of the original input(s), e.g. with sampling, point aggregation, feature extraction etc; I am only interested in algorithms that work with the full original pointclouds.

Algorithm for 2D nearest-neighbour queries with dynamic points

I am trying to find a fast algorithm for finding the (approximate, if need be) nearest neighbours of a given point in a two-dimensional space where points are frequently removed from the dataset and new points are added.
(Relatedly, there are two variants of this problem that interest me: one in which points can be thought of as being added and removed randomly and another in which all the points are in constant motion.)
Some thoughts:
kd-trees offer good performance, but are only suitable for static point sets
R*-trees seem to offer good performance for a variety of dimensions, but the generality of their design (arbitrary dimensions, general content geometries) suggests the possibility that a more specific algorithm might offer performance advantages
Algorithms with existing implementations are preferable (though this is not necessary)
What's a good choice here?
I agree with (almost) everything that #gsamaras said, just to add a few things:
In my experience (using large dataset with >= 500,000 points), kNN-performance of KD-Trees is worse than pretty much any other spatial index by a factor of 10 to 100. I tested them (2 KD-trees and various other indexes) on a large OpenStreetMap dataset. In the following diagram, the KD-Trees are called KDL and KDS, the 2D dataset is called OSM-P (left diagram): The diagram is taken from this document, see bullet points below for more information.
This research describes an indexing method for moving objects, in case you keep (re-)inserting the same points in slightly different positions.
Quadtrees are not too bad either, they can be very fast in 2D, with excellent kNN performance for datasets < 1,000,000 entries.
If you are looking for Java implementations, have a look at my index library. In has implementations of quadtrees, R-star-tree, ph-tree, and others, all with a common API that also supports kNN. The library was written for the TinSpin, which is a framework for testing multidimensional indexes. Some results can be found enter link description here (it doesn't really describe the test data, but 'OSM-P' results are based on OpenStreetMap data with up to 50,000,000 2D points.
Depending on your scenario, you may also want to consider PH-Trees. They appear to be slower for kNN-queries than R-Trees in low dimensionality (though still faster than KD-Trees), but they are faster for removal and updates than RTrees. If you have a lot of removal/insertion, this may be a better choice (see the TinSpin results, Figures 2 and 46). C++ versions are available here and here.
Check the Bkd-Tree, which is:
an I/O-efficient dynamic data structure based on the kd-tree. [..] the Bkd-tree maintains its high space utilization and excellent
query and update performance regardless of the number of updates performed on it.
However this data structure is multi dimensional, and not specialized to lower dimensions (like the kd-tree).
Play with it in bkdtree.
Dynamic Quadtrees can also be a candidate, with O(logn) query time and O(Q(n)) insertion/deletion time, where Q(n) is the time
to perform a query in the data structure used. Note that this data structure is specialized for 2D. For 3D however, we have octrees, and in a similar way the structure can be generalized for higher dimensions.
An implentation is QuadTree.
R*-tree is another choice, but I agree with you on the generality. A r-star-tree implementations exists too.
A Cover tree could be considered as well, but I am not sure if it fits your description. Read more here,and check the implementation on CoverTree.
Kd-tree should still be considered, since it's performance is remarkable on 2 dimensions, and its insertion complexity is logarithic in size.
nanoflann and CGAL are jsut two implementations of it, where the first requires no install and the second does, but may be more performant.
In any case, I would try more than one approach and benchmark (since all of them have implementations and these data structures are usually affected by the nature of your data).

performance issue, edit distance for large strings LCP vs Levenshtein vs SIFT

So I'm trying to calculate the distance between two large strings (about 20-100).
The obstacle is the performance, I need to run 20k distance comparisons. (It takes hours)
After investigating, I came a cross few algorithms, And I'm having trouble to decide which to choose. (based on performance VS accuracy)
https://github.com/tdebatty/java-string-similarity - performance list for each of the algorithms.
** EDITED **
Is SIFT4 algorithm well-proven / reliable?
Is SIFT4 the right algorithm for the task?
How come it's so much faster than LCP-based / Levenshtein algorithm?
Is SIFT also used in image processing? or is it a different thing? answered by AMH
Thanks.
As far as i know Scale-invariant feature transform (SIFT) is an algorithm in computer vision detect and describe local features in images.
also if you want to find similar images you must compare local features of images to each other by calculating their distance which may do what you intend to do. but local features are vector of numbers as i remember. it uses Brute-Force matcher:Feature Matching - OpenCV Library - SIFT
please read about SIFT here: http://docs.opencv.org/3.1.0/da/df5/tutorial_py_sift_intro.html
SIFT4 which is mentioned on your provided link is completely different thing.

Efficient Algorithm for Detecting Text Duplicates in Big Dataset

I'm working on detecting duplicates in a list of around 5 million addresses, and was wondering if there was consensus on an efficient algorithm for such a purpose. I've looked at the Dedupe library on Gitbub (https://github.com/datamade/dedupe), but based on the documentation I'm not clear that this would scale to a large application well.
As an aside, I'm just looking to define duplicates based on textual similarity - have already done a lot of cleaning on the addresses. I've been using a crude method using Levenshtein distance, but was wondering if there's anything more efficient for large datasets.
Thanks,
Dedupe should work fine for data of that size.
There has been some excellent work by Michael Wick and Beka Steorts that have better complexity than dedupe.

Efficient data structure for quality threshold clustering algorithm

I'm trying to implement the quality threshold clustering algorithm. The outline of it (taken from here) is listed below:
Initialize the threshold distance allowed for clusters and the minimum cluster size
Build a candidate cluster for each data point by including the closest point, the next closest, and so on, until the distance of the cluster surpasses the threshold
Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration
Repeat with the reduced set of points until no more cluster can be formed having the minimum cluster size
I've been reading up on some nearest neighbor search algorithms and space partitioning data structures, as they seem to be the kind of thing I need, but I cannot determine which one to use or if I'm supposed to be looking at something else.
I want to implement the data structure myself for educational purposes, and I need one that can successively return the nearest points for some point. However, since I don't know the number of times I need to query (i.e. until the threshold is exceeded), I can't use k-nearest neighbor algorithms. I've been looking mostly at quadtrees and k-d trees.
Additionally, since the algorithm constantly builds new candidate clusters, it would be interesting to use a modified data structure that uses cached information to speed up subsequent queries (but also taking point removal into account).
This algorithm sounds like a predecessor of DBSCAN (Wikipedia), which is known to work very well with R*-Tree indexes (Wikipedia). But of course, kd-trees are also an option. The main difference between these two is that R*-trees are meant for database use - they support online insertions and deletions very well, and are block oriented - while kd-trees are more of an in-memory data structure based on binary splits. R*-trees perform rebalancing, while kd-trees will slowly become unbalanced and will need to be rebuilt.
I find nearest neighbor search in R*-trees much more understandable than in k-d-trees, because you have the bounding rectangles are very intuitive.
DBSCAN also "removes" points from further consideration, but simply by marking them as already assigned. That way you don't need to update the index; and it's sufficient to bulk-load it once in the beginning. You should be able to do this for QT, too. So unless I'm mistaken, you can get the QT clustering efficiently by running DBSCAN with epsilon set to the QT clustering and minPts=2 (although one would prefer higher values in proper DBSCAN).
There are a number of DBSCAN implementations around. The one in Weka is exceptionally crappy, so stay away from it. The fpc implementation in R is okay, but could still be a lot faster. ELKI seems to be the only one with full index support, and the speed difference is massive. Their Benchmark shows a 12x speed gain by using an index on this data set, allowing them to cluster in 50 seconds instead of 603 (without index). Weka took incredible 37917 seconds, R fpc 4339 there. That aligns with my experiences, Weka has the reputation of being quite slow, and R only kicks ass at vectorized operations, once the R interpreter has to work, it is significantly slower than anything native. But it is a good example about how different the same algorithm can perform when it is implemented by different people. I would have expected this to be 2x-5x, but apparently the differences can easily be 50x from one programmer implementing the same algorithm to another.

Resources