Knn search for large data? - algorithm

I'm interested in performing knn search on large dataset.
There are some libs: ANN and FLANN, but I'm interested in the question: how to organize the search if you have a database that does not fit entirely into memory(RAM)?

I suppose it depends on how much bigger your index is in comparison to the memory. Here are my first spontaneous ideas:
Supposing it was tens of times the size of the RAM, I would try to cluster my data using, for instance, hierarchical clustering trees (implemented in FLANN). I would modify the implementation of the trees so that they keep the branches in memory and save the leaves (the clusters) on the disk. Therefore, the appropriate cluster would have to be loaded each time. You could then try to optimize this in different ways.
If it was not that bigger (let's say twice the size of the RAM), I would separate the dataset in two parts and create one index for each. I would therefore need to find the nearest neighbor in each dataset and then choose between them.

It depends if your data is very high-dimensional or not. If it is relatively low-dimensional, you can use an existing on-disk R-Tree implementation, such as Spatialite.
If it is a higher dimensional data, you can use X-Trees, but I don't know of any on-disk implementations off the top of my head.
Alternatively, you can implement locality sensitive hashing with on disk persistence, for example using mmap.

Related

What is locality in Graph Matching problem and Distributed models?

I’m a beginner in the field of Graph Matching and Parallel Computing. I read a paper that talks about an efficient parallel matching algorithm. They explained the importance of the locality, but I don't know it represents what? and What is good and bad locality?
Our distributed memory parallelization (using MPI) on p processing elements (PEs or MPI processes) assigns nodes to PEs and stores all edges incident to a node locally. This can be done in a load balanced way if no node has degree exceeding m/p. The second pass of the basic algorithm from Section 2 has to exchange information on candidate edges that cross a PE boundary. In the worst case, this can involve all edges handled by a PE, i.e., we can expect better performance if we manage to keep most edges locally. In our experiments, one PE owns nodes whose numbers are a consecutive range of the input numbers. Thus, depending on how much locality the input numbering contains we have a highly local or a highly non-local situation.
Generally speaking, locality in distributed models is basically the extent to which a global solution for a computational problem problem can be obtained from locally available data.
Good locality is when most nodes can construct solutions using local data, since they'll require less communication to get any missing data. Bad locality would be if a node spends more than desirable time fetching data, rather than finding a solution using local data.
Think of a simple distributed computer system which comprises a collection of computers each somewhat like a desktop PC, in as much as each one has a CPU and some RAM. (These are the nodes mentioned in the question.) They are assembled into a distributed system by plugging them all into the same network.
Each CPU has memory-bus access (very fast) to data stored in its local RAM. The same CPU's access to data in the RAM on another computer in the system will run across the network (much slower) and may require co-operation with the CPU on that other computer.
locality is a property of the data used in the algorithm, local data is on the same computer as the CPU, non-local data is elsewhere on the distributed system. I trust that it is clear that parallel computations can proceed more quickly the more that each CPU has to work only with local data. So the designers of parallel programs for distributed systems pay great attention to the placement of data often seeking to minimise the number and sizes of exchanges of data between processing elements.
Complication, unnecessary for understanding the key issues: of course on real distributed systems many of the individual CPUs are multi-core, and in some designs multiple multi-core CPUs will share the same enclosure and have approximately memory-bus-speed access to all the RAM in the same enclosure. Which makes for a node which itself is a shared-memory computer. But that's just detail and a topic for another answer.

K-Means clustering in OpenIMAJ library

I'm not very experienced in machine learning and cluster analysis, but I have following problem:
I have ~100kk-1000kk pieces of data which I cannot load into memory all at once and I need to divide it to a number of classes (like 1-10k or even 100k classes) for further analisys. To do that I've choosed K-Means algorithm implemented in OpenIMAJ library (FloatKMeans class).
I understand that K-Means algorithm can be divided into 2 phases:
Learning phase - where I pass in all the data I have to create/fill the classes
Assignemnt phase - where I can ask the cluster to which class the given piece of data belongs to
I'm planning to build the cluster model using Hadoop reduce phase where I'll receive the data pieces one by one (that's why i cannot pass the data all at once to the algorithm)
My questions are:
Is OpenIMAJ implementation optimal for such 'bigdata' use case? Wont it take like forever to calculate it?
Is it possible to 'stream' the data into the algorithm during the hadoop reduce faze in order to learn the cluster?
Is it possible to save the learned cluster (model) as bytes in order to pass the model to the next hadoop job?
Is it ok to run the algorithm assignment phase during hadoop mapping?
Thanks for help
K-Means clustering is an iterative algorithm that makes multiple passes over the data. In each pass, points are assigned to cluster centroids and then after all points have been assigned, the cluster centroids are recomputed to be the mean of the assigned points. You can't "stream" data to the algorithm in the traditional sense as you'll need to come back to it during the subsequent iterations.
Regarding the OpenIMAJ FloatKMeans implementation: yes this can handle "big data" in the sense that it doesn't mind where it gets the data from - the DataSource instance that it takes as input can read data from disk if necessary. The only requirement is that you can hold all the centroids in memory during the runtime of the algorithm. The implementation is multi-threaded, so all cpu cores can be used during computation. There is example code here: https://github.com/openimaj/openimaj/blob/master/demos/examples/src/main/java/org/openimaj/examples/ml/clustering/kmeans/BigDataClusterExample.java.
The OpenIMAJ IOUtils.writeBinary(...) methods can be used to save the resultant cluster centroids in the FloatCentroidsResult object.
One of the biggest costs in K-Means is the computation of distances between each data point and each cluster centroid in order to find the closest. The cost of this is related to the dimensionality of the data and the number of centroids. If you've got a large number of centroids and high dimensional data, then using an approximate K-Means implementation can have big speed benefits at the cost of a slight loss in accuracy (see FloatKMeans.createKDTreeEnsemble() for example -- this uses an ensemble of KD-Trees to speed neighbour computations).
Regarding integration with Hadoop, it is possible to implement K-Means as a series of Map-Reduce tasks (each pair corresponds to an iteration of the algorithm). See this paper for a discussion: http://eprints.soton.ac.uk/344243/1/paper.pdf . If you want to go down this route, OpenIMAJ has a very rough implementation here, which you could build off: https://github.com/openimaj/openimaj/tree/master/hadoop/tools/HadoopFastKMeans. As mentioned in the linked paper, Apache Mahout also contains an implementation: https://mahout.apache.org. One problem with both of these implementations is that they required quite a lot of data to be transferred between the mappers and reducer (each mapper emits the current data point and its assigned cluster id). The extent of this could mean that it could be faster to use a non-Hadoop implementation of the algorithm, but this would depend on what processing resources you have available and the nature of the dataset. The problem of data-transfer between the map and reduce could probably also be reduced with a clever Hadoop Combiner and computes weighted centroids from subsets of the data and then passes these to the (modified) reducer to compute the actual centroids.

How to decide order of a B-tree

B trees are said to be particularly useful in case of huge amount of data that cannot fit in main memory.
My question is then how do we decide order of B tree or how many keys to store in a node ? Or how many children a node should have ?
I came across that everywhere people are using 4/5 keys per node. How does it solve the huge data and disk read problem ?
Typically, you'd choose the order so that the resulting node is as large as possible while still fitting into the block device page size. If you're trying to build a B-tree for an on-disk database, you'd probably pick the order such that each node fits into a single disk page, thereby minimizing the number of disk reads and writes necessary to perform each operation. If you wanted to build an in-memory B-tree, you'd likely pick either the L2 or L3 cache line sizes as your target and try to fit as many keys as possible into a node without exceeding that size. In either case, you'd have to look up the specs to determine what size to use.
Of course, you could always just experiment and try to determine this empirically as well. :-)
Hope this helps!

What Mongo Index algorithm is using? Binary Tree?

I would like to know what kind of internal indexing algorithm MongoDB is using. Because I have some data want to store, and each document (row) has a id, which is probably a unique hash value. (e.g. generated by md5() or other hash algorithm). so, I would to understand which hash method I should use to create the id, so that it is fast for the MongoDB to index it. :)
Yes, mongoDB use b-tree, documentation:
An index is a data structure that collects information about the values
of the specified fields in the
documents of a collection. This data
structure is used by Mongo's query
optimizer to quickly sort through and
order the documents in a collection.
Formally speaking, these indexes are
implemented as "B-Tree" indexes.
I suggest to use mongodb ObjectId for collection _id, and don't care about: "How to create _id?" at all. Because it probably task for mongodb, but not for developer. I suppose that better to care about schema, indexes, etc..
For Mongo 3.2+, the default storage engine is WiredTiger, and B+ tree is used to store data.
WiredTiger maintains a table's data in memory using a data structure called a B-Tree ( B+ Tree to be specific), referring to the nodes of a B-Tree as pages. Internal pages carry only keys. The leaf pages store both keys and values.
And also LSM Tree is used to store data
WiredTiger supports Log-Structured Merge Trees, where updates are buffered in small files that fit in cache for fast random updates, then automatically merged into larger files in the background so that read latency approaches that of traditional Btree files. LSM trees automatically create Bloom filters to avoid unnecessary reads from files that cannot containing matching keys.
LSM
B+ Tree
Pros
* More space efficient by using append-only writes and periodic compaction * Better performance with fast-growing data and sequential writes * No fragmentation penalty because of how SSTable files are written and updated
* Excellent performance with read-intensive workloads
Cons
* CPU overhead for compaction can meaningfully affect performance and efficiency if not tuned appropriately * More tuning options increase flexibility but can seem complex to developers and operators * Read/search performance can be optimized with the use of bloom filters
* Increased space overhead to deal with fragmentation * Uses random writes which causes slower create/insert behavior * Concurrent writes may require locks which slows write performance * Scaling challenges, especially with >50% write transactions
The choice advice
If you don't require extreme write throughput btree is likely to be a better choice. Read throughput is better and high volumes of writes can be maintained.
If you have a workload that requires a high write throughput LSM is the best choice.
Source:
LSM vs B tree
WiredTiger Doc

Data structures and Algorithm analysis question

I'm looking for an answer to this question that comes from a class on data structures and algorithms. I learned about the merge sort but don't remember clusters and buffers. I'm not quite sure I understand the question. Can someone help explain or answer it?
A file of size 1 Million clusters is
to be sorted using 128 input buffers
of one cluster size. There is an
output buffer of one cluster size. How
many Disk I/O's will be needed if the
balanced k-way merge sort (a
multi-step merge) algorithm is used?
It is asking about the total number of disk operations, a cluster here can be any size.
You need to know how many Disk IOs are needed per iteration of a balanced k-way merge sort.
(hint: every merge pass requires reading and writing every value in the array from and to disk once)
Then you work out how many iterations must be performed to read your data.
The total number of Disk IOs can then be calculated.

Resources