Our team is currently evaluating Neo4j, and graph databases as a whole, as a candidate for our backend solution.
The upsides - the flexible data model, fast traversals in a native graph store - are all very applicable to our problem space.
However, we also have a need to perform large scale aggregations on our datasets. I'm testing a very simple use case with a simple data model: (s: Specimen)-[d: DONOR]->(d: DONOR)
A Specimen has an edge relating it to a Donor.
The dataset I loaded has ~6 million Specimens, and a few hundred Donors. The aggregation query I want to perform is simple:
MATCH (s: Specimen)-[e: DONOR]->(d: Donor)
WITH d.sex AS sex, COUNT(s.id) AS count
RETURN count, sex
The performance time is very slow - the result does not return for ~9 seconds. We need sub-second return times for this solution to work.
We are running Neo4j on an EC2 instance with 32vCPU units and 256GB of memory, so compute power shouldn't be a blocker here. The database itself is only 15GB.
We also have indexes on both the Specimen and Donor nodes, as well as an index on the Donor.sex property.
Any suggestions on improving the query times? Or are Graph Databases simply not cut out for such large-scale aggregations?
You will more than likely need to refactor your graph model. For example, you may want to investigate if using multiple labels (e.g. something like Specimen:Male/Specimen:Female) if it is appropriate to do so, as this will act as a pre-filter before scanning the db.
You may find the following blog posts helpful:
Modelling categorical variables
Modelling relationships
Modelling flights, which talks about dealing with dense nodes
I am trying to find a fast algorithm for finding the (approximate, if need be) nearest neighbours of a given point in a two-dimensional space where points are frequently removed from the dataset and new points are added.
(Relatedly, there are two variants of this problem that interest me: one in which points can be thought of as being added and removed randomly and another in which all the points are in constant motion.)
Some thoughts:
kd-trees offer good performance, but are only suitable for static point sets
R*-trees seem to offer good performance for a variety of dimensions, but the generality of their design (arbitrary dimensions, general content geometries) suggests the possibility that a more specific algorithm might offer performance advantages
Algorithms with existing implementations are preferable (though this is not necessary)
What's a good choice here?
I agree with (almost) everything that #gsamaras said, just to add a few things:
In my experience (using large dataset with >= 500,000 points), kNN-performance of KD-Trees is worse than pretty much any other spatial index by a factor of 10 to 100. I tested them (2 KD-trees and various other indexes) on a large OpenStreetMap dataset. In the following diagram, the KD-Trees are called KDL and KDS, the 2D dataset is called OSM-P (left diagram): The diagram is taken from this document, see bullet points below for more information.
This research describes an indexing method for moving objects, in case you keep (re-)inserting the same points in slightly different positions.
Quadtrees are not too bad either, they can be very fast in 2D, with excellent kNN performance for datasets < 1,000,000 entries.
If you are looking for Java implementations, have a look at my index library. In has implementations of quadtrees, R-star-tree, ph-tree, and others, all with a common API that also supports kNN. The library was written for the TinSpin, which is a framework for testing multidimensional indexes. Some results can be found enter link description here (it doesn't really describe the test data, but 'OSM-P' results are based on OpenStreetMap data with up to 50,000,000 2D points.
Depending on your scenario, you may also want to consider PH-Trees. They appear to be slower for kNN-queries than R-Trees in low dimensionality (though still faster than KD-Trees), but they are faster for removal and updates than RTrees. If you have a lot of removal/insertion, this may be a better choice (see the TinSpin results, Figures 2 and 46). C++ versions are available here and here.
Check the Bkd-Tree, which is:
an I/O-efficient dynamic data structure based on the kd-tree. [..] the Bkd-tree maintains its high space utilization and excellent
query and update performance regardless of the number of updates performed on it.
However this data structure is multi dimensional, and not specialized to lower dimensions (like the kd-tree).
Play with it in bkdtree.
Dynamic Quadtrees can also be a candidate, with O(logn) query time and O(Q(n)) insertion/deletion time, where Q(n) is the time
to perform a query in the data structure used. Note that this data structure is specialized for 2D. For 3D however, we have octrees, and in a similar way the structure can be generalized for higher dimensions.
An implentation is QuadTree.
R*-tree is another choice, but I agree with you on the generality. A r-star-tree implementations exists too.
A Cover tree could be considered as well, but I am not sure if it fits your description. Read more here,and check the implementation on CoverTree.
Kd-tree should still be considered, since it's performance is remarkable on 2 dimensions, and its insertion complexity is logarithic in size.
nanoflann and CGAL are jsut two implementations of it, where the first requires no install and the second does, but may be more performant.
In any case, I would try more than one approach and benchmark (since all of them have implementations and these data structures are usually affected by the nature of your data).
We know that elasticsearch using Lucene or the famous search engine Google will keep the offset distance of the words in the indexed document for better results. Both of the above mentioned software perform indexing and searching on a very large amount of data. What is the special index (or data structure) or algorithm for efficient and fast internally? And what about the cost (time and space)? Is there a web page or document that explains the word offset distance-based algorithm used by Google or elasticsearch (lucene)? Below is a picture of what I would like to create myself.
Check TF-IDF https://en.wikipedia.org/wiki/Tf-idf
This pretty much it.
Does Sphinx provide a way to precompute document similarity matrices? I have looked at Sphinx/Solr/Lucene; it seems Lucene is able to do this indirectly using Term Vectors Computing Document Similarity with Term Vectors.
Currently I am using the tf-idf-similarity gem to do these calculations, but it is incredibly slow as the dataset grows; something like On^(n-1!).
Currently trying to find a faster alternative to this. Lucene seems like a potential solution, but it doesn't have as much support within the Ruby community so if Sphinx has a good way of doing this that would be ideal.
Just to clarify; I am not trying to do live search similarity matching, which appears to be the most common use case for both Lucene and Sphinx, I am trying to precompute a similarity matrix that will create a similarity between all documents with the dataset. This will subsequently be used in data visualizations for different types of user analysis.
Also anyone with prior experience doing this I'm curious about benchmarks. How it looks in terms of time to process and how much computing power and/or parallelization were you using with reference to number of documents and doc size average.
Currently it is taking about 40 minutes for me to process roughly 4000 documents and about 2 hours to process 6400 records. Providing the 2 different sizes and times here to give an indication of the growth expansion so you can see how slow this would become with significantly large datasets.
back with a question on datamining and working with Weka and WekaSharp on datamining. Through WekaSharp I have been doing some analysis on a fairly large dataset which is the KDD Cup 1999 10% database ( ~70 mb). I have had good results with the decision tree J48 algorithm and the Naive Bayes algorithm each taking between 10 and 30 min to complete. When I run this same data through the KNN algorithm and it never finishes the analysis, it does not error out it simply runs forever. I have tried all different parameters with no effect. When I run the same KNN algorithm on a smaller sample dataset such as the iris.arff it finishes with no difficulty. Here is the setup I have for the KNN parameters:
"-K 1 -W 0 -A \"weka.core.neighboursearch.KDTree -A \\"weka.core.EuclideanDistance -R first-last\\"\""
Is there an inherent issue with KNN and large datasets or is there a setup issue? Thank you very much.
kNN is subject to the "curse of dimensionality": spatial queries of high-dimensional datasets cannot be effectively optimized in the same way lower-dimensional datasets can, turning them effectively into brute-force searches.
NB laughs at dimensionality because it basically ignores dimensions. Many decision tree variants are also fairly good at dealing with high-dimensional data. kNN does not like high-dimensional data. Expect to wait for a long time.