Our team is currently evaluating Neo4j, and graph databases as a whole, as a candidate for our backend solution.
The upsides - the flexible data model, fast traversals in a native graph store - are all very applicable to our problem space.
However, we also have a need to perform large scale aggregations on our datasets. I'm testing a very simple use case with a simple data model: (s: Specimen)-[d: DONOR]->(d: DONOR)
A Specimen has an edge relating it to a Donor.
The dataset I loaded has ~6 million Specimens, and a few hundred Donors. The aggregation query I want to perform is simple:
MATCH (s: Specimen)-[e: DONOR]->(d: Donor)
WITH d.sex AS sex, COUNT(s.id) AS count
RETURN count, sex
The performance time is very slow - the result does not return for ~9 seconds. We need sub-second return times for this solution to work.
We are running Neo4j on an EC2 instance with 32vCPU units and 256GB of memory, so compute power shouldn't be a blocker here. The database itself is only 15GB.
We also have indexes on both the Specimen and Donor nodes, as well as an index on the Donor.sex property.
Any suggestions on improving the query times? Or are Graph Databases simply not cut out for such large-scale aggregations?
You will more than likely need to refactor your graph model. For example, you may want to investigate if using multiple labels (e.g. something like Specimen:Male/Specimen:Female) if it is appropriate to do so, as this will act as a pre-filter before scanning the db.
You may find the following blog posts helpful:
Modelling categorical variables
Modelling relationships
Modelling flights, which talks about dealing with dense nodes
Related
I came across multiple opinions that graph databases tend to have problems with aggregation operations. Like if you have a set of users and want to get maximum age, RDBMS will outperform graph database. Is true and if it is, what is the reason behind it? As far as I understand, key difference between relational and graph database is that each graph database node somehow includes references to the nodes it is connected to. How does it impact "get max age"-like query?
Disclaimer: most of what I have read was about Neo4j, but I suppose if these limitations exist, they should apply to any graph db.
The use of graph databases like Neo4j is recommended when dealing with connected data and complex queries.
The book Learning Neo4j by Rik Van Bruggen state that you should not use graph databases when dealing with simple, aggregate-oriented queries:
From the book:
(...) simple queries, where write patterns and read patterns align to
the aggregates that we are trying to store, are typically served quite
inefficiently in a graph, and would be more efficiently handled by an
aggregate-oriented Key-Value or Document store. If complexity is low,
the advantage of using a graph database system will be lower too.
The reason behind this is highly related to the nature of the persistence model. Its more easy to make a sum, max or avg operation over a tabled data than a data stored as graph.
I am trying to extract some consumption patterns of certain demographic groups from large multidimensional datasets built for other purposes. I am using clustering and regression analysis with manual methods (SPSS). Is this considered to be secondary analysis or data mining? I understand the difference between statistical analysis and data mining, but in this case seems to be sort of in between... Thanks
"Secondary analysis" means that the data was collected for "primary" research project A, but then was analyzed again for "secondary" project B with a very different objective that was not originally planned. Maybe much later maybe by different people. Fairly common in medicine if you want to avoid the cost of doing the experiments yourself, and someone else has published suitable data.
A theoretical example:
Research group A does a clinical trial on drug B, and measures body mass, and insuline levels.
Data is published, for both the study group (with drug B) and the control group (without drug B).
... ten years later ...
Research group C wants to know if there is a correlation between body mass and insuline levels. They do not care about drug B, so they only look at the control group. They join the data with the data of many other groups instead of doing own experiments.
This is not a "meta" study, because they disregard any results with respect to drug B. They do not use the results of group A, only their data, for a different purpose. Since this is secondary use of the data, it is called "secondary analysis".
The analysis could be as simple as computing correlation - something usually not considered to be "data mining" (you do not search, nor use advanced statistics) but tradational statistical hypothesis testing.
I am little bottle-necked with design approach to my application. My plan is to create model Product and add filter, that uses Country, Region and City.
Country -> Region -> City are one to many relations respectively.
How well does apache solr handle tree strucures like these and how fast they are? Is it better to use something like ancestry for Solr https://github.com/stefankroes/ancestry or it is better to seperate models for City, Region and Country?
System will need to handle a lot of requests and this tree can become very gigantic.
Thanks for any advice on this.
Have you considered Neo4j, a graph database?
It supports Rails.
It supports Lucene query syntax.
Friendly advice:
I had huge performance problems some 5 years back, while trying to fit a large tree into a relational db. After replacing it with Neo4j a SP computation time went from 4 hours to 0 (the need for it simply vanished since I suddenly could do all relationship-based algebra in the run time).
I have a MongoDB collection containing attributes such as:
longitude, latitude, start_date, end_date, price
I have over 500 million documents.
My question is how to search by lat/long, date range and price as efficiently as possible?
As I see it my options are:
Create an Geo-spatial index on lat/long and use MongoDB's proximity search... and then filter this based on date range and price.
I have yet to test this but, am worrying that the amount of data would be too much to search this quickly, when we have around 1 search a second.
have you had experience with how MongoDB would react under these circumstances?
Split the data into multiple collections by location. i.e. by cities like london_collection, paris_collection, new_york_collection.
I would then have to query by lat/long first, find the nearest city collection and then do a MongoDB spatial search on that subset data in that collection with date and price filters.
I would have uneven distribution of documents as some cities would have more documents than others.
Create collections by dates instead of location. Same as above but each document is allocated a collection based on it's date range.
problem with searches that have a date range that straddles multiple collections.
Create unique ids based on city_start_date_end_date for each document.
Again I would have to use my lat/long query to find the nearest city append the date range to access the key. This seems to be pretty fast but I don't really like the city look up aspect... it seems a bit ugly.
I am in the process of experimenting with option 1.) but would really like to hear your ideas before I go too far down one particular path?
How do search engines split up and manage their data... this must be a similar kind of problem?
Also I do not have to use MongoDB, I'm open to other options?
Many thanks.
Indexing and data access performance is a deep and complex subject. A lot of factors can effect the most efficient solution including the size of your data sets, the read to write ratio, the relative performance of your IO and backing store, etc.
While I can't give you a concrete answer, I can suggest investigating using morton numbers as an efficient way of pulling multiple similar numeric values like lat longs.
Morton number
Why do you think option 1 would be too slow? Is this the result of a real world test or is this merely an assumption that it might eventually not work out?
MongoDB has native support for geohashing and turns coordinates into a single number which can then be searched by a BTree traversal. This should be reasonably fast. Messing around with multiple collections does not seem like a very good idea to me. All it does is replace one level of BTree traversal on the database with some code you still need to write, test and maintain.
Don't reinvent the wheel, but try to optimize the most obvious path (1) first:
Set up geo indexes
Use explain to make sure your queries actually use the index
Make sure your indexes fit into RAM
Profile the database using the built-in profiler
Don't measure performance on a 'cold' system where the indexes didn't have a chance to go to RAM yet
If possible, try not to use geoNear if possible, and stick to the faster (but not perfectly spherical) near queries
If you're still hitting limits, look at sharding to distribute reads and writes to multiple machines.
I have a graph (and it is a graph because one node might have many parents) that has contains nodes with the following data:
Keyword Id
Keyword Label
Number of pervious searches
Depth of keyword promotion
The relevance is rated with a number starting from 1.
The relevance of a child node is determained by the distance from the parent node the child node minus the depth of the keyword's promotion.
The display order of of child nodes from the same depth is determained by the number of pervious searches.
Is there an algorithm that is able to search such a data structure?
Do I have an efficiency issue if I need to transverse all nodes, cache the generated result and display them by pages considering that this should scale well for a large amount of users? If I do have an issue, how can this be resolved?
What kind of database do I need to use? A NoSQL, a relational one or a graph database?
How the scheme would look like?
Can this be done using django-haystack?
It seems you're trying to compute a top-k query over a graph. There is a variety of algorithms fit to solve this problem, the simplest one I believe will help you to solve your problem is the Threshold Algorithm (TA), when the traversal over the graph is done in a BFS fashion. Some other top-k algorithms are Lawler-Murty Procedure, and other TA variations exist.
Regarding efficiency - the problem of computing the query itself might have an exponential time, simply due to exponential number of results to be returned, but when using a TA the time between outputting results should be relatively short. As far as caching & scale involved, the usual considerations apply - you'll probably want to use a distributed system when the scale gets and the appropriate TA version (such as Threshold Join Algorithm). Of course you'll need to consider the scaling & caching issues when choosing which database solution to use as well.
As far as the database goes you should definitely use one that supports graphs as first class citizens (those tend to be known as Graph Databases), and I believe it doesn't matter if the storage engine behind the graph database is relative or NoSQL. One point to note is that you'd probably will want to make sure the database you choose can scale to the scale you require (so for large scale, perhaps, you'll want to look into more distributed solutions). The schema will depend on the database you'll choose (assuming it won't be a schema-less database).
Last but not least - Haystack. As haystack will work with everything that the search engine you choose to use will work with, there should be at least one possible way to do it (combining Apache Solr for search and Neo4j or GoldenOrb for the database), and maybe more (as I'm not really familiar with Haystack or the search engines it supports other than Solr).