Do graph databases have problems with aggregation operations? - performance

I came across multiple opinions that graph databases tend to have problems with aggregation operations. Like if you have a set of users and want to get maximum age, RDBMS will outperform graph database. Is true and if it is, what is the reason behind it? As far as I understand, key difference between relational and graph database is that each graph database node somehow includes references to the nodes it is connected to. How does it impact "get max age"-like query?
Disclaimer: most of what I have read was about Neo4j, but I suppose if these limitations exist, they should apply to any graph db.

The use of graph databases like Neo4j is recommended when dealing with connected data and complex queries.
The book Learning Neo4j by Rik Van Bruggen state that you should not use graph databases when dealing with simple, aggregate-oriented queries:
From the book:
(...) simple queries, where write patterns and read patterns align to
the aggregates that we are trying to store, are typically served quite
inefficiently in a graph, and would be more efficiently handled by an
aggregate-oriented Key-Value or Document store. If complexity is low,
the advantage of using a graph database system will be lower too.
The reason behind this is highly related to the nature of the persistence model. Its more easy to make a sum, max or avg operation over a tabled data than a data stored as graph.

Related

Neo4j Large Scale Aggregation - sub-second time possible?

Our team is currently evaluating Neo4j, and graph databases as a whole, as a candidate for our backend solution.
The upsides - the flexible data model, fast traversals in a native graph store - are all very applicable to our problem space.
However, we also have a need to perform large scale aggregations on our datasets. I'm testing a very simple use case with a simple data model: (s: Specimen)-[d: DONOR]->(d: DONOR)
A Specimen has an edge relating it to a Donor.
The dataset I loaded has ~6 million Specimens, and a few hundred Donors. The aggregation query I want to perform is simple:
MATCH (s: Specimen)-[e: DONOR]->(d: Donor)
WITH d.sex AS sex, COUNT(s.id) AS count
RETURN count, sex
The performance time is very slow - the result does not return for ~9 seconds. We need sub-second return times for this solution to work.
We are running Neo4j on an EC2 instance with 32vCPU units and 256GB of memory, so compute power shouldn't be a blocker here. The database itself is only 15GB.
We also have indexes on both the Specimen and Donor nodes, as well as an index on the Donor.sex property.
Any suggestions on improving the query times? Or are Graph Databases simply not cut out for such large-scale aggregations?
You will more than likely need to refactor your graph model. For example, you may want to investigate if using multiple labels (e.g. something like Specimen:Male/Specimen:Female) if it is appropriate to do so, as this will act as a pre-filter before scanning the db.
You may find the following blog posts helpful:
Modelling categorical variables
Modelling relationships
Modelling flights, which talks about dealing with dense nodes

Is a GraphQL related to Graph Database?

According to wikipedia: Graph Database
In computing, a graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data.[1] A key concept of the system is the graph (or edge or relationship). The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes.
If a database has a GraphQL API, is this database a Graph database?
Both terms sound very similar.
They are not related. GraphQL is just an API technology that is compared to REST . Think it as another way to implement the Web API and it has nothing to do with where the data is actually stored or the storage technology behind scene. For example, it can be used as a Web API to get the data from PostgreSQL too.
But as GraphQL treats the data as an object graph, in term of API implementation, it may be more matched when working with the Graph database. It may be easier to implement as we may delegate some graph loading problem to the Graph database to solve rather than solve it by ourself.

Neptune Graph Database Performance Cost to List Edges of a Type

The use case for the graph database is to have users and contents (vertices) linked by likes, favorites and reports relations (edges). The problem I have is that I will sometimes need to show the reported contents (from any users). Since this is not a standard graph traversal, I fear this would have a big performance hit.
Is it possible to index the edges of type "reports" to quickly get the list of all contents that have been reported? Is there a better way to do this?
No, you cannot (don't need to) explicitly manage indices. Neptune uses a novel indexing strategy based on semi-clustered indices and offers excellent index performance out of the box. There is no need for custom indices.
From Neptune FAQs: https://aws.amazon.com/neptune/faqs/
Do I need to create indices on my data with Amazon Neptune?
No, existing graph database users are often forced to try and outguess the vendor implementation. Explicitly maintaining indices is just one aspect of that. Amazon Neptune does not require you to create specific indices to achieve good query performance, and it minimizes the need for such second guessing of the database design.
Can you share some details on the specific queries that you are looking for?

Efficient traversal/search algorithm to fetch data from RDF?

I have my data as a RDF graph in DB and using SPARQL i am retriving the data. Now the nodes (objects) in the graphs gets huge and the traversal/search gets much slower now.
a. Can anyone suggest the efficient traversal/search algorithm to fetch the data?
As a next step, i have federated data i.e the data from external applications like SAP. In this case, the search becomes even much slower.
b. What efficient search algorithm do i use in this case?
This seems like a common issue in an large enterprise systems, and any inputs on how these problems have been solved in such systems will also be helpful.
I had a similiar problem. I was doing a lot of graph traversal using SPARQL property paths and it was too slow using an RDF based repository. I was using Jena TDB which is supposed to be fast but still it was too slow !
Like #Mikos suggested, I tried Neo4J. It then got much faster. Like Mark Watson says on this blog entry,
RDF data stores support SPARQL queries: good for matching patterns in data.
Neo4j supports arbitrary graph structures and seems best for exploring
a neighborhood of a graph: start at a node and explore the connected
nodes. (graph traversal)
I used Neo4j but you can try any tool that is built for graph traversal. I read that Allegrograph 4 is RDF based and has good graph traversal speed.
Now Im using Neo4j but I didnt give up on RDF. I still use URIs as identifiers and try to reuse the popular rdf vocabularies and relations. Later I'll add a feature to render my gaphs as RDF. I know that with Neo4j you can also use Tinkerpop to render RDF but I havent tried it myself.
Graph traversal and efficient querying is a wide-ranging problem and the approach to use is dependent on your situation. I would suggest looking at a data-store like Neo4j and complementing it with a tool like Lucene.

Algorithms for searching a graph that represents relevance to a certain keyword

I have a graph (and it is a graph because one node might have many parents) that has contains nodes with the following data:
Keyword Id
Keyword Label
Number of pervious searches
Depth of keyword promotion
The relevance is rated with a number starting from 1.
The relevance of a child node is determained by the distance from the parent node the child node minus the depth of the keyword's promotion.
The display order of of child nodes from the same depth is determained by the number of pervious searches.
Is there an algorithm that is able to search such a data structure?
Do I have an efficiency issue if I need to transverse all nodes, cache the generated result and display them by pages considering that this should scale well for a large amount of users? If I do have an issue, how can this be resolved?
What kind of database do I need to use? A NoSQL, a relational one or a graph database?
How the scheme would look like?
Can this be done using django-haystack?
It seems you're trying to compute a top-k query over a graph. There is a variety of algorithms fit to solve this problem, the simplest one I believe will help you to solve your problem is the Threshold Algorithm (TA), when the traversal over the graph is done in a BFS fashion. Some other top-k algorithms are Lawler-Murty Procedure, and other TA variations exist.
Regarding efficiency - the problem of computing the query itself might have an exponential time, simply due to exponential number of results to be returned, but when using a TA the time between outputting results should be relatively short. As far as caching & scale involved, the usual considerations apply - you'll probably want to use a distributed system when the scale gets and the appropriate TA version (such as Threshold Join Algorithm). Of course you'll need to consider the scaling & caching issues when choosing which database solution to use as well.
As far as the database goes you should definitely use one that supports graphs as first class citizens (those tend to be known as Graph Databases), and I believe it doesn't matter if the storage engine behind the graph database is relative or NoSQL. One point to note is that you'd probably will want to make sure the database you choose can scale to the scale you require (so for large scale, perhaps, you'll want to look into more distributed solutions). The schema will depend on the database you'll choose (assuming it won't be a schema-less database).
Last but not least - Haystack. As haystack will work with everything that the search engine you choose to use will work with, there should be at least one possible way to do it (combining Apache Solr for search and Neo4j or GoldenOrb for the database), and maybe more (as I'm not really familiar with Haystack or the search engines it supports other than Solr).

Resources