I'm solving a complex pathfinding problem in my database that can't be expressed in Cypher, so I need to divide it up into multiple queries (and write a complex recursive set of functions).
My question is regarding performance of doing multiple queries on the same node. When query A returns a node X and node X is needed in the next query B, what is the best way of telling neo4j to look for node X in query B?
The most simple way would be to give every node a name, and then return X.name in query A and use WHERE X.name = ... in query B. I suppose this is really slow because neo4j would have to check every name of every node in the database. Is there a faster way or is this actually the best?
EDIT: because the question might not be completely clear, I'll give some more information on the problem I'm solving
I want to get the person that has the best knowledge of a given skill, for example physics. In the database there's a connection between physics and another skill, for example maths, that tells knowledge of maths is usefull for physics. But now I need to know how skilled every person is in maths which is the same process again. This would make sense to do recursively, but as far as I know there's no recursion in Cypher, so I'll have to split it up in multiple queries.
What I want to prevent is that when a link is found between physics and maths, the function that will calculate every person's knowledge in maths does not have to go through every node in the database to find the one where name = 'maths', which is very inefficiƫnt.
I don't know if I understood your question completely, but I think that a good starting point is to create an index in the name property of your nodes. From the docs:
A database index is a redundant copy of information in the database
for the purpose of making retrieving said data more efficient. This
comes at the cost of additional storage space and slower writes, so
deciding what to index and what not to index is an important and often
non-trivial task.
CREATE INDEX ON :NodeLabel(name)
Related
I can't quite tell if this is a bad question, but I think it has a definitive answer...
I'm work on building my first graph database. It will hold nodes that are references to content. These nodes will be connected to term nodes. Each term node can be one of about seven types (Person, Organization, Jargon etc).
What is the best way to implement the types of terms in the database as it relates to query speed? Users will search for content based on the terms and the idea is to allow them to filter the terms based on their types.
As a property seems out of the question as it would require accessing a JSON object for every term during a query.
(contentNode:content)-[:TAGGED_WITH]-(termNode:term {type: {"people":false,"organizations":false,"physicalObjects":true,"concepts":true,...}}
Labels intuitively make sense to me as the different types really are just labeling the term nodes more specifically. Each term node could have the label 'term' as well as the relevant types. I have some confusion about this, but it seems labels cannot be used as dynamic properties in a cypher query as it prevents the query from being cached/properly indexed.
(contentNode:content)-[:TAGGED_WITH]-(termNode:term:physicalObject:jargon:...)
The last option I can think of would be to have a node for each of the term 'types' and connect the term to the relevant type nodes. Right now this is seeming like the best option (despite being the most verbose).
(contentNode:content)-[:TAGGED_WITH]-(termNode:term)-[:IS_TYPE]-(typeNode:termType {name:jargon}), (termNode:term)-[:IS_TYPE]-(typeNode:termType {name:physical object}), (termNode:term)-[:IS_TYPE]-(typeNode:termType {name: ...})
Can anyone with more experience/knowledge weigh in on this? Thanks a lot.
I'm not sure I completely understand what you're trying to do but I wanted to answer a few of the points and then maybe you can elaborate:
but it seems labels cannot be used as dynamic properties in a cypher query as it prevents
the query from being cached/properly indexed.
Using dynamic labels won't have an impact on indexing but you're partially write about the caching. The cypher parser keeps a cache of queries that it's seen before so that it doesn't have to regenerate the query plan each time. Given that you only have a limited number of labels it wouldn't take long until you've cached all combinations anyway.
I would suggest trying out the various models with a subset of your data and measure the query time & query readability for each.
Mark
For example, there are 5 object stores. I am thinking of inserting documents into them, but not in sequential order. Initially it might be sequential, but if i could insert by using some ranking method it would be easier to know which object store to search to find the document. The goal is to reduce the number of object store searches. This can only be achieved if the insertion uses some intelligent algorithm.
One method i found useful is using the current year MOD N (number of object stores) to determine where a document goes. Could we have some better approaches to this?
If you want fast access there are a couple of criteria:
The hash function has to be reproducible based on the data which is queried. This means, a lot depends on the queries you expect.
You usually want to distribute your object as much evenly accross stores as possible. If you want to go parallel, you want to access each document for a given query from different stores, so they will not block each other. Hence your hashing function should spread as much as possible to different stores for similar documents. If you expect documents related to the same query to be from the same year, do not use the year directly.
This assumes, you want to be able to have fast queries which can be paralised. If you instead have a system in which you first have to open a potentially expensive connection to the store, then most documents related to the same query should go in the same store and you should not take my advice above.
Your criteria for "what goes in a FileNet object store?" is basically "what documents logically belong together?".
I would like to represent a mutable graph in Prolog in an efficient manner. I will searching for subsets in the graph and replacing them with other subsets.
I've managed to get something working using the database as my 'graph storage'. For instance, I have:
:- dynamic step/2.
% step(Type, Name).
:- dynamic sequence/2.
% sequence(Step, NextStep).
I then use a few rules to retract subsets I've matched and replace them with new steps using assert. I'm really liking this method... it's easy to read and deal with, and I let Prolog do a lot of the heavy pattern-matching work.
The other way I know to represent graphs is using lists of nodes and adjacency connections. I've seen plenty of websites using this method, but I'm a bit hesitant because it's more overhead.
Execution time is important to me, as is ease-of-development for myself.
What are the pros/cons for either approach?
As usual: Using the dynamic database gives you indexing, which may speed things up (on look-up) and slow you down (on asserting). In general, the dynamic database is not so good when you assert more often than you look up. The main drawback though is that it also significantly complicates testing and debugging, because you cannot test your predicates in isolation, and need to keep the current implicit state of the database in mind. Lists of nodes and adjacancy connections are a good representation in many cases. A different representation I like a lot, especially if you need to store further attributes for nodes and edges, is to use one variable for each node, and use variable attribtues (get_attr/3 and put_attr/3 in SWI-Prolog) to store edges on them, for example [edge_to(E1,N_1),edge_to(E2,N_2),...] where N_i are the variables representing other nodes (with their own attributes), and E_j are also variables onto which you can attach further attributes to store additional information (weight, capacity etc.) about each edge if needed.
Have you considered using SWI-Prolog's RDF database ? http://www.swi-prolog.org/pldoc/package/semweb.html
as mat said, dynamic predicates have an extra cost.
in case however you can construct the graph and then you dont need to change it, you can compile the predicate and it will be as fast as a normal predicate.
usually in sw-prolog the predicate lookup is done using hash tables on the first argument. (they are resized in case of dynamic predicates)
another solution is association lists where the cost of lookup etc is o(log(n))
after you understand how they work you could easily write an interface if needed.
in the end, you can always use a SQL database and use the ODBC interface to submit queries (although it sounds like an overkill for the application you mentioned)
I have a graph (and it is a graph because one node might have many parents) that has contains nodes with the following data:
Keyword Id
Keyword Label
Number of pervious searches
Depth of keyword promotion
The relevance is rated with a number starting from 1.
The relevance of a child node is determained by the distance from the parent node the child node minus the depth of the keyword's promotion.
The display order of of child nodes from the same depth is determained by the number of pervious searches.
Is there an algorithm that is able to search such a data structure?
Do I have an efficiency issue if I need to transverse all nodes, cache the generated result and display them by pages considering that this should scale well for a large amount of users? If I do have an issue, how can this be resolved?
What kind of database do I need to use? A NoSQL, a relational one or a graph database?
How the scheme would look like?
Can this be done using django-haystack?
It seems you're trying to compute a top-k query over a graph. There is a variety of algorithms fit to solve this problem, the simplest one I believe will help you to solve your problem is the Threshold Algorithm (TA), when the traversal over the graph is done in a BFS fashion. Some other top-k algorithms are Lawler-Murty Procedure, and other TA variations exist.
Regarding efficiency - the problem of computing the query itself might have an exponential time, simply due to exponential number of results to be returned, but when using a TA the time between outputting results should be relatively short. As far as caching & scale involved, the usual considerations apply - you'll probably want to use a distributed system when the scale gets and the appropriate TA version (such as Threshold Join Algorithm). Of course you'll need to consider the scaling & caching issues when choosing which database solution to use as well.
As far as the database goes you should definitely use one that supports graphs as first class citizens (those tend to be known as Graph Databases), and I believe it doesn't matter if the storage engine behind the graph database is relative or NoSQL. One point to note is that you'd probably will want to make sure the database you choose can scale to the scale you require (so for large scale, perhaps, you'll want to look into more distributed solutions). The schema will depend on the database you'll choose (assuming it won't be a schema-less database).
Last but not least - Haystack. As haystack will work with everything that the search engine you choose to use will work with, there should be at least one possible way to do it (combining Apache Solr for search and Neo4j or GoldenOrb for the database), and maybe more (as I'm not really familiar with Haystack or the search engines it supports other than Solr).
I'm not really trying to compress a database. This is more of a logical problem. Is there any algorithm that will take a data table with lots of columns and repeated data and find a way to organize it into many tables with ID's in such a way that in total there are as few cells as possible, and that this tables can be then joined with a query to replicate the original one.
I don't care about any particular database engine or language. I just want to see if there is a logical way of doing it. If you will post code, I like C# and SQL but you can use any.
I don't know of any automated algorithms but what you really need to do is heavily normalize your database. This means looking at your actual functional dependencies and breaking this off wherever it makes sense.
The problem with trying to do this in a computer program is that it isn't always clear if your current set of stored data represents all possible problem cases. You can't only look at numbers of values either. It makes little sense to break off booleans into their own table because they have only two values, for example, and this is only the tip of the iceberg.
I think that at this point, nothing is going to beat good ol' patient, hand-crafted normalization. This is something to do by hand. Any possible computer algorithm will either make a total mess of things or make you define the relationships such that you might as well do it all yourself.