we are trying to optimize a query but the time explodes (~20 seconds) when having around 40K nodes in the database, but it should be way faster.
First, I will describe a simplified description of our schema. We have the following nodes:
Usergroup
Feature
Asset
Section
We also have the following relationships:
A Feature has only one Section (IS_IN_SECTION)
A Feature has one or more Asset (CONTAINS_ASSET)
An asset may be restricted for a Usergroup (HAS_RESTRICTED_ASSET)
A Feature may be restricted for a Usergroup (HAS_RESTRICTED_FEATURE)
A Section, and therefore, all the Feature of that Section, may be restricted for a Usergroup (HAS_RESTRICTED_SECTION)
A Usergroup may have a parent Usergroup (HAS_PARENT_GROUP) and it should fulfill its restrictions and those of its parents
The goal is, given a Usergroup, to list the top 20 assets ordered by date, that don't have any restrictions with the Usergroup.
The current query is similar to:
(1)
MATCH path=(:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup)
WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup)
WITH nodes(path) AS usergroups
UNWIND usergroups AS ug
(2)
MATCH (node:Asset)
WHERE NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug)
AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug)
RETURN DISTINCT node
ORDER BY node.date DESC
SKIP 0
LIMIT 20
We have a few more types of restrictions but here we have the main idea.
Some observations we have made are:
If we execute the query part (1) adding return ug after unwind, this query is solved in 1ms
If we change the query part (1) to MATCH (ug:Usergroup {uid: $usergroup_uid}) ignoring the parent groups, the query is solved in around 800ms. And if we add back the original part (1) it is solved in 8 seconds even if the Usergroup has no parents.
Currently, our database is small compared to the expected number of nodes (~6 millions), and the number of restrictions will grow, and we need to optimize this kind of queries.
For that, we have these questions:
The NOT <restrictions> (ex: NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)) conditions is correct in this kind of situation or are there other approachs to get the job done more efficiently?
Do we need any type of index?
Is the structure of the schema right, or are there any inefficiencies?
How can we rewrite the part (1) of the query or what do you thinks is causing the overhead with it?
The database version is Neo4j 3.5.X
Thanks in advance.
Let me answer your questions one by one:
The NOT <restrictions> type of conditions can prove to be inefficient if provide a set of paths within restrictions because it can lead to duplicate work. Consider the following two sets of restrictions in your query.
NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug)
and NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug)
In both of these checks, neo4j might look for the relationship CONTAINS_ASSET and nodes of type Feature separately, once for the first path match and then for the second. This duplicate processing should be reduced if it happens. You should profile your query in Neo4j Browser, to see how the query planner plans and executes the query.
In terms of indexes you can create two indexes, the first is on the Usergroup uid field, and the second on the date field of the Asset node, this might help if you have a lot of asset nodes and date key stores a string. Again, profile your query to see what indexes are coming into play during execution.
In terms of schema, I noticed that in second part of your query
NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug) AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug) AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug) , all these three checks are basically looking for whether an asset is restricted for a user group, either directly, via a feature, or via a section. One thing we can do here is to create intermediate relationships, between an Asset and UserGroup node. For example, we can have IS_RESTRICTED_DUE_TO_A_FEATURE relationship between an asset and user group node, if an asset is part of a feature for which the user group has restricted access. In this way, your path match reduces from NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug) to NOT (node)<-[:IS_RESTRICTED_DUE_TO_A_FEATURE]-(ug), which should be faster. Obviously, this change will impact your other CRUD operations, and you might want to store some properties in the new relationships as well.
For this part, I am not sure, what is causing the overhead, but I suggest you add the index on the Usergroup uid field, if not present, and then modify your first part to this:
MATCH (ug:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup) WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup) RETURN ug
If it performs well, then try modifying your second part to this:
MATCH (node:Asset)
OPTIONAL MATCH (node)<-[rel1:CONTAINS_ASSET]-(f:Feature)-[rel2:IS_IN_SECTION]->(s:Section)
WITH node
WHERE (s IS NULL OR NOT (s)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug))
AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
AND (f IS NULL OR NOT (f)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug))
RETURN DISTINCT node
ORDER BY node.date DESC
SKIP 0
LIMIT 20
Please try out the above suggestions, also the above queries are not tested, so please modify them a bit, if the output format is unexpected. Do profile them out, to figure out the slowest part, in the queries. Hopefully, it helps.
Related
I am trying to build out a social graph between 100k users. Users can sync other social media platforms or upload their own contacts. Building each relationship takes about 200ms. Currently, I have everything uploaded on a queue so it can run in the background, but ideally, I can complete it within the HTTP request window. I've tried a few things and received a few warnings.
Added an index to the field pn
Getting a warning This query builds a cartesian product between disconnected patterns. - I understand why I am getting this warning, but no relationship exists and that's what I am building in this initial call.
MATCH (p1:Person {userId: "....."}), (p2:Person) WHERE p2.pn = "....." MERGE (p1)-[:REL]->(p2) RETURN p1, p2
Any advice on how to make it faster? Ideally, each relationship creation is around 1-2ms.
You may want to EXPLAIN the query and make sure that NodeIndexSeeks are being used, and not NodeByLabelScan. You also mentioned an index on :Person(pn), but you have a lookup on :Person(userId), so you might be missing an index there, unless that was a typo.
Regarding the cartesian product warning, disregard it, the cartesian product is necessary in order to get the nodes to create the relationship, this should be a 1 x 1 = 1 row operation so it's only going to be costly if multiple nodes are being matched per side, or if index lookups aren't being used.
If these are part of some batch load operation, then you may want to make your query apply in batches. So if 100 contacts are being loaded by a user, you do NOT want to execute 100 queries each, with each query adding a single contact. Instead, pass as a parameter the list of contacts, then UNWIND the list and apply the query once to process the entire batch.
Something like:
UNWIND $batch as row
MATCH (p1:Person {pn: row.p1}), (p2:Person {pn: row.p2)
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2
It's usually okay to batch 10k or so entries at a time, though you can adjust that depending on the complexity of the query
Check out this blog entry for how to apply this approach.
https://dzone.com/articles/tips-for-fast-batch-updates-of-graph-structures-wi
You can use the index you created on Person by suggesting a planner hint.
Reference: https://neo4j.com/docs/cypher-manual/current/query-tuning/using/#query-using-index-hint
CREATE INDEX ON :Person(pn);
MATCH (p1:Person {userId: "....."})
WITH p1
MATCH (p2:Person) using index p2:Person(pn)
WHERE p2.pn = "....."
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2
I have this situation. Starting from a table, I have to check all the records that match a key. If records are found, I have to check another table using a key from the first table and so on, more on less on five levels. There is a way to do this in a recursive way, or I have to write all the code "by hand"? The language I am using is Visual Fox Pro. If this is is not possible, is it al least possible to use recursion to popolate a treeview?
You can set a relation between tables. For example:
USE table_1.dbf IN 0 SHARED
USE table_2.dbf IN 0 SHARED
SET ORDER TO TAG key_field OF table_2.cdx IN table_2
SET RELATION TO key_field INTO table_2 ADDITIVE IN table_1
First two commands open table_1 and table_2. Then you have to set the order/index of table_2. If you don't have an index for the key field then this will not work. The final command sets the relation between the two tables on the key field.
From here you can browse both tables and table_2's records will be filtered based on table_1's key field. Hope this helps.
If the tables have similar structure or you only need to look at a few fields, you could write a recursive routine that receives the name of the table, the key to check, and perhaps the fields you need to check as parameters. The tricky part, I guess, is knowing what to pass down to the next call.
I don't think I can offer any more advice without at least seeing some table structures.
Sorry for answering so late, but the problem was of course that the recursion wasn't a viable solution since I had to search inside multiple tables. So I resolved by doing a simple 2-Level search in the tables that I needed.
Thank you very much for the help, and sorry again for answering so late.
I have a huge file of for about 11M relationship .
When I run the query : Match (n) detach delete n, It seems to be taking forever to finish .
I did some researches and found that I need to delete the relationships with a limit and then the nodes using that query :
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
WITH r LIMIT 200000
DELETE r
RETURN count(r) as deletedCount
Yet, and as I'm doing some performances comparison , it dows not seem logic to me to sum the total of deletion time to delete the hole graph .
and as it changes when changing the limit value of relationships to delete at once. (if i do 2000 relationships it is not the same as 20000 relationships at once)
How can I solve that problem ?
Any help would be appreciated
You can use apoc.periodic.commit to help you with batching. You must use apoc plugin, which has lots of cool functions to enhance cypher.
You can use the following cypher query.
call apoc.periodic.commit("
match (node)
with node limit {limit}
DETACH DELETE node
RETURN count(*)
",{limit:10000})
This will run the query in batches until the first match return null, which means in this case that no node exists in the database. You can play around with different limit settings to see what works best.
Hope this helps
Suppose I have a large (300-500k) collection of text documents stored in the relational database. Each document can belong to one or more (up to six) categories. I need users to be able to randomly select documents in a specific category so that a single entity is never repeated, much like how StumbleUpon works.
I don't really see a way I could implement this using slow NOT IN queries with large amount of users and documents, so I figured I might need to implement some custom data structure for this purpose. Perhaps there is already a paper describing some algorithm that might be adapted to my needs?
Currently I'm considering the following approach:
Read all the entries from the database
Create a linked list based index for each category from the IDs of documents belonging to the this category. Shuffle it
Create a Bloom Filter containing all of the entries viewed by a particular user
Traverse the index using the iterator, randomly select items using Bloom Filter to pick not viewed items.
If you track via a table what entries that the user has seen... try this. And I'm going to use mysql because that's the quickest example I can think of but the gist should be clear.
On a link being 'used'...
insert into viewed (userid, url_id) values ("jj", 123)
On looking for a link...
select p.url_id
from pages p left join viewed v on v.url_id = p.url_id
where v.url_id is null
order by rand()
limit 1
This causes the database to go ahead and do a 1 for 1 join, and your limiting your query to return only one entry that the user has not seen yet.
Just a suggestion.
Edit: It is possible to make this one operation but there's no guarantee that the url will be passed successfully to the user.
It depend on how users get it's random entries.
Option 1:
A user is paging some entities and stop after couple of them. for example the user see the current random entity and then moving to the next one, read it and continue it couple of times and that's it.
in the next time this user (or another) get an entity from this category the entities that already viewed is clear and you can return an already viewed entity.
in that option I would recommend save a (hash) set of already viewed entities id and every time user ask for a random entity- randomally choose it from the DB and check if not already in the set.
because the set is so small and your data is so big, the chance that you get an already viewed id is so small, that it will take O(1) most of the time.
Option 2:
A user is paging in the entities and the viewed entities are saving between all users and every time user visit your page.
in that case you probably use all the entities in each category and saving all the viewed entites + check whether a entity is viewed will take some time.
In that option I would get all the ids for this topic- shuffle them and store it in a linked list. when you want to get a random not viewed entity- just get the head of the list and delete it (O(1)).
I assume that for any given <user, category> pair, the number of documents viewed is pretty small relative to the total number of documents available in that category.
So can you just store indexed triples <user, category, document> indicating which documents have been viewed, and then just take an optimistic approach with respect to randomly selected documents? In the vast majority of cases, the randomly selected document will be unread by the user. And you can check quickly because the triples are indexed.
I would opt for a pseudorandom approach:
1.) Determine number of elements in category to be viewed (SELECT COUNT(*) WHERE ...)
2.) Pick a random number in range 1 ... count.
3.) Select a single document (SELECT * FROM ... WHERE [same as when counting] ORDER BY [generate stable order]. Depending on the SQL dialect in use, there are different clauses that can be used to retrieve only the part of the result set you want (MySQL LIMIT clause, SQLServer TOP clause etc.)
If the number of documents is large the chance serving the same user the same document twice is neglibly small. Using the scheme described above you don't have to store any state information at all.
You may want to consider a nosql solution like Apache Cassandra. These seem to be ideally suited to your needs. There are many ways to design the algorithm you need in an environment where you can easily add new columns to a table (column family) on the fly, with excellent support for a very sparsely populated table.
edit: one of many possible solutions below:
create a CF(column family ie table) for each category (creating these on-the-fly is quite easy).
Add a row to each category CF for each document belonging to the category.
Whenever a user hits a document, you add a column with named and set it to true to the row. Obviously this table will be huge with millions of columns and probably quite sparsely populated, but no problem, reading this is still constant time.
Now finding a new document for a user in a category is simply a matter of selecting any result from select * where == null.
You should get constant time writes and reads, amazing scalability, etc if you can accept Cassandra's "eventually consistent" model (ie, it is not mission critical that a user never get a duplicate document)
I've solved similar in the past by indexing the relational database into a document oriented form using Apache Lucene. This was before the recent rise of NoSQL servers and is basically the same thing, but it's still a valid alternative approach.
You would create a Lucene Document for each of your texts with a textId (relational database id) field and multi valued categoryId and userId fields. Populate the categoryId field appropriately. When a user reads a text, add their id to the userId field. A simple query will return the set of documents with a given categoryId and without a given userId - pick one randomly and display it.
Store a users past X selections in a cookie or something.
Return the last selections to the server with the users new criteria
Randomly choose one of the texts satisfying the criteria until it is not a member of the last X selections of the user.
Return this choice of text and update the list of last X selections.
I would experiment to find the best value of X but I have in mind something like an X of say 16?
May order of rows in unordered query (like select * from smth) be different in different queries (in the same and not same session) if there are no updates to database table?
The order or rows returned from a query should never be relied upon unless you have included a specific ORDER BY clause in your query.
You may find that even without the ORDER BY the results appear in the same order but you could not guarentee this will be the case and to rely on it would be foolish especially when the ORDER BY clause will fulfill your requirements.
See this question: Default row ordering for select query in oracle
It has an excellent quote from Tom Kyte about record ordering.
So to answer your question: Yes, the order of rows in an unordered query may be different between queries and sessions as it relies on multiple factors of which you may have no control (if you are not a DBA etc.)
Hope this helps...