Two identical queries in terms of processing. Run multiple times to avoid cache distortions on timing:
MATCH (p:Pathway {name: {PNAME}}), (t:target {symbol: {TNAME}}) MERGE (p)-[:INVOLVES]->(t)
Above runs 11,100 commands per second
UNWIND {LIST} AS i MATCH (p:Pathway {name: i.PNAME}), (t:target {symbol: i.TNAME}) MERGE (p)-[:INVOLVES]->(t)
Above runs 547 commands per second on the same data set.
Windows 10 Pro, 64GB Ram, SSD, Python 3.7
There are unique constraints on both variables in the statements above and both indices are ON.
The LIST statement in other situations is dramatically faster so I like using it for bulk operations. I tested on Neo4j 3.4 and today on 3.4.4 and Python 3.6 and 3.7. Using latest neo4j-driver. Same results. My guess is query planning is not using the index. About 40,000 nodes in Pathway and 25,000 in target.
Any suggestions? Thanks in advance.
Query plan when using a list. For this profile, the list contained one record.
Suggestion: Can the plan optimizer calculate the number of records in the list to determine whether to scan in all records or to individually use the unique index. Maybe set a threshold of if less than 10% of rows will be needed use unique index. Just a thought for Neo4j developers. In the meantime I dropped using the LIST version.
The UNWIND {list} is what's killing you. You are completely changing the dynamic of the query from one to the other.
The first query is a simple 2 node lookup, The second query is creating a bunch of rows, and then doing a per row 2 node match.
In the first example, it is obvious to the Cypher Planner to use the index to match. In the later, The planner doesn't know for sure what the best way to proceed is. Run against the index for every row, or scan all the nodes to try and get what it needs in one pass through (or something else)?
You can use Cypher hints to try and help the Planner choose the right one, but in your case, use the first query. The first query is simpler and easier for the Cypher planner to plan, and the planner will actually cache the plan so that it doesn't need to figure out what to do each time you re-run it. (The second query will too, but as far as I can tell, it is only trying (and failing) to reproduce the performance boost of using a parameter-ized query, so why not just use the Neo4j built in one?).
Related
I have a database with 500K nodes and 700K relationships. I created 500 additional relationships with a new typeDummyEdge with edge_id attributes from "1" to "500". Now I want to query and modify these relationships. Running a query
MATCH ()-[e:DummyEdge {edge_id:"123"}]->() SET e.property="value" is really slow, it takes around 300ms, so if I run 500 such queries, it takes around 2-3 minutes. I also called CREATE INDEX ON :DummyEdge(edge_id) but it didn't speed up the query execution.
Is there any way to make such bulk relationship modification faster?
CREATE INDEX creates an index for nodes, so such an index would make no difference in the performance of your query.
Since your MATCH pattern, ()-[e:DummyEdge {edge_id:"123"}]->(), provided no information about the end nodes, neo4j has to scan every relationship in the DB to find the ones you want. That is why your query is so slow.
It would be much more efficient if (as # MichaelHunger stated) your query provided useful information (like a label, or an indexed label/property pair) for either of the nodes in your MATCH pattern. That would help neo4j narrow down the number of relationships that need to be scanned. As an example, let's state that the start node must have the Foo label:
MATCH (:Foo)-[e:DummyEdge {edge_id:"123"}]->()
SET e.property="value"
With the above query, neo4j would only have to look at the outgoing relationships of Foo nodes, which is much faster since neo4j can quickly find nodes with a given label (or index).
Now, neo4j also supports full-text schema indexes, which do support relationship indexes. However, those kinds of indexes require much more effort on your part, and may be overkill for your use case.
There are now relationship - indexes that should spee up your operation massively.
https://neo4j.com/docs/cypher-manual/current/indexes-for-search-performance/#administration-indexes-create-a-single-property-b-tree-index-for-relationships
I want to save a large graph in Redis and was trying to accomplish this using RedisGraph. To test this I was creating a test-graph first to check the performance characteristics.
The graph is rather small for the purposes we need.
Vertices: about 3.5 million
Edges: about 18 million
And this is very limited for our purposes, we would need to be able to increase this to 100's of millions of edges in a single database.
In any case, I was checking space and performance requirements buit stopped after only loading in the vertices and seeing that the performance for a:
GRAPH.QUERY gid 'MATCH (t:token {token: "some-string"}) RETURN t'
Is over 300 milliseconds for just this retrieval which is absolutely unacceptable.
Am I missing an obvious way to improve the retrieval performance, or is that currently the limit of RedisGraph?
Thanks
Adding an index will speed things up a lot when matching.
CREATE INDEX ON :token(token)
From my investigations, I think that at least one instance of the item must exist for an index to be created, but I've not done any numbers on extra overhead of creating the index early and then adding most of the new nodes, rather than after all items are in the tree and they can be indexed en-mass.
In case all nodes are labeled as "token" then redisgraph will have to scan 3.5 million entities, comparing each entity "token" attribute against the value you've provided ("some-string")
for speed up I would recommend either adding an index, or limiting the number of results you would like to receive using LIMIT.
Also worth mentioning is that the first query to be served might take awhile longer then following queries due to internal memory management.
I have a query like this as a key component of my application:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group)
RETURN node
There is an index on :GroupType(Name)
In a database of roughly 10,000 elements this query uses nearly 1 million database hits. Here is the PROFILE of the query:
However, this slight variation of the query which performs an identical search is MUCH faster:
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)-[:MEMBER_OF]->(group)
RETURN node
The only difference is the node:NodeType match and the relationship match are merged into a single MATCH instead of a MATCH ... WHERE. This query uses 1/70th of the database hits of the previous query and is more than 10 times faster, despite performing an identical search:
I thought Cypher treated MATCH ... WHERE statements as single search expressions, so the two queries should compile to identical operations, but these two queries seem to be performing vastly different operations. Why is this?
I would like to start by saying that this is not actually a Cypher problem. Cypher describes what you want, not how to get it, so the performance of this query will very vastly between say, Neo4J 3.1.1 and Neo4J 3.2.3.
As the one executing the Cypher is the one that decides how to do this, the real question is "Why doesn't the Neo4J Cypher planner not treat these the same?"
Ideally, both of these Cyphers should be equivalent to
MATCH (node:NodeType)-[:MEMBER_OF]->(group:GroupType{name:"String"})
RETURN node
because they should all produce the same results.
In reality, there are a lot of subtle nuances with dynamically parsing a query that has very many 'equivalent' expressions. But a subtle shift in context can change that equivalence, say if you did this adjustment
MATCH (group:GroupType)
WHERE group.Name = "String"
MATCH (node:NodeType)
WHERE (node)-[:MEMBER_OF]->(group) OR SIZE(group.members) = 1
RETURN node
Now the two queries are almost nothing alike in their results. In order to scale, the query planner must make decision shortcuts to come up with an efficient plan as quickly as possible.
In sort, the performance depends on what the server you are throwing it at is running because coming up with an actionable lookup strategy for a language that lets you ask for ANYTHING/EVERYTHING is hard!
RELATED READING
Optimizing performance
What is Cypher?
MATCH ... WHERE <pattern> isn't the same as MATCH <pattern>.
The first query performs the match, then uses the pattern as a filter to perform for all built up rows.
You can see in the query plan that what's happening is a cartesian product between your first match results and all :NodeType nodes. Then for each row of the cartesian product, the WHERE checks to see if the the :GroupType node on that row has is connected to the :NodeType node on that row by the given pattern (this is the Expand(Into) operation).
The second query, by contrast, expands the pattern from the previously matched group nodes, so the nodes considered from the expansion are far less in number and almost immediately relevant, only requiring a final filter to ensure that those nodes are :NodeType nodes.
EDIT
As Tezra points out, Cypher operates by having you define what you want, not how to get it, as the "how" is the planner's job. In the current versions of Neo4j (3.2.3), my explanation stands, in that the planner interprets each of the queries differently and generates different plans for each, but that may be subject to change as Cypher evolves and the planner improves.
In these cases, you should be running PROFILEs on your queries and tuning accordingly.
I'm working on a simple index containing one million docs with 30 fields each.
a q=: with a very low start value (0 for instance) takes only a few milliseconds (~1 actually)
the higher the start value is, the slowest SolR gets...
start=100000 => 171 ms
start=500000 => 844 ms
start=1000000 => 1274 ms
I'm a bit surprised by this performance degradation, and I'm afraid since the index is supposed to grow over hundred million documents within a few month.
Maybe did I something wrong in the schema? Or is it aenter code here normal behavior, given slicing docs behind the few first hundreds should usually not happen :)
EDIT
Thanks guys for those explanations - I was guessing something like that, however I do prefer be sure that this was not related to the way the schema has been described. So the question solved for me.
Every time you make search query to solr, it collect all the matching documents to the query. Then it skip the document until the start value is reached and then return the results.
Other point to note is that, every time you make the same search query but with higher start value, these documents are also not present in cache, so it might refresh cache as well. (depending on the size and type of cache you have configured)
Pagination naively works by retrieving all the documents up until the cut off point, throwing them away, then fetching enough documents to satisfy the number of documents requested and then returning.
If you're going to deep paging (going far into the dataset) this becomes expensive, and the CursorMark support was implemented (see "Fetching A Large Number of Sorted Results: Cursors") to support near-instant pagination into a large set of documents.
Yonik also has a good blog post about deep pagination in Solr.
We recently did some research on how to speed things up a bit in sphinxsearch.
We found a great way to speed things up is to use a distributed index.
We ran real-life tests, and found that queries execute somewhere between 35-40% faster when a distributed index is used.
What I mean by distributed is basically our regular index, split up into 4 (the box hosting this index has 4 cores) via adding AND id % 4/3/2/1 = 0 into each source, for each of the parts of the index.
FYI, id is our primary key / auto increment.
So what this should do instead of having one huge index is split it up into 4.
And then we just use index type = distributed + local .... local .... local .... local .... for a 'put all the parts together' index.
We did some quick testing, the same results come back... Only 35-40% faster :)
So, before we implement this site wide, we would like to know:
Does switching to a distributed index like the one mentioned above impact sorting in any way?
We ask this because we use sphinx for a number of SEO related items, and we NEED to keep the order of the results the same.
I should also mention, queries, all query options, etc stay the same. Any and all changes were done on the daemon end.
Thanks!
Sorting should be unaffected. You suffer a bigger performance hit when using distribution indexes and high offsets. But the first few pages will be fine.
As far as I know the gotcha are using grouping/clustering and kill-lists. But if not using them, should be nothing to worry about.