Match query with relationship is taking too long to retrieve results does it mean we need to upgrade Neo4j or memory allocated? - performance

I'm trying to understand why below query is taking too long to retrieve results. I have mocked up the values used but the below query is right and is returning 40 records (a node has 8 diff values and z node has 5 diff values so total 40 combinations). It's taking 2.5 min to return those 40 records. Please let me know what the issue is here. I'm suspecting this to be Neo4j version and infrastructure we're using right now in production.
After the below query we have algo.kShortestPaths.stream so the whole thing together is taking more than 5 min. What do you suggest? Is there no other way where we can handle such combinations (a and z node combinations > 40) within 5 min?
Infrastructure details: Neo4j 3.5 community edition
2 separate datacenters, sync job - 64GB mem 16GB CPU 4 cores
Cypher Query:
MATCH (s:SiteNode {siteName: 'siteName1'})-[rl:CONNECTED_TO]-(a:EquipmentNode)
WHERE a.locationClli = s.siteName AND toUpper(a.networkType) = 'networkType1' AND NOT (toUpper(a.equipmentTid) CONTAINS 'TEST')
WITH a.equipmentTid AS tid_A
MATCH pp = (a:EquipmentNode)-[rel:CONNECTED_TO]-(a1:EquipmentNode)
WHERE a.equipmentTid = tid_A AND ALL( t IN relationships(pp)
WHERE t.type IN ['Type1'] AND (t.totalChannels > 0 AND t.totalChannelsUsed < t.totalChannels) AND t.networkId IN ['networkId1'] AND t.status IN ['status1', 'status2'] )
WITH a
MATCH (d:SiteNode {siteName: 'siteName2'})-[rl:CONNECTED_TO]-(z:EquipmentNode)
WHERE z.locationClli = d.siteName AND toUpper(z.networkType) = 'networkType2' AND NOT (toUpper(z.equipmentTid) CONTAINS 'TEST')
WITH z.equipmentTid AS tid_Z, a
MATCH pp = (z:EquipmentNode)-[rel:CONNECTED_TO]-(z1:EquipmentNode)
WHERE z.equipmentTid=tid_Z AND ALL(t IN relationships(pp)
WHERE t.type IN ['Type2'] AND (t.totalChannels > 0 AND t.totalChannelsUsed < t.totalChannels) AND t.networkId IN ['networkId2'] AND t.status IN ['status1', 'status2'])
WITH DISTINCT z, a
return a.equipmentTid, z.equipmentTid
This query was built to handle small combinations upto 4 total a and z node combinations but today we might have combinations greater than 10 or 40 or 100 so this is timing out. I'm not sure if there's a better way to write the query to improve performance assuming the community edition is good enough for our case.

Related

Neo4j cypher query improvement (performance)

I have the following cypher query:
CALL apoc.index.nodes('node_auto_index','pref_label:(Foo)')
YIELD node, weight
WHERE node.corpus = 'my_corpus'
WITH node, weight
MATCH (selected:ontoterm{corpus:'my_corpus'})-[:spotted_in]->(:WEBSITE)<-[:spotted_in]-(node:ontoterm{corpus:'my_corpus'})
WHERE selected.uri = 'http://uri1'
OR selected.uri = 'http://uri2'
OR selected.uri = 'http://uri3'
RETURN DISTINCT node, weight
ORDER BY weight DESC LIMIT 10
The first part (until the WITH) runs very fast (Lucene legacy index) and returns ~100 nodes. The uri property is also unique (selected = 3 nodes)
I have ~300 WEBSITE nodes. The execution time is 48749 ms.
Profile:
How can I restructure the query to improve performance? And why there are ~13.8 Mio rows in the profile?
I think the problem was in the WITH clause which expanded the results enormous. InverseFalcon's answer makes the query faster: 49 -> 18 sec (but still not fast enough). To avoid the enormous expand I collected the websites. The following query takes 60ms
MATCH (selected:ontoterm)-[:spotted_in]->(w:WEBSITE)
WHERE selected.uri in ['http://avgl.net/carbon_terms/Faser', 'http://avgl.net/carbon_terms/Carbon', 'http://avgl.net/carbon_terms/Leichtbau']
AND selected.corpus = 'carbon_terms'
with collect(distinct(w)) as websites
CALL apoc.index.nodes('node_auto_index','pref_label:(Fas OR Fas*)^10 OR pref_label_deco:(Fas OR Fas*)^3 OR alt_label:(Fa)^5') YIELD node, weight
WHERE node.corpus = 'carbon_terms' AND node:ontoterm
WITH websites, node, weight
match (node)-[:spotted_in]->(w:WEBSITE)
where w in websites
return node, weight
ORDER BY weight DESC
LIMIT 10
I don't see any occurrence of NodeUniqueIndexSeek in your plan, so the selected node isn't being looked up efficiently.
Make sure you have a unique constraint on :ontoterm(uri).
After the unique constraint is up, give this a try:
PROFILE CALL apoc.index.nodes('node_auto_index','pref_label:(Foo)')
YIELD node, weight
WHERE node.corpus = 'my_corpus' AND node:ontoterm
WITH node, weight
MATCH (selected:ontoterm)
WHERE selected.uri in ['http://uri1', 'http://uri2', 'http://uri3']
AND selected.corpus = 'my_corpus'
WITH node, weight, selected
MATCH (selected)-[:spotted_in]->(:WEBSITE)<-[:spotted_in]-(node)
RETURN DISTINCT node, weight
ORDER BY weight DESC LIMIT 10
Take a look at the query plan. You should see a NodeUniqueIndexSeek somewhere in there, and hopefully you should see a drop in db hits.

Why this simple ArangoDB query sometimes takes very long time

I am querying ArangoDb of about 500k document via arangojs.query() with this very simple query
"FOR c IN Entity FILTER c.id == 261764 RETURN c"
It is a node in node-link graph.
But sometimes, it took more than 10 seconds and in the log of arangodb also has warning about query taking too long.Lots of time it happens if new session is used on browser. Is it problem of arangodb or arangojs or my query itself is not optimized?
-------------------Edit----------------------
Added db.explain
Query string:
FOR c IN Entity FILTER c.id == 211764 RETURN c
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 140270 - FOR c IN Entity /* full collection scan */
3 CalculationNode 140270 - LET #1 = (c.`id` == 211764) /* simple expression */ /* collections used: c : Entity */
4 FilterNode 140270 - FILTER #1
5 ReturnNode 140270 - RETURN c
Indexes used:
none
Optimization rules applied:
none
As your Explain shows, your query doesn't utilize indices, but does a full collection scan.
Depending on when it finds the match (at the start or the end of the collection) execution times may vary.
See the Indexing chapter for creating indices, and the AQL Execution and Performance chapter howto analyse the output of db._explain()

Expressions and Filters SSRS

I have a few things I am struggling with so hopefully I can ask all at once ?
I am using VS 2010 and I think with Vb.net to build reports, I use databases from Sql - I am mainly using matrix tables
I have a report that is multiple tables in one but not sure how to set/define to still show the tables that has no data ? So currently if there is a blank one it messes up the full report look ?
In another scenario how can I use an expression/custom code to filter out items in one row - in a calculation for example if I only want to sum 3 items of 5 etc
How can I work out % of a row or coloumn based on criteria or filters so if total items is 30 and item 1 is 5 the % of will be 17% and all items will total to 100%
How can I work out growth of the row/column so if year 1 is 50 and year 2 is 60 the growth/variance will be 20%
There are some issues with the expressions:
=IIF(Fields!Total_Amount__Excl_VAT_.Value = 0
OR Fields!Total_Amount__Excl_VAT_.Value = "", 0, Sum(Fields!Total_Amount__Excl_VAT_.Value))
The SUM should be around the IIF:
=SUM(IIF(Fields!Total_Amount__Excl_VAT_.Value = 0
OR Fields!Total_Amount__Excl_VAT_.Value = "", 0, Fields!Total_Amount__Excl_VAT_.Value))
The same issue for
=IIF(Fields!Total_Amount__Excl_VAT_.Value = 0
OR Fields!Total_Amount__Excl_VAT_.Value = "",0,Sum(Fields!Total_Amount__Excl_VAT_.Value))
Should Be:
=SUM(IIF(Fields!Total_Amount__Excl_VAT_.Value = 0
OR Fields!Total_Amount__Excl_VAT_.Value = "", 0, Fields!Total_Amount__Excl_VAT_.Value))
The growth formula looks correct - are you getting a different result than expected?

pig - need tips after performance tuning gone wrong

I have a Pig script that took around 10 minutes to finish and I thought that there was still room for some performance improvement.
So, I started by putting the JOINs and GROUPs in a nested FOREACH and also putting the previous FILTERs inside the same FOREACH.
I also added using 'replicated'.
The problem now is that instead of taking 10 minutes, it's taking over 30 minutes.
Is there a place that has best practices and performance improvement tips besides PIG's documentation?
So that you can get a better picture, here's some code:
--before
previous_join = JOIN A by id, B by id --for symplification
filtering = FILTER previous_join BY ((year_min > 1995 ? year_min - 1 : year_min) <= list_year and (year_max > 2015 ? year_max - 1 : year_max) >= list_year);
final_filtered = FOREACH filtering GENERATE user_id as user_id, list_year;
--after
final_filtered = FOREACH (JOIN A by id, B by id) {
tmp = FILTER group BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year and (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year and A::premium == 'true');
GENERATE A::user_id AS user_id, B::list_year AS list_year;
};
Am I doing something wrong or is this the wrong approach?
Thanks.
In prior case [before] you are performing filter and projection after the join is performed.
It will be helpful if you calculate time log for each operation and identify the bottleneck operation.
Can you also try splitting your filter statements in multiple relations rather than just one and check the difference in filter timing?
filter_by_min_year = FILTER previous_join BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year);
filter_by_max_year = FILTER filter_by_min_year BY (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year);
Overall you want to find ids(+some more columns) with A::year_min <=B::list_year and A::year_max >= B::list_year
Instead of performing join on raw A & B, you can try using projections on both of them to contain only columns needed for join and later operations.
A-projected = foreach A generate id, year_min, year_max;
B-projected = foreach B generate id, list_year;
C = join A-projected by id, B-projected by id USING 'replicated';
If any of A-projected or B-projected is a small set that can be loaded in memory use replicated join, I am assuming B-projected to be a smaller set than A-projected.
If this doesnt apply to your case, please skip this option.
Also you can try setting the number of reducers to be used for this join by using PARALLEL keyword.
After applying filter you will get a list of required id's that you can use to fetch other information from A or B.
Also consider tweaking MapReduce properties like io.sort.mb, mapred.job.shuffle.input.buffer.percent etc.
Hope this helps.

Elasticsearch: get total of GROUP BY and COUNT

I'm using the elasticsearch java client and I want to work out how many distinct items items there are matching a combination. I'm currently using facets like this:
client.prepareSearch(indexName)
.addFacet(Efb.termsFacet("hotels")
.field("hotel.specialCode")
.script("term + ';;;' + _source.hotel.name + ';;;' + _source.offer.destinationId")
.facetFilter(filter)
.size(50))
I then get the total using:
case tf: TermsFacet =>
val entries = tf.getEntries
val totalHotels = entries.size()
This is wrong because I am asking for the total returned, which in my query I limited to 50 - so this number will never exceed 50.
How can I get the total number of facets, without having to pull them all back into memory? Or is there a better way to do the equivalent of a SQL GROUP BY and COUNT in elasticsearch?

Resources