I am querying ArangoDb of about 500k document via arangojs.query() with this very simple query
"FOR c IN Entity FILTER c.id == 261764 RETURN c"
It is a node in node-link graph.
But sometimes, it took more than 10 seconds and in the log of arangodb also has warning about query taking too long.Lots of time it happens if new session is used on browser. Is it problem of arangodb or arangojs or my query itself is not optimized?
-------------------Edit----------------------
Added db.explain
Query string:
FOR c IN Entity FILTER c.id == 211764 RETURN c
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 140270 - FOR c IN Entity /* full collection scan */
3 CalculationNode 140270 - LET #1 = (c.`id` == 211764) /* simple expression */ /* collections used: c : Entity */
4 FilterNode 140270 - FILTER #1
5 ReturnNode 140270 - RETURN c
Indexes used:
none
Optimization rules applied:
none
As your Explain shows, your query doesn't utilize indices, but does a full collection scan.
Depending on when it finds the match (at the start or the end of the collection) execution times may vary.
See the Indexing chapter for creating indices, and the AQL Execution and Performance chapter howto analyse the output of db._explain()
Related
I'm trying to understand why below query is taking too long to retrieve results. I have mocked up the values used but the below query is right and is returning 40 records (a node has 8 diff values and z node has 5 diff values so total 40 combinations). It's taking 2.5 min to return those 40 records. Please let me know what the issue is here. I'm suspecting this to be Neo4j version and infrastructure we're using right now in production.
After the below query we have algo.kShortestPaths.stream so the whole thing together is taking more than 5 min. What do you suggest? Is there no other way where we can handle such combinations (a and z node combinations > 40) within 5 min?
Infrastructure details: Neo4j 3.5 community edition
2 separate datacenters, sync job - 64GB mem 16GB CPU 4 cores
Cypher Query:
MATCH (s:SiteNode {siteName: 'siteName1'})-[rl:CONNECTED_TO]-(a:EquipmentNode)
WHERE a.locationClli = s.siteName AND toUpper(a.networkType) = 'networkType1' AND NOT (toUpper(a.equipmentTid) CONTAINS 'TEST')
WITH a.equipmentTid AS tid_A
MATCH pp = (a:EquipmentNode)-[rel:CONNECTED_TO]-(a1:EquipmentNode)
WHERE a.equipmentTid = tid_A AND ALL( t IN relationships(pp)
WHERE t.type IN ['Type1'] AND (t.totalChannels > 0 AND t.totalChannelsUsed < t.totalChannels) AND t.networkId IN ['networkId1'] AND t.status IN ['status1', 'status2'] )
WITH a
MATCH (d:SiteNode {siteName: 'siteName2'})-[rl:CONNECTED_TO]-(z:EquipmentNode)
WHERE z.locationClli = d.siteName AND toUpper(z.networkType) = 'networkType2' AND NOT (toUpper(z.equipmentTid) CONTAINS 'TEST')
WITH z.equipmentTid AS tid_Z, a
MATCH pp = (z:EquipmentNode)-[rel:CONNECTED_TO]-(z1:EquipmentNode)
WHERE z.equipmentTid=tid_Z AND ALL(t IN relationships(pp)
WHERE t.type IN ['Type2'] AND (t.totalChannels > 0 AND t.totalChannelsUsed < t.totalChannels) AND t.networkId IN ['networkId2'] AND t.status IN ['status1', 'status2'])
WITH DISTINCT z, a
return a.equipmentTid, z.equipmentTid
This query was built to handle small combinations upto 4 total a and z node combinations but today we might have combinations greater than 10 or 40 or 100 so this is timing out. I'm not sure if there's a better way to write the query to improve performance assuming the community edition is good enough for our case.
I am currently working on a social media site which exactly the same in terms of users' timeline, like user can follow, create, share the posts, block, unblock, etc. So for that, we have created 2 types of labels "User" and "Post" and have several relations like follow, block, private, etc.
currently, we have approximately 41000 nodes and 650000 relationships.
Hardware conf:
8 gb ram
2 core
50 GB HDD
1 Master and 2 Slave
and using the following query to get the users' timeline
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'}),(p:Post{is_deleted:'0'}),(po:User{user_id:p.owner_id})
WHERE (p.post_type = '1' OR p.post_type = '4') WITH n,p,po
WHERE po.is_active='1' AND (n)-[:CREATED{own_status:'1'}]->(p) OR
(n)-[:FOLLOWS{follow_status:'1'}]->(:User{is_active:'1'})-[:CREATED{own_status:'1'}]->(p)
OR (n)-[:FOLLOWS{follow_status:'1'}]->(:Keyword{is_deleted:'0'})-[:KEYWORD]->(p)
WITH n,p,po
OPTIONAL MATCH (n)-[fr:FOLLOWS]->(po)
WHERE fr.follow_status='1' WITH p,n,po,fr
WHERE NOT ((n)-[:FOLLOWS{is_blocked:true}]->(po) OR (n)-[:FOLLOWS{is_mute:true}]->(po)) WITH p,n,po,fr
WHERE NOT (n)<-[:FOLLOWS{is_blocked:true}]-(po) WITH p,n,po,fr
WHERE (fr is not null and toInteger(po.is_private) <= 1 AND po.user_id <> n.user_id)
OR (toInteger(po.is_private) <= 1 AND po.user_id = n.user_id)
OR (toInteger(po.is_private) = 0 AND po.user_id <> n.user_id) WITH p,n,po
RETURN p,po,SIZE(()-[:LIKED]->(p)) as likecount,
SIZE((n)-[:LIKED]->(p)) as likestatus,count(*) as postcount
ORDER BY p.created_at DESC
SKIP 0 LIMIT 10
This query takes more than 10 sec. which is too high
Here is Profile of the above query
Here is the index list
Can anybody suggest where am I doing wrong?
If you're trying to get a user's timeline, I would think you'd start with the specific user, then connect to other nodes via the relationships you're interested in. The current query isn't taking advantage of pattern matching or the connected nature of a graph database.
The first match statement of the query as it's currently written finds a specific user, then all Post nodes that have the property is_deleted:'0' and then all User nodes that are connected to any of the Post nodes. Searching this way is giving you more database hits (54,984) in the first middle Expand(All) than there are nodes in the database (41,000).
Where you should get the most lift in optimizing this query is to focus your search on the single user then expand out from there using the relationships:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
This will match the user and all qualifying posts connected to the user via a relationship. Note, if a user isn't connected to any qualifying posts, there won't be any matches even if that user does exist in the database.
If you only want to include certain relationship types, you can specify that in this first MATCH statement like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r:CREATED|FOLLOWS|KEYWORD]-(p:Post{is_deleted:'0'})
Or you can put it in the WHERE clause like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
WHERE type(r) in ['CREATED', 'FOLLOWS' , 'KEYWORD']
I didn't follow all your conditional statements (and I think you might be able to remove some of them once you convert it to pattern matching), but once you have your initial pattern you can add in whatever conditional statements you need. Example:
WHERE (p.post_type = '1' OR p.post_type = '4')
AND (r.own_status = '1' OR r.follow_status = '1')
AND NOT r.is_blocked = true
For more on pattern matching, check out section 2.9 of the Neo4j Cypher Manual.
According to the application logic I have implemented own Cypher query builder which creates a complex queries based on the user input from the UI.
This is the example of such queries:
MATCH (dg:DecisionGroup)-[:CONTAINS]->(childD:Decision)
WHERE dg.id = {decisionGroupId}
MATCH (childD)-[relationshipValueRel2:HAS_VALUE_ON]-(filterCharacteristic2:Characteristic)
WHERE filterCharacteristic2.id = 2
WITH relationshipValueRel2, childD, dg
WHERE (ANY (id IN {relationshipValueRel21} WHERE id IN relationshipValueRel2.value )) AND ( relationshipValueRel2.`property.1.3` = {property2} ) OR ( relationshipValueRel2.`property.1.3` = {property3} )
WITH childD, dg
MATCH (childD)-[relationshipValueRel3:HAS_VALUE_ON]-(filterCharacteristic3:Characteristic)
WHERE filterCharacteristic3.id = 3
WITH relationshipValueRel3, childD, dg
WHERE (relationshipValueRel3.`value` = 'London')
WITH childD , dg
SKIP 0 LIMIT 100
WITH *
MATCH (childD)-[ru:CREATED_BY]->(u:User)
OPTIONAL MATCH (childD)-[rup:UPDATED_BY]->(up:User)
RETURN ru, u, rup, up, childD AS decision,
[ (dg)<-[:DEFINED_BY]-(entityGroup)-[:CONTAINS]->(entity)<-[:COMMENTED_ON]-(comg:CommentGroup)-[:COMMENTED_FOR]->(childD)
| {entityId: toInt(entity.id), types: labels(entity), totalComments: toInt(comg.totalComments)} ] AS commentGroups,
[ (c1)<-[vg1:HAS_VOTE_ON]-(childD)
| {criterionId: toInt(c1.id), weight: vg1.avgVotesWeight, totalVotes: toInt(vg1.totalVotes)} ] AS weightedCriteria,
[ (dg)<-[:DEFINED_BY]-(chg:CharacteristicGroup)-[rchgch1:CONTAINS]->(ch1:Characteristic)<-[v1:HAS_VALUE_ON]-(childD)
| {characteristicId: toInt(ch1.id), v1:v1, optionIds: v1.optionIds, valueIds: v1.valueIds, value: v1.value, available: v1.available, totalHistoryValues: v1.totalHistoryValues, totalFlags: v1.totalFlags, description: v1.description, valueType: ch1.valueType, visualMode: ch1.visualMode} ] AS valuedCharacteristics
With the small number of the child:Decision(for example < 1000) it works pretty fast but the performance is noticeably degrades after the data growth.
The primary reason of this are the filtering operation based on the relationship Properties like:
WHERE (ANY (id IN {relationshipValueRel21}
WHERE id IN relationshipValueRel2.value ))
AND ( relationshipValueRel2.`property.1.3` = {property2} )
OR ( relationshipValueRel2.`property.1.3` = {property3} )
Please note that this is just a particular case and the query based on the user input can contains tens such condition with (<, <=, =, >, >=, IN, ALL IN, ANY IN) operations where property value can be a single value or arrays.
Right now I'm thinking how to improve the performance of this query because I expect tens or hundreds of thousands child:Decision there. In order to do this I'm looking into APOC Manual Index on Relationship Properties
Taking this into account - does it make sense to reimplement my query builder in order to use APOC Manual Index on Relationship Properties for the mentioned filtering operation or it will not help there?
I have a Pig script that took around 10 minutes to finish and I thought that there was still room for some performance improvement.
So, I started by putting the JOINs and GROUPs in a nested FOREACH and also putting the previous FILTERs inside the same FOREACH.
I also added using 'replicated'.
The problem now is that instead of taking 10 minutes, it's taking over 30 minutes.
Is there a place that has best practices and performance improvement tips besides PIG's documentation?
So that you can get a better picture, here's some code:
--before
previous_join = JOIN A by id, B by id --for symplification
filtering = FILTER previous_join BY ((year_min > 1995 ? year_min - 1 : year_min) <= list_year and (year_max > 2015 ? year_max - 1 : year_max) >= list_year);
final_filtered = FOREACH filtering GENERATE user_id as user_id, list_year;
--after
final_filtered = FOREACH (JOIN A by id, B by id) {
tmp = FILTER group BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year and (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year and A::premium == 'true');
GENERATE A::user_id AS user_id, B::list_year AS list_year;
};
Am I doing something wrong or is this the wrong approach?
Thanks.
In prior case [before] you are performing filter and projection after the join is performed.
It will be helpful if you calculate time log for each operation and identify the bottleneck operation.
Can you also try splitting your filter statements in multiple relations rather than just one and check the difference in filter timing?
filter_by_min_year = FILTER previous_join BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year);
filter_by_max_year = FILTER filter_by_min_year BY (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year);
Overall you want to find ids(+some more columns) with A::year_min <=B::list_year and A::year_max >= B::list_year
Instead of performing join on raw A & B, you can try using projections on both of them to contain only columns needed for join and later operations.
A-projected = foreach A generate id, year_min, year_max;
B-projected = foreach B generate id, list_year;
C = join A-projected by id, B-projected by id USING 'replicated';
If any of A-projected or B-projected is a small set that can be loaded in memory use replicated join, I am assuming B-projected to be a smaller set than A-projected.
If this doesnt apply to your case, please skip this option.
Also you can try setting the number of reducers to be used for this join by using PARALLEL keyword.
After applying filter you will get a list of required id's that you can use to fetch other information from A or B.
Also consider tweaking MapReduce properties like io.sort.mb, mapred.job.shuffle.input.buffer.percent etc.
Hope this helps.
When I use the following Linq query in LinqPad I get 25 results returned:
var result = (from l in LandlordPreferences
where l.Name == "Wants Student" && l.IsSelected == true
join t in Tenants on l.IsSelected equals t.IsStudent
select new { Tenant = t});
result.Dump();
When I add .Distinct() to the end I only get 5 results returned, so, I'm guessing I'm getting 5 instances of each result when the above is used.
I'm new to Linq, so I'm wondering if this is because of a poorly built query? Or is this the way Linq always behaves? Surely not - if I returned 500 rows with .Distinct(), does that mean without it there's 2,500 returned? Would this compromise performance?
It's a poorly built query.
You are joining LandlordPreferences with Tenants on a boolean value instead of a foreign key.
So, most likely, you have 5 selected land lords and 5 tenants that are students. Each student will be returned for each land lord: 5 x 5 = 25. This is a cartesian product and has nothing to do with LINQ. A similar query in SQL would behave the same.
If you would add the land lord to your result (select new { Tenant = t, Landlord = l }), you would see that no two results are actually the same.
If you can't fix the query somehow, Distinct is your only option.