Optimize Neo4j cypher query on huge dataset - performance

The following query can't run on a dataset with ~2M nodes. What should i do to make it run faster?
MATCH (cc:ConComp)-[r1:IN_CONCOMP]-(p1:Person)-[r2:SAME_CLUSTER]-(p2:Person)
WHERE cc.cluster_type = "household"
MERGE (cluster:Cluster {CLUSTER_TMP_ID:cc.CONCOMP_ID + '|' + r2.root_id, cluster_type:cc.cluster_type })
MERGE (cluster)-[r3:IN_CLUSTER]-(p1)

A number of suggestions:
adding directions to your relationships will decrease the number of paths in the MATCH
make sure that you have indexes on all properties that you MERGE on
in the second MERGE , also add direction.

I finally found a solution by using the following query (and by indexing cc.cluster_type and cc.CONCOMP_ID):
CALL apoc.periodic.iterate('MATCH (cc:ConComp)<-[r1:IN_CONCOMP]-(p1:Person)-[r2:SAME_CLUSTER]-(p2:Person) WHERE cc.cluster_type = "household" WITH DISTINCT cc.CONCOMP_ID + "|" + r2.root_id as id_name, cc.cluster_type as cluster_type_name, p1 RETURN id_name, cluster_type_name, p1', '
MERGE (cluster:Cluster {CLUSTER_TMP_ID: id_name, cluster_type: cluster_type_name})
MERGE (cluster)-[r3:IN_CLUSTER]->(p1)', {batchSize:10000, parallel:false})
I precise that I had previously ran my initial question query with apoc.periodic.iterate without success.

Related

How to get uniques and sum based on criteria set in Google Sheets?

I'm trying to get these summed, but I can't wrap my head around it.
Here is a visual demo:
Here is the link to the sheet, in case you feel like jumping in:
https://docs.google.com/spreadsheets/d/1gh5w0czg2JuoA3i5wPu8_eOpC4Q4TXIRhmUrg53nKMU/edit?usp=sharing
I did an INDEX+MATCH, but of course this isn't going to get me anywhere:
=iferror(INDEX(E:E;MATCH(1;(F:F=I$6)*(A:A=$H$7);0));"-")
You can do two nested QUERY to find the uniques (by counting them), and only selecting them the column of numbers. The first QUERY also filters by date and type:
=SUM(QUERY(QUERY(A:F;"SELECT A,B,E,COUNT(A) where A = '"&H7&"' AND F = date'"&TEXT(I6;"yyyy-mm-dd")&"' group by A,B,E");"SELECT Col3"))
As an array to include more all the dates:
=MAKEARRAY(2;COUNTA(I6:6);LAMBDA(r;c;SUM(QUERY(QUERY(A:F;"SELECT A,B,E,COUNT(A) where A = '"&INDEX(H7:H8;r)&"' AND F = date'"&TEXT(INDEX(I6:6;;c);"yyyy-mm-dd")&"' group by A,B,E");"SELECT Col3"))))
Added working solution to your sheet here:
=MAKEARRAY(2;COUNTA(I6:O6);LAMBDA(r;c;SUM(LAMBDA(z;MAP(INDEX(z;;1);INDEX(z;;2);INDEX(z;;3);LAMBDA(a;e;f;IFNA(FILTER(e;a=INDEX(H7:H8;r);f=INDEX(I6:O6;;c))))))(UNIQUE({A:A\E:E\F:F})))))

Multiple consecutive join operations on PySpark

I am running a PySpark application where we are comparing two large datasets of 3GB each. There are some differences in the datasets, which we are filtering via outer join.
mismatch_ids_row = (sourceonedf.join(sourcetwodf, on=primary_key,how='outer').where(condition).select(primary_key)
mismatch_ids_row.count()
So the output of join on count is a small data of say 10 records. The shuffle partition at this point is about 30 which has been counted as amount of data/partition size(100Mb).
After the result of the join, the previous two datasets are joined with the resultant joined datasets to filter out data for each dataframe.
df_1 = sourceonedf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
df_2 = sourcetwodf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
Here we are dropping duplicates since the result of first join will be double via outer join where some values are null.
These two dataframes are further joined to find the column level comparison and getting the exact issue where the data is mismatched.
df = (df_1.join(df_2,on=some condition, how="full_outer"))
result_df = df.count()
The resultant dataset is then used to display as:
result_df.show()
The issue is that, the first join with more data is using merge sort join with partition size as 30 which is fine since the dataset is somewhat large.
After the result of the first join has been done, the mismatched rows are only 10 and when joining with 3Gb is a costly operation and using broadcast didn't help.
The major issue in my opinion comes when joining two small resultant datasets in second join to produce the result. Here too many shuffle partitions are killing the performance.
The application is running in client mode as spark run for testing purposes and the parameters are sufficient for it to be running on the driver node.
Here is the DAG for the last operation:
As an example:
data1 = [(335008138387,83165192,"yellow","2017-03-03",225,46),
(335008138384,83165189,"yellow","2017-03-03",220,4),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
data2 = [(335008138387,83165192,"yellow","2017-03-03",300,46),
(335008138384,83165189,"yellow","2017-03-03",220,10),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
field = [
StructField("row_num",LongType(),True),
StructField("tripid",IntegerType(),True),
StructField("car_type",StringType(),True),
StructField("dates", StringType(), True),
StructField("pickup_location_id", IntegerType(), True),
StructField("trips", IntegerType(), True)
]
schema = StructType(field)
sourceonedf = spark.createDataFrame(data=data1,schema=schema)
sourcetwodf = spark.createDataFrame(data=data2,schema=schema)
They have just two differences, on a larger dataset think of these as 10 or more differences.
df_1 will get rows from 1st sourceonedf based on mismatch_ids_row and so will the df_2. They are then joined to create another resultant dataframe which outputs the data.
How can we optimize this piece of code so that optimum partitions are there for it to perform faster that it does now.
At this point it takes ~500 secs to do whole activity, when it can take about 200 secs lesser and why does the show() takes time as well, there are only 10 records so it should print pretty fast if all are in 1 partition I guess.
Any suggestions are appreciated.
You should be able to go without df_1 and df_2. After the first 'outer' join you have all the data in that table already.
Cache the result of the first join (as you said, the dataframe is small):
# (Removed the select after the first join)
mismatch_ids_row = sourceonedf.join(sourcetwodf, on=primary_key, how='outer').where(condition)
mismatch_ids_row.cache()
mismatch_ids_row.count()
Then you should be able to create a self-join condition. When joining, use dataframe aliases for explicit control:
result_df = (
mismatch_ids_row.alias('a')
.join(mismatch_ids_row.alias('b'), on=some condition...)
.select(...)
)

Search Multiple Indexes with condition

Here is requirement I am working on
There are multiple indexes with name content_ssc, content_teal, content_mmy.
These indexes can have common data (co_code is one of the field in the documents of these indexes)
a. content_ssc can have documents with co_code = teal/ssc/mmy
b. content_mmy can have documents with co_code = ssc/mmy
I need to get the data using below condition (this is one of the approach to get the unique data from these indexes)
a. (Index = content_ssc and site_code = ssc) OR (Index = content_mmy and site_code = mmy)
Basically I am getting a duplicate data from these indexes currently so I need any solution which should fetch unique data from these indexes using the above condition.
I have tried using boolean query with multiple indices from this link but it didn't produce unique result.
Please suggest.
You can use distinct query , and you will get unique result

Performance issues related to Neo4j path query

I have the following cypher query that basically is trying to find paths between the same set of nodes such that the paths returned contain all 5 specified relationships.
match p=(n)-[r*1..10]->(m)
where (m.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND (n.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND filter(x IN r where type(x)=~'.*isOfFormat.*')
AND filter(y IN r where type(y)=~'.*Processors.*')
AND filter(z IN r where type(z)=~'.*hasProducts.*')
AND filter(u IN r where type(u)=~'.*ProcessorFamilies.*')
AND filter(v IN r where type(v)=~'.*hasProductCategory.*')
return p;
The query I had above worked just fine and I got the paths I wanted. However, the execution time for the query was quite long. Below is some information about the query and the graph I used:
1) the graph contains 107,387 nodes and 226,468 relationships;
2) the size of the set of source(destination) nodes is 120; in other words, there are 120 strings in (n.URI IN ['x86_64/2#this', ... ,'/CSP52369']) and (m.URI IN ['x86_64/2#this', ...,'/CSP52369'];
The query execution time for the above query is 212,840 ms.
Then, in order to find nodes with the URI property faster, I use a label Uri for URI property and create an index on :Uri(URI). Then, I modified the query and the new query looks like:
match p=(n:URI)-[r*1..10]->(m:URI)
where (m.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND (n.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND filter(x IN r where type(x)=~'.*isOfFormat.*')
AND filter(y IN r where type(y)=~'.*Processors.*')
AND filter(z IN r where type(z)=~'.*hasProducts.*')
AND filter(u IN r where type(u)=~'.*ProcessorFamilies.*')
AND filter(v IN r where type(v)=~'.*hasProductCategory.*')
return p;
I ran the query again and the execution time was 5,841 ms. It did improve the performance a lot. However, I am not sure how the index helped here. I actually profiled both queries. Below are what I got.
The figure on the top/bottom is profiling result for the first/second query.
By comparing the two execution plans, I didn't see any operators related to index such as "NodeIndexSeek". Further, according to both plans, the system actually first computed all paths between n and m, then chose the ones to keep with the filter. Then, in this case, how would index help?
Can anybody help me clear up my doubts? Thanks in advance!!!
It seems your query runs with the RULE based optimizer while it should run with the cost based one? I hope you use the latest Neo4j version 2.3.1
I would also change your filter (which is not really a predicate) into:
You might need to add the index lookup hint:
WITH ["isOfFormat","Processors","hasProducts","ProcessorFamilies","hasProductCategory"] as types
MATCH p=(n:URI)-[rels*1..10]->(m:URI)
USING INDEX n:URI(URI)
USING INDEX m:URI(URI)
WHERE (m.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND (n.URI IN ['http://.../x86_64/2#this', ... ,'http://.../CSP52369'])
AND ALL(t in types WHERE ANY(r in rels WHERE type(r) = t))
RETURN p;
But you might be better off to express your concrete path as a concrete pattern with the relevant rel-types in between!!
Like Nicole suggested:
MATCH (n:URI)-[rels:isOfFormat|:Processors|:hasProducts|
:ProcessorFamilies|:hasProductCategory*..10]-(m:URI)
...

mongoDB geoNear command with count

I am using the geoNear commang with mongoid in order to retrive a document collection ordered by distance. I need the distance for each document in the collection which is why I am having to resort to the geoNear command.
Given the following command:
category_ids = ["list", "of", "ids"]
cmd = Hash.new
cmd[:geoNear] = :poi
cmd[:near] = [params[:location][:x], params[:location][:y]]
cmd[:query] = {
"$or" => [
{primary_category_id: {"$in" => category_ids}},
{category_ids: {"$in" => category_ids}}
]
}
cmd[:spherical] = true
cmd[:num] = num
res = Poi.collection.database.command cmd
My problem is that I require the total number of results in the collection. Sure I could just run another query that just counts the number of items that satisfy the query part of the command, however that would be pretty inefficient and also not very extendible as every change I make in the command would have to be reflected in the count query. Just adding a maxDistance would land me in a whole heap of trouble.
Another option would be to go with find and calculate the distance manually but again I would like to avoid that.
So my question is there a clever way of getting the number of documents returned by the command (minus the num) without having to run a separate query or having to calculate the distance manually and go with find.
You can use facet for the same after geoNear use facet one will project the documents and in other you can use group by _id null and use the count in group to count the total number of documents.

Resources