Optimize Cypher Query for Movie Recommendation on Large Dataset - performance

I'm currently working on movie recommendation using MovieLens 20m dataset after reading https://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/. Node Movie connects to Genre with relationship hasGenre, Node Movie connects to User with relationship hasRating. I'm trying to retrieve all movies which are most highly co-rated (co-rated > 3.0) with a query (e.g. Toy Story) that share all genres with Toy Story. Here's my Cypher query:
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]-(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)-[:hasGenre]->(genre)
WITH inputGenres, r, o, movie, COLLECT(genre) AS genres
WHERE ALL(h in inputGenres where h in genres) and (r.rating>3 and o.rating>3)
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC
However, it seems that my system cannot handle it (using 16GB of RAM, Core i7 4th gen, and SSD). When I'm running the query, it peaks to 97% of RAM then Neo4j shutdowns unexpectedly (probably due to heap size or else, due to RAM size).
Do I make the query correct? I'm newbie in Neo4j so probably I make the query incorrectly.
Please suggest how to optimize such query?
How can I optimize the Neo4j so it can handle large dataset with my system's spec according to the query?
Thanks in advance.

First, your Cypher can be simplified for more efficient planning by only matching what we need, and handling the rest in the WHERE (so that filtering can possibly be done while matching)
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)
WHERE (r.rating>3 and o.rating>3) AND ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC
Now, if you don't mind adding data to the graph to find the data you want, another thing you can do is split the query into tiny bits and "cache" the result. So for example
// Cypher 1
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (movie:Movie)
WHERE ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
// Merge so that multiple runs don't create extra copies
MERGE (inputMovie)-[:isLike]->(movie)
// Cypher 2
MATCH (movie:Movie)<-[r:hasRating]-(user)
WHERE r.rating>3
// Merge so that multiple runs don't create extra copies
MERGE (user)-[:reallyLikes]->(movie)
// Cypher 3
MATCH (inputMovie:Movie{movieId: 1})<-[:reallyLikes]-(user)-[:reallyLikes]->(movie:Movie)<-[:isLike]-(inputMovie)
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC

Related

Neo4j building initial graph is slow

I am trying to build out a social graph between 100k users. Users can sync other social media platforms or upload their own contacts. Building each relationship takes about 200ms. Currently, I have everything uploaded on a queue so it can run in the background, but ideally, I can complete it within the HTTP request window. I've tried a few things and received a few warnings.
Added an index to the field pn
Getting a warning This query builds a cartesian product between disconnected patterns. - I understand why I am getting this warning, but no relationship exists and that's what I am building in this initial call.
MATCH (p1:Person {userId: "....."}), (p2:Person) WHERE p2.pn = "....." MERGE (p1)-[:REL]->(p2) RETURN p1, p2
Any advice on how to make it faster? Ideally, each relationship creation is around 1-2ms.
You may want to EXPLAIN the query and make sure that NodeIndexSeeks are being used, and not NodeByLabelScan. You also mentioned an index on :Person(pn), but you have a lookup on :Person(userId), so you might be missing an index there, unless that was a typo.
Regarding the cartesian product warning, disregard it, the cartesian product is necessary in order to get the nodes to create the relationship, this should be a 1 x 1 = 1 row operation so it's only going to be costly if multiple nodes are being matched per side, or if index lookups aren't being used.
If these are part of some batch load operation, then you may want to make your query apply in batches. So if 100 contacts are being loaded by a user, you do NOT want to execute 100 queries each, with each query adding a single contact. Instead, pass as a parameter the list of contacts, then UNWIND the list and apply the query once to process the entire batch.
Something like:
UNWIND $batch as row
MATCH (p1:Person {pn: row.p1}), (p2:Person {pn: row.p2)
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2
It's usually okay to batch 10k or so entries at a time, though you can adjust that depending on the complexity of the query
Check out this blog entry for how to apply this approach.
https://dzone.com/articles/tips-for-fast-batch-updates-of-graph-structures-wi
You can use the index you created on Person by suggesting a planner hint.
Reference: https://neo4j.com/docs/cypher-manual/current/query-tuning/using/#query-using-index-hint
CREATE INDEX ON :Person(pn);
MATCH (p1:Person {userId: "....."})
WITH p1
MATCH (p2:Person) using index p2:Person(pn)
WHERE p2.pn = "....."
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2

Oracle database help optimizing LIKE searches

I am on Oracle 11g and we have these 3 core tables:
Customer - CUSTOMERID|DOB
CustomerName - CUSTOMERNAMEID|CustomerID|FNAME|LNAME
Address - ADDRESSID|CUSTOMERID|STREET|CITY|STATE|POSTALCODE
I have about 60 million rows on each of the tables and the data is a mix of US and Canadian population.
I have a front-end application that calls a web service and they do a last name and partial zip search. So my query basically has
where CUSTOMERNAME.LNAME = ? and ADDRESS.POSTALCODE LIKE '?%'
They typically provide the first 3 digits of the zip.
The address table has an index on all street/city/state/zip and another one on state and zip.
I did try adding an index exclusively for the zip and forced oracle to use that index on my query but that didn't make any difference.
For returning about 100 rows (I have pagination to only return 100 at a time) it takes about 30 seconds which isn't ideal. What can I do to make this better?
The problem is that the filters you are applying are not very selective and they apply to different tables. This is bad for an old-fashioned btree index. If the content is very static you could try bitmap indexes. More precisely a function based bitmap join index on the first three letter of the last name and a bitmap join index on the postal code column. This assumes that very few people with the whose last name starts with certain letters live in an are with a certain postal code.
CREATE BITMAP INDEX ix_customer_custname ON customer(SUBSTR(cn.lname,1,3))
FROM customer c, customername cn
WHERE c.customerid = cn.customerid;
CREATE BITMAP INDEX ix_customer_postalcode ON customer(SUBSTR(a.postalcode,1,3))
FROM customer c, address a
WHERE c.customerid = a.customerid;
If you are successful you should see the two bitmap indexes becoming AND connected. The execution time should drop to a couple of seconds. It will not be as fast as a btree index.
Remarks:
You may have to play around a bit whether it is more efficient to make one or two indexes and whether the function are helpful useful.
If you decide to do it function based you should include the exact same function calls in the where clause of your query. Otherwise the index will not be used.
DML operations will be considerably slower. This is only useful for tables with static data. Note that DML operations will block whole row "ranges". Concurrent DML operations will run into problems.
Response time will probably still be seconds not instanteously like a BTREE index.
AFAIK this will work only on the enterprise edition. The syntax is untested because I do not have an enterprise db available at the moment.
If this is still not fast enough you can create a materialized view with customerid, last name and postal code and but a btree index on it. But that is kind of expensive, too.

Neo4j delete graph big data

I have a huge file of for about 11M relationship .
When I run the query : Match (n) detach delete n, It seems to be taking forever to finish .
I did some researches and found that I need to delete the relationships with a limit and then the nodes using that query :
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
WITH r LIMIT 200000
DELETE r
RETURN count(r) as deletedCount
Yet, and as I'm doing some performances comparison , it dows not seem logic to me to sum the total of deletion time to delete the hole graph .
and as it changes when changing the limit value of relationships to delete at once. (if i do 2000 relationships it is not the same as 20000 relationships at once)
How can I solve that problem ?
Any help would be appreciated
You can use apoc.periodic.commit to help you with batching. You must use apoc plugin, which has lots of cool functions to enhance cypher.
You can use the following cypher query.
call apoc.periodic.commit("
match (node)
with node limit {limit}
DETACH DELETE node
RETURN count(*)
",{limit:10000})
This will run the query in batches until the first match return null, which means in this case that no node exists in the database. You can play around with different limit settings to see what works best.
Hope this helps

Improve SQLite anti-join performance

Check out the update at the bottom of this question, the cause of the unexpected variance in query times noted below has been identified as the result of a sqliteman quirk
I have the following two tables in a SQLite DB (The structure might seem pointless I know but bear with me)
+-----------------------+
| source |
+-----------------------+
| item_id | time | data |
+-----------------------+
+----------------+
| target |
+----------------+
| item_id | time |
+----------------+
--Both tables have a multi column index on item_id and time
The source table contains around 500,000 rows, there will never more than one matching record in the target table, in practise it is likely almost all source rows will have a matching target row.
I am attempting to a perform a fairly standard anti-join to find all records in source without corresponding rows in target, but am finding it difficult to create a query with an acceptable execution time.
The query I am using is:
SELECT
source.item_id,
source.time,
source.data
FROM source
LEFT JOIN target USING (item_id, time)
WHERE target.item_id IS NULL;
Just the LEFT JOIN without the WHERE clause takes around 200ms to complete, with it this increases to 5000ms.
While I originally noticed the slow query from within my consuming application the timings above were obtained by executing the statements directly from within sqliteman.
Is there a particular reason why this seemingly simple clause so dramatically increases execution time and is there some way I can restructure this query to improve it?
I have also tried the following with the same result. (I imagine the underlying query plan is the same)
SELECT
source.item_id,
source.time,
source.data
FROM source
WHERE NOT EXISTS (
SELECT 1 FROM target
WHERE target.item_id = source.item_id
AND target.time = source.time
);
Thanks very much!
Update
Terribly sorry, it turns out that these apparent results are actually due to a quirk with sqliteman.
It seems sqliteman arbitrarily applies a limit to the number of rows returned to 256, and will load more dynamically as you scroll through them. This will make a query over a large dataset appear much quicker then actually is, making it a poor choice for estimating query performance.
Nonetheless is their any obvious way to improve the performance of this query or am I simply hitting limits of what SQLite is capable of?
This is the query plan of your query (either one):
0|0|0|SCAN TABLE source
0|1|1|SEARCH TABLE target USING COVERING INDEX ti (item_id=? AND time=?)
This is pretty much as efficient as possible:
Every row in source must be checked, by
searching for a matching row in target.
It might be possible to make one little improvement.
The source rows are probably not ordered, so the target search will do a lookup at a random position in the index.
If we can force the source scan to be in index order, the target lookups will be in order too, which makes it more likely for these index pages to already be in the cache.
SQLite will use the source index if we do not use any columns not in the index, i.e., if we drop the data column:
> EXPLAIN QUERY PLAN
SELECT source.item_id, source.time
FROM source
LEFT JOIN target USING (item_id, time)
WHERE target.item_id IS NULL;
0|0|0|SCAN TABLE source USING COVERING INDEX si
0|1|1|SEARCH TABLE target USING COVERING INDEX ti (item_id=? AND time=?)
This might not help much.
But if it helps, and if you want the other columns in source, you can do this by doing the join first, and then looking up the source rows by their rowid (the extra lookup should not hurt if you have very few results):
SELECT *
FROM source
WHERE rowid IN (SELECT source.rowid
FROM source
LEFT JOIN target USING (item_id, time)
WHERE target.item_id IS NULL)

making inner join linq or using [table].[joiningTable].column is the same?

I have recently started working on linq and I was wondering suppose I have 2 related tables Project (<=with fkAccessLevelId) and AccessLevel and I want to just select values from both tables. Now there are 2 ways I can select values from these tables.
The one i commonly use is:
(from P in DataContext.Projects
join AL in DataContext.AccessLevel
on P.AccessLevelId equals AL.AccessLevelId
select new
{
ProjectName = P.Name,
Access = AL.AccessName
}
Another way of doing this would be:
(from P in DataContext.Projects
select new
{
ProjectName = P.Name,
Access = P.AccessLevel.AccessName
}
What i wanted to know is which of these way is efficient if we increase the number of table say 5-6 with 1-2 tables containing thousands of records...?
You should take a look at the SQL generated. You have to understand that there are several main performance bottle necks in a Linq query (in this case I assume a OMG...Linq to SQL?!?!) the usual main bottle neck is the SQL query on the server.
Typically SQL Server has a very good optimizer, so actually, given the same query, refactored, the perf is pretty uniform.
However in your case, there is a very real difference in the two queries. A project with no Access Level would not appear in the first query, whilst the second query would return with a null AccessName. In effect you would be comparing a LEFT JOIN to an INNER JOIN.
TL:DR For SQL Server/Linq to Entity Framework queries that do the same thing should give similar performance. However your queries are far from similar.

Resources