Neo4j cypher query performance issue with movie recommendation query - performance

I'm currently working on a movie recommendation query that should return the movies with the most "recommendation impact" using the following cypher query:
match (m:Movie)
with m, size((m)<-[:LIKED]-(:User)-[:LIKED]->(:Movie)) as score
order by score desc
limit 10
return m.title, score
After reading the graphdb (neo4j) e-book my assumption was that this whould be an easy query for neo4j but the execution time took 32737 ms which is not what I was expecting. Does any one have experience with these kind of queries and has any suggestions to improve performance? Or should this query perform well and do I need to do some neo4j / java configuration tuning?
The profile of the query:
The result:

Maybe this is something you can pre-calculate.
Your score is related to the number of movies liked by each user. Why not calculate and store the number of movies liked by each user (assuming a user can only like a movie once, not multiple times)?
Note that this only makes sense if you only care about the number of movies liked by each user, and are okay with adding those up, even if they represent multiple likes of the same movie across many users.
MATCH (u:User)
SET u.likedCount = SIZE((u)-[:LIKED]->(:Movie))
You will need to update this every time the user likes (or unlikes) another movie.
When this is pre-populated for all users, your scoring query now becomes:
MATCH (m:Movie)
WITH m
MATCH (m)<-[:LIKED]-(u:User)
WITH m, SUM(u.likedCount) as score
ORDER BY score desc
LIMIT 10
RETURN m.title, score
EDIT
This of course includes the likes from each user of the movie in question. If you really need to account for this, you'll need to adjust your with to:
WITH m, SUM(u.likedCount) - count(u) as score
If you only want to count distinct movies liked by users in your scoring, then you can't pre-calculate and have to use something like stdob--'s answer.

Try this query:
MATCH (M:Movie)<-[:LIKED]-(:User)-[:LIKED]->(R:Movie)
WITH M,
size( collect(distinct R) ) as score
RETURN M.title as title,
score
ORDER BY score DESC LIMIT 10
As an option:
MATCH (M:Movie)<-[:LIKED]-(:User)-[:LIKED]->(R:Movie)
RETURN M.title as title,
count(R) as score
ORDER BY score DESC LIMIT 10

Related

Performance with pagination

Question
Given the following query:
MATCH (t:Tenant)-[:lives_in]->(:Apartment)-[:is_in]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
So: "Give me the first 10 tenants that live in City1"
With the sample data below, the database will get hit for every single apartment in City1 and for every tenant that lives in each of these apartments.
If I remove the ORDER BY this doesn't happen.
I am trying to implement pagination so I need the ORDER BY. How to improve the performance on this?
Sample data
UNWIND range(1, 5) as CityIndex
CREATE (c:City { id: CityIndex, name: 'City' + CityIndex})
WITH c, CityIndex
UNWIND range(1, 5000) as ApartmentIndex
CREATE (a:Apartment { id: CityIndex * 1000 + ApartmentIndex, name: 'Apartment'+CityIndex+'_'+ApartmentIndex})
CREATE (a)-[:is_in]->(c)
WITH c, a, CityIndex, ApartmentIndex
UNWIND range(1, 3) as TenantIndex
CREATE (t:Tenant { id: (CityIndex * 1000 + ApartmentIndex) * 10 + TenantIndex, name: 'Tenant'+CityIndex+'_'+ApartmentIndex+'_'+TenantIndex})
CREATE (t)-[:lives_in]->(a)
Without the ORDER BY, cypher can lazily evaluate the tenants and stop at 10 rather than matching every tenant in City1. However, because you need to order the tenants, the only way it can do that is to fetch them all and then sort.
If the only labels that can live in apartments is Tenants then you could possibly save a Filter step by removing the Tenant in your query like MATCH (t)-[:lives_in]->(:Apartment)....
You might want to check the profile of your query as well and see if it uses the index backed order by
What sort of numbers are you expecting back from this query? What's the worst case number of tenants in a given city?
EDIT
I was hoping a USING JOIN on t would use the index to improve the plan but it does not.
The query performs slightly better if you add a redundant relation from the tenant to the city:
MATCH (t:Tenant)-[:CITY]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
and similarly by embedding the city name onto the tenant- no major gains. I tested for 150,000 tenants in City1, perhaps the gains are more visible as you approach millions, but not sure.

Neo4j - return only one node that has multiple relations

I'm having a small issue finding out how to return one node, that has multiple outgoing relations.
So what I want is to display only node, even if it has more than one relation; this is my query:
MATCH total=(n:Employee)-[r:WorkedOn]->(p:Project)
RETURN toFloat(p.total_efficiency) / toFloat(count(p)) as score , n.first_name as name, n.last_name as surname, r.role as role, n.start_date_of_work as startDate, n.experience as experience,
n.email as email, n.age as age, collect(p.name) as projects ORDER BY score DESC LIMIT {l}
but this returns a table like this:
How do I solve the double 'Jari Van Melckebeke' records? I only want one.
I could also remove the 'role' property, but I need the Project object anyway to calculate the score...
Thanks in advance,
Jari Van Melckebeke
You have two options to collapse this into one row. Either, as you suggested, removing role from your return, or returning COLLECT(r.role) as roles.

OFFSET/LIMIT only count DISTINCT values in Activerecord query

I am running this query
Playlistship.order("created_at desc").select("distinct playlist_id").limit(12).offset(2)
This query does not necessarily return 12 records. It returns the number of distinct records in the set of 12 defined by the LIMIT, OFFSET and ORDER parameters.
For example if the Playlistships between id=13 and id=24 had playlist_ids of [2,3,3,5,6,3,5,6,8,11,12,12], then this query will only give return 7 records, corresponding to the first ones having the playlist_ids [2,3,5,6,8,11,12].
What I would like to find is a query that yields 12, records with distinct playlist_ids, with the correct offset so that running this query again with an OFFSET of 3 would yield the next 12 records with distinct playlist_ids.
Hopefully I didn't "over explain" this one, as I think it's a relatively straightforward question. Please ask for more details if you need them.
Thanks!
Have you tried with subqueries? Give this a try:
Playlistship.select("distinct playlist_id").limit(12).where(playlist_id: Playlistship.order("created_at desc").select('playlist_id').offset(2))

Slow Neo4j query despite indices

Here I'm trying to find all Twitter users who are followed by and who follow any members of some group G:
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE (x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}})
AND (y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}})
RETURN t.id
But for the group G I sometime have their screen names and sometimes have their ids, thus the OR clause above. Unfortunately this query is long running and doesn't appear to ever return.
I have indices and constraints on both on both id and screen_name:
Indexes
ON :User(screen_name) ONLINE (for uniqueness constraint)
ON :User(id) ONLINE (for uniqueness constraint)
Constraints
ON (user:User) ASSERT user.screen_name IS UNIQUE
ON (user:User) ASSERT user.id IS UNIQUE
If I get rid of the OR clause (for instance if I happen to have all screen_names or all ids for group G) then the query runs quite fast.
I'm using neo4j-community-2.1.3 on a Mac. My graph has 286039 nodes, all of which have the User label.
And ideas to improve this? Otherwise I'll have to chop this up into 4 queries to get all possible combinations of members. This is really even more problematic because I really want to keep track of how commonly a user appears in a G-->user-->G relationship, and I'll need to do a lot of extra bookkeeping if the counts are spread among 4 different queries.
Update
I created an issue related to this: https://github.com/neo4j/neo4j/issues/2834
I ended up using
MATCH (x:User) WHERE x.screen_name IN ["apple","banana","coconut"]
WITH collect(id(x)) as x_ids
MATCH (x:User) WHERE x.id in [12345,98765]
WITH x_ids+collect(id(x)) as x_ids
MATCH (y:User) WHERE y.screen_name IN ["apple","banana","coconut"]
WITH x_ids,collect(id(y)) as y_ids
MATCH (y:User) WHERE y.id in [12345,98765]
WITH x_ids,y_ids+collect(id(y)) as y_ids
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE id(x) in x_ids AND id(y) in y_ids
RETURN count(*) as c, t.screen_name,t.id
ORDER BY c DESC
LIMIT 1000
But this basically represents a hack to get around a place where neo4j isn't using the indices that it could be.
I guess the query does not make use of indexes due to the OR condition, you can verify by prefixing the query with PROFILE and run it in neo4j-shell.
If there's no notion of index usage, you might split the query up into two parts. The first one fetches the combined list of user ids, instead of the OR we do a UNION on two queries (each using a index lookup):
MATCH (x:User) WHERE x.screen_name in {G_SCREEN_NAMES} RETURN id(x) as ids UNION
MATCH (x:User) WHERE x.id in {G_IDS} RETURN id(x) as ids
On the client side, use the list of node ids as parameter for the next query:
MATCH (x:User)-[:FOLLOWS]->(t)-[:FOLLOWS]->(y)
WHERE id(x) in {ids} AND id(y) in {ids}
RETURN t.id
I've intentionally removed the labels for t and y with the assumption that you can only follow User and no other kind of nodes. This removes a unnecessary label check.
JnBrymn,
How about this query?
MATCH (x:User)
WHERE x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}}
WITH x
MATCH (x)-[:FOLLOWS]->(t:User)
WITH t
MATCH (t)-[:FOLLOWS]->(y:User)
WHERE y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}}
RETURN t.id
Grace and peace,
Jim

Linq to Entities: complex query getting "average" restaurant rating

So I'm building a Restaurant Review site for my community. I need to
extract data from the following tables: RESTAURANT, CUISINE, CITY,
PRICE and RATING (customer ratings).
The query should return all restuarants of a selected CUISINE_ID and
return the RESTAURANT_NAME, CUSINE_NAME, CUTY_NAME, PRICE_CODE and it
should average all the reviews RATING_CODE and return a calculated
value. I'm fine with returning all the data except the average
rating.
I've only been working with LINQ to Entities 2 days and LINQ for about
3 weeks, so I'm really a newbie; I'm waiting for my LINQ book to be
delivered from Amazon.com. Your help guidance be appreciated!
It should end up looking something like this:
var avgForMatches =
(from r in context.Restaurants
where r.Cuisines.Any(c => c.CuisineName == cuisineName)
where r.Prices.Any(p => p.PriceCode == priceCode)
//... same pattern for other searches.
select r.RatingCode)
.Average();
Read about aggregate methods (including average) within the 101 linq samples - http://msdn.microsoft.com/en-us/vcsharp/aa336747

Resources