Using COLLECT and COLLECT AGGREGATE to search large dataset in ArangoDB - performance

I want to improve performance on some custom queries, the objective is to update group with most users daily.
It's a large dataset composed of 3 main collections:
Users (document)
Groups (document)
Connections (edges)
Users can participate in many groups (1:N).
Groups can have many participants (N:1), witch are users or other groups. Ex: Fighter1 is in RaidParty1 and RaidParty1 is in Guild1.
COLLECT party= user.partyID WITH COUNT INTO num_users
SORT num_users DESC
RETURN {"Party": party,
"Num. Users" : num_users}
This takes 4 ms in the sample dataset, over a second on complete dataset, and is very memmory intensive.
I understand I could use something like
FOR user IN userSample
COLLECT party= user.partyID WITH COUNT INTO num_users
COLLECT AGGREGATE max_num_users= MAX(num_users)
RETURN {"Top Guild": max_num_users}
But this isn't unsing the aggregate optimization, as forming the num_users every time is the time consuming part. Is there a way to collect and aggregate at the same time?
BONUS: Any ideia how to make it a top 10 list?

You could replace WITH COUNT INTO with an AGGREGATE clause:
FOR user IN userSample
COLLECT party = user.partyID AGGREGATE num_users = LENGTH(1)
SORT num_users DESC
RETURN {
"Party": party,
"Num. Users": num_users
}
However, both variants lead to the exact same execution plan (at least in the latest version of ArangoDB).
If I understand correctly, then you want to count the number of users per party and then find the maximum among these counts. This isn't possible with a single COLLECT operation:
an aggregate expression must not refer to variables introduced by the COLLECT itself
Grouping and aggregation with COLLECT in AQL | ArangoDB Documentation
For a top 10 list, this should work:
FOR user IN userSample
COLLECT party = user.partyID WITH COUNT INTO num_users
SORT num_users DESC
LIMIT 10
RETURN {
"Party": party,
"Num. Users": num_users
}

Related

NRediSearch - Getting total documents matched count

Is there a way to get a total results count when calling Aggregate function?
Note that I'm not using Aggregate function to aggregate results, but as an advanced search query, because Search function does not allow to sort by multiple fields.
RediSearch returns total documents matched count, but I can't find a way to get this number using NRediSearch library.
With NRediSearch
Using NRediSearch, you would need to build and execute aggregation that will run a GROUPBY 0 and the COUNT reducer, say you have a person-idx index and you want to count all the Person documents in Redis:
var client = new Client("person-idx", muxer.GetDatabase());
var result = await client.AggregateAsync(new AggregationBuilder().GroupBy(new List<string>(), new List<Reducer>{Reducers.Count()}));
Console.WriteLine(result.GetResults().First().Values.First());
Will get the count you are looking for.
With Redis.OM
There's a newer library Redis.OM which you can also use to make these aggregations a bit simpler, the same operation would be done with the following:
var peopleAggregations = provider.AggregationSet<Person>();
Console.WriteLine(peopleAggregations.Count());

Performance with pagination

Question
Given the following query:
MATCH (t:Tenant)-[:lives_in]->(:Apartment)-[:is_in]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
So: "Give me the first 10 tenants that live in City1"
With the sample data below, the database will get hit for every single apartment in City1 and for every tenant that lives in each of these apartments.
If I remove the ORDER BY this doesn't happen.
I am trying to implement pagination so I need the ORDER BY. How to improve the performance on this?
Sample data
UNWIND range(1, 5) as CityIndex
CREATE (c:City { id: CityIndex, name: 'City' + CityIndex})
WITH c, CityIndex
UNWIND range(1, 5000) as ApartmentIndex
CREATE (a:Apartment { id: CityIndex * 1000 + ApartmentIndex, name: 'Apartment'+CityIndex+'_'+ApartmentIndex})
CREATE (a)-[:is_in]->(c)
WITH c, a, CityIndex, ApartmentIndex
UNWIND range(1, 3) as TenantIndex
CREATE (t:Tenant { id: (CityIndex * 1000 + ApartmentIndex) * 10 + TenantIndex, name: 'Tenant'+CityIndex+'_'+ApartmentIndex+'_'+TenantIndex})
CREATE (t)-[:lives_in]->(a)
Without the ORDER BY, cypher can lazily evaluate the tenants and stop at 10 rather than matching every tenant in City1. However, because you need to order the tenants, the only way it can do that is to fetch them all and then sort.
If the only labels that can live in apartments is Tenants then you could possibly save a Filter step by removing the Tenant in your query like MATCH (t)-[:lives_in]->(:Apartment)....
You might want to check the profile of your query as well and see if it uses the index backed order by
What sort of numbers are you expecting back from this query? What's the worst case number of tenants in a given city?
EDIT
I was hoping a USING JOIN on t would use the index to improve the plan but it does not.
The query performs slightly better if you add a redundant relation from the tenant to the city:
MATCH (t:Tenant)-[:CITY]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
and similarly by embedding the city name onto the tenant- no major gains. I tested for 150,000 tenants in City1, perhaps the gains are more visible as you approach millions, but not sure.

Cloudant couchdb query custom sort

I want to sort the results of couchdb query a.k.a mango queries based on custom sort. I need custom sort because if Status can be one of the following:
Active = 1
Sold = 2
Contingent = 4
Pending = 3
I want to sort the results on Status but not in an alphabetical order, rather my own weightage I assign to each value which can be seen in the above list. Here's the selector for Status query I'm using:
{type:"Property", Status:{"$or":[{$eq: "Pending"}, {$eq:"Active"}, {$eq: "Sold"}]}}
If I use the sort array in my json with Status I think it'll sort alphabetically which I don't want.
You are actually looking for results based on "Status". You can create a view similar to this:
function(doc) { if (doc.type == "Property") { emit(doc.Status, doc);}}
When you use it, invoke it 4 times in the order you need and you'll get the result you need. This would eliminate the need to sort.

Neo4j cypher query performance issue with movie recommendation query

I'm currently working on a movie recommendation query that should return the movies with the most "recommendation impact" using the following cypher query:
match (m:Movie)
with m, size((m)<-[:LIKED]-(:User)-[:LIKED]->(:Movie)) as score
order by score desc
limit 10
return m.title, score
After reading the graphdb (neo4j) e-book my assumption was that this whould be an easy query for neo4j but the execution time took 32737 ms which is not what I was expecting. Does any one have experience with these kind of queries and has any suggestions to improve performance? Or should this query perform well and do I need to do some neo4j / java configuration tuning?
The profile of the query:
The result:
Maybe this is something you can pre-calculate.
Your score is related to the number of movies liked by each user. Why not calculate and store the number of movies liked by each user (assuming a user can only like a movie once, not multiple times)?
Note that this only makes sense if you only care about the number of movies liked by each user, and are okay with adding those up, even if they represent multiple likes of the same movie across many users.
MATCH (u:User)
SET u.likedCount = SIZE((u)-[:LIKED]->(:Movie))
You will need to update this every time the user likes (or unlikes) another movie.
When this is pre-populated for all users, your scoring query now becomes:
MATCH (m:Movie)
WITH m
MATCH (m)<-[:LIKED]-(u:User)
WITH m, SUM(u.likedCount) as score
ORDER BY score desc
LIMIT 10
RETURN m.title, score
EDIT
This of course includes the likes from each user of the movie in question. If you really need to account for this, you'll need to adjust your with to:
WITH m, SUM(u.likedCount) - count(u) as score
If you only want to count distinct movies liked by users in your scoring, then you can't pre-calculate and have to use something like stdob--'s answer.
Try this query:
MATCH (M:Movie)<-[:LIKED]-(:User)-[:LIKED]->(R:Movie)
WITH M,
size( collect(distinct R) ) as score
RETURN M.title as title,
score
ORDER BY score DESC LIMIT 10
As an option:
MATCH (M:Movie)<-[:LIKED]-(:User)-[:LIKED]->(R:Movie)
RETURN M.title as title,
count(R) as score
ORDER BY score DESC LIMIT 10

Linq Query Where Contains

I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks
You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.
The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.

Resources