MongoDB dynamic ranking - performance

I use MongoDB and have a collection with about 100000 entries.
The entries contain data like that:
{"page": "page1", "user_count": 1400}
{"page": "page2", "user_count": 1100}
{"page": "page3", "user_count": 900}
...
I want to output a ranking of the entries according to the user_count like:
#1 - page1
#2 - page2
#3 - page3
...
...so far so good. I can simply use a loop counter if I just output a sorted list.
But I also have to support various search queries. So for example I get 20 results and want to show on which rank the results are. Like:
#432 - page1232
#32 - page223
#345 - page332
...
What's the best way to do that? I don't really want to store the ranking in the collection since the collection constantly changes. I tried to solve it with a lookup dictionary I have built on the fly but it was really slow. Does MongoDB have any special functionality for such cases that could help?

There's no single command that you can use to do this, but you can do it with count:
var doc = db.pages.findOne(); // Or however you get your document
var n = db.pages.find({user_count : {$gt : doc.user_count}}).count(); // This is the number of documents with a higher user_count
var ranking = n+1; // Your doc is next in a ranking
A separate qustion is whether you should do this. Consider the following:
You'll need an index on user_count. You may already have this.
You'll need to perform a count query for each record you are displaying. There's no way to batch these up.
Given this, you may impact your performance more than if you stored the ranking in the collection depending on the CRUD profile of your application - it's up to your to decide what is the best option.

There's no simple approach to solve this problem with MongoDB.
If it is possible I would advise you to look at the Redis with its Sorted Sets. As documentation says:
With Sorted Sets you can: Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
You can easily take ranks for random pages by using MULTI/EXEC block. So it's the best approach for your task I think, and it will much faster than using MapReduce or reranking with mongodb.

Starting in Mongo 5, it's a perfect use case for the new $setWindowFields aggregation operator:
// { page: "page1", user_count: 1400 }
// { page: "page2", user_count: 1100 }
// { page: "page3", user_count: 900 }
db.test.aggregate([
{ $setWindowFields: {
sortBy: { user_count: -1 },
output: { rank: { $rank: {} } }
}},
// { page: "page1", user_count: 1400, rank: 1 }
// { page: "page2", user_count: 1100, rank: 2 }
// { page: "page3", user_count: 900, rank: 3 }
{ $match: { page: "page2" } }
])
// { page: "page2", user_count: 1100, rank: 2 }
The $setWindowFields stage adds the global rank by:
sorting documents by decreasing order of user_count: sortBy: { user_count: -1 }
and adding the rank field in each document (output: { rank: { $rank: {} } })
which is the rank of the document amongst all documents based on the sorting field user_count: rank: { $rank: {} }.
The $match stage is there to simulate your filtering requirement.

Related

Counting occurrences of search terms in Elasticsearch function score script

I have an Elasticsearch index with document structure like below.
{
"id": "foo",
"tags": ["Tag1", "Tag2", "Tag3"],
"special_tags": ["SpecialTag1", "SpecialTag2", "SpecialTag3"],
"reserved_tags": ["ReservedTag1", "ReservedTag2", "Tag1", "SpecialTag2"],
// rest of the document
}
The fields tags, special_tags, reserved_tags are stored separately for multiple use cases. In one of the queries, I want to order the documents by number of occurrences for searched tags in all the three fields.
For example, if I am searching with three tags Tag1,
Tag4 and SpecialTag3, total occurrences are 2 in the above document. Using this number, I want to add a custom score to this document and sort by the score.
I am already using function_score as there are few other attributes on which the scoring depends. To compute the matched number, I tried painless script like below.
def matchedTags = 0;
def searchedTags = ["Tag1", "Tag4", "SpecialTag3"];
for (int i = 0; i < searchedTags.length; ++i) {
if (doc['tags'].contains(searchedTags[i])) {
matchedTags++;
continue;
}
if (doc['special_tags'].contains(searchedTags[i])) {
matchedTags++;
continue;
}
if (doc['reserved_tags'].contains(searchedTags[i])) {
matchedTags++;
}
}
// logic to score on matchedTags (returning matchedTags for simplicity)
return matchedTags;
This runs as expected, but extremely slow. I assume that ES has to count the occurrences for each doc and cannot use indexes here. (If someone can shed light on how this will work internally or provide documentation/resources links, that would be helpful.)
I want to have two scoring functions.
Score as a function of number of occurrences
Score higher for higher occurrences. This is basically same as 1, but the repeated occurrences would be counted.
Is there any way where I can get benefits of both faster searching and also the custom scoring using script?
Any help is appreciated. Thanks.

Binning Data With Two Timestamps

I'm posting because I have found no content surrounding this topic.
My goal is essentially to produce a time-binned graph that plots some aggregated value. For Example. Usually this would be a doddle, since there is a single timestamp for each value, making it relatively straight forward to bin.
However, my problem lies in having two timestamps for each value - a start and an end. Similar to a gantt chart, here is an example of my plotted data. I essentially want to bin the values (average) for when the timelines exist within said bin (bin boundaries could be where a new/old task starts/ends). Likeso.
I'm looking for a basic example or an answer to whether this is even supported, in Vega-Lite. My current working example would yield no benefit to this discussion.
I see that you found a Vega solution, but I think in Vega-Lite what you were looking for was something like the following. You put the start field in "x" and the end field in x2, add bin and type to x and all should work.
"encoding": {
"x": {
"field": "start_time",
"bin": { "binned": true },
"type": "temporal",
"title": "Time"
},
"x2": {
"field": "end_time"
}
}
I lost my old account, but I was the person who posted this. Here is my solution to my question. The value I am aggregating here is the sum of times the timelines for each datapoint is contained within each bin.
First you want to use a join aggregate to get the max and min times your data extend to. You could also hardcode this.
{
type: joinaggregate
fields: [
startTime
endTime
]
ops: [
min
max
]
as: [
min
max
]
}
You want to find a step for your bins, you can hard code this later or use a formula and write this into a new field.
You want to create two new fields in your data that is a sequence between the max and min, and the other the same sequence offset by your step.
{
type: formula
expr: sequence(datum.min, datum.max, datum.step)
as: startBin
}
{
type: formula
expr: sequence(datum.min + datum.step, datum.max + datum.step, datum.step)
as: endBin
}
The new fields will be arrays. So if we go ahead and use a flatten transform we will get a row for each data value in each bin.
{
type: flatten
fields: [
startBin
endBin
]
}
You then want to calculate the total time your data spans across each specific bin. In order to do this you will need to round up the start time to the bin start and round down the end time to the bin end. Then taking the difference between the start and end times.
{
type: formula
expr: if(datum.startTime<datum.startBin, datum.startBin, if(datum.startTime>datum.endBin, datum.endBin, datum.startTime))
as: startBinTime
}
{
type: formula
expr: if(datum.endTime<datum.startBin, datum.startBin, if(datum.endTime>datum.endBin, datum.endBin, datum.endTime))
as: endBinTime
}
{
type: formula
expr: datum.endBinTime - datum.startBinTime
as: timeInBin
}
Finally, you just need to aggregate the data by the bins and sum up these times. Then your data is ready to be plotted.
{
type: aggregate
groupby: [
startBin
endBin
]
fields: [
timeInBin
]
ops: [
sum
]
as: [
timeInBin
]
}
Although this solution is long, it is relatively easily to implement in the transform section of your data. From my experience this runs fast and just displays how versatile Vega can be. Freedom to visualisations!

FaunaDB search document and get its ranking based on a score

I have the following Collection of documents with structure:
type Streak struct {
UserID string `fauna:"user_id"`
Username string `fauna:"username"`
Count int `fauna:"count"`
UpdatedAt time.Time `fauna:"updated_at"`
CreatedAt time.Time `fauna:"created_at"`
}
This looks like the following in FaunaDB Collections:
{
"ref": Ref(Collection("streaks"), "288597420809388544"),
"ts": 1611486798180000,
"data": {
"count": 1,
"updated_at": Time("2021-01-24T11:13:17.859483176Z"),
"user_id": "276989300",
"username": "yodanparry"
}
}
Basically I need a lambda or a function that takes in a user_id and spits out its rank within the collection. rank is simply sorted by the count field. For example, let's say I have the following documents (I ignored other fields for simplicity):
user_id
count
abc
12
xyz
10
fgh
999
If I throw in fgh as an input for this lambda function, I want it to spit out 1 (or 0 if you start counting from 0).
I already have an index for user_id so I can query and match a document reference from this index. I also have an index sorted_count that sorts document based on count field ascendingly.
My current solution was to query all documents by sorted_count index, then get the rank by iterating through the array. I think there should be a better solution for this. I'm just not seeing it.
Please help. Thank you!
Counting things in Fauna isn't as easy as one might expect. But you might still be able to do something more efficient than you describe.
Assuming you have:
CreateIndex(
{
name: "sorted_count",
source: Collection("streaks"),
values: [
{ field: ["data", "count"] }
]
}
)
Then you can query this index like so:
Count(
Paginate(
Match(Index("sorted_count")),
{ after: 10, size: 100000 }
)
)
Which will return an object like this one:
{
before: [10],
data: [123]
}
Which tells you that there are 123 documents with count >= 10, which I think is what you want.
This means that, in order to get a user's rank based on their user_id, you'll need to implement this two-step process:
Determine the count of the user in question using your index on user_id.
Query sorted_count using the user's count as described above.
Note that, in case your collection has more than 100,000 documents, you'll need your Go code to iterate through all the pages based on the returned object's after field. 100,000 is Fauna's maximum allowed page size. See the Fauna docs on pagination for details.
Also note that this might not reflect whatever your desired logic is for resolving ties.

Execute bulk find queries with sorting and limit mongo

I have documents that contain a category field and I want to take the top 10 documents from each category based on popularity, seniority and rise in popularity. I plan on using a votes field within the document to determine the popularity, the _id field for seniority and a votesPerDay field to determine which are rising in popularity. There are a total of 12 categories.
Typical documents will look like this.
{
name : 'alpaca',
category : 'blue',
votes : 500,
_id : Object.Id,
votesPerDay : 50
}
{
name : 'muon',
category : 'green',
votes : 100,
_id : Object.Id,
votesPerDay : 20
}
I have an object that needs to store all the categories and within each category it will store the most popular, the newest and those rising in popularity. The object will be refreshed every 24 hours.
Where I am running into trouble is whenever I want to query mongo for all 12 different categories.
I have tried to have a for loop run three queries for each category and store the results once they arrive but this fails since it seems that I am over loading mongo server and it crashes.
the architecture of the object I am trying to build will look something like this.
var myObj = {category1 : {
newest : [0,2....9],
mostPopular : [0,2....9],
risingInPopularity : [0,2....9]
} ..... with 12 of such objects};
the way I initially thought of performing the queries (although I am a bit uncomfortable doing it this way) was.
categories.forEach(function(category){
var query = {category : category}
var sort = //sorting criterion for newest
performQuery(query,sorting,function(results){
myObject[category].newest = results
});
sort = //sorting criterion for most popular
performQuery(query,sorting,function(results){
myObject[category].mostPopular= results
});
sort = //sorting criterion for rising in popularity
performQuery(query,sorting,function(results){
myObject[category].risingInPopularity = results
});
//limit has to be set to 10 documents
});
So my question is how is it best to perform this type of mass query where I will need to obtain documents based on category and retrieve only those top 10 documents that have the most votes, are the most recently added (based on _id), and have the most votesPerDay.

Sorting values in couchdb

I'm trying to pull scores back from this view
"scores": {
"map": "function(doc) { emit(doc.appName, {_id:doc.username, username:doc.username, credits:doc.credits, avatar:doc.avatar}) }"
},
I pass in the appname and it returns the scores. The issue is the scores wont sort correctly. When I try to sort them they only sort by the first number so something like so
1500
50
7
900
As you can see the first numbers are sorted ASC but the whole number itself isn't. Is it possible to have couchdb sort the scores if the appname is the key?
is doc.appName a string? Turn it into a number:
function(doc) {
emit(parseInt(doc.appName), {_id:doc.username, username:doc.username, credits:doc.credits, avatar:doc.avatar});
// ^^^^^^^^
}
Use a complex key:
emit([doc.appName, doc.score], null)
Then query using a range:
startkey=["app1", 0]&endkey=["app1", {}]

Resources