Paginating Distinct Values in Mongo - performance

I would like to retrieve district values for a field from a collection in my database. The distinct command is the obvious solution. The problem is that some fields have a large number of possible values and are not simple primitive values (i.e. are complex sub-documents instead of just a string). This means the results are large causing it to overload the client I am delivering the results to.
The obvious solution is to paginate the resulting distinct values. But I cannot find a good way to optimally do this. Since distinct doesn't have pagination options (limit, skip, etc) I am turning to the aggregation framework. My basic pipeline is:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$limit: 10},
]
This gives me the first 10 unique values for myfield. To get the next page a $skip operator is added to the pipeline. So:
[
{$match: {... the documents I am interested in ...}},
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
But sometimes the field I am collecting unique values from is an array. This means I have to unwind it before grouping. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$myfield'}
{$group: {_id: '$myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Other times the field I am getting unique values for may not be an array but it's parent node might be an array. So:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
Finally sometimes I also need to filter on data inside the field I am getting distinct values. This means I sometimes need another match operator after the unwind:
[
{$match: {... the documents I am interested in ...}},
{$unwind: '$list'}
{$match: {... filter within list.myfield ...}},
{$group: {_id: '$list.myfield'},
{$sort: {_id: 1},
{$skip: 10},
{$limit: 10},
]
The above are all simplifications of my actual data. Here is a real pipeline from the application:
[
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id": 1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$unwind":"$data.surroundings.school.nearby"},
{"$match": {
"source_name":"fmls",
"data.surroundings.school.nearby.district.id":1300120,
"compiled_at":{"$exists":true},
"data.status.current":{"$ne":"Sold"}
}},
{"$group":{"_id":"$data.surroundings.school.nearby"}},
{"$sort":{"_id":1}},
{"$skip":10},
{"$limit":10}
]
I hand the same $match document to both the initial filter and the filter after the unwind because the $match document is somewhat opaque from a 3rd party so I don't really know which parts of the query are filtering outside my unwind vs inside the data I am getting distinct values for.
Is there any obvious different way to go about this. In general, my strategy is working but there are some queries where it is taking 10-15 seconds to return the results. There are about 200,000 documents in the collection although only about 60,000 after the first $match in the pipeline is applied (which can use indexes).

Related

Can I apply facets on the entire solr result instead of the filtered solr result?

Let's say I have a collection of solr documents:
{'id': 1, 'title': 'shirt', 'brand': 'adidas'},
{'id': 2, 'title': 'shirt2', 'brand': 'adidas'},
{'id': 3, 'title': 'shirt', 'brand': 'nike'},
{'id': 4, 'title': 'shirt', 'brand': 'puma'}
Now if I try to apply faceting on this with facet.field=brand, I will get the following
{'adidas': 2, 'nike': 1, 'puma': 1}
In addition, if I apply a filter brand:adidas, I get the following for faceting:
{'adidas': 2, 'nike': 0, 'puma': 0}
Is it possible to still get the count of documents returned by solr before the filtering happened? I would like to still see {'adidas': 2, 'nike': 1, 'puma': 1} regardless of what filters I apply
You can use tagging and excluding filters as shown in the reference guide.
q=yourquery&fq={!tag=brand}brand:adidas&facet=true&facet.field={!ex=brand}brand
The {!ex} instruction will then exclude any filters matching that tag from the active filters when generating the facet results, allowing you to keep the complete set of brands matching the query available as facets.

Generic way to get prev/next search results by id in Elasticsearch

Say I have a million (many) documents in my index. I execute a search query sorting the items by some key X.
Now I have a very long list of results: [..., id1, id2, id3, ...]
Question: how do I get id1 and id3 if I know id2 but don't want to execute the whole search/don't want to get all ids?
I'm looking of a generic solution that works for any search query. Given an id that for certain exists in the results of a query, how to get prev/next by that id. The query should NOT have prior knowledge of anything else than the id whose prev/next are searched for. (In other words, if ordered by title and searched for prev/next of id X, the title of X is not known at query time, only X's id.)
It is of course possible to execute multiple search queries and achieve the same end result by getting id2 and then playing with ordering to get ids 1 and 3.
EDIT:
I think Luc E's answer isn't what I'm looking for. In that scenario, knowledge of the original objects title is required to query for prev/next. I'm looking for a solution where only the id is known at query time.
Example data looks like this:
[...
{id: 32, title: 'AAA'},
{id: 12, title: 'BBB'},
{id: 99, title: 'CCC'},
{id: 3, title: 'DDD'},
{id: 1001, title: 'EEE'},
...]
What I know: id 99. What I don't know: what is title of id 99.
What I want: ids of the prev/next items sorted by title field (=3 and 12).
To put it yet another way: I have id 99 but not the title in my hand. I want a query that gives me ids 3 and 12 (they are prev/next sorted by title).
What you want to do is called deep scrolling, you have only two ways to make it :
scroll
search_after
The easiest way is the search_after but you will need to make two requests :
one request for id3
Another one for id1
So, in this example I am looking for id2 : 128. I can sort documents with the field title and I have get beforehand the value of title for id2 which is title_of_128.
To perform the search_after, I have to add the _id on a sub sort condition
Here is my query :
POST test/_search
{
"size": 2,
"search_after": ["title_of_128","128"],
"sort": [
{
"title": {
"order": "asc"
},
"_id": {
"order": "asc"
}
}
]
}
The result of this query is id2 and id3
Now I inverse the direction of the sort in order to retrieve the id1 :
POST test/_search
{
"size": 2,
"search_after": ["title_of_128","128"],
"sort": [
{
"title": {
"order": "desc"
},
"_id": {
"order": "desc"
}
}
]
}
The result of this query is id2 and id1
Note that sort with _id is deprecated and the best practice is to copy the _id in another field if you want to use search_after

Elasticsearch array scoring

I'm using elasticsearch to search multiple array fields in my type, which looks something like
t1 = { field1: ["foo", "bar"],
field2: ["foo", "foo", "foo", "foo"]
field3: ["foo", "foo", "foo", "foo", "foo", "foo"]
}
And then I'm using a multi_match query to get matches, something along
multi_match: { query: "foo",
fields: "fields*"
}
When computing the score of t1, elasticsearch adds the score of queries in field1, field2 and field3 which is what I want. However, they are not contributing equally, field3 contributes to the score the most since "foo" occurs multiple times there.
I want now to compute the score within each array field by not adding up the score of all array entries, but by just taking the maximum of them. In my example, all fields contained would have the same score then since they all have one exact match.
This question was already asked on the elasticsearch forum, but has not been answered so far.
I've been stumped on this myself, it really seems like there should be a simple, builtin way to just specify max instead of sum.
Not sure if this is exactly what you're going for, because you lose the match score on any particular item in the array. So you're not getting max of the match score of the best particular item, just a boolean value if anything matches. If it's something more nuanced (say a person's full name, where you want a better match for first and last vs just one or the other) this may not be acceptable because you're throwing out your scores.
If it is acceptable, this workaround seems to work:
{function_score: {
query: {bool: {should: [
{term: {field1: 'foo'}},
{term: {field2: 'foo'}},
{term: {field3: 'foo'}},
]}},
functions: [
{filter: {term: {field1: 'foo'}}, weight: 1},
{filter: {term: {field2: 'foo'}}, weight: 1},
{filter: {term: {field2: 'foo'}}, weight: 1},
],
score_mode: 'sum',
boost_mode: 'replace',
}}
We need the "query" part to give us the results to further filter, even though we discard the score. This seems like it should really be a filter, but just wrapping this same thing in the filtered query doesn't work. There may be a better option here.
Then, the weight functions just basically give a 1 if there's a match on that field and 0 otherwise. The score_mode tells it to sum those weights, so in your case they all match so we get 3. The boost_mode tells how to combine with the original query, "replace" tells it to ignore the original query score (which has the problem you mentioned that multiple matches in an array are being summed). So, the total score of this query is 3 because there are 3 matches.
It seems more complicated to me, but in my relatively limited testing I haven't noticed performance issues or anything. I'd love to see a better answer if someone more familiar with elasticsearch has one.

ElasticSearch custom score script does not preserve array ordering

I am using ElasticSearch with a function_score property to retrieve documents sorted by createdOn. The createdOn field is stored as an Array representing date values, i.e.
"createdOn": [ 2014, 4, 24, 22, 11, 47 ]
Where createdOn[0] is year, createdOn[1] is month, createdOn[2] is day, etc. I am testing the following query, which should return documents scored by year. However, the doc['createdOn'] array does not preserve the value of the elements. In this query, doc['createdOn'].values[0] returns 4, not 2014.
POST /example/1
{
name:"apple",
createdOn: [2014, 8, 22, 5, 12, 32]
}
POST /example/2
{
name:"apple",
createdOn: [2011, 8, 22, 5, 12, 32]
}
POST /example/3
{
name:"apple",
createdOn: [2013, 8, 22, 5, 12, 32]
}
POST /example/_search
{
"query":
{
"function_score": {
"boost_mode": "replace",
"query": {
"match_all": {}
},
"script_score" : {
"script": "doc['createdOn'].values[0]"
}
}
}
}
It appears that this is due to the way ElasticSearch caches data: http://elasticsearch-users.115913.n3.nabble.com/Accessing-array-field-within-Native-Plugin-td4042848.html:
The only apparent solution other than using the source method (which is slow), is to use nested queries. Any ideas on how I could rewrite my query using nested queries? It seems like the only efficient way to sort this query by year.
The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The
field data cache is not loaded from source of the document (because
this will be slow) but from the lucene index, where the values are
sorted (for lookup speed). The get api does work based on the original
document source which is why you see those values in order (note- ES
doesn't the parse the source for the get api, it just gives you back
what you've put in it).
You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as
it needs to go to disk for every document.
I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{
"key": "k1" , "value" : "v1"},...].

CouchDB - hierarchical comments with ranking. Hacker News style

I'm trying to implement a basic way of displaying comments in the way that Hacker News provides, using CouchDB. Not only ordered hierarchically, but also, each level of the tree should be ordered by a "points" variable.
The idea is that I want a view to return it in the order I except, and not make many Ajax calls for example, to retrieve them and make them look like they're ordered correctly.
This is what I got so far:
Each document is a "comment".
Each comment has a property path which is an ordered list containing all its parents.
So for example, imagine I have 4 comments (with _id 1, 2, 3 and 4). Comment 2 is children of 1, comment 3 is children of 2, and comment 4 is also children of 1. This is what the data would look like:
{ _id: 1, path: ["1"] },
{ _id: 2, path: ["1", "2"] },
{ _id: 3, path: ["1", "2", "3"] }
{ _id: 4, path: ["1", "4"] }
This works quite well for the hierarchy. A simple view will already return things ordered the way I want it.
The issue comes when I want to order each "level" of the tree independently. So for example documents 2 and 4 belong to the same branch, but are ordered, on that level, by their ID. Instead I want them ordered based on a "points" variable that I want to add to the path - but can't seem to understand where I could be adding this variable for it to work the way I want it.
Is there a way to do this? Consider that the "points" variable will change in time.
Because each level needs to be sorted recursively by score, Couch needs to know the score of each parent to make this work the way you want it to.
Taking your example with the following scores (1: 10, 2: 10, 3: 10, 4: 20)
In this case you'd want the ordering to come out like the following:
.1
.1.4
.1.2
.1.2.3
Your document needs a scores array like this:
{ _id: 1, path: [1], scores: [10] },
{ _id: 2, path: [1, 2], scores: [10,10] },
{ _id: 3, path: [1, 2, 3], scores: [10,10,10] },
{ _id: 4, path: [1, 4], scores: [10,20] }
Then you'll use the following sort key in your view.
emit([doc.scores, doc.path], doc)
The path gets used as a tiebreaker because there will be cases where sibling comments have the exact same score. Without the tiebreaker, their descendants could lose their grouping (by chain of ancestry).
Note: This approach will return scores from low-to-high, whereas you probably want scores (high to low) and path/tiebreaker(low to high). So a workaround for this would be to populate the scores array with the inverse of each score like this:
{ _id: 1, path: [1], scores: [0.1] },
{ _id: 2, path: [1, 2], scores: [0.1,0.1] },
{ _id: 3, path: [1, 2, 3], scores: [0.1,0.1,0.1] },
{ _id: 4, path: [1, 4], scores: [0.1,0.2] }
and then use descending=true when you request the view.
Maybe anybody interestingly the thread on this question with variants of solutions:
http://mail-archives.apache.org/mod_mbox/couchdb-dev/201205.mbox/thread -> theme "Hierarchical comments Hacker News style" 16/05/2012

Resources