ElasticSearch Score Function Depending on Neighbor Documents - elasticsearch

I have an ElasticSearch index with 2 mappings (types).
In the app I need to display a paginated feed containing items of both types.
Currently the items are sorted just by creation date, but I also want to have control on how the items alternate with each other on the page.
For example, I want to set a rule for sequence "3 items of type A, 1 item of type B, and so on".
I need it to make sure items of both types are displayed on each page and equally distributed across the pages.
But as far as I see it's not possible to access another documents in custom score function script.
Of course it's easy to implement directly in the app logic, but it's not clear how to implement pagination using this way.
Any ideas on how to achieve that?

I don't think you can do this.
One approach (that doesn't work) is to keep a global variable in a script and to increment that once every document is being returned/processed. And then to take this number, divide it by 3 and get the modulo number. Based on this number, to sort the docs. But "global" variables are not possible in sripts.
The only two approaches that I can think of is to use a script to generate a random number and based on that to sort. In this way, you get some chances to have a "mixed list of types.
Or, if you want the smallest deterministic way of sorting the docs, still in a script take the ID of the document (you said is a number) modulo 3 it and use the value to sort.
For the random approach:
"sort": [
{
"date": {
"order": "desc"
}
},
{
"_script": {
"script": "Math.random()",
"type": "number",
"order": "asc"
}
}
]

Related

Elasticsearch - Limit of total fields [1000] in index exceeded

I saw that there are some concerns to raising the total limit on fields above 1000.
I have a situation where I am not sure how to approach it from the design point of view.
I have lots of simple key value pairs:
key1:15, key2:45, key99999:1313123.
Where key is a string and value is a integer on which I would like to sort my results upon on where as if a certain document receives a key it gets sorted by the value.
I ended up creating an object and just put the key value pairs inside so I can match it easy.
For example I have sorting: "object.key".
I was wondering if I just use a simple object with bunch of strings inside that are just there for exact matching should I worry about raising this limit to 10k, or 20k.
Because I now have an issue where there can be more then 1k of these records. I've found I could use nested sorting but it still has a default limit of 10k.
Is there a good design pattern approach for this or should I not be worried by raising the field limits?
Simplified version of the query:
GET products/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"sortingObject.someSortingKey1": {
"order": "desc",
"missing": 2,
"unmapped_type":"float"
}
}
]
}
Point is that I get the sortingKey from request and I use it to sort my results. There are 100k different ways to sort the result for example
There were some recent improvements (in 7.16) that should help there, but 10K or 20K fields is still a lot of overhead.
I'm not sure what kind of queries you need to run on those keyX fields, but maybe the flattened data-type would work for you? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

Search After (pagination) in Elasticsearch when sorting by score

Search after in elasticsearch must match its sorting parameters in count and order. So I was wondering how to get the score from previous result (example page 1) to use it as a search after for next page.
I faced an issue when using the score of the last document in previous search. The score was 1.0, and since all documents has 1.0 score, the result for next page turned out to be null (empty).
That's actually make sense, since I am asking elasticsearch for results that has lower rank (score) than 1.0 which are zero, so which score do I use to get the next page.
Note:
I am sorting by score then by TieBreakerID, so one possible solution is using high value (say 1000) for score.
What you're doing sounds like it should work, as explained by an Elastic team member. It works for me (in ES 7.7) even with tied scores when using the document ID (copied into another indexed field) as a tiebreaker. It's true that indexing additional documents while paginating will make your scores slightly unstable, but not likely enough to cause a significant problem for an end user. If you need it to be reliable for a batch job, the Scroll API is the better choice.
{
"query": {
...
},
"search_after": [
12.276552,
14173
],
"sort": [
{ "_score": "desc" },
{ "id": "asc" }
]
}

Possible to have a document always return above certain position

I've got a bunch of documents from a query which are sorted by a modified date. However I'd like certain documents (identified by a field value) to always return in the top ten results regardless of whether there are ten or more documents with a more recent modified date.
From what I've read about the various ways of sorting in Elasticsearch (score, boost, scripts) I don't think I have any way of determining the actual position of a document in the search results, let alone some way of manipulating the score to push a document into the top ten.
Assuming that you have a field called "important_field" which contains value 1, for documents you in top and say 0 for all other documents, you can use multi field sorting as below
{
"sort": [
{ "important_field": { "order": "desc" }},
{ "modified_date": { "order": "desc" }}
]
}
This way of sorting means it will sort by important_field value and if they are same then will be sorted by modified_date. So all documents with important_field value 1 will come on top and rest will still be sorted by modified_date.

Add custom comparatorClass class in Solr

I am newbie in Solr. I want to add a custom comparatorClass in Solr. I also need to use fields - term and count in my custom class which I have defined in my schema.xml.
Structure of indexing document :
"docs": [
{
"count": 98,
"term": "age",
},
{
"count": 6,
"term": "age assan",
},
{
"count": 5,
"term": "age but",
},
{
"count": 10,
"term": "age salman",
}]
I have stored ngrams with term and their count but solr gives frequency by own that I don't need. I want my count frequency which I have defined for each term. And that term and count, I need to use and want to sort with frequency(count) and then edit distance which I need to implement by creating own class in comparator class or there is something else which helps me. Please share..
How can I do this. Any help please.
Thanks.
You should be able to do this without implementing a custom similarity class. The first requirement is (from your description) a straight forward sort on the count value, while the latter can be implemented by sorting on the value from the strdist() function. You can also multiply or weight these values against each other in a single sort statement by using several functions.
If you really, really need to build your own scorer (which I don't think you need to do from your description) - these are usually written to explore other ranking algorithms than tf/idf, bm25 etc. for larger corpuses, a search on Google gives you many resources with pre-made, easy to adopt solutions. I particularly want to point out "This is the Nuclear Option" in Build Your Own Custom Lucene Query and Scorer:
Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths [...]

Resources