Elasticsearch simoultaneous sort - elasticsearch

I have multiple es indices - each of the tables from the database is kept in separate one.
Each of those indices has different mapping, although they present similar data. And so for dates I have:
In index1 - First_date which is datetype (2018-05-22).
In index2 - First_date which is integer (2018)
In index3 - justDate - integer (2017)
In index4 - date - string ("May 2018")
Is there way of sorting by all of these field simultaneously? I guess the answer might be script sorting, however I'm interested if this can be achieved in any other way.
If not, maybe at least same can be done for fields with same field type.

It could've look like this:
POST index1,index2,index3,index4/_search
{
"sort": [
{"First_date": {"order": "desc"}},
{"justDate": {"order": "desc"}},
{"date": {"order": "desc"}}
]
}
But I assume you want to sort by all of the fields in date order which this query will not give you.
Solving this task with a script will bring unnecessary calculations on query time.
I would suggest you to create date-format field in each index and fill it on index time. In this case query above will work as is.

Related

Elasticsearch - Limit of total fields [1000] in index exceeded

I saw that there are some concerns to raising the total limit on fields above 1000.
I have a situation where I am not sure how to approach it from the design point of view.
I have lots of simple key value pairs:
key1:15, key2:45, key99999:1313123.
Where key is a string and value is a integer on which I would like to sort my results upon on where as if a certain document receives a key it gets sorted by the value.
I ended up creating an object and just put the key value pairs inside so I can match it easy.
For example I have sorting: "object.key".
I was wondering if I just use a simple object with bunch of strings inside that are just there for exact matching should I worry about raising this limit to 10k, or 20k.
Because I now have an issue where there can be more then 1k of these records. I've found I could use nested sorting but it still has a default limit of 10k.
Is there a good design pattern approach for this or should I not be worried by raising the field limits?
Simplified version of the query:
GET products/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"sortingObject.someSortingKey1": {
"order": "desc",
"missing": 2,
"unmapped_type":"float"
}
}
]
}
Point is that I get the sortingKey from request and I use it to sort my results. There are 100k different ways to sort the result for example
There were some recent improvements (in 7.16) that should help there, but 10K or 20K fields is still a lot of overhead.
I'm not sure what kind of queries you need to run on those keyX fields, but maybe the flattened data-type would work for you? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html

Search After (pagination) in Elasticsearch when sorting by score

Search after in elasticsearch must match its sorting parameters in count and order. So I was wondering how to get the score from previous result (example page 1) to use it as a search after for next page.
I faced an issue when using the score of the last document in previous search. The score was 1.0, and since all documents has 1.0 score, the result for next page turned out to be null (empty).
That's actually make sense, since I am asking elasticsearch for results that has lower rank (score) than 1.0 which are zero, so which score do I use to get the next page.
Note:
I am sorting by score then by TieBreakerID, so one possible solution is using high value (say 1000) for score.
What you're doing sounds like it should work, as explained by an Elastic team member. It works for me (in ES 7.7) even with tied scores when using the document ID (copied into another indexed field) as a tiebreaker. It's true that indexing additional documents while paginating will make your scores slightly unstable, but not likely enough to cause a significant problem for an end user. If you need it to be reliable for a batch job, the Scroll API is the better choice.
{
"query": {
...
},
"search_after": [
12.276552,
14173
],
"sort": [
{ "_score": "desc" },
{ "id": "asc" }
]
}

Really huge query or optimizing an elasticsearch update

I'm working in documents-visualization for binary classification of a big amount of documents (around 150 000). The challenge is how to present general visual information to end-users, so they can have an idea on the main "concepts" on each category (positive/negative). As each document has an associated set of topics, I thought about asking Elasticsearch through aggregations for the top-20 topics on positive classified documents, and then the same for the negatives.
I created a python script that downloads the data from Elastic and classify the docs, BUT the problem is that the predictions on the dataset are not registered on Elasticsearch, so I can not ask for the top-20 topics on a certain category. First I thought about creating a query in elastic to ask for the aggregations and passing a match
As I have the ids of the positive/negative documents, I can write a query to retrieve the aggregation of topics BUT in the query I should provide a really big amount of documents IDS to indicate, for instance, just the positive documents. That is impossible, since there is a limit on the endpoint and I can not pass 50 000 ids like:
"query": {
"bool": {
"should": [
{"match": {"id_str": "939490553510748161"}},
{"match": {"id_str": "939496983510742348"}}
...
],
"minimum_should_match" : 1
}
},
"aggs" : { ... }
So I tried to register the predicted categories of the classification in the Elastic index, but as the amount of documents is really huge, it takes like half an hour (compared to less than a minute for running the classification)... which is a LOT of time just for storing the predictions.... Then I also need to query the index to et the right data for the visualization. To update the documents, I am using:
for id in docs_ids:
es.update(
index=kwargs["index"],
doc_type=kwargs["doc_type"],
id=id,
body={"doc": {
"prediction": kwargs["category"]
}}
)
Do you know an alternative to update the predictions faster?
You could use bulk query that permits you to serialize your requests and query only one time against elasticsearch executing a lot of searches.
Try:
from elasticsearch import helpers
query_list = []
list_ids = ["1","2","3"]
es = ElasticSearch("myurl")
for id in list_ids:
query_dict ={
'_op_type': 'update',
'_index': kwargs["index"],
'_type': kwargs["doc_type"],
'_id': id,
'doc': {"prediction": kwargs["category"]}
}
query_list.append(query_dict)
helpers.bulk(client=es,actions=query_list)
Please have a read here
Regarding to query the list ids, to get faster response you should't bring with you the match_string value, as you have done in the question, but the _id field. That permits you to use multiget query, a bulk query for the get operation. Here in the python library. Try:
my_ids_list = [<some_ids_here>]
es.mget(index = kwargs["index"],
doc_type = kwargs["index"],
body = {'ids': my_ids_list})

Elasticsearch: Aggregate documents based on date range

I have a set of documents in ElasticSearch 5.5 with two date fields: start_date and end_date.
I want to aggregate them into date histogram buckets (ex: weekly) such that if the start_date < week X < end_date, then document would be in "week X" bucket.
This means that a single document might be in multiple buckets.
Consider the following concrete example: I have a set of documents describing company employees, and for each employee you have hire date and (optionally) termination date. I want to build date histogram of number of active employees for trailing twelve months.
Sample doc content:
{
"start_date": "2013-01-12T00:00:00.000Z",
"end_date": "2016-12-08T00:00:00.000Z",
"id": "123123123"
}
Is there a way to do this in ES?
I have found one way to do this, using filter aggregations (
https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-filter-aggregation.html). If I need, say, 12 trailing months report, then I would create 12 buckets, where each bucket defines filter conditions, such as:
"bool":{
"must":[{
"range":{
"start_date":{
"lte":"2016-01-01T00:00:00.000Z"
}
}
},{
{
"range":{
"end_date":{
"gt":"2016-02-01T00:00:00.000Z"
}
}
}]
}
However, I feel that it would be nice if there was an easier way to do this, since if I want say trailing 365 days, that means I have to create 365 bucket filters, which makes resultant query very large.
I know this question is quite old but as it's still open I am sharing my knowledge on this. Also this question does not clearly explains that what kind of output is expected but still I think this can be achieved using the "Date Histogram Aggregation" and "Bucket Script Aggregation".
Here are the documentation links for both of these aggregations.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-bucket-datehistogram-aggregation.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-aggregations-pipeline-bucket-script-aggregation.html

Conditional Sorting in ElasticSearch

I have some documents that I would like to sort on a date field. For documents with date equal to a specified date, example today, and all dates after that I would like to sort ascending. For dates before the specified date I would like to sort in descending order.
Is this possible in ElasticSearch? If so could you suggest any literature or an approach.
date is of type "date" and format "dateOptionalTime".
Thanks
Yes this is possible in ElasticSearch using a script, either for sorting or for scoring.
My preference would be for a scoring script because 'script based score' is going to be quicker (according to the documentation).
Using a scoring script, you could use the Unix timestamp for the date field of type int/long and an mvel sorting script in the custom_score query. You might need to re-index your documents. You would also need to be able to convert the searched for time into a Unix timestamp to pump it at ElasticSearch.
The sorting script would then deduct the requested timestamp from each document's timestamp and make an absolute value. Then the results are sorted in ascending order - the lowest 'distance' is the best.
So when looking for documents dated about a year ago, it would look something like:
"query": {
"custom_score" : {
"query" : {
....
},
"params" : {
"req_date_stamp" : 1348438345,
},
"script" : "abs(doc['timestamp'].value - req_date_timestamp)"
}
},
"sort": {
"_score": {
'order': 'asc'
}
}
(Apologies for any mistakes in my JSON - I tested this idea in pyes)
You might need to tweak this to get the rounding right - for example your question mentions matching days, so you might want to round the timestamp generator to the nearest day.
For "full" info you can check out the Custom Score Query docs and follow the link to MVEL scripting.
For this kind of specific use cases, you should use a sorting script.
See the "script based sorting" section in the Sort documentation page.
My English is poor.
My soluation is boost.
My data is {"terms_id": [20211011,20211012,20211013,20211014],"sort_value":1} {"terms_id": [20211012,20211013,20211014],"sort_value":2} {"terms_id": [20211013,20211014,20211015],"sort_value":1}
My query is {"bool":{"must":[],"should":[{"bool":{"must":[{"terms":{"terms_id":[20211012],"boost":5}}],"must_not":[]}},{"bool":{"must_not":[{"terms":{"terms_id":[20211012]}}]}}],"minimum_should_match":1}}
My sort is {"_score":{"order":"desc"},"sort_value":{"order":"desc"}}
Result is{"terms_id": [20211012,20211013,20211014],"sort_value":2} {"terms_id": [20211011,20211012,20211013,20211014],"sort_value":1} {"terms_id": [20211013,20211014,20211015],"sort_value":1}

Resources