Elasticsearch:: Sorting giving weird results - sorting

When I am searching the for the first time, its sorting all documents and giving me the first 5 records. However, if same search query is executed by changing the sort direction(ASC -> DESC), then its not sorting all documents again, its giving me last 5 retrieved documents(from previous search query), sorting them in desc order, and giving it back to me. I was expecting that it will sort all available documents in DESC order, and then retrieve first 5 results.
Am I doing something wrong, or missed any concept.
My search query:
{
"sort": {
"taskid": {
"order": "ASC"
}
},
"from": 0,
"size": 5,
"query": {
"filtered": {
"query": {
"match_all": []
}
}
}
}
I have data with taskid 1 to 100. Now above query fetched me record from taskid 1 to 5 in first attempt. Now when I changed the sort direction to desc, I was expecting documents with taskid 96-100(100,99,98,97,96 sequence) should be returned, however I was returned documents with taskid 5,4,3,2,1 in that sequence. Which meant, sorting was done on previous returned result only.
Please note that taskid and _id are same in my document. I had added a redundant field in my mapping which will be same as _id

Just change the case of the value in order key and you are good to go.
{
"sort": {
"taskid": {
"order": "asc" // or "desc"
}
},
"from": 0,
"size": 5,
"query": {
"filtered": {
"query": {
"match_all": []
}
}
}
}
Hope this helps..

In elastic search, sort query is applied after the result are extracted from the es. As per the query mentioned in your question, first result is filtered based on search criteria, and then sorting is applied on the filtered result.

If it looks like you are only getting results based on an old subset of your data, then it may be that your newer data has not been indexed yet. This can happen easily in an automated test but with manual testing it is less likely.
Segments are rebuilt every second, so adding a delay/sleep of about a second between indexing and searching should fix your test if this is the problem.

Related

elasticsearch: count appearance of terms aggregation on other fields

I want to count how many times, unique values (result of terms aggragation) have appeared in other fields in the same query. Let's say:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"unique_products": {
"terms": {
"field": "products.name.keyword",
"min_doc_count": 10
}
}
}
}
What I want is to count, how many time each of the keys returned in the bucket, appeared in another field.
My ideal output is:
"aggregations": {
"product_stat": {
"key": "<product_name>"
"sold": "<#>" #I want to know how many times the key is appeared in another field like sold
"bought": "<#>"
}
}
Elasticsearch cannot do terms aggregations over multiple fields. In short, if they would, aggregations would not be blazing fast.
As documentation suggests, there are two options:
use script terms aggregation (with performance penalty),
change how the documents are indexed so a normal terms aggregation can be used.
Depending on the structure of your data and your use-cases, you might get by with a complex aggregation + some processing on the client side. This can be done with sub aggregations like here, for example.
Hope that helps!

Elasticsearch: get documents only when value changes

I have an ES index with such kind of documents:
from_1,to_1,timestamp_1
from_1,to_1,timestamp_2
from_1,to_2,timestamp_3
from_2,to_3,timestamp_4
from_1,to_2,timestamp_5
from_2,to_3,timestamp_6
from_1,to_1,timestamp_7
from_2,to_4,timestamp_8
I need a query that would return a document only if its combination of from and to values is different than the previous seen document with the same from value.
So with the provided sample above:
document with timestamp_1 should be in the result because there is no earlier document with from_1+to_1 combination
document with timestamp_2 must be skipped because its from+to combination is exactly the same as the last seen document with from = from_1
document with timestamp_3 should be in the result because its to field (to_2) is different than the value of the last seen with the same from (to_1 in document with timestamp_1
document with timestamp_4 should be in the result
document with timestamp_5 must not be in the result because it has the same combination of from+to as the last seen with from_1 (document with timestamp_3)
document with timestamp_6 must not be in the result because it has the same combination of from+to as the last seen with from_2 (document with timestamp_4)
document with timestamp_7 should be in the result because it has the different combination of from+to to the last seen with from_1 (document with timestamp_3)
document with timestamp_8 should be in the result because its combination is completely new so far
I need to fetch all such "semi-unique" documents from the index, so it would be nice if it possible to use scroll request or after_key if an aggregation is used.
Any ideas how to approach it?
The closest thing I could come up with is the following (let me know if it does not work with your data).
{
"size": 0,
"aggs": {
"from_and_to": {
"composite" : {
"size": 5,
"sources": [
{
"from_to_collected":{
"terms": {
"script": {
"lang": "painless",
"source": "doc['from'].value + '_' + doc['to'].value"
}
}
}
}]
},
"aggs": {
"top_from_and_to_hits": {
"top_hits": {
"size": 1,
"sort": [{"timestamp":{"order":"asc"}}],
"_source": {"includes": ["_id"]}
}
}
}
}
}
}
Keep in mind that the terms aggregations is probabilistic.
This will allow you to scroll to the next set of buckets over the from_to_collected key.

ES: How do quasi-join queries using global aggregation compare to parent-child / nested queries?

At my work, I came across the following pattern for doing quasi-joins in Elasticsearch. I wonder whether this is a good idea, performance-wise.
The pattern:
Connects docs in one index in one-to-many relationship.
Somewhat like ES parent-child, but implemented without it.
Child docs need to be indexed with a field called e.g. "my_parent_id", with value being the parent ID.
Can be used when querying for parent, knowing its ID in advance, to also get the children in the same query.
The query with quasi-join (assume 123 is parent ID):
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": 123
}
}
}
]
}
},
"aggs": {
"my-global-agg" : {
"global" : {},
"aggs" : {
"my-filtering-all-but-children": {
"filter": {
"term": {
"my_parent_id": 123
}
},
"aggs": {
"my-returning-children": {
"top_hits": {
"_source": {
"includes": [
"my_child_field1_to_return",
"my_child_field2_to_return"
]
},
"size": 1000
}
}
}
}
}
}
}
}
This query returns:
the parent (as search query result), and
its children (as the aggregation result).
Performance-wise, is the above:
definitively a good idea,
definitively a bad idea,
hard to tell / it depends?
It depends ;-) The idea is good, however, by default the maximum number of hits you can return in a top_hits aggregation is 100, if you try 1000 you'll get an error like this:
Top hits result window is too large, the top hits aggregator [hits]'s from + size must be less than or equal to: [100] but was [1000]. This limit can be set by changing the [index.max_inner_result_window] index level setting.
As the error states, you can increase this limit by changing the index.max_inner_result_window index setting. But, if there's a default, there's usually a good reason. I would take that as a hint that it might not be that great an idea to increase it too much.
So, if your parent documents have less than 100 children, why not, otherwise I'd seriously consider going another approach.

How to find all duplicate documents in ElasticSearch

We have a need to walk over all of the documents in our AWS ElasticSearch cluster, version 6.0, and gather a count of all the duplicate user ids.
I have tried using a Data Visualization to aggregate counts on the user ids and export them, but the numbers don't match another source of our data that is searchable via traditional SQL.
What we would like to see is like this:
USER ID COUNT
userid1 4
userid22 3
...
I am not an advanced Lucene query person and have yet to find an answer to this question. If anyone can provide some insight into how to do this, I would be appreciative.
The following query will count each id, and filter the ids which have <2 counts, so you'll get something in the likes of:
id:2, count:2
id:4, count:15
GET /index
{
"query":{
"match_all":{}
},
"aggs":{
"user_id":{
"terms":{
"field":"user_id",
"size":100000,
"min_doc_count":2
}
}
}
}
More here:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
If you want to get all duplicate userids with count
First you get to know maximum size of aggs.
find all maximum matches record via aggs cardinality.
GET index/type/_search
{
"size": 0,
"aggs": {
"maximum_match_counts": {
"cardinality": {
"field": "userid",
"precision_threshold": 100
}
}
}
}
get value of maximum_match_counts aggregations
Now you can get all duplicate userids
GET index/type/_search
{
"size": 0,
"aggs": {
"userIds": {
"terms": {
"field": "userid",
"size": maximum_match_counts,
"min_doc_count": 2
}
}
}
}
When you go with terms aggregation (Bharat suggestion) and set aggregation size more than 10K you will get a warning about this approach will throw an error for the feature releases.
Instead of using terms aggregation you should go with composite aggregation to scan all of your documents by pagination/afterkey method.
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.

How check elasticsearch unique document

As I understand it, ES can't create unique constraints on index.
But, on creation and updating of the documents, I need to check that some fields are unique in index.
Can ES find matches of content, not a query? Thanks!
After you've updated your record you'd have to run a query to find how many others have that field value. Something like:
GET index1/test/_search
{
"size": 0,
"query": {
"filtered": {
"filter": {
"term": {
"field123": 10
}
}
}
}
}
Note the size of zero, this will save time by not returning any records, but it will still return you the total number of records matching.

Resources