Elasticsearch order by _score or max_score from SearchResponse Java API - elasticsearch

I have an index which contain documents with same employee name and email address but varies with other information such as meetings attended and amount spent.
{
"emp_name" : "Raju",
"emp_email" : "raju#abc.com",
"meeting" : "World cup 2019",
"cost" : "2000"
}
{
"emp_name" : "Sanju",
"emp_email" : "sanju#abc.com",
"meeting" : "International Academy",
"cost" : "3000"
}
{
"emp_name" : "Sanju",
"emp_email" : "sanju#abc.com",
"meeting" : "School of Education",
"cost" : "4000"
}
{
"emp_name" : "Sanju",
"emp_email" : "sanju#abc.com",
"meeting" : "Water world",
"cost" : "1200"
}
{
"emp_name" : "Sanju",
"emp_email" : "sanju#abc.com",
"meeting" : "Event of Tech",
"cost" : "5200"
}
{
"emp_name" : "Bajaj",
"emp_email" : "bajaju#abc.com",
"meeting" : "Event of Tech",
"cost" : "4500"
}
Now, when I do search based on emp_name field like "raj" then I should get one of the Raju, Sanju and Bajaj document since I am using fuzzy search functionality (fuzziness(auto)).
I am implementing elasticsearch using Java High level rest client 6.8 API.
TermsAggregationBuilder termAggregation = AggregationBuilders.terms("employees")
.field("emp_email.keyword")
.size(2000);
TopHitsAggregationBuilder termAggregation1 = AggregationBuilders.topHits("distinct")
.sort(new ScoreSortBuilder().order(SortOrder.DESC))
.size(1)
.fetchSource(includeFields, excludeFields);
Based on the above code, it's getting distinct documents but Raju's record is not on the top of the response instead we see Sanju document due to the number of counts.
Below is the JSON created based on the searchrequest.
{
"size": 0,
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "raj",
"fields": [
"emp_name^1.0",
"emp_email^1.0"
],
"boost": 1.0
}
}
],
"filter": [
{
"range": {
"meeting_date": {
"from": "2019-12-01",
"to": null,
"boost": 1.0
}
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"aggregations": {
"employees": {
"terms": {
"field": "emp_email.keyword",
"size": 2000,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": [
{
"_count": "desc"
},
{
"_key": "asc"
}
]
},
"aggregations": {
"distinct": {
"top_hits": {
"from": 0,
"size": 1,
"version": false,
"explain": false,
"_source": {
"includes": [
"all_uid",
"emp_name",
"emp_email",
"meeting",
"country",
"cost"
],
"excludes": [
]
},
"sort": [
{
"_score": {
"order": "desc"
}
}
]
}
}
}
}
}
}
I think if we order by max_score or _score then Raju's record will be on top of the response.
Could you please let me know how to get order by _score or max_score of the document returned by response?
Sample response is
{
"took": 264,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 232,
"max_score": 0.0,
"hits": [
]
},
"aggregations": {
"sterms#employees": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Sanju",
"doc_count": 4,
"top_hits#distinct": {
"hits": {
"total": 4,
"max_score": 35.71312,
"hits": [
{
"_index": "indexone",
"_type": "employeedocs",
"_id": "1920424",
"_score": 35.71312,
"_source": {
"emp_name": "Sanju",
...
}
}
]
}
}
},
{
"key": "Raju",
"doc_count": 1,
"top_hits#distinct": {
"hits": {
"total": 1,
"max_score": 89.12312,
"hits": [
{
"_index": "indexone",
"_type": "employeedocs",
"_id": "1920424",
"_score": 89.12312,
"_source": {
"emp_name": "Raju",
...
}
}
]
}
}
}
Let me know if you have any question.
Note: I see many similar kind of questions but none of them helped me. Please advise.
Thanks,
Chetan

Related

Is it possible to use a query result into another query in ElasticSearch?

I have two queries that I want to combine, the first one returns a document with some fields.
Now I want to use one of these fields into the new query without creating two separates ones.
Is there a way to combine them in order to accomplish my task?
This is the first query
{
"_source": {
"includes": [
"data.session"
]
},
"query": {
"bool": {
"must": [
{
"match": {
"field1": "9419"
}
},
{
"match": {
"field2": "5387"
}
}
],
"filter": [
{
"range": {
"timestamp": {
"time_zone": "+00:00",
"gte": "2020-10-24 10:16",
"lte": "2020-10-24 11:16"
}
}
}
]
}
},
"size" : 1
}
And this is the response returned:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 109,
"relation": "eq"
},
"max_score": 3.4183793,
"hits": [
{
"_index": "file",
"_type": "_doc",
"_id": "UBYCkgsEzLKoXh",
"_score": 3.4183793,
"_source": {
"data": {
"session": "123456789"
}
}
}
]
}
}
I want to use that "data.session" into another query, instead of rewriting the value of the field by passing the result of the first query.
{
"_source": {
"includes": [
"data.session"
]
},
"query": {
"bool": {
"must": [
{
"match": {
"data.session": "123456789"
}
}
]
}
},
"sort": [
{
"timestamp": {
"order": "asc"
}
}
]
}
If you mean to use the result of the first query as an input to the second query, then it's not possible in Elasticsearch. But if you share your query and use-case, we might suggest you better way.
ElasticSearch does not allow sub queries or inner queries.

Elasticsearch separate aggregation based on values from first

I'm using a Elasticsearch 6.8.8 and trying to aggregate the number of entities and relationships over a given time period.
Here is the data structure and examples values of the index:
date entityOrRelationshipId startId endId type
=========================================================================
DATETIMESTAMP ENT1_ID null null ENTITY
DATETIMESTAMP ENT2_ID null null ENTITY
DATETIMESTAMP ENT3_ID null null ENTITY
DATETIMESTAMP REL1_ID ENT1_ID ENT2_ID RELATIONSHIP
DATETIMESTAMP REL2_ID ENT3_ID ENT1_ID RELATIONSHIP
etc.
For a given entity ID, I want to get the top 50 relationships. I have started with the following query.
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"date": {
"gte": "2020-04-01T00:00:00.000+00:00",
"lt": "2020-04-28T00:00:00.000+00:00"
}
}
}
]
}
},
"aggs": {
"my_rels": {
"filter": {
"bool": {
"must": [
{
"term": {
"type": "RELATIONSHIP"
}
},
{
"bool": {
"should": [
{
"term": {"startId": "ENT1_ID"}
},
{
"term": {"endId": "ENT1_ID"}
}
]
}
}
]
}
},
"aggs": {
"my_rels2": {
"terms": {
"field": "entityOrRelationshipId",
"size": 50
},
"aggs": {
"my_rels3": {
"top_hits": {
"_source": {
"includes": ["startId","endId"]
},
"size": 1
}
}
}
}
}
}
}
}
This produces the following results:
{
"took": 54,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 93122,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"my_rels": {
"doc_count": 332,
"my_rels2": {
"doc_count_error_upper_bound": 6,
"sum_other_doc_count": 259,
"buckets": [
{
"key": "REL1_ID",
"doc_count": 47,
"my_rels3": {
"hits": {
"total": 47,
"max_score": 1.0,
"hits": [
{
"_index": "trends",
"_type": "trend",
"_score": 1.0,
"_source": {
"endId": "ENT2_ID",
"startId": "ENT1_ID"
}
}
]
}
}
},
{
"key": "REL2_ID",
"doc_count": 26,
"my_rels3": {
"hits": {
"total": 26,
"max_score": 1.0,
"hits": [
{
"_index": "trends",
"_type": "trend",
"_score": 1.0,
"_source": {
"endId": "ENT1_ID",
"startId": "ENT3_ID"
}
}
]
}
}
}
]
}
}
}
}
This lists the top 50 relationships. For each relationship it lists the relationship ID, the count and the entity ids (startId, endId). What I would like to do now is produce another aggregation of entity counts for those distinct entities. Ideally this would not be a nested aggregation but a separate one using the rel ids identified in the first aggregation.
Is that possible to do in this query?
Unfortunately you cannot aggregate over the results of top_hits in Elasticsearch.
Here is the link to GitHub issue.
You can have other aggregation on a parallel level of top_hit but you cannot have any sub_aggregation below top_hit.
You can have a parallel level aggregation like:
"aggs": {
"top_hits_agg": {
"top_hits": {
"size": 10,
"_source": {
"includes": ["score"]
}
}
},
"avg_agg": {
"avg": {
"field": "score"
}
}
}

Elasticsearch: Querying nested objects

Dear elasticsearch experts,
i have a problem querying nested objects. Lets use the following simplified mapping:
{
"mappings" : {
"_doc" : {
"properties" : {
"companies" : {
"type": "nested",
"properties" : {
"company_id": { "type": "long" },
"name": { "type": "text" }
}
},
"title": { "type": "text" }
}
}
}
}
And put some documents in the index:
PUT my_index/_doc/1
{
"title" : "CPU release",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 2, "name" : "Intel" }
]
}
PUT my_index/_doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/3
{
"title" : "GPU release 2018-03-01",
"companies" : [
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/4
{
"title" : "Chipset release",
"companies" : [
{ "company_id" : 2, "name" : "Intel" }
]
}
Now i want to execute queries like this:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } },
{ "nested": {
"path": "companies",
"query": {
"bool": {
"must": [
{ "match": { "companies.name": "AMD" } }
]
}
},
"inner_hits" : {}
}
}
]
}
}
}
As result I want to get the matching companies with the number of matching documents. So the above query should give me:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 }
]
The following query:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } }
{ "nested": {
"path": "companies",
"query": { "match_all": {} },
"inner_hits" : {}
}
}
]
}
}
}
should give me all companies assigned to a document whichs title contains "GPU" with the number of matching documents:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 },
{ "company_id" : 3, "name" : "Nvidia", "matched_documents:": 2 }
]
Is there any possibility with good performance to achieve this result? I'm explicitly not interested in the matching documents, only in the number of matched documents and the nested objects.
Thanks for your help.
What you need to do in terms of Elasticsearch is:
filter "parent" documents on desired criteria (like having GPU in title, or also mentioning Nvidia in the companies list);
group "nested" documents by a certain criteria, a bucket (e.g. company_id);
count how many "nested" documents there are per each bucket.
Each of the nested objects in the array are indexed as a separate hidden document, which complicates life a bit. Let's see how to aggregate on them.
So how to aggregate and count the nested documents?
You can achieve this with a combination of a nested, terms and top_hits aggregation:
POST my_index/doc/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "GPU"
}
},
{
"nested": {
"path": "companies",
"query": {
"match_all": {}
}
}
}
]
}
},
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
This will give the following output:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 4, <== How many "nested" documents there were?
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3, <== this bucket's key: "company_id": 3
"doc_count": 2, <== how many "nested" documents there were with such company_id?
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [ <== an example, "top hit" for such company_id
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
Notice that for Nvidia we have "doc_count": 2.
But what if we want to count the number of "parent" objects who's got Nvidia vs Intel?
What if we want to count parent objects based on a nested bucket?
It can be achieved with reverse_nested aggregation.
We need to change our query just a little bit:
POST my_index/doc/_search
{
"query": { ... },
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
},
"original doc count": { <== we ask ES to count how many there are parent docs
"reverse_nested": {}
}
}
}
}
}
}
}
The result will look like this:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 3,
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 2,
"original doc count": {
"doc_count": 2 <== how many "parent" documents have such company_id
},
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"original doc count": {
"doc_count": 1
},
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
How can I spot the difference?
To make the difference evident, let's change the data a bit and add another Nvidia item in the document list:
PUT my_index/doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
The last query (the one with reverse_nested) will give us the following:
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 3, <== 3 "nested" documents with Nvidia
"original doc count": {
"doc_count": 2 <== but only 2 "parent" documents
},
"Examples of such company_id": {
"hits": {
"total": 3,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 2
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
As you can see, this is a subtle difference that is hard to grasp, but it changes the semantics completely.
What's about performance?
While for most of the cases the performance of nested query and aggregations should be enough, of course it comes with a certain cost. It is therefore recommended to avoid using nested or parent-child types when tuning for search speed.
In Elasticsearch the best performance is often achieved through denormalization, although there is no single recipe and you should select the data model depending on your needs.
Hope this clarifies this nested thing for you a bit!

Elasticsearch query that requires all values in array to be present

Heres a sample query:
{
"query":{
"constant_score":{
"filter":{
"terms":{
"genres_slugs":["simulator", "strategy", "adventure"]
}
}
}
},
"sort":{
"name.raw":{
"order":"asc"
}
}
}
The value mapped to the genres_slugs property is just a simple array.
What i'm trying to do here is match all games that have all the values in the array: ["simulator","strategy","adventure"]
As in, the resulting items MUST have all those values. What's returning instead are results that have only one value and not the others.
Been going at this for 6 hours now :(
Ok, if the resulting items MUST have all those values, use MUST param instead of FILTER.
{ "query":
{ "constant_score" :
{ "filter" :
{ "bool" :
{ "must" : [
{ "term" :
{"genres_slugs":"simulator"}
},
{ "term" :
{"genres_slugs":"strategy"}
},
{ "term" :
{"genres_slugs":"adventure"}
}]
}
}
}
}
}
This returns:
{
"took": 54,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "try",
"_type": "stackoverflowtry",
"_id": "123",
"_score": 1,
"_source": {
"genres_slugs": [
"simulator",
"strategy",
"adventure"
]
}
},
{
"_index": "try",
"_type": "stackoverflowtry",
"_id": "126",
"_score": 1,
"_source": {
"genres_slugs": [
"simulator",
"strategy",
"adventure"
]
}
}
]
}
}
Doc:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_multiple_exact_values.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html

How to aggregate different field by a date field in Elasticsearch

REST api call
GET test10/LREmail10/_search/
{
"size": 10,
"query": {
"range": {
"ALARM DATE": {
"gte": "now-15d/d",
"lt": "now/d"
}
}
},
"fields": [
"ALARM DATE",
"CLASSIFICATION"
]
}
part of out put is,
"took": 25,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 490,
"max_score": 1,
"hits": [
{
"_index": "test10",
"_type": "LREmail10",
"_id": "AVM5g6XaShke4hy5dziK",
"_score": 1,
"fields": {
"CLASSIFICATION": [
"Attack"
],
"ALARM DATE": [
"25/02/2016 8:35:22 AM(UTC-08:00)"
]
}
},
{
"_index": "test10",
"_type": "LREmail10",
"_id": "AVM5g6e_Shke4hy5dziL",
"_score": 1,
"fields": {
"CLASSIFICATION": [
"Compromise"
],
"ALARM DATE": [
"25/02/2016 8:36:16 AM(UTC-08:00)"
]
}
},
What I really want to do here is, aggregate CLASSIFICATION by ALARM DATE. Default format of the date has minutes, seconds and time-zone too. But I want to aggrigate all the classifications for each and everydate. So, "25/02/2016 8:36:16 AM(UTC-08:00)" and "25/02/2016 8:35:22 AM(UTC-08:00)" should be considered as "25/02/2016" date. and get the all the classifications belong to a single date.
I wish that I have explained question properly. If you guys need any more details let me know.
If anyone, can give me a hint to look what area in Elasticsearch is also very helpful.
Use date_histogram like below.
{
"size" :0 ,
"aggs": {
"classification of day": {
"date_histogram": {
"field": "ALARM DATE",
"interval": "day"
},
"aggs": {
"classification": {
"terms": {
"field": "CLASSIFICATION"
}
}
}
}
}
}

Resources