ElasticSearch Aggregation + Sorting in on NonNumric Field 5.3 - elasticsearch

I wanted to aggregate the data on a different field and also wanted to get the aggregated data on sorted fashion based on the name.
My data is :
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp001_local000000000000001",
"_score": 10.0,
"_source": {
"name": [
"Person 01"
],
"groupbyid": [
"group0001"
],
"ranking": [
"2.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp002_local000000000000001",
"_score": 85146.375,
"_source": {
"name": [
"Person 02"
],
"groupbyid": [
"group0001"
],
"ranking": [
"10.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp003_local000000000000001",
"_score": 20.0,
"_source": {
"name": [
"Person 03"
],
"groupbyid": [
"group0002"
],
"ranking": [
"-1.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp004_local000000000000001",
"_score": 5.0,
"_source": {
"name": [
"Person 04"
],
"groupbyid": [
"group0002"
],
"ranking": [
"2.0"
]
}
}
My query :
{
"size": 0,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "name:emp*^1000.0"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"order": {
"top_hit_agg": "desc"
},
"size": 10
},
"aggs": {
"top_hit_agg": {
"terms": {
"field": "name"
}
}
}
}
}
}
My mapping is :
{
"name": {
"type": "text",
"fielddata": true,
"fields": {
"lower_case_sort": {
"type": "text",
"fielddata": true,
"analyzer": "case_insensitive_sort"
}
}
},
"groupbyid": {
"type": "text",
"fielddata": true,
"index": "analyzed",
"fields": {
"raw": {
"type": "keyword",
"index": "not_analyzed"
}
}
}
}
I am getting data based on the average of the relevance of grouped records. Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
I wanted grouping on one field and after that grouped bucket, I want to sort on another field. This is sample data.
There are other fields like created_on, updated_on. I also wanted to get sorted data based on that field. also get the data by alphabetically grouped.
I wanted to sort on the non-numeric data type(string). I can do the numeric data type.
I can do it for the ranking field but not able to do it for the name field. It was giving the below error.
Expected numeric type on field [name], but got [text];

You're asking for a few things, so I'll try to answer them in turn.
Step 1: Sorting buckets by relevance
I am getting data based on the average of the relevance of grouped records.
If this is what you're attempting to do, it's not what the aggregation you wrote is doing. Terms aggregations default to sorting the buckets by the number of documents in each bucket, descending. To sort the groups by "average relevance" (which I'll interpret as "average _score of documents in the group"), you'd need to add a sub-aggregation on the score and sort the terms aggregation by that:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless",
}
}
}
}
}
}
Step 2: Sorting employees by name
Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
To sort the documents within each bucket, you can use a top_hits aggregation:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"employees": {
"top_hits": {
"size": 10, // Default will be 10 - change to whatever
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
Step 3: Putting it all together
Putting the both the above together, the following aggregation should suit your needs (note that I used a function_score query to simulate "relevance" based on ranking - your query can be whatever and just needs to be any query that produces whatever relevance you need):
POST /testing-aggregation/employee/_search
{
"size": 0,
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "ranking"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"size": 10,
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless"
}
}
},
"employees": {
"top_hits": {
"size": 10,
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
}

Related

Elasticsearch matched results on top and remaining after them

I am using elasticsearch in my application and I am new to Elasticsearch.
I have an index called files with some tags associated to it. I want to query them using tags. something like this may be
{
"query": {
"terms": {
"tags": [
"xxx",
"yyy"
]
}
},
"sort": [
{
"created_at": {
"order": "desc"
}
}
]
}
The above query results only matched ones. But I need all the results with matched results on top. And also sort by created_at. How to do it?
I TRIED THIS:
{
"query": {
"bool": {
"should": [
{
"terms": {
"name": [
"cool",
"co"
]
}
}
],
"minimum_should_match": 0
}
},
"sort": [
{
"_score": {
"order": "desc"
}
},
{
"created_at": {
"order": "desc"
}
}
]
}
But results zero always.
You can use bool queries with should.
Since you want all the docs, you can use a match_all. should only affects the scoring and not whether documents are included or not.
{
"query": {
"bool": {
"must" :
{
"match_all": { }
}
},
"should": [
{ "terms" : {
"tags": [
"xxx",
"yyy"
]
} }]
},
"sort": [
{ "_score":
{ "order": "desc"
}
},
{ "created_at":
{ "order": "desc"
}
}
]
}
Also, sort can take an array so you can pass in your multiple parameters basis which the results should be sorted.

elasticsearch Saved Search with Group by

index_name: my_data-2020-12-01
ticket_number: T123
ticket_status: OPEN
ticket_updated_time: 2020-12-01 12:22:12
index_name: my_data-2020-12-01
ticket_number: T124
ticket_status: OPEN
ticket_updated_time: 2020-12-01 12:32:11
index_name: my_data-2020-12-02
ticket_number: T123
ticket_status: INPROGRESS
ticket_updated_time: 2020-12-02 12:33:12
index_name: my_data-2020-12-02
ticket_number: T125
ticket_status: OPEN
ticket_updated_time: 2020-12-02 14:11:45
I want to create a saved search with group by ticket_number field get unique doc with latest ticket status (ticket_status). Is it possible?
You can simply query again, I am assuming you are using Kibana for visualization purpose. in your query, you need to filter based on the ticket_number and sort based on ticket_updated_time.
Working example
Index mapping
{
"mappings": {
"properties": {
"ticket_updated_time": {
"type": "date"
},
"ticket_number" :{
"type" : "text"
},
"ticket_status" : {
"type" : "text"
}
}
}
}
Index sample docs
{
"ticket_number": "T123",
"ticket_status": "OPEN",
"ticket_updated_time": "2020-12-01T12:22:12"
}
{
"ticket_number": "T123",
"ticket_status": "INPROGRESS",
"ticket_updated_time": "2020-12-02T12:33:12"
}
Now as you can see, both the sample documents belong to the same ticket_number with different status and updated time.
Search query
{
"size" : 1, // fetch only the latest status document, if you remove this, will get other ticket with different status.
"query": {
"bool": {
"filter": [
{
"match": {
"ticket_number": "T123"
}
}
]
}
},
"sort": [
{
"ticket_updated_time": {
"order": "desc"
}
}
]
}
And search result
"hits": [
{
"_index": "65180491",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"ticket_number": "T123",
"ticket_status": "INPROGRESS",
"ticket_updated_time": "2020-12-02T12:33:12"
},
"sort": [
1606912392000
]
}
]
If you need to group by ticket_number field, then you can use aggregation as well
Index Mapping:
{
"mappings": {
"properties": {
"ticket_updated_time": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"unique_id": {
"terms": {
"field": "ticket_number.keyword",
"order": {
"latestOrder": "desc"
}
},
"aggs": {
"latestOrder": {
"max": {
"field": "ticket_updated_time"
}
}
}
}
}
}
Search Result:
"buckets": [
{
"key": "T125",
"doc_count": 1,
"latestOrder": {
"value": 1.606918305E12,
"value_as_string": "2020-12-02 14:11:45"
}
},
{
"key": "T123",
"doc_count": 2,
"latestOrder": {
"value": 1.606912392E12,
"value_as_string": "2020-12-02 12:33:12"
}
},
{
"key": "T124",
"doc_count": 1,
"latestOrder": {
"value": 1.606825931E12,
"value_as_string": "2020-12-01 12:32:11"
}
}
]

How to aggregate documents in different buckets and then apply filters to the result

I have many elasticsearch documents in this format:
{
"_index": "testIndex",
"_type": "_doc",
"_id": "0kt102sBt5sWDQMwsMNJ",
"_score": 1.4376891,
"_source": {
"id": "8dJs76YI",
"entity": "movie",
"actor": "Pier",
"action": "like",
"source": "tablet",
"tag": [
"drama"
],
"location": "3.698492,-73.697308",
"country": "",
"city": "",
"timestamp": "2019-07-04T05:35:01Z"
}
}
This index stores all the activities done against a movie entity. id is the movie id. action can be like, view, share etc. actor is the name of user.
I want to apply aggregation and get those movies which are having total likes between 1000 and 10000 and also liked by actor Pier but only those having tags as comedy.
The query need to have a combination of bool, terms and range query along with aggregations. I have tried filters aggregation but the official documentation example is not proving to be enough.
Can any one please give some example to prepare the query for this.
Thanks.
So I'd begin writing query with data that isn't part of aggregation, which is actor and tag.
{
"query": {
"bool": {
"filter": [
{
"term": {
"actor": "Pier"
}
},
{
"term": {
"tag": "comedy"
}
},
{
"term": {
"action": "like"
}
}
]
}
}
}
This should filter only liked movies where Pier was part of the cast and it was of comedy genre.
The next thing is aggregating and getting counts per movie, so it certainly makes sense to use terms aggregation to group everything by id.
{
"query": {
"bool": {
"filter": [
{
"term": {
"actor": "Pier"
}
},
{
"term": {
"tag": "comedy"
}
},
{
"term": {
"action": "like"
}
}
]
}
},
"aggs": {
"movies": {
"terms": {
"field": "id",
"min_doc_count": 1000
}
}
}
}
So with this query you should already have counts per movie, given that we already have filtered out, these counts are for liked comedy movies where Pier has been part of cast, now this has to filter each filter to ensure wanted amount of likes.
So now it's needed to add max likes per movie. You'll need to use bucket selector for that:
{
"query": {
"bool": {
"filter": [
{
"term": {
"actor": "Pier"
}
},
{
"term": {
"tag": "comedy"
}
},
{
"term": {
"action": "like"
}
}
]
}
},
"aggs": {
"movieIds": {
"terms": {
"field": "id",
"min_doc_count": 1000
},
"aggs": {
"likesWithinRange": {
"bucket_selector": {
"buckets_path": {
"doc_count": "_count"
},
"script": {
"inline": "params.doc_count < 10000"
}
}
}
}
}
}
}
Hopefully that works or at least puts you on a right direction.

Aggregations elasticsearch 5

In my elastic search index has following type of entries.
{
"_index": "employees",
"_type": "employee",
"_id": "10000",
"_score": 1.3640093,
"_source": {
"itms": {
"depc": [
"IT",
"MGT",
"FIN"
],
"dep": [
{
"depn": "Information Technology",
"depc": "IT"
},
{
"depn": "Management",
"depc": "MGT"
},
{
"depn": "Finance",
"depc": "FIN"
},
{
"depn": "Finance",
"depc": "FIN"
}
]
}
}
}
Now I an trying to get unique department list including department code (depc) and department name (depn).
I was trying following but it doesn't give result what I expected.
{
"size": 0,
"query": {},
"aggs": {
"departments": {
"terms": {
"field": "itms.dep.depc",
"size": 10000,
"order": {
"_term": "asc"
}
},
"aggs": {
"department": {
"terms": {
"field": "itms.dep.depn",
"size": 10
}
}
}
}
}
}
Any suggestions are appreciated.
Thanks You
From your agg query, it seems like the mapping type for itms.dep is object and not nested
Lucene has no concept of inner objects, so Elasticsearch flattens
object hierarchies into a simple list of field names and values.
Hence, your doc has internally transformed to :
{
"depc" : ["IT","MGT","FIN"],
"dep.depc" : [ "IT","MGT","FIN"],
"dep.depn" : [ "Information Technology", "Management", "Finance" ]
}
i.e. you have lost the association between depc and depn
To fix this :
You need to change your object type to nested
Use nested aggregation
The structure of your existing agg query seems fine to me but you will have to convert it to a nested aggregation post the mapping update

Elastic search date range aggregation

I have a Json Data
"hits": [
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "AHkuN5_iRGO-R5dtaOvz6w",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-08-02T04:55:04.509Z"
}
},
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "Busk7MDFQ4emtL3x5AQyZA",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-08-02T04:58:31.440Z"
}
},
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "4AN0zKe9SaSF1trz1IixfA",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-07-02T04:53:07.010Z"
}
}
]
Am trying to write aggregation query which will find records in particular "deleted_date" range.
This is my query
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02"
},
{
"to": "2014-08-02"
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
My problem is am not getting correct number of records in particular date range. When i put any date am getting some doc_count number. Am new to elastic search. Am not sure is it the way to write range aggregation query. Please help me to solve this issue.
I think problem is you are confused with "from" and "to" of date range aggregation, with range filter. Range filter includes both date (from and to ) in default. But in date_range aggregation, includes the from value and excludes the to value for each range..
In your query,
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02"
},
{
**"to": "2014-08-02"** -- > if you want to include 2014-08-02 date then do,
"to" : "2014-08-03" (increase date by one, so 08-02 is included)
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
This was also encountered by me, and I think your problem is also same.
FYI, look at the link.
What OP is looking for is InternalDateRange query. Try this instead:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02||/d", // /d rounds off to day
// from value -> 2014-08-02T00:00:00.000Z
"to": "2014-08-03||/d" // to value -> 2014-08-03T00:00:00.000Z
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
This will return count of matching results in single bucket named daily_team.
"buckets": [
{
"key": "2014-08-02T00:00:00.000Z-2014-08-03T00:00:00.000Z",
"from": 1470096000000, //test data value
"from_as_string": "2014-08-02T00:00:00.000Z",
"to": 1470182400000, //test data value
"to_as_string": "2014-08-03T00:00:00.000Z",
"doc_count": 0
}
]
This will return single bucket containing matching doc_count.
"ranges": [
{
"from": "2014-08-02"
},
{
"to": "2014-08-02"
}
Using above ranges will return 2 buckets, one each for from and to date range.
from -> 2014-08-02-*
to -> *-2014-08-02 as shown on official documentation page.

Resources