Elastic search date range aggregation - elasticsearch

I have a Json Data
"hits": [
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "AHkuN5_iRGO-R5dtaOvz6w",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-08-02T04:55:04.509Z"
}
},
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "Busk7MDFQ4emtL3x5AQyZA",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-08-02T04:58:31.440Z"
}
},
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "4AN0zKe9SaSF1trz1IixfA",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-07-02T04:53:07.010Z"
}
}
]
Am trying to write aggregation query which will find records in particular "deleted_date" range.
This is my query
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02"
},
{
"to": "2014-08-02"
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
My problem is am not getting correct number of records in particular date range. When i put any date am getting some doc_count number. Am new to elastic search. Am not sure is it the way to write range aggregation query. Please help me to solve this issue.

I think problem is you are confused with "from" and "to" of date range aggregation, with range filter. Range filter includes both date (from and to ) in default. But in date_range aggregation, includes the from value and excludes the to value for each range..
In your query,
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02"
},
{
**"to": "2014-08-02"** -- > if you want to include 2014-08-02 date then do,
"to" : "2014-08-03" (increase date by one, so 08-02 is included)
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
This was also encountered by me, and I think your problem is also same.
FYI, look at the link.

What OP is looking for is InternalDateRange query. Try this instead:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02||/d", // /d rounds off to day
// from value -> 2014-08-02T00:00:00.000Z
"to": "2014-08-03||/d" // to value -> 2014-08-03T00:00:00.000Z
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
This will return count of matching results in single bucket named daily_team.
"buckets": [
{
"key": "2014-08-02T00:00:00.000Z-2014-08-03T00:00:00.000Z",
"from": 1470096000000, //test data value
"from_as_string": "2014-08-02T00:00:00.000Z",
"to": 1470182400000, //test data value
"to_as_string": "2014-08-03T00:00:00.000Z",
"doc_count": 0
}
]
This will return single bucket containing matching doc_count.
"ranges": [
{
"from": "2014-08-02"
},
{
"to": "2014-08-02"
}
Using above ranges will return 2 buckets, one each for from and to date range.
from -> 2014-08-02-*
to -> *-2014-08-02 as shown on official documentation page.

Related

elasticsearch Saved Search with Group by

index_name: my_data-2020-12-01
ticket_number: T123
ticket_status: OPEN
ticket_updated_time: 2020-12-01 12:22:12
index_name: my_data-2020-12-01
ticket_number: T124
ticket_status: OPEN
ticket_updated_time: 2020-12-01 12:32:11
index_name: my_data-2020-12-02
ticket_number: T123
ticket_status: INPROGRESS
ticket_updated_time: 2020-12-02 12:33:12
index_name: my_data-2020-12-02
ticket_number: T125
ticket_status: OPEN
ticket_updated_time: 2020-12-02 14:11:45
I want to create a saved search with group by ticket_number field get unique doc with latest ticket status (ticket_status). Is it possible?
You can simply query again, I am assuming you are using Kibana for visualization purpose. in your query, you need to filter based on the ticket_number and sort based on ticket_updated_time.
Working example
Index mapping
{
"mappings": {
"properties": {
"ticket_updated_time": {
"type": "date"
},
"ticket_number" :{
"type" : "text"
},
"ticket_status" : {
"type" : "text"
}
}
}
}
Index sample docs
{
"ticket_number": "T123",
"ticket_status": "OPEN",
"ticket_updated_time": "2020-12-01T12:22:12"
}
{
"ticket_number": "T123",
"ticket_status": "INPROGRESS",
"ticket_updated_time": "2020-12-02T12:33:12"
}
Now as you can see, both the sample documents belong to the same ticket_number with different status and updated time.
Search query
{
"size" : 1, // fetch only the latest status document, if you remove this, will get other ticket with different status.
"query": {
"bool": {
"filter": [
{
"match": {
"ticket_number": "T123"
}
}
]
}
},
"sort": [
{
"ticket_updated_time": {
"order": "desc"
}
}
]
}
And search result
"hits": [
{
"_index": "65180491",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"ticket_number": "T123",
"ticket_status": "INPROGRESS",
"ticket_updated_time": "2020-12-02T12:33:12"
},
"sort": [
1606912392000
]
}
]
If you need to group by ticket_number field, then you can use aggregation as well
Index Mapping:
{
"mappings": {
"properties": {
"ticket_updated_time": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"unique_id": {
"terms": {
"field": "ticket_number.keyword",
"order": {
"latestOrder": "desc"
}
},
"aggs": {
"latestOrder": {
"max": {
"field": "ticket_updated_time"
}
}
}
}
}
}
Search Result:
"buckets": [
{
"key": "T125",
"doc_count": 1,
"latestOrder": {
"value": 1.606918305E12,
"value_as_string": "2020-12-02 14:11:45"
}
},
{
"key": "T123",
"doc_count": 2,
"latestOrder": {
"value": 1.606912392E12,
"value_as_string": "2020-12-02 12:33:12"
}
},
{
"key": "T124",
"doc_count": 1,
"latestOrder": {
"value": 1.606825931E12,
"value_as_string": "2020-12-01 12:32:11"
}
}
]

ElasticSearch Aggregation + Sorting in on NonNumric Field 5.3

I wanted to aggregate the data on a different field and also wanted to get the aggregated data on sorted fashion based on the name.
My data is :
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp001_local000000000000001",
"_score": 10.0,
"_source": {
"name": [
"Person 01"
],
"groupbyid": [
"group0001"
],
"ranking": [
"2.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp002_local000000000000001",
"_score": 85146.375,
"_source": {
"name": [
"Person 02"
],
"groupbyid": [
"group0001"
],
"ranking": [
"10.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp003_local000000000000001",
"_score": 20.0,
"_source": {
"name": [
"Person 03"
],
"groupbyid": [
"group0002"
],
"ranking": [
"-1.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp004_local000000000000001",
"_score": 5.0,
"_source": {
"name": [
"Person 04"
],
"groupbyid": [
"group0002"
],
"ranking": [
"2.0"
]
}
}
My query :
{
"size": 0,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "name:emp*^1000.0"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"order": {
"top_hit_agg": "desc"
},
"size": 10
},
"aggs": {
"top_hit_agg": {
"terms": {
"field": "name"
}
}
}
}
}
}
My mapping is :
{
"name": {
"type": "text",
"fielddata": true,
"fields": {
"lower_case_sort": {
"type": "text",
"fielddata": true,
"analyzer": "case_insensitive_sort"
}
}
},
"groupbyid": {
"type": "text",
"fielddata": true,
"index": "analyzed",
"fields": {
"raw": {
"type": "keyword",
"index": "not_analyzed"
}
}
}
}
I am getting data based on the average of the relevance of grouped records. Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
I wanted grouping on one field and after that grouped bucket, I want to sort on another field. This is sample data.
There are other fields like created_on, updated_on. I also wanted to get sorted data based on that field. also get the data by alphabetically grouped.
I wanted to sort on the non-numeric data type(string). I can do the numeric data type.
I can do it for the ranking field but not able to do it for the name field. It was giving the below error.
Expected numeric type on field [name], but got [text];
You're asking for a few things, so I'll try to answer them in turn.
Step 1: Sorting buckets by relevance
I am getting data based on the average of the relevance of grouped records.
If this is what you're attempting to do, it's not what the aggregation you wrote is doing. Terms aggregations default to sorting the buckets by the number of documents in each bucket, descending. To sort the groups by "average relevance" (which I'll interpret as "average _score of documents in the group"), you'd need to add a sub-aggregation on the score and sort the terms aggregation by that:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless",
}
}
}
}
}
}
Step 2: Sorting employees by name
Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
To sort the documents within each bucket, you can use a top_hits aggregation:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"employees": {
"top_hits": {
"size": 10, // Default will be 10 - change to whatever
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
Step 3: Putting it all together
Putting the both the above together, the following aggregation should suit your needs (note that I used a function_score query to simulate "relevance" based on ranking - your query can be whatever and just needs to be any query that produces whatever relevance you need):
POST /testing-aggregation/employee/_search
{
"size": 0,
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "ranking"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"size": 10,
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless"
}
}
},
"employees": {
"top_hits": {
"size": 10,
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
}

Aggregations and filters in Elastic - find the last hits and filter them afterwards

I'm trying to work with Elastic (5.6) and to find a way to retrieve the top documents per some category.
I have an index with the following kind of documents :
{
"#timestamp": "2018-03-22T00:31:00.004+01:00",
"statusInfo": {
"status": "OFFLINE",
"timestamp": 1521675034892
},
"name": "myServiceName",
"id": "xxxx",
"type": "Http",
"key": "key1",
"httpStatusCode": 200
}
}
What i'm trying to do with these, is retrieve the last document (#timestamp-based) per name (my categories), see if its statusInfo.status is OFFLINE or UP and fetch these results into the hits part of a response so I can put it in a Kibana count dashboard or somewhere else (a REST based tool I do not control and can't modify by myself).
Basically, I want to know how many of my services (name) are OFFLINE (statusInfo.status) in their last update (#timestamp) for monitoring purposes.
I'm stuck at the "Get how many of my services" part.
My query so far:
GET actuator/_search
{
"size": 0,
"aggs": {
"name_agg": {
"terms": {
"field": "name.raw",
"size": 1000
},
"aggs": {
"last_document": {
"top_hits": {
"_source": ["#timestamp", "name", "statusInfo.status"],
"size": 1,
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
]
}
}
}
}
},
"post_filter": {
"bool": {
"must_not": {
"term": {
"statusInfo.status.raw": "UP"
}
}
}
}
}
This provides the following response:
{
"all_the_meta":{...},
"hits": {
"total": 1234,
"max_score": 0,
"hits": []
},
"aggregations": {
"name_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "myCategory1",
"doc_count": 225,
"last_document": {
"hits": {
"total": 225,
"max_score": null,
"hits": [
{
"_index": "myIndex",
"_type": "Http",
"_id": "dummy id",
"_score": null,
"_source": {
"#timestamp": "2018-04-06T00:06:00.005+02:00",
"statusInfo": {
"status": "UP"
},
"name": "myCategory1"
},
"sort": [
1522965960005
]
}
]
}
}
},
{other_buckets...}
]
}
}
}
Removing the size make the result contain ALL of the documents, which is not what I need, I only need each bucket content (every one contains one bucket).
Removing the post filter does not appear to do much.
I think this would be feasible in ORACLE SQL with a PARTITION BY OVER clause, followed by a condition.
Does somebody know how this could be achieved ?
If I understand you correctly, you are looking for the latest doc that have status of OFFLINE in each group (grouped by name)?. In that case you can try the query below and the number of items in the bucket should give you the "how many are down" (for up you would change the term in the filter)
NOTE: this is done in latest version, so it uses keyword field instead of raw
POST /index/_search
{
"size": 0,
"query":{
"bool":{
"filter":{
"term": {"statusInfo.status.keyword": "OFFLINE"}
}
}
},
"aggs":{
"services_agg":{
"terms":{
"field": "name.keyword"
},
"aggs":{
"latest_doc":{
"top_hits": {
"sort": [
{
"#timestamp":{
"order": "desc"
}
}
],
"size": 1,
"_source": ["#timestamp", "name", "statusInfo.status"]
}
}
}
}
}
}

Aggregation on geo_piont elasticsearch

Is there a way to aggregate on a geo_point field and to receive the actual lat long?
all i managed to do is get the hash geo.
what i did so far:
creating the index
PUT geo_test
{
"mappings": {
"sharon_test": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
adding X docs with different lat long
POST geo_test/sharon_test
{
"location": {
"lat": 45,
"lon": -7
}
}
ran this agg:
GET geo_test/sharon_test/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
]
}
},
"aggs": {
"locationsAgg": {
"geohash_grid": {
"field": "location",
"precision" : 12
}
}
}
}
i got this result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "geo_test",
"_type": "sharon_test",
"_id": "fGb4uGEBfEDTRjcEmr6i",
"_score": 1,
"_source": {
"location": {
"lat": 41.12,
"lon": -71.34
}
}
},
{
"_index": "geo_test",
"_type": "sharon_test",
"_id": "oWb4uGEBfEDTRjcE7b6R",
"_score": 1,
"_source": {
"location": {
"lat": 4,
"lon": -7
}
}
}
]
},
"aggregations": {
"locationsAgg": {
"buckets": [
{
"key": "ebenb8nv8nj9",
"doc_count": 1
},
{
"key": "drm3btev3e86",
"doc_count": 1
}
]
}
}
}
I want to know if i can get one of the 2:
1. convert the "key" that is currently representing as a geopoint hash to the sources lat/long
2. show the lat, long in the aggregation in the first place
Thanks!
P.S
I also tried the other geo aggregations but all they give me is the number of docs that fit my aggs conditions, i need the actual values
E.G
wanted this aggregation to return all the locations i had in my index, but it only returned the count
GET geo_test/sharon_test/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
]
}
},
"aggs": {
"distanceRanges": {
"geo_distance": {
"field": "location",
"origin": "50.0338, 36.2242 ",
"unit": "meters",
"ranges": [
{
"key": "All Locations",
"from": 1
}
]
}
}
}
}
You can actually use geo_bounds inside the geo_hash to get a bounding box to narrow it down precisely but to get the exact location you will need to decode the geohash
GET geo_test/sharon_test/_search
{
"query":{
"bool":{
"must":[
{
"match_all":{
}
}
]
}
},
"aggs":{
"locationsAgg":{
"geohash_grid":{
"field":"location",
"precision":12
},
"aggs":{
"cell":{
"geo_bounds":{
"field":"location"
}
}
}
}
}
}

elastic search embedded script optimization

Is there a way to simplify and optimize the following query:
{
"query": {
"filtered": {
"filter": {
"and": [
{
"range": {
"ts": {
"gte": "2014-12-18",
"lte": "2014-12-18"
}
}
}
]
},
"query": {
"match": {
"track_events.event": "render"
}
}
}
},
"aggs": {
"per_type": {
"terms": {
"field": "type",
"order": {
"_count": "desc"
},
"size": 0
},
"aggs": {
"per_hour": {
"terms": {
"script": "(doc[\"track_events.ts\"].value - doc[\"ts\"].value)/(1000 * 3600)",
"order": {
"_count": "desc"
},
"size": 0
}
}
}
}
}
}
The index in elasticsearch contains documents with fields track_events.ts and ts. The purpose is to count how many occurances exist in the hourly intervals between track_events.ts and ts.
Example response:
"buckets": [{
"key": "0",
"doc_count": 67736997
},
{
"key": "1",
"doc_count": 7193214
},
{
"key": "2",
"doc_count": 3406966
},
{
"key": "3",
"doc_count": 1988135
}]
}
which means that 67736997 counts where found that have time difference less than 1 hour, 7193214 counts with time difference less than 2 hours, etc.
The biggest performance gain would be to replace the script.
i.e. instead of doing:
(doc[\"track_events.ts\"].value - doc[\"ts\"].value)/(1000 * 3600)
pre-calculate this value when loading the data into Elasticsearch and put it into another field. Then do the term aggregation on this field instead.

Resources