how do I get the latest document grouped by a field? - elasticsearch

I have an index with many documents in this format:
{
"userId": 1234,
"locationDate" "2016-07-19T19:24:51+0000",
"location": {
"lat": -47.38163,
"lon": 26.38916
}
}
In this index I have incremental positions from the user, updated every few seconds.
I would like to execute a search that would return me the latest position (sorted by locationDate) from each user (grouped by userId)
Is this possible with elastic search? the best I could do was get all the positions from the last 30 seconds, using this:
{"query":{
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"range" : {
"locationDate" : {
"from" : "2016-07-19T18:54:51+0000",
"to" : null,
"include_lower" : true,
"include_upper" : true
}
}
}
}
}}
And then after that I sort them out by hand, but I would like to do this directly on elastic search
IMPORTANT: I am using elasticsearch 1.5.2

Try this (with aggregations):
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"locationDate": {
"from": "2016-07-19T18:54:51+0000",
"to": null,
"include_lower": true,
"include_upper": true
}
}
}
}
},
"aggs": {
"byUser": {
"terms": {
"field": "userId",
"size": 10
},
"aggs": {
"firstOne": {
"top_hits": {
"size": 1,
"sort": [
{
"locationDate": {
"order": "desc"
}
}
]
}
}
}
}
}
}

Related

Aggregation performance issue in Elasticsearch with hotel availability data

I am building a small app to find hotel room availability like booking.com using Elasticsearch 6.8.0.
Basically, I have a document per day and room, that specifies if it is available and the rate for that day. I need to run a query with this requirements:
Input:
The days of the desired staying.
The max amount of money I am willing to spend.
The page of the results I want to see.
The number of results per page.
Output:
List of cheapest offer per hotel that fulfill the requirements, ordered in ASC order.
Documents schema:
{
"mappings": {
"_doc": {
"properties": {
"room_id": {
"type": "keyword"
},
"available": {
"type": "boolean"
},
"rate": {
"type": "float"
},
"hotel_id": {
"type": "keyword"
},
"day": {
"type": "date",
"format": "yyyyMMdd"
}
}
}
}
}
I have an index per month, and at the moment I only search within the same month.
I came up with this query:
GET /hotels_201910/_search?filter_path=aggregations.hotel.buckets.min_price.value,aggregations.hotel.buckets.key
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"aggs": {
"hotel": {
"terms": {
"field": "hotel_id",
"min_doc_count": 1,
"size" : 1000000
},
"aggs": {
"room": {
"terms": {
"field": "room_id",
"min_doc_count": 10,
"size" : 1000000
},
"aggs": {
"sum_price": {
"sum": {
"field": "rate"
}
},
"max_price": {
"bucket_selector": {
"buckets_path": {
"price": "sum_price"
},
"script": "params.price <= 600"
}
}
}
},
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
},
"sort_by_min_price" : {
"bucket_sort" :{
"sort": [{"min_price" : { "order" : "asc" }}],
"from" : 0,
"size" : 20
}
}
}
}
}
}
And it works, but have several issues.
It is too slow. With 100K daily rooms, it takes about 500 ms to return on my computer, where no other query is running. So in a live system it would be very bad.
I need to setup the "size" to a big number in the terms aggregation, otherwise not all hotels and rooms are considered.
Is there a way to improve the performance of this aggregation? I have tried to split the index in multiple shards, but it did not help.
I am almost sure that the approach is wrong, and that is why is slow. Any recommendation about how to achieve a faster query response time in this case?
Before going to the answer, I didnt understand why you are using the below condition/aggregation
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
}
Can you give me more clarification on why you need this.
Now, the answer your main question:
Why do you want to term by room_id as well with hotel_id. You can get all the rooms of your search and then group them by hotel_id on application side.
The below logic, will get you all docs grouped by room_id and with sum metrics. You can use the same script filter for > 600 condition.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"by_room_id": {
"composite" : {
"size": 100,
"sources" : [
{
"room_id": {
"terms" : {
"field": "room_id"
}
}
}
]
},
"aggregations": {
"price_on_required_dates": {
"sum": { "field": "rate" }
},
"include_source": {
"top_hits": {
"size": 1,
"_source": true
}
},
"price_bucket_sort": {
"bucket_sort": {
"sort": [
{"price_on_required_dates": {"order": "desc"}}
]
}
}
}
}
}
Also, to improve search performance,
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html

elasticsearch composite aggs with nested object

I have an object with nested field.
"parameters": {
"type": "nested",
"properties": {
"id": {
"type": "integer"
},
"values": {
"type": "keyword"
}
}
}
I am trying aggregate operation:
GET places/place/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"parameters": {
"nested": {
"path": "parameters"
},
"aggs": {
"parameters_cnt_i": {
"terms": {
"field": "parameters.id",
"size": 100
},
"aggs": {
"parameters_cnt_v": {
"terms": {
"field": "parameters.values",
"size": 100
}
}
}
}
}
}
}
}
but it is not good, because i set a "size" too large.
in docs says
If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the Composite aggregation
but i cant understand how to use a Composite aggregation with nested object.. its real?
my solution
{
"size": 0,
"aggs" : {
"parameters" : {
"nested" : {
"path" : "parameters"
},
"aggs": {
"group":{
"composite" : {
"size": 100, // your size
"sources" : [
{ "id": { "terms" : { "field": "parameters.id"} }}
]
}
}
}
}
}
}
Try dropping your 3rd "aggs", like this:
{
"aggs": {
"parameters": {
"nested": {
"path": "parameters"
},
"aggs": {
"count_item_one": {
"terms" : {
"field": "parameters.item_one",
"size": 100
}
},
"count_item_two": {
"terms" : {
"field": "parameters.item_two",
"size": 100
}
}
}
}
}
}
If you're 2nd item is nested again, you may have to set up your nested params again as you did with your 1st "aggs".

Aggregation using elastic search

I have my search query for fetch latest 5000 documents from my elastic DB as below
{
"size": 5000,
"from": 0,
"query": {
"range" : {
"hostTimestamp" : {
"gte" : 1499674634382,
"lte" : 1499680034000
}
}
},
"sort": [
{
"hostTimestamp": {
"order": "desc"
}
}
]
}
Now in the documents that are fetched as result of this query I want to count no of documents with eventSeverity as Alert or Critical. How can this be achieved?
You can achieve that with a terms aggregation on the eventSeverity field:
{
"size": 5000,
"from": 0,
"query": {
"range" : {
"hostTimestamp" : {
"gte" : 1499674634382,
"lte" : 1499680034000
}
}
},
"sort": [
{
"hostTimestamp": {
"order": "desc"
}
}
],
"aggs": { <--- add this part
"severities": {
"terms": {
"field": "eventSeverity"
}
}
}
}

How can I aggregate over the _score

I tried to run an aggregate query over the _score field on Elastic Search with no results. Seems it is not possible to use the _score field, maybe because it is not a field of the document. How can I aggregate over the _score ?
This is my query:
{
"_source": false, "explain": false, "from": 0, "size": 0,
"aggs" : {
"score_ranges" : {
"range" : {
"field" : "_score",
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 75 },
{ "from" : 75 }
]
}
}
},
"query": {
"function_score": {
"query": {
"match_all": { }
}
}
}
}
"aggs": {
"scores_histogram": {
"histogram": {
"script": "return _score.doubleValue() * 10",
"interval": 3
}
}
}
or, with ranges:
"aggs": {
"score_ranges": {
"range": {
"script": "_score",
"ranges": [
{
"to": 50
},
{
"from": 50,
"to": 75
},
{
"from": 75
}
]
}
}
}
And you need to enable dynamic scripting.

Query or Filter for minimum field value?

Example: a document stored in an index represents test scores and meta data about each test.
{ "test": 1, "user":1, "score":100, "meta":"other data" },
{ "test": 2, "user":2, "score":65, "meta":"other data" },
{ "test": 3, "user":2, "score":88, "meta":"other data" },
{ "test": 4, "user":1, "score":23, "meta":"other data" }
I need to be able to filter out all but the lowest test score and return the associated metadata with that test for each test taker. So my expected result set would be:
{ "test": 2, "user":2, "score":65, "meta":"other data" },
{ "test": 4, "user":1, "score":23, "meta":"other data" }
The only way I see to do this now is by first doing a terms aggregation by user with a nested min aggregation to get their lowest score.
POST user/tests/_search
{
"aggs" : {
"users" : {
"terms" : {
"field" : "user",
"order" : { "lowest_score" : "asc" }
},
"aggs" : {
"lowest_score" : { "min" : { "field" : "score" } }
}
}
},"size":0
}
Then I'd have to take the results of that query and do a filtered query for EACH user and filter on the lowest score value to grab the rest of the metadata. Yuk.
POST user/tests/_search
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{"term": { "user": {"value": "1" }}},
{"term": { "score": {"value": "22" }}}
]
}
}
}
}
}
I'd like to know if there is a way to return return one response that has the lowest test score for each test taker and includes the original _source document.
Solutions?
UPDATE - SOLVED
The following gives me the lowest score document for each user and is ordered by the overall lowest score. And, it includes the original document.
GET user/tests/_search?search_type=count
{
"aggs": {
"users": {
"terms": {
"field": "user",
"order" : { "lowest_score" : "asc" }
},
"aggs": {
"lowest_score": { "min": { "field": "score" }},
"lowest_score_top_hits": {
"top_hits": {
"size":1,
"sort": [{"score": {"order": "asc"}}]
}
}
}
}
}
}
Maybe you could try this with top hits aggregation:
GET user/tests/_search?search_type=count
{
"aggs": {
"users": {
"terms": {
"field": "user",
"order": {
"_term": "asc"
}
},
"aggs": {
"lowest_score": {
"min": {
"field": "score"
}
},
"agg_top": {
"top_hits": {"size":1}
}
}
}
},
"size": 20
}

Resources