ElasticSearch filtering by field1 THEN field2 THEN take max of field3 - elasticsearch

I am struggling to get the information that I need from ElasticSearch.
My log statements are like this:
field1: Example
field2: Example2
field3: Example3
I would like to search a timeframe (using last 24 hours) to find all data that has this in field1 and that in field2.
There then may be multiple this.that.[field3] entries, so I want to only return the maximum of that field.
In fact, in my data, field3 is actually the key of the entry.
What is the best way of retrieving the information I need? I have managed to get the results returned using aggs, but the data is in buckets, and I am only interested in the data with the max value of field3.
I have added an example of the query that I am looking to do: https://jsonblob.com/54535d49e4b0d117eeaf6bb4
{
"size": 0,
"aggs": {
"agg_129": {
"filters": {
"filters": {
"CarName: Toyota": {
"query": {
"query_string": {
"query": "CarName: Toyota"
}
}
}
}
},
"aggs": {
"agg_130": {
"filters": {
"filters": {
"Attribute: TimeUsed": {
"query": {
"query_string": {
"query": "Attribute: TimeUsed"
}
}
}
}
},
"aggs": {
"agg_131": {
"terms": {
"field": "#timestamp",
"size": 0,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
},
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
],
"must_not": []
}
}
}
}
}
So, that example above is showing only those that have CarName = Toyota and Attribute = TimeUsed.
My data is as follows:
There are x number of cars CarName and each car has y number of Attributes and each of those Attributes have a document with a timestamp.
To begin with, I was looking for a query for CarName.Attribute.timestamp (latest), however, if I am able to use just ONE query to get the latest timestamp for EVERY attribute for EVERY CarName, then that would decrease query calls from ~50 to one.

If you are using a ElasticSearch v1.3+, you can add a top_hits aggregation with parameter size:1 and descending sort on the field3 value.
This will return the whole document with maximum value on the field, as you wish.
This example in the documentation might do the trick.
Edit:
Ok, it seems you don't need the whole document, but only the maximum timestamp value. You can use a max aggregation instead of using a top_hits one.
The following query (not tested) should give you the maximum timestamp value for each top 10 Attribute value of each CarName top 10 value, in only one request.
terms aggregation is like a GROUP BY clause, and you should not have to query 50 times to retrieve the values of each CarName/Attribute combination : this is the point of nesting a terms aggregation for Attribute in the CarName aggregation.
Note that, to work properly, the CarName and Attribute fields should be not_analyzed. If it's not the case, you will have "funny" results in your buckets. The problem (and possible solution) is very well described here.
Feel free to change the size parameter of the terms aggregation to fit to your case.
{
"size": 0,
"aggs": {
"by_carnames": {
"terms": {
"field": "CarName",
"size": 10
},
"aggs": {
"by_attribute": {
"terms": {
"field": "Attribute",
"size": 10
},
"aggs": {
"max_timestamp": {
"max": {
"field": "#timestamp"
}
}
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
]
}
}
}
}
}

Related

Query returns result with small size that is not my intention in elasticsearch

I am using rest api to query the result from ElasticSearch.
Below is the API query string.
GET /..../_search
{
"size":0,
"query": {
"bool": {
"must": [
{ "range": {
"#timestamp": {
"time_zone": "+09:00",
"gte": "2023-01-24T00:00:00.000Z",
"lt": "2023-01-24T03:03:00.000Z" } } },
{
"term" : {
"serviceid.keyword" : {
"value" : "430011397"
}
}
}
]
}
},
"aggs": {
"by_day": {
"auto_date_histogram": {
"field": "#timestamp",
"minimum_interval":"minute"
},
"aggs": {
"agg-type": {
"terms": {
"field": "nxlogtype.keyword",
"size": 100000
},
"aggs": {
"my-sub-agg-name": {
"avg": {
"field": "size"
}
}
}
}
}
}
}
}
As you can see, I specified the time range about three hours in gte and lt field.
However, the result returns only 6 buckets which have 30 minute intervals.
I expected that many buckets will be returned with one minute interval during the timestamp I specified, but the result is always same even though I changed the time range as more extended one.
Since I am quite new to elastic search, I am not familiar with query usage.
How to resolve my issue?

Elasticsearch scoped aggregation not desired results

I have the following query but the aggregation doesn't seem to be acting on top of the query.
The query returns 3 results there are 10 items in the aggregation. Looks like the aggregation is acting on top of all queried results.
Basically, how do I get the aggregation to take the given query as the input?
{
"query": {
"filtered": {
"filter": {
"and": [
{
"geo_distance": {
"coordinates": [
-79.3931,
43.6709
],
"distance": "15km"
}
},
{
"term": {
"user.type": "2"
}
}
]
},
"query": {
"match": {
"user.shoes": "314"
}
}
}
},
"aggs": {
"dedup": {
"terms": { "field": "user.id" }
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1
}
}
}
}
}
}
So as it turns out, I was expecting the aggregation to act on the paginated results given by the query. And that's incorrect.
The aggregation takes as input "all results" of the query, not just the paginated one.

elasticsearch averaging a field on a bucket

I am a newbie to elasticsearch, trying to understand how aggregates and metrics work. I was particularly running an aggregate query to retrieve average num of bytesOut based on clientIPHash from an elasticsearch instance. The query I created (using kibana) is as follows:
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*",
"analyze_wildcard": true
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": 1476177616965,
"lte": 1481361616965,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"2": {
"terms": {
"field": "ClientIP_Hash",
"size": 50,
"order": {
"1": "desc"
}
},
"aggs": {
"1": {
"avg": {
"field": "Bytes Out"
}
}
}
}
}
}
It gives me some output (supposed to be avg) grouped on clientIPHash like below:
ClientIP_Hash: Descending Average Bytes Out
64e6b1f6447fd044c5368740c3018f49 1,302,210
4ff8598a995e5fa6930889b8751708df 94,038
33b559ac9299151d881fec7508e2d943 68,527
c2095c87a0e2f254e8a37f937a68a2c0 67,083
...
The problem is, if I replace the avg with sum or min or any other metric type, I still get same values.
ClientIP_Hash: Descending Sum of Bytes Out
64e6b1f6447fd044c5368740c3018f49 1,302,210
4ff8598a995e5fa6930889b8751708df 94,038
33b559ac9299151d881fec7508e2d943 68,527
c2095c87a0e2f254e8a37f937a68a2c0 67,083
I checked the query generated by kibana, and it seems to correctly put the keyword 'sum' or 'avg' accordingly. I am puzzled why I get the same values for avg and sum or any other metric.
Could you see if the sample data set of yours have more values. As min, max and Avg remains the same if you have only one value.
Thanks

Filter/Query support in Elasticsearch Top hits Aggregation

Elasticsearch documentation states that The top_hits aggregation returns regular search hits, because of this many per hit features can be supported Crucially, the list includes Named filters and queries
But trying to add any filter or query throws SearchParseException: Unknown key for a START_OBJECT
Use case: I have items which have list of nested comments
items{id} -> comments {date, rating}
I want to get top rated comment for each item in the last week.
{
"query": {
"match_all": {}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"comment": {
"nested": {
"path": "comments"
},
"aggs": {
"top_comment": {
"top_hits": {
"size": 1,
//need filter here to select only comments of last week
"sort": {
"comments.rating": {
"order": "desc"
}
}
}
}
}
}
}
}
}
}
So is the documentation wrong, or is there any way to add a filter?
https://www.elastic.co/guide/en/elasticsearch/reference/2.1/search-aggregations-metrics-top-hits-aggregation.html
Are you sure you have mapped them as Nested? I've just tried to execute such query on my data and it did work fine.
If so, you could simply add a filter aggregation, right after nested aggregation (hopefully I haven't messed up curly brackets):
POST data/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "comments",
"query": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
}
}
}
}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"nested": {
"nested": {
"path": "comments"
},
"aggs": {
"filterComments": {
"filter": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
},
"aggs": {
"topComments": {
"top_hits": {
"size": 1,
"sort": {
"comments.rating": "desc"
}
}
}
}
}
}
}
}
}
}
}
P.S. Always include FULL path for nested objects.
So this query will:
Filter documents that have comments younger than one week to narrow down documents for aggregation and to find those, who actually have such comments (filtered query)
Do terms aggregation based on id field
Open nested sub documents (comments)
Filter them by date
Return the most badass one (most rated)

Faster query filter for getting a document with "greatest" field value?

In an Elasticsearch index I have document having fields: fooId and fooField.
I would like to fetch the document with a given fooId value but the largest value of fooField. Right now, I have a filtered query with an aggregation like this one:
"aggs": {
"topHits_agg": {
"top_hits": {
"sort": [{
"fooField": {
"order": "desc"
}
}],
size: 1
}
}
}
However, the performance is not good. Is there any way to make this better?
If I understand correctly you do not need aggregation, you could sort on fooField directly like this
GET your_index/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"fooId": "your_specific_id"
}
}
}
},
"sort": [
{
"fooField": {
"order": "desc"
}
}
],
"size": 1
}

Resources