How to Use pagination (size and from) in elastic search aggregation? - elasticsearch

How to Use pagination (size and from) in elasticsearch aggregation , I used Size and from in agreggition it,s throw exception for exmaple.
I wanna query like?
GET /index/nameorder/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"match": {
"projectId": "10057"
}
}
]
}
},
"filter": {
"range": {
"purchasedDate": {
"from": "2012-02-05T00:00:00",
"to": "2015-02-11T23:59:59"
}
}
}
}
},
"aggs": {
"group_by_a": {
"terms": {
"field": "promocode",
"size": 40,
"from": 40
},
"aggs": {
"TotalPrice": {
"sum": {
"field": "subtotalPrice"
}
}
}
}
}
}

As of now , this feature is not supported.
There is a bug on this , but its still in discuss mode.
Issue - https://github.com/elasticsearch/elasticsearch/issues/4915

In order to implement pagination on top of aggregation in Elasticsearch, you need to do as follows.
Define the size of each batch.
Run cardinality count
Then according to cardinality define partition = (Cardinality count / size )(this size must be smaller than fetch size)
now you can iterate over the partitions using partition filter - Please note size must be big enough cause the results are not equally splitted between the buckets.

Related

Deduplicate and perform composite aggregation on deduced result

I've an index in elastic search which contains data of daily transactions. Each doc has mainly three fields as below :
TxnId, Status, TxnType,userId
two documents can have same TxnIds.
I'm looking for a query that provides aggregation over status,TxnType for unique txnIds. Basically I'm looking for something like : select unique txnIds from user_table group by status,txnType.
I've a ES query which will dedup on TxnIds. I've another ES query which can perform composite aggregation on status and txnType. I want to do both things in Single query.
I tried collapse feature . I also tried cardinality and dedup features. But query is not giving correct output.:
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"streamSource": 3
}
}
]
}
},
"collapse": {
"field": "txnId"
},
"aggs": {
"buckets": {
"composite": {
"size": 30,
"sources": [
{
"status": {
"terms": {
"field": "status"
}
}
},
{
"txnType": {
"terms": {
"field": "txnType"
}
}
}
]
}
}
}
}

Suspiciously low result on Elasticsearch

The following query returns 24 buckets:
{
"query": {
"bool": {
"filter": [
{
"match": {
"partnerCategory": 6
}
}
]
}
},
"size": 0,
"aggs": {
"uniqcnpjs": {
"terms": {
"field": "partnerId"
}
}
}
}
The expected result is about 750 buckets long. 24 is very low.
If you take into consideration that if you add up the "doc_count" of each bucket, it doesn't match the number of hits if you don't aggregate.
The sum of the buckets doc_count should be at least 20k. Now it's 2.5k.
So, can anyone tell me what's going on? I'm doing something wrong?
Have you tried to set the size option of the terms aggregation to a very high value? e.g.,
"aggs": {
"uniqcnpjs": {
"terms": {
"field": "partnerId",
"size": 1000
}
}
}
Also, checks whether also the result of the cardinality aggregation is lower than what you expect. e.g.,
"aggs": {
"cardinality_partnerid": {
"cardinality": {
"field": "partnerId"
}
}
}

elasticsearch averaging a field on a bucket

I am a newbie to elasticsearch, trying to understand how aggregates and metrics work. I was particularly running an aggregate query to retrieve average num of bytesOut based on clientIPHash from an elasticsearch instance. The query I created (using kibana) is as follows:
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*",
"analyze_wildcard": true
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": 1476177616965,
"lte": 1481361616965,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"2": {
"terms": {
"field": "ClientIP_Hash",
"size": 50,
"order": {
"1": "desc"
}
},
"aggs": {
"1": {
"avg": {
"field": "Bytes Out"
}
}
}
}
}
}
It gives me some output (supposed to be avg) grouped on clientIPHash like below:
ClientIP_Hash: Descending Average Bytes Out
64e6b1f6447fd044c5368740c3018f49 1,302,210
4ff8598a995e5fa6930889b8751708df 94,038
33b559ac9299151d881fec7508e2d943 68,527
c2095c87a0e2f254e8a37f937a68a2c0 67,083
...
The problem is, if I replace the avg with sum or min or any other metric type, I still get same values.
ClientIP_Hash: Descending Sum of Bytes Out
64e6b1f6447fd044c5368740c3018f49 1,302,210
4ff8598a995e5fa6930889b8751708df 94,038
33b559ac9299151d881fec7508e2d943 68,527
c2095c87a0e2f254e8a37f937a68a2c0 67,083
I checked the query generated by kibana, and it seems to correctly put the keyword 'sum' or 'avg' accordingly. I am puzzled why I get the same values for avg and sum or any other metric.
Could you see if the sample data set of yours have more values. As min, max and Avg remains the same if you have only one value.
Thanks

Compute the "fill rate" of a field in Elasticsearch

I would like to compute the ratio of fields that have a value in my index.
I managed to count how many documents miss the field:
GET profiles/_search
{
"aggs": {
"profiles_wo_country": {
"missing": {
"field": "country"
}
}
},
"size": 0
}
I also managed to count how many documents have the filed:
GET profiles/_search
{
"query": {
"filtered": {
"query": {"match_all": {}},
"filter": {
"exists": {
"field": "country"
}
}
}
},
"size": 0
}
Naturally I can also get the total number of documents in the index. How can I compute the ratio?
An easy way to get the numbers you need out of a query is using the following query
POST profiles/_search?filter_path=hits.total,aggregations.existing.doc_count
{
"size": 0,
"aggs": {
"existing": {
"filter": {
"exists": {
"field": "tag"
}
}
}
}
}
You'll get an response like this one:
{
"hits": {
"total": 37258601
},
"aggregations": {
"existing": {
"doc_count": 9287160
}
}
}
And then in your client code, you can simply do
fill_rate = (aggregations.existing.doc_count / hits.total) * 100
And you're good to go.

ElasticSearch filtering by field1 THEN field2 THEN take max of field3

I am struggling to get the information that I need from ElasticSearch.
My log statements are like this:
field1: Example
field2: Example2
field3: Example3
I would like to search a timeframe (using last 24 hours) to find all data that has this in field1 and that in field2.
There then may be multiple this.that.[field3] entries, so I want to only return the maximum of that field.
In fact, in my data, field3 is actually the key of the entry.
What is the best way of retrieving the information I need? I have managed to get the results returned using aggs, but the data is in buckets, and I am only interested in the data with the max value of field3.
I have added an example of the query that I am looking to do: https://jsonblob.com/54535d49e4b0d117eeaf6bb4
{
"size": 0,
"aggs": {
"agg_129": {
"filters": {
"filters": {
"CarName: Toyota": {
"query": {
"query_string": {
"query": "CarName: Toyota"
}
}
}
}
},
"aggs": {
"agg_130": {
"filters": {
"filters": {
"Attribute: TimeUsed": {
"query": {
"query_string": {
"query": "Attribute: TimeUsed"
}
}
}
}
},
"aggs": {
"agg_131": {
"terms": {
"field": "#timestamp",
"size": 0,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
},
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
],
"must_not": []
}
}
}
}
}
So, that example above is showing only those that have CarName = Toyota and Attribute = TimeUsed.
My data is as follows:
There are x number of cars CarName and each car has y number of Attributes and each of those Attributes have a document with a timestamp.
To begin with, I was looking for a query for CarName.Attribute.timestamp (latest), however, if I am able to use just ONE query to get the latest timestamp for EVERY attribute for EVERY CarName, then that would decrease query calls from ~50 to one.
If you are using a ElasticSearch v1.3+, you can add a top_hits aggregation with parameter size:1 and descending sort on the field3 value.
This will return the whole document with maximum value on the field, as you wish.
This example in the documentation might do the trick.
Edit:
Ok, it seems you don't need the whole document, but only the maximum timestamp value. You can use a max aggregation instead of using a top_hits one.
The following query (not tested) should give you the maximum timestamp value for each top 10 Attribute value of each CarName top 10 value, in only one request.
terms aggregation is like a GROUP BY clause, and you should not have to query 50 times to retrieve the values of each CarName/Attribute combination : this is the point of nesting a terms aggregation for Attribute in the CarName aggregation.
Note that, to work properly, the CarName and Attribute fields should be not_analyzed. If it's not the case, you will have "funny" results in your buckets. The problem (and possible solution) is very well described here.
Feel free to change the size parameter of the terms aggregation to fit to your case.
{
"size": 0,
"aggs": {
"by_carnames": {
"terms": {
"field": "CarName",
"size": 10
},
"aggs": {
"by_attribute": {
"terms": {
"field": "Attribute",
"size": 10
},
"aggs": {
"max_timestamp": {
"max": {
"field": "#timestamp"
}
}
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
]
}
}
}
}
}

Resources