Suspiciously low result on Elasticsearch - elasticsearch

The following query returns 24 buckets:
{
"query": {
"bool": {
"filter": [
{
"match": {
"partnerCategory": 6
}
}
]
}
},
"size": 0,
"aggs": {
"uniqcnpjs": {
"terms": {
"field": "partnerId"
}
}
}
}
The expected result is about 750 buckets long. 24 is very low.
If you take into consideration that if you add up the "doc_count" of each bucket, it doesn't match the number of hits if you don't aggregate.
The sum of the buckets doc_count should be at least 20k. Now it's 2.5k.
So, can anyone tell me what's going on? I'm doing something wrong?

Have you tried to set the size option of the terms aggregation to a very high value? e.g.,
"aggs": {
"uniqcnpjs": {
"terms": {
"field": "partnerId",
"size": 1000
}
}
}
Also, checks whether also the result of the cardinality aggregation is lower than what you expect. e.g.,
"aggs": {
"cardinality_partnerid": {
"cardinality": {
"field": "partnerId"
}
}
}

Related

Query returns result with small size that is not my intention in elasticsearch

I am using rest api to query the result from ElasticSearch.
Below is the API query string.
GET /..../_search
{
"size":0,
"query": {
"bool": {
"must": [
{ "range": {
"#timestamp": {
"time_zone": "+09:00",
"gte": "2023-01-24T00:00:00.000Z",
"lt": "2023-01-24T03:03:00.000Z" } } },
{
"term" : {
"serviceid.keyword" : {
"value" : "430011397"
}
}
}
]
}
},
"aggs": {
"by_day": {
"auto_date_histogram": {
"field": "#timestamp",
"minimum_interval":"minute"
},
"aggs": {
"agg-type": {
"terms": {
"field": "nxlogtype.keyword",
"size": 100000
},
"aggs": {
"my-sub-agg-name": {
"avg": {
"field": "size"
}
}
}
}
}
}
}
}
As you can see, I specified the time range about three hours in gte and lt field.
However, the result returns only 6 buckets which have 30 minute intervals.
I expected that many buckets will be returned with one minute interval during the timestamp I specified, but the result is always same even though I changed the time range as more extended one.
Since I am quite new to elastic search, I am not familiar with query usage.
How to resolve my issue?

Compute the "fill rate" of a field in Elasticsearch

I would like to compute the ratio of fields that have a value in my index.
I managed to count how many documents miss the field:
GET profiles/_search
{
"aggs": {
"profiles_wo_country": {
"missing": {
"field": "country"
}
}
},
"size": 0
}
I also managed to count how many documents have the filed:
GET profiles/_search
{
"query": {
"filtered": {
"query": {"match_all": {}},
"filter": {
"exists": {
"field": "country"
}
}
}
},
"size": 0
}
Naturally I can also get the total number of documents in the index. How can I compute the ratio?
An easy way to get the numbers you need out of a query is using the following query
POST profiles/_search?filter_path=hits.total,aggregations.existing.doc_count
{
"size": 0,
"aggs": {
"existing": {
"filter": {
"exists": {
"field": "tag"
}
}
}
}
}
You'll get an response like this one:
{
"hits": {
"total": 37258601
},
"aggregations": {
"existing": {
"doc_count": 9287160
}
}
}
And then in your client code, you can simply do
fill_rate = (aggregations.existing.doc_count / hits.total) * 100
And you're good to go.

How to Use pagination (size and from) in elastic search aggregation?

How to Use pagination (size and from) in elasticsearch aggregation , I used Size and from in agreggition it,s throw exception for exmaple.
I wanna query like?
GET /index/nameorder/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"match": {
"projectId": "10057"
}
}
]
}
},
"filter": {
"range": {
"purchasedDate": {
"from": "2012-02-05T00:00:00",
"to": "2015-02-11T23:59:59"
}
}
}
}
},
"aggs": {
"group_by_a": {
"terms": {
"field": "promocode",
"size": 40,
"from": 40
},
"aggs": {
"TotalPrice": {
"sum": {
"field": "subtotalPrice"
}
}
}
}
}
}
As of now , this feature is not supported.
There is a bug on this , but its still in discuss mode.
Issue - https://github.com/elasticsearch/elasticsearch/issues/4915
In order to implement pagination on top of aggregation in Elasticsearch, you need to do as follows.
Define the size of each batch.
Run cardinality count
Then according to cardinality define partition = (Cardinality count / size )(this size must be smaller than fetch size)
now you can iterate over the partitions using partition filter - Please note size must be big enough cause the results are not equally splitted between the buckets.

Elasticsearch significant terms minimum

I've got something like this:
GET index_*/_search?search_type=count
{
"aggs": {
"products": {
"terms": {
"field": "products_id",
"size": 100
},
"aggs": {
"significant_products": {
"significant_terms": {
"field": "also_purchased_id",
"size": 40
}
}
}
}
}
}
And i want to say significant_terms to give me more results. It gives sometimes only 10 even when the doc_count says 400. If i add "min_doc_count": 10 to significant terms its just doing weird things. Some keys won't give me any result and some just 3 oder 4? So how can i do that?
Thanks!

ElasticSearch filtering by field1 THEN field2 THEN take max of field3

I am struggling to get the information that I need from ElasticSearch.
My log statements are like this:
field1: Example
field2: Example2
field3: Example3
I would like to search a timeframe (using last 24 hours) to find all data that has this in field1 and that in field2.
There then may be multiple this.that.[field3] entries, so I want to only return the maximum of that field.
In fact, in my data, field3 is actually the key of the entry.
What is the best way of retrieving the information I need? I have managed to get the results returned using aggs, but the data is in buckets, and I am only interested in the data with the max value of field3.
I have added an example of the query that I am looking to do: https://jsonblob.com/54535d49e4b0d117eeaf6bb4
{
"size": 0,
"aggs": {
"agg_129": {
"filters": {
"filters": {
"CarName: Toyota": {
"query": {
"query_string": {
"query": "CarName: Toyota"
}
}
}
}
},
"aggs": {
"agg_130": {
"filters": {
"filters": {
"Attribute: TimeUsed": {
"query": {
"query_string": {
"query": "Attribute: TimeUsed"
}
}
}
}
},
"aggs": {
"agg_131": {
"terms": {
"field": "#timestamp",
"size": 0,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
},
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
],
"must_not": []
}
}
}
}
}
So, that example above is showing only those that have CarName = Toyota and Attribute = TimeUsed.
My data is as follows:
There are x number of cars CarName and each car has y number of Attributes and each of those Attributes have a document with a timestamp.
To begin with, I was looking for a query for CarName.Attribute.timestamp (latest), however, if I am able to use just ONE query to get the latest timestamp for EVERY attribute for EVERY CarName, then that would decrease query calls from ~50 to one.
If you are using a ElasticSearch v1.3+, you can add a top_hits aggregation with parameter size:1 and descending sort on the field3 value.
This will return the whole document with maximum value on the field, as you wish.
This example in the documentation might do the trick.
Edit:
Ok, it seems you don't need the whole document, but only the maximum timestamp value. You can use a max aggregation instead of using a top_hits one.
The following query (not tested) should give you the maximum timestamp value for each top 10 Attribute value of each CarName top 10 value, in only one request.
terms aggregation is like a GROUP BY clause, and you should not have to query 50 times to retrieve the values of each CarName/Attribute combination : this is the point of nesting a terms aggregation for Attribute in the CarName aggregation.
Note that, to work properly, the CarName and Attribute fields should be not_analyzed. If it's not the case, you will have "funny" results in your buckets. The problem (and possible solution) is very well described here.
Feel free to change the size parameter of the terms aggregation to fit to your case.
{
"size": 0,
"aggs": {
"by_carnames": {
"terms": {
"field": "CarName",
"size": 10
},
"aggs": {
"by_attribute": {
"terms": {
"field": "Attribute",
"size": 10
},
"aggs": {
"max_timestamp": {
"max": {
"field": "#timestamp"
}
}
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
]
}
}
}
}
}

Resources