Elasticsearch: accuracy on a filter aggregation - elasticsearch

I'm fairly new to Elasticsearch (using version 2.2).
To simplify my question, I have documents that have a field named termination, which can sometimes take the value transfer.
I currently do this request to aggregate by month the number of documents which have that termination :
{
"size": 0,
"sort": [{
"#timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}],
"query": { "match_all": {} },
"aggs": {
"report": {
"date_histogram": {
"field": "#timestamp",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"documents_with_termination_transfer": {
"filter": {
"term": {
"termination": "transfer"
}
}
}
}
}
}
}
Here is the response :
{
"_shards": {
"failed": 0,
"successful": 206,
"total": 206
},
"aggregations": {
"report": {
"buckets": [
{
"calls_with_termination_transfer": {
"doc_count": 209163
},
"doc_count": 278100,
"key": 1451606400000,
"key_as_string": "2016-01-01T00:00:00.000Z"
},
{
"calls_with_termination_transfer": {
"doc_count": 107244
},
"doc_count": 136597,
"key": 1454284800000,
"key_as_string": "2016-02-01T00:00:00.000Z"
}
]
}
},
"hits": {
"hits": [],
"max_score": 0.0,
"total": 414699
},
"timed_out": false,
"took": 90
}
Why is the number of hits (414699) greater than the total number of document counts (278100 + 136597 = 414697)? I had read about accuracy problems but it didn't seem to apply in the case of filters...
Is there also an accuracy problem if I sum the total numbers of documents with transfer termination ?

My guess is that some documents have a missing #timestamp.
You could verify this by running exists query on this field.

Related

Elasticsearch aggregation shows incorrect total

Elasticsearch version is 7.4.2
I suck at Elasticsearch and I'm trying to figure out what's wrong with this query.
{
"size": 10,
"from": 0,
"query": {
"bool": {
"must": [
{
"exists": {
"field": "firstName"
}
},
{
"query_string": {
"query": "*",
"fields": [
"params.display",
"params.description",
"params.name",
"lastName"
]
}
},
{
"match": {
"status": "DONE"
}
}
],
"filter": [
{
"term": {
"success": true
}
}
]
}
},
"sort": {
"createDate": "desc"
},
"collapse": {
"field": "lastName.keyword",
"inner_hits": {
"name": "lastChange",
"size": 1,
"sort": [
{
"createDate": "desc"
}
]
}
},
"aggs": {
"total": {
"cardinality": {
"field": "lastName.keyword"
}
}
}
}
It returns:
"aggregations": {
"total": {
"value": 429896
}
}
So ~430k results, but in pagination we stop getting results around the 426k mark. Meaning, when I run the query with
{
"size": 10,
"from": 427000,
...
}
I get:
{
"took": 2215,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"total": {
"value": 429896
}
}
}
But if I change from to be 426000 I still get results.
You are comparing the cardinality aggregation value of your field lastName.keyword to your total documents in the index, which is two different things.
You can check the total no of documents in your index using the count API and from/size you are defined at query level ie it brings the documents matching your search query and as you don't have track_total_hits it shows 10k with relation gte means there are more than 10k documents matching your search query.
When it comes to your aggregation, I can see in both the case it returns the count as 429896 as this aggregation is not depend on the from/size you are mentioning for your query.
I was surprised when I found out that the cardinality parameter has Precision control.
Setting the maximum value was the solution for me.

How to get the number of documents for each occurence in Elastic?

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client.
Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months.
For the moment, the closest I get it with this query:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
}
The result is something like that:
{
"took": 793,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 1.0,
"hits": [
{
"_index": "file",
"_type": "_doc",
"_id": "8DkTFHQB3kG435svAA3O",
"_score": 1.0,
"_source": {
"filename": "taz",
"id": 24009,
"when": "2020-08-21T08:11:54.943Z"
}
},
...
]
},
"aggregations": {
"downloads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 418486,
"buckets": [
{
"key": "file1",
"doc_count": 313873
},
{
"key": "file2",
"doc_count": 281504
},
...,
{
"key": "file10",
"doc_count": 10662
}
]
}
}
}
So I am quite interested in the aggregations.downloads.bucket, but this is limited to 10 results.
What do I need to change in my query to have all the list (in my case, I will have ~15,000 different files)?
Thanks.
The size of the terms buckets defaults to 10. If you want to increase it, go with
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 15000 <-------
}
}
}
}
Note that there are strategies to paginate those buckets using a composite aggregation.
Also note that as your index grows, you may hit the default limit as well. It's a dynamic cluster-wide setting so it can be changed.

elasticsearch date_histogram offset and extended_bounds don't work together?

I'm doing a date_histogram with week interval. i need my weeks to start on sundays rather than mondays, and i need the result to include weeks in which there are no docs (empty records).
to get that i use offset = -1d to change the start to sunday, and extended_bounds to get the empty records.
elasticsearch nicely figures out the first day of the interval, so if i supply a start date that's, say wednesday, i get a record for the week starting the previous sunday.
the problem is, if i set offset = -1d, i get an extra week. my hypothesis is that it calculates the first day of the interval without taking the offset into account.
in the example shown, i would not expect to get the 2017-09-24 record:
query:
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"utility.utility_uuid.orig": "17245998142979832061"
}
},
{
"range": {
"user.date_created": {
"gte": "2017-10-01",
"lt": "2017-10-31"
}
}
}
]
}
},
"aggs": {
"eow_accounts_and_users": {
"date_histogram": {
"format": "yyyy-MM-dd",
"interval": "week",
"offset": "-1d",
"time_zone": "US/Pacific",
"field": "user.date_created",
"min_doc_count": 0,
"extended_bounds": {
"min": "2017-10-01",
"max": "2017-10-31"
}
}
}
}
}
result:
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"aggregations": {
"eow_accounts_and_users": {
"buckets": [
{
"key_as_string": "2017-09-24",
"key": 1506236400000,
"doc_count": 0
},
{
"key_as_string": "2017-10-01",
"key": 1506841200000,
"doc_count": 0
},
{
"key_as_string": "2017-10-08",
"key": 1507446000000,
"doc_count": 0
},
{
"key_as_string": "2017-10-15",
"key": 1508050800000,
"doc_count": 0
},
{
"key_as_string": "2017-10-22",
"key": 1508655600000,
"doc_count": 0
},
{
"key_as_string": "2017-10-29",
"key": 1509260400000,
"doc_count": 0
}
]
}
}
}
Add an extra day to the extended bounds:
"extended_bounds": {
"min": dateParams.startTime+86400000, // an extra day
"max": dateParams.endTime
}

Boosting elastic aggregation result

I have an elastic index for products, each product has Brand attribution and I "have to" create an aggregation that returns Brands of the products.
My Sample Query:
GET /products/product/_search
{
"size": 0,
"aggs": {
"myFancyFilter": {
"filter": {
"match_all": {}
},
"aggs": {
"inner": {
"terms": {
"field": "Brand",
"size": 3
}
}
}
}
},
"query": {
"match_all": {}
}
}
And the result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 236952,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 236952,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 139267,
"buckets": [
{
"key": "Brand1",
"doc_count": 3144
},
{
"key": "Brand2",
"doc_count": 1759
},
{
"key": "Brand3",
"doc_count": 1737
}
]
}
}
}
}
It works perfect for me. Elastic sorts buckets according to doc_count, however I would like to manipulate the bucket order in result. For example, assume that I have Brand5 and I want to increment its order to #2. I want result coming in order Brand1, Brand5 and Brand3.
If it was not in an aggregation, but in a query, I could use function_score, but now, I don't have an idea. Any clues?
What you are looking for is to define your own sorting definition and that to be applied in aggregation in elasticsearch. I've been able to come up with a solution by renaming the aggregation terms in below manner:
Brand1 to a_Brand1
Brand5 to b_Brand5
Brand3 to c_Brand3
And then apply sorting on the terms so that sorting happens lexicographically.
Of course this may not be the exact or the best solution but I felt this can help.
Below is the query that I've used. Please note that my field name is brand and it is a multifield and I'm using the field brand.keyword.
POST testdataindex/_search
{
"size":0,
"query":{
"match_all":{
}
},
"aggs":{
"myFancyFilter":{
"filter":{
"match_all":{
}
},
"aggs":{
"inner":{
"terms":{
"script":{
"lang":"painless",
"inline":"if(params.newNames.containsKey(doc['brand.keyword'].value)) { return params.newNames[doc['brand.keyword'].value];} return null;",
"params":{
"newNames":{
"Brand1":"a_Brand1",
"Brand5":"b_Brand5",
"Brand3":"c_Brand3"
}
}
},
"order":{
"_term":"asc"
}
}
}
}
}
}
}
I've created a sample data with brand names Brand1, Brand3 and Brand5 and below how the results would appear. Note the change in the term names.
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 8,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "a_Brand1",
"doc_count": 2
},
{
"key": "b_Brand5",
"doc_count": 4
},
{
"key": "c_Brand3",
"doc_count": 2
}
]
}
}
}
}
Hope it helps!

ElasticSearch count multiple fields grouped by

I have documents like
{"domain":"US", "zipcode":"11111", "eventType":"click", "id":"1", "time":100}
{"domain":"US", "zipcode":"22222", "eventType":"sell", "id":"2", "time":200}
{"domain":"US", "zipcode":"22222", "eventType":"click", "id":"3","time":150}
{"domain":"US", "zipcode":"11111", "eventType":"sell", "id":"4","time":350}
{"domain":"US", "zipcode":"33333", "eventType":"sell", "id":"5","time":225}
{"domain":"EU", "zipcode":"44444", "eventType":"click", "id":"5","time":120}
I want to filter these documents by eventType=sell and time between 125 and 400, group by domain followed by zipcode and count the documents in each bucket. So my output would be like (first and last docs would be ignored by the filters)
US, 11111,1
US, 22222,1
US, 33333,1
In SQL, this should have been straightforward. But I am not able to get this to work on ElasticSearch. Could someone please help me out here?
How do I write ElasticSearch query to accomplish the above?
This query seems to do what you want:
POST /test_index/_search
{
"size": 0,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"eventType": "sell"
}
},
{
"range": {
"time": {
"gte": 125,
"lte": 400
}
}
}
]
}
}
}
},
"aggs": {
"zipcode_terms": {
"terms": {
"field": "zipcode"
}
}
}
}
returning
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"zipcode_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "11111",
"doc_count": 1
},
{
"key": "22222",
"doc_count": 1
},
{
"key": "33333",
"doc_count": 1
}
]
}
}
}
(Note that there is only 1 "sell" at "22222", not 2).
Here is some code I used to test it:
http://sense.qbox.io/gist/1c4cb591ab72a6f3ae681df30fe023ddfca4225b
You might want to take a look at terms aggregations, the bool filter, and range filters.
EDIT: I just realized I left out the domain part, but it should be straightforward to add in a bucket aggregation on that as well if you need to.

Resources