Elasticsearch aggregation shows incorrect total - elasticsearch

Elasticsearch version is 7.4.2
I suck at Elasticsearch and I'm trying to figure out what's wrong with this query.
{
"size": 10,
"from": 0,
"query": {
"bool": {
"must": [
{
"exists": {
"field": "firstName"
}
},
{
"query_string": {
"query": "*",
"fields": [
"params.display",
"params.description",
"params.name",
"lastName"
]
}
},
{
"match": {
"status": "DONE"
}
}
],
"filter": [
{
"term": {
"success": true
}
}
]
}
},
"sort": {
"createDate": "desc"
},
"collapse": {
"field": "lastName.keyword",
"inner_hits": {
"name": "lastChange",
"size": 1,
"sort": [
{
"createDate": "desc"
}
]
}
},
"aggs": {
"total": {
"cardinality": {
"field": "lastName.keyword"
}
}
}
}
It returns:
"aggregations": {
"total": {
"value": 429896
}
}
So ~430k results, but in pagination we stop getting results around the 426k mark. Meaning, when I run the query with
{
"size": 10,
"from": 427000,
...
}
I get:
{
"took": 2215,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"total": {
"value": 429896
}
}
}
But if I change from to be 426000 I still get results.

You are comparing the cardinality aggregation value of your field lastName.keyword to your total documents in the index, which is two different things.
You can check the total no of documents in your index using the count API and from/size you are defined at query level ie it brings the documents matching your search query and as you don't have track_total_hits it shows 10k with relation gte means there are more than 10k documents matching your search query.
When it comes to your aggregation, I can see in both the case it returns the count as 429896 as this aggregation is not depend on the from/size you are mentioning for your query.

I was surprised when I found out that the cardinality parameter has Precision control.
Setting the maximum value was the solution for me.

Related

How can we get the minimum and maximum dates of data in each indices in elastic search?

How can we get the minimum and maximum dates of data in each indices in elastic search?
I was going through the documentation but not able to understand which API will help get the min and max dates of data in each indices in elastic search.
Tldr;
I believe you could address that using the aggregation feature of Elasticsearch.
You could use the Min and Max metric aggregation.
Solution
This would look like that using the api:
GET <The Index>/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"min": {
"min": {
"field": "<The date>"
}
},
"max":{
"max": {
"field": "<The date>"
}
}
}
}
The result should look like so:
{
"took": 0,
"timed_out": false,
"_shards": {
...
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"min": {
"value": 1660521603764,
"value_as_string": "2022-08-15T00:00:03.764Z"
},
"max": {
"value": 1665431791459,
"value_as_string": "2022-10-10T19:56:31.459Z"
}
}
}

Elasticseach multiple indices suggestions

I have following problem. This is actually my implementation of an "did you mean" query. If I use only one index the results fit perfectly. If I use multiple indices I wont get any results.
Does this query only work for single indices?
GET index1/_search
{
"suggest": {
"text": "exmple",
"multi_phrase": {
"phrase": {
"field": "all",
"size": 5,
"gram_size": 3,
"collate": {
"query": {
"source": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"multi_match": {
"query": "{{suggestion}}",
"type": "cross_fields",
"fields": [
"name",
"name2"
],
"operator": "AND",
"lenient": true
}
}
}
}
},
"params": {
"field_name": "all"
}
}
}
}
}
}
If I try this query against on single index everything works fine. If I use multiple indices the results are empty.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"suggest": {
"multi_phrase": [
{
"text": "example",
"offset": 0,
"length": 9,
"options": []
}
]
}
}
I found the solution on my own. I have to use confidence parameter.
The confidence level defines a factor applied to the input phrases
score which is used as a threshold for other suggest candidates. Only
candidates that score higher than the threshold will be included in
the result. For instance a confidence level of 1.0 will only return
suggestions that score higher than the input phrase. If set to 0.0 the
top N candidates are returned. The default is 1.0.

Elasticsearch: accuracy on a filter aggregation

I'm fairly new to Elasticsearch (using version 2.2).
To simplify my question, I have documents that have a field named termination, which can sometimes take the value transfer.
I currently do this request to aggregate by month the number of documents which have that termination :
{
"size": 0,
"sort": [{
"#timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}],
"query": { "match_all": {} },
"aggs": {
"report": {
"date_histogram": {
"field": "#timestamp",
"interval": "month",
"min_doc_count": 0
},
"aggs": {
"documents_with_termination_transfer": {
"filter": {
"term": {
"termination": "transfer"
}
}
}
}
}
}
}
Here is the response :
{
"_shards": {
"failed": 0,
"successful": 206,
"total": 206
},
"aggregations": {
"report": {
"buckets": [
{
"calls_with_termination_transfer": {
"doc_count": 209163
},
"doc_count": 278100,
"key": 1451606400000,
"key_as_string": "2016-01-01T00:00:00.000Z"
},
{
"calls_with_termination_transfer": {
"doc_count": 107244
},
"doc_count": 136597,
"key": 1454284800000,
"key_as_string": "2016-02-01T00:00:00.000Z"
}
]
}
},
"hits": {
"hits": [],
"max_score": 0.0,
"total": 414699
},
"timed_out": false,
"took": 90
}
Why is the number of hits (414699) greater than the total number of document counts (278100 + 136597 = 414697)? I had read about accuracy problems but it didn't seem to apply in the case of filters...
Is there also an accuracy problem if I sum the total numbers of documents with transfer termination ?
My guess is that some documents have a missing #timestamp.
You could verify this by running exists query on this field.

Elastic search find total hits for a date

I have a requirement to find total records in my user table for a particular date, i can able to find the total hits, but cannot find a query to fetch date for a particular date.
Query
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"filter": {
"range": {
"created_date": {
"from": "2015-01-02",
"to": "2015-01-02"
}
}
}
}
}
}
Result
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 33,
"max_score": 0,
"hits": []
},
"aggregations": {
"daily_team": {
"doc_count": 1
}
}
}
Here "total": 33, but its for total number of records in my database. I have only 22 records from "starting date" to "2015-01-02". Could you please help me to find query for the same. Thanks
I found a solution, just removed "from" parameter from range, now i can get the size from doc_count.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"filter": {
"range": {
"created_date": {
"to": "2015-01-02"
}
}
}
}
}
}

Elasticsearch Cardinality Aggregation giving completely wrong results

I am saving each page view of a website in an ES index, where each page is recognized by an entity_id.
I need to get the total count of unique page views since a given point in time.
I have the following mapping:
{
"my_index": {
"mappings": {
"page_views": {
"_all": {
"enabled": true
},
"properties": {
"created": {
"type": "long"
},
"entity_id": {
"type": "integer"
}
}
}
}
}
}
According to the Elasticsearch docs, the way to do that is using a cardinality aggregation.
Here is my search request:
GET my_index/page_views/_search
{
"filter": {
"bool": {
"must": [
[
{
"range": {
"created": {
"gte": 9999999999
}
}
}
]
]
}
},
"aggs": {
"distinct_entities": {
"cardinality": {
"field": "entity_id",
"precision_threshold": 100
}
}
}
}
Note, that I have used a timestamp in the future, so no results are returned.
And the result I'm getting is:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
},
"aggregations": {
"distinct_entities": {
"value": 116
}
}
}
I don't understand how the unique page visits could be 116, giving that there are no page visits at all for the search query. What am I doing wrong?
Your aggregation is returning the global value for the cardinality. If you want it to return only the cardinality of the filtered set, one way you could do that is to use a filter aggregation, then nest your cardinality aggregation inside that. Leaving out the filtered query for clarity (you can add it back in easily enough), the query I tried looks like:
curl -XPOST "http://localhost:9200/my_index/page_views/_search " -d'
{
"size": 0,
"aggs": {
"filtered_entities": {
"filter": {
"bool": {
"must": [
[
{
"range": {
"created": {
"gte": 9999999999
}
}
}
]
]
}
},
"aggs": {
"distinct_entities": {
"cardinality": {
"field": "entity_id",
"precision_threshold": 100
}
}
}
}
}
}'
which returns:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"filtered_entities": {
"doc_count": 0,
"distinct_entities": {
"value": 0
}
}
}
}
Here is some code you can play with:
http://sense.qbox.io/gist/bd90a74839ca56329e8de28c457190872d19fc1b
I used Elasticsearch 1.3.4, by the way.

Resources