When I create an aggregate query what scope it is applied to: all entries in an index or just first 10000?
For example, here is a response I got for a script metric aggregation:
{
"took": 76,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"number_of_operations_in_progress": {
"value": 2
}
}
}
hits->total->value is 10000 what makes me think that the aggregate function is applied to first 10000 entries only, not the whole data set in the index.
Is my understanding correct? If yes, is there a way to apply an aggregate function to all entries?
Aggregations are always applied to the whole document set that is selected by the query.
hits.total.value only gives a hint at how many documents match the query, in this case more than 10K documents match the query.
you can usr track_total_hits to control how the total number of hits should be tracked
POST index1/_search
{
"track_total_hits": true,
"query": {
"match_all": {}
},
"aggs": {
"groupbyk1": {
"terms": {
"field": "k1"
}
}
}
}
Related
This is my query HTTP POST.
URL : http://127.0.0.1:9200/*-2023.02.*/_search?timeout=10ms
Request :
{
"query": {
"bool": {
"must": [
{
"match": {
"event.code": "1"
}
}
]
}
},
"sort": [
{
"#timestamp": {
"order": "asc"
}
}
],
"size": 10000
}
Response :
{
"took": 1557,
"timed_out": false,
"_shards": {
"total": 984,
"successful": 984,
"skipped": 826,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
May I ask why I set the timeout 10ms, but the time spent is 1557ms(took) ?
How can I set a timeout so that Elastic terminates the query?
Elastic Search Version 7.8.1.
The timeout parameter is per shard. If the time spent on one shard exceeds the timeout value, then the current search on that shard is cancelled and the hits gathered till then are returned.
As you can see, you have 984 shards, so if you have a single node with a single processor it could in theory take up to 9.84 seconds to return with a 10ms timeout. It's probably not your case since the query returned in 1.5 seconds, but that was just to illustrate that the timeout is not working the way you expect it to.
I am newbie in learning aggregations in elastic search
Below is my query in kibana
GET /vehicles/cars/_search
{
"aggs": {
"popular_cars": {
"terms": {"field": "make.keyword","size": 1000
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
My output from elastic search
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 16,
"max_score": 1,
"hits": [Here it contains the long list of hits]
}
}
}
My confusion is i am not getting aggregation field displayed in the output am i missing something here?
I'm sending a request like this:
{
"from": 0,
"query": {
"match": {
"_all": "presidencia"
}
}
,
"aggs": {
//... some aggregations
}
,
"highlight": {
"fields": {
"nomeOrgaoSuperior": {}
}
}
}
But my response doesn't come with highlight field.
Response:
{
"took": 68,
"timed_out": false,
"_shards": {"total": 15, "successful": 15, "failed": 0},
"hits": {
"total": 692785,
"max_score": 0.48536316,
"hits": [
//Some hits...
]
},
"aggregations": {
//some aggs ...
}
}
Do i need some extra configuration on my index or what?
Found the problem. I was trying to use highlight on field that wasn't analysed by my analyser. So, my search was analysed and the fields i was trying to get the highlight wasn't. That made the highlighter to never return a match.
Is there a way to access the total number of terms in an index through ES API?
I need to estimate the prior probability of a term occurring in the index:
total_term_frequency/total_terms_in_index
I can access ttf but no total number of terms stored in the index.
I think the cardinality aggregation is what you're looking for.
For example:
POST /test_index/_search
{
"size": 0,
"aggs": {
"term_count": {
"cardinality": {
"field": "doc_text"
}
}
}
...
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"term_count": {
"value": 161
}
}
}
Here is some code I used to play around with it:
http://sense.qbox.io/gist/d5625c80946f332718b0fa166bba27efd264b76e
I have records of the type
{
"_index": "constant",
"_type": "host",
"_id": "AU7TX249tNLhGJRMfUXb",
"_score": 1,
"_source": {
"private": true,
"host-ip": "172.22.69.64",
}
}
If I look for aggregates of private and host-ip via
POST constant/host/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs":{
"test":{
"cardinality":{
"field": "host-ip"
}
},
"test2":{
"cardinality":{
"field": "private"
}
}
}
}
I get as a result
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 7730,
"max_score": 0,
"hits": []
},
"aggregations": {
"test": {
"value": 7860
},
"test2": {
"value": 2
}
}
}
My understanding of the result above is the following:
there is a total of 7730 documents of type host in the index constant
there are two different values for private (this is expected)
What I do not understand is how it is possible to have 7860 distinct values of host-ip when the total number of documents in the index is 7730?
Is my understanding of total in hits correct?
Cardinality aggregation is not exact. As the doc says:
A single-value metrics aggregation that calculates an approximate count of distinct values
So, that's the reason behind the greater number.
You can play with the option precision_threshold to make results more accurate, but it will consume more resources.