How can an aggregation be greater than the total number of hits? - elasticsearch

I have records of the type
{
"_index": "constant",
"_type": "host",
"_id": "AU7TX249tNLhGJRMfUXb",
"_score": 1,
"_source": {
"private": true,
"host-ip": "172.22.69.64",
}
}
If I look for aggregates of private and host-ip via
POST constant/host/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs":{
"test":{
"cardinality":{
"field": "host-ip"
}
},
"test2":{
"cardinality":{
"field": "private"
}
}
}
}
I get as a result
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 7730,
"max_score": 0,
"hits": []
},
"aggregations": {
"test": {
"value": 7860
},
"test2": {
"value": 2
}
}
}
My understanding of the result above is the following:
there is a total of 7730 documents of type host in the index constant
there are two different values for private (this is expected)
What I do not understand is how it is possible to have 7860 distinct values of host-ip when the total number of documents in the index is 7730?
Is my understanding of total in hits correct?

Cardinality aggregation is not exact. As the doc says:
A single-value metrics aggregation that calculates an approximate count of distinct values
So, that's the reason behind the greater number.
You can play with the option precision_threshold to make results more accurate, but it will consume more resources.

Related

Elasticsearch aggregation limitation

When I create an aggregate query what scope it is applied to: all entries in an index or just first 10000?
For example, here is a response I got for a script metric aggregation:
{
"took": 76,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"number_of_operations_in_progress": {
"value": 2
}
}
}
hits->total->value is 10000 what makes me think that the aggregate function is applied to first 10000 entries only, not the whole data set in the index.
Is my understanding correct? If yes, is there a way to apply an aggregate function to all entries?
Aggregations are always applied to the whole document set that is selected by the query.
hits.total.value only gives a hint at how many documents match the query, in this case more than 10K documents match the query.
you can usr track_total_hits to control how the total number of hits should be tracked
POST index1/_search
{
"track_total_hits": true,
"query": {
"match_all": {}
},
"aggs": {
"groupbyk1": {
"terms": {
"field": "k1"
}
}
}
}

ElasticSearch. Total number of unique terms in an index

Is there a way to access the total number of terms in an index through ES API?
I need to estimate the prior probability of a term occurring in the index:
total_term_frequency/total_terms_in_index
I can access ttf but no total number of terms stored in the index.
I think the cardinality aggregation is what you're looking for.
For example:
POST /test_index/_search
{
"size": 0,
"aggs": {
"term_count": {
"cardinality": {
"field": "doc_text"
}
}
}
...
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"term_count": {
"value": 161
}
}
}
Here is some code I used to play around with it:
http://sense.qbox.io/gist/d5625c80946f332718b0fa166bba27efd264b76e

How to filter out elements from an array that doesn’t match the query?

stackoverflow won't let me write that much example code so I put it on gist.
So I have this index
with this mapping
here is a sample document I insert into newly created mapping
this is my query
GET products/paramSuggestions/_search
{
"size": 10,
"query": {
"filtered": {
"query": {
"match": {
"paramName": {
"query": "col",
"operator": "and"
}
}
}
}
}
}
this is the unwanted result I get from previous query
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
{
"paramName": "capacity",
"value": "32GB"
}
]
}
}
]
}
}
and finally the wanted result, how I want the query result to look like
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
]
}
}
]
}
}
How should the query look like to achieve the wanted result with filtered array field which matches the query? In other words, all other non-matching array items should not appear in the final result.
The final result is the _source document that you indexed. There is no feature that lets you mask field elements of your document out of the Elasticsearch response.
That said, depending on your goal, you can look into how Highlighters and Suggesters identify result terms matching the query, or possibly, roll-your-own client-side masking using info returned from setting "explain": true in your query.

Difference between a "plain" terms query and a terms query using a filter

I am trying to understand what the difference is between:
a "plain" elasticsearch query that is going to match a terms query and return a certain number of hits.
and a filtered query (therefore using a filter) that is going to return the same number of hits.
Here is the terms query:
GET _search
{
"query": {
"terms": {
"childcareTypes": [
"SOLE_CHARGE",
"OUT_OF_SCHOOL",
"BABY_SITTING"
],
"minimum_match": 3
}
}
}
Here is the filtered version:
GET _search
{
"query": {
"filtered": {
"filter": {
"terms": {
"childcareTypes": [
"SOLE_CHARGE",
"OUT_OF_SCHOOL",
"BABY_SITTING"
],
"execution": "and"
}
}
}
}
}
Both return a total hits of 8000 (against my index).
Here is the result from the "plain" terms query:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8000,
"max_score": 5.134171,
"hits": [
{
"_index": "bignibou",
"_type": "advertisement",
"_id": "AUs2T2lt3L5LNr7nkot2",
"_score": 5.134171,
"_source": {
"childcareWorkerType": "AUXILIAIRE_PARENTALE",
"childcareTypes": [
"SOLE_CHARGE",
"OUT_OF_SCHOOL",
"BABY_SITTING"
],
"address": {
"latitude": 48.8532558,
"longitude": 2.36584
},
"giveBath": "EMPTY"
}
},
...
Here is the result from the "filtered" query:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8000,
"max_score": 1,
"hits": [
{
"_index": "bignibou",
"_type": "advertisement",
"_id": "AUs2T2lt3L5LNr7nkot2",
"_score": 1,
"_source": {
"childcareWorkerType": "AUXILIAIRE_PARENTALE",
"childcareTypes": [
"SOLE_CHARGE",
"OUT_OF_SCHOOL",
"BABY_SITTING"
],
"address": {
"latitude": 48.8532558,
"longitude": 2.36584
},
"giveBath": "EMPTY"
}
},
....
Then what are the differences between the two?
This is related to the differences between queries and filters (more information here).
In your case, unlike terms query, terms filter :
is cached
doesn't compute the score : all matching documents have the same _score of 1 (look at your results)
Consequently, the biggest difference is that the filtered query will be faster than a 'plain' terms query.

How to return number of matches according to specific term in search query?

In my search query I have this:
...
term: { CategoryId: [1,2,3] }
...
I need to return how many matches were found for each category. For now just total number of matches is returned. Is it possible? I think this might be related to aggregation, however I can't find the right solution...
A sample query can be,
POST /test/products/_search
{
"size": 0,
"aggs": {
"category": {
"terms": {
"field": "category"
}
}
}
}
so response is as,
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"category": {
"buckets": [
{
"key": "1",
"doc_count": 10
},
{
"key": "2",
"doc_count": 12
}
]
}
}
}
Which gives no of documents for each category.
Hope this helps!!

Resources