ElasticSearch. Total number of unique terms in an index - elasticsearch

Is there a way to access the total number of terms in an index through ES API?
I need to estimate the prior probability of a term occurring in the index:
total_term_frequency/total_terms_in_index
I can access ttf but no total number of terms stored in the index.

I think the cardinality aggregation is what you're looking for.
For example:
POST /test_index/_search
{
"size": 0,
"aggs": {
"term_count": {
"cardinality": {
"field": "doc_text"
}
}
}
...
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"term_count": {
"value": 161
}
}
}
Here is some code I used to play around with it:
http://sense.qbox.io/gist/d5625c80946f332718b0fa166bba27efd264b76e

Related

Elasticsearch aggregation limitation

When I create an aggregate query what scope it is applied to: all entries in an index or just first 10000?
For example, here is a response I got for a script metric aggregation:
{
"took": 76,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"number_of_operations_in_progress": {
"value": 2
}
}
}
hits->total->value is 10000 what makes me think that the aggregate function is applied to first 10000 entries only, not the whole data set in the index.
Is my understanding correct? If yes, is there a way to apply an aggregate function to all entries?
Aggregations are always applied to the whole document set that is selected by the query.
hits.total.value only gives a hint at how many documents match the query, in this case more than 10K documents match the query.
you can usr track_total_hits to control how the total number of hits should be tracked
POST index1/_search
{
"track_total_hits": true,
"query": {
"match_all": {}
},
"aggs": {
"groupbyk1": {
"terms": {
"field": "k1"
}
}
}
}

Elasticsearch range query not working as expected

I am trying to fetch data by applying range on date type field("timeA" in this case).
My query is:
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"name": "A"
}
},
{
"range": {
"timeA": {
"lte": 9999
}
}
}
]
}
}
}
I don't have any data less then 1558891800000 in timeA filed.
SO the expected output has to be:
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
But the actual output I'm getting is:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.287682,
"hits": [
{
"_index": "checktimestamp",
"_type": "doc",
"_id": "AWr4sdJv_fFf5JZrQhXl",
"_score": 1.287682,
"_source": {
"name": "A",
"timeA": 1558899000000,
"timeLocal": "27-1AM"
}
}
]
}
}
The Type of timeA field is date.
My elasticsearch version is 5.6.10 and Kibana version is 5.6.10.
Please suggest what is the problem here and how can I resolve it.
Thanks in advance.
Elastic parses the 4 digits as a year meaning it matches documents with a year less or equal to 9999, which i'm assuming is all your data.
To avoid this your need to define in your mapping a strict format for your date field, this will now allow a "yyyy" format to sneak in.
or alternatively don't use numbers with less than 5 digits in those queries.

Aggregation Field Missing in output of ElasticSearch

I am newbie in learning aggregations in elastic search
Below is my query in kibana
GET /vehicles/cars/_search
{
"aggs": {
"popular_cars": {
"terms": {"field": "make.keyword","size": 1000
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
My output from elastic search
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 16,
"max_score": 1,
"hits": [Here it contains the long list of hits]
}
}
}
My confusion is i am not getting aggregation field displayed in the output am i missing something here?

How can an aggregation be greater than the total number of hits?

I have records of the type
{
"_index": "constant",
"_type": "host",
"_id": "AU7TX249tNLhGJRMfUXb",
"_score": 1,
"_source": {
"private": true,
"host-ip": "172.22.69.64",
}
}
If I look for aggregates of private and host-ip via
POST constant/host/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs":{
"test":{
"cardinality":{
"field": "host-ip"
}
},
"test2":{
"cardinality":{
"field": "private"
}
}
}
}
I get as a result
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 7730,
"max_score": 0,
"hits": []
},
"aggregations": {
"test": {
"value": 7860
},
"test2": {
"value": 2
}
}
}
My understanding of the result above is the following:
there is a total of 7730 documents of type host in the index constant
there are two different values for private (this is expected)
What I do not understand is how it is possible to have 7860 distinct values of host-ip when the total number of documents in the index is 7730?
Is my understanding of total in hits correct?
Cardinality aggregation is not exact. As the doc says:
A single-value metrics aggregation that calculates an approximate count of distinct values
So, that's the reason behind the greater number.
You can play with the option precision_threshold to make results more accurate, but it will consume more resources.

How to return number of matches according to specific term in search query?

In my search query I have this:
...
term: { CategoryId: [1,2,3] }
...
I need to return how many matches were found for each category. For now just total number of matches is returned. Is it possible? I think this might be related to aggregation, however I can't find the right solution...
A sample query can be,
POST /test/products/_search
{
"size": 0,
"aggs": {
"category": {
"terms": {
"field": "category"
}
}
}
}
so response is as,
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"category": {
"buckets": [
{
"key": "1",
"doc_count": 10
},
{
"key": "2",
"doc_count": 12
}
]
}
}
}
Which gives no of documents for each category.
Hope this helps!!

Resources