the uniq gender returns only 10 values. whereas I need all the unique values - elasticsearch

Problem statement: I require list of unique values of metric host.name.keyword from the complete index. Currently, I am using the below query which gives only 10 values but there are more values existing in the index.
Query:
GET nw-metricbeats-7.10.0-2021.07.16/_search
{
"size":"0",
"aggs" :
{
"uniq_gender" :
{
"terms" :
{
"field" : "host.name.keyword"
}
}
}
}
currently, it returns only 10 values like below:
{
"took" : 68,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"uniq_gender" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1011615,
"buckets" : [
{
"key" : "service1",
"doc_count" : 303710
},
{
"key" : "service2",
"doc_count" : 155110
},
{
"key" : "service3",
"doc_count" : 154074
},
{
"key" : "service4",
"doc_count" : 148499
},
{
"key" : "service5",
"doc_count" : 145033
},
{
"key" : "service6",
"doc_count" : 144226
},
{
"key" : "service7",
"doc_count" : 139367
},
{
"key" : "service8",
"doc_count" : 137063
},
{
"key" : "service9",
"doc_count" : 135586
},
{
"key" : "service10",
"doc_count" : 134794
}
]
}
}
}
can someone help me with the query which can return N number of unique values from the metrics ??

There are two options you have. If you have a slight idea of the number of values the field will take, you can pass a size parameter larger than that number.
{
"size":"0",
"aggs" :
{
"uniq_gender" :
{
"terms" :
{
"field" : "host.name.keyword",
"size" : 500
}
}
}
}
This might not be the best solution for you because:
1: You have to pass in a fixed value in the size.
2: Because the result might not be completely accurate
Elasticsearch docs advice to use
composite aggregation as an alternative.
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{ "uniq_gender": { "terms": { "field": "host.name.keyword" } } }
]
}
}
}
}

Your terms agg also accepts a size parameter that sets the number of buckets to be returned. The default is 10.
I would caution you against relying on this approach to find all indexed values of any field that has very high cardinality, as that is a notorious way to blow up the heap use of your nodes. A composite agg is provided for that purpose.

Related

Aggregate by custom defined buckets, according to field value

I'm interested in aggregating my data into buckets, but I want to put two distinct values to the same bucket.
This is what I mean:
Say I have this query:
GET _search
{
"size": 0,
"aggs": {
"my-agg-name": {
"terms": {
"field": "ecs.version"
}
}
}
}
it returns this response:
"aggregations" : {
"my-agg-name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1.12.0",
"doc_count" : 642826144
},
{
"key" : "8.0.0",
"doc_count" : 204064845
},
{
"key" : "1.1.0",
"doc_count" : 16508253
},
{
"key" : "1.0.0",
"doc_count" : 9162928
},
{
"key" : "1.6.0",
"doc_count" : 1111542
},
{
"key" : "1.5.0",
"doc_count" : 10445
}
]
}
}
every distinct value of the field ecs.version is in it's own bucket.
But say I wanted to define my buckets such that:
bucket1: [1.12.0, 8.0.0]
bucket2: [1.6.0, 8.4.0]
bucket3: [1.0.0, 8.8.0]
Is this possible in anyway?
I know I can just return all the buckets and do the sum programmatically, but this list can be very long, I don't think it would be efficient. Am I wrong?
You can use Runtime Mapping to generat runtime field and that field will be use for aggregation. I have done below exmaple on ES 7.16.
I have index some of the sample document and below is aggregation output without join on multipul values:
"aggregations" : {
"version" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1.12.0",
"doc_count" : 3
},
{
"key" : "1.6.0",
"doc_count" : 3
},
{
"key" : "8.4.0",
"doc_count" : 3
},
{
"key" : "8.0.0",
"doc_count" : 2
}
]
}
}
You can use below query with runtime mapping but you need to add multipul if condition for your version mappings:
{
"size": 0,
"runtime_mappings": {
"normalized_version": {
"type": "keyword",
"script": """
String version = doc['version.keyword'].value;
if (version.equals('1.12.0') || version.equals('8.0.0')) {
emit('1.12.0, 8.0.0');
} else if (version.equals('1.6.0') || version.equals('8.4.0')){
emit('1.6.0, 8.4.0');
}else {
emit(version);
}
"""
}
},
"aggs": {
"genres": {
"terms": {
"field": "normalized_version"
}
}
}
}
Below is output of above aggregation query:
"aggregations" : {
"genres" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1.6.0, 8.4.0",
"doc_count" : 6
},
{
"key" : "1.12.0, 8.0.0",
"doc_count" : 5
}
]
}
}

Bucket Script Aggregation - Elastic Search

I'm trying to build a query at Elastic Search, in order to get the difference of two values:
Here's the code I'm using:
GET /monitora/_search
{
"size":0,
"aggs": {
"CALC_DIFF": {
"filters": {
"filters": {
"FTS_callback": {"term":{ "msgType": "panorama_fts"}},
"FTS_position": {"term":{ "msgType": "panorama_position"}}
}
},
"aggs": {
"subtract": {
"bucket_script": {
"buckets_path": {
"PCountCall": "_count",
"PcountPos":"_count"
},
"script": "params.PCountCall - params.PcountPos"
}
}
}
}
}
}
And this is what I get back when I run it:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"CALC_DIFF" : {
"buckets" : {
"FTS_callback" : {
"doc_count" : 73530,
"subtract" : {
"value" : 0.0
}
},
"FTS_position" : {
"doc_count" : 156418,
"subtract" : {
"value" : 0.0
}
}
}
}
}
}
However, instead of getting the subtraction inside these buckets (which will always be zero), I was looking for the subtraction of the counts on each bucket, which would return me (73530 - 156418) following this example.
After that, I would like to display the result as a "metric" visualization element in Kibana. Is it possible?
Could anyone give me a hand to get it right?
Thanks in advance!

Get an aggregate count in elasticsearch based on particular uniqueid field

I have created an index and indexed the document in elasticsearch it's working fine but here the challenge is i have to get an aggregate count of category field based on uniqueid i have given my sample documents below.
{
"UserID":"A1001",
"Category":"initiated",
"policyno":"5221"
},
{
"UserID":"A1001",
"Category":"pending",
"policyno":"5222"
},
{
"UserID":"A1001",
"Category":"pending",
"policyno":"5223"
},
{
"UserID":"A1002",
"Category":"completed",
"policyno":"5224"
}
**Sample output for UserID - "A1001"**
initiated-1
pending-2
**Sample output for UserID - "A1002"**
completed-1
How to get the aggregate count from above given Json documents like the sample output mentioned above
I suggest a terms aggregation as shown in the following:
{
"size": 0,
"aggs": {
"By_ID": {
"terms": {
"field": "UserID.keyword"
},
"aggs": {
"By_Category": {
"terms": {
"field": "Category.keyword"
}
}
}
}
}
}
Here is a snippet of the response:
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"By_ID" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "A1001",
"doc_count" : 3,
"By_Category" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "pending",
"doc_count" : 2
},
{
"key" : "initiated",
"doc_count" : 1
}
]
}
},
{
"key" : "A1002",
"doc_count" : 1,
"By_Category" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "completed",
"doc_count" : 1
}
]
}
}
]
}
}

Elasticsearch field breaks into multiple values

I am using the ELK stack for shipping logs.
The problem I'm dealing with is that one of the fields breaks down to multiple values.
To make it clear, for the field product, my values should be:
Anti Malware, New Anti Virus, VPN-1 & FireWall-1 and some more.
however, when running :
curl --user admin:111111 -XPOST 'localhost:9200/filebeat-2016.07.14/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_product": {
"terms": {
"field": "product",
"script": "_value"
}
}
}
}'
The output is:
{
"size": 0,
"aggs": {
"group_by_product": {
"terms": {
"field": "product"
}
}
}
}'
{
"took" : 116,
"timed_out" : false,
"_shards" : {
"total" : 20,
"successful" : 20,
"failed" : 0
},
"hits" : {
"total" : 2624573,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_product" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 8748,
"buckets" : [ {
"key" : "1",
"doc_count" : 2439769
}, {
"key" : "firewall",
"doc_count" : 2439769
}, {
"key" : "vpn",
"doc_count" : 2439769
}, {
"key" : "anti",
"doc_count" : 166522
}, {
"key" : "malware",
"doc_count" : 87399
}, {
"key" : "new",
"doc_count" : 79123
}, {
"key" : "virus",
"doc_count" : 79123
}, {
"key" : "blade",
"doc_count" : 8249
}, {
"key" : "compliance",
"doc_count" : 8249
}, {
"key" : "identity",
"doc_count" : 5176
} ]
}
}
}
So the value VPN-1 & FireWall-1 breaks into vpn, firewall and 1.
I saw that it has something to do with analyzed field, but i cannot define a field as not analyzed bacause the field creation is dynamically.
Thanks.
You need to use dynamic templates. Refer here.
You just need to make sure that fields created dynamically follow a certain pattern or else just use * if you want it to be applicable to all fields. Set your analyzer to keyword. This analyzer passes the string as is.

ElasticSearch: retriving documents belonging to buckets

I am trying to retrieve documents for the past year, bucketed into 1 month wide buckets each. I will take the documents for each 1 month bucket, and then further analyze them (out of scope of my problem here). From the description, it seems "Bucket Aggregation" is the way to go, but in the "bucket" response, I am getting only the count of documents in each bucket, and not the raw documents itself. What am I missing?
GET command
{
"aggs" : {
"DateHistogram" : {
"date_histogram" : {
"field" : "timestamp",
"interval": "month"
}
}
},
"size" : 0
}
Resulting Output
{
"took" : 138,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1313058,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"DateHistogram" : {
"buckets" : [ {
"key_as_string" : "2015-02-01T00:00:00.000Z",
"key" : 1422748800000,
"doc_count" : 270
}, {
"key_as_string" : "2015-03-01T00:00:00.000Z",
"key" : 1425168000000,
"doc_count" : 459
},
(...and all the other months...)
{
"key_as_string" : "2016-03-01T00:00:00.000Z",
"key" : 1456790400000,
"doc_count" : 136009
} ]
}
}
}
You're almost there, you simply need to add the a top_hits sub-aggregation in order to retrieve some documents for each bucket:
POST /your_index/_search
{
"aggs" : {
"DateHistogram" : {
"date_histogram" : {
"field" : "timestamp",
"interval": "month"
},
"aggs": { <--- add this
"docs": {
"top_hits": {
"size": 10
}
}
}
}
},
"size" : 0
}

Resources