Elasticsearch field breaks into multiple values - ruby

I am using the ELK stack for shipping logs.
The problem I'm dealing with is that one of the fields breaks down to multiple values.
To make it clear, for the field product, my values should be:
Anti Malware, New Anti Virus, VPN-1 & FireWall-1 and some more.
however, when running :
curl --user admin:111111 -XPOST 'localhost:9200/filebeat-2016.07.14/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_product": {
"terms": {
"field": "product",
"script": "_value"
}
}
}
}'
The output is:
{
"size": 0,
"aggs": {
"group_by_product": {
"terms": {
"field": "product"
}
}
}
}'
{
"took" : 116,
"timed_out" : false,
"_shards" : {
"total" : 20,
"successful" : 20,
"failed" : 0
},
"hits" : {
"total" : 2624573,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_product" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 8748,
"buckets" : [ {
"key" : "1",
"doc_count" : 2439769
}, {
"key" : "firewall",
"doc_count" : 2439769
}, {
"key" : "vpn",
"doc_count" : 2439769
}, {
"key" : "anti",
"doc_count" : 166522
}, {
"key" : "malware",
"doc_count" : 87399
}, {
"key" : "new",
"doc_count" : 79123
}, {
"key" : "virus",
"doc_count" : 79123
}, {
"key" : "blade",
"doc_count" : 8249
}, {
"key" : "compliance",
"doc_count" : 8249
}, {
"key" : "identity",
"doc_count" : 5176
} ]
}
}
}
So the value VPN-1 & FireWall-1 breaks into vpn, firewall and 1.
I saw that it has something to do with analyzed field, but i cannot define a field as not analyzed bacause the field creation is dynamically.
Thanks.

You need to use dynamic templates. Refer here.
You just need to make sure that fields created dynamically follow a certain pattern or else just use * if you want it to be applicable to all fields. Set your analyzer to keyword. This analyzer passes the string as is.

Related

Aggregating all fields for an object in a search query, without manually specifying the fields

I have an index products which has an internal object attributes which looks like:
{
properties: {
id: {...},
name: {...},
colors: {...},
// remaining fields
}
}
I'm trying to produce a search query with this form and I need to figure out how to write the aggs object.
{ query: {...}, aggs: {...} }
I can write this out manually for two fields to get the desired result, however the object contains 50+ fields so I need it to be able to handle it automatically
"aggs": {
"attributes.color_group.id": {
"terms": {
"field": "attributes.color_group.id.keyword"
}
},
"attributes.product_type.id": {
"terms": {
"field": "attributes.product_type.id.keyword"
}
}
}
Gives me the result:
"aggregations" : {
"attributes.product_type.id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 34,
"buckets" : [
{
"key" : "374",
"doc_count" : 203
},
{
"key" : "439",
"doc_count" : 79
},
{
"key" : "460",
"doc_count" : 28
},
{
"key" : "451",
"doc_count" : 24
},
{
"key" : "558",
"doc_count" : 18
},
{
"key" : "500",
"doc_count" : 10
},
{
"key" : "1559",
"doc_count" : 9
},
{
"key" : "1560",
"doc_count" : 9
},
{
"key" : "455",
"doc_count" : 7
},
{
"key" : "501",
"doc_count" : 6
}
]
},
"attributes.color_group.id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 35,
"buckets" : [
{
"key" : "12",
"doc_count" : 98
},
{
"key" : "54",
"doc_count" : 48
},
{
"key" : "118",
"doc_count" : 43
},
{
"key" : "110",
"doc_count" : 41
},
{
"key" : "111",
"doc_count" : 35
},
{
"key" : "71",
"doc_count" : 35
},
{
"key" : "119",
"doc_count" : 24
},
{
"key" : "62",
"doc_count" : 21
},
{
"key" : "115",
"doc_count" : 20
},
{
"key" : "113",
"doc_count" : 15
}
]
}
}
Which is exactly what I want. After some research I found that you can use query_string which would allow me to find everything starting with attributes., however it does not seem to work inside aggregations.
As I know what you are asking is not possible with inbuild functionality of elasticsearch. But there are some work around you can do like:
Use Search Template:
Below is Example for Search Template, where you will provide list of field as array and it will create the aggregation for all provided fields. you can store search template using Script API and use id of search template while calling search request.
POST dyagg/_search/template
{
"source": """{
"query": {
"match_all": {}
},
"aggs": {
{{#filter}}
"{{.}}": {
"terms": {
"field": "{{.}}",
"size": 10
}
}, {{/filter}}
"name": {
"terms": {
"field": "name",
"size": 10
}
}
}
}""",
"params": {
"filter":["lastname","firstname","city","country"]
}
}
Response:
"aggregations" : {
"country" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "India",
"doc_count" : 4
}
]
},
"firstname" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Rajan",
"doc_count" : 1
},
{
"key" : "Sagar",
"doc_count" : 1
},
{
"key" : "Sajan",
"doc_count" : 1
},
{
"key" : "Sunny",
"doc_count" : 1
}
]
},
"city" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Mumbai",
"doc_count" : 2
},
{
"key" : "Pune",
"doc_count" : 2
}
]
},
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Rajan Desai",
"doc_count" : 1
},
{
"key" : "Sagar Patel",
"doc_count" : 1
},
{
"key" : "Sajan Patel",
"doc_count" : 1
},
{
"key" : "Sunny Desai",
"doc_count" : 1
}
]
},
"lastname" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Desai",
"doc_count" : 2
},
{
"key" : "Patel",
"doc_count" : 2
}
]
}
}
Second way is using programming. Please check this stackoverflow answer where they have mentioned about how to do in PHP so same you can follow for other language.
NOTE:
If you noticed search template, I have added one static aggregation for name field and reason for adding is to avoid extra comma in the end of for loop complete. If you not add then you will get json_parse_exception.

the uniq gender returns only 10 values. whereas I need all the unique values

Problem statement: I require list of unique values of metric host.name.keyword from the complete index. Currently, I am using the below query which gives only 10 values but there are more values existing in the index.
Query:
GET nw-metricbeats-7.10.0-2021.07.16/_search
{
"size":"0",
"aggs" :
{
"uniq_gender" :
{
"terms" :
{
"field" : "host.name.keyword"
}
}
}
}
currently, it returns only 10 values like below:
{
"took" : 68,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"uniq_gender" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1011615,
"buckets" : [
{
"key" : "service1",
"doc_count" : 303710
},
{
"key" : "service2",
"doc_count" : 155110
},
{
"key" : "service3",
"doc_count" : 154074
},
{
"key" : "service4",
"doc_count" : 148499
},
{
"key" : "service5",
"doc_count" : 145033
},
{
"key" : "service6",
"doc_count" : 144226
},
{
"key" : "service7",
"doc_count" : 139367
},
{
"key" : "service8",
"doc_count" : 137063
},
{
"key" : "service9",
"doc_count" : 135586
},
{
"key" : "service10",
"doc_count" : 134794
}
]
}
}
}
can someone help me with the query which can return N number of unique values from the metrics ??
There are two options you have. If you have a slight idea of the number of values the field will take, you can pass a size parameter larger than that number.
{
"size":"0",
"aggs" :
{
"uniq_gender" :
{
"terms" :
{
"field" : "host.name.keyword",
"size" : 500
}
}
}
}
This might not be the best solution for you because:
1: You have to pass in a fixed value in the size.
2: Because the result might not be completely accurate
Elasticsearch docs advice to use
composite aggregation as an alternative.
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"sources": [
{ "uniq_gender": { "terms": { "field": "host.name.keyword" } } }
]
}
}
}
}
Your terms agg also accepts a size parameter that sets the number of buckets to be returned. The default is 10.
I would caution you against relying on this approach to find all indexed values of any field that has very high cardinality, as that is a notorious way to blow up the heap use of your nodes. A composite agg is provided for that purpose.

Get an aggregate count in elasticsearch based on particular uniqueid field

I have created an index and indexed the document in elasticsearch it's working fine but here the challenge is i have to get an aggregate count of category field based on uniqueid i have given my sample documents below.
{
"UserID":"A1001",
"Category":"initiated",
"policyno":"5221"
},
{
"UserID":"A1001",
"Category":"pending",
"policyno":"5222"
},
{
"UserID":"A1001",
"Category":"pending",
"policyno":"5223"
},
{
"UserID":"A1002",
"Category":"completed",
"policyno":"5224"
}
**Sample output for UserID - "A1001"**
initiated-1
pending-2
**Sample output for UserID - "A1002"**
completed-1
How to get the aggregate count from above given Json documents like the sample output mentioned above
I suggest a terms aggregation as shown in the following:
{
"size": 0,
"aggs": {
"By_ID": {
"terms": {
"field": "UserID.keyword"
},
"aggs": {
"By_Category": {
"terms": {
"field": "Category.keyword"
}
}
}
}
}
}
Here is a snippet of the response:
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"By_ID" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "A1001",
"doc_count" : 3,
"By_Category" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "pending",
"doc_count" : 2
},
{
"key" : "initiated",
"doc_count" : 1
}
]
}
},
{
"key" : "A1002",
"doc_count" : 1,
"By_Category" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "completed",
"doc_count" : 1
}
]
}
}
]
}
}

How to compare 2 field in elasticsearch

Ok, I have example result on my data in elastic search :
"hits" : [
{
"_index" : "solutionpedia_data",
"_type" : "doc",
"_id" : "nyODP24BA840z5O6WguE",
"_score" : 46.63439,
"_source" : {
"ID" : "1",
"PRODUCT_NAME" : "ATM",
"UPDATEDATE" : "13-FEB-18",
"PROPOSAL" : [
{
}
],
"MARKETING_KIT" : [ ],
"VIDEO" : [ ]
}
},
{
"_index" : "classification",
"_type" : "doc",
"_id" : "5M-r5m4BNYha4zuWalJa",
"_score" : 39.25268,
"_source" : {
"productId" : "1",
"productName" : "ATM",
"productIconUrl" : "media/8ae0f0c3-1402-4559-901e-7ec9b874ce68-prod032.webp",
"type" : "nonconnectivity",
"businessLineId" : "",
"subsidiaries" : "",
"segment" : [],
"productType" : "Efisien",
"tariff" : null,
"tags" : [ ],
"contact" : [],
"mediaId" : [
"Med391"
],
"documentId" : [
"doc260",
"doc261"
],
"createdAt" : "2019-09-22T05:22:46.956Z",
"updatedAt" : "2019-09-22T05:22:46.956Z",
"totalClick" : 46
}
}
]
this is a result of my alias. can we search for the same data based on 2 different fields, the example above is the ID and productId fields. Can we make these 2 objects in one bucket or compare?
i was try with some aggregate but nothing :
{
"query": {
"match_all": {}
},
"size": 0,
"aggregations": {
"product catalog": {
"terms": {
"field": "productId.keyword",
"min_doc_count": 2,
"size": 100
},
"aggregations": {
"product solped": {
"terms": {
"field": "ID.keyword",
"min_doc_count": 2
}
}
}
}
}
}
result :
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1276,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"product catalog" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
}
You can achieve this with a Scripted Bucket Aggregation, using script logic to define your buckets (pseudo code: if field a exists value of field a, if field b exists value of field b).
Another (and better) way to achieve this is to change your data model and indexing logic on Elasticsearch side and store the information in a field of the same name.
You could also consider the alias data type to make fields with different names in different indices accessible under one common field name. This is also the approach Elastic takes with the Elastic Common Schema specification.

ElasticSearch: retriving documents belonging to buckets

I am trying to retrieve documents for the past year, bucketed into 1 month wide buckets each. I will take the documents for each 1 month bucket, and then further analyze them (out of scope of my problem here). From the description, it seems "Bucket Aggregation" is the way to go, but in the "bucket" response, I am getting only the count of documents in each bucket, and not the raw documents itself. What am I missing?
GET command
{
"aggs" : {
"DateHistogram" : {
"date_histogram" : {
"field" : "timestamp",
"interval": "month"
}
}
},
"size" : 0
}
Resulting Output
{
"took" : 138,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1313058,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"DateHistogram" : {
"buckets" : [ {
"key_as_string" : "2015-02-01T00:00:00.000Z",
"key" : 1422748800000,
"doc_count" : 270
}, {
"key_as_string" : "2015-03-01T00:00:00.000Z",
"key" : 1425168000000,
"doc_count" : 459
},
(...and all the other months...)
{
"key_as_string" : "2016-03-01T00:00:00.000Z",
"key" : 1456790400000,
"doc_count" : 136009
} ]
}
}
}
You're almost there, you simply need to add the a top_hits sub-aggregation in order to retrieve some documents for each bucket:
POST /your_index/_search
{
"aggs" : {
"DateHistogram" : {
"date_histogram" : {
"field" : "timestamp",
"interval": "month"
},
"aggs": { <--- add this
"docs": {
"top_hits": {
"size": 10
}
}
}
}
},
"size" : 0
}

Resources