Find list of Distinct string Values stored in a field in ElasticSearch - elasticsearch

I have stored my data in elasticsearch which is as given below. It returns only distinct words in the given field and not the entire distinct phrase.
{
"_index" : "test01",
"_type" : "whatever01",
"_id" : "1234",
"_score" : 1.0,
"_source" : {
"company_name" : "State Bank of India",
"user" : ""
}
},
{
"_index" : "test01",
"_type" : "whatever01",
"_id" : "5678",
"_score" : 1.0,
"_source" : {
"company_name" : "State Bank of India",
"user" : ""
}
},
{
"_index" : "test01",
"_type" : "whatever01",
"_id" : "8901",
"_score" : 1.0,
"_source" : {
"company_name" : "Kotak Mahindra Bank",
"user" : ""
}
}
I tried using Term Aggregation Function
GET /test01/_search/
{
"aggs" : {
"genres":
{
"terms" :
{ "field": "company_name"}
}
}
}
I get the following output
"aggregations" : {
"genres" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 10531,
"buckets" : [
{
"key" : "bank",
"doc_count" : 2818
},
{
"key" : "mahindra",
"doc_count" : 1641
},
{
"key" : "state",
"doc_count" : 1504
}]
}}
How to get the entire string in the field "company_name" with only distinct values as given below?
"aggregations" : {
"genres" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 10531,
"buckets" : [
{
"key" : "Kotak Mahindra Bank",
"doc_count" : 2818
},
{
"key" : "State Bank of India",
"doc_count" : 1641
}
]
}}

It appears that you've set "fielddata": "true" for your field company_name which is of type text. This is not good as it can end up consuming lot of heap space as mentioned in this link.
Further more, the field's values of type text are broken down into tokens and is saved in inverted index using a process called Analysis. Setting fielddata on fields of type text would cause the aggregation to work as what you mentioned in your question.
What you'd need to do is create its sibling equivalent of type keyword as mentioned in this link and perform aggregation on that field.
Basically modify your mapping for company_name as below:
Mapping:
PUT <your_index_name>/_search
{
"mappings": {
"mydocs": {
"properties": {
"company_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
Run the below aggregation query on this company_name.keyword field and you'd get what you are looking for.
Query:
POST <your_index_name>/_search
{
"aggs": {
"unique_names": {
"terms": {
"field": "company_name.keyword", <----- Run on this field
"size": 10
}
}
}
}
Hope this helps!

Related

How to get different aggregations for different params in bool in Elasticsearch

Document structure:
{
"_index" : "admin",
"_type" : "_doc",
"_id" : "9rEy94EB7k-V-3UYmchn",
"_source" : {
"entity_title" : "Title CPP",
"entity_type" : "type1",
"entity_score" : 185346,
"entity" : {
"customer_id" : "cid1",
"customer_name" : "cname1",
}
}
}
{
"_index" : "admin",
"_type" : "_doc",
"_id" : "9rEy94EB7k-V-3UYmchn",
"_source" : {
"entity_title" : "Title APP",
"entity_type" : "type1",
"entity_score" : 12,
"entity" : {
"customer_id" : "cid2",
"customer_name" : "cname2",
}
}
}
My query
GET /admin/_search
{
"size": 0,
"query" : {
"bool" : {
"should" : [
{
"query_string" : {"default_field" : "entity_title", "query" : "app*"}
},
{
"fuzzy": {"entity_title": {"value": "app"}}
}
]
}
}
},
"aggs": {
"by_entity_type": {
"terms":{
"field":"entity_type",
"size": 4 <total number of entity types>
},
"aggs": {
"by_top_score":{"top_hits":{"size":10, "sort": {"entity_score": {"order" : "desc", "mode" : "avg"}}}}
}
}
}
I need to
Aggregate all search results by entity_type.
Sort the results of matched query (query_string) by _score.
Sort results of fuzzy search by 'entity_score'.
Kindly help me to fetch this as a separate or in same aggregation.
Thanks.

Is there a way to get related documents if a match occurs after the query?

I am currently doing a fuzzy name search on some documents. These documents can be related to each other (for example name field of one document may contain the name and another may contain the alias for the same person). I will give these documents the same unique identifier. My question is, can I get the documents with same unique identifier if a match occurs in any of them?
Suppose that there are 4 documents like this.
{
{
"name": "Bob"
"uid": "1"
},
{
"name": "Bilbo"
"uid": "1"
},
{
"name": "Jack"
"uid": "2"
},
{
"name": "Mary"
"uid" : "3"
}
}
When I query name "Bob", I expect to get both documents with "uid" = "1"
{
{
"name": "Bob"
"uid": "1"
},
{
"name": "Bilbo"
"uid": "1"
}
}
Elasticsearch doesn't have concept of JOINS. So documents cannot be fetched by joining on "uid"
1. Using two queries
i. Get documents with name "Bob"
{
"query": {
"term": {
"name.keyword": {
"value": "Bob"
}
}
}
}
ii. Fetch documents using above returned ids.
2. Using terms and bucket selector aggregation
Mapping:
{
"<mapping_name>" : {
"mappings" : {
"properties" : {
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"uid" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
Query:
1. Create a bucket(collection) of uid.
2. Create sub bucket of name which includes only "Bob" so uid 1 will have a bucket of key Bob , uid 2 will be empty
3. Use bucket_selector aggregation to select where count of sub bucket name is greater than equal to 1. This will remove uid 2
4. Use top_hits aggregation to get documents.
{
"size": 0,
"aggs": {
"uid": {
"terms": {
"field": "uid.keyword",
"size": 10
},
"aggs": {
"documents":{
"top_hits": { --> to get documents under parent term
"size": 10
}
},
"name": {
"terms": {
"field": "name.keyword", --> terms need non_analyzed field so keyword
"include":"Bob", --> get terms with name bob
"size": 10
}
},
"my_bucket":{
"bucket_selector": { --> select buckets which have atleast one name
"buckets_path": {"count":"name._bucket_count"},
"script": "if(params.count>=1) return true;"
}
}
}
}
}
}
Result: All docuents with uid 1(same uid as "Bob") are returned
"aggregations" : {
"uid" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 2,
"documents" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index61",
"_type" : "_doc",
"_id" : "uCP1-nAB_Wo5RvhlZM6k",
"_score" : 1.0,
"_source" : {
"name" : "Bob",
"uid" : "1"
}
},
{
"_index" : "index61",
"_type" : "_doc",
"_id" : "uSP1-nAB_Wo5Rvhlbc4S",
"_score" : 1.0,
"_source" : {
"name" : "Bilbo",
"uid" : "1"
}
}
]
}
},
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Bob",
"doc_count" : 1
}
]
}
}
]
}
}

Elasticsearch returns 0.0 for metrics sum aggregation

Elasticsearch returns 0.0 for metrics sum aggregation. Expected output will be some of metric probe_http_duration_seconds.
Elasticsearch version: 7.1.1
Query used for aggregation:
GET some_metric/_search
{
"query": {
"bool": {
"must": [
{
"range": { "time": { "gte" : "now-1m", "lt": "now" } }
},
{
"match": {"name": "probe_http_duration_seconds"}
},
{
"match": {"labels.instance": "some-instance"}
}
]
}
},
"aggs" : {
"sum_is" : { "sum": { "field" : "value" } }
}
}
The above query returns for documents followed by:
"aggregations" : {
"sum_is" : {
"value" : 0.0
}
Each document in the index looks like:
{
"_index" : "some_metric-2019.12.03-000004",
"_type" : "_doc",
"_id" : "_wCjz24Bk6FPpmW1lC31",
"_score" : 5.3475914,
"_source" : {
"name" : "probe_http_duration_seconds",
"time" : 1575441630181,
"value" : 0,
"labels" : {
"__name__" : "probe_http_duration_seconds",
"app" : "some-events",
"i" : "some_metric",
"instance" : "some-instance",
"job" : "someproject-k8s-service",
"kubernetes_name" : "some-events",
"kubernetes_namespace" : "deploytest",
"phase" : "connect",
"t" : "type",
"v" : "1"
}
}
}
In query on changing must to should, I get:
"aggregations" : {
"sum_is" : {
"value" : 1.5389155527088604E16
}
}
The index dynamic mapping looks something like this:
"mappings" : {
"dynamic_templates" : [
{
"strings" : {
"unmatch" : "*seconds*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "keyword"
}
}
},
{
"to_float" : {
"match" : "*seconds*",
"mapping" : {
"type" : "float"
}
}
}
],
However in our requirement, we need results matching all of the clauses in the query.
For metrics aggregation elasticsearch converts everything to double, still this doesn't explain result as zero.
Any pointers will be helpful. Thanks for attention.
NOTE: I see that in example document, value field is zero. Maybe while drafting/editing I made a mistake.
Below is the result of past 2 mins. This shows value field is actually float.
Query:
GET some_metric/_search?size=3
{
"_source": ["value"],
"query": {
"bool": {
"must": [
{
"range": { "time": { "gte" : "now-2m", "lt": "now" } }
},
{
"match": {"name": "probe_http_duration_seconds"}
},
{
"match": {"labels.instance": "some-instance"}
}
]
}
}
}
Result:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10,
"relation" : "eq"
},
"max_score" : 14.551308,
"hits" : [
{
"_index" : "some_metric-2019.12.04-000005",
"_type" : "_doc",
"_id" : "7oog0G4Bk6EPplW1ibD1",
"_score" : 14.551308,
"_source" : {
"value" : 0.040022423
}
},
{
"_index" : "some_metric-2019.12.04-000005",
"_type" : "_doc",
"_id" : "74og0G4Bk6EPplW1ibD1",
"_score" : 14.551308,
"_source" : {
"value" : 3.734E-5
}
},
{
"_index" : "some_metric-2019.12.04-000005",
"_type" : "_doc",
"_id" : "A4og0G4Bk6EPplW1ibH1",
"_score" : 14.551308,
"_source" : {
"value" : 0.015694122
}
}
]
}
}
What you see is just what you indexed in the source document. ES will never modify your source document. However, since the type is long as I thought then it will index that float value as a long and not as a float.
This usually happens when the very first document to be indexed has an integer value, such as 0, for instance.
You can either reindex your data with the proper mapping... Or since you have time-based indexes, just modify the dynamic template and tomorrow's index will be created correctly.

ElasticSearch join data within the same index

I am quite new with ElasticSearch and I am collecting some application logs within the same index which have this format
{
"_index" : "app_logs",
"_type" : "_doc",
"_id" : "JVMYi20B0a2qSId4rt12",
"_source" : {
"username" : "mapred",
"app_id" : "application_1569623930006_490200",
"event_type" : "STARTED",
"ts" : "2019-10-02T08:11:53Z"
}
I can have different event types. In this case I am interested in STARTED and FINISHED. I would like to query ES in order to get all the app that started in a certain day and enrich them with their end time. Basically I want to create couples of start/end (an end might also be missing, but that's fine).
I have realized join relations in sql cannot be used in ES and I was wondering if I can exploit some other feature in order to get this result in one query.
Edit: these are the details of the index mapping
{
“app_logs" : {
"mappings" : {
"_doc" : {
"properties" : {
"event_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
“app_id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"ts" : {
"type" : "date"
},
“event_type” : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}}}}
What I understood is that you would want to collate list of documents having same app_id along with the status as either STARTED or FINISHED.
I do not think Elasticsearch is not meant to perform JOIN operations. I mean you can but then you have to design your documents as mentioned in this link.
What you would need is an Aggregation query.
Below is the sample mapping, documents, the aggregation query and the response as how it appears, which would actually help you get the desired result.
Mapping:
PUT mystatusindex
{
"mappings": {
"properties": {
"username":{
"type": "keyword"
},
"app_id":{
"type": "keyword"
},
"event_type":{
"type":"keyword"
},
"ts":{
"type": "date"
}
}
}
}
Sample Documents
POST mystatusindex/_doc/1
{
"username" : "mapred",
"app_id" : "application_1569623930006_490200",
"event_type" : "STARTED",
"ts" : "2019-10-02T08:11:53Z"
}
POST mystatusindex/_doc/2
{
"username" : "mapred",
"app_id" : "application_1569623930006_490200",
"event_type" : "FINISHED",
"ts" : "2019-10-02T08:12:53Z"
}
POST mystatusindex/_doc/3
{
"username" : "mapred",
"app_id" : "application_1569623930006_490201",
"event_type" : "STARTED",
"ts" : "2019-10-02T09:30:53Z"
}
POST mystatusindex/_doc/4
{
"username" : "mapred",
"app_id" : "application_1569623930006_490202",
"event_type" : "STARTED",
"ts" : "2019-10-02T09:45:53Z"
}
POST mystatusindex/_doc/5
{
"username" : "mapred",
"app_id" : "application_1569623930006_490202",
"event_type" : "FINISHED",
"ts" : "2019-10-02T09:45:53Z"
}
POST mystatusindex/_doc/6
{
"username" : "mapred",
"app_id" : "application_1569623930006_490203",
"event_type" : "STARTED",
"ts" : "2019-10-03T09:30:53Z"
}
POST mystatusindex/_doc/7
{
"username" : "mapred",
"app_id" : "application_1569623930006_490203",
"event_type" : "FINISHED",
"ts" : "2019-10-03T09:45:53Z"
}
Query:
POST mystatusindex/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"ts": {
"gte": "2019-10-02T00:00:00Z",
"lte": "2019-10-02T23:59:59Z"
}
}
}
],
"should": [
{
"match": {
"event_type": "STARTED"
}
},
{
"match": {
"event_type": "FINISHED"
}
}
]
}
},
"aggs": {
"application_IDs": {
"terms": {
"field": "app_id"
},
"aggs": {
"ids": {
"top_hits": {
"size": 10,
"_source": ["event_type", "app_id"],
"sort": [
{ "event_type": { "order": "desc"}}
]
}
}
}
}
}
}
Notice that for filtering I've made use of Range Query as you only want to filter documents for that date and also added a bool should logic to filter based on STARTED and FINISHED.
Once I have the documents, I've made use of Terms Aggregation and Top Hits Aggregation to get the desired result.
Result
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"application_IDs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "application_1569623930006_490200", <----- APP ID
"doc_count" : 2,
"ids" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "1", <--- Document with STARTED status
"_score" : null,
"_source" : {
"event_type" : "STARTED",
"app_id" : "application_1569623930006_490200"
},
"sort" : [
"STARTED"
]
},
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "2", <--- Document with FINISHED status
"_score" : null,
"_source" : {
"event_type" : "FINISHED",
"app_id" : "application_1569623930006_490200"
},
"sort" : [
"FINISHED"
]
}
]
}
}
},
{
"key" : "application_1569623930006_490202",
"doc_count" : 2,
"ids" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "4",
"_score" : null,
"_source" : {
"event_type" : "STARTED",
"app_id" : "application_1569623930006_490202"
},
"sort" : [
"STARTED"
]
},
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "5",
"_score" : null,
"_source" : {
"event_type" : "FINISHED",
"app_id" : "application_1569623930006_490202"
},
"sort" : [
"FINISHED"
]
}
]
}
}
},
{
"key" : "application_1569623930006_490201",
"doc_count" : 1,
"ids" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "3",
"_score" : null,
"_source" : {
"event_type" : "STARTED",
"app_id" : "application_1569623930006_490201"
},
"sort" : [
"STARTED"
]
}
]
}
}
}
]
}
}
}
Note that the last document with only STARTED appears in the aggregation result as well.
Updated Answer
{
"size":0,
"query":{
"bool":{
"must":[
{
"range":{
"ts":{
"gte":"2019-10-02T00:00:00Z",
"lte":"2019-10-02T23:59:59Z"
}
}
}
],
"should":[
{
"term":{
"event_type.keyword":"STARTED" <----- Changed this
}
},
{
"term":{
"event_type.keyword":"FINISHED" <----- Changed this
}
}
]
}
},
"aggs":{
"application_IDs":{
"terms":{
"field":"app_id.keyword" <----- Changed this
},
"aggs":{
"ids":{
"top_hits":{
"size":10,
"_source":[
"event_type",
"app_id"
],
"sort":[
{
"event_type.keyword":{ <----- Changed this
"order":"desc"
}
}
]
}
}
}
}
}
}
Note the changes I've made. Whenever you would need exact matches or want to make use of aggregation, you would need to make use of keyword type.
In the mapping you've shared, there is no username field but two event_type fields. I'm assuming its just a human err and that one of the field should be username.
Now if you notice carefully, the field event_type has a text and its sibling keyword field. I've just modified the query to make use of the keyword field and when I am doing that, I'm use Term Query.
Try this out and let me know if it helps!

enabled fielddata on text field in ElasticSearch but aggregation is not working

According to the documentation you can run ElasticSearch aggregations on fields that are type keyword or not a text field or which have fielddata set to true in the index mapping.
I am trying to count city_names in an nginx log. It works fine with the int field result. But it does not work with the field city_name even when I updated the index mapping for that to put fielddata=true. The should have been not required as it was of type keyword.
To say it does not work means that:
"aggregations" : {
"cities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
Here is the field mapping:
"city_name" : {
"type" : "text",
"fielddata" : true
},
And here is the aggression query:
curl -XGET --user $pwd --header 'Content-Type: application/json' https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/logstash/_search?pretty -d '{
"aggs" : {
"cities": {
"terms" : { "field": "city_name"}
}
}
}'
If you don't get any error when executing your search it seems that is more like a problem with the data. Are you sure you have, at least, one document with the field city_name filled?
I tried to reproduce your issue with ElasticSearch 6.6.2.
I created an index
PUT cities
{
"mappings": {
"city": {
"dynamic": "true",
"properties": {
"id": {
"type": "long"
},
"city_name": {
"type": "text",
"fielddata": true
}
}
}
}
}
I added one document without the city_name
PUT cities/city/1
{
"id": "1"
}
When i performed the search:
GET cities/_search
{
"aggs": {
"cities": {
"terms" : { "field": "city_name"}
}
}
}
I got no buckets in the cities aggregation. But when I added one document with the city name filled:
PUT cities/city/2
{
"id": "2",
"city_name": "London"
}
I got the expected result:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "cities",
"_type" : "city",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"id" : "2",
"city_name" : "london"
}
},
{
"_index" : "cities",
"_type" : "city",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : "1"
}
}
]
},
"aggregations" : {
"cities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "london",
"doc_count" : 1
}
]
}
}
}

Resources