Fielddata is disabled on text fields by default in elasticsearch - elasticsearch

I have problem that I updated from elasticsearch 2.x to 5.1. However, some of my data does not work in newer elasticsearch because of this "Fielddata is disabled on text fields by default" https://www.elastic.co/guide/en/elasticsearch/reference/5.1/fielddata.html before 2.x it was enabled it seems.
Is there way to enable fielddata automatically to text fields?
I tried code like this
curl -XPUT http://localhost:9200/_template/template_1 -d '
{
"template": "*",
"mappings": {
"_default_": {
"properties": {
"fielddata-*": {
"type": "text",
"fielddata": true
}
}
}
}
}'
but it looks like elasticsearch does not understand wildcard there in field name. Temporary solution to this is that I am running python script every 30 minutes, scanning all indices and adding fielddata=true to fields which are new.
The problem is that I have string data like "this is cool" in elasticsearch.
curl -XPUT 'http://localhost:9200/example/exampleworking/1' -d '
{
"myfield": "this is cool"
}'
when trying to aggregate that:
curl 'http://localhost:9200/example/_search?pretty=true' -d '
{
"aggs": {
"foobar": {
"terms": {
"field": "myfield"
}
}
}
}'
"Fielddata is disabled on text fields by default. Set fielddata=true on [myfield]"
that elasticsearch documentation suggest using .keyword instead of adding fielddata. However, that is not returning data what I want.
curl 'http://localhost:9200/example/_search?pretty=true' -d '
{
"aggs": {
"foobar": {
"terms": {
"field": "myfield.keyword"
}
}
}
}'
returns:
"buckets" : [
{
"key" : "this is cool",
"doc_count" : 1
}
]
which is not correct. Then I add fielddata true and everything works:
curl -XPUT 'http://localhost:9200/example/_mapping/exampleworking' -d '
{
"properties": {
"myfield": {
"type": "text",
"fielddata": true
}
}
}'
and then aggregate
curl 'http://localhost:9200/example/_search?pretty=true' -d '
{
"aggs": {
"foobar": {
"terms": {
"field": "myfield"
}
}
}
}'
return correct result
"buckets" : [
{
"key" : "cool",
"doc_count" : 1
},
{
"key" : "is",
"doc_count" : 1
},
{
"key" : "this",
"doc_count" : 1
}
]
How I can add this fielddata=true automatically to all indices to all text fields? Is that even possible? In elasticsearch 2.x this is working out of the box.

i will answer to myself
curl -XPUT http:/localhost:9200/_template/template_1 -d '
{
"template": "*",
"mappings": {
"_default_": {
"dynamic_templates": [
{
"strings2": {
"match_mapping_type": "string",
"mapping": {
"type": "text",
"fielddata": true
}
}
}
]
}
}
}'
this is doing what i want. Now all indexes have default settings fielddata true

Adding "fielddata": true allows the text field to be aggregated, but this has performance problems at scale. A better solution is to use a multi-field mapping.
Unfortunately, this is hidden a bit deep in Elasticsearch's documentations, in a warning under the fielddata mapping parameter: https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#before-enabling-fielddata
Here's a complete example of how this helps with a terms aggregation, tested on Elasticsearch 7.12 as of 2021-04-24:
Mapping (in ES7, under the mappings property of the body of a "put index template" request etc):
{
"properties": {
"bio": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
Four documents indexed:
{
"bio": "Dogs are the best pet."
}
{
"bio": "Cats are cute."
}
{
"bio": "Cats are cute."
}
{
"bio": "Cats are the greatest."
}
Aggregation query:
{
"size": 0,
"aggs": {
"bios_with_cats": {
"filter": {
"match": {
"bio": "cats"
}
},
"aggs": {
"bios": {
"terms": {
"field": "bio.keyword"
}
}
}
}
}
}
Aggregation query results:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"bios_with_cats": {
"doc_count": 3,
"bios": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Cats are cute.",
"doc_count": 2
},
{
"key": "Cats are the greatest.",
"doc_count": 1
}
]
}
}
}
}
Basically, this aggregation says "Of the documents whose bios are like 'cats', how many of each distinct bio are there?" The one document without "cats" in its bio property is excluded, and then the remaining documents are grouped into buckets, one of which has one document and the other has two documents.

Related

ELK query to return one record for each product with the max timestamp

On Kibana, I can view logs for various products (product.name) along with timestamp and other information. Here is one of the log:
{
"_index": "xxx-2017.08.30",
"_type": "logs",
"_id": "xxxx",
"_version": 1,
"_score": null,
"_source": {
"v": "1.0",
"level": "INFO",
"timestamp": "2017-01-30T18:31:50.761Z",
"product": {
"name": "zzz",
"version": "2.1.0-111"
},
"context": {
...
...
}
},
"fields": {
"timestamp": [
1504117910761
]
},
"sort": [
1504117910761
]
}
There are several other logs for same product and also several logs for different products.
However, I want to write a query that returns single record for a given product.name (the one with maximum timestamp value) and it returns same information for all other products. That, is logs returned one for each product and for each product, it should be the one with maximum timestamp.
How do I achieve this?
I tried to follow the approach listed in:
How to get latest values for each group with an Elasticsearch query?
And created a query:
{
"aggs": {
"group": {
"terms": {
"field": "product.name"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}'
But, I got an error that said:
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [product.name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
],
Do I absolutely need to set fielddata=true for this field in this case? If no, what should I do? If yes, I am not sure how to set it. I tried doing it this way:
curl -XGET 'localhost:9200/xxx*/_search?pretty' -H 'Content-Type: application/json' -d'
{
"properties": {
"product.name": {
"type": "text",
"fielddata": true
}
},
"aggs": {
"group": {
"terms": {
"field": "product.name"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}'
But, I think there is something wrong with it (synatactically?) and I get this error:
{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "Unknown key for a START_OBJECT in [properties].",
"line" : 3,
"col" : 19
}
],
The reason you got error is because you try to do aggregation on text field (product.name) you can't doing that in elasticsearch 5.
You don't need to set field data to true,what you need to do is define in the mapping the fields product. name as a 2 fields, one product.name and second product.name.keyword
Like this:
{
"product.name":
{
"type" "text",
"fields":
{
"keyword":
{
"type": "keyword",
"ignore_above": 256
}
}
}
}
Then you need to do the aggregation on product.name.keyword

Elasticsearch fielddata - should I use it?

Given an index with documents that have a brand property, we need to create a term aggregation that is case insensitive.
Index definition
Please note that the use of fielddata
PUT demo_products
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"product": {
"properties": {
"brand": {
"type": "text",
"analyzer": "my_custom_analyzer",
"fielddata": true,
}
}
}
}
}
Data
POST demo_products/product
{
"brand": "New York Jets"
}
POST demo_products/product
{
"brand": "new york jets"
}
POST demo_products/product
{
"brand": "Washington Redskins"
}
Query
GET demo_products/product/_search
{
"size": 0,
"aggs": {
"brand_facet": {
"terms": {
"field": "brand"
}
}
}
}
Result
"aggregations": {
"brand_facet": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new york jets",
"doc_count": 2
},
{
"key": "washington redskins",
"doc_count": 1
}
]
}
}
If we use keyword instead of text we end up the 2 buckets for New York Jets because of the differences in casing.
We're concerned about the performance implications by using fielddata. However if fielddata is disabled we get the dreaded "Fielddata is disabled on text fields by default."
Any other tips to resolve this - or should we not be so concerned about fielddate?
Starting with ES 5.2 (out today), you can use normalizers with keyword fields in order to (e.g.) lowercase the value.
The role of normalizers is a bit like analyzers for text fields, though what you can do with them is more restrained, but that would probably help with the issue you're facing.
You'd create the index like this:
PUT demo_products
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"product": {
"properties": {
"brand": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
And your query would return this:
"aggregations" : {
"brand_facet" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "new york jets",
"doc_count" : 2
},
{
"key" : "washington redskins",
"doc_count" : 1
}
]
}
}
Best of both worlds!
You can lowercase the aggregation at query time if you use a script. It won't perform as well as a normalized keyword field, but is still quite fast in my experience. For example, your query would be:
GET demo_products/product/_search
{
"size": 0,
"aggs": {
"brand_facet": {
"terms": {
"script": "doc['brand'].value.toLowerCase()"
}
}
}
}

Removing stopwords from basic Terms aggregation in Elasticsearch?

I'm a little new to Elasticsearch, but basically I have an single index called posts with multiple post documents that take the following form:
"post": {
"id": 123,
"message": "Some message"
}
I'm trying to get the most frequently occurring words in the message field across the entire index, with a simple Terms aggregation:
curl -XPOST 'localhost:9200/posts/_search?pretty' -d '
{
"aggs": {
"frequent_words": {
"terms": {
"field": "message"
}
}
}
}
'
Unfortunately, this aggregation includes stopwords, so I end up with a list of words like "and", "the", "then", etc. instead of more meaningful words.
I've tried applying an analyzer to exclude those stopwords, but to no avail:
curl -XPUT 'localhost:9200/posts/?pretty' -d '
{
"settings": {
"analysis": {
"analyzer": {
"standard": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}'
Am I applying the analyzer correctly, or am I going about this the wrong way? Thanks!
I guess you forgot set analyzer to your message filed of your type field. Because Elasticsearch use their indexed data while aggregating data. This means that Elasticsearch dont get your stopwords if you analyze your field correctly. You can check this link. I used sense plugin of kibana to execute following requests. Check mapping create request
PUT /posts
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords": ["test", "testable"]
}
}
}
}
}
### Dont forget these lines
POST /posts/post/_mapping
{
"properties": {
"message": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
POST posts/post/1
{
"id": 1,
"message": "Some messages"
}
POST posts/post/2
{
"id": 2,
"message": "Some testable message"
}
POST posts/post/3
{
"id": 3,
"message": "Some test message"
}
POST /posts/_search
{
"aggs": {
"frequent_words": {
"terms": {
"field": "message"
}
}
}
}
This is my resultset for this search request :
{
"hits": {
...
},
"aggregations": {
"frequent_words": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "some",
"doc_count": 3
},
{
"key": "message",
"doc_count": 2
},
{
"key": "messages",
"doc_count": 1
}
]
}
}
}
In latest version 5.5, the string type has been changed to text/keyword. I enabled the stopwords for the field title and it is working for search. Means if i search for the, it is not returning but if I use below for aggregation
"field": "message_analyzed.keyword"
getting the stopwords too in aggregation bucket.
Any suggestion are welcome.
Thanks

Broken aggregation in elasticsearch

I'm getting erroneous results on performing terms aggregation in the field names in the index.
The following is the mappings I have used to the names field:
{
"dbnames": {
"properties": {
"names": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
Here is the results I'm getting for a simple terms aggregation on the field:
"aggregations": {
"names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Martin",
"doc_count": 1
},
{
"key": "John martin",
"doc_count": 1
},
{
"key": " Victor Moses",
"doc_count": 1
}
]
}
}
As you can see, I have the same names with different casings being shown as different buckets in the aggregation. What I want here is irrespective of the case, the names should be clubbed together.
The easiest way would be to make sure you properly case the value of your names field at indexing time.
If that is not an option, the other way to go about it is to define an analyzer that will do it for you and set that analyzer as index_analyzer for the names field. Such a custom analyzer would need to use the keyword tokenizer (i.e. take the whole value of the field as a single token) and the lowercase token filter (i.e. lowercase the value)
curl -XPUT localhost:9200/your_index -d '{
"settings": {
"index": {
"analysis": {
"analyzer": {
"casing": { <--- custom casing analyzer
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"names": {
"type": "string",
"index_analyzer": "casing" <--- use your custom analyzer
}
}
}
}
}'
Then we can index some data:
curl -XPOST localhost:9200/your_index/your_type/_bulk -d '
{"index":{}}
{"names": "John Martin"}
{"index":{}}
{"names": "John martin"}
{"index":{}}
{"names": "Victor Moses"}
'
And finally the terms aggregation on the names field would return your the expected results:
curl -XPOST localhost:9200/your_index/your_type/_search-d '{
"size": 0,
"aggs": {
"dbnames": {
"terms": {
"field": "names"
}
}
}
}'
Results:
{
"dbnames": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "john martin",
"doc_count": 2
},
{
"key": "victor moses",
"doc_count": 1
}
]
}
}
There are 2 options here
Use not_analyzed option - This one has a disadvantage that same
string with different cases wont be seen as on
keyword tokenizer + lowercase filter - This one does not have the
above issue
I have neatly outlined these two approaches and how to use them here - https://qbox.io/blog/elasticsearch-aggregation-custom-analyzer

Elasticsearch return unique values for a field

I am trying to build an Elasticsearch query that will return only unique values for a particular field.
I do not want to return all the values for that field nor count them.
For example, if there are 50 different values currently contained by the field, and I do a search to return only 20 hits (size=20). I want each of the 20 results to have a unique result for that field, but I don't care about the 30 other values not represented in the result.
For example with the following search (pseudo code - not checked):
{
from: 0,
size: 20,
query: {
bool: {
must: {
range: { field1: { gte: 50 }},
term: { field2: 'salt' },
/**
* I want to return only unique values for "field3", but I
* don't want to return all of them or count them.
*
* How do I specify this in my query?
**/
unique: 'field3',
},
mustnot: {
match: { field4: 'pepper'},
}
}
}
}
You should be able to do this pretty easily with a terms aggregation.
Here's an example. I defined a simple index, containing a field that has "index": "not_analyzed" so we can get the full text of each field as a unique value, rather than terms generated from tokenizing it, etc.
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then I add a few docs with the bulk API.
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"first doc"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"second doc"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"third doc"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"third doc"}
Now we can run our terms aggregation:
POST /test_index/_search?search_type=count
{
"aggs": {
"unique_vals": {
"terms": {
"field": "title"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique_vals": {
"buckets": [
{
"key": "third doc",
"doc_count": 2
},
{
"key": "first doc",
"doc_count": 1
},
{
"key": "second doc",
"doc_count": 1
}
]
}
}
}
I'm very surprised a filter aggregation hasn't been suggested. It goes back all the way to ES version 1.3.
The filter aggregation is similar to a regular filter query but can instead be nested into an aggregation chain to filter out counts of documents that don't meet a particular criteria and give you sub-aggregation results based only on the documents that meet the criteria of the query.
First, we'll put our mapping.
curl --request PUT \
--url http://localhost:9200/items \
--header 'content-type: application/json' \
--data '{
"mappings": {
"item": {
"properties": {
"field1" : { "type": "integer" },
"field2" : { "type": "keyword" },
"field3" : { "type": "keyword" },
"field4" : { "type": "keyword" }
}
}
}
}
'
Then let's load some data.
curl --request PUT \
--url http://localhost:9200/items/_bulk \
--header 'content-type: application/json' \
--data '{"index":{"_index":"items","_type":"item","_id":1}}
{"field1":50, "field2":["salt", "vinegar"], "field3":["garlic", "onion"], "field4":"paprika"}
{"index":{"_index":"items","_type":"item","_id":2}}
{"field1":40, "field2":["salt", "pepper"], "field3":["onion"]}
{"index":{"_index":"items","_type":"item","_id":3}}
{"field1":100, "field2":["salt", "vinegar"], "field3":["garlic", "chives"], "field4":"pepper"}
{"index":{"_index":"items","_type":"item","_id":4}}
{"field1":90, "field2":["vinegar"], "field3":["chives", "garlic"]}
{"index":{"_index":"items","_type":"item","_id":5}}
{"field1":900, "field2":["salt", "vinegar"], "field3":["garlic", "chives"], "field4":"paprika"}
'
Notice, that only the documents with id's 1 and 5 will pass the criteria and so we will be left to aggregate on these two field3 arrays and four values total. ["garlic", "chives"], ["garlic", "onion"]. Also notice that field3 can be an array or single value in the data but I'm making them arrays to illustrate how the counts will work.
curl --request POST \
--url http://localhost:9200/items/item/_search \
--header 'content-type: application/json' \
--data '{
"size": 0,
"aggregations": {
"top_filter_agg" : {
"filter" : {
"bool": {
"must":[
{
"range" : { "field1" : { "gte":50} }
},
{
"term" : { "field2" : "salt" }
}
],
"must_not":[
{
"term" : { "field4" : "pepper" }
}
]
}
},
"aggs" : {
"field3_terms_agg" : { "terms" : { "field" : "field3" } }
}
}
}
}
'
After running the conjuncted filter/terms aggregation. We only have a count of 4 terms on field3 and three unique terms altogether.
{
"took": 46,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"top_filter_agg": {
"doc_count": 2,
"field3_terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "garlic",
"doc_count": 2
},
{
"key": "chives",
"doc_count": 1
},
{
"key": "onion",
"doc_count": 1
}
]
}
}
}
}

Resources