Elasticsearch 5.2.2: terms aggregation case insensitive - elasticsearch

I am attempting to do a case-insensitive aggregation on a keyword type field, but I'm having issues in getting this to work.
What I have tried so far is to add a custom analyzer called "lowercase" which uses the "keyword" tokenizer, and "lowercase" filter. I then added a field to the mapping called "use_lowercase" for the field I want to work with. I wanted to retain the existing "text" and "keyword" field components as well since I may want to search for the terms within the field.
Here is the index definition, including the custom analyzer:
PUT authors
{
"settings": {
"analysis": {
"analyzer": {
"lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"famousbooks": {
"properties": {
"Author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"use_lowercase": {
"type": "text",
"analyzer": "lowercase"
}
}
}
}
}
}
}
Now I add 2 records with the same Author, but with different case:
POST authors/famousbooks/1
{
"Book": "The Mysterious Affair at Styles",
"Year": 1920,
"Price": 5.92,
"Genre": "Crime Novel",
"Author": "Agatha Christie"
}
POST authors/famousbooks/2
{
"Book": "And Then There Were None",
"Year": 1939,
"Price": 6.99,
"Genre": "Mystery Novel",
"Author": "Agatha christie"
}
So far so good. Now if I do a terms aggregation based on Author,
GET authors/famousbooks/_search
{
"size": 0,
"aggs": {
"authors-aggs": {
"terms": {
"field": "Author.use_lowercase"
}
}
}
}
I get the following result:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "authors",
"node": "yxcoq_eKRL2r6JGDkshjxg",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
},
"status": 400
}
So it seems to me that the aggregation is thinking that the search field is text instead of keyword, and hence giving me the fielddata warning. I would think that ES would be sophisticated enough to recognize that the terms field is in fact a keyword (via custom analyzer) and therefore aggregate-able, but that doesn't appear to be the case.
If I add "fielddata":true to the mapping for Author, the aggregation then works fine, but I'm hesitant to do this given the dire warnings of high heap usage when setting this value.
Is there a best practice for doing this type of insensitive keyword aggregation? I was hoping I could just say "type":"keyword", "filter":"lowercase" in the mappings section but that is not available it seems.
It feels like I'm having to use too big of a stick to get this to work if I go the "fielddata":true route. Any help on this would be appreciated!

Turns out the solution is to use a custom normalizer instead of a custom analyzer.
PUT authors
{
"settings": {
"analysis": {
"normalizer": {
"myLowercase": {
"type": "custom",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"famousbooks": {
"properties": {
"Author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"use_lowercase": {
"type": "keyword",
"normalizer": "myLowercase",
"ignore_above": 256
}
}
}
}
}
}
}
This then allows terms aggregation using field Author.use_lowercase without issue.

It seems this is not possible by default,(without "lowercase" normalizer) but without this you can use a trick - translate the string in a case insensitive regex match.
e.g. for string "bar" - a case insensitive regex would be "[bB][aA][rR]"
I used a python helper for doing this:
def case_insensitive_regex_from_string(v):
if not v:
return v
zip_obj = zip(itertools.cycle('['), v, v.swapcase(), itertools.cycle(']'))
return ''.join(''.join(x) for x in zip_obj)

Well you did define use_lowercase as text:
"use_lowercase": {
"type": "text",
"analyzer": "lowercase"
}
Try defining it as type: keyword - It helped me with a similar problem I had with sorting.

Related

kibana keyword occurrency across documents

I have been unable to show words occurrency in kibana inside a full_text field mapped as "type": "keyword" across documents in the index.
My first attempt involved the usage of an analyzer. However I have been unable to change the document in any way, the index mapping relfect the analyzer but no field reflect the analysis.
This is the simplified mapping:
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"analyzed": {
"type": "text",
"analyzer": "rebuilt"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"rebuilt": {
"tokenizer": "standard"
}
}
},
"index.mapping.ignore_malformed": true,
"index.mapping.total_fields.limit": 2000
}
}
but still I'm unable to see the array of words that I expect to be saved under the text.analyzed field, indeed that fields does not exists and I'm wondering why
It seems like settings fielddata=true link, in spite of being heavily discouraged, solved my problem (at least for now), and allows me to visualize in kibana the occurrence (or absolute frequency) of each word in the text field across documents.
The final version of the proposed simplified mapping therefore became:
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "rebuilt",
"fielddata": true
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"rebuilt": {
"tokenizer": "standard"
}
}
},
"index.mapping.ignore_malformed": true,
"index.mapping.total_fields.limit": 2000
}
}
Getting rid of the useless analyzed field.
I still have to check the performance of kibana. If someone has a performance safe solution to this problem please do not hesitate.
Thanks.

Empty value generates mapper_parsing_exception for Elasticsearch completion suggester field

I have a name field which is a completion suggester, and indexing generates a mapper_parsing_exception error, stating value must have a length > 0.
There are indeed some empty values in this field. How do I accommodate them?
ignore_malformed had no effect, either at the properties or index level.
I tried filtering out empty strings in the analyzer, setting a min length:
PUT /genes
{
"settings": {
"analysis": {
"filter": {
"remove_empty": {
"type": "length",
"min": 1
}
},
"analyzer": {
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"remove_empty"
]
}
}
}
},
"mappings": {
"gene": {
"name": {
"type": "completion",
"analyzer": "keyword_lowercase"
}
}
}
}
}
Or filter empty strings as a stopword:
"remove_empty": {
"type": "stop",
"stopwords": [""]
}
Attempting to apply a filter to the name mapping generates an unsupported parameter error:
"mappings": {
"gene": {
"name": {
"type": "completion",
"analyzer": "keyword_lowercase",
"filter": "remove_empty"
}
}
}
}
This sure feels like it ought to be simple. Is there a way to do this?
Thanks!
I have faced the same issue. After some research it seems to me that currently the only option is to change data (e.g. replace empty values with some dummy non-empty values) before indexing.
But there is also good news. This issue exists on GitHub and was resolved about a month ago. It is planned to be released in version 6.4.0.

Elasticsearch keyword and lowercase and aggregation

I have previously stored some fields with the mapping "keyword". But, they are case senstive.
To solve this, it is possible to use an analyzer, such as
{
"index": {
"analysis": {
"analyzer": {
"keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
}
}
with the mapping
{
"properties": {
"field": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
But then the Aggregate on term does not work.
Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default. Set fielddata=true on [a] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.
It works on mapping type=keyword, but type=keyword does not allow analyzer it seems.
How do I index it as a lowercase keyword but still make it possible to use aggregation without setting fielddata=true?
If you're using ES 5.2 or above, you can now leverage normalizers for keyword fields. Simply declare your index settings and mappings like this and you're good to go
PUT index
{
"settings": {
"analysis": {
"normalizer": {
"keyword_lowercase": {
"type": "custom",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"type": {
"properties": {
"field": {
"type": "keyword",
"normalizer": "keyword_lowercase"
}
}
}
}
}

elasticsearch 5.2 sorting with ICU plugin needs fielddata = true?

I want to sort elasticsearch result documents with icu_collation filter. So I have
settings for index:
"settings": {
"analysis": {
"analyzer": {
"ducet_sort": {
"tokenizer": "keyword",
"filter": [ "icu_collation" ]
}
}
}
}
and mappings
"mappings": {
"card": {
"properties": {
"title": {
"type": "text",
"fields": {
"sort": {
"type": "text",
"analyzer": "ducet_sort",
"index": false
}
}
}
}}}
and query:
{
"sort": ["title.sort"]
}
But query failed:
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title.sort] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
In documentation the suggested data type for sorting is keyword. But data type keyword doesn't support analyzer. In addition the fielddata is not recommended:
documentation
So is there a way for sorting documents in elasticsearch with some specific collation e.g. icu_collation without fielddata=true?
Thank you.
In Kibana, open Dev Tools option from left menu and execute the query below after update according to your settings.
PUT _mapping/INDEX_NAME?update_all_types
{
"properties": {
"FIELD_NAME": {
"type": "text",
"fielddata": true
}
}
}
or through Curl or a terminal like Cygwin(for Windows) execute the query below after update according to your settings.
curl -XPUT http://DOCKER_MACHINE_IP:9200/INDEX_NAME -d '{
"mappings": {
"type": {
"properties": {
"FIELD_NAME": {
"type": "text",
"fielddata": true
}
}
}
}
}'

ElasticSearch 5.1 Fielddata is disabled in text field by default [ERROR: trying to use aggregation on field]

Having this field in my mapping
"answer": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
i try to execute this aggregation
"aggs": {
"answer": {
"terms": {
"field": "answer"
}
},
but i get this error
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [answer] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
Do i have to change my mapping or am i using the wrong aggregation ? (just updated from 2.x to 5.1)
You need to aggregate on the keyword sub-field, like this:
"aggs": {
"answer": {
"terms": {
"field": "answer.keyword"
}
},
That will work.
In Aggregation, just add keyword to answer.It worked for me. For text fields we need to add keyword.
"field": "answer.keyword"
Adding to #Val's answer, you can also set the fielddata to true during your mapping itself:
"answer": {
"type": "text",
"fielddata": true, <-- add this line
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},

Resources