Elasticsearch keyword and lowercase and aggregation - elasticsearch

I have previously stored some fields with the mapping "keyword". But, they are case senstive.
To solve this, it is possible to use an analyzer, such as
{
"index": {
"analysis": {
"analyzer": {
"keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
}
}
with the mapping
{
"properties": {
"field": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
But then the Aggregate on term does not work.
Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default. Set fielddata=true on [a] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.
It works on mapping type=keyword, but type=keyword does not allow analyzer it seems.
How do I index it as a lowercase keyword but still make it possible to use aggregation without setting fielddata=true?

If you're using ES 5.2 or above, you can now leverage normalizers for keyword fields. Simply declare your index settings and mappings like this and you're good to go
PUT index
{
"settings": {
"analysis": {
"normalizer": {
"keyword_lowercase": {
"type": "custom",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"type": {
"properties": {
"field": {
"type": "keyword",
"normalizer": "keyword_lowercase"
}
}
}
}
}

Related

kibana keyword occurrency across documents

I have been unable to show words occurrency in kibana inside a full_text field mapped as "type": "keyword" across documents in the index.
My first attempt involved the usage of an analyzer. However I have been unable to change the document in any way, the index mapping relfect the analyzer but no field reflect the analysis.
This is the simplified mapping:
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"analyzed": {
"type": "text",
"analyzer": "rebuilt"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"rebuilt": {
"tokenizer": "standard"
}
}
},
"index.mapping.ignore_malformed": true,
"index.mapping.total_fields.limit": 2000
}
}
but still I'm unable to see the array of words that I expect to be saved under the text.analyzed field, indeed that fields does not exists and I'm wondering why
It seems like settings fielddata=true link, in spite of being heavily discouraged, solved my problem (at least for now), and allows me to visualize in kibana the occurrence (or absolute frequency) of each word in the text field across documents.
The final version of the proposed simplified mapping therefore became:
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "rebuilt",
"fielddata": true
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"rebuilt": {
"tokenizer": "standard"
}
}
},
"index.mapping.ignore_malformed": true,
"index.mapping.total_fields.limit": 2000
}
}
Getting rid of the useless analyzed field.
I still have to check the performance of kibana. If someone has a performance safe solution to this problem please do not hesitate.
Thanks.

Empty value generates mapper_parsing_exception for Elasticsearch completion suggester field

I have a name field which is a completion suggester, and indexing generates a mapper_parsing_exception error, stating value must have a length > 0.
There are indeed some empty values in this field. How do I accommodate them?
ignore_malformed had no effect, either at the properties or index level.
I tried filtering out empty strings in the analyzer, setting a min length:
PUT /genes
{
"settings": {
"analysis": {
"filter": {
"remove_empty": {
"type": "length",
"min": 1
}
},
"analyzer": {
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"remove_empty"
]
}
}
}
},
"mappings": {
"gene": {
"name": {
"type": "completion",
"analyzer": "keyword_lowercase"
}
}
}
}
}
Or filter empty strings as a stopword:
"remove_empty": {
"type": "stop",
"stopwords": [""]
}
Attempting to apply a filter to the name mapping generates an unsupported parameter error:
"mappings": {
"gene": {
"name": {
"type": "completion",
"analyzer": "keyword_lowercase",
"filter": "remove_empty"
}
}
}
}
This sure feels like it ought to be simple. Is there a way to do this?
Thanks!
I have faced the same issue. After some research it seems to me that currently the only option is to change data (e.g. replace empty values with some dummy non-empty values) before indexing.
But there is also good news. This issue exists on GitHub and was resolved about a month ago. It is planned to be released in version 6.4.0.

Elasticsearch 5.2.2: terms aggregation case insensitive

I am attempting to do a case-insensitive aggregation on a keyword type field, but I'm having issues in getting this to work.
What I have tried so far is to add a custom analyzer called "lowercase" which uses the "keyword" tokenizer, and "lowercase" filter. I then added a field to the mapping called "use_lowercase" for the field I want to work with. I wanted to retain the existing "text" and "keyword" field components as well since I may want to search for the terms within the field.
Here is the index definition, including the custom analyzer:
PUT authors
{
"settings": {
"analysis": {
"analyzer": {
"lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"famousbooks": {
"properties": {
"Author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"use_lowercase": {
"type": "text",
"analyzer": "lowercase"
}
}
}
}
}
}
}
Now I add 2 records with the same Author, but with different case:
POST authors/famousbooks/1
{
"Book": "The Mysterious Affair at Styles",
"Year": 1920,
"Price": 5.92,
"Genre": "Crime Novel",
"Author": "Agatha Christie"
}
POST authors/famousbooks/2
{
"Book": "And Then There Were None",
"Year": 1939,
"Price": 6.99,
"Genre": "Mystery Novel",
"Author": "Agatha christie"
}
So far so good. Now if I do a terms aggregation based on Author,
GET authors/famousbooks/_search
{
"size": 0,
"aggs": {
"authors-aggs": {
"terms": {
"field": "Author.use_lowercase"
}
}
}
}
I get the following result:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "authors",
"node": "yxcoq_eKRL2r6JGDkshjxg",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [Author.use_lowercase] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
},
"status": 400
}
So it seems to me that the aggregation is thinking that the search field is text instead of keyword, and hence giving me the fielddata warning. I would think that ES would be sophisticated enough to recognize that the terms field is in fact a keyword (via custom analyzer) and therefore aggregate-able, but that doesn't appear to be the case.
If I add "fielddata":true to the mapping for Author, the aggregation then works fine, but I'm hesitant to do this given the dire warnings of high heap usage when setting this value.
Is there a best practice for doing this type of insensitive keyword aggregation? I was hoping I could just say "type":"keyword", "filter":"lowercase" in the mappings section but that is not available it seems.
It feels like I'm having to use too big of a stick to get this to work if I go the "fielddata":true route. Any help on this would be appreciated!
Turns out the solution is to use a custom normalizer instead of a custom analyzer.
PUT authors
{
"settings": {
"analysis": {
"normalizer": {
"myLowercase": {
"type": "custom",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"famousbooks": {
"properties": {
"Author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"use_lowercase": {
"type": "keyword",
"normalizer": "myLowercase",
"ignore_above": 256
}
}
}
}
}
}
}
This then allows terms aggregation using field Author.use_lowercase without issue.
It seems this is not possible by default,(without "lowercase" normalizer) but without this you can use a trick - translate the string in a case insensitive regex match.
e.g. for string "bar" - a case insensitive regex would be "[bB][aA][rR]"
I used a python helper for doing this:
def case_insensitive_regex_from_string(v):
if not v:
return v
zip_obj = zip(itertools.cycle('['), v, v.swapcase(), itertools.cycle(']'))
return ''.join(''.join(x) for x in zip_obj)
Well you did define use_lowercase as text:
"use_lowercase": {
"type": "text",
"analyzer": "lowercase"
}
Try defining it as type: keyword - It helped me with a similar problem I had with sorting.

elasticsearch 5.2 sorting with ICU plugin needs fielddata = true?

I want to sort elasticsearch result documents with icu_collation filter. So I have
settings for index:
"settings": {
"analysis": {
"analyzer": {
"ducet_sort": {
"tokenizer": "keyword",
"filter": [ "icu_collation" ]
}
}
}
}
and mappings
"mappings": {
"card": {
"properties": {
"title": {
"type": "text",
"fields": {
"sort": {
"type": "text",
"analyzer": "ducet_sort",
"index": false
}
}
}
}}}
and query:
{
"sort": ["title.sort"]
}
But query failed:
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title.sort] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
}
In documentation the suggested data type for sorting is keyword. But data type keyword doesn't support analyzer. In addition the fielddata is not recommended:
documentation
So is there a way for sorting documents in elasticsearch with some specific collation e.g. icu_collation without fielddata=true?
Thank you.
In Kibana, open Dev Tools option from left menu and execute the query below after update according to your settings.
PUT _mapping/INDEX_NAME?update_all_types
{
"properties": {
"FIELD_NAME": {
"type": "text",
"fielddata": true
}
}
}
or through Curl or a terminal like Cygwin(for Windows) execute the query below after update according to your settings.
curl -XPUT http://DOCKER_MACHINE_IP:9200/INDEX_NAME -d '{
"mappings": {
"type": {
"properties": {
"FIELD_NAME": {
"type": "text",
"fielddata": true
}
}
}
}
}'

How to implement case sensitive search in elasticsearch?

I have a field in my indexed documents where i need to search with case being sensitive. I am using the match query to fetch the results.
An example of my data document is :
{
"name" : "binoy",
"age" : 26,
"country": "India"
}
Now when I give the following query:
{
“query” : {
“match” : {
“name” : “Binoy"
}
}
}
It gives me a match for "binoy" against "Binoy". I want the search to be case sensitive. It seems by default,elasticsearch seems to go with case being insensitive. How to make the search case sensitive in elasticsearch?
In the mapping you can define the field as not_analyzed.
curl -X PUT "http://localhost:9200/sample" -d '{
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}'
echo
curl -X PUT "http://localhost:9200/sample/data/_mapping" -d '{
"data": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}'
Now if you can do normal index and do normal search , it wont analyze it and make sure it deliver case insensitive search.
It depends on the mapping you have defined for you field name. If you haven't defined any mapping then elasticsearch will treat it as string and use the standard analyzer (which lower-cases the tokens) to generate tokens. Your query will also use the same analyzer for search hence matching is done by lower-casing the input. That's why "Binoy" matches "binoy"
To solve it you can define a custom analyzer without lowercase filter and use it for your field name. You can define the analyzer as below
"analyzer": {
"casesensitive_text": {
"type": "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
You can define the mapping for name as below
"name": {
"type": "string",
"analyzer": "casesensitive_text"
}
Now you can do the the search on name.
note: the analyzer above is for example purpose. You may need to change it as per your needs
Have your mapping like:
PUT /whatever
{
"settings": {
"analysis": {
"analyzer": {
"mine": {
"type": "custom",
"tokenizer": "standard"
}
}
}
},
"mappings": {
"type": {
"properties": {
"name": {
"type": "string",
"analyzer": "mine"
}
}
}
}
}
meaning, no lowercase filter for that custom analyzer.
Here is the full index template which worked for my ElasticSearch 5.6:
{
"template": "logstash-*",
"settings": {
"analysis" : {
"analyzer" : {
"case_sensitive" : {
"type" : "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
},
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"fluentd": {
"properties": {
"message": {
"type": "text",
"fields": {
"case_sensitive": {
"type": "text",
"analyzer": "case_sensitive"
}
}
}
}
}
}
}
As you see, the logs are coming from FluentD and are saved into a timebased index logstash-*. To make sure, I can still execute wildcard queries on the message filed, I put a multi-field mapping on that field. Wildcard/analyzed queries can be done on message field and the case sensitive one on the message.case_sensitive field.

Resources