Analyze all uppercase tokens in a field - elasticsearch

I would like to analyze value of a text field in 2 ways. Using standard analysis and a custom analysis that only indexes all uppercase tokens in the text.
For example, if the value is "This WHITE cat is very CUTE.", the only tokens that should be indexed for custom analysis is "WHITE" and "CUTE". For this, I am using Pattern Capture Token Filter with pattern "(\b[A-Z]+\b)+?". But this is indexing all tokens and not just uppercase tokens.
Is Pattern Capture Token Filter the right one to use for this task? If yes, what am I doing wrong? If not, how do I get this done? Please help.

You should use instead pattern_replace and char_filter:
PUT test
{
"settings": {
"analysis": {
"char_filter": {
"filter_lowercase": {
"type": "pattern_replace",
"pattern": "[A-Z][a-z]+|[a-z]+",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"filter_lowercase"
]
}
}
}
}
}
GET test/_analyze
{"analyzer": "my_analyzer",
"text" : "This WHITE cat is very CUTE"
}

Related

Extend Elasticsearch's standard Analyzer with additional characters to tokenize on

I basically want the functionality of the inbuilt standard analyzer that additionally tokenizes on underscores.
Currently the standard analyzer will keep brown_fox_has as a singular token but I want [brown, fox, has] instead. The simple analyzer loses some functionality over the standard one, so I want to keep the standard as much as possible.
The docs only shows how you would add filters and other non-tokenizer changes, but I want to keep all of the standard tokenizer, while adding the additional underscore.
I could create a character filter to map _ to - and the standard tokenizer will do the job for me, but is there a better way?
es.indices.create(index="mine", body={
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
# "tokenize_on_chars": ["_"], # i want this to work with the standard tokenizer without using char group
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
},
}
})
res = es.indices.analyze(index="mine", body={
"field": "text",
"text": "the quick brown_fox_has to be split"
})
Use normalizer and define it along with your preferred standard tokenizer
POST /_analyze
{
"char_filter": {
"type": "mapping",
"mappings": [
"_ =>\\u0020" // replace underscore with whitespace
]
},
"tokenizer": "standard",
"text": "the quick brown_fox_has to be split"
}

Defining cusstom tokenizer in elastic search

This is how i am trying to define a custom tokenizer in es
"pattern" :"[\-s+]",
but when i run this i get the response as shown below
"pattern" : """[-s+]""",
notice in the output i get additional quotes : "pattern" : """[-s+]""",in the begenninng and the end, if we dont have to use any escape characters this works fine, but when using escape character this results in double quotes being appended, any help?
\ Is a reserved Lucene operator, you have to escape it.
https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html
Please try this way
PUT test_varun
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern":"[\\-s+]"
}
}
}
}
}
If doesnt make it please attach an example input/output to reproduce in my end.

Elasticsearch strange filter behaviour

I'm trying to replace a particular string inside a field. So I used custom analyser and character filter just as it's described in the docs, but it didn't work.
Here are my index settings:
{
"settings": {
"analysis": {
"char_filter": {
"doule_colon_to_space": {
"type": "mapping",
"mappings": [ "::=> "]
}},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [ "doule_colon_to_space" ],
"tokenizer": "standard"
}}
}}}
which should replace all double colons (::) in a field with spaces. I then update my mapping to use the analyzer:
{
"posts": {
"properties": {
"id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "my_analyzer",
"fields": {
"simple": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Then I put a document in the index:
{
"id": 1,
"title": "Person::Bruce Wayne"
}
I then test if analyzer works, but it appears it's not - when I send this https://localhost:/first_test/_analyze?analyzer=my_analyzer&text=Person::Someone+Close, I got two tokens back - 'PersonSomeone' (together) and 'Close'. Am I doing this right? May be I should escape the space somehow? I use Elasticsearch 1.3.4
I think the whitespace in your char_filter pattern is being ignored. Try using the unicode escape sequence for a single space instead:
"mappings": [ "::=>\\u0020"]
Update:
In response to your comment, the short answer is yes, the example is wrong. The docs do suggest that you can use a mapping character filter to replace a token with another one which is padded by whitespace, but the code disagrees.
The source code for the MappingCharFilterFactory uses this regex to parse the settings:
// source => target
private static Pattern rulePattern = Pattern.compile("(.*)\\s*=>\\s*(.*)\\s*$");
This regex matches (and effectively discards) any whitespace (\\s*) surrounding the second replacement token ((.*)), so it seems that you cannot use leading or trailing whitespace as part of your replacement mapping (though it could include interstitial whitespace). Even if the regex were different, the matched token is trim()ed, which would have removed any leading and trailing whitespace.

Elasticsearch Analysis token filter doesn't capture pattern

I made a custom analyzer in my test index:
PUT test
{
"settings": {
"analysis": {
"filter": {
"myFilter": {
"type": "pattern_capture",
"patterns": ["\\d+(,\\d+)*(\\.\\d+)?[%$€£¥]?"],
"preserve_original": 1
}
},
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "myTokenizer",
"filters":["myFilter"]
}
},
"tokenizer": {
"myTokenizer":{
"type":"pattern",
"pattern":"([^\\p{N}\\p{L}%$€£¥##'\\-&]+)|((?<=[^\\p{L}])['\\-&]|^['\\-&]|['\\-&](?=[^\\p{L}])|['\\-&]$)|((?<=[^\\p{N}])[$€£¥%]|^[$€£¥%]|(?<=[$€£¥%])(?=\\d))"
}
}
}
}
}
It is supposed to spit numbers like 123,234.56$ as a single token
But when such a number is provided it spits out 3 tokens 123 234 56$
The sample of failing test query:
GET test/Stam/_termvector?pretty=true
{
doc:{
"Stam" : {
"fld" : "John Doe",
"txt": "100,234.54%"
}
},
"per_field_analyzer" : {
"Stam.txt": "myAnalyzer"
},
"fields" : ["Stam.txt"],
"offsets":true,
"positions":false,
"payloads":false,
"term_statistics":false,
"field_statistics":false
}
}
Can anyone figure out what is the reason?
Definitely for every other case ',' and '.' are delimiters, that is why I added a filter for that purpose, but unfortunately it doesn't work.
Thanks in advance.
The answer is quite simple, token filter can not combine tokens by design. It should be done through char filters, that are applied to the char stream even before tokenizer starts to split to tokens.
I only had to make sure that the custom tokenizer will not split my tokens.

Semi-exact (complete) match in ElasticSearch

Is there a way to require a complete (though not necessarily exact) match in ElasticSearch?
For instance, if a field has the term "I am a little teapot short and stout", I would like to match on " i am a LITTLE TeaPot short and stout! " but not just "teapot short and stout". I've tried the term filter, but that requires an actual exact match.
If your "not necessarily exact" definition refers to uppercase/lowercase letters combination and the punctuation marks (like ! you have in your example), this would be a solution, not too simple and obvious tough:
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"my_pattern_replace"
]
}
},
"filter": {
"my_pattern_replace": {
"type": "pattern_replace",
"pattern": "!",
"replacement":""
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword_lowercase"
}
}
}
}
}
The idea here is the following:
use a keyword tokenizer to keep the text as is and not to be split into tokens
use the lowercase filter to get rid of the mixing uppercase/lowercase characters
trim filter used to get rid of the trailing and leading whitespaces
use a pattern_replace filter to get rid of the punctuation. This is like this because a keyword tokenizer won't do anything to the characters inside the text. A standard analyzer will do this, but the standard will, also, split the text whereas you need it as is
And this is the query you would use for the mapping above:
{
"query": {
"match": {
"text": " i am a LITTLE TeaPot short and stout! "
}
}
}

Resources