Elasticsearch custom analyzer with two output tokens - elasticsearch

Requirement is to create a custom analyzer which can generate two tokens as shown in below scenarios.
E.g.
Input -> B.tech in
Output Tokens ->
- btechin
- b.tech in
I am able to remove non-alphanumeric character, but how to retain original one too in the output token list. Below is the custom analyzer that I have created.
"alphanumericStringAnalyzer": {
"filter": [
"lowercase",
"minLength_filter"],
"char_filter": [
"specialCharactersFilter"
],
"type": "custom",
"tokenizer": "keyword"
}
"char_filter": {
"specialCharactersFilter": {
"pattern": "[^A-Za-z0-9]",
"type": "pattern_replace",
"replacement": ""
}
},
This analyzer is generating single token "btechin" for input "B.tech in" but I also want original one too in token list "B.tech in"
Thanks!

You can use the word token delimiter as described in this documentation
Here an example of word delimiter configuration :
POST _analyze
{
"text": "B.tech in",
"tokenizer": "keyword",
"filter": [
"lowercase",
{
"type": "word_delimiter",
"catenate_all": true,
"preserve_original": true,
"generate_word_parts": false
}
]
}
results :
{
"tokens": [
{
"token": "b.tech in",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "btechin",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
I hope it will fulfill your requirements!

Related

Synonym token filter

I created a test index with synonym token filter
PUT /synonyms-index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"shares","equity","stock"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Then I ran analyze API ,
post synonyms-index/_analyze
{
"analyzer":"my_synonyms",
"text":"equity awesome"
}
I got the following response to see what token got into inverted index and I was expecting "shares" and "stock" needed to be added as per the synonym rule, but it doesn't seem so. Am I missing anything here ?
{
"tokens": [
{
"token": "equity",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "awesome",
"start_offset": 7,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
posting the answer for the community-
it is common pitfall with JSON ,
We need to make it as ( put everything in a double quotes which consistutues a rule and it follows simple expansion.)
"synonyms": [ "shares,equity,stock" ]
rather than
"synonyms": [
"shares","equity","stock"
]

How to optimize elasticsearch's full text search to match strings like 'C++'

We have a search engine for text content which contains strings like c++ or c#. The switch to Elasticsearch has shown that the search does not match on terms like 'c++'. ++ is removed.
How can we teach elasticsearch to match correctly in a full text search and not to remove special characters? Characters like comma , should of course still be removed.
You need to create your own custom-analyzer which generates token as per your requirement, for your example I created a below custom analyzer with a text field name language and indexed some sample docs:
Index creation with a custom analyzer
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"char_filter": [
"replace_comma"
]
}
},
"char_filter": {
"replace_comma": {
"type": "mapping",
"mappings": [
", => \\u0020"
]
}
}
}
},
"mappings": {
"properties": {
"language": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Tokens generated for text like c++, c# and c,java.
POST http://{{hostname}}:{{port}}/{{index}}/_analyze
{
"text" : "c#",
"analyzer": "my_analyzer"
}
{
"tokens": [
{
"token": "c#",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
}
]
}
for c,java it generated 2 separate tokens c and java as it replaces , with whitespace shown below:
{
"text" : "c, java",
"analyzer":"my_analyzer"
}
{
"tokens": [
{
"token": "c",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "java",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}
Note: You need to understand the analysis process and accordingly modify your custom-analyzer to make it work for all of your use-case, My example might not work for all your edge cases, But hope you get an idea on how to handle such requirements.

Elasticsearch custom analyser

Is it possible to create custom elasticsearch analyser which can split index by space and then create two tokens? One, with everything before space and second, with everything.
For example: I have stored record with field which has following text: '35 G'.
Now I want to receive that record by typing only '35' or '35 G' query to that field.
So elastic should create two tokens: ['35', '35 G'] and no more.
If it's possible, how to achieve it ?
Doable using path_hierarchy tokenizer.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy",
"delimiter": " "
}
}
}
}
...
}
And now
POST test/_analyze
{
"analyzer": "my_analyzer",
"text": "35 G"
}
outputs
{
"tokens": [
{
"token": "35",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "35 G",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}

In span_first query can we specify "end" paramter based on actual string that is stored in ES or do i have to specify in terms of tokens stored in ES

I asked previous question here Query in Elasticsearch for retrieving strings that start with a particular word on elasticsearch and my problem was solved by using span_first query but now my problem has been changed a bit, now my mapping has been changed because now i want to store words ending with apostrophe 's' as "word", "words", "word's" for example see below case
"joseph's -> "joseph's", "josephs", "joseph"
My mapping is given below
curl -X PUT "http://localhost:9200/colleges/" -d
'{
"settings": {
"index": {
"analysis": {
"char_filter": {
"apostrophe_comma": {
"type": "pattern_replace",
"pattern": "\\b((\\w+)\\u0027S)\\b",
"replacement": "$1 $2s $2"
}
},
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter" : ["apostrophe_comma"],
"filter": ["lowercase", "unique"]
}
}
}
}
},
"mappings" : {
"college": {
"properties":{
"college_name" : { "type" : "string", "index": "analyzed", "analyzer": "simple_wildcard"}
}
}
}
}'
My span_first query i was using
"span_first" : {
"match" : {
"span_term" : { "college_name" : first_string[0] }
},
"end" : 1
}
Now the problem i am facing is consider below example
Suppose i have "Donald Duck's" now if anyone would search for "Donald Duck", "Donald Duck's", "Donald Ducks" etc i want them to give "Donald Duck's" but by using span_first query it is not happening because as due to mapping i have 4 tokens now "Donald", "Duck", "Ducks" and "Duck's". now for Donald "end" used in span_first query will be 1, but for other three i used 2 but as "end" is different for different tokens of same word i am not getting desired result.
In short my problem is as span_first query uses "end" parameter to describe position from beginning my token must be present now as due to my mapping i have broken one word "Duck's" to "Duck's", "Ducks" and "Duck" because of which all have "end" value different but while querying i can only use one "end" parameter that's why i don't know how to get my desired output.
If anyone of you have worked on span_first query please help me.
You can use english possessive stemmer to remove 's and english stemmer which maps to porter stem algorithm to handle plurals.
POST colleges
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"unique",
"english_possessive_stemmer",
"light_english_stemmer"
]
}
},
"filter": {
"light_english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
},
"mappings": {
"college": {
"properties": {
"college_name": {
"type": "string",
"index": "analyzed",
"analyzer": "simple_wildcard"
}
}
}
}
}
After that you will have to make two queries to get the right result. First you would have to run the user query through analyze api to get the tokens which you will pass to span queries.
GET colleges/_analyze
{
"text" : "donald ducks duck's",
"analyzer" : "simple_wildcard"
}
The output would be the tokens which will be passed to next phase i.e span query.
{
"tokens": [
{
"token": "donald",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "duck",
"start_offset": 7,
"end_offset": 12,
"type": "word",
"position": 1
},
{
"token": "duck",
"start_offset": 13,
"end_offset": 19,
"type": "word",
"position": 2
}
]
}
The tokens donald, duck, duck will be passed with end position as 1, 2 and 3 respectively.
NOTE : No stemming algorithm is 100%, you might miss some singular/plural combination. For this you could log your queries and then use either synonym token filter or mapping char filter.
Hope this solves the problem.

Synonym filter with "&" not working in elasticsearch suggest with standard tokenizer

My goal is that if I have something like "s & p indices" indexed, that I can also suggest this if the user searches s and p, s & p, or s p. However, there seems to be something peculiar about the & as the below synonym set-up does not work for it. I have the below mapping for my suggest index.
{
"settings": {
"analysis": {
"analyzer": {
"suggest_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase", "my_synonym_filter" ]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [ "&, and", "foo, bar" ]
}
}
}
}
}
And I have the below mapping for my type
{
"properties" : {
"name" : { "type" : "string" },
"name_suggest" : {
"type" : "completion",
"index_analyzer" : "suggest_analyzer",
"search_analyzer" : "suggest_analyzer"
}
}
}
If I index the following object:
{
"name" : "s & p indices",
"name_suggest" : {
"input" : [ "s & p indices"]
}
}
Searching for s and does not return the indexed suggestion. However, the synonym for foo and bar work as expected.
I assume it probably is related to how the standard tokenizer tokenizes on &, but I do not know how to work around the issue. Is there a way to get the tokenizer to exclude splitting on the & and/or treating it differently?
Your current issue apparently lies in the choice of tokenizer for suggest_analyzer. The standard tokenizer does not generate a token for &, and thus the token stream passed to your filters do not see the & token for them to be able to replace it. You can see how this works using the _analyze endpoint
In this case, the tokens generated by the standard tokenizer look like this for the text s & p
"tokens": [
{
"token": "s",
"start_offset": 5,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "p",
"start_offset": 9,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
}
]
The standard tokenizer eats the &. The simplest way to get everything to work here is to change your analyzer to use the whitespace analyzer, which will not strip out special characters or do much work at all, its job is to split on white space.
I modified your mapping to be this:
"settings": {
"analysis": {
"analyzer": {
"suggest_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase", "my_synonym_filter" ]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"&, and",
"foo, bar" ]
}
}
}
}
That will get you results like this:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"name_suggest": [
{
"text": "s and",
"offset": 0,
"length": 5,
"options": [
{
"text": "s & p",
"score": 1
}
]
}
]
}
Another option is to replace ampersands before hitting the tokenizer, using a char filter. Like so:
...
"char_filter" : {
"replace_ampersands" : {
"type" : "mapping",
"mappings" : ["&=>and"]
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"char_filter" : ["replace_ampersands"],
"filter": [
"lowercase",
"addy_synonym_filter",
"autocomplete_filter",
]
}
}
...

Resources