Analyzer for '&' and 'and' - elasticsearch

I want to build a search on ElasticSearch, but I get stuck with this:
Query for:
H and M
H&M
H & M
Need to find a document with this variable value:
H&M
How to deal with it?

You should be using Pattern Replace Char Filter and append this to your analyzer.
For instance, this would be minimal reproduction:
POST /hm
{
"index": {
"analysis": {
"char_filter": {
"my_pattern": {
"type": "pattern_replace",
"pattern": "(\\s+)?&(\\s+)?|(\\s+)?and(\\s+)?",
"replacement": "and"
}
},
"analyzer": {
"custom_with_char_filter": {
"tokenizer": "standard",
"char_filter": [
"my_pattern"
]
}
}
}
}
}
It will replace &, and with optional multiple whitespaces around to and. So now you can check how this analyzer works by running these statements:
GET /hm/_analyze?analyzer=custom_with_char_filter&text=h%26m
GET /hm/_analyze?analyzer=custom_with_char_filter&text=h %26 m
GET /hm/_analyze?analyzer=custom_with_char_filter&text=handm
All of these bring back very same token:
{
"tokens": [
{
"token": "handm",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Which means that whenever you're searching for any of these:
HandM
H and M
H&M
H & M
It will bring same result.

Related

ElasticSearch - exclude special character from standard stemmer

I'm using standard analyzer for my ElasticSearch index, and I have noticed that when I search a query with % in it - the analyzer drops the % as part of the stemmer steps (on the query "2% milk")
GET index_name/_analyze
{
"field": "text.english",
"text": "2% milk"
}
The response is the following 2 tokens (2 and milk):
{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "milk",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Meaning, the 2% becomes 2
I want to use the standard stemmer to reduce punctuation, I don't want to use the space stemmer or other stemmer which is not standard but I do want to use the <number>% sign as term in the index.
Is there a way to configure to the stemmer to ignore special character when it's next to a number? worst case not to ignore it at all?
Thanks!
You can achieve the desired behavior by configuring a custom analyzer using a character filter that preserves the "%"-character from getting stripped away.
Check the Elasticsearch documentation about the configuration of the built-in analyzers, to use that configuration as a blueprint to configure your custom analyzer (see Elasticsearch Reference: english analyzer)
Add a character filter that maps the percentage-character to a different string, as demonstrated in the following code snippet:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_percent_char_filter"
]
}
},
"char_filter": {
"my_percent_char_filter": {
"type": "mapping",
"mappings": [
"0% => 0_percent",
"1% => 1_percent",
"2% => 2_percent",
"3% => 3_percent",
"4% => 4_percent",
"5% => 5_percent",
"6% => 6_percent",
"7% => 7_percent",
"8% => 8_percent",
"9% => 9_percent"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "The fee is between 0.93% or 2%"
}
With this, you can even search for specific percentages (like 2%)!
Alternative Solution
If you simply want to remove the percentage character, you can use the very same approach, but simply map the %-character to an empty string, as shown in the following code snippet
"char_filter": {
"my_percent_char_removal_filter": {
"type": "mapping",
"mappings": [
"% => "]
}
}
BTW: This approach is not considered to be a "hack", it's the standard solution approach to modify your original string before it gets sent to the tokenizer.

Search special characters with elasticsearch

I just have problem with elasticsearch, I have some business requirement that need to search with special characters. For example, some of the query string might contain (space, #, &, ^, (), !) I have some similar use case below.
foo&bar123 (an exact match)
foo & bar123 (white space between word)
foobar123 (No special chars)
foobar 123 (No special chars with whitespace)
foo bar 123 (No special chars with whitespace between word)
FOO&BAR123 (Upper case)
All of them should match the same results, can anyone please give me some help about this? Note this right now I can search other string with no special characters perfectly
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "custom_tokenizer"
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"index": {
"properties": {
"some_field": {
"type": "text",
"analyzer": "autocomplete"
},
"some_field_2": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
EDIT:
There are two things to check here:
(1) Is the special character being analysed when we index the document?
The _analyze API tells us no:
POST localhost:9200/index-name/_analyze
{
"analyzer": "autocomplete",
"text": "foo&bar"
}
// returns
fo, foo, foob, fooba, foobar, oo, oob, // ...etc: the & has been ignored
This is because the "token_chars" in your mapping: "letter", "digit". These two groups do not include punctuation such as '&'. Hence, when you upload "foo&bar" to the index, the & is actually ignored.
To include the & in the index, you want to include "punctuation" in your "token_chars" list. You may also want the "symbol" group too for some of your other chars... :
"tokenizer": {
"custom_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
Now we see the the terms being analyed appropriately:
POST localhost:9200/index-name/_analyze
{
"analyzer": "autocomplete",
"text": "foo&bar"
}
// returns
fo, foo, foo&, foo&b, foo&ba, foo&bar, oo, oo&, // ...etc
(2) Is my search query doing what I expect?
Now that we know the 'foo&bar' document is being indexed (analyzed) correctly, we need to check that the search returns the result. The following query works:
POST localhost:9200/index-name/_doc/_search
{
"query": {
"match": { "some_field": "foo&bar" }
}
}
As does the GET query http://localhost:9200/index-name/_search?q=foo%26bar
Other queries may have unexpected to results - according to the docs, you probably want to declare your search_analyzer to be different than your index analyzer (e.g. ngram index analyzer and standard search analyzer) ... however this is up to you

Mapping international character to multiple options

What I want to achieve is ability for people to search for individuals without being language aware, but not punishing those people, that are. What I mean is:
Given I build index:
Jorgensen
Jörgensen
Jørgensen
I want to be able allow such conversions:
ö to o
ö to oe
ø to oe
ø to oe
so if someone searches for:
QUERY | RESULT(I include only ID's, but it would be full records in reality)
Jorgensen return - 1,2,3
Jörgensen return - 1,2
Jørgensen return - 1,3
Joergensen return - 2,3
Starting with that I tried to create index analyzer and filter that:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ö => o",
"ö => oe"
]
}
}
}
}
}
But that is invalid, because it tries to map to same character.
What ma I missing? Do I need multiple analyzers? Any direction would be appreciated.
Since custom mapping isn't enough in your case, as show comments above, let's play with your data and char normalization.
In your case, normalization using unidecode isn't enough due ø and oe conversions. Example:
import unicodedata
def strip_accents(s):
return ''.join(
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
)
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
for b in body_matches:
print b,strip_accents(b)
>>>> Jorgensen Jorgensen
>>>> Jörgensen Jorgensen
>>>> Jørgensen Jørgensen
>>>> Joergensen Joergensen
So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list.
accented_letters = {
u'ö' : [u'o',u'oe'],
u'ø' : [u'o',u'oe'],
}
Then, we can normalize words and store them in a special property, body_normalized for instance, and index them as a field of your Elasticsearch records
Once they are inserted, you could perform two types of search:
exact search: User input isn't normalized and Elasticsearch query search against body field, that isn't normalized too.
simliar search. User input is normalized and we'll search againts
body_normalized field
Let's see an example
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
print "------EXACT MATCH------"
for body_match in body_matches:
elasticsearch_query = {
"query": {
"match" : {
"body" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
print "\n------SIMILAR MATCHES------"
for body_match in body_matches:
body_match = normalize_word(body_match)
elasticsearch_query = {
"query": {
"match" : {
"body_normalized" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
You can see a running example in this notebook
After playing with it quite bit more, so far I came up with following approach:
We cannot store multiple representations of data in one field. That does makes sense, so instead, like it was suggested, we store it in multiple representations of the same field in something like sub field. I did everything with Kibana and/or Postman.
Create index with following settings:
PUT surname
{
"mappings": {
"individual": {
"_all": { "enabled": false },
"properties": {
"id": { "type": "integer" },
"name" : {
"type": "string",
"analyzer": "not_folded",
"fields": {
"double": {
"type": "string",
"analyzer": "double_folder"
},
"single": {
"type": "string",
"analyzer": "folded"
}
}
}
}
}
},
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"double_folder": {
"tokenizer": "icu_tokenizer",
"filter" : [
"icu_folding"
],
"char_filter": [
"my_char_filter"
]
},
"folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding"
]
},
"not_folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"lowercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ö => oe"
]
}
}
}
}
}
in this case it stores all names in 3 different formats:
The way it was entered
Folded to multiple symbols where I want it to
Folded to single symbol
Number of shards one is important bit for testing, since having multiple shards doesn't work well where there is not enough data. More read in Relevance is broken
then we can add test data to our index:
POST surname/individual/_bulk
{ "index": { "_id": 1}}
{ "id": "1", "name": "Matt Jorgensen"}
{ "index": { "_id": 2}}
{ "id": "2", "name": "Matt Jörgensen"}
{ "index": { "_id": 3}}
{ "id": "3", "name": "Matt Jørgensen"}
{ "index": { "_id": 4}}
{ "id": "4", "name": "Matt Joergensen"}
all that is left is to test if we get proper response:
GET surname/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "Jorgensen",
"fields": [ "name","name.double", "name.single" ]
}
}
}

Elasticsearch Analysis token filter doesn't capture pattern

I made a custom analyzer in my test index:
PUT test
{
"settings": {
"analysis": {
"filter": {
"myFilter": {
"type": "pattern_capture",
"patterns": ["\\d+(,\\d+)*(\\.\\d+)?[%$€£¥]?"],
"preserve_original": 1
}
},
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "myTokenizer",
"filters":["myFilter"]
}
},
"tokenizer": {
"myTokenizer":{
"type":"pattern",
"pattern":"([^\\p{N}\\p{L}%$€£¥##'\\-&]+)|((?<=[^\\p{L}])['\\-&]|^['\\-&]|['\\-&](?=[^\\p{L}])|['\\-&]$)|((?<=[^\\p{N}])[$€£¥%]|^[$€£¥%]|(?<=[$€£¥%])(?=\\d))"
}
}
}
}
}
It is supposed to spit numbers like 123,234.56$ as a single token
But when such a number is provided it spits out 3 tokens 123 234 56$
The sample of failing test query:
GET test/Stam/_termvector?pretty=true
{
doc:{
"Stam" : {
"fld" : "John Doe",
"txt": "100,234.54%"
}
},
"per_field_analyzer" : {
"Stam.txt": "myAnalyzer"
},
"fields" : ["Stam.txt"],
"offsets":true,
"positions":false,
"payloads":false,
"term_statistics":false,
"field_statistics":false
}
}
Can anyone figure out what is the reason?
Definitely for every other case ',' and '.' are delimiters, that is why I added a filter for that purpose, but unfortunately it doesn't work.
Thanks in advance.
The answer is quite simple, token filter can not combine tokens by design. It should be done through char filters, that are applied to the char stream even before tokenizer starts to split to tokens.
I only had to make sure that the custom tokenizer will not split my tokens.

Elasticsearch Query String Query with # symbol and wildcards

I defined a custom analyzer that I was surprised not built-in.
analyzer": {
"keyword_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
Then my mapping for this field is:
"email": {
"type": "string",
"analyzer": "keyword_lowercase"
}
This works great. (http://.../_analyze?field=email&text=me#example.com) ->
"tokens": [
{
"token": "me#example.com",
"start_offset": 0,
"end_offset": 16,
"type": "word",
"position": 1
}
]
Finding by that keyword works great. http://.../_search?q=me#example.com yields results.
The problem is trying to incorporate wildcards anywhere in the Query String Query. http://.../_search?q=*me#example.com yields no results. I would expect results containing emails such as "me#example.com" and "some#example.com".
It looks like elasticsearch performs the search with the default analyzer, which doesn't make sense. Shouldn't it perform the search with each field's own default analyzer?
I.E. http://.../_search?q=email:*me#example.com returns results because I am telling it which analyzer to use based upon the field.
Can elasticsearch not do this?
See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Set analyze_wildcard to true, as it is false by default.

Resources