ElasticSearch - exclude special character from standard stemmer - elasticsearch

I'm using standard analyzer for my ElasticSearch index, and I have noticed that when I search a query with % in it - the analyzer drops the % as part of the stemmer steps (on the query "2% milk")
GET index_name/_analyze
{
"field": "text.english",
"text": "2% milk"
}
The response is the following 2 tokens (2 and milk):
{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "milk",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Meaning, the 2% becomes 2
I want to use the standard stemmer to reduce punctuation, I don't want to use the space stemmer or other stemmer which is not standard but I do want to use the <number>% sign as term in the index.
Is there a way to configure to the stemmer to ignore special character when it's next to a number? worst case not to ignore it at all?
Thanks!

You can achieve the desired behavior by configuring a custom analyzer using a character filter that preserves the "%"-character from getting stripped away.
Check the Elasticsearch documentation about the configuration of the built-in analyzers, to use that configuration as a blueprint to configure your custom analyzer (see Elasticsearch Reference: english analyzer)
Add a character filter that maps the percentage-character to a different string, as demonstrated in the following code snippet:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_percent_char_filter"
]
}
},
"char_filter": {
"my_percent_char_filter": {
"type": "mapping",
"mappings": [
"0% => 0_percent",
"1% => 1_percent",
"2% => 2_percent",
"3% => 3_percent",
"4% => 4_percent",
"5% => 5_percent",
"6% => 6_percent",
"7% => 7_percent",
"8% => 8_percent",
"9% => 9_percent"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "The fee is between 0.93% or 2%"
}
With this, you can even search for specific percentages (like 2%)!
Alternative Solution
If you simply want to remove the percentage character, you can use the very same approach, but simply map the %-character to an empty string, as shown in the following code snippet
"char_filter": {
"my_percent_char_removal_filter": {
"type": "mapping",
"mappings": [
"% => "]
}
}
BTW: This approach is not considered to be a "hack", it's the standard solution approach to modify your original string before it gets sent to the tokenizer.

Related

Elasticsearch : Problem with querying document where "." is included in field

I have an index where some entries are like
{
"name" : " Stefan Drumm"
}
...
{
"name" : "Dr. med. Elisabeth Bauer"
}
The mapping of the name field is
{
"name": {
"type": "text",
"analyzer": "index_name_analyzer",
"search_analyzer": "search_cross_fields_analyzer"
}
}
When I use the below query
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Stefan Drumm","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
It returns the first document.
But when I try to get the second document using the query below
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Dr. med. Elisabeth Bauer","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
it is not returning anything.
Things I can't do
can't change the index
can't use the term query.
change the operator to 'OR', because in that case it will return multiple entries, which I don't want.
What I am doing wrong and how can I achieve this by modifying the query?
You have configured different analyzers for indexing and searching (index_name_analyzer and search_cross_fields_analyzer). If these analyzers process the input Dr. med. Elisabeth Bauer in an incompatible way, the search isn't going to match. This is described in more detail in Index and search analysis, as well as in Controlling Analysis.
You don't provide the definition of these two analyzers, so it's hard to guess from your question what they are doing. Depending on the analyzers, it may be possible to preprocess your query string (e.g. by removing .) before executing the search so that the search will match.
You can investigate how analysis affects your search by using the _analyze API, as described in Testing analyzers. For your example, the commands
GET my_index/_analyze
{
"analyzer": "index_name_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
and
GET my_index/_analyze
{
"analyzer": "search_cross_fields_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
should show you how the two analyzers configured for your index treats the target string, which might provide you with a clue about what's wrong. The response will be something like
{
"tokens": [
{
"token": "dr",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "med",
"start_offset": 4,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "elisabeth",
"start_offset": 9,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "bauer",
"start_offset": 19,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 3
}
]
}
For the example output above, the analyzer has split the input into one token per word, lowercased each word, and discarded all punctuation.
My guess would be that index_name_analyzer preserves punctuation, while search_cross_fields_analyzer discards it, so that the tokens won't match. If this is the case, and you can't change the index configuration (as you state in your question), one other option would be to specify a different analyzer when running the query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "Dr. med. Elisabeth Bauer",
"operator": "AND",
"analyzer": "index_name_analyzer"
}
}
}
],
"boost": 1
}
},
"min_score": 0
}
In the query above, the analyzer parameter has been set to override the search analysis to use the same analyzer (index_name_analyzer) as the one used when indexing. What analyzer might make sense to use depends on your setup. Ideally, you should configure the analyzers to align so that you don't have to override at search time, but it sounds like you are not living in an ideal world.

Search special characters with elasticsearch

I just have problem with elasticsearch, I have some business requirement that need to search with special characters. For example, some of the query string might contain (space, #, &, ^, (), !) I have some similar use case below.
foo&bar123 (an exact match)
foo & bar123 (white space between word)
foobar123 (No special chars)
foobar 123 (No special chars with whitespace)
foo bar 123 (No special chars with whitespace between word)
FOO&BAR123 (Upper case)
All of them should match the same results, can anyone please give me some help about this? Note this right now I can search other string with no special characters perfectly
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "custom_tokenizer"
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"index": {
"properties": {
"some_field": {
"type": "text",
"analyzer": "autocomplete"
},
"some_field_2": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
EDIT:
There are two things to check here:
(1) Is the special character being analysed when we index the document?
The _analyze API tells us no:
POST localhost:9200/index-name/_analyze
{
"analyzer": "autocomplete",
"text": "foo&bar"
}
// returns
fo, foo, foob, fooba, foobar, oo, oob, // ...etc: the & has been ignored
This is because the "token_chars" in your mapping: "letter", "digit". These two groups do not include punctuation such as '&'. Hence, when you upload "foo&bar" to the index, the & is actually ignored.
To include the & in the index, you want to include "punctuation" in your "token_chars" list. You may also want the "symbol" group too for some of your other chars... :
"tokenizer": {
"custom_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
Now we see the the terms being analyed appropriately:
POST localhost:9200/index-name/_analyze
{
"analyzer": "autocomplete",
"text": "foo&bar"
}
// returns
fo, foo, foo&, foo&b, foo&ba, foo&bar, oo, oo&, // ...etc
(2) Is my search query doing what I expect?
Now that we know the 'foo&bar' document is being indexed (analyzed) correctly, we need to check that the search returns the result. The following query works:
POST localhost:9200/index-name/_doc/_search
{
"query": {
"match": { "some_field": "foo&bar" }
}
}
As does the GET query http://localhost:9200/index-name/_search?q=foo%26bar
Other queries may have unexpected to results - according to the docs, you probably want to declare your search_analyzer to be different than your index analyzer (e.g. ngram index analyzer and standard search analyzer) ... however this is up to you

Allowing hypen based words to be tokenized in elasticsearch

I have the following mapping for a field name that will hold products name for ecommerce.
'properties': {
'name': {
'type': 'text',
'analyzer': 'standard',
'fields': {
'english': {
'type': 'text',
'analyzer': 'english'
},
}
},
Assuming that I have the following string to be indexed/searched.
A pack of 3 T-shirts
Both of the analyerzs are producing terms [t, shirts], [t, shirt] respectively.
This gives me the problem of not getting any result when a user types "mens tshirts"
How can i get the term in the inverted index like [t, shirts, shirt, tshirt', tshirts]
I tried to look into Stemmers exclusions but I couldn't find any thing to deal with hyphens. Also i will be helpful if a more generic solution is found rather than doing exclusions manually. Because there could be many possiblities which i don't know now e.g emails, e-mails
whitespace tokenizer could do the job
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html
POST _analyze
{
"tokenizer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
will produce
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
I found one solution which i guess could help me achieving the desired results. However, I would still like to see if there is some good and recommended approach for this problem.
Basically I will use multi fields for this problem where the first analyzer will be standard and the second will be my custom.
According to Elasticsearch documentation, chars_filters happens before tokenizer. So the idea is to remove - with a empty character which will make t-shirts to tshirt. Hence the tokenizer will token the whole term as tshirts in invertded index.
GET _analyze
{
"tokenizer": "standard",
"filter": [
"lowercase",
{"type": "stop", "stopwords": "_english_"}
],
"char_filter" : [
"html_strip",
{"type": "mapping", "mappings": ["- => "]}
],
"text": "these are t-shirts <table>"
}
will give the following tokens
{
"tokens": [
{
"token": "tshirts",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}
]
}

Elasticsearch and Drupal : how to add filter lowercase and asciifolding

I created an index in Drupal, and my queries works.
Now I try to add filters lowercase and asciifolding in the elasticsearch.yml file, but unsuccessfully:
I add these lines :
index:
analysis:
analyzer:
default:
filter : [standard, lowercase, asciifolding]
I have an error : IndexCreationException: [myindex] failed to create index.
But 'myindex' already exist, I just try to add filters to this existing index.
How I can add these filters so that the indexation is correct for me?
Thank you very much for your help.
The reason that you get this exception is because it's not possible to update the settings of an index by calling the general create index endpoint. In order to update analyzers you will have to call the '_settings' endpoint.
I've made a small example for you on how to do this:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"new_analyzer": {
"tokenizer": "standard"
}
}
}
}
}
GET test/_analyze
{
"analyzer": "new_analyzer",
"text": "NoLowercasse"
}
POST test/_close
PUT test/_settings
{
"analysis": {
"analyzer": {
"new_analyzer": {
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
}
POST test/_open
GET test/_analyze
{
"analyzer": "new_analyzer",
"text": "LowerCaseAdded"
}
Response:
{
"tokens": [
{
"token": "lowercaseadded",
"start_offset": 0,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
}
]
}
You can see that after the second analysis, the lowercase filter is being applied. The reason that you have to close your index is because it needs to rebuild the analyzer. You will notice that the new analyzer won't work as expected since the previously added documents weren't indexed with this analyzer, but rather the one without the asciifolding and lowercase.
In order to fix this, you'll have to rebuild your index (with the Reindex-API for example)
Hope this helps!
Edit: I maybe was a bit too quick in responding, as this is not a Drupal-Elastic solution, but it might point you in the right direction. To be honest I'm not familiar with running ES in combination with Drupal.

Elasticsearch Query String Query with # symbol and wildcards

I defined a custom analyzer that I was surprised not built-in.
analyzer": {
"keyword_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
Then my mapping for this field is:
"email": {
"type": "string",
"analyzer": "keyword_lowercase"
}
This works great. (http://.../_analyze?field=email&text=me#example.com) ->
"tokens": [
{
"token": "me#example.com",
"start_offset": 0,
"end_offset": 16,
"type": "word",
"position": 1
}
]
Finding by that keyword works great. http://.../_search?q=me#example.com yields results.
The problem is trying to incorporate wildcards anywhere in the Query String Query. http://.../_search?q=*me#example.com yields no results. I would expect results containing emails such as "me#example.com" and "some#example.com".
It looks like elasticsearch performs the search with the default analyzer, which doesn't make sense. Shouldn't it perform the search with each field's own default analyzer?
I.E. http://.../_search?q=email:*me#example.com returns results because I am telling it which analyzer to use based upon the field.
Can elasticsearch not do this?
See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Set analyze_wildcard to true, as it is false by default.

Resources