Search special characters with elasticsearch - elasticsearch

I just have problem with elasticsearch, I have some business requirement that need to search with special characters. For example, some of the query string might contain (space, #, &, ^, (), !) I have some similar use case below.
foo&bar123 (an exact match)
foo & bar123 (white space between word)
foobar123 (No special chars)
foobar 123 (No special chars with whitespace)
foo bar 123 (No special chars with whitespace between word)
FOO&BAR123 (Upper case)
All of them should match the same results, can anyone please give me some help about this? Note this right now I can search other string with no special characters perfectly
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "custom_tokenizer"
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"index": {
"properties": {
"some_field": {
"type": "text",
"analyzer": "autocomplete"
},
"some_field_2": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}

EDIT:
There are two things to check here:
(1) Is the special character being analysed when we index the document?
The _analyze API tells us no:
POST localhost:9200/index-name/_analyze
{
"analyzer": "autocomplete",
"text": "foo&bar"
}
// returns
fo, foo, foob, fooba, foobar, oo, oob, // ...etc: the & has been ignored
This is because the "token_chars" in your mapping: "letter", "digit". These two groups do not include punctuation such as '&'. Hence, when you upload "foo&bar" to the index, the & is actually ignored.
To include the & in the index, you want to include "punctuation" in your "token_chars" list. You may also want the "symbol" group too for some of your other chars... :
"tokenizer": {
"custom_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
Now we see the the terms being analyed appropriately:
POST localhost:9200/index-name/_analyze
{
"analyzer": "autocomplete",
"text": "foo&bar"
}
// returns
fo, foo, foo&, foo&b, foo&ba, foo&bar, oo, oo&, // ...etc
(2) Is my search query doing what I expect?
Now that we know the 'foo&bar' document is being indexed (analyzed) correctly, we need to check that the search returns the result. The following query works:
POST localhost:9200/index-name/_doc/_search
{
"query": {
"match": { "some_field": "foo&bar" }
}
}
As does the GET query http://localhost:9200/index-name/_search?q=foo%26bar
Other queries may have unexpected to results - according to the docs, you probably want to declare your search_analyzer to be different than your index analyzer (e.g. ngram index analyzer and standard search analyzer) ... however this is up to you

Related

Elasticsearch Became case sensitive after add synonym analyzer

After I added synonym analyzer to my_index, the index became case-sensitive
I have one property called nationality that has synonym analyzer. But it seems that this property become case sensitive because of the synonym analyzer.
Here is my /my_index/_mappings
{
"my_index": {
"mappings": {
"items": {
"properties": {
.
.
.
"nationality": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "synonym"
},
.
.
.
}
}
}
}
}
Inside the index, i have word India COUNTRY. When I try to search India nation using the command below, I will get the result.
POST /my_index/_search
{
"query": {
"match": {
"nationality": "India nation"
}
}
}
But, when I search for india (notice the letter i is lowercase), I will get nothing.
My assumption is, this happend because i put uppercase filter before the synonym. I did this because the synonyms are uppercased. So the query India will be INDIA after pass through this filter.
Here is my /my_index/_settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "1",
"provided_name": "my_index",
"similarity": {
"default": {
"type": "BM25",
"b": "0.9",
"k1": "1.8"
}
},
"creation_date": "1647924292297",
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"lenient": "true",
"synonyms": [
"NATION, COUNTRY, FLAG"
]
}
},
"analyzer": {
"synonym": {
"filter": [
"uppercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"version": {
"created": "6080099"
}
}
}
}
}
Is there a way so I can make this property still case-insensitive. All the solution i've found only shows that I should only either set all the text inside nationality to be lowercase or uppercase. But how if I have uppercase & lowercase letters inside the index?
Did you apply synonym filter after adding your data into index?
If so, probably "India COUNTRY" phrase was indexed exactly as "India COUNTRY". When you sent a match query to index, your query was analyzed and sent as "INDIA COUNTRY" because you have uppercase filter anymore, it is matched because you are using match query, it is enough to match one of the words. "COUNTRY" word provide this.
But, when you sent one word query "india" then it is analyzed and converted to "INDIA" because of your uppercase filter but you do not have any matching word on your index. You just have a document contains "India COUNTRY".
My answer has a little bit assumption. I hope that it will be useful to understand your problem.
I have found the solution!
I didn't realize that the filter that I applied in the settings is applicable while updating and searching the data. At first, I did this step:
Create index with synonym filter
Insert data
Add uppercase before synonym filter
By doing that, the uppercase filter is not applied to my data. What I should've done are:
Create index with uppercase & synonym filter (pay attention to the order)
Insert data
Then the filter will be applied to my data.

Elastic search with Java: exclude matches with random leading characters in a letter

I am new to using elastic search. I managed to get things working somewhat close to what I intended. I am using the following configuration.
{
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3,
"output_unigrams": true,
"token_separator": ""
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"shingle_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"shingle_index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter",
"autocomplete_filter"
]
}
}
}
}
I have this applied over multiple fields and doing a multi match query.
Following is the java code:
NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(QueryBuilders.multiMatchQuery(i)
.field("title")
.field("alias")
.fuzziness(Fuzziness.ONE)
.type(MultiMatchQueryBuilder.Type.BEST_FIELDS))
.build();
The problem is it matches with fields that have letters with some leading characters.
For example, if my search input is "ron" I want it to match with "ron mathews", but I don't want it match with "iron". How can I make sure that I am matching with letters having no leading characters?
Update-1
Turning off fuzzy transposition seems to improve search results. But I think we can make it better.
You probably want to score "ron" higher than "ronaldo" and the exact match of complete field "ron" even higher so the best option here would be to use few subfields with standard and keyword analyzers and boost those fields in your multi_match query.
Also, as you figured out yourself, be careful with the fuzziness. Might make sense to run 2 queries in a should with one being fuzzy and another boosted so that exact matches are ranked higher.

Elasticsearch autocomplete searching middle word

I'm stuck on this for a while.
How can I get suggestion on elastic search to complete my word even when I write a middle term.
For example in my data I have "Alan Turing is great" and I start typing "turi", I would like to see suggestion term "Alan Turing is Great".
I am using elastic search v. 6.3.2, I tried with query similar to these:
curl -X GET "http://127.0.0.1:9200/my_index/_search" -H 'Content-Type: application/json' -d '{"_source":false,"suggest":{"show-suggest":{"prefix":"turi","completion":{"field":"auto_suggest"}}}}'
or
curl -X GET "http://127.0.0.1:9200/my_index/_search" -H 'Content-Type: application/json' -d '{"_source":false,"suggest":{"show-suggest":{"text":"turi","completion":{"field":"auto_suggest"}}}}'
but it works only if I search for "alan" and it shows all the terms.
index:
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
"mappings": {
"poielement": {
"numeric_detection": false,
"date_detection": false,
"dynamic_templates": [
{
"suggestions": {
"match": "suggest_*",
"mapping": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer",
"copy_to": "auto_suggest",
"store": true
}
}
},
{
"property": {
"match": "*",
"mapping": {
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer"
}
}
}
],
"properties": {
"auto_suggest": {
"type": "completion"
},
"name_suggest": {
"type": "completion"
}
}
}
}
We have an exact similar use case and this is how we solved it. what you are looking is for substring search.
Please create a custom substring analyzer for your field like below, java code for which is below:-
TokenStream result = new WhitespaceTokenizer(SearchManager.LUCENE_VERSION_301, reader);
result = new LowerCaseFilter(SearchManager.LUCENE_VERSION_301, result);
result = new SubstringFilter(result, minSize);
return result;
In the above code, I am first using the WhitespaceTokenizer and then passing it to first LowerCaseFilter and then my custom SubstringFilter code of which is customizable based on the minimum no of chars you want in your tokens.
Above code will generate below tokens for strings like hellowworld if you set min substring length 3.
Giving public URI to access the tokens which it generates as for helloworld string and min substring length 3. it will generate lot of tokens.
https://justpaste.it/4i6gh
Also you can test the tokens which your custom analyzer using the _analyze api, https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
http://localhost:9200/jaipur/_analyze?text=helloworld&analyzer=substring
here jaipur is my index name and helloworld is the string for which I want to generates tokens using substring.
EDIT
As suggested by Nishant in comments, you can use the ngram filter instead of substring filter, which Elastic inbuilt provides.

Choosing right Tokenizer in Elastic 5.4 for emulate contains like queries

I am using Elastic 5.4 to implement suggestion/ completion like functionality and facing issue in choosing the right tokenizer for my requirements. Below is example:
There are 4 documents in the index as follows with the content as mentioned below:
DOC 1: Applause
DOC 2: Apple
DOC 3: It is an Apple
DOC 4: Applications
DOC 5: There is_an_appl
Queries
Query 1: Query String 'App' should return all 5 documents.
Query 2: Query String 'Apple' should return only document 2 and document 3.
Query 3: Query String 'Applications' should return only document 4.
Query 4: Query String 'appl' should return all 5 documents.
Tokenizer
I am using the following tokenizer in Elastic and I am seeing all documents returned for Query 2 and Query 3.
The analyzer is applied to fields of type 'text'.
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "ngram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit"
]
}
}
}
}
How can I restrict the results to return documents which contain an exact match of the query string either as part of existing word or a phrase or a exact word( I have mentioned the expected results are provided in the queries above)?
That's because you're using an nGram tokenizer instead of edgeNGram one. The latter only indexes prefixes, while the former will index prefixes, suffixes and also sub-parts of your data.
Change your analyzer definition to this instead and it should work as expected:
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "edge_ngram", <---- change this
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit"
]
}
}
}
}

Analyzer for '&' and 'and'

I want to build a search on ElasticSearch, but I get stuck with this:
Query for:
H and M
H&M
H & M
Need to find a document with this variable value:
H&M
How to deal with it?
You should be using Pattern Replace Char Filter and append this to your analyzer.
For instance, this would be minimal reproduction:
POST /hm
{
"index": {
"analysis": {
"char_filter": {
"my_pattern": {
"type": "pattern_replace",
"pattern": "(\\s+)?&(\\s+)?|(\\s+)?and(\\s+)?",
"replacement": "and"
}
},
"analyzer": {
"custom_with_char_filter": {
"tokenizer": "standard",
"char_filter": [
"my_pattern"
]
}
}
}
}
}
It will replace &, and with optional multiple whitespaces around to and. So now you can check how this analyzer works by running these statements:
GET /hm/_analyze?analyzer=custom_with_char_filter&text=h%26m
GET /hm/_analyze?analyzer=custom_with_char_filter&text=h %26 m
GET /hm/_analyze?analyzer=custom_with_char_filter&text=handm
All of these bring back very same token:
{
"tokens": [
{
"token": "handm",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Which means that whenever you're searching for any of these:
HandM
H and M
H&M
H & M
It will bring same result.

Resources