Requirement: Search with special characters in a text field.
my Solution so far: Use wildcard query with custom analyzer. I want to use wildcards because it seems the easiest way to do partial searches in a long string with multiple search keys. See ES query below.
I have an index called "invoices" and it has document with one of the fields as
"searchString" : "I000010-1 000010 3901 North Saginaw Road add 2 Midland MI 48640 US MS Dhoni MSD-Company MSD (777) 777-7777 (333) 333-3333 sandeep#xyz.io msd-company msdhoni Dhoni, MS (3241480)"
Note: This field acts as the deprecated _all field in ES.
Index Mapping for this field:
"searchString": {"type": "text","analyzer": "multi_level_analyzer"},
Analyzer settings:
PUT invoices
{
"settings": {
"analysis": {
"analyzer": {
"multi_level_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
My query looks something like this:
GET invoices/_search
{
"query": {
"bool": {
"must": [{
"wildcard": {
"searchString": {
"value": "msd-company*",
"boost": 1.0
}
}
},
{
"wildcard": {
"searchString": {
"value": "Saginaw*",
"boost": 1.0
}
}
}
]
}
}
}
My question:
Earlier when I was not using a custom analyzer the above query worked BUT I was not able to search for words with special characters like "msd-company".
After attaching the custom analyzer(multi_level_analyzer) the above query fails to return any result. I changed the wildcard query and appended an asterisk before the search key and for some reason it works now. (referred this answer)
I want to know the impact of using "* msd-company*" instead of "msd-company*" in the wildcard query for the text field.
How can I still use the wildcard query "msd-company*" with custom analyzer?
Open to suggestions for any other approach to my problem statement.
I have solved my problem by changing the mapping of the said field to this:
"searchString": {"type": "text","analyzer": "multi_level_analyzer", "search_analyzer": "standard"},
But since wildcard queries are expensive, I would still like to know if there exists a better solution to satisfy my search use case.
Say I have a synonym file with just the two synonym lines below
ft , synonym_1
10 ft , synonym_2
When I use this file in an analyzer and analyze the word "10 ft" I get the following:
{
"tokens": [
{
"token": "10"
},
{
"token": "ft"
},
{
"token": "synonym_2",
}
]
}
synonym_1 doesn't appear, even though "ft" matched a token in the analyzed text. Is this because of some precedence with single tokens and phrases? Does "10 ft" match more of the analyzed text and therefore it's the only synonym that takes effect? Is there some way to get the first synonym to work in this case?
Note: I'm using a whitespace tokenizer and analyzing the text "30 ft" gives me synonym_1. It's only when "10 ft" appears exactly that the first synonym is broken.
"simplified_analyzer": {
"filter": [
"lowercase",
"stemmer",
"synonyms",
"edge_ngrams",
"remove_duplicates"
],
"char_filter" => ["remove_html", "remove_non_alphanumeric"],
"tokenizer" => "whitespace"
}
Do I have to use a second synonym filter to handle single words?
I'm building an analyser to provide partial search on term. So I want to use 2-5 ngram tokenzier at index time and 5-5 ngram at search.
The rational of using 2-5 ngram at index time is that the a partial term query of lenght 2 shall match.
At search, if the search term has a length lower than 5, the term can be searched directly in the inverted index. If it has a len greater than 5, then the term is tokenized with 5-grams and match if all token match.
However, in Elastic, using 5-5 ngram tokenziser won't create any token if the query term has a length lower than 5.
The solution could be to use at search a 2-5 tokenizer, same as for indexing, but this would result in searching all the 2grams, 3grams and 4grams tokens, which is useless... (5grams token is sufficient)
Here is my current index mapping:
{
"settings" : {
"analysis":{
"analyzer":{
"index_partial":{
"type":"custom",
"tokenizer":"2-5_ngram_token"
},
"search_partial":{
"type":"custom",
"tokenizer": "5-5_ngram_token"
}
},
"tokenizer":{
"2-5_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"5"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"partial": {
"type":"text",
"analyzer":"index_partial",
"search_analyzer":"search_partial"
}
}
}
}
}
}
So my question is : How can create analyzer that would do no-op if the search query has a length lower than 5. If it has a length greater than 5, it creates 5 grams tokens ?
----------------------UPDATE WITH WORK AROUND SOLUTION-----------------------
It seems not possible to create an analyser that do no-op if len < 5 and 5-5ngram if len >= 5.
There is two work around solutions to perform partial:
1- As mentionned by #Amit Khandelwal, one solution is to use max ngrams at index time. If your field has 30 chars max, use a tokenizer with ngram 2-30 and at searh time, search for the exact term, without processing it with the ngram analyser (either via term query or by setting the search analyszer to keyword).
Drawback of this solution is that it could result in huge inverted index depending on the max length.
2- Other solution is to create two fields:
- one for short search query term that can be look for in the inverted index directly, without being tokenized
- one for longer search query term that shall be tokenized
Depending of the length of the search query term, the search shall be performed on either one of those two fields
Below is the mapping I used for solution 2 (the limit between short and long term I chose is len=5):
PUT name_test
{
"settings" : {
"max_ngram_diff": 3,
"analysis":{
"analyzer":{
"2-4nGrams":{
"type":"custom",
"tokenizer":"2-4_ngram_token",
"filter": ["lowercase"]
},
"5-5nGrams":{
"type":"custom",
"tokenizer": "5-5_ngram_token",
"filter": ["lowercase"]
}
},
"tokenizer":{
"2-4_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"4"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"2-4partial": {
"type":"text",
"analyzer":"2-4nGrams",
"search_analyzer":"keyword"
},
"5-5partial": {
"type":"text",
"analyzer":"5-5nGrams"
}
}
}
}
}
}
and the two kind of request to be used with this mapping depending search term length:
GET name_test/_search
{
"query": {
"match": {
"name_trans.2-4partial": {
"query": "ema",
"operator": "and",
"fuzziness": 0
}
}
}
}
GET name_test/_search
{
"query": {
"match": {
"name_trans.5-5partial": {
"query": "emanue",
"operator": "and",
"fuzziness": 0
}
}
}
Maybe this will help someone someday :)
I am not sure if it's possible in Elasticsearch or not, But I can suggest you a workaround which we also use in our application although our use case was different.
Create a custom analyzer using 2-5 ngram tokenzier on the fields, which you want to use for the partial search, this will store the ngram tokens of the fields in inverted index, for example for a field containing foobar as a value, it will store fo, foo, foob, fooba, oo, oob , ooba, oobar ,ob, oba ,obar, ba, bar, ar.
Now instead of match query use the term query on partial fields, which is not analyzed, you can read diff b/w these here.
So now, in this case, It doesn't matter whether the search term is smaller than 5 or not, it will still match the tokens and you will get the results.
Now lets dry run this on the field containing foobar as a value and test it against some search terms,
Case 1: If search term contains less than 5 chars like fo, oo, ar, bar , oob, oba, bar and ooba, still it will match as these tokens are present in the inverted index.
Case 2: Search term contains equal or more than 5 chars, like fooba, oobar then also it return the document as index contains these tokens.
Let me know if its clear or you require additional clarification.
Is there a way to set the default operator for sub tokens (tokens generated through the analyzer)? It currently seems to default to OR and setting operator does not work.
I'm using the validate API to see how Elastic Search is understanding the query:
/myIndex/mapping/_validate/query?explain=true
{
"query":{
"multi_match":{
"type":"phrase_prefix",
"query":"test123",
"fields":[
"message"
],
"lenient":true,
"analyzer":"myAnalyzer"
}
}
}
Which returns
+(message:test123 message:test message:123)
What I want is
+message:test123 +message:test +message:123
Is there any way to do this without using a script or splitting the terms and creating a more complex query in the application?
EDIT
Using operator or minimum_should_match does not make a difference.
My elastic search mapping for myAnalyzer is
{
"analysis":{
"filter":{
"foldAscii":{
"type":"asciifolding",
"preserve_original":"1"
},
"capturePattern":{
"type":"pattern_capture",
"patterns":[
"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+(?!\\p{Ll}+))",
"(\\d+)"
]
},
"noDuplicates":{
"type":"unique",
"only_on_same_position":"true"
}
},
"myAnalyzer":{
"filter":[
"capturePattern",
"lowercase",
"foldAscii",
"noDuplicates"
],
"tokenizer":"standard"
}
}
}
SELECT * FROM Customers
WHERE City IN ('Paris','London')
How to convert above query in elasticsearch..
You may use terms query
GET _search
{
"query" : {
"terms" : {
"city" : ["Paris", "London"]
}
}
}
However, please make sure that your mapping has city marked as not_analyzed
In order to make your searches case insensitive, there are two ways I can think of :
lower case your terms while indexing as well as querying, this is an easy way.
Create a custom analyzer for lowercase the input without tokenizing it. Use match query instead of terms query. Terms query doesn't work on analyzed fields.
A sample lowercase analyzer would look like this :
"analyzer": {
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
}
Your query should look like:
The POST request:
http://localhost:9200/Customers/_search? <--assuming customers as your index
and the request BODY:
"query":{
"query_string":{
"query":"City:(Paris London)"
}
}
IN acts like : in ES. Hope this helps!