ElasticSearch apostrophe handling - elasticsearch

I have a problem with handling apostrophes in ElasticSearch.
I have a doc with field value = 'Pangk’ok' and I want to be able to find the document by the search requests: 'Pangk’ok', 'Pangkok' or 'Pangk ok'.
I've tried to do this with such analyzer:
"filter": {
"my_word_delimiter_graph": {
"type": "word_delimiter_graph",
"catenate_words": true
}
},
"analyzer": {
"source_analyzer": {
"tokenizer": "icu_tokenizer",
"filter": [ "my_word_delimiter_graph", "icu_folding" ],
"type": "custom"
}
}
It's successful for the match query for all of the searches but fails when part of a phrase and match_phrase query.
And this case actually described in ElasticSearch documentation:
catenate_words
(Optional, Boolean) If true, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. For example: super-duper-xl → [ super, superduperxl, duper, xl ]. Defaults to false.
When used for search analysis, catenated tokens can cause problems for the match_phrase query and other queries that rely on token position for matching. Avoid setting this parameter to true if you plan to use these queries.
So, my question is if it exists some other solution to solve my problem?

Related

Elasticsearch: search with wildcard and custom analyzer

Requirement: Search with special characters in a text field.
my Solution so far: Use wildcard query with custom analyzer. I want to use wildcards because it seems the easiest way to do partial searches in a long string with multiple search keys. See ES query below.
I have an index called "invoices" and it has document with one of the fields as
"searchString" : "I000010-1 000010 3901 North Saginaw Road add 2 Midland MI 48640 US MS Dhoni MSD-Company MSD (777) 777-7777 (333) 333-3333 sandeep#xyz.io msd-company msdhoni Dhoni, MS (3241480)"
Note: This field acts as the deprecated _all field in ES.
Index Mapping for this field:
"searchString": {"type": "text","analyzer": "multi_level_analyzer"},
Analyzer settings:
PUT invoices
{
"settings": {
"analysis": {
"analyzer": {
"multi_level_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
My query looks something like this:
GET invoices/_search
{
"query": {
"bool": {
"must": [{
"wildcard": {
"searchString": {
"value": "msd-company*",
"boost": 1.0
}
}
},
{
"wildcard": {
"searchString": {
"value": "Saginaw*",
"boost": 1.0
}
}
}
]
}
}
}
My question:
Earlier when I was not using a custom analyzer the above query worked BUT I was not able to search for words with special characters like "msd-company".
After attaching the custom analyzer(multi_level_analyzer) the above query fails to return any result. I changed the wildcard query and appended an asterisk before the search key and for some reason it works now. (referred this answer)
I want to know the impact of using "* msd-company*" instead of "msd-company*" in the wildcard query for the text field.
How can I still use the wildcard query "msd-company*" with custom analyzer?
Open to suggestions for any other approach to my problem statement.
I have solved my problem by changing the mapping of the said field to this:
"searchString": {"type": "text","analyzer": "multi_level_analyzer", "search_analyzer": "standard"},
But since wildcard queries are expensive, I would still like to know if there exists a better solution to satisfy my search use case.

Elasticsearch cannot find standalone reserved characters

I use Kibana to execute query elastic (Query string query).
When i search a word include escapable characters (reserved characters like: '\', '+', '-', '&&', '||', '!', '(', ')', '{', '}', '[', ']', '^', '"', '~', '*', '?', ':', '/'). It will get expected result.
My example use: '!'
But when i search single reserved character. I got nothing.
Or:
How can i search with single reserved character?
TL;DR You'll need to specify an analyzer (+ a tokenizer) which ensures that special chars like ! won't be stripped away during the ingestion phase.
In the first screenshot you've correctly tried running _analyze. Let's use it to our advantage.
See, when you don't specify any analyzer, ES will default to the standard analyzer which is, by definition, constrained by the standard tokenizer which'll strip away any special chars (except the apostrophe ' and some other chars).
Running
GET dev_application/_analyze?filter_path=tokens.token
{
"tokenizer": "standard",
"text": "Se, det ble grønt ! a"
}
thus yields:
["Se", "det", "ble", "grønt", "a"]
This means you'll need to use some other tokenizer which'll preserve these chars instead. There are a few built-in ones available, the simplest of which would be the whitespace tokenizer.
Running
GET _analyze?filter_path=tokens.token
{
"tokenizer": "whitespace",
"text": "Se, det ble grønt ! a"
}
retains the !:
["Se,", "det", "ble", "grønt", "!", "a"]
So,
1. Drop your index:
DELETE dev_application
2. Then set the mappings anew:
(I chose the multi-field approach which'll preserve the original, standard analyzer and only apply the whitespace tokenizer on the name.splitByWhitespace subfield.)
PUT dev_application
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"splitByWhitespaceAnalyzer": {
"tokenizer": "whitespace"
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"splitByWhitespace": {
"type": "text",
"analyzer": "splitByWhitespaceAnalyzer"
}
}
}
}
}
}
3. Reindex
POST dev_application/_doc
{
"name": "Se, det ble grønt ! a"
}
4. Search freely for special chars:
GET dev_application/_search
{
"query": {
"query_string": {
"default_field": "name.splitByWhitespace",
"query": "*\\!*",
"default_operator": "AND"
}
}
}
Do note that if you leave the default_field out, it won't work because of the standard analyzer.
Indeed, you could reverse this approach, apply whitespace by default, and create a multi-field mapping for the "original" indexing strategy (-> the only config being "type": "text").
Shameless plug: I wrote a book on Elasticsearch and you may find it useful!
Standard analyzer
The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.
so no token is generated for hyphen. If you want to find text with hyphen, you need to look into keyword fields and use wildcard for full text match
{
"query": {
"query_string": {
"query": "*\\-*"
}
}
}

analyser with ngram token depending on term length

I'm building an analyser to provide partial search on term. So I want to use 2-5 ngram tokenzier at index time and 5-5 ngram at search.
The rational of using 2-5 ngram at index time is that the a partial term query of lenght 2 shall match.
At search, if the search term has a length lower than 5, the term can be searched directly in the inverted index. If it has a len greater than 5, then the term is tokenized with 5-grams and match if all token match.
However, in Elastic, using 5-5 ngram tokenziser won't create any token if the query term has a length lower than 5.
The solution could be to use at search a 2-5 tokenizer, same as for indexing, but this would result in searching all the 2grams, 3grams and 4grams tokens, which is useless... (5grams token is sufficient)
Here is my current index mapping:
{
"settings" : {
"analysis":{
"analyzer":{
"index_partial":{
"type":"custom",
"tokenizer":"2-5_ngram_token"
},
"search_partial":{
"type":"custom",
"tokenizer": "5-5_ngram_token"
}
},
"tokenizer":{
"2-5_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"5"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"partial": {
"type":"text",
"analyzer":"index_partial",
"search_analyzer":"search_partial"
}
}
}
}
}
}
So my question is : How can create analyzer that would do no-op if the search query has a length lower than 5. If it has a length greater than 5, it creates 5 grams tokens ?
----------------------UPDATE WITH WORK AROUND SOLUTION-----------------------
It seems not possible to create an analyser that do no-op if len < 5 and 5-5ngram if len >= 5.
There is two work around solutions to perform partial:
1- As mentionned by #Amit Khandelwal, one solution is to use max ngrams at index time. If your field has 30 chars max, use a tokenizer with ngram 2-30 and at searh time, search for the exact term, without processing it with the ngram analyser (either via term query or by setting the search analyszer to keyword).
Drawback of this solution is that it could result in huge inverted index depending on the max length.
2- Other solution is to create two fields:
- one for short search query term that can be look for in the inverted index directly, without being tokenized
- one for longer search query term that shall be tokenized
Depending of the length of the search query term, the search shall be performed on either one of those two fields
Below is the mapping I used for solution 2 (the limit between short and long term I chose is len=5):
PUT name_test
{
"settings" : {
"max_ngram_diff": 3,
"analysis":{
"analyzer":{
"2-4nGrams":{
"type":"custom",
"tokenizer":"2-4_ngram_token",
"filter": ["lowercase"]
},
"5-5nGrams":{
"type":"custom",
"tokenizer": "5-5_ngram_token",
"filter": ["lowercase"]
}
},
"tokenizer":{
"2-4_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"4"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"2-4partial": {
"type":"text",
"analyzer":"2-4nGrams",
"search_analyzer":"keyword"
},
"5-5partial": {
"type":"text",
"analyzer":"5-5nGrams"
}
}
}
}
}
}
and the two kind of request to be used with this mapping depending search term length:
GET name_test/_search
{
"query": {
"match": {
"name_trans.2-4partial": {
"query": "ema",
"operator": "and",
"fuzziness": 0
}
}
}
}
GET name_test/_search
{
"query": {
"match": {
"name_trans.5-5partial": {
"query": "emanue",
"operator": "and",
"fuzziness": 0
}
}
}
Maybe this will help someone someday :)
I am not sure if it's possible in Elasticsearch or not, But I can suggest you a workaround which we also use in our application although our use case was different.
Create a custom analyzer using 2-5 ngram tokenzier on the fields, which you want to use for the partial search, this will store the ngram tokens of the fields in inverted index, for example for a field containing foobar as a value, it will store fo, foo, foob, fooba, oo, oob , ooba, oobar ,ob, oba ,obar, ba, bar, ar.
Now instead of match query use the term query on partial fields, which is not analyzed, you can read diff b/w these here.
So now, in this case, It doesn't matter whether the search term is smaller than 5 or not, it will still match the tokens and you will get the results.
Now lets dry run this on the field containing foobar as a value and test it against some search terms,
Case 1: If search term contains less than 5 chars like fo, oo, ar, bar , oob, oba, bar and ooba, still it will match as these tokens are present in the inverted index.
Case 2: Search term contains equal or more than 5 chars, like fooba, oobar then also it return the document as index contains these tokens.
Let me know if its clear or you require additional clarification.

How to handle auto completion on multi word text?

My input text is a multiword english text and I have the requirement to implement a autocompletion feature for that text.
I initially looked at search completion suggesters only to figure out that those can only match the first characters of the input. This is fine for auto completion of product names or address but not very useful when requiring a auto completion on any word in the input text.
After that I setup an edge_ngram analyzer and query to locate those documents which contain the input string. That works just fine but I don't know how to use this information to provide options for my auto completion.
I could use a highlighter in order to show the words which match the query. That data could in turn be used to setup a list of options. This solution seems rather hacky and not very elegant and I wonder how this problem is usually solved?
I'm unfortunately not able to maintain another field which could include the auto completion options for the documents.
I'm currently using highlight information of the query in order to construct the autocomplete options.
My Query:
{
"query": {
"match": {
"fields.content.auto": {
"query": "content co",
"analyzer": "standard"
}
}
},
"highlight": {
"fields": {
"fields.content.auto": {
"fragment_size": 0,
"number_of_fragments": 10,
"pre_tags" : [ "%ha%" ],
"post_tags" : [ "%he%" ]
}
}
},
"_source": ["uuid", "language"]
}
My auto field used the autocomplete analyzer.
"auto": {
"type": "string",
"analyzer": "autocomplete"
}
And this is the index configuration that I'm using:
{
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stop",
"autocomplete_filter"
]
}
}
}
}
The solution was mainly inspired by the Search-as-you-type post.
I process the response JSON in order to get the autocomplete options.
The highlight information is used to extract all found tokens. These tokens are next used to construct the potential autocomplete phrase by also comparing it to the phrase that the user has already entered. The neat thing is that a stop word filter can be applied and thus stopwords will never be highlighted and in turn never be used for autocomplete suggestions.
A PoC Java code of this processor can be found here
I'm not yet sure whether I'll run with this solution but I want to share it anyway.
I think your best option is to create a dedicated index for storing just the suggestions using the edge_ngram analyzer. If you use the completion suggesters you need to explicitly define your actual suggestions anyway. The completion suggester is also document centric in ES 5.x so if you index multiple documents with the same suggestions you will get duplicate suggestions returned on a match. There is a de-duplication option in ES 6, but that has only just been released.
If you have a dedicated suggestion index you can use a hash of the suggestion as a document ID to avoid duplicates. You can start indexing document titles and other useful meta data as suggestions. Later on you could include historical searches entered by users that are seen as successful due to the user ultimately clicking on or purchasing the returned results.

Is Simple Query Search compatible with shingles?

I am wondering if it is possible to use shingles with the Simple Query String query. My mapping for the relevant field looks like this:
{
"text_2": {
"type": "string",
"analyzer": "shingle_analyzer"
}
}
The analyzer and filters are defined as follows:
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "custom_delimiter", "lowercase", "stop", "snowball", "filter_shingle"]
}
},
"filter": {
"filter_shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"custom_delimiter": {
"type": "word_delimiter",
"preserve_original": True
}
}
I am performing the following search:
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"analyzer": "shingle_analyzer",
"fields": [
"text_2"
],
"lenient": "false",
"default_operator": "and",
"query": "porsches small red"
}
}
]
}
}
}
Now, I have a document with text_2 = small red porsches. Since I am using the AND operator, I would expect my document to NOT match, since the above query should produce a shingle of "porsches small red", which is a different order. However, when I look at the match explanation I am only seeing the single word tokens "red" "small" "porsche", which of course match.
Is SQS incompatible with shingles?
The answer is "Yes, but...".
What you're seeing is normal given the fact that the text_2 field probably has the standard index analyzer in your mapping (according to the explanation you're seeing), i.e. the only tokens that have been produced and indexed for small red porsches are small, red and porsches.
On the query side, you're probably using a shingle analyzer with output_unigrams set to true (default), which means that the unigram tokens will also be produced in addition to the bigrams (again according to the explanation you're seeing). Those unigrams are the only reason why you get matches at all. If you want to match on bigrams, then one solution is to use the shingle analyzer at indexing time, too, so that bigrams small red and red porsches can be produced and indexed as well in addition to the unigrams small, red and porsches.
Then at query time, the unigrams would match as well but small red bigram would definitely match, too. In order to only match on the bigrams, you can have another shingle analyzer just for query time whose output_unigrams is set to false, so that only bigrams get generated out of your search input. And in case your query only contains one single word (e.g. porsches), then that shingle analyzer would only generate a single unigram (because output_unigrams_if_no_shingles is true) and the query would still match your document. If that's not desired you can simply set output_unigrams_if_no_shingles to false in your shingle search analyzer.

Resources