ElasticSearch custom analyzer breaks words containing special characters - elasticsearch

If user searches for foo(bar), elasticsearch breaks it into foo and bar.
What I'm trying to achieve, is when a user types in say, i want a foo(bar), I match exactly an item named foo(bar), the name is fixed, and it will be used by a filter, so it is set to a keyword type.
The approximate steps I did,
define a custom analyzer
define a dictionary containing foo(bar)
define a synonym mapping containing abc => foo(bar)
Now, when I search for abc, elasticsearch translates it to foo(bar), but then it breaks it into foo and bar.
The question, as you may have guessed, is how to preserve special characters in elasticsearch analyzer?
I tried to use quotes(") in the dictionary file, like "foo(bar)", it didn't work.
Or is there maybe another way to work around this problem?
By the way, I'm using foo(bar) here just for simplicity, the actual case is much more complicated.
Thanks in advance.
---- edit ----
Thanks to #star67, now I realized that it is the issue of the tokenizer.
I am using the plugin medcl/elasticsearch-analysis-ik, which provides an ik_smart tokenizer, which is designed for the Chinese language.
Although I realized what the real problem is, I still don't know how to solve it. I mean, I have to use the ik_smart tokenizer, but how do I modify it to exclude some special characters?
I know that I can define a custom pattern tokenizer like this as #star67 provided:
{
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[^\\w\\(\\)]+"
}
}
}
But I also want to use the ik_smart tokenizer, because in Chinese, words and characters are not separated by space, for example 弹性搜索很厉害 should be tokenized as ['弹性', '搜索', '很', '厉害'], words can only be split based on a dictionary, so the default behavior is not desirable. What I want is maybe something like this:
{
"tokenizer": {
"my_tokenizer": {
"tokenizer": "ik_smart",
"ignores": "[\\w\\(\\)]+"
}
}
}
And I couldn't find an equivalent setting in elasticsearch.
Is it that I have to build my own plugin to achieve this?

You might want to use another tokenizer in your custom analyzer for your index.
For example, the standard tokenizer (used via analyzer for short) splits by all non-word characters (\W+):
POST _analyze
{
"analyzer": "standard",
"text": "foo(bar)"
}
==>
{
"tokens" : [
{
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "bar",
"start_offset" : 4,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Compare to a custom tokenizer, that splits by all non-word characters except the ( and ) ( which is [^\w\(\)]+):
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[^\\w\\(\\)]+"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "foo(bar)"
}
===>
{
"tokens" : [
{
"token" : "foo(bar)",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
}
]
}
I used a Pattern Tokenier as an example to exclude certain symbols (( and ) in your case) from being used in tokenization.

Related

Elasticsearch's minimumShouldMatch for each member of an array

Consider an Elasticsearch entity:
{
"id": 123456,
"keywords": ["apples", "bananas"]
}
Now, imagine I would like to find this entity by searching for apple.
{
"match" : {
"keywords" : {
"query" : "apple",
"operator" : "AND",
"minimum_should_match" : "75%"
}
}
}
The problem is that the 75% minimum for matching would be required for both of the strings of the array – so nothing will be found. Is there a way to say something like minimumSouldMatch: "75% of any array fields"?
Note that I need to use AND as each item of keywords may be composed of longer text.
EDIT:
I tried the proposed solutions, but none of them was giving expected results. I guess the problem is that the text might be quite long, eg.:
["national gallery in prague", "narodni galerie v praze"]
I guess the fuzzy expansion is just not able to expand such long strings if you just start searching by "national g".
Would this may be be possible somehow via Nested objects?
{ keywords: [{keyword: "apples"}, {keyword: "babanas"}}
and then have minimumShouldMatch=1 on keywords and then 75% on each keyword?
As per doc
The match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator parameter can be set to or or and to control the boolean clauses (defaults to or). The minimum number of optional should clauses to match can be set using the minimum_should_match parameter.
If you are searching for multiple tokens example "apples mangoes" and set minimum as 100%. It will mean both tokens should be present in document. If you set it at 50% , it means at least one of these should be present.
If you want to match tokens partially
You can use fuzziness parameter
Using fuzziness you can set maximum edit distance allowed for matching
{
"query": {
"match": {
"keywords": {
"query": "apple",
"fuzziness": "auto"
}
}
}
}
If you are trying to match word to its root form you can use "stemming" token filter
PUT index-name
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [ "stemmer" ]
}
}
}
},
"mappings": {
"properties": {
"keywords":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Tokens generated
GET index-name/_analyze
{
"text": ["apples", "bananas"],
"analyzer": "my_analyzer"
}
"tokens" : [
{
"token" : "appl",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "banana",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 101
}
]
stemming breaks words to their root form.
You can also explore n-grams, edge grams for partial matching

Indexing the last word of a string in Elasticsearch

I'm looking for a way to index the last word (or more generally: the last token) of a field into a separate sub-field. I've looked into the Predicate Script token filter but the painless script API in that context only provides the absolute position of the toekn from the start of the original input string so I could find the first token like this:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "predicate_token_filter",
"script": {
"source": """
token.position == 0
"""
}
}
],
"text": "the fox jumps the lazy dog"
}
This works and results in:
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
But I need the last token, not the first. Is there any way to achieve this without preparing a separate field pre-indexing, outside of Elasticsearch?
You're on the right path!! The solution is not that far from what you have... When you know you can easily fetch the first token, but what you need is the last... just reverse the string...
The following analyzer will output just the token you need, i.e. dog.
We first start by reversing the whole string, then we split by token, use your predicate script to only select the first one and reverse that token again. Voilà!
POST test/_analyze
{
"text": "the fox jumps the lazy dog",
"tokenizer": "keyword",
"filter": [
"reverse",
"word_delimiter",
{
"type": "predicate_token_filter",
"script": {
"source": """
token.position == 0
"""
}
},
"reverse"
]
}
Result:
{
"tokens" : [
{
"token" : "dog",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
}
]
}

Simple Elasticsearch PDF Text Search using german language

I can handle/extract the text from my PDF-Files, I don't know quite know if I am going the right way about how to store my content in Elasticsearch.
My PDF-Texts are mostly German - with letters like "ö", "ä", etc.
In order to store EVERY character of the content, I "escape" necessary characters and encode them properly to JSON so I can store them.
For example:
I want to store the following (PDF) text:
Öffentliche Verkehrsmittel. TestPath: C:\Windows\explorer.exe
I convert and upload it to Elasticsearch like this:
{"text":"\\u00D6ffentliche Verkehrsmittel. TestPath: C:\\\\Windows\\\\explorer.exe"}
My question is: Is this the right way to store documents like this?
Elasticsearch comes up with a wide range of inbuilt language-specific analyzer and if you are creating the text field and storing your data, by default standard analyzer is used. which you change like below:
{
"mappings": {
"properties": {
"title.german" :{
"type" :"text",
"analyzer" : "german"
}
}
}
}
You can also check the tokens generated by language analyzer in your case german using analyze API
{
"text" : "Öffentliche",
"analyzer" : "german"
}
And generated token
{
"tokens": [
{
"token": "offentlich",
"start_offset": 0,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Tokens for Ö
{
"text" : "Ö",
"analyzer" : "german"
}
{
"tokens": [
{
"token": "o",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Note:- it converted it to plain text, so now whether you search for Ö or ö it will come in the search result, as the same analyzer is applied at query time if you use the match query.

Elasticsearch modify asciifolding

ASCII Folding Token Filter folds "Ə"/"ə"(U+018F / U+0259) characters to "A"/"a". I need to modify or add fold to "E"/"e". char_filter doesn't help and doesn't preserve original
Add analyzer:
curl -XPUT 'localshot:9200/myix/_settings?pretty' -H 'Content-Type: application/json' -d'
{
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
}
'
Test result:
http://localhost:9200/myix/_analyze?text=üöğıəçşi_ÜÖĞIƏÇŞİ&filter=my_ascii_folding
{
"tokens": [
{
"token": "uogiacsi_UOGIACSI",
"start_offset": 0,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "üöğıəçşi_ÜÖĞIƏÇŞİ",
"start_offset": 0,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 0
}
]
}
When looking at Lucene's ASCIIFoldingFilter.java source file, it doesn indeed seem like Ə gets folded into an E and not a A. Even the ICU folding filter which is asciifolding on steroids, does the same folding.
However, there's an interesting discussion on the subject and it seems that given the pronunciation it should be folded into an a and not a e:
A quick search on English or French Wikipedia, where it currently gets folded, shows that it gets folded to an a! I would have expected an e based on orthography, but a makes sense in terms of pronunciation (in English, at least).
Someone else even thinks that neither a nor e makes sense:
That seems like a really bad decision. I don't think ə should fold to either of a or e.
Anyway, I don't think there is a way except using a char_filter or extending the ASCIIFoldingFilter and bundling it into an ES analysis plugin yourself.

How can I index a field using two different analyzers in Elastic search

Say that I have a field "productTitle" which I want to use for my users to search for products.
I also want to apply autocomplete functionality. So I m using an autocomplete_analyzer with the following filter:
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
}
However, at the same time when users make a search I don't want the "edge_ngram" to be applied, since it produces lot of irrelevant results.
For example when users want to search for "mi" and start typing "m", "mi".. they should get the results starting with m,mi as auto-complete options. However, when they actually make the query, they should only get results with the word "mi". Currently they also see results with "mini" etc..
Therefore, is it possible to have "productTitle" indexed using two different analyzers? Is multi-field type an option for me?
EDIT: Mapping for productTitle
"productTitle" : {
"type" : "string",
"index_analyzer" : "second",
"search_analyzer" : "standard",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
,
"second" analyzer
"analyzer": {
"second": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"autocomplete_filter"
]
}
So when I'm querying for :
"filtered" : {
"query" : {
"match" : {
"productTitle" : {
"query" : "mi",
"type" : "boolean",
"minimum_should_match" : "2<75%"
}
}
}
}
I also get results like "mini". But I need to only get results including just "mi"
Thank you
hmm ... as far as I know, there is no way to apply multiple analyzers for same field ... what You can make is to use "Multi Fields".
here is an example how to apply different analyzers for "subfields":
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html#_multi_fields_with_multiple_analyzers
The correct way of preventing what you describe in your answer is to specify both analyzer and search_analyzer in your field mapping, like this:
"productTitle": {
"type": "string",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard"
}
The autocomplete analyzer will kick in at indexing time and tokenize your title according to your edge_ngram configuration and the standard analyzer will kick in at search time without applying the edge_ngram stuff.
In this context, there is no need for multi-fields unless you need to tokenize the productTitle field in different ways.

Resources