I have a case where I want I use elasticsearch as a text search engine for pretty long HTML Arabic text.
The search works pretty fine except for words with diacritics, it doesn't seem to be able to recognize them.
For example:
This sentence: ' وَهَكَذَا في كُلّ عَقْدٍ' (this is the one stored in the db)
is the exact same as this: 'وهكذا في كل عقد' (this is what the user enters for search)
it's exactly the same with the exception of the added diacritics, which are handled as separate characters in computers (but are just rendered on top of other characters).
I want to know if there's a way to make the search ignore all diacritics.
The first method I am thinking about is if there's a way to tell elasticsearch to completely ignore diacritics when indexing (kindda like stopwords ?).
If not, is it suitable to have another field in the document (text_normalized) where I manually remove the diacritics before adding it to elasticsearch, would that be efficient ?
To solve your problem you can use arabic_normalization token filter, it will remove diacritics from text before indexing. You need to define a custom analyzer and your Analyzer should look something like this:
"analyzer": {
"rebuilt_arabic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"decimal_digit",
"arabic_stop",
"arabic_normalization",
"arabic_keywords",
"arabic_stemmer"
]
}
}
Analyzer API check:
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["arabic_normalization"],
"text" : "وَهَكَذَا في كُلّ عَقْدٍ"
}
Result from Analyzer:
{
"tokens" : [
{
"token" : "وهكذا",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "في",
"start_offset" : 10,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "كل",
"start_offset" : 13,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "عقد",
"start_offset" : 18,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
As you can see diacritics are removed. For more information you can check here.
Related
Our documents have a field of keyword type called "version_numbers", which essentially stores an array of version numbers. For example: ["1.00.100.01", "2.00.470.00"].
This versioning follows a specific pattern, where each group should be associated with search keywords. Here's a breakdown of the versioning pattern:
1.00.240.15
\ / | |
\/ \ \_ maintenance version
major \_ minor version
version
I want to to build an analyzer such that:
major version is associated with keyword APP_VERSION and can be searched with queries like APP_VERSION1.0, APP_VERSION1.00, APP_VERSION1 etc.
minor version is associated with keyword XP and can be searched with queries like XP24, XP 24, XP240 etc.
maintenance version is associated with keyword Rev and can be searched with queries like Rev15, Rev 15 etc.
documents also can be queried in a combination of all three, like APP_VERSION1 XP240 Rev15 etc.
How do I associate each group of version pattern with keywords specified above?
Here's what I've tried so far to tokenize versions:
{
"analysis": {
"analyzer": {
"version_analyzer": {
"tokenizer": "version_tokenizer"
}
},
"tokenizer": {
"version_tokenizer": {
"type": "simple_pattern_split",
"pattern": "\\d"
}
}
}
}
But this seems to be splitting by dots only.
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "00",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "470",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "00",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 3
}
]
}
I'm new to Elasticsearch, so I highly appreciate any hints/guidance on this.
I'm building an elasticsearch powered layered navigation module for an ecomm site. It's all working great, I can fetch the options from my external source and display them. Selecting them works too but I've run into a snag where one of the filter options has these choices;
FINISHES:
Finished (1)
Semi-Finished (16)
Semi Finished (1)
Clearly the 2 variations with and without a hyphen should be tidied up, but ignoring that for a moment, when I apply the following to my collection;
$client = $this->clientBuilder;
$params .... etc
$params['body']['query']['bool']['must'][] = ['match_phrase' => [$split[0] => "$selected"]];
$response = $client->search($params);
Where $split[0] is the elasticsearch field ref for 'FINISHES' and $selected is the chosen value. If you click on any of the options, I am getting all 18 records back. No doubt because they all contain one of the words being searched 'finished'.
How can make this search for the exact term only? I've tried escaping the hyphen with \- which didnt help, I've also tried checking whether the searched term has spaces or hyphens and trying to forcibly add them to 'must_not', but that didn't work either;
if(!$space) {
$params['body']['query']['bool']['must_not'][] = ['match' => [$split[0] => ' ']];
}
if(!$hyphen) {
$params['body']['query']['bool']['must_not'][] = ['match' => [$split[0] => '\\-']];
}
By default standard analyzer is applied to all fields. So in your case, Semi-Finished is the keyword and the inverted index will contain two words semi and finished, so every time you look for finished it matches since standard analyzer breaks it on hyphen.
POST _analyze
{
"analyzer": "standard",
"text": ["Semi-Finished"]
}
##Result
{
"tokens" : [
{
"token" : "semi",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "finished",
"start_offset" : 5,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
.keyword searches against original text i.e. non-analyzed. In your case, fieldname.keyword should work.
POST _analyze
{
"analyzer": "keyword",
"text": ["Semi-Finished"]
}
##Result
{
"tokens" : [
{
"token" : "Semi-Finished",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 0
}
]
}
In elasticsearch, is there a way to set up an analyzer that would produce position gaps between tokens when line breaks or punctuation marks are encountered?
Let's say I index an object with the following nonsensical string (with line break) as one of its fields:
The quick brown fox runs after the rabbit.
Then comes the jumpy frog.
The standard analyzer will yield the following tokens with respective positions:
0 the
1 quick
2 brown
3 fox
4 runs
5 after
6 the
7 rabbit
8 then
9 comes
10 the
11 jumpy
12 frog
This means that a match_phrase query of the rabbit then comes will match this document as a hit.
Is there a way to introduce a position gap between rabbit and then so that it doesn't match unless a slop is introduced?
Of course, a workaround could be to transform the single string into an array (one line per entry) and use position_offset_gap in field mapping, but I would really rather keep a single string with newlines (and an ultimate solution would involve larger position gaps for newlines than, say, for punctuation marks).
I eventually figured out a solution using a char_filter to introduce extra tokens on line breaks and punctuation marks:
PUT /index
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [ ".=>\\n_PERIOD_\\n", "\\n=>\\n_NEWLINE_\\n" ]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["my_mapping"],
"filter": ["lowercase"]
}
}
}
}
}
Testing with the example string
POST /index/_analyze?analyzer=my_analyzer&pretty
The quick brown fox runs after the rabbit.
Then comes the jumpy frog.
yields the following result:
{
"tokens" : [ {
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
... snip ...
"token" : "rabbit",
"start_offset" : 35,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 8
}, {
"token" : "_period_",
"start_offset" : 41,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 9
}, {
"token" : "_newline_",
"start_offset" : 42,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 10
}, {
"token" : "then",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 11
... snip ...
} ]
}
I'm using Elastic Search 1.7.1 on Mac.
Here is my index mapping:
{
"settings":{
"analysis":{
"filter":{
"my_edgengram":{
"max_gram":15,
"token_chars":[
"letter",
"digit"
],
"type":"edgeNGram",
"min_gram":1
},
},
"analyzer":{
"stop_edgengram_analyzer":{
"filter":[
"lowercase",
"asciifolding",
"stop",
"my_edgengram"
],
"type":"custom",
"tokenizer":"whitespace"
}
}
}
}
}
Debugging analyzer:
$ curl -XGET 'http://localhost:9200/objects/_analyze?analyzer=stop_edgengram_analyzer&text=America,s&pretty=True'
{
"tokens" : [
... skipped ...
, {
"token" : "america",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "america,",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
}, {
"token" : "america,s",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 1
} ]
}
Why america,s token is in output?
, is punctuation symbol. I expect letters and digits only as specified in token_chars property of my_edgengram filter.
You are confusing edge_ngram tokenizer and edge_ngram token filter.
From documentation:
Tokenizers are used to break a string down into a stream of terms or
tokens.
In the example provided in question whitespace is the tokenizer that is being used
Token Filter on other hand :
accept a stream of tokens from a tokenizer and can
modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or
add tokens (eg synonyms).
In the example provided in OP egde_ngram token filter is being used.
token_chars is not supported for edge_ngram token filter and hence ignored.
I have been using different type of tokenizers for test and demonstration purposes. I need to check how a particular text field is tokenized using different tokenizers and also see the tokens generated.
How can I achieve that?
You can use the _analyze endpoint for this purpose.
For instance, using the standard analyzer, you can analyze this is a test like this
curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'this is a test'
And this produces the following tokens:
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
} ]
}
Of course, you can use any of the existing analyzers and you can also specify tokenizers using the tokenizer parameter, token filters using the token_filtersparameter and character filters using the char_filters parameter. For instance, analyzing the HTML curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>' using the standard analyzer, the keyword tokenizer, the lowercase token filter and the html_strip character filter yields this, i.e. a lowercase single token without the HTML markup:
curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>'
{
"tokens" : [ {
"token" : "this is a test",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 1
} ]
}
Apart from what #Val have mentioned you can try out the term vector,if you are intending to study the working of tokenisers.You can try out something like this just for examining the tokenisation happening in a field
GET /index-name/type-name/doc-id/_termvector?fields=field-to-be-examined
To know more about tokenisers and their operations you can refer this blog