Getting exact match with Elastic Search 6 and php ClientBuilder - elasticsearch

I'm building an elasticsearch powered layered navigation module for an ecomm site. It's all working great, I can fetch the options from my external source and display them. Selecting them works too but I've run into a snag where one of the filter options has these choices;
FINISHES:
Finished (1)
Semi-Finished (16)
Semi Finished (1)
Clearly the 2 variations with and without a hyphen should be tidied up, but ignoring that for a moment, when I apply the following to my collection;
$client = $this->clientBuilder;
$params .... etc
$params['body']['query']['bool']['must'][] = ['match_phrase' => [$split[0] => "$selected"]];
$response = $client->search($params);
Where $split[0] is the elasticsearch field ref for 'FINISHES' and $selected is the chosen value. If you click on any of the options, I am getting all 18 records back. No doubt because they all contain one of the words being searched 'finished'.
How can make this search for the exact term only? I've tried escaping the hyphen with \- which didnt help, I've also tried checking whether the searched term has spaces or hyphens and trying to forcibly add them to 'must_not', but that didn't work either;
if(!$space) {
$params['body']['query']['bool']['must_not'][] = ['match' => [$split[0] => ' ']];
}
if(!$hyphen) {
$params['body']['query']['bool']['must_not'][] = ['match' => [$split[0] => '\\-']];
}

By default standard analyzer is applied to all fields. So in your case, Semi-Finished is the keyword and the inverted index will contain two words semi and finished, so every time you look for finished it matches since standard analyzer breaks it on hyphen.
POST _analyze
{
"analyzer": "standard",
"text": ["Semi-Finished"]
}
##Result
{
"tokens" : [
{
"token" : "semi",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "finished",
"start_offset" : 5,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
.keyword searches against original text i.e. non-analyzed. In your case, fieldname.keyword should work.
POST _analyze
{
"analyzer": "keyword",
"text": ["Semi-Finished"]
}
##Result
{
"tokens" : [
{
"token" : "Semi-Finished",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 0
}
]
}

Related

How do I index non-standard version numbers in Elasticsearch?

Our documents have a field of keyword type called "version_numbers", which essentially stores an array of version numbers. For example: ["1.00.100.01", "2.00.470.00"].
This versioning follows a specific pattern, where each group should be associated with search keywords. Here's a breakdown of the versioning pattern:
1.00.240.15
\ / | |
\/ \ \_ maintenance version
major \_ minor version
version
I want to to build an analyzer such that:
major version is associated with keyword APP_VERSION and can be searched with queries like APP_VERSION1.0, APP_VERSION1.00, APP_VERSION1 etc.
minor version is associated with keyword XP and can be searched with queries like XP24, XP 24, XP240 etc.
maintenance version is associated with keyword Rev and can be searched with queries like Rev15, Rev 15 etc.
documents also can be queried in a combination of all three, like APP_VERSION1 XP240 Rev15 etc.
How do I associate each group of version pattern with keywords specified above?
Here's what I've tried so far to tokenize versions:
{
"analysis": {
"analyzer": {
"version_analyzer": {
"tokenizer": "version_tokenizer"
}
},
"tokenizer": {
"version_tokenizer": {
"type": "simple_pattern_split",
"pattern": "\\d"
}
}
}
}
But this seems to be splitting by dots only.
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "00",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "470",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "00",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 3
}
]
}
I'm new to Elasticsearch, so I highly appreciate any hints/guidance on this.

Arabic diacritics ignoring in elasticsearch

I have a case where I want I use elasticsearch as a text search engine for pretty long HTML Arabic text.
The search works pretty fine except for words with diacritics, it doesn't seem to be able to recognize them.
For example:
This sentence: ' وَهَكَذَا في كُلّ عَقْدٍ' (this is the one stored in the db)
is the exact same as this: 'وهكذا في كل عقد' (this is what the user enters for search)
it's exactly the same with the exception of the added diacritics, which are handled as separate characters in computers (but are just rendered on top of other characters).
I want to know if there's a way to make the search ignore all diacritics.
The first method I am thinking about is if there's a way to tell elasticsearch to completely ignore diacritics when indexing (kindda like stopwords ?).
If not, is it suitable to have another field in the document (text_normalized) where I manually remove the diacritics before adding it to elasticsearch, would that be efficient ?
To solve your problem you can use arabic_normalization token filter, it will remove diacritics from text before indexing. You need to define a custom analyzer and your Analyzer should look something like this:
"analyzer": {
"rebuilt_arabic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"decimal_digit",
"arabic_stop",
"arabic_normalization",
"arabic_keywords",
"arabic_stemmer"
]
}
}
Analyzer API check:
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["arabic_normalization"],
"text" : "وَهَكَذَا في كُلّ عَقْدٍ"
}
Result from Analyzer:
{
"tokens" : [
{
"token" : "وهكذا",
"start_offset" : 0,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "في",
"start_offset" : 10,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "كل",
"start_offset" : 13,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "عقد",
"start_offset" : 18,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
As you can see diacritics are removed. For more information you can check here.

ElasticSearch / Nest search with synonyms, plural and mispellings

I want to do a search that implements the following items.
Right now I have implemented all of this through regex and it is far from covering everything and I would like to know how much I could use ElasticSearch for this instead:
Synonyms
My understanding is that this is implemented when the index is created.
indexSettings.Analysis.TokenFilters.Add("synonym", new SynonymTokenFilter { Synonyms = new[] { "tire => tyre", "aluminum => aluminium" }, IgnoreCase = true, Tokenizer = "whitespace" });
but do I need to include the plurals as well? or,
Singular words (shoes and shoe should be an identical match)
does that mean that I need to put 'shoes' in the synonym list? or is there another way?
Small misspellings, substitutions and omissions should be allowed
so that 'automobile', 'automoble' or 'automoblie' would match. I don't know if this is even possible.
Ignore all stop words
right now I'm removing all the 'the', 'this', 'my', etc through regex
All my search terms are plain English words and numbers; nothing else is allowed.
All of this is possible through configuring/writing a custom analyzer within Elasticsearch. To answer each question in turn:
Synonyms
Synonyms can be applied at either index time, search time or both. There are tradeoffs to consider in whichever approach you choose
Applying synonyms at index time will result in faster search compared to applying at search time, at the cost of more disk space, indexing throughput and ease and flexibility of adding/removing existing synonyms
Applying synonyms at search time allows for greater flexibility at the expense of search speed.
Also need to consider the size of the synonyms list and how frequently, if ever, it changes. I would consider trying both and deciding which works best for your scenario and requirements.
Singular words (shoes and shoe should be an identical match)
You may consider using stemming to reduce plural and singular words to their root form, using an algorithmic or dictionary based stemmer. Perhaps start with the English Snowball stemmer and see how it works for you.
You should also consider whether you need to also index the original word form e.g. should exact word matches rank higher than stemmed words on their root form?
Small misspellings, substitutions and omissions should be allowed
Consider using queries that can utilize fuzziness to handle typos and misspellings. If there are spelling errors in the index data, consider some form of sanitization before indexing. As per all data stores, Garbage In, Garbage Out :)
Ignore all stop words
Use an English Stop token filter to remove stop words.
Putting all of this together, an example analyzer might look like
void Main()
{
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var defaultIndex = "default-index";
var connectionSettings = new ConnectionSettings(pool)
.DefaultIndex(defaultIndex);
var client = new ElasticClient(connectionSettings);
if (client.IndexExists(defaultIndex).Exists)
client.DeleteIndex(defaultIndex);
client.CreateIndex(defaultIndex, c => c
.Settings(s => s
.Analysis(a => a
.TokenFilters(t => t
.Stop("my_stop", st => st
.StopWords("_english_", "i've")
.RemoveTrailing()
)
.Synonym("my_synonym", st => st
.Synonyms(
"dap, sneaker, pump, trainer",
"soccer => football"
)
)
.Snowball("my_snowball", st => st
.Language(SnowballLanguage.English)
)
)
.Analyzers(an => an
.Custom("my_analyzer", ca => ca
.Tokenizer("standard")
.Filters(
"lowercase",
"my_stop",
"my_snowball",
"my_synonym"
)
)
)
)
)
.Mappings(m => m
.Map<Message>(mm => mm
.Properties(p => p
.Text(t => t
.Name(n => n.Content)
.Analyzer("my_analyzer")
)
)
)
)
);
client.Analyze(a => a
.Index(defaultIndex)
.Field<Message>(f => f.Content)
.Text("Loving those Billy! Them is the maddest soccer trainers I've ever seen!")
);
}
public class Message
{
public string Content { get; set; }
}
my_analyzer produces the following tokens for above
{
"tokens" : [
{
"token" : "love",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "those",
"start_offset" : 7,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "billi",
"start_offset" : 13,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "them",
"start_offset" : 20,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "maddest",
"start_offset" : 32,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "football",
"start_offset" : 40,
"end_offset" : 46,
"type" : "SYNONYM",
"position" : 7
},
{
"token" : "trainer",
"start_offset" : 47,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dap",
"start_offset" : 47,
"end_offset" : 55,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "sneaker",
"start_offset" : 47,
"end_offset" : 55,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "pump",
"start_offset" : 47,
"end_offset" : 55,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "ever",
"start_offset" : 61,
"end_offset" : 65,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "seen",
"start_offset" : 66,
"end_offset" : 70,
"type" : "<ALPHANUM>",
"position" : 11
}
]
}

Elastic search is not returning expected results for term query

This is how ,my article data looks in elastic search
id:123,
title:xyz,
keywords:"Test Example"
id:124,
title:xyzz,
keywords:"Test Example|test1"
When a keyword is clicked on the front end,say for example: 'Test Example' then i should get articles having that keyword ( i should get above two articles as my results).But i am getting only first article as my result and below is my mapping:
"keywords":
{
"type":"string",
"index":"not_analysed"
}
How can i get both articles in search results?Thank you
Term Query searches for exact terms. That's why when you search for Test Example you get only one result, as there is only one record that exactly matches Test Example. If you want both the results you need to use something like match or query_string. You can use query_string like:
{
"query": {
"query_string": {
"default_field": "keywords",
"query": "Test Example*"
}
}
}
You have to Query with query_string,term query search only for exact term.
You set your keywords field to not_analyzed: if you want the field to be searchable you should remove the index clause like so
"keywords": {
"type":"string"
}
Searching over this field with a match query, anyway, will return results containing a superset of the provided query: searching for test will return both documents even though the tag is actually Test Example.
If you can change your documents to something like this
id:123,
title:xyz,
keywords:"Test Example"
id:124,
title:xyzz,
keywords: ["Test Example", "test1"]
you can use your original mapping with "index":"not_analysed" and a term query will return only documents containing exactly the tag you were looking for.
{
"query": {
"term": {
"keywords": "test1"
}
}
}
Another option to accomplish the same result is to use a pattern tokenizer to split your tag string on the | character to accomplish the same result
"tokenizer": {
"split_tags": {
"type": "pattern",
"group": "-1",
"pattern": "\|"
}
}
I have got it working with the following tokenizer:
"split_keywords": {
"type": "pattern",
"group": "0",
"pattern": "([^|]+)"
}
Keywords will split at pipe character(below is the example)
{
"tokens" : [ {
"token" : "TestExample",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "test",
"start_offset" : 13,
"end_offset" : 17,
"type" : "word",
"position" : 2
}, {
"token" : "1",
"start_offset" : 17,
"end_offset" : 18,
"type" : "word",
"position" : 3
}, {
"token" : "test1",
"start_offset" : 13,
"end_offset" : 18,
"type" : "word",
"position" : 3
} ]
}
Now when i search for 'TestExample',i get above two articles.
Thanks a lot for your help :)

How to check the tokens generated for different tokenizers in Elasticsearch

I have been using different type of tokenizers for test and demonstration purposes. I need to check how a particular text field is tokenized using different tokenizers and also see the tokens generated.
How can I achieve that?
You can use the _analyze endpoint for this purpose.
For instance, using the standard analyzer, you can analyze this is a test like this
curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'this is a test'
And this produces the following tokens:
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
} ]
}
Of course, you can use any of the existing analyzers and you can also specify tokenizers using the tokenizer parameter, token filters using the token_filtersparameter and character filters using the char_filters parameter. For instance, analyzing the HTML curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>' using the standard analyzer, the keyword tokenizer, the lowercase token filter and the html_strip character filter yields this, i.e. a lowercase single token without the HTML markup:
curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>'
{
"tokens" : [ {
"token" : "this is a test",
"start_offset" : 0,
"end_offset" : 21,
"type" : "word",
"position" : 1
} ]
}
Apart from what #Val have mentioned you can try out the term vector,if you are intending to study the working of tokenisers.You can try out something like this just for examining the tokenisation happening in a field
GET /index-name/type-name/doc-id/_termvector?fields=field-to-be-examined
To know more about tokenisers and their operations you can refer this blog

Resources