Elasticsearch Synonyms - How is precedence determined? - elasticsearch

Say I have a synonym file with just the two synonym lines below
ft , synonym_1
10 ft , synonym_2
When I use this file in an analyzer and analyze the word "10 ft" I get the following:
{
"tokens": [
{
"token": "10"
},
{
"token": "ft"
},
{
"token": "synonym_2",
}
]
}
synonym_1 doesn't appear, even though "ft" matched a token in the analyzed text. Is this because of some precedence with single tokens and phrases? Does "10 ft" match more of the analyzed text and therefore it's the only synonym that takes effect? Is there some way to get the first synonym to work in this case?
Note: I'm using a whitespace tokenizer and analyzing the text "30 ft" gives me synonym_1. It's only when "10 ft" appears exactly that the first synonym is broken.
"simplified_analyzer": {
"filter": [
"lowercase",
"stemmer",
"synonyms",
"edge_ngrams",
"remove_duplicates"
],
"char_filter" => ["remove_html", "remove_non_alphanumeric"],
"tokenizer" => "whitespace"
}
Do I have to use a second synonym filter to handle single words?

Related

ES how does this below field match & Not removing token on particular scenario?

I have below mapping.
"sub":{"type":"text", "analyzer":"stop_analyzer"}
I have a query
{
"_source":["sub"],
"query": {
"fuzzy" : { "sub" : "Thr" }
}
}
Analyzer:
{
"analysis": {
"analyzer":{
"stop_analyzer":{
"tokenizer":"lowercase",
"filter":["synonym_graph","stop_el_filter"]
}
},
"filter": {
"stop_el_filter": {
"type": "stop",
"stopwords": "_english_"
},
"synonym_graph" : {
"type" : "synonym_graph",
"lenient": true,
"synonyms" : [
"americas, us, usa, u.s.a, america => america",
"americas-us public sector, america ps, ps america, ps usa => ps"
]
}
}
}
}
How does the below String matches:
(USER_TRIGGERED (ALL:MAINT=8hr ARL of Nodes 02-A/B))
Analyse API provides the below tokens:
"token": "user"
"token": "triggered"
"token": "all"
"token": "maint",
"token": "hr"
"token": "arl"
"token": "nodes"
"token": "b"
Why Thr is matching this doc? When I analyse Thr it results thr.
Is it because of fuzzy removes t to match hr? - Yes, I think I am correct.
And
is there any way not to remove that A from A/B - not considering as a stop word in particular cases [not tokenise when A is not accompanied by space]?
Thr is matching your doc because of the fuzzy query which allows a distance of 1 character on that word length. Hence, fuzzy(Thr) matches the hr token.
Regarding your second question, A is removed because it is an english stop word and you're using the stop token filter. So if you remove it, the A will be indexed as well

ElasticSearch - exclude special character from standard stemmer

I'm using standard analyzer for my ElasticSearch index, and I have noticed that when I search a query with % in it - the analyzer drops the % as part of the stemmer steps (on the query "2% milk")
GET index_name/_analyze
{
"field": "text.english",
"text": "2% milk"
}
The response is the following 2 tokens (2 and milk):
{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "milk",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Meaning, the 2% becomes 2
I want to use the standard stemmer to reduce punctuation, I don't want to use the space stemmer or other stemmer which is not standard but I do want to use the <number>% sign as term in the index.
Is there a way to configure to the stemmer to ignore special character when it's next to a number? worst case not to ignore it at all?
Thanks!
You can achieve the desired behavior by configuring a custom analyzer using a character filter that preserves the "%"-character from getting stripped away.
Check the Elasticsearch documentation about the configuration of the built-in analyzers, to use that configuration as a blueprint to configure your custom analyzer (see Elasticsearch Reference: english analyzer)
Add a character filter that maps the percentage-character to a different string, as demonstrated in the following code snippet:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_percent_char_filter"
]
}
},
"char_filter": {
"my_percent_char_filter": {
"type": "mapping",
"mappings": [
"0% => 0_percent",
"1% => 1_percent",
"2% => 2_percent",
"3% => 3_percent",
"4% => 4_percent",
"5% => 5_percent",
"6% => 6_percent",
"7% => 7_percent",
"8% => 8_percent",
"9% => 9_percent"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "The fee is between 0.93% or 2%"
}
With this, you can even search for specific percentages (like 2%)!
Alternative Solution
If you simply want to remove the percentage character, you can use the very same approach, but simply map the %-character to an empty string, as shown in the following code snippet
"char_filter": {
"my_percent_char_removal_filter": {
"type": "mapping",
"mappings": [
"% => "]
}
}
BTW: This approach is not considered to be a "hack", it's the standard solution approach to modify your original string before it gets sent to the tokenizer.

How to find numbers with comma in Elasticsearch?

Query number like below data doesn't get any result, but space after commas it can find.
Sample data:
{
"data":"34543,2525,5674,879"
}
Query:
"query": {
"query_string" : {
"query" : "(data:2525)"
}
}
Settings:
"analysis":{
"filter":{
"my_ascii_folding":{
"type":"asciifolding",
"preserve_original":"true"
}
},
"analyzer":{
"default":{
"filter":[
"lowercase",
"my_ascii_folding"
],
"char_filter":[
"html_strip"
],
"tokenizer":"standard"
}
}
}
For example querying 2525 in "34543, 2525, 5674, 879" found, but with "34543,2525,5674,879" doesn't find.
Without any more information it looks like you're probably using the standard tokenizer. You can show how your tokens are analyzed by using
GET users/_analyze
{
"text": "34543, 2525, 5674, 879"
}
or
GET users/_analyze
{
"text": "34543,2525,5674,879"
}
If you're using the standard tokenizer then 34543,2525,5674,879 is only one token in your inverted index. When you search for 2525 it won't match that token. On the other hand, 3453, 2525, 5674, 879 is tokenized into four tokens without commas. 2525 matches the second token.
If you want to solve this problem you'll need to use a different tokenizer that always tokenizes on a comma rather than just when it's at the beginning or end of a token Indexing a comma-separated value field in Elastic Search.

Allowing hypen based words to be tokenized in elasticsearch

I have the following mapping for a field name that will hold products name for ecommerce.
'properties': {
'name': {
'type': 'text',
'analyzer': 'standard',
'fields': {
'english': {
'type': 'text',
'analyzer': 'english'
},
}
},
Assuming that I have the following string to be indexed/searched.
A pack of 3 T-shirts
Both of the analyerzs are producing terms [t, shirts], [t, shirt] respectively.
This gives me the problem of not getting any result when a user types "mens tshirts"
How can i get the term in the inverted index like [t, shirts, shirt, tshirt', tshirts]
I tried to look into Stemmers exclusions but I couldn't find any thing to deal with hyphens. Also i will be helpful if a more generic solution is found rather than doing exclusions manually. Because there could be many possiblities which i don't know now e.g emails, e-mails
whitespace tokenizer could do the job
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html
POST _analyze
{
"tokenizer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
will produce
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
I found one solution which i guess could help me achieving the desired results. However, I would still like to see if there is some good and recommended approach for this problem.
Basically I will use multi fields for this problem where the first analyzer will be standard and the second will be my custom.
According to Elasticsearch documentation, chars_filters happens before tokenizer. So the idea is to remove - with a empty character which will make t-shirts to tshirt. Hence the tokenizer will token the whole term as tshirts in invertded index.
GET _analyze
{
"tokenizer": "standard",
"filter": [
"lowercase",
{"type": "stop", "stopwords": "_english_"}
],
"char_filter" : [
"html_strip",
{"type": "mapping", "mappings": ["- => "]}
],
"text": "these are t-shirts <table>"
}
will give the following tokens
{
"tokens": [
{
"token": "tshirts",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}
]
}

Is Simple Query Search compatible with shingles?

I am wondering if it is possible to use shingles with the Simple Query String query. My mapping for the relevant field looks like this:
{
"text_2": {
"type": "string",
"analyzer": "shingle_analyzer"
}
}
The analyzer and filters are defined as follows:
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "custom_delimiter", "lowercase", "stop", "snowball", "filter_shingle"]
}
},
"filter": {
"filter_shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"custom_delimiter": {
"type": "word_delimiter",
"preserve_original": True
}
}
I am performing the following search:
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"analyzer": "shingle_analyzer",
"fields": [
"text_2"
],
"lenient": "false",
"default_operator": "and",
"query": "porsches small red"
}
}
]
}
}
}
Now, I have a document with text_2 = small red porsches. Since I am using the AND operator, I would expect my document to NOT match, since the above query should produce a shingle of "porsches small red", which is a different order. However, when I look at the match explanation I am only seeing the single word tokens "red" "small" "porsche", which of course match.
Is SQS incompatible with shingles?
The answer is "Yes, but...".
What you're seeing is normal given the fact that the text_2 field probably has the standard index analyzer in your mapping (according to the explanation you're seeing), i.e. the only tokens that have been produced and indexed for small red porsches are small, red and porsches.
On the query side, you're probably using a shingle analyzer with output_unigrams set to true (default), which means that the unigram tokens will also be produced in addition to the bigrams (again according to the explanation you're seeing). Those unigrams are the only reason why you get matches at all. If you want to match on bigrams, then one solution is to use the shingle analyzer at indexing time, too, so that bigrams small red and red porsches can be produced and indexed as well in addition to the unigrams small, red and porsches.
Then at query time, the unigrams would match as well but small red bigram would definitely match, too. In order to only match on the bigrams, you can have another shingle analyzer just for query time whose output_unigrams is set to false, so that only bigrams get generated out of your search input. And in case your query only contains one single word (e.g. porsches), then that shingle analyzer would only generate a single unigram (because output_unigrams_if_no_shingles is true) and the query would still match your document. If that's not desired you can simply set output_unigrams_if_no_shingles to false in your shingle search analyzer.

Resources