Allowing hypen based words to be tokenized in elasticsearch - elasticsearch

I have the following mapping for a field name that will hold products name for ecommerce.
'properties': {
'name': {
'type': 'text',
'analyzer': 'standard',
'fields': {
'english': {
'type': 'text',
'analyzer': 'english'
},
}
},
Assuming that I have the following string to be indexed/searched.
A pack of 3 T-shirts
Both of the analyerzs are producing terms [t, shirts], [t, shirt] respectively.
This gives me the problem of not getting any result when a user types "mens tshirts"
How can i get the term in the inverted index like [t, shirts, shirt, tshirt', tshirts]
I tried to look into Stemmers exclusions but I couldn't find any thing to deal with hyphens. Also i will be helpful if a more generic solution is found rather than doing exclusions manually. Because there could be many possiblities which i don't know now e.g emails, e-mails

whitespace tokenizer could do the job
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-tokenizer.html
POST _analyze
{
"tokenizer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
will produce
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

I found one solution which i guess could help me achieving the desired results. However, I would still like to see if there is some good and recommended approach for this problem.
Basically I will use multi fields for this problem where the first analyzer will be standard and the second will be my custom.
According to Elasticsearch documentation, chars_filters happens before tokenizer. So the idea is to remove - with a empty character which will make t-shirts to tshirt. Hence the tokenizer will token the whole term as tshirts in invertded index.
GET _analyze
{
"tokenizer": "standard",
"filter": [
"lowercase",
{"type": "stop", "stopwords": "_english_"}
],
"char_filter" : [
"html_strip",
{"type": "mapping", "mappings": ["- => "]}
],
"text": "these are t-shirts <table>"
}
will give the following tokens
{
"tokens": [
{
"token": "tshirts",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}
]
}

Related

Elastic search with Java: exclude matches with random leading characters in a letter

I am new to using elastic search. I managed to get things working somewhat close to what I intended. I am using the following configuration.
{
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3,
"output_unigrams": true,
"token_separator": ""
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"shingle_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"shingle_index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter",
"autocomplete_filter"
]
}
}
}
}
I have this applied over multiple fields and doing a multi match query.
Following is the java code:
NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(QueryBuilders.multiMatchQuery(i)
.field("title")
.field("alias")
.fuzziness(Fuzziness.ONE)
.type(MultiMatchQueryBuilder.Type.BEST_FIELDS))
.build();
The problem is it matches with fields that have letters with some leading characters.
For example, if my search input is "ron" I want it to match with "ron mathews", but I don't want it match with "iron". How can I make sure that I am matching with letters having no leading characters?
Update-1
Turning off fuzzy transposition seems to improve search results. But I think we can make it better.
You probably want to score "ron" higher than "ronaldo" and the exact match of complete field "ron" even higher so the best option here would be to use few subfields with standard and keyword analyzers and boost those fields in your multi_match query.
Also, as you figured out yourself, be careful with the fuzziness. Might make sense to run 2 queries in a should with one being fuzzy and another boosted so that exact matches are ranked higher.

elasticsearch synonyms & shingle conflict

Let me jump straight to the code.
PUT /test_1
{
"settings": {
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"university of tokyo => university_of_tokyo, u_tokyo",
"university" => "college, educational_institute, school"
],
"tokenizer": "whitespace"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [
"shingle",
"synonym"
]
}
}
}
}
}
output
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [shingle] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [shingle] cannot be used to parse synonyms"
},
"status": 400
}
Basically,
Lets Say I have following index_time synonyms
"university => university, college, educational_institute, school"
"tokyo => tokyo, japan_capitol"
"university of tokyo => university_of_tokyo, u_tokyo"
If I search for "college" I expect to match "university of tokyo"
but since index contains only "university of tokyo" => university_of_tokyo, u_tokyo.....the search fails
I was expecting if I use analyzer{'filter': ["single", "synonym"]}
university of tokyo -shingle-> university -synonyms-> college, institue
How do I obtain the desired behaviour?
I was getting a similar error, though I was using synonym graph....
I tried using lenient=true in the synonym graph definition and got rid of the error. Not sure if there is a downside....
"graph_synonyms" : {
"lenient": "true",
"type" : "synonym_graph",
"synonyms_path" : "synonyms.txt"
},
According to this link Tokenizers should produce single tokens before a synonym filter.
But to answer your problem first of all your second rule should be modified to be like this to make all of terms synonyms
university , college, educational_institute, school
Second Because of underline in the tail of first rule (university_of_tokyo) all the occurrences of "university of tokyo" are indexed as university_of_tokyo which is not aware of it's single tokens. To overcome this problem I would suggest a char filter with a rule like this:
university of tokyo => university_of_tokyo university of tokyo
and then in your synonyms rule:
university_of_tokyo , u_tokyo
This a way to handle multi-term synonyms problem as well.

ElasticSearch - exclude special character from standard stemmer

I'm using standard analyzer for my ElasticSearch index, and I have noticed that when I search a query with % in it - the analyzer drops the % as part of the stemmer steps (on the query "2% milk")
GET index_name/_analyze
{
"field": "text.english",
"text": "2% milk"
}
The response is the following 2 tokens (2 and milk):
{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "milk",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Meaning, the 2% becomes 2
I want to use the standard stemmer to reduce punctuation, I don't want to use the space stemmer or other stemmer which is not standard but I do want to use the <number>% sign as term in the index.
Is there a way to configure to the stemmer to ignore special character when it's next to a number? worst case not to ignore it at all?
Thanks!
You can achieve the desired behavior by configuring a custom analyzer using a character filter that preserves the "%"-character from getting stripped away.
Check the Elasticsearch documentation about the configuration of the built-in analyzers, to use that configuration as a blueprint to configure your custom analyzer (see Elasticsearch Reference: english analyzer)
Add a character filter that maps the percentage-character to a different string, as demonstrated in the following code snippet:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_percent_char_filter"
]
}
},
"char_filter": {
"my_percent_char_filter": {
"type": "mapping",
"mappings": [
"0% => 0_percent",
"1% => 1_percent",
"2% => 2_percent",
"3% => 3_percent",
"4% => 4_percent",
"5% => 5_percent",
"6% => 6_percent",
"7% => 7_percent",
"8% => 8_percent",
"9% => 9_percent"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "The fee is between 0.93% or 2%"
}
With this, you can even search for specific percentages (like 2%)!
Alternative Solution
If you simply want to remove the percentage character, you can use the very same approach, but simply map the %-character to an empty string, as shown in the following code snippet
"char_filter": {
"my_percent_char_removal_filter": {
"type": "mapping",
"mappings": [
"% => "]
}
}
BTW: This approach is not considered to be a "hack", it's the standard solution approach to modify your original string before it gets sent to the tokenizer.

Elasticsearch Synonyms - How is precedence determined?

Say I have a synonym file with just the two synonym lines below
ft , synonym_1
10 ft , synonym_2
When I use this file in an analyzer and analyze the word "10 ft" I get the following:
{
"tokens": [
{
"token": "10"
},
{
"token": "ft"
},
{
"token": "synonym_2",
}
]
}
synonym_1 doesn't appear, even though "ft" matched a token in the analyzed text. Is this because of some precedence with single tokens and phrases? Does "10 ft" match more of the analyzed text and therefore it's the only synonym that takes effect? Is there some way to get the first synonym to work in this case?
Note: I'm using a whitespace tokenizer and analyzing the text "30 ft" gives me synonym_1. It's only when "10 ft" appears exactly that the first synonym is broken.
"simplified_analyzer": {
"filter": [
"lowercase",
"stemmer",
"synonyms",
"edge_ngrams",
"remove_duplicates"
],
"char_filter" => ["remove_html", "remove_non_alphanumeric"],
"tokenizer" => "whitespace"
}
Do I have to use a second synonym filter to handle single words?

Elasticsearch Query String Query with # symbol and wildcards

I defined a custom analyzer that I was surprised not built-in.
analyzer": {
"keyword_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
Then my mapping for this field is:
"email": {
"type": "string",
"analyzer": "keyword_lowercase"
}
This works great. (http://.../_analyze?field=email&text=me#example.com) ->
"tokens": [
{
"token": "me#example.com",
"start_offset": 0,
"end_offset": 16,
"type": "word",
"position": 1
}
]
Finding by that keyword works great. http://.../_search?q=me#example.com yields results.
The problem is trying to incorporate wildcards anywhere in the Query String Query. http://.../_search?q=*me#example.com yields no results. I would expect results containing emails such as "me#example.com" and "some#example.com".
It looks like elasticsearch performs the search with the default analyzer, which doesn't make sense. Shouldn't it perform the search with each field's own default analyzer?
I.E. http://.../_search?q=email:*me#example.com returns results because I am telling it which analyzer to use based upon the field.
Can elasticsearch not do this?
See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Set analyze_wildcard to true, as it is false by default.

Resources