How can I accurate my ElasticSeach query to distingue better its results? - elasticsearch

My chalenge here is to create a autocomplete field (django and ES), where I could search "apeni", "rua apen" or "roa apen" and have got "rua apeninos" as the main (or unique) option. I have already tried suggest and completion in ES, but both use prefix (don't work with "apen"). I tried wildcards as well, but couldn't use fuzzy (don't work with "roa apeni" or "apini"). So, now I am tring match with fuzzy.
But even when query term is differente, like "rua ape" or "rua apot", it returns the same two docs with street_desc equal "rua apeninos" and "rua apotribu" and both with score 1.0.
Query:
{
"aggs":{
"addresses":{
"filters":{
"filters":{
"street":{
"match":{
"street_desc":{
"query":"rua ape",
"fuzziness":"AUTO",
"prefix_length":0,
"max_expansions":50
}
}
}
}
},
"aggs":{
"street_bucket":{
"significant_terms":{
"field":"street_desc.raw",
"size":3
}
}
}
}
},
"sort":[
{
"_score":{
"order":"desc"
}
}
]
}
Index:
{
"catalogs":{
"mappings":{
"properties":{
"street_desc":{
"type":"text",
"fields":{
"raw":{
"type":"keyword"
}
},
"analyzer":"suggest_analyzer"
}
}
}
}
}
Analyzer: (python)
suggest_analyzer = analyzer(
'suggest_analyzer',
tokenizer=tokenizer("lowercase"),
filter=[token_filter('stopbr', 'stop', stopwords="_brazilian_")],
language="brazilian",
char_filter=["html_strip"]
)

Adding an end to end working example, which I tested on all the given search terms.
Index-mapping
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index sample docs
{
"title" : "rua apotribu"
}
{
"title" : "rua apeninos"
}
Search queries
{
"query": {
"match": {
"title": {
"query": "apeni", //
"fuzziness":"AUTO"
}
}
}
}
And search result
"hits": [
{
"_index": "64881760",
"_type": "_doc",
"_id": "1",
"_score": 1.1026623,
"_source": {
"title": "rua apeninos"
}
}
]
Now with apen also it gives search result
"hits": [
{
"_index": "64881760",
"_type": "_doc",
"_id": "1",
"_score": 2.517861,
"_source": {
"title": "rua apeninos"
}
}
]
And now when query terms are different like rua apot, it brings both the docs with a much higher score to rua apotribu as shown in below search result.
"hits": [
{
"_index": "64881760",
"_type": "_doc",
"_id": "2",
"_score": 2.9289336,
"_source": {
"title": "rua apotribu"
}
},
{
"_index": "64881760",
"_type": "_doc",
"_id": "1",
"_score": 0.41107285,
"_source": {
"title": "rua apeninos"
}
}
]

Related

Elasticsearch. How to stem plurals ending in es?

For the EN language I have a custom analyser using the porter_stem. I want queries with the words "virus" and "viruses" to return the same results.
What I am finding is that porter stems virus->viru and viruses->virus. Consequently I get differing results.
How can I handle this?
You can achieve your use case, i.e, queries with the words "virus" and "viruses" should return the same result, by using snowball token filter,
that stems all the words to their root word.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_snow"
]
}
},
"filter": {
"my_snow": {
"type": "snowball",
"language": "English"
}
}
}
},
"mappings": {
"properties": {
"desc": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Analyze API
GET /_analyze
{
"analyzer" : "my_analyzer",
"text" : "viruses"
}
Following tokens are generated -
{
"tokens": [
{
"token": "virus",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Index Data:
{
"desc":"viruses"
}
{
"desc":"virus"
}
Search Query:
{
"query": {
"match": {
"desc": {
"query": "viruses"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65707743",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"desc": "viruses"
}
},
{
"_index": "65707743",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"desc": "virus"
}
}
]

Elasticsearch query to return part of words searched for

I would like to know how I can return "thanks" or "thanking" if I search for "thank"
Currently I have a multi-match query which returns only occurrences of "thank" like "thank you" but not "thanksgiving" or "thanks". I am using ElasticSearch 7.9.1
query: {
bool: {
must: [
{match: {accountId}},
{
multi_match: {
query: "thank",
type: "most_fields",
fields: ["text", "address", "description", "notes", "name"],
}
}
],
filter: {match: {type: "personaldetails"}}
}
},
Also is it possible to combine the multimatch query with a queryString on one of the fields (say description, where I would do a querystring search only on description and a phrase match on other fields)
{ "query": {
"query_string": {
"query": "(new york city) OR (big apple)",
"default_field": "content"
}
}
}
Any input is appreciated.
thanks
You can use edge_ngrma tokenizer that first breaks text down into
words whenever it encounters one of a list of specified characters,
then it emits N-grams of each word where the start of the N-gram is
anchored to the beginning of the word.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 5,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"notes": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard" // note this
}
}
}
}
Index Data:
{
"notes":"thank"
}
{
"notes":"thank you"
}
{
"notes":"thanks"
}
{
"notes":"thanksgiving"
}
Search Query:
{
"query": {
"multi_match" : {
"query": "thank",
"fields": [ "notes", "name" ]
}
}
}
Search Result:
"hits": [
{
"_index": "65511630",
"_type": "_doc",
"_id": "1",
"_score": 0.1448707,
"_source": {
"notes": "thank"
}
},
{
"_index": "65511630",
"_type": "_doc",
"_id": "3",
"_score": 0.1448707,
"_source": {
"notes": "thank you"
}
},
{
"_index": "65511630",
"_type": "_doc",
"_id": "2",
"_score": 0.12199639,
"_source": {
"notes": "thanks"
}
},
{
"_index": "65511630",
"_type": "_doc",
"_id": "4",
"_score": 0.06264679,
"_source": {
"notes": "thanksgiving"
}
}
]
To combine multi-match query with query string, use the below query:
{
"query": {
"bool": {
"must": {
"multi_match": {
"query": "thank",
"fields": [
"notes",
"name"
]
}
},
"should": {
"query_string": {
"query": "(new york city) OR (big apple)",
"default_field": "content"
}
}
}
}
}

Elasticsearch - How to specify the same analyzer for search and index

I'm working on a Spanish search engine. (I don't speak Spanish) But based on my research, the goal is more or less like this: 1. filter stopwords like "dos","de","la"... 2. stem the words for both search and index. e.g If you search "primera", then "primero","primer" should also show up.
My attempt:
es_analyzer={
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_stemmer": {
"type": "stemmer",
"language": "spanish"
}
},
"analyzer": {
"default_search": {
"type": "spanish"
},
"rebuilt_spanish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_stemmer"
]
}
}
}
}
}
The problem:
When I use "type":"spanish" in the "default_search", my query "primera" gets stemmed to "primer", which is correct, but even though I specified to use "spanish_stemmer" in the filter, the documents in the index aren't stemmed. So as a result when I search for "primera", it only shows exact matches for "primer". Any suggestions on fixing this?
Potential fix but I haven't figured out the syntax:
Using built-in "spanish" analyzer in filter. What's the syntax?
Adding spanish stemmer and stopwords in "default_search". But I don't know how to use compound settings there.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_stemmer": {
"type": "stemmer",
"language": "spanish"
}
},
"analyzer": {
"default_search": {
"type":"spanish",
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_stemmer"
]
}
}
}
},
"mappings":{
"properties":{
"title":{
"type":"text",
"analyzer":"default_search"
}
}
}
}
Index Data:
{
"title": "primer"
}
{
"title": "primera"
}
{
"title": "primero"
}
Search Query:
{
"query":{
"match":{
"title":"primer"
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64420517",
"_type": "_doc",
"_id": "3",
"_score": 0.13353139,
"_source": {
"title": "primer"
}
},
{
"_index": "stof_64420517",
"_type": "_doc",
"_id": "1",
"_score": 0.13353139,
"_source": {
"title": "primera"
}
},
{
"_index": "stof_64420517",
"_type": "_doc",
"_id": "2",
"_score": 0.13353139,
"_source": {
"title": "primero"
}
}
]

ElasticSearch match score

I have a simple field of type "text" in my index.
"keywordName": {
"type": "text"
}
And I have these documents already inserted : "samsung", "samsung galaxy", "samsung cover", "samsung charger".
If I make a simple "match" query, the results are disturbing:
Query:
GET keywords/_search
{
"query": {
"match": {
"keywordName": "samsung"
}
}
}
Results:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1.113083,
"hits": [
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung galaxy",
"_score": 1.113083,
"_source": {
"keywordName": "samsung galaxy"
}
},
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung charger",
"_score": 0.9433406,
"_source": {
"keywordName": "samsung charger"
}
},
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung",
"_score": 0.8405092,
"_source": {
"keywordName": "samsung"
}
},
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung cover",
"_score": 0.58279467,
"_source": {
"keywordName": "samsung cover"
}
}
]
}
}
First Question : Why "samsung" has not the highest score?
Second Question : How can I make a query or analyser which gives me "samsung" as highest score?
Starting from the same index settings (analyzers, filters, mappings) as in my previous reply, I suggest the following solution. But, as I mentioned, you need to lay down all the requirements in terms of what you need to search for in this index and consider all of this as a complete solution.
DELETE test
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"custom_stop": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_stop",
"my_snow",
"asciifolding"
]
}
},
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_french_"
},
"my_snow": {
"type": "snowball",
"language": "French"
}
}
}
},
"mappings": {
"test": {
"properties": {
"keywordName": {
"type": "text",
"analyzer": "custom_stop",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
POST /test/test/_bulk
{"index":{}}
{"keywordName":"samsung galaxy"}
{"index":{}}
{"keywordName":"samsung charger"}
{"index":{}}
{"keywordName":"samsung cover"}
{"index":{}}
{"keywordName":"samsung"}
GET /test/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"keywordName": {
"query": "samsungs",
"operator": "and"
}
}
},
{
"term": {
"keywordName.raw": {
"value": "samsungs"
}
}
},
{
"fuzzy": {
"keywordName.raw": {
"value": "samsungs",
"fuzziness": 1
}
}
}
]
}
},
"size": 10
}

Elasticsearch search for Turkish characters

I have some documents that i am indexing with elasticsearch. But some of the documents are written with upper case and Tukish characters are changed. For example "kürşat" is written as "KURSAT".
I want to find this document by searching "kürşat". How can i do that?
Thanks
Take a look at the asciifolding token filter.
Here is a small example for you to try out in Sense:
Index:
DELETE test
PUT test
{
"settings": {
"analysis": {
"filter": {
"my_ascii_folding": {
"type": "asciifolding",
"preserve_original": true
}
},
"analyzer": {
"turkish_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_ascii_folding"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "turkish_analyzer"
}
}
}
}
}
POST test/test/1
{
"name": "kürşat"
}
POST test/test/2
{
"name": "KURSAT"
}
Query:
GET test/_search
{
"query": {
"match": {
"name": "kursat"
}
}
}
Response:
"hits": {
"total": 2,
"max_score": 0.30685282,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.30685282,
"_source": {
"name": "KURSAT"
}
},
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.30685282,
"_source": {
"name": "kürşat"
}
}
]
}
Query:
GET test/_search
{
"query": {
"match": {
"name": "kürşat"
}
}
}
Response:
"hits": {
"total": 2,
"max_score": 0.4339554,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.4339554,
"_source": {
"name": "kürşat"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.09001608,
"_source": {
"name": "KURSAT"
}
}
]
}
Now the 'preserve_original' flag will make sure that if a user types: 'kürşat', documents with that exact match will be ranked higher than documents that have 'kursat' (Notice the difference in scores for both query responses).
If you want the score to be equal, you can put the flag on false.
Hope I got your problem right!

Resources