Position aware search results in Elasticsearch automcompletion - elasticsearch

I want to implement address autocompletion using Elasticsearch.
The current approach I am investigating is based on search_as_you_type field type.
Consider this two addresses:
3543JN Carl Zellerhof 8 Utrecht (3543JN is postcode)
1234JN The Street 3543 Utrecht
It is important to prioritize some address parts over others, for instance, postcode should have more weight than number, eg when a user types 3543 - the first address should be first in search results.
I see two solutions here:
Combine address into one string and give weight based on position within the combined string
Do search on multiple fields (then weight can be adjusted per field, but it seems more complex to me, how to ensure the same address part is not matched several times?)
I am leaning more towards one-string solution, but this implementation gives the same weight for the 3543 search query.
Please advise how to implement this.
(It is also desirable to allow some fuzziness)
UPD:
seems adding postcode field to the multi_match fields gives me what I want. Are there any disadvantages of this approach?
the index
{
"mappings": {
"properties": {
"search": {
"type": "search_as_you_type"
}
}
}
}
the search query
{
"query": {
"multi_match": {
"query": "3543",
"type": "bool_prefix",
"fields": [
"search",
"search._2gram",
"search._3gram"
]
}
}
}

Related

Fuzzy Results Score in Complete Suggester Elastic Search

I have just started using Elastic search and am stuck with the following use case -
I am using complete suggester in elastic search with auto fuzziness setting to get city suggestions as output. My city name in completion field has weights according to popularity. The problem is the ordering in case of fuzzy results.
Example if user types "dilh" -> I would want to give "delhi" result above "digha" or "dighwara" owing to popularity i.e. weights assigned to different cities.
Right now "digha","dighwara","Dihira" etc are coming above more relevant cities like "delhi" or "dalhousie". Since the edit distance is same anyone can let me know how can I configure this so the order is according to the weights of cities?
Attaching sample request:
{
"suggest": {
"loc-suggest2": {
"prefix": "dilh",
"completion": {
"field": "suggestedNames",
"size":20,
"fuzzy": {
"fuzziness": auto
}
}
}
}
}

Retrieve distinct values for search as you type in Elasticsearch

We have a field title and the type is search_as_you_type,
{
"mappings": {
"properties": {
"title": {
"type": "search_as_you_type"
}
}
}
}
and when we a searching
{
"query": {
"match_phrase_prefix": {
"title": "red"
}
}
}
we are getting duplicates results
red car
red icecream
red car
This is because we have documents with same title values.
Is there a way to indicate that result must have distinct vaules?
You can see terms aggregation of your title field in case of search as you type works on not by following the example given in [this SO answer] 1. You can also check this blog which explains how to get unique values from Elasticsearch.
Also, make sure these documents which are coming in your results are the same documents and not the different document which has the same values.
Edit:- As discussed in the comment, in this case, completion suggestor was more useful as it deals with duplicates and it solved the issue.

Custom score for exact, phonetic and fuzzy matching in elasticsearch

I have a requirement where there needs to be custom scoring on name. To keep it simple lets say, if I search for 'Smith' against names in the index, the logic should be:
if input = exact 'Smith' then score = 100%
else
if input = phonetic match then
score = <depending upon fuzziness match of input with name>%
end if
end if;
I'm able to search documents with a fuzziness of 1 but I don't know how to give it custom score depending upon how fuzzy it is. Thanks!
Update:
I went through a post that had the same requirement as mine and it was mentioned that the person solved it by using native scripts. My question still remains, how to actually get the score based on the similarity distance such that it can be used in the native scripts:
The post for reference:
https://discuss.elastic.co/t/fuzzy-query-scoring-based-on-levenshtein-distance/11116
The text to look for in the post:
"For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring."
You can implement this search logic using the rescore function query (docs here).
Here there is a possible example:
{
"query": {
"function_score": {
"query": { "match": {
"input": "Smith"
} },
"boost": "5",
"functions": [
{
"filter": { "match": { "input.keyword": "Smith" } },
"random_score": {},
"weight": 23
}
]
}
}
}
In this example we have a mapping with the input field indexed both as text and keyword (input.keyword is for exact match). We re-score the documents that match exactly the term "Smith" with an higher score respect to the all documents matched by the first query (in the example is a match, but in your case will be the query with fuzziness).
You can control the re-score effect tuning the weight parameter.

Document Frequency and Null Values in Elasticsearch

I have an app that uses the following elasticsearch autocomplete query to allow users to search for products based on the product name and it's alternate names.
"query": {
"dis_max": {
"queries": [
{
"match": {
"name.autocomplete": {
"query": "p",
"analyzer": "autocomplete_search",
"boost": 3
}
}
},
{
"nested": {
"path": "alternate_names",
"score_mode": "max",
"query": {
"match": {
"alternate_names.name.autocomplete": {
"query": "p",
"analyzer": "autocomplete_search"
}
}
}
}
}
]
}
}
All products have a name but only about 10% of products have an alternate name (stored as a nested field under products). Even though I am boosting matches on name over matches on alternate name, I noticed that sometimes after typing one letter, it would return a match on alternate name over a match on name.
After doing some digging with the explain API, I discovered that this is happening because the calculation for document frequency for alternate names is using the number of matching documents / the total number of documents (as expected). However, the results are skewed in this case because so many products have a null alternate name. So if 10% of products have that includes a word beginning with 'p', and 10% of alternate names have a word beginning with 'p', the query believes the match on the alternate name to be much more relevant because only 1% of all products have an alternate name beginning with 'p'.
I'm wondering if anyone has run into this issue and knows a way to address it? Ideally, I would like to not count documents with null alternate names for the total doc count in the document frequency calculation for matches on alternate name. Or, I could use the same tf/idf calculation across both fields so that alternate names are just seen as other names in terms of how unique a specific word is. However, I have not been able to figure out how to do either or these things.

Productsearch with Elasticsearch

I am relatively new to elasticsearch and I want to perform a search for products with brand and type names.
I already tried a bit but I think I am missing something important to have a solid search algorithm. Here is my approach:
A product looks e.g. like this:
{
brandName: "Samsung",
typeName: "PS-50Q7HX",
...
}
I will have a single input field. The user can search for a brand/type only or for a brand in combination with a type name. E.g.
Samsung | Samsung PS-50Q7HX | PS-50Q7HX
To eliminate misstyping in the typeName field I use an ngram tokenizer which works great when I search for types only. But in combination with the brandName field I get in trouble. Using something like this does not work well (especially when I use an ngram tokenizer on the brandName field too):
{
"query" : {
"multi_match" : {
"query": "Samsung PS 50Q 7HX",
"type": "cross_fields",
"fields": ["brandName", "typeName"]
}
}
}
Of course I know why this is not working well with two ngram tokenizer and a mixed field but I am not sure how to solve this the best way.
I think the main problem is that I do not know if the user entered a brand name or not and I thought about using a second index filled with all available brands, which I use to perform a "pre-search" for an eventually given brand name in my query string. If I find a match I am able to split the search string into type and brand name and perform a more specific search. Like this one
{
"query": {
"bool": {
"must": [
{ "match": { "brandName": "Samsung" } },
{ "match": { "typeName": "PS-50Q7HX" } }
]
}
}
}
Does this sound like a good approach? Or does anyone see a better way?
Any help is appreciated!
Thank you very much and best regards,
Stefan
To eliminate the typo mistake by the user, you used ngram analyzer which is a costly one. You could use stem analyzer which provide some flexible options to eliminate the typo mistakes
As per my concern, instead of index this in 2 different fields you could index this as a single field.
ex:- "FIELD_NAME": "Samsung|PS-50Q7HX"
Brand name and Product name with some delimiter i used |. analyse this field values with delimiter. so your content data will be index as follows
Samsung
PS-50Q7HX
Then you could search by the following query
{
"query": {
"query-string": {
"query": "Samsung PS-50Q7HX",
"default_operator": "or",
"fields": [
"FIELD_NAME"
]
}
}
}
this will retrieve the document which has the brand name as samsung or product name as PS-50Q7Hx from index. you could use prefix search and if you use default_operator as and then your search will be most accuracy.

Resources