Document Frequency and Null Values in Elasticsearch - elasticsearch

I have an app that uses the following elasticsearch autocomplete query to allow users to search for products based on the product name and it's alternate names.
"query": {
"dis_max": {
"queries": [
{
"match": {
"name.autocomplete": {
"query": "p",
"analyzer": "autocomplete_search",
"boost": 3
}
}
},
{
"nested": {
"path": "alternate_names",
"score_mode": "max",
"query": {
"match": {
"alternate_names.name.autocomplete": {
"query": "p",
"analyzer": "autocomplete_search"
}
}
}
}
}
]
}
}
All products have a name but only about 10% of products have an alternate name (stored as a nested field under products). Even though I am boosting matches on name over matches on alternate name, I noticed that sometimes after typing one letter, it would return a match on alternate name over a match on name.
After doing some digging with the explain API, I discovered that this is happening because the calculation for document frequency for alternate names is using the number of matching documents / the total number of documents (as expected). However, the results are skewed in this case because so many products have a null alternate name. So if 10% of products have that includes a word beginning with 'p', and 10% of alternate names have a word beginning with 'p', the query believes the match on the alternate name to be much more relevant because only 1% of all products have an alternate name beginning with 'p'.
I'm wondering if anyone has run into this issue and knows a way to address it? Ideally, I would like to not count documents with null alternate names for the total doc count in the document frequency calculation for matches on alternate name. Or, I could use the same tf/idf calculation across both fields so that alternate names are just seen as other names in terms of how unique a specific word is. However, I have not been able to figure out how to do either or these things.

Related

Position aware search results in Elasticsearch automcompletion

I want to implement address autocompletion using Elasticsearch.
The current approach I am investigating is based on search_as_you_type field type.
Consider this two addresses:
3543JN Carl Zellerhof 8 Utrecht (3543JN is postcode)
1234JN The Street 3543 Utrecht
It is important to prioritize some address parts over others, for instance, postcode should have more weight than number, eg when a user types 3543 - the first address should be first in search results.
I see two solutions here:
Combine address into one string and give weight based on position within the combined string
Do search on multiple fields (then weight can be adjusted per field, but it seems more complex to me, how to ensure the same address part is not matched several times?)
I am leaning more towards one-string solution, but this implementation gives the same weight for the 3543 search query.
Please advise how to implement this.
(It is also desirable to allow some fuzziness)
UPD:
seems adding postcode field to the multi_match fields gives me what I want. Are there any disadvantages of this approach?
the index
{
"mappings": {
"properties": {
"search": {
"type": "search_as_you_type"
}
}
}
}
the search query
{
"query": {
"multi_match": {
"query": "3543",
"type": "bool_prefix",
"fields": [
"search",
"search._2gram",
"search._3gram"
]
}
}
}

Find one result based on a term query or a list of results based on a match query

I have an index of documents, each containing an id and name field. Each document name happens to be unique.
I want to perform a query on the name field that returns one exact result if possible, or falls back to return a list of similar results. For example, if the search term is Acme Incorporated and there is an exact result, return that only. Otherwise return similar matches; e.g: ACME Inc., acme, Ace etc.
I assumed that I need to somehow combine a keyword-based term query for an exact match, and a text-based match query for the similar matches. I am still getting to grips with compound queries so my first attempt was pretty naive:
{
"query": {
"bool": {
"should": [
{
"term": {
"name.exact": "Acme Incorporated"
}
},
{
"match": {
"name": "Acme Incorporated"
}
}
]
}
}
}
This returns a list of similar matches AND an exact match if present, because at least one query should succeed. This is obviously not correct.
In order to facilitate the keyword-based term query above, I added name.exact to my document mapping:
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"fields": {
"exact": {
"type": "keyword"
}
}
}
}
}
}
I suppose another approach is use the Multi Search API to perform the above queries separately. This allows me to look at the responses, and decide to use the match query if the term query result set is empty. This will work for my use case but I suspect that this is not an optimal approach.
I assume this is a common use-case but I am not sure what the solution is.
Edit
My current thinking on this is that I go with a Multi Search query as described above, the first is the same keyword-based term query to attempt to find an exact result and the second is the following — a compound bool query that excludes an exact result.
{
"query": {
"bool": {
"must": {
"match": {
"name": "Acme Incorporated"
}
},
"must_not": {
"term": {
"name.keyword": "Acme Incorporated"
}
}
}
}
}
In the end, the MultiSearch API suited my use case:
The multi search API executes several searches from a single API request. The format of the request is similar to the bulk API format and makes use of the newline delimited JSON (NDJSON) format.
I used this to perform two queries in one request:
Find any exact results with a keyword-based term query on the document name field.
Find any similar results with a bool query, comprising a match query on the
document name field, and a must_not of the first query to
filter out any exact results.
A Multi Search body is constructed of one or more pairs of an (optionally) empty header and body (a single query) delimited by newlines; e.g:
GET /myindex/_msearch
{}
{"query": {"constant_score": {"filter": {"term": {"name.keyword": "Acme Incorporated"}}}}}
{}
{"query": {"bool": {"must": {"match": {"name": "Acme Incorporated"}}, "must_not": {"term": {"name.keyword": "Acme Incorporated"}}}}}
The query is in ndjson format, which states that "Each Line is a Valid JSON Value". This requires that each query be compressed to one line, which is not very readable but not an issue if you're using a library to construct queries.

Elasticsearch index field with wildcard and search for it

I have a document with a field "serial number". That serial number is ABC.XXX.DEF where XXX indicates wildcards. XXX can be \d{3}[a-zA-Z0-9].
So users can search for:
ABC.123.DEF
ABC.234.DEF
ABC.XYZ.DEF
while the document only includes
ABC.XXX.DEF
When a user queries ABC.123.DEF i need a hit on that document containing ABC.XXX.DEF. As other documents might contain ABC.DEF.XXX and must not be hit I am running out of ideas with my basic elasticsearch knowledge.
Do I have to attack the problem from the query side or when analyzing/tokenizing the pattern?
Can anyone give me an example how to approach that problem?
As long as serial number is well defined the first solution that comes to my mind is to split serial number into three parts ("part1", "part2" and "part3", for example) and index them as three separate fields. Parts consisting of wildcards should have special value or may not be indexed at all. Then at query time I would split serial number provided by user in the same way. Assuming that parts consisting of wildcards are not indexed my query would look like this:
"query": {
"bool": {
"must":[
{
"bool": {
"should": [
{
"match": {
"part1": "ABC"
}
},
{
"bool": {
"must_not": {
"exists": {
"field": "part1"
}
}
}
}
]
}
},
... // Similar code for other parts
]
}
}

Custom score for exact, phonetic and fuzzy matching in elasticsearch

I have a requirement where there needs to be custom scoring on name. To keep it simple lets say, if I search for 'Smith' against names in the index, the logic should be:
if input = exact 'Smith' then score = 100%
else
if input = phonetic match then
score = <depending upon fuzziness match of input with name>%
end if
end if;
I'm able to search documents with a fuzziness of 1 but I don't know how to give it custom score depending upon how fuzzy it is. Thanks!
Update:
I went through a post that had the same requirement as mine and it was mentioned that the person solved it by using native scripts. My question still remains, how to actually get the score based on the similarity distance such that it can be used in the native scripts:
The post for reference:
https://discuss.elastic.co/t/fuzzy-query-scoring-based-on-levenshtein-distance/11116
The text to look for in the post:
"For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring."
You can implement this search logic using the rescore function query (docs here).
Here there is a possible example:
{
"query": {
"function_score": {
"query": { "match": {
"input": "Smith"
} },
"boost": "5",
"functions": [
{
"filter": { "match": { "input.keyword": "Smith" } },
"random_score": {},
"weight": 23
}
]
}
}
}
In this example we have a mapping with the input field indexed both as text and keyword (input.keyword is for exact match). We re-score the documents that match exactly the term "Smith" with an higher score respect to the all documents matched by the first query (in the example is a match, but in your case will be the query with fuzziness).
You can control the re-score effect tuning the weight parameter.

Productsearch with Elasticsearch

I am relatively new to elasticsearch and I want to perform a search for products with brand and type names.
I already tried a bit but I think I am missing something important to have a solid search algorithm. Here is my approach:
A product looks e.g. like this:
{
brandName: "Samsung",
typeName: "PS-50Q7HX",
...
}
I will have a single input field. The user can search for a brand/type only or for a brand in combination with a type name. E.g.
Samsung | Samsung PS-50Q7HX | PS-50Q7HX
To eliminate misstyping in the typeName field I use an ngram tokenizer which works great when I search for types only. But in combination with the brandName field I get in trouble. Using something like this does not work well (especially when I use an ngram tokenizer on the brandName field too):
{
"query" : {
"multi_match" : {
"query": "Samsung PS 50Q 7HX",
"type": "cross_fields",
"fields": ["brandName", "typeName"]
}
}
}
Of course I know why this is not working well with two ngram tokenizer and a mixed field but I am not sure how to solve this the best way.
I think the main problem is that I do not know if the user entered a brand name or not and I thought about using a second index filled with all available brands, which I use to perform a "pre-search" for an eventually given brand name in my query string. If I find a match I am able to split the search string into type and brand name and perform a more specific search. Like this one
{
"query": {
"bool": {
"must": [
{ "match": { "brandName": "Samsung" } },
{ "match": { "typeName": "PS-50Q7HX" } }
]
}
}
}
Does this sound like a good approach? Or does anyone see a better way?
Any help is appreciated!
Thank you very much and best regards,
Stefan
To eliminate the typo mistake by the user, you used ngram analyzer which is a costly one. You could use stem analyzer which provide some flexible options to eliminate the typo mistakes
As per my concern, instead of index this in 2 different fields you could index this as a single field.
ex:- "FIELD_NAME": "Samsung|PS-50Q7HX"
Brand name and Product name with some delimiter i used |. analyse this field values with delimiter. so your content data will be index as follows
Samsung
PS-50Q7HX
Then you could search by the following query
{
"query": {
"query-string": {
"query": "Samsung PS-50Q7HX",
"default_operator": "or",
"fields": [
"FIELD_NAME"
]
}
}
}
this will retrieve the document which has the brand name as samsung or product name as PS-50Q7Hx from index. you could use prefix search and if you use default_operator as and then your search will be most accuracy.

Resources