Elasticsearch - best query and index for partial and fuzzy search - elasticsearch

I thought this scenario must be quite common, but I was unable to find the best way to do it.
I have a big dataset of products. All the products have this kind of schema:
{
"productID": 1,
"productName": "Whatever",
"productBoost": 1234
}
I have this problem to combine partial (query string) and fuzzy query.
What i have is about 1.5M records in an index which have listed the names od the product and the boost value- like the popularity value of the product(most common have higher popularity and less popular ones have less popularity).
For this i would like to use function score.
What i was trying to achieve is search as you type, with the function score and fuzziness.
I’m not sure if this is the best approach.
Current query i'm using is this:
"query": {
"function_score": {
"query": {
"match": {
"productName": {
"query": "word",
"fuzziness": "AUTO",
"operator": "AND"
}
}
},
"field_value_factor": {
"field": "productBoost",
"factor": 1,
"modifier": "square"
}
}
}
This is working kinda ok, but the problem is that i want products like:
"Cabbage raw", to come up before "Cabernet red wine", when i try to search for the string "cab" because the boost is way higher on "Cabbage raw".
Another problem is when i search for the word "cabage" (typo of "cabagge"), there is only one product, and there are a lot of "cabagge" containing products.
If the query_string had the fuzziness with the wildcards, that would be ideal for this solution i think.
Also this is a match query so partial part is not working as well.
I tried using query_string, with the wildcards, but the downside of that is i can not use fuzziness for that kind of query.
Also i've tried nGrams and edge but i'm not sure how to implement it in this case scenario and how to combine the search score with the existing boost i have.
The only thing, that might even fix this issue, that i didn't try are suggesters.
I couldn't make them work with the function_score.
If anyone have any ideas on implementing this, it would be really helpful.

Related

Search in two fields on elasticsearch with kibana

Assuming I have an index with two fields: title and loc, I would like to search in this two fields and get the "best" match. So if I have three items:
{"title": "castle", "loc": "something"},
{"title": "something castle something", "loc": "something,pontivy,something"},
{"title": "something else", "loc": "something"}
... I would like to get the second one which has "castle" in its title and "pontivy" in its loc. I tried to simplify the example and the base, it's a bit more complicated. So I tried this query, but it seems not accurate (it's a feeling, not really easy to explain):
GET merimee/_search/?
{
"query": {
"multi_match" : {
"query": "castle pontivy",
"fields": [ "title", "loc" ]
}
}
}
Is it the right way to search in various field and get the one which match the in all the fields?
Not sure my question is clear enough, I can edit if required.
EDIT:
The story is: the user type "castle pontivy" and I want to get the "best" result for this query, which is the second because it contains "castle" in "title" and "pontivy" in "loc". In other words I want the result that has the best result in both fields.
As the other posted suggested, you could use a bool query but that might not work for your use case since you have a single search box that you want to query against multiple fields with.
I recommend looking at a Simple Query String query as that will likely give you the functionality you're looking for. See: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html
So you could do something similar to this:
{
"query": {
"simple_query_string" : {
"query": "castle pontivy",
"fields": ["title", "loc"],
"default_operator": "and"
}
}
}
So this will try to give you the best documents that match both terms in either of those fields. The default operator is set as AND here because otherwise it is OR which might not give you the expected results.
It is worthwhile to experiment with other options available for this query type as well. You might also explore using a Query String query as it gives more flexibility but the Simple Query String term works very well for most cases.
This can be done by using bool type of query and then matching the fields.
GET _search
{
"query":
{
"bool": {"must": [{"match": {"title": "castle"}},{"match": {"loc": "pontivy"}}]
}
}
}

Custom score for exact, phonetic and fuzzy matching in elasticsearch

I have a requirement where there needs to be custom scoring on name. To keep it simple lets say, if I search for 'Smith' against names in the index, the logic should be:
if input = exact 'Smith' then score = 100%
else
if input = phonetic match then
score = <depending upon fuzziness match of input with name>%
end if
end if;
I'm able to search documents with a fuzziness of 1 but I don't know how to give it custom score depending upon how fuzzy it is. Thanks!
Update:
I went through a post that had the same requirement as mine and it was mentioned that the person solved it by using native scripts. My question still remains, how to actually get the score based on the similarity distance such that it can be used in the native scripts:
The post for reference:
https://discuss.elastic.co/t/fuzzy-query-scoring-based-on-levenshtein-distance/11116
The text to look for in the post:
"For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring."
You can implement this search logic using the rescore function query (docs here).
Here there is a possible example:
{
"query": {
"function_score": {
"query": { "match": {
"input": "Smith"
} },
"boost": "5",
"functions": [
{
"filter": { "match": { "input.keyword": "Smith" } },
"random_score": {},
"weight": 23
}
]
}
}
}
In this example we have a mapping with the input field indexed both as text and keyword (input.keyword is for exact match). We re-score the documents that match exactly the term "Smith" with an higher score respect to the all documents matched by the first query (in the example is a match, but in your case will be the query with fuzziness).
You can control the re-score effect tuning the weight parameter.

elastic search search index for keywords phrases or keywords

I'm new to Elastic Search and have an index with lots of articles in it. I have 3 main fields I use; title, snippet and date. I want to find the most common or top key-phrases or keywords for a specific date in the title field. I was hoping someone can provide an example on how to do this or at least point me in the right direction.
Many Thanks!
I think you are looking for terms aggregation. Try something like this
{
"query": {
"match": {
"date": {
"query": "your_date"
}
}
},
"size": 0,
"aggs": {
"common_words": {
"terms": {
"field": "title",
"size": 10
}
}
}
}
You will find common words at the top as they are ordered by count.
If you are looking for phrases you might have to analyze your title field accordingly. You can map title with multiple analyzer. for e.g standard analyzer for common words and shingle analyzer for common phrases.
You also might want to look into significant terms aggregation if you want to find something unusual.

Productsearch with Elasticsearch

I am relatively new to elasticsearch and I want to perform a search for products with brand and type names.
I already tried a bit but I think I am missing something important to have a solid search algorithm. Here is my approach:
A product looks e.g. like this:
{
brandName: "Samsung",
typeName: "PS-50Q7HX",
...
}
I will have a single input field. The user can search for a brand/type only or for a brand in combination with a type name. E.g.
Samsung | Samsung PS-50Q7HX | PS-50Q7HX
To eliminate misstyping in the typeName field I use an ngram tokenizer which works great when I search for types only. But in combination with the brandName field I get in trouble. Using something like this does not work well (especially when I use an ngram tokenizer on the brandName field too):
{
"query" : {
"multi_match" : {
"query": "Samsung PS 50Q 7HX",
"type": "cross_fields",
"fields": ["brandName", "typeName"]
}
}
}
Of course I know why this is not working well with two ngram tokenizer and a mixed field but I am not sure how to solve this the best way.
I think the main problem is that I do not know if the user entered a brand name or not and I thought about using a second index filled with all available brands, which I use to perform a "pre-search" for an eventually given brand name in my query string. If I find a match I am able to split the search string into type and brand name and perform a more specific search. Like this one
{
"query": {
"bool": {
"must": [
{ "match": { "brandName": "Samsung" } },
{ "match": { "typeName": "PS-50Q7HX" } }
]
}
}
}
Does this sound like a good approach? Or does anyone see a better way?
Any help is appreciated!
Thank you very much and best regards,
Stefan
To eliminate the typo mistake by the user, you used ngram analyzer which is a costly one. You could use stem analyzer which provide some flexible options to eliminate the typo mistakes
As per my concern, instead of index this in 2 different fields you could index this as a single field.
ex:- "FIELD_NAME": "Samsung|PS-50Q7HX"
Brand name and Product name with some delimiter i used |. analyse this field values with delimiter. so your content data will be index as follows
Samsung
PS-50Q7HX
Then you could search by the following query
{
"query": {
"query-string": {
"query": "Samsung PS-50Q7HX",
"default_operator": "or",
"fields": [
"FIELD_NAME"
]
}
}
}
this will retrieve the document which has the brand name as samsung or product name as PS-50Q7Hx from index. you could use prefix search and if you use default_operator as and then your search will be most accuracy.

Is there a way to score fuzzy hits with the same score as exact hits?

I'm trying to use elasticsearch as a integration tool which can match records from different sources. I'm combining filters and query for this. Filters are filtering out irrevelant records and putting trough candidate matches. Then out of those candidates all are being scored. I'm using fuzzy match because some of the records might contain a misspell (Nicolson Way/Nicholson Way). I would like them to be scored equally with disregard if its a fuzzy match or equal match.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/fuzzy-scoring.html
Is there a way to achieve this with Elasticsearch?
Use a constant_score to give it a score of your choice:
{
"query": {
"constant_score": {
"filter": {
"query": {
"fuzzy": {"text": "whatever"}
}
},
"boost": 1
}
}
}

Resources