Elasticsearch fuzziness with multi_match and bool_prefix type - elasticsearch

I have a set of search_as_you_type_fields I need to search against. Here is my mapping
"mappings" : {
"properties" : {
"description" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"questions" : {
"properties" : {
"content" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
},
"title" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
}
}
I am using a multi_match query with bool_prefix type.
"query": {
"multi_match": {
"query": "triangle",
"type": "bool_prefix",
"fields": [
"title",
"title._2gram",
"title._3gram",
"description",
"description._2gram",
"description._3gram",
"questions.content",
"questions.content._2gram",
"questions.content._3gram",
"questions.tags",
"questions.tags._2gram",
"questions.tags._3gram"
]
}
}
So far works fine. Now I want to add a typo tolerance which is fuzziness in ES. However, looks like bool_prefix has some conflicts working with this. So if I modify my query and add "fuzziness": "AUTO" and make an error in a word "triangle" -> "triangld", it won't get any results.
However, if I am looking for a phrase "right triangle", I have some different behavior:
even if no typos is made, I got more results with just "fuzziness": "AUTO" (1759 vs 1267)
if I add a typo to the 2d word "right triangdd", it seems to work, however looks like it now pushes the results containing "right" without "triangle" first ("The Bill of Rights", "Due process and right to privacy" etc.) in front.
If I make a typo in the 1st word ("righd triangle") or both ("righd triangdd"), the results seems to be just fine. So this is probably the only correct behavior.
I've seen a couple of articles and even GitHub issues that fuzziness does not work in a proper way with a multi_match query with bool_prefix, however I can't find a workaround for this. I've tried changing the query type, but looks like bool_prefix is the only one that supports search as you type and I need to get search result as a user starts typing something.
Since I make all the requests from ES from our backend What I also can do is manipulate a query string to build different search query types if needed. For example, for 1 word searches use one type for multi use another. But I basically need to maintain current behavior.
I've also tried appending a sign "~" or "~1[2]" to the string which seems to be another way of specifying the fuzziness, but the results are rather unclear and performance (search speed) seems to be worse.
My questions are:
How can I achieve fuzziness for 1 word searches? so that query "triangld" returns documents containing "triangle" etc.
How can I achieve correct search results when the typo in the 2d (last?) word of the query? Like I mentioned above it works, but see the point 2 above
Why just adding a fuzziness (see p. 1) returns more results even if the phrase is correct?
Anything I need to change in my analyzers etc.?

so to achieve a desired behavior, we did the following:
change query type to "query_string"
added query string preprocessing on the backend. We split the query string by white spaces and add "~1" or "~2" to each word if their length is more 4 chars or 8 chars respectively. ~ is a fuzziness syntax in ES. However, we don't add this to the current typing word until the user types a white space. For example, user typing [t, tr, tri, ... triangle] => no fuzzy, but once "triangle " => "triangle~2". This is because there will be unexpected results with the last word having fuzziness
we also removed all ngram fields from the search fields as we get the same results but performance is a bit better.
added "default_operator": "AND" to the query to contain the results from one field for phrase queries

Related

How to give more weight-age to specific keywords while searching for similar text using elasticsearch?

I am using elasticsearch to get relevant blog articles from a database of articles. I want results that contain particular words to be given higher score than the search results who do not have them.
I have tried adding stop words and given more to other fields but the results are not quite as expected. I am using developer mode of the Kibana interface of elasticsearch
"""
GET blog-desc/_search
{
"query": {
"more_like_this" : {
"fields" : ["Meta description","Title^5",
"Short title^0.5"],
"like" : "Harry had a silver wand he likes to play with! Among his friends he has the most expensive one. The only difference between his wand and his sister's is that in the color",
"min_term_freq" : 1,
"max_query_terms" : 12,
"minimum_should_match": "30%",
"stop_words": ["difference", "play", "among"]
, "boost_terms": 1
}
}
}
"""
In the sample code above, I would want search results having "silver" as a word in them given more score than other articles who do not that word.

how decrease score on TF in elasticsearch?

two docs: 1. "Some Important Company",2. "Some Important Company Important branch"
since "Important" have a high docCount(many docs has Important word), so when search on "Some Important Company"
the 2nd doc get a higher score, even though 1st doc has exactlly match.
so my question is how to boost score when exactlly matched or decrease the TF score?
my query is multi_match for customerName usedName,but usedName is all "" in this case
I assume the field of your document is indexed using a standard text analyzer or something of the like. I would combine a match query and a match_phrase query using a dismax compound query.
This would give something like that:
{
"query": {
"dis_max" : {
"queries" : [
{ "match" : { "myField" : "Some Important Company" }},
{ "match_phrase" : { "myField" : "Some Important Company" }}
],
"tie_breaker" : 0.7
}
}
}
There's no notion of "matching an exact phrase" with the match query. For this you need to use the match_phrase query. That's why you combine the two here. Using the dis_max, documents that match the two queries will get a boost. You can read more about dis_max and match_phrase:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

Finding fields Elasticsearch has matched on

I am using Elasticsearch to search for a group a user should join. I have the user data nested into the search query. On return I get back the closest matched group that user should be in.
The field I am searching on is a nested field as follows:
`{"interests": [
{"topics":["python", "stackoverflow", "elasticsearch"]},
{"topics":["arts", "textiles"]}
]}`
However if you want an understanding of a match - how do you do this?
Elasticsearch does have an explain function which says what the scoring is made up of using tfidf, but not specifically what terms were used.
For example, if I search for 'Textile', the doc should match on 'textiles'. Thus I want the term 'textiles' to be returned in explain or some other way.
The only way I see that provides this need, is to store the search and the document retrieved and then process both to discover words ES has most likely matched on.
EDIT - for some more clarity of the question
An example in my index of a group which has "interests": ['arts', 'fine arts', 'art painting', 'arts and crafts', 'sports']
Now my search, I am looking for Arts and many other things. Now the term I am searching for comes up in this list many times, thus should always be a contributor.
What I want in the response is to say these words were matched ['arts', 'fine arts', 'art painting', 'arts and crafts']along with the degree to which they match i..e 'arts' should be higher than the others, but all others are also relevant
Elasticsearch allows you to specify the _name field for all queries and
filters. This means that you can separate your query into different parts with
separate names, which will allow you to determine which parts matched.
For example:
{
"query" : {
"bool" : {
"should" : [
{"match" : { "interests.topics" : {"query" : "python", "_name" : "py-topic"} }},
{"match" : { "interests.topics" : {"query" : "arts", "_name" : "arts-topic"} }}
]
}
}
}
Then, in your response, you will get back any array of which queries (or
filters) matched and you can determine if the py-topic query and/or the
arts-topic query matched above.

How to enable fuzziness for phrase queries in ElasticSearch

We're using ElasticSearch for searching through millions of tags. Our users should be able to include boolean operators (+, -, "xy", AND, OR, brackets). If no hits are returned, we fall back to a spelling suggestion provided by ES and search again. That's our query:
$ curl -XGET 'http://127.0.0.1:9200/my_index/my_type/_search' -d '
{
"query" : {
"query_string" : {
"query" : "some test query +bools -included",
"default_operator" : "AND"
}
},
"suggest" : {
"text" : "some test query +bools -included",
"simple_phrase" : {
"phrase" : {
"field" : "my_tags_field",
"size" : 1
}
}
}
}
Instead of only providing a fallback to spelling suggestions, we'd like to enable fuzzy matching. If, for example, a user searches for "stackoverfolw", ES should return matches for "stackoverflow".
Additional question: What's the better performing method for "correcting" spelling errors? As it is now, we have to perform two subsequent requests, first with the original search term, then with the by ES suggested term.
The query_string does support some fuzziness but only when using the ~ operator, which I think doesn't your usecase. I would add a fuzzy query then and put it in or with the existing query_string. For instance you can use a bool query and add the fuzzy query as a should clause, keeping the original query_string as a must clause.
As for your additional question about how to correct spelling mistakes: I would use fuzzy queries to automatically correct them and two subsequent requests if you want the user to select the right correction from a list (e.g. Did you mean), but your approach sounds good too.

How to index the following for multifaceting in elasticsearch?

If I have a People collection. Each person may have multiple hobbies. (e.g. Running, Climbing, Swimming, Jumping Jacks).
How would I index a single person with all those attrubutes such that I could apply a facet to them? Could someone provide a sample oh how data should be indexed given the following:
Person | Hobbies
Joe | Chess, Jumping Jacks, Swimming
Person | Hobbies
Bob | Rowing
And how I would go about being able to get facets for "hobbies" key? (note that "Jumping Jacks" is a single value, but whitespace separated word.
If you both want to search on the hobbies field and make a facet on it, you need to use a multi_field. That's how you can index the same field in different ways. Usually the version for search needs to be tokenized and at least lowercased, plus language dependent analysis if you want, while the facet version doesn't even need to be analyzed since the facet entries need to be the same that you had in your source documents.
{
"people" : {
"properties" : {
"hobbies" : {
"type" : "multi_field",
"fields" : {
"hobbies" : {"type" : "string", "index" : "analyzed"},
"facet" : {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}
The above mapping would create two different fields for the same hobbies field as input. The first one, which you can refer to in your queries just using the hobbies name using the default standard analyzer; the second one is not analyzed and can be used for the facet. You can refer to it as hobbies.facet.
As a result you can search for jumpingand find a match, but your facet will look like the following:
Chess (1)
Jumping Jacks (1)
Swimming (1)
Hobbies (1)
Rowing (1)

Resources