Allow wildcards in proximity searches with multiple words - elasticsearch

I am using ElasticSearch 5.6 on Ubuntu 16.04. My problem is when i try to use wildcards inside a proximity search with multiple words.
Examples:
"hell* worl*"~3
Basically, I would like to get all the words that starts with "hell" and "worl" that are close to each other with a max distance of 3.
I do not get any error but it does not find the documents. It seems that wildcards are not analyzed. I also have set analyze_wildcard: true
The DOC says:
By default, wildcards terms in a query string are not analyzed. By
setting this value to true, a best effort will be made to analyze
those as well.
But, only the following query works:
"hello world"~3 # this works
This is my query:
{
"size":15,
"from":0,
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"\"hell* worl*\"~3",
"analyze_wildcard":true
}
}
]
}
}
}
Reference:
Proximity Searches
Wildcards

You can use span queries to achive what you want, though be careful cause the terms are not analyzed here.
{
"size": 15,
"from": 0,
"query": {
"span_near": {
"clauses": [
{
"span_multi": {
"match": {
"wildcard": {
"t": "hell*"
}
}
}
},
{
"span_multi": {
"match": {
"wildcard": {
"t": "worl*"
}
}
}
}
],
"slop": 3,
"in_order": true
}
}
}
The problem in your query_string is that * character is not treated as wildcard within quotes. What you get is simple slop phrase similar to "hell# worl#"~3 cause special characters have no meaning within quotes.
Be careful though, cause span queries have much slower performance than simple phrase search (though it seems that it still is faster than slop phrases which actually surprised me).
Better option if you can still prepare your data for the scenario is to use ngrams. With ngrams simple "hell worl"~3 would match what you want.

Related

Elasticsearch: how to write bool query that will contain multiple conditions on the same token?

I have a field with tokenizer that splits by dots.
on search, the following value aaa.bbb will be splitted to two terms aaa and bbb.
My question is how to write bool query that will contain multiple conditions on the same term?
For example, i want to get all docs where its field contains a term that matches a fuzzy search for gmail but also the same term must not contain gamil.
Here are some examples of what i want to achieve:
bmail // MATCH: since its matches fuzzy search and is not gamil
gamil.bmail // MATCH: since the term bmail matches fuzzy search and is not gamil
gamil // NO MATCH: since its matches fuzzy search and but equals gamil
NOTE: the following query does NOT appear to be working since it looks as if one term matches one condition and the second term matches the other, it will be considered a hit.
{
...
"body": {
"query": {
"bool": {
"must": [
{
"fuzzy": {
"my_field": {
"value": "gmail",
"fuzziness": 1,
"max_expansions": 2100000000
}
}
},
{
"bool": {
"must_not": [
{
"query_string": {
"default_field": "my_field",
"query": "*gamil*",
"analyzer": "keyword"
}
}
]
}
}
]
}
}
},
}
I ended up using Highlight by executing fuzzy (or any other) query, and then programatically filter the results by the returned highlight object.
span queries might also be a good option if you don't need regular expression or you can make sure you don't exceed the boolean query limit.
(see more details in the provided link)

Elasticsearch: Get documents which have minimum matching percentage

Consider I have following two documents indexed:
[
{
"name": "John Doe"
},
{
"name": "John A"
}
]
The match percentage of the word John is 50 and 66.7 with the field name of the first and second document respectively.
Now the question is, how can I find all the matches, where the match percentage is more than X, where 0<=X<=100. Match should be always prefix match.
The only way I see to do it is the use of a script query in a filter to determine a minimum length of the field (you can calculate it with your percentage and your term length):
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
// Your name: 'John' match
{
"script": {
"script": {
"params": {
"min_size": 4
},
// In ES <5.6 versions, use "inline" instead of "source"
"source": "doc['name'].values.length() > params.min_size"
}
}
}
]
}
}
}
}
}
But you will have to enable fielddata on your field.
While you can build something like this with scripting (as Julien TASSIN describes), this is not what you want:
Unless you have a filter criteria or very little data, this will be slow, since Elasticsearch needs to do some heavy calculations for every search.
Elasticsearch generally operates on tokens. While you can do a lot of things with scripting, your use case sounds like you are either using it wrong or Elasticsearch is probably not a great fit; though I don't know any other system that would work very well for this specific requirement.

Elasticsearch case-insensitive query_string query with wildcards

In my ES mapping I have an 'uri' field which is currently set to not_analysed and I'm not allowed to change the mapping.I wanted to search for uri parts with a query_string query like this (this ES query is autogenerated, that is why it is a bit complicated but let's just focus on the query_string part)
{
"sort": [{"updated": {"order": "desc"}}],
"query": {
"bool": {
"must":[{
"query_string": {
"query":"*w3\\.org\\/2014\\/01\\/a*",
"lowercase_expanded_terms": true,
"default_field": "uri"
}
}],
"minimum_number_should_match": 1
}
}, "size": 50}
Now it is usually working, but I've the following url stored (fictional url): http://w3.org/2014/01/Abc.html and this query does not bring it back because of the A-a difference. Setting the expanded terms to false also not solves this. What should I do for this query to be case insensitive?
Thanks for the help in advance.
From the docs, it seems like you need a new analyzer that first transforms to lowercase and then can run the search. Have you tried that?
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/sorting-collations.html
As I read it, your pattern, lowercase_expanded_terms, only applies to expansions, not to regular words
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
lowercase_expanded_terms
Whether terms of wildcard, prefix, fuzzy, and range queries are to be automatically lower-cased or not (since they are not analyzed). Default it true
Try to use match query instead of query string.
{
"sort": [
{
"updated": {
"order": "desc"
}
}
],
"query": {
"bool": {
"must": [
{
"match": {
"uri": "*w3\\.org\\/2014\\/01\\/a*"
}
}
]
}
},
"size": 50
}
Query string queries are not analyzed and but match queries are analyzed.

query_string vs group match in elasticsearch

What is the difference between such query:
"query": {
"bool": {
...
"should": [
{
"match": {
"description": {
"query": "test"
}
}
},
{
"match": {
"address": {
"query": "test",
}
}
},
{
"match": {
"country": {
"query": "test"
}
}
},
{
"match": {
"city": {
"query": "test"
}
}
}
]
}}
and that one:
"query": {
"bool": {
...
"should": [
{
"query_string": {
"query": "test",
"fields": [
"description",
"address",
"country",
"city"
]
}
}
]
}}
Performance, relevance?
Thanks in advance!
The query is analyzed depending on the field analyzer (unless you specify the analyzer in the query itself), thus querying multiple fields with a single query doesn't necessarily mean analyzing the query only once.
Keep in mind that the query_string supports the lucene query syntax: AND and OR operators, querying on specific fields, wildcard, phrase queries etc. therefore it needs to be parsed, which I don't think makes a lot of difference here in terms of performance, but it is error prone and might lead to errors. If you don't need all that power, stick to the match query, and if you want to perform the same query on multiple fields, have a look at the multi_match query, which does what you did with your query_string but translates internally to multiple match queries.
Also, the scores returned if you compare the output of multiple match queries and your query_string might be quite different. Using a bool query you effectively build a lucene boolean query, while the query_string uses by default "use_dis_max":"true", which means it uses internally a dis_max query by default. Same happens using the multi_match query. If you set use_dis_max to false a bool query is going to be used internally instead.
I terms of performance, I would say that the second query will have performance benefits because, the first query requires the query string to be analyzed for all the four match sections, while in the second there is only one query string that needs to be analyzed.
Apart from that, there are some comparisons done over here that you can look at.
I am not quite sure about the relevancy differences, but that you can always fire these two queries and see if there is any difference in relevance from the results fetched.

elasticsearch boost importance of exact phrase match

Is there a way in elasticsearch to boost the importance of the exact phrase appearing in the the document?
For example if I was searching for the phrase "web developer" and if the words "web developer" appeared together they would be boosted by 5 compared to "web" and "developer" appearing separately throughout the document. Thereby any document that contained "web developer" together would appear first in the results.
You can combine different queries together using a bool query, and you can assing a different boost to them as well. Let's say you have a regular match query for both the terms, regardless of their positions, and then a phrase query with a higher boost.
Something like the following:
{
"query": {
"bool": {
"should": [
{
"match": {
"field": "web developer"
}
},
{
"match_phrase": {
"field": "web developer",
"boost": 5
}
}
],
"minimum_number_should_match": 1
}
}
}
As an alternative to javanna's answer, you could do something similar with must and should clauses within a bool query:
{
"query": {
"bool": {
"must": {
"match": {
"field": "web developer",
"operator": "and"
}
},
"should": {
"match_phrase": {
"field": "web developer"
}
}
}
}
}
Untested, but I believe the must clause here will match results containing both 'web' and 'developer' and the should clause will score phrases matching 'web developer' higher.
You could try using rescore to run an exact phrase match on your initial results. From the docs:
"Rescoring can help to improve precision by reordering just the top (eg 100 - 500) documents returned by the query and post_filter phases, using a secondary (usually more costly) algorithm, instead of applying the costly algorithm to all documents in the index."
https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html#rescore
I used below sample query in my case which is working. It brings exact + fuzzy results but exact ones are boosted!
{ "query": {
"bool": {
"should": [
{
"match": {
"name": "pala"
}
},
{
"fuzzy": {
"name": "pala"
}
}
]
}}}
I do not have enough reputation to comment on James Adison's answer, which I agree with.
What is still missing is the boost factor, which can be done using the following syntax:
{
"match_phrase":
{
"fieldName": {
"query": "query string for exact match",
"boost": 10
}
}
}
I think its default behaviour already with match query "or" operator. It'll filter phrase "web developer" first and then terms like "web" or "develeper". Though you can boost your query using above answers. Correct me if I'm wrong.

Resources