Which analyzer is used while using fuzzy operator with query_string clause? - elasticsearch

Suppose I have a query clause like,
{
"query":
{
"query_string": {
"query": "ads spark~",
"fields": [
"flowName",
"projectName"
],
"default_operator": "and"
}
}
}
For this the explain output is:
"explanation": "+(projectName:ads | flowName:ads) +(projectName:spark~1 | flowName:spark~1)"
Whereas if I remove the fuzzy operator from query. Updated query clause below,
{
"query":
{
"query_string": {
"query": "ads spark",
"fields": [
"flowName",
"projectName"
],
"default_operator": "and"
}
}
}
I get a different explain output,
"explanation": "(projectName:ads spark | flowName:ads spark)"
Any idea why the tokens generated as different in both cases?

When you use fuzzy queries the way the query is parsed and constructed in Lucene differs from the normal behavior.
The one you see with the explanation is the Lucene query built from the query text.
When using fuzziness most of the text analysis is not done, only the filters that work on a per-character basis are allowed, as you can read in the documentation [1][2].
In this first case, since you are using fuzziness, the query text is split by whitespaces. Then, for each term a mandatory clause is built (the AND operator states that each term MUST appear in the document). You can call this a "term centric" query. Then each term is searched across the multiple fields in input with a disjunction (|) clause.
You therefore see "ads MUST be in projectName OR flowName, AND spark (with variations within the Levenshtein_distance) MUST be in projectName OR flowName".
In the second case, no fuzziness is used. Here the query is passed to each field and then the terms will follow the corresponding field text analysis (if any). You may call this a "field centric" query. Therefore you see "ads spark MUST be in projectName OR flowName" to have a document match.
You are effectively moving from an "I want all the terms to appear in the document" (it could be in different fields) to "I want all terms to appear in a single field".
If you want an in-depth analysis you can read this blog post https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html. This is relative to Solr but Elasticsearch applies the same behavior.

Related

Elasticsearch: What is the difference between a match and a term in a filter?

I was following an ES tutorial, and at some point I wrote a query using term in the filter instead the recommended solution using match. My understanding is that match was used in the query part to get scoring, while term was used in the filter part to just remove hits before enter the query part. To my surprise match also works in the filter part.
What is the difference between:
GET blogs/_search
{
"query": {
"bool": {
"filter": {
"match": {
"category.keyword": "News"
}
}
}
}
}
and:
GET blogs/_search
{
"query": {
"bool": {
"filter": {
"term": {
"category.keyword": "News"
}
}
}
}
}
Both returns the same hits, and the score is 0 for all hits.
What is the behaviour or match in a filter clause? I would expect it to yield some score, but it does not.
What I thought:
term : does not analyze either the parameter or the field, and it is a yes/no scenario.
match : analyzes parameter and field and calculates a score of how good they match.
But when using match against a keyword in the filter part of the query, how does it behave?
The match query is a high-level query that resorts to using a term query if it needs to.
Scoring has nothing to do with using match instead of term. Scoring kicks in when you use bool/must/should instead of bool/filter.
Here is how the match query works:
First, it checks the type of the field.
If it's a text field then the value will be analyzed, either with the analyzer specified in the query (if any), or with the search- or index-time analyzer specified in the mapping.
If it's a keyword field (like in your case), then the input is not analyzed and taken "as is"
Since you're using the match query on a keyword field and your input is a single term, nothing is analyzed and the match query resorts to using a term query underneath. This is why you're seeing the same results.
In general, it's always best to use a match query as it is smart enough to know what to do given the field you're querying and the input data you're searching for.
You can read more about the difference between the two here.

Elastic Search Multimatch: Is there a way to search all fields except one?

We have an Elastic Search structure that specifies fields in a multi_match query like this:
"multi_match": {
"query": "find this string",
"fields": ["*_id^20", "*_name^20", "*"]
}
This works great - except under certain circumstances like when query is "Find NOWAK". This is because "NOW" is a reserved word for date searching and field "*" matches fields that are defined as dates.
So what I would like to do is ignore fields that match "*_at".
Is there way to tell Elastic Search to ignore certain fields in a multi_match query?
If the answer to that is "no" then the follow up question is how to escape the search term so that it won't trigger key words
Running version 6.7
Try this:
Exclude a field on a Elasticsearch query
curl -XGET 'localhost:9200/testidx/items/_search?pretty=true' -d '{
"query" : {
"query_string": {
"fields": ["title", "field2", "field3"], <-- add this
"query": "Titulo"
}},
"_source" : {
"exclude" : ["*.body"]
}
}'
Apparently the answer is "No: there is not a way to tell ElasticSearch to ignore certain fields in a multi_match query"
For my particular issue I found an inexpensive way to find the necessary white-listed fields (this is performed outside the scope of ElasticSearch otherwise I would post it here) and list those in place of the "*" when building the query.
I am hopeful someone will tell me I'm wrong, but I don't think I am.

Elasticsearch query multi_match

I'm trying to create an elasticsearch query that looks for multiple fields. This works fine so far. However, I would like to refine this.
Let's say the word was indexed: "test". However, when I search for "tes" he does not find that word for me, but I would like to show it already - but the combination with my query brings me to a challenge.
{
"multi_match" : {
"query": "*" + query + "*",
"type": "cross_fields",
"operator": "and",
"fields": ["article.number^1","article.name_de^1", "article.name_en^5", "article.name_fr^5", "article.description^1"],
"tie_breaker": 0,
}
Depending on your constraints, here are your options.
If you wish to use wildcard before/after your search term, you can use wildcard query. This has high processing cost at query time.
If you are fine with additional storage cost, you can opt to tokenize your input during analysis process. See ngram tokenizer. Beware that if you have long strings, it can quickly explode the storage requirement.

Is it possible to chain fquery filters in elastic search with exact matches?

I have been having trouble writing a method that will take in various search parameters in elasticsearch. I was working with queries that looked like this:
body:
{query:
{filtered:
{filter:
{and:
[
{term: {some_term: "foo"}},
{term: {is_visible: true}},
{term: {"term_two": "something"}}]
}
}
}
}
Using this syntax I thought I could chain these terms together and programatically generate these queries. I was using simple strings and if there was a term like "person_name" I could split the query into two and say "where person_name match 'JOHN'" and where person_name match 'SMITH'" getting accurate results.
However, I just came across the "fquery" upon asking this question:
Escaping slash in elasticsearch
I was not able to use this "and"/"term" filter searching a value with slashes in it, so I learned that I can use fquery to search for the full value, like this
"fquery": {
"query": {
"match": {
"by_line": "John Smith"
But how can I search like this for multiple items? IT seems that when i combine fquery and my filtered/filter/and/term queries, my "and" term queries are ignored. What is the best practice for making nested / chained queries using elastic search ?
As in the comment below, yes I can just add fquery to the "and" block like so
{:filtered=>
{:filter=>
{:and=>[
{:term=>{:is_visible=>true}},
{:term=>{:is_private=>false}},
{:fquery=>
{:query=>{:match=>{:sub_location=>"New JErsey"}}}}]}}}
Why would elasticsearch also return results with "sub_location" = "new York"? I would like to only return "new jersey" here.
A match query analyzes the input and by default it is a boolean OR query if there are multiple terms after the analysis. In your case, "New JErsey" gets analyzed into the terms "new" and "jersey". The match query that you are using will search for documents in which the indexed value of field "sub_location" is either "new" or "jersey". That is why your query also matches documents where the value of field "sub_location" is "new York" because of the common term "new".
To only match for "new jersey", you can use the following version of the match query:
{
"query": {
"match": {
"sub_location": {
"query": "New JErsey",
"operator": "and"
}
}
}
}
This will not match documents where the value of field "sub_location" is "New York". But, it will match documents where the value of field "sub_location" is say "York New" because the query finally translates into a boolean query like "York" AND "New". If you are fine with this behaviour, well and good, else read further.
All these issues arise because you are using the default analyzer for the field "sub_location" which breaks tokens at word boundaries and indexes them. If you really do not care about partial matches and want to always match the entire string, you can make use of custom analyzers to use Keyword Tokenizer and Lowercase Token Filter. Mind you, going ahead with this approach will need you to re-index all your documents again.

ElasticSearch "Match" with FullText Matches

I'm working with ElasticSearch.
When I do this query:
{query: "blackberry -q10"}
I get exactly what I want (all results which have reference to BlackBerry but not Q10).
However, I want to restrict the fields which are searched to just the "title" field. Eg, the _source documents have titles, body, tags, etc. and I only want to search the title. The ElasticSearch "Match" seems right for me...
{query: {match: {title: "blackberry -q10"}}}
While this succeeds in only searching the title, it still returns results with have Q10 in the title, unlike the search above.
I'm looking at the match documentation but can't seem to figure it out.
Thanks!
The Match query doesn't use negation syntax like that. E.g you can't use a "minus" to negate a term. It will be parsed as a hyphen by the default search analyzer.
I would use a filtered query in this case. You could add the negation in a query...but a filter will be much faster.
{
"filtered":{
"query":{
"match":{
"title":"blackberry"
}
},
"filter":{
"bool":{
"must_not":{
"term":{
"title":"q10"
}
}
}
}
}
}
Note, you may need to change the term filter, depending on how you analyzed the field at index time.
EDIT:
Based on your comment below, if you really want to keep the ability to do negations "inline", you would use the field query (a more specific version of query_string, which would also work). This query uses Lucene syntax, which allows inline negation
{
"field" : {
"title" : "blackberry -q10"
}
}
The reason query_string and it's derivatives are not recommended is because it is easy to shoot yourself in the foot. Or rather, it's easy for your users to shoot your server in the face. Query_string demands proper syntax and will simply die if the users enter it incorrectly. It also allows your users to make some horrible inefficient queries, usually through wildcards
You want to match all the titles that have "blackberry" AND do not have have q10, not all the titles that have "blackberry" OR do not have q10.
The default boolean operator for a match is (in most cases) OR. Try adding an "operator": "and" clause to your query instead.

Resources