elasticsearch: or operator, number of matches - elasticsearch

Is it possible to score my searches according to the number of matches when using operator "or"?
Currently query looks like this:
"query": {
"function_score": {
"query": {
"match": {
"tags.eng": {
"query": "apples banana juice",
"operator": "or",
"fuzziness": "AUTO"
}
}
},
"script_score": {
"script": # TODO
},
"boost_mode": "replace"
}
}
I don't want to use "and" operator, since I want documents containing "apple juice" to be found, as well as documents containing only "juice", etc. However a document containing the three words should score more than documents containing two words or a single word, and so on.
I found a possible solution here https://github.com/elastic/elasticsearch/issues/13806
which uses bool queries. However I don't know how to access the tokens (in this example: apples, banana, juice) generated by the analyzer.
Any help?

Based on the discussions above I came up with the following solution, which is a bit different that I imagined when I asked the question, but works for my case.
First of all I defined a new similarity:
"settings": {
"similarity": {
"boost_similarity": {
"type": "scripted",
"script": {
"source": "return 1;"
}
}
}
...
}
Then I had the following problem:
a query for "apple banana juice" had the same score for a doc with tags ["apple juice", "apple"] and another doc with tag ["banana", "apple juice"]. Although I would like to score the second one higher.
From the this other discussion I found out that this issue was caused because I had a nested field. And I created a usual text field to address it.
But I also was wanted to distinguish between a doc with tags ["apple", "banana", "juice"] and another doc with tag ["apple banana juice"] (all three words in the same tag). The final solution was therefore to keep both fields (a nested and a text field) for my tags.
Finally the query consists of bool query with two should clauses: the first should clause is performed on the text field and uses an "or" operator. The second should clause is performed on the nested field and uses and "and operator"
Despite I found a solution for this specific issue, I still face a few other problems when using ES to search for tagged documents. The examples in the documentation seem to work very well when searching for full texts. But does someone know where I can find something more specific to tagged documents?

Related

How to highlight regexp in elasticsearch to with patterns that include spaces

Am trying to use regexp in elasticsearch to find some patterns and highlight it, the pattern am trying to find contain spaces,
I also used keyword as to not analyze the text
{
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
the exact pattern is ".*( [a-zA-Z]( |,|.)){3,5}.*" the query looks like this
{
"_source": false,
"query": {
"bool": {
"should": [
{
"regexp": {
"transcript_data.transcript.keyword": {
"value": ".*( [a-zA-Z]){3,5}.*"
}
}
}
]
}
},
"highlight": {
"fields": {
"transcript_data.transcript.keyword": {}
}
}
}
the highlight seems to highlight the whole document (start to end) eventhough the pattern lies in the middle of text.
to clarify, consider the below difference between the 2 images
V.S.
for example
It's it's it's a steal. A hot eight mining and b k k t. I think these have went too the output should be <em>b k k t</em> I get ... <em>It's it's it's a steal. A hot eight mining and b k k t. I think these have went too</em> and I believe this because the .* but this also seem to be the way regex work in ES, what am doing wrong ?
As far as I know you cannot search for text type of field when regex have white space because text field are analzyed and it is split into multipul tokens. So when you do search on text field with whitespace in regex then it will not return any result.
Currently you are trying to search keyword, type of field which is not analyzed field and that's why you are able to search it. Also, it is highlighting entire field because keyword field is not analyzed and it will store entire single value.
You can use "require_field_match": "false" in highlighting if you want to search on other field and highlight on different field but this will also not work in your case.
You can try out shingle another and then try to search on keyword field and highlight on shingle field, but I am not sure if this will fit your usecase completelty.

Requiring Phrase Matches in Elasticsearch SimpleStringQuery

I'm creating a simple search engine using Elasticsearch 7.7 and the python elasticsearch_dsl package version 7.0.0. I'm using the simple_query_string search, because I'd like to enable most common search functionality (boolean operators, phrase search) without having to parse the query myself. This is largely working well except for the phrase match functionality.
I would like to ensure all results will include a phrase match if one is in the query. E.g. How google works - If I search for "green eggs" and ham, there will be no results that do not include "green eggs".
Let's assume I have 3 documents in my index:
{
"question":"I love my phrase",
"background: "dont you"
},
{
"question":"I love my phrase",
"background: "and other terms"
},
{
"question":"I have other terms",
"background: "and more"
}
What I am seeing now:
As expected, the below query only returns the first two documents, which have "my phrase" in one of the fields.
{
'simple_query_string':
{
'query': '"my phrase"',
'fields': ['question', 'background']
}
}
Contrary to what I expect, the below query will return all 3 results, with the 3rd one scored higher than the 1st.
{
'simple_query_string':
{
'query': '"my phrase" other terms',
'fields': ['question', 'background']
}
}
How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?
Things I have tried that have not worked:
'query': '"my phrase" AND (other terms)'
'query': '"my phrase" AND other terms'
Thank you
Contrary to what I expect, the below query will return all 3 results
By default words in query combine with OR operator: see description for default_operator parameter in simple_query_string documentation. Your second query is interpreted as "my phrase" OR other OR terms, so it will return all 3 results: each document contains at least one of the terms "my phrase", other, terms.
How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?
AFAIK, this isn't possible with simple_query_string search. You can try to use query_string search, which have feature named boolean operators. Using that feature you can write query which provide desired result:
{
"query": {
"query_string": {
"query": "+\"my phrase\" other terms",
"fields": ["question", "background"]
}
}
}

Fuzzy Matching Fails But Exact Match Passes

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.
For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.
In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.
For Instance, this query returns a user record as expected:
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"firstName": {
"query": "sVxGBCkPYZ",
"boost": 30
}
}
}
],
"should": [
]
}
},
"fields": [
"id",
"firstName"
]
}
However replacing the match element with the below fails to return any records:
{
"fuzzy": {
"firstName": {
"value": "sVxGBCkPYZ",
"fuzziness": 2,
"boost": 30,
"min_similarity": 0.3
}
}
}
Why would this be happening, and is there anything I can do to remedy the situation?
For reference. This is the ES version i'm currently using:
"version": {
"number": "1.7.1",
"build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
"build_timestamp": "2015-07-29T09:54:16Z",
"build_snapshot": false,
"lucene_version": "4.10.4"
}
The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.
You can instead, implement fuzziness with match query as below:
POST testindex/_search
{
"query":{
"match":{
"firstname":{
"query":"sVxGBCkPYZ",
"fuzziness":"AUTO"
}
}
}
}
You can change the value from AUTO to 2 or 3 depending on your use case.
The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.
As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:
The fuzzy query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automaton to step efficiently through
all of the terms in the term dictionary to see if they match. Once it
has collected all of the matching terms that exist in the term
dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy
query with an edit distance of 2 can match a very large number of
terms and perform very badly.
Note this statement in particular, representing all the strings that are within the specified edit distance of the original string
For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.
So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7
Adding one more LINK for more info. Hope it helps!

Is it possible to chain fquery filters in elastic search with exact matches?

I have been having trouble writing a method that will take in various search parameters in elasticsearch. I was working with queries that looked like this:
body:
{query:
{filtered:
{filter:
{and:
[
{term: {some_term: "foo"}},
{term: {is_visible: true}},
{term: {"term_two": "something"}}]
}
}
}
}
Using this syntax I thought I could chain these terms together and programatically generate these queries. I was using simple strings and if there was a term like "person_name" I could split the query into two and say "where person_name match 'JOHN'" and where person_name match 'SMITH'" getting accurate results.
However, I just came across the "fquery" upon asking this question:
Escaping slash in elasticsearch
I was not able to use this "and"/"term" filter searching a value with slashes in it, so I learned that I can use fquery to search for the full value, like this
"fquery": {
"query": {
"match": {
"by_line": "John Smith"
But how can I search like this for multiple items? IT seems that when i combine fquery and my filtered/filter/and/term queries, my "and" term queries are ignored. What is the best practice for making nested / chained queries using elastic search ?
As in the comment below, yes I can just add fquery to the "and" block like so
{:filtered=>
{:filter=>
{:and=>[
{:term=>{:is_visible=>true}},
{:term=>{:is_private=>false}},
{:fquery=>
{:query=>{:match=>{:sub_location=>"New JErsey"}}}}]}}}
Why would elasticsearch also return results with "sub_location" = "new York"? I would like to only return "new jersey" here.
A match query analyzes the input and by default it is a boolean OR query if there are multiple terms after the analysis. In your case, "New JErsey" gets analyzed into the terms "new" and "jersey". The match query that you are using will search for documents in which the indexed value of field "sub_location" is either "new" or "jersey". That is why your query also matches documents where the value of field "sub_location" is "new York" because of the common term "new".
To only match for "new jersey", you can use the following version of the match query:
{
"query": {
"match": {
"sub_location": {
"query": "New JErsey",
"operator": "and"
}
}
}
}
This will not match documents where the value of field "sub_location" is "New York". But, it will match documents where the value of field "sub_location" is say "York New" because the query finally translates into a boolean query like "York" AND "New". If you are fine with this behaviour, well and good, else read further.
All these issues arise because you are using the default analyzer for the field "sub_location" which breaks tokens at word boundaries and indexes them. If you really do not care about partial matches and want to always match the entire string, you can make use of custom analyzers to use Keyword Tokenizer and Lowercase Token Filter. Mind you, going ahead with this approach will need you to re-index all your documents again.

How to filter results based on frequency of repeating terms in an array in elasticsearch

I have an array field with a lot of keywords and i need to sort the documents on the basis on how many times a particular keyword repetation in those arrays.
For eg,if my field name is "nationality" and for document 1, it consists of the following
doc1
nationality :
["US","UK","Australia","India","US","US"]
and for doc2
nationality:
["US","UK","US","US","US","China"]
I want only those documents to be shown where the term "US" occurs more than 3 times. That would make only doc2 to be shown. How to do this?
You can use scripting for this to be implemented.
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "_index['nationality']['US'].tf() > 3"
}
}
}
}
}
Here in this scripy the array "nationality" is checked for the term "US" and the count is taken by tf (term frequency). Now only the documents with term frequency greater than three are shown in the results. You can learn more about the filter operations here

Resources