Elasticsearch how to match documents for which the field tokens are a sub-set of the query tokens - elasticsearch

I have a keyword/key-phrase field I tokenize using standard analyser. I want this field to match if if there is a search phrase that has all tokens of this field in it.
For example if the field value is "veni, vidi, vici" and the search phrase is "Ceaser veni,vidi,vici" I want this search phrase to match but search phrase "veni, vidi" not match.
I also need "vidi, veni, vici" (weird!) to match. So the positions and ordering of the terms is not really important. A phrase match would not quite work for me I think.
I can use "bool query" with "minimum_should_match" parameter for this specific example but that is not really what I want as minimum should match is about ratio/number of tokens in the search phrase.

Pure ES solution would go like this. You will need two requests.
1) First you need to pass user query through analyze api to get all the search tokens.
curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "Ceaser veni,vidi,vici"
}'
you will get 4 tokens ceaser, veni, vidi, vici . You need to pass these tokens as an array to next search request.
2) We need to search for documents whose tokens are subset of search tokens.
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"query": {
"match": {
"title": "Ceaser veni,vidi,vici"
}
}
},
{
"script": {
"script": "if(search_tokens.containsAll(doc['title'].values)){return true;}",
"params": {
"search_tokens": [
"ceaser",
"veni",
"vidi",
"vici"
]
}
}
}
]
}
}
}
}
}
Here job of first match query inside the filter is to narrow down the documents on which script should run. containsAll method will check if the documents tokens are sublist of search tokens. This will be slow but will do the job with your current set up. One big improvement you can do is store tokens as an array so that doc['title'].values can be replaced with that field which will improve the script.
Hope this helps!

No built-in solution but this works:
Add an extra field with the number of terms in the field for each document. So in your "veni, vidi, vici" example, you would have a field like "field_term_count" : 3.
Perform a separate match search for each token in the search query.
Sum the number of searches that matched for each document with at least one match (e.g. a hashtable with key of document ID and value of count).
Compare the number of matches in 3 to the "field_term_count" field for each of the documents with matches. If they are equal then the document is a match.
Then "Ceaser veni,vidi,vici" will match but the search phrases "veni, vidi" will not, as desired. It should be quite fast for reasonable numbers of matches.

Related

Elasticsearch: What is the difference between a match and a term in a filter?

I was following an ES tutorial, and at some point I wrote a query using term in the filter instead the recommended solution using match. My understanding is that match was used in the query part to get scoring, while term was used in the filter part to just remove hits before enter the query part. To my surprise match also works in the filter part.
What is the difference between:
GET blogs/_search
{
"query": {
"bool": {
"filter": {
"match": {
"category.keyword": "News"
}
}
}
}
}
and:
GET blogs/_search
{
"query": {
"bool": {
"filter": {
"term": {
"category.keyword": "News"
}
}
}
}
}
Both returns the same hits, and the score is 0 for all hits.
What is the behaviour or match in a filter clause? I would expect it to yield some score, but it does not.
What I thought:
term : does not analyze either the parameter or the field, and it is a yes/no scenario.
match : analyzes parameter and field and calculates a score of how good they match.
But when using match against a keyword in the filter part of the query, how does it behave?
The match query is a high-level query that resorts to using a term query if it needs to.
Scoring has nothing to do with using match instead of term. Scoring kicks in when you use bool/must/should instead of bool/filter.
Here is how the match query works:
First, it checks the type of the field.
If it's a text field then the value will be analyzed, either with the analyzer specified in the query (if any), or with the search- or index-time analyzer specified in the mapping.
If it's a keyword field (like in your case), then the input is not analyzed and taken "as is"
Since you're using the match query on a keyword field and your input is a single term, nothing is analyzed and the match query resorts to using a term query underneath. This is why you're seeing the same results.
In general, it's always best to use a match query as it is smart enough to know what to do given the field you're querying and the input data you're searching for.
You can read more about the difference between the two here.

Less restrictive search doesn't return any hits in ElasticSearch

The query below returns hits, for example where name is "Balances by bank":
GET /_search
{ "query": {
"multi_match": { "query": "Balances",
"fields": ["name","descrip","notes"]
}
}
}
So why this doesn't return anything? Note that the query is less restrictive, the word is "Balance" and not "Balances" with an s.
GET /_search
{ "query": {
"multi_match": { "query": "Balance",
"fields": ["name","descrip","notes"]
}
}
}
What search would return both?
You need to change your mapping to be able to do that.
If you didn't specified a mapping with specific analyzers when creating your index, elasticsearch will use the default mapping and analyzer.
The default mapping will map each text field as both text and keyword, so you will be able to performe full text search (match part of the string) and keyword search (match the whole string), but it will use the standard analyzer.
With the standard analyzer your example Balances by bank becomes the following list of tokens: [Balances, by, bank], those items are added to the inverted index and elasticsearch can find the documents when you search for any of them.
When you search for just Balance, this term does not exist in the inverted index and elasticsearch returns nothing.
To be able to return both Balance and Balances you need to change your mapping and use the analyzer for the english language, this analyzer will reduce your terms to their stem and match Balance, Balances as also Balancing, Balanced, Balancer etc.
Look at this part of the documentation to see how the analysis process work.
And of course, you can also search for Balance* and it will return both Balance and Balances, but it is a different query.

Fuzzy Matching Fails But Exact Match Passes

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.
For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.
In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.
For Instance, this query returns a user record as expected:
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"firstName": {
"query": "sVxGBCkPYZ",
"boost": 30
}
}
}
],
"should": [
]
}
},
"fields": [
"id",
"firstName"
]
}
However replacing the match element with the below fails to return any records:
{
"fuzzy": {
"firstName": {
"value": "sVxGBCkPYZ",
"fuzziness": 2,
"boost": 30,
"min_similarity": 0.3
}
}
}
Why would this be happening, and is there anything I can do to remedy the situation?
For reference. This is the ES version i'm currently using:
"version": {
"number": "1.7.1",
"build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
"build_timestamp": "2015-07-29T09:54:16Z",
"build_snapshot": false,
"lucene_version": "4.10.4"
}
The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.
You can instead, implement fuzziness with match query as below:
POST testindex/_search
{
"query":{
"match":{
"firstname":{
"query":"sVxGBCkPYZ",
"fuzziness":"AUTO"
}
}
}
}
You can change the value from AUTO to 2 or 3 depending on your use case.
The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.
As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:
The fuzzy query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automaton to step efficiently through
all of the terms in the term dictionary to see if they match. Once it
has collected all of the matching terms that exist in the term
dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy
query with an edit distance of 2 can match a very large number of
terms and perform very badly.
Note this statement in particular, representing all the strings that are within the specified edit distance of the original string
For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.
So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7
Adding one more LINK for more info. Hope it helps!

What is the difference between must and filter in Query DSL in elasticsearch?

I am new to elastic search and I am confused between must and filter. I want to perform an and operation between my terms, so I did this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
},
{
"term": {
"saleType": "sale_type1"
}
}
]
}
}
}
which gave me the required results matching both the terms, and on using filter like this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
}
],
"filter": {
"term": {
"saleType": "sale_type1"
}
}
}
}
}
I get the same result, so when should I use must and when should I use filter? What is the difference?
must contributes to the score. In filter, the score of the query is ignored.
In both must and filter, the clause(query) must appear in matching documents. This is the reason for getting same results.
You may check this link
Score
The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.
A query clause generates a _score for each document.
To know how score is calculated, refer this link
must returns a score for every matching document. This score helps you rank the matching documents, and compare the relative relevance between documents (using the magnitude of the score of each document).
With this, one can say, Doc 1 is how many times more relevant than Doc 2. Or that Doc 1 to 7 are of much higher relevancy than Doc 8+.
For how the relative score is determined, you can refer to the references below.
Briefly, it is related to the number of term occurrences in the document, the document length, and the average number of term occurrences in your database index.
filter doesn't return a score. All one can say is, all matching documents are of relevance. But it won't help in evaluating if one is more relevant than the other. You can think of filter as a must with only 2 scores: zero or non-zero, and where all zero-scored documents are dropped.
filter is helpful if you just want to whitelist/blacklist for e.g., all documents belonging to the topic "pets".
In summary, there are 3 points that will help you in deciding when to use what:
must is your only choice when comparing/ranking documents by relevance
filter excludes all documents that don't match
filter is a lot faster because Elasticsearch doesn't need to compute the relative score
References:
Query vs Filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
Computation of Relevance: https://www.infoq.com/articles/similarity-scoring-elasticsearch/

Is it possible to chain fquery filters in elastic search with exact matches?

I have been having trouble writing a method that will take in various search parameters in elasticsearch. I was working with queries that looked like this:
body:
{query:
{filtered:
{filter:
{and:
[
{term: {some_term: "foo"}},
{term: {is_visible: true}},
{term: {"term_two": "something"}}]
}
}
}
}
Using this syntax I thought I could chain these terms together and programatically generate these queries. I was using simple strings and if there was a term like "person_name" I could split the query into two and say "where person_name match 'JOHN'" and where person_name match 'SMITH'" getting accurate results.
However, I just came across the "fquery" upon asking this question:
Escaping slash in elasticsearch
I was not able to use this "and"/"term" filter searching a value with slashes in it, so I learned that I can use fquery to search for the full value, like this
"fquery": {
"query": {
"match": {
"by_line": "John Smith"
But how can I search like this for multiple items? IT seems that when i combine fquery and my filtered/filter/and/term queries, my "and" term queries are ignored. What is the best practice for making nested / chained queries using elastic search ?
As in the comment below, yes I can just add fquery to the "and" block like so
{:filtered=>
{:filter=>
{:and=>[
{:term=>{:is_visible=>true}},
{:term=>{:is_private=>false}},
{:fquery=>
{:query=>{:match=>{:sub_location=>"New JErsey"}}}}]}}}
Why would elasticsearch also return results with "sub_location" = "new York"? I would like to only return "new jersey" here.
A match query analyzes the input and by default it is a boolean OR query if there are multiple terms after the analysis. In your case, "New JErsey" gets analyzed into the terms "new" and "jersey". The match query that you are using will search for documents in which the indexed value of field "sub_location" is either "new" or "jersey". That is why your query also matches documents where the value of field "sub_location" is "new York" because of the common term "new".
To only match for "new jersey", you can use the following version of the match query:
{
"query": {
"match": {
"sub_location": {
"query": "New JErsey",
"operator": "and"
}
}
}
}
This will not match documents where the value of field "sub_location" is "New York". But, it will match documents where the value of field "sub_location" is say "York New" because the query finally translates into a boolean query like "York" AND "New". If you are fine with this behaviour, well and good, else read further.
All these issues arise because you are using the default analyzer for the field "sub_location" which breaks tokens at word boundaries and indexes them. If you really do not care about partial matches and want to always match the entire string, you can make use of custom analyzers to use Keyword Tokenizer and Lowercase Token Filter. Mind you, going ahead with this approach will need you to re-index all your documents again.

Resources