ElasticSearch search query processing - elasticsearch

I have been reading up on ElasticSearch and couldn't find an answer for how to do the following:
Say, you have some records with, "study" in the title and a user uses the word "studying" instead of "study". How would you set up ElasticSearch to match this?
Thanks,
Alex
ps: Sorry, if this is a duplicate. Wasn't sure what to search for!

You might be interested in this: http://www.elasticsearch.org/guide/reference/query-dsl/flt-query/
For eg: I have indexed book titles and on this query:
{
"query": {
"bool": {
"must": [
{
"fuzzy": {
"book": {
"value": "ringing",
"min_similarity": "0.3"
}
}
}
]
}
}
}
I got
{
"took" : "1",
"timed_out" : "false",
"_shards" : {
"total" : "5",
"successful" : "5",
"failed" : "0"
}
"hits" : {
"total" : "1",
"max_score" : "0.19178301",
"hits" : [
{
"_index" : "library",
"_type" : "book",
"_id" : "3",
"_score" : "0.19178301",
"_source" : {
"book" : "The Lord of the Rings",
"author" : "J R R Tolkein"
}
}
]
}
}
which is the only correct result..

You could apply stemming to your documents, so that when you index studying, you are beneath indexing study. And when you query you do the same, so that when you search for studying again, you'll be searching for study and you'll find a match, both looking for study and studying.
Stemming depends of course on the language and there are different techniques, for english snowball is fine. What happens is that you lose some information when you index data, since as you can see you cannot really distinguish between studying and study anymore. If you want to keep that distinction you could index the same text in different ways using a multi_field and apply different text analysis to it. That way you could search on multiple fields, both the non stemmed version and stemmed version, maybe giving different weights to them.

Related

ElasticSearch results are inaccurate

My current query is:
GET /index/_search
{
"query": {
"simple_query_string": {
"query": "(\"cheaper+than,+therapy\")",
"analyzer": "standard",
"flags": "OR|AND",
"fields": ["name"]
}
}
}
My main problem is at the moment this still find matches like "GOLF . . . CHEAPER THAN THERAPY". I don't want matches like this. I want to maybe fix some typo and normalize the search query but i don't want to extend them. So in this result the TM's "GOLF . . . CHEAPER THAN THERAPY" and "RUNNING IS: CHEAPER THAN THERAPY" should not be a result.
So the result should just show results which are almost the same as my search query is.
I try something with fuzzienes and so on but it does not help me.
The field name is a text field.
I await the following results:
CHEAPER THAN THERAPY
CHEAPER THAN, THERAPY
I dont await the following results:
GOLF . . . CHEAPER THAN THERAPY
"CHEAPER THAN THERAPY" MOORENKO'S
SHOPPING IS CHEAPER THAN THERAPY!
RUNNING IS: CHEAPER THAN THERAPY
CHEAPER THAN THERAPY AND WAY MORE FUN!
What do I have to do to get more accurate results?
You can use fuzzy query on keyword field.
The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization. Basically it breaks a text in number of tokens.
So when you are using simple_query_string it is just checking if any document has tokens ["CHEAPER","THAN","THERAPY"] in it.
You can use fuzzy query on text.keyword which will match whole string
{
"query": {
"fuzzy": {
"text.keyword": {
"value": "CHEAPER THAN THERAPY",
"fuzziness": "AUTO"
}
}
}
}
Result
[
{
"_index" : "index129",
"_type" : "_doc",
"_id" : "pnXJM3oBX7bKb5rQ30Vb",
"_score" : 1.6739764,
"_source" : {
"text" : "CHEAPER THAN THERAPY"
}
},
{
"_index" : "index129",
"_type" : "_doc",
"_id" : "p3XJM3oBX7bKb5rQ60UT",
"_score" : 1.5902774,
"_source" : {
"text" : "CHEAPER THAN, THERAPY"
}
}
]

Change compound token default behaviour in lucene/elasticsearch

Lucene/elasticsearch provide a possibility of compound tokens / subtokens. This is an important feature for e.g. German with composed words. The default behaviour of lucene is to combine the subtokens with an OR in order to not hurt recall and exclude documents from being returned. In specific situations, however, the opposite is required.
Assume that I want to index the following two documents:
Document 1:
PUT /idxwith/_doc/1
{
"name": "stockfisch"
}
Document 2:
PUT /idxwith/_doc/2
{
"name" : "laufstock"
}
Where the words will be decomposed as follows:
stockfisch ==> stock, fisch
laufstock ==> lauf, stock
Now with the following search query:
POST /idxwith/_search
{
"query": {
"match": {
"name": {
"query": "stockfisch"
}
}
}
}
I'd expect only the first document to be returned - which is not the case. As the subtokens are combined with OR, both documents will be returned (hurting the precision of my search):
"hits" : [
{
"_index" : "idxwith",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.3287766,
"_source" : {
"name" : "stockfisch"
}
},
{
"_index" : "idxwith",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.241631,
"_source" : {
"name" : "laufstock"
}
}
]
I'm looking for hints on how to adapt lucene (or elastic) to make this behaviour configurable, i.e. to be able to define that subtokens are combined with an AND if necessary.
Thanks!
To solve this problem you can use matchphrase query like this:
POST /idxwith/_search
{
"query": {
"match_phrase": {
"name": {
"query": "stockfisch"
}
}
}
}
A phrase query matches terms up to a configurable slop (which defaults to 0) in any order. Transposed terms have a slop of 2. for more info about MatchPhrase check here.
It is also possbile to use Operator in match query that it means all terms should be in term, more info here.
In your specific case I think Match_Phrase is a much better option since the order of terms are important.

How to make Elastic Engine understand a field is not to be analyzed for an exact match?

The question is based on the previous post where the Exact Search did not work either based on Match or MatchPhrasePrefix.
Then I found a similar kind of post here where the search field is set to be not_analyzed in the mapping definition (by #Russ Cam).
But I am using
package id="Elasticsearch.Net" version="7.6.0" targetFramework="net461"
package id="NEST" version="7.6.0" targetFramework="net461"
and might be for that reason the solution did not work.
Because If I pass "SOME", it matches with "SOME" and "SOME OTHER LOAN" which should not be the case (in my earlier post for "product value").
How can I do the same using NEST 7.6.0?
Well I'm not aware of how your current mapping looks. Also I don't know about NEST as well but I will explain
How to make Elastic Engine understand a field is not to be analyzed for an exact match?
by an example using elastic dsl.
For exact match (case sensitive) all you need to do is to define the field type as keyword. For a field of type keyword the data is indexed as it is without applying any analyzer and hence it is perfect for exact matching.
PUT test
{
"mappings": {
"properties": {
"field1": {
"type": "keyword"
}
}
}
}
Now lets index some docs
POST test/_doc/1
{
"field1":"SOME"
}
POST test/_doc/2
{
"field1": "SOME OTHER LOAN"
}
For exact matching we can use term query. Lets search for "SOME" and we should get document 1.
GET test/_search
{
"query": {
"term": {
"field1": "SOME"
}
}
}
O/P that we get:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931472,
"_source" : {
"field1" : "SOME"
}
}
]
}
}
So the crux is make the field type as keyword and use term query.

Get from ElasticSearch why a result is a hit

In the ElasticSearch below I search for the word Balances in two fields name and notes:
GET /_search
{ "query": {
"multi_match": { "query": "Balances",
"fields": ["name","notes"]
}
}
}
And the result in the name field:
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.673515,
"hits" : [
{
"_index" : "idx",
"_type" : "_doc",
"_id" : "25",
"_score" : 1.673515,
"_source" : {
"name" : "Deposits checking accounts balances",
"notes" : "These are the notes",
"#timestamp" : "2019-04-18T21:05:00.387Z",
"id" : 25,
"#version" : "1"
}
}
]
}
Now, I want to know in which field ElasticSearch found the value. I could evaluate the result and see if the searched text is in name or notes, but I cannot do that if it's a fuzzy search.
Can ElasticSearch tell me in which field the text was found, and in addition provide a snippet with 5 words to the left and to the right of the result to tell the user why the result is a hit?
What I want to achieve is similar to Google highlighting in bold the text that was found within a phrase.
I think the 2 solutions in Find out which fields matched in a multi match query are still the valid solutions:
Highlight to find it.
Split the query up into multiple named match queries.

dis_max query isn't looking for the best matching clause

I'm testing the dis_max query in the docs below:
PUT /blog/post/1
{
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
PUT /blog/post/2
{
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
This example is extracted from the book "Elasticsearch definitive guide" which explains that the answer from the query below would shows equals _score for both documents.
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
]
}
}}
But, as you could see, the result from the query shows different _score.
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.02250402,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "2",
"_score" : 0.02250402,
"_source" : {
"title" : "Keeping pets healthy",
"body" : "My quick brown fox eats rabbits on a regular basis."
}
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : 0.016645055,
"_source" : {
"title" : "Quick brown rabbits",
"body" : "Brown rabbits are commonly seen."
}
} ]
}
}
Elasticsearch is not returning the _score from best matching clause but is, somehow, blending the results. How may I fix it?
I've got the answer.
This confusing behavior happens because the index used in the example is using 5 shards (default number of shards). And the _score is not calculated in the index as a whole but in individual shards and then are summarized before the user got the answer.
This problem is not a issue when you have a huge number of documents, what it is not my case.
So, to test my thesis, I deleted my index:
DELETE /blog
And then, created a new index using only 1 shard:
PUT /BLOG
{ "settings" : { "number_of_shards" : 1 }}
So, I performed my query again and got both documents with the same _score: 0.12713557
Sweet =)

Resources