ElasticSearch results are inaccurate - elasticsearch

My current query is:
GET /index/_search
{
"query": {
"simple_query_string": {
"query": "(\"cheaper+than,+therapy\")",
"analyzer": "standard",
"flags": "OR|AND",
"fields": ["name"]
}
}
}
My main problem is at the moment this still find matches like "GOLF . . . CHEAPER THAN THERAPY". I don't want matches like this. I want to maybe fix some typo and normalize the search query but i don't want to extend them. So in this result the TM's "GOLF . . . CHEAPER THAN THERAPY" and "RUNNING IS: CHEAPER THAN THERAPY" should not be a result.
So the result should just show results which are almost the same as my search query is.
I try something with fuzzienes and so on but it does not help me.
The field name is a text field.
I await the following results:
CHEAPER THAN THERAPY
CHEAPER THAN, THERAPY
I dont await the following results:
GOLF . . . CHEAPER THAN THERAPY
"CHEAPER THAN THERAPY" MOORENKO'S
SHOPPING IS CHEAPER THAN THERAPY!
RUNNING IS: CHEAPER THAN THERAPY
CHEAPER THAN THERAPY AND WAY MORE FUN!
What do I have to do to get more accurate results?

You can use fuzzy query on keyword field.
The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization. Basically it breaks a text in number of tokens.
So when you are using simple_query_string it is just checking if any document has tokens ["CHEAPER","THAN","THERAPY"] in it.
You can use fuzzy query on text.keyword which will match whole string
{
"query": {
"fuzzy": {
"text.keyword": {
"value": "CHEAPER THAN THERAPY",
"fuzziness": "AUTO"
}
}
}
}
Result
[
{
"_index" : "index129",
"_type" : "_doc",
"_id" : "pnXJM3oBX7bKb5rQ30Vb",
"_score" : 1.6739764,
"_source" : {
"text" : "CHEAPER THAN THERAPY"
}
},
{
"_index" : "index129",
"_type" : "_doc",
"_id" : "p3XJM3oBX7bKb5rQ60UT",
"_score" : 1.5902774,
"_source" : {
"text" : "CHEAPER THAN, THERAPY"
}
}
]

Related

How to build an Elasticsearch query that will take into account the distance between words?

I'm running with elasticsearch:7.6.2
I have an index with 4 simple documents:
PUT demo_idx/_doc/1
{
"content": "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
}
PUT demo_idx/_doc/2
{
"content": "Distributed tmp nature, simple REST APIs, speed, and scalability"
}
PUT demo_idx/_doc/3
{
"content": "Distributed nature, simple REST APIs, speed, and scalability"
}
PUT demo_idx/_doc/4
{
"content": "Distributed tmp tmp nature"
}
I want to search for the text: distributed nature and get the following results order:
Doc id: 3
Doc id: 1
Doc id: 2
Doc id: 4
i.e documents with exact match (doc 3 & doc 1) will be displayed before documents with small slop (doc 2) and documents with big slop match will be last displayed (doc 4)
I read this post:
How to build an Elasticsearch query that will take into account the distance between words and the exactitude of the word but it didn't help me
I have tried the following seach query:
"query": {
"bool": {
"must":
[{
"match_phrase": {
"content": {
"query": query,
"slop": 2
}
}
}]
}
}
But it didnt gave me the required results.
I got the following results:
Doc id: 3 ,Score: 0.22949813
Doc id: 4 ,Score: 0.15556586
Doc id: 1 ,Score: 0.15401536
Doc id: 2 ,Score: 0.14397088
How can I write the query in order to get the results I want to ?
You can show the documents that match exactly with "Distributed nature", by using a bool should clause. The first clause will boost the score of, those documents that match exactly with "Distributed nature", without any slop.
POST demo_idx/_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"content": {
"query": "Distributed nature"
}
}
},
{
"match_phrase": {
"content": {
"query": "Distributed nature",
"slop": 2
}
}
}
]
}
}
}
Search Response will be:
"hits" : [
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.45899627,
"_source" : {
"content" : "Distributed nature, simple REST APIs, speed, and scalability"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.30803072,
"_source" : {
"content" : "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.15556586,
"_source" : {
"content" : "Distributed tmp tmp nature"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.14397088,
"_source" : {
"content" : "Distributed tmp nature, simple REST APIs, speed, and scalability"
}
}
]
Update 1:
In order to avoid the impact of "length of the field" param in the search query scoring, you need to disable the "norms" param for "content" field, using the Update mapping API
PUT demo_idx/_mapping
{
"properties": {
"content": {
"type": "text",
"norms": "false"
}
}
}
After this, reindex the documents again, so that norms will not be removed instantly
Now hit the search query, the search response will be in the order you expect to get.

Change compound token default behaviour in lucene/elasticsearch

Lucene/elasticsearch provide a possibility of compound tokens / subtokens. This is an important feature for e.g. German with composed words. The default behaviour of lucene is to combine the subtokens with an OR in order to not hurt recall and exclude documents from being returned. In specific situations, however, the opposite is required.
Assume that I want to index the following two documents:
Document 1:
PUT /idxwith/_doc/1
{
"name": "stockfisch"
}
Document 2:
PUT /idxwith/_doc/2
{
"name" : "laufstock"
}
Where the words will be decomposed as follows:
stockfisch ==> stock, fisch
laufstock ==> lauf, stock
Now with the following search query:
POST /idxwith/_search
{
"query": {
"match": {
"name": {
"query": "stockfisch"
}
}
}
}
I'd expect only the first document to be returned - which is not the case. As the subtokens are combined with OR, both documents will be returned (hurting the precision of my search):
"hits" : [
{
"_index" : "idxwith",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.3287766,
"_source" : {
"name" : "stockfisch"
}
},
{
"_index" : "idxwith",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.241631,
"_source" : {
"name" : "laufstock"
}
}
]
I'm looking for hints on how to adapt lucene (or elastic) to make this behaviour configurable, i.e. to be able to define that subtokens are combined with an AND if necessary.
Thanks!
To solve this problem you can use matchphrase query like this:
POST /idxwith/_search
{
"query": {
"match_phrase": {
"name": {
"query": "stockfisch"
}
}
}
}
A phrase query matches terms up to a configurable slop (which defaults to 0) in any order. Transposed terms have a slop of 2. for more info about MatchPhrase check here.
It is also possbile to use Operator in match query that it means all terms should be in term, more info here.
In your specific case I think Match_Phrase is a much better option since the order of terms are important.

URI Search fails with parse_exception when query string contains slash

I get an error message when including a slash in query string.
The query is looks as below ,
"query_string": {
"query": "usr0\\/7\\/0\\/20",
"default_field": "logmsg"
"analyzer": "keyword"
}
my document looks like as below,
{
"_index" : "logstash-log-2016.11.03",
"_type" : "log",
"_id" : "AVgpFuqyvHnB4OYqM9QE",
"_score" : 2.2499034,
"_source" : {
"message" : "#<SNMP::SNMP_Trap:0x5383e289 #request_id=63766, #error_index=0, #error_status=0, #value=#<SNMP::TimeTicks:0x3cbfc0fd #value=2033549672>>,blablabla>",
"#timestamp" : "2016-11-03T07:28:37.177Z",
"type" : "usrinfo",
"logmsg" : "DISMAN-EVENT-MIB::sysUpTimeInstance:235 days, 08:44:56.72,SNMPv2-MIB::snmpTrapOID:IF-MIB::linkUp,IF-MIB::ifIndex.132:132,IF-MIB::ifDescr.132:usr0/7/0/20,IF-MIB::ifType.132:6,CISCO-IF-EXTENSION-MIB::cieIfStateChangeReason.132:up",
"error_status" : "0",
}
I want to get documents that logmsg have the keyword "usr0/7/0/20",
but get no hits return
This occurs with ES "number" : "2.3.5",
The backslash escapes the forward slash, but you also need to escape the backslash itself, like this:
{
"query": {
"query_string": {
"query": "user0\\/0\\/0\\/2",
"default_field": "name"
}
}
}
However, this will not work if your goal is to search for the token user0/0/0/2 in your message field. You either need to use a term query or add "analyzer": "keyword" to your query_string query, otherwise user0/0/0/2 will get tokenized to user0, 0and 2

Elastich search : more_like_this operator returns no hit

I am trying to find similar documents to one document in elastic search (the document with id '4' in this case) in my sandbox based on a field (the 'town' field in this case).
So i wrote this query, which returns no hit :
GET _search
{
"query": {
"more_like_this" : {
"fields" : ["town"],
"docs" : [
{
"_index" : "app",
"_type" : "house",
"_id" : "4"
}
],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
In my dataset, the document #4 is located in a town nammed 'Paris'. Thus when I run the following query, the document #4 is in the hits results with a lot of others results :
GET _search
{
"query": {
"match": { "town": "Paris" }
}
}
I don't understand why the 'more_like_this' query does not return results whereas there are other documents that have a field with the same value.
I precise that I check the _index, _type and _id parameters using the '"match_all": {}' query.
It looks like the second example of this official elastic search ressource : http://www.elastic.co/guide/en/elasticsearch/reference/1.5/query-dsl-mlt-query.html
What's wrong with my 'more_like_this' query ?
I am assuming you have only a less number of documents.
In that case , can you give min_doc_freq as 0 and try again.
Also use POST for search -
POST _search
{
"query": {
"more_like_this" : {
"fields" : ["town"],
"docs" : [
{
"_index" : "app",
"_type" : "house",
"_id" : "4"
}
],
"min_term_freq" : 1,
"max_query_terms" : 12,
"min_doc_freq" : 1
}
}
}

ElasticSearch search query processing

I have been reading up on ElasticSearch and couldn't find an answer for how to do the following:
Say, you have some records with, "study" in the title and a user uses the word "studying" instead of "study". How would you set up ElasticSearch to match this?
Thanks,
Alex
ps: Sorry, if this is a duplicate. Wasn't sure what to search for!
You might be interested in this: http://www.elasticsearch.org/guide/reference/query-dsl/flt-query/
For eg: I have indexed book titles and on this query:
{
"query": {
"bool": {
"must": [
{
"fuzzy": {
"book": {
"value": "ringing",
"min_similarity": "0.3"
}
}
}
]
}
}
}
I got
{
"took" : "1",
"timed_out" : "false",
"_shards" : {
"total" : "5",
"successful" : "5",
"failed" : "0"
}
"hits" : {
"total" : "1",
"max_score" : "0.19178301",
"hits" : [
{
"_index" : "library",
"_type" : "book",
"_id" : "3",
"_score" : "0.19178301",
"_source" : {
"book" : "The Lord of the Rings",
"author" : "J R R Tolkein"
}
}
]
}
}
which is the only correct result..
You could apply stemming to your documents, so that when you index studying, you are beneath indexing study. And when you query you do the same, so that when you search for studying again, you'll be searching for study and you'll find a match, both looking for study and studying.
Stemming depends of course on the language and there are different techniques, for english snowball is fine. What happens is that you lose some information when you index data, since as you can see you cannot really distinguish between studying and study anymore. If you want to keep that distinction you could index the same text in different ways using a multi_field and apply different text analysis to it. That way you could search on multiple fields, both the non stemmed version and stemmed version, maybe giving different weights to them.

Resources