Change compound token default behaviour in lucene/elasticsearch - elasticsearch

Lucene/elasticsearch provide a possibility of compound tokens / subtokens. This is an important feature for e.g. German with composed words. The default behaviour of lucene is to combine the subtokens with an OR in order to not hurt recall and exclude documents from being returned. In specific situations, however, the opposite is required.
Assume that I want to index the following two documents:
Document 1:
PUT /idxwith/_doc/1
{
"name": "stockfisch"
}
Document 2:
PUT /idxwith/_doc/2
{
"name" : "laufstock"
}
Where the words will be decomposed as follows:
stockfisch ==> stock, fisch
laufstock ==> lauf, stock
Now with the following search query:
POST /idxwith/_search
{
"query": {
"match": {
"name": {
"query": "stockfisch"
}
}
}
}
I'd expect only the first document to be returned - which is not the case. As the subtokens are combined with OR, both documents will be returned (hurting the precision of my search):
"hits" : [
{
"_index" : "idxwith",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.3287766,
"_source" : {
"name" : "stockfisch"
}
},
{
"_index" : "idxwith",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.241631,
"_source" : {
"name" : "laufstock"
}
}
]
I'm looking for hints on how to adapt lucene (or elastic) to make this behaviour configurable, i.e. to be able to define that subtokens are combined with an AND if necessary.
Thanks!

To solve this problem you can use matchphrase query like this:
POST /idxwith/_search
{
"query": {
"match_phrase": {
"name": {
"query": "stockfisch"
}
}
}
}
A phrase query matches terms up to a configurable slop (which defaults to 0) in any order. Transposed terms have a slop of 2. for more info about MatchPhrase check here.
It is also possbile to use Operator in match query that it means all terms should be in term, more info here.
In your specific case I think Match_Phrase is a much better option since the order of terms are important.

Related

How to build an Elasticsearch query that will take into account the distance between words?

I'm running with elasticsearch:7.6.2
I have an index with 4 simple documents:
PUT demo_idx/_doc/1
{
"content": "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
}
PUT demo_idx/_doc/2
{
"content": "Distributed tmp nature, simple REST APIs, speed, and scalability"
}
PUT demo_idx/_doc/3
{
"content": "Distributed nature, simple REST APIs, speed, and scalability"
}
PUT demo_idx/_doc/4
{
"content": "Distributed tmp tmp nature"
}
I want to search for the text: distributed nature and get the following results order:
Doc id: 3
Doc id: 1
Doc id: 2
Doc id: 4
i.e documents with exact match (doc 3 & doc 1) will be displayed before documents with small slop (doc 2) and documents with big slop match will be last displayed (doc 4)
I read this post:
How to build an Elasticsearch query that will take into account the distance between words and the exactitude of the word but it didn't help me
I have tried the following seach query:
"query": {
"bool": {
"must":
[{
"match_phrase": {
"content": {
"query": query,
"slop": 2
}
}
}]
}
}
But it didnt gave me the required results.
I got the following results:
Doc id: 3 ,Score: 0.22949813
Doc id: 4 ,Score: 0.15556586
Doc id: 1 ,Score: 0.15401536
Doc id: 2 ,Score: 0.14397088
How can I write the query in order to get the results I want to ?
You can show the documents that match exactly with "Distributed nature", by using a bool should clause. The first clause will boost the score of, those documents that match exactly with "Distributed nature", without any slop.
POST demo_idx/_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"content": {
"query": "Distributed nature"
}
}
},
{
"match_phrase": {
"content": {
"query": "Distributed nature",
"slop": 2
}
}
}
]
}
}
}
Search Response will be:
"hits" : [
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.45899627,
"_source" : {
"content" : "Distributed nature, simple REST APIs, speed, and scalability"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.30803072,
"_source" : {
"content" : "Distributed nature, simple REST APIs, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, the end"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.15556586,
"_source" : {
"content" : "Distributed tmp tmp nature"
}
},
{
"_index" : "demo_idx",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.14397088,
"_source" : {
"content" : "Distributed tmp nature, simple REST APIs, speed, and scalability"
}
}
]
Update 1:
In order to avoid the impact of "length of the field" param in the search query scoring, you need to disable the "norms" param for "content" field, using the Update mapping API
PUT demo_idx/_mapping
{
"properties": {
"content": {
"type": "text",
"norms": "false"
}
}
}
After this, reindex the documents again, so that norms will not be removed instantly
Now hit the search query, the search response will be in the order you expect to get.

ElasticSearch results are inaccurate

My current query is:
GET /index/_search
{
"query": {
"simple_query_string": {
"query": "(\"cheaper+than,+therapy\")",
"analyzer": "standard",
"flags": "OR|AND",
"fields": ["name"]
}
}
}
My main problem is at the moment this still find matches like "GOLF . . . CHEAPER THAN THERAPY". I don't want matches like this. I want to maybe fix some typo and normalize the search query but i don't want to extend them. So in this result the TM's "GOLF . . . CHEAPER THAN THERAPY" and "RUNNING IS: CHEAPER THAN THERAPY" should not be a result.
So the result should just show results which are almost the same as my search query is.
I try something with fuzzienes and so on but it does not help me.
The field name is a text field.
I await the following results:
CHEAPER THAN THERAPY
CHEAPER THAN, THERAPY
I dont await the following results:
GOLF . . . CHEAPER THAN THERAPY
"CHEAPER THAN THERAPY" MOORENKO'S
SHOPPING IS CHEAPER THAN THERAPY!
RUNNING IS: CHEAPER THAN THERAPY
CHEAPER THAN THERAPY AND WAY MORE FUN!
What do I have to do to get more accurate results?
You can use fuzzy query on keyword field.
The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization. Basically it breaks a text in number of tokens.
So when you are using simple_query_string it is just checking if any document has tokens ["CHEAPER","THAN","THERAPY"] in it.
You can use fuzzy query on text.keyword which will match whole string
{
"query": {
"fuzzy": {
"text.keyword": {
"value": "CHEAPER THAN THERAPY",
"fuzziness": "AUTO"
}
}
}
}
Result
[
{
"_index" : "index129",
"_type" : "_doc",
"_id" : "pnXJM3oBX7bKb5rQ30Vb",
"_score" : 1.6739764,
"_source" : {
"text" : "CHEAPER THAN THERAPY"
}
},
{
"_index" : "index129",
"_type" : "_doc",
"_id" : "p3XJM3oBX7bKb5rQ60UT",
"_score" : 1.5902774,
"_source" : {
"text" : "CHEAPER THAN, THERAPY"
}
}
]

In Elastic search ,how to get "-id" value of a document by providing unique content present in the document

I have few documents ingested in Elastic search. A sample document is as below.
"_index": "author_index",
"_type": "_doc",
"_id": "cOPf2wrYBik0KF", --Automatically generated by Elastic search after ingestion
"_score": 0.13956004,
"_source": {
"author_data": {
"author": "xyz"
"author_id": "123" -- This is unique id for each document
"publish_year" : "2016"
}
}
Is there a way to get the auto-generated _id by sending author_id from High-level Rest Client?
I tried researching solutions.But all the solutions are only fetching the document using _id. But I need the reverse operation.
Actual Output expected: cOPf2wrYBik0KF
The SearchHit provides access to basic information like index, document ID and score of each search hit, so with Search API you can do it this way on Java,
String index = hit.getIndex();
String id = hit.Id() ;
OR something like this,
SearchResponse searchResponse =
client.prepareSearch().setQuery(matchAllQuery()).get();
for (SearchHit hit : searchResponse.getHits()) {
String yourId = hit.id();
}
SEE HERE: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-search.html#java-rest-high-search-response
You can use source filtering.You can turn off _source retrieval as you are interested in just the _id.The _source accepts one or more wildcard patterns to control what parts of the _source should be returned.(https://www.elastic.co/guide/en/elasticsearch/reference/7.0/search-request-source-filtering.html):
GET /author_index
{
"_source" : false,
"query" : {
"term" : { "author_data.author_id" : "123" }
}
}
Another approach will also give for the _id for the search.The stored_fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended:
GET /author_index
{
"stored_fields" : ["author_data.author_id", "_id"],
"query" : {
"term" : { "author_data.author_id" : "123" }
}
}
Output for both above queries:
"hits" : [
{
"_index" : "author_index",
"_type" : "_doc",
"_id" : "cOPf2wrYBik0KF",
"_score" : 6.4966354
}
More details here: https://www.elastic.co/guide/en/elasticsearch/reference/7.0/search-request-stored-fields.html

Get from ElasticSearch why a result is a hit

In the ElasticSearch below I search for the word Balances in two fields name and notes:
GET /_search
{ "query": {
"multi_match": { "query": "Balances",
"fields": ["name","notes"]
}
}
}
And the result in the name field:
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.673515,
"hits" : [
{
"_index" : "idx",
"_type" : "_doc",
"_id" : "25",
"_score" : 1.673515,
"_source" : {
"name" : "Deposits checking accounts balances",
"notes" : "These are the notes",
"#timestamp" : "2019-04-18T21:05:00.387Z",
"id" : 25,
"#version" : "1"
}
}
]
}
Now, I want to know in which field ElasticSearch found the value. I could evaluate the result and see if the searched text is in name or notes, but I cannot do that if it's a fuzzy search.
Can ElasticSearch tell me in which field the text was found, and in addition provide a snippet with 5 words to the left and to the right of the result to tell the user why the result is a hit?
What I want to achieve is similar to Google highlighting in bold the text that was found within a phrase.
I think the 2 solutions in Find out which fields matched in a multi match query are still the valid solutions:
Highlight to find it.
Split the query up into multiple named match queries.

ElasticSearch search query processing

I have been reading up on ElasticSearch and couldn't find an answer for how to do the following:
Say, you have some records with, "study" in the title and a user uses the word "studying" instead of "study". How would you set up ElasticSearch to match this?
Thanks,
Alex
ps: Sorry, if this is a duplicate. Wasn't sure what to search for!
You might be interested in this: http://www.elasticsearch.org/guide/reference/query-dsl/flt-query/
For eg: I have indexed book titles and on this query:
{
"query": {
"bool": {
"must": [
{
"fuzzy": {
"book": {
"value": "ringing",
"min_similarity": "0.3"
}
}
}
]
}
}
}
I got
{
"took" : "1",
"timed_out" : "false",
"_shards" : {
"total" : "5",
"successful" : "5",
"failed" : "0"
}
"hits" : {
"total" : "1",
"max_score" : "0.19178301",
"hits" : [
{
"_index" : "library",
"_type" : "book",
"_id" : "3",
"_score" : "0.19178301",
"_source" : {
"book" : "The Lord of the Rings",
"author" : "J R R Tolkein"
}
}
]
}
}
which is the only correct result..
You could apply stemming to your documents, so that when you index studying, you are beneath indexing study. And when you query you do the same, so that when you search for studying again, you'll be searching for study and you'll find a match, both looking for study and studying.
Stemming depends of course on the language and there are different techniques, for english snowball is fine. What happens is that you lose some information when you index data, since as you can see you cannot really distinguish between studying and study anymore. If you want to keep that distinction you could index the same text in different ways using a multi_field and apply different text analysis to it. That way you could search on multiple fields, both the non stemmed version and stemmed version, maybe giving different weights to them.

Resources