I get invalid search results every time with elasticsearch. I ran a query with explain: true and checked results. I was surprised that 'messy' output entries has different score and explained score:
"_score": 0.32287252,
...
"_explanation": {
"value": 1.6143626,
"description": "product of:",
...
If those (messy) entries had explained score's value in _score, output would look perfect. Does anybody know how to fix this?
PS: I tried to change the number of shards from 5 to 1: nothing changes, the output is still invalid.
Related
I would like to boost scores of documents based on how "recent" a document is. I am trying to do this using a function_score. Here is an example of me doing this on a field called updated_at:
{
"function_score": {
"boost_mode": "sum",
"functions": [
{
"exp": {
"updated_at": {
"origin": "now",
"scale": "1h",
"decay": 0.01,
},
},
"weight": 1,
}
],
"query": query
},
}
I would expect documents close to the datetime now will have a score closer to 1, and documents closer to scale will have a score closer to decay (as described in the docs). Therefore, I'm using the boost_mode sum, to keep the original document scores, and increase depending on how close to now the updated_at value is. (Also, the query score is useful so I would rather add than multiply, which is the default).
To test this scenario, I create a document (A) that returns a query score of about 2. I then duplicate it (B) and modify the new document's updated_at timestamp to be an hour in the past.
In this scenario, I would expect (A) to have a higher score and (B) to have a lower score. However, when I run this scenario, I get the exact opposite. (B) ends up with a score of 3 and (A) ends up with a score of 2.
What am I misunderstanding here to cause this to happen? And how would I modify my function score to do what I would like?
This turned out to be a a timezone issue.
I ended up using the explain API to look at what was contributing to the score. When doing that, I noticed that the origin set to now was actually in a different timezone to the one I was setting in the documents.
I fixed this by manually providing a UTC timestamp in the elasticsearch query rather than using now as the value.
(If there is a better way to do this, please let me know)
Search after in elasticsearch must match its sorting parameters in count and order. So I was wondering how to get the score from previous result (example page 1) to use it as a search after for next page.
I faced an issue when using the score of the last document in previous search. The score was 1.0, and since all documents has 1.0 score, the result for next page turned out to be null (empty).
That's actually make sense, since I am asking elasticsearch for results that has lower rank (score) than 1.0 which are zero, so which score do I use to get the next page.
Note:
I am sorting by score then by TieBreakerID, so one possible solution is using high value (say 1000) for score.
What you're doing sounds like it should work, as explained by an Elastic team member. It works for me (in ES 7.7) even with tied scores when using the document ID (copied into another indexed field) as a tiebreaker. It's true that indexing additional documents while paginating will make your scores slightly unstable, but not likely enough to cause a significant problem for an end user. If you need it to be reliable for a batch job, the Scroll API is the better choice.
{
"query": {
...
},
"search_after": [
12.276552,
14173
],
"sort": [
{ "_score": "desc" },
{ "id": "asc" }
]
}
so I'm trying to get good search results out of an Elasticsearch installation.
But I run into problems when I'm trying to make a fuzzy search on some very simple data.
Somehow multiple (some of them partial) words are scored too low and only get scored higher, when more letters of the word are present in the search query.
Let me explain:
I have a simple index built with two simple documents.
{
"name": "Product with good qualities and awesome sound system"
},
{
"name": "Another Product that has better acustics than the other one"
}
Now I query the index with this parameters:
{
"query": {
"multi_match": {
"fields": ["name"],
"query": "product acust",
"fuzziness": "auto"
}
}
}
And the results look like this:
"hits": [
{
"_index": "test_products",
"_type": "_doc",
"_id": "1",
"_score": 0.19100355,
"_source": {
"name": "Product with good qualities and awesome sound system"
}
},
{
"_index": "test_products",
"_type": "_doc",
"_id": "2",
"_score": 0.17439455,
"_source": {
"name": "Another Product that has better acustics than the other one"
}
}
]
As you can see the product with the ID 2 is scored less than the other product even though it has possibly more similarity with the given query string than the other product because it has 1 full word match and 1 partial word match.
When the query would looke like "product acusti" the results would start to behave correctly.
I've already fiddled around with bool search but the results are identical.
Any ideas how I can get the wanted results back faster than having to have almost the whole second word typed in?
As far as I know, Elasticsearch does not do partial word matching by default, so the term acust is not matched in neither of your documents.
The reason you are getting a higher score in the first document is that your matched term, product, appears in a shorter sentence:
Product with good qualities and awesome sound system
But as for the second document, product appears in a longer sentence:
Another Product that has better acoustics than the other one
So your second document is getting a lower score because the ratio of your match term (product) to the number of terms in the sentence is lower.
In other words in has lower Field length normalization:
norm = 1/sqrt(numFieldTerms)
Now if you you want to be able to do partial prefix matching, you need to tokenize your term into ngrams, for example you can create the following ngrams for the term "acoustics":
"ac", "aco", "acou", "acous", "acoust", "acousti", "acoustic", "acoustics"
You have 2 options to achieve this, see the answer by Russ Cam on this question
use Analyze API
with an analyzer that will tokenize the field into tokens/terms from
which you would want to partial prefix match, and index this
collection as the input to the completion field. The Standard analyzer
may be a good one to start with...
Don't use the Completion Suggester here and instead set up your field (name) as a text datatype with
multi-fields
that include the different ways that name should be analyzed (or not
analyzed, with a keyword sub field for example). Spend some time with the Analyze API to build an analyzer that will
allow for partial prefix of terms anywhere in the name. As a start,
something like the Standard tokenizer, Lowercase token filter,
Edgengram token filter and possibly Stop token filter would get you
running...
You may also find this guide helpful.
I have the below more_like_this query to elasticsearch.
I run this in a loop for 15 times with different art_title and art_tags each time. For some articles the time it takes is very less but for some articles in the loop it takes too long to execute. Is there anything which I can do to optimize this query. Any help is appreciated.
bodyquery={
"query":
{"bool":
{"should":
[
{"more_like_this":
{
"like_text": art_title,
"fields": ["title"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
},
{"more_like_this":
{
"like_text": art_tags,
"fields": ["tags"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
}
]
}
}
}
I believe you might have solved this already by now but depending on the content of your indexed docs and the analyzers applied to the fields you are looking at, this can take a wide range of time to complete. Think how similarity works and how it will be calculated for your documents and you probably will find the answer. Also, you can use the explain param to get a Lucene detailed step-by-step response to the question
, but just in case I want to add: it is virtually impossible to determine anything without more details:
What your mappings look like
How are those fields analyzed
What version of ES are you using
Your ES setup
Also, describe in english what are you trying to retrieve: "I want documents in the catalog index that have a title similar to art_title and/or a tag similar to art_tag".
There is reference to the syntax in HERE if you are using the latest version of ES
Cheers
I'm trying to create a filter against ElasticSearch that requires more than one match before the result is returned. For example, in the following text:
If you're uneasy at the idea of riding in a vehicle that drives itself, just wait till you see Google's new car. It has no gas pedal, no brake and no steering wheel. Google has been demonstrating its driverless technology for several years by retrofitting Toyotas, Lexuses and other cars with cameras and sensors. But now, for the first time, the company has unveiled a prototype of its own: a cute little car that looks like a cross between a VW Beetle and a golf cart.
If I set the minimum number of matches to 2 and searched for Google, I would expect this result because Google appears in the text twice. However, searching on Toyota with the same number of expected matches should not result in this article.
How do I construct this filter?
Probably not exactly what you are looking for, but you could add explain to your query and then filter on the client side by number of term matches. From the docs, query would look like this:
GET /_search?explain
{
"query" : { "match" : { "tweet" : "honeymoon" }}
}
Results would look like this:
"_explanation": {
"description": "weight(tweet:honeymoon in 0)
[PerFieldSimilarity], result of:",
"value": 0.076713204,
"details": [
{
"description": "fieldWeight in 0, product of:",
"value": 0.076713204,
"details": [
{
"description": "tf(freq=1.0), with freq of:",
"value": 1,
"details": [
{
"description": "termFreq=1.0",
"value": 1
}
]
},
{
"description": "idf(docFreq=1, maxDocs=1)",
"value": 0.30685282
},
{
"description": "fieldNorm(doc=0)",
"value": 0.25,
}
]
}
]
}
You could then filter on the description field for term frequency and look for a value > 1.
I believe you may be able to do this directly (no client side filtering) by using scripting, as you can get reference to term frequency:
Term statistics:
Term statistics for a field can be accessed with a subscript operator like this: _index['FIELD']['TERM']. This will never return null, even if term or field does not exist. If you do not need the term frequency, call _index['FIELD'].get('TERM', 0) to avoid uneccesary initialization of the frequencies. The flag will have only affect is your set the index_options to docs (see mapping documentation).
_index['FIELD']['TERM'].df()
df of term TERM in field FIELD. Will be returned, even if the term is not present in the current document.
_index['FIELD']['TERM'].ttf()
The sum of term frequencys of term TERM in field FIELD over all documents. Will be returned, even if the term is not present in the current document.
_index['FIELD']['TERM'].tf()
tf of term TERM in field FIELD. Will be 0 if the term is not present in the current document.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html
However, I've not done this and there are the normal concerns about both security and performance when using server side scripting.