ElasticSearch and highlighting performance - plain vs. fast vector highlighter - performance

I am running into performance issues when running a query that uses both slop and the fact vector highlighter. Interestingly, the performance issue goes away when performing the same query with the plain highlighter, and I am not sure why this is the case.
Here's the metadata for the field being searched:
contents: {
store: true
search_analyzer: mySearchAnalyzer
term_vector: with_positions_offsets
type: string
}
The following query, which uses the fact vector highlighter, takes over 60 seconds:
{
"size": 500,
"query": {
"query_string": {
"query": "\"CATERPILLAR FINANCIAL SERVICES ASIA PTE LTD\"~5",
"fields": [
"contents"
],
"default_operator": "and",
}
},
"highlight": {
"fields": {
"contents": {}
}
}
}
However, if I change the query to use the plain analyzer, then it takes only a few milliseconds:
{
"size": 500,
"query": {
"query_string": {
"query": "\"CATERPILLAR FINANCIAL SERVICES ASIA PTE LTD\"~5",
"fields": [
"contents"
],
"default_operator": "and",
}
},
"highlight": {
"fields": {
"contents": {"type" : "plain"}
}
}
}
I have looked at different options for the highlighters (like fragment_size, fragment_offset, phrase_limit), but nothing is immediately obvious as what can be set to improve performance.
Any ideas on what is going on here? Or what type of settings I can try to improve the performance?
Note: One reason we switched from the plain to fact vector highlighter was due to some queries failing with the plain highlighter.
Edit: I've added the reproduction steps which demonstrate the issue in the following link:
https://drive.google.com/file/d/0B-IfDOojIDnIQmpkY2RNN2pMREE/edit?usp=sharing
I think the key is that there is a field which contains lots of similar values (e.g. in this case, Caterpillar is referenced many times).

While not strictly an answer, based on comments from Duc.Duong in which he was not able to reproduce the issue, I tried reproducing this with the version we are using (0.90.3) and the latest versopm (1.3.2). It turns out that this no longer reproduces with the latest version - that the search returns right away.
So, bottom line, this issue does not reproduce with the latest version. Not sure where it was fixed, but the problem occurs in 0.90.3.

Related

Erratic search results from Elastic when sorting on a field

We just upgraded to Elasticsearch 2.3.1 (from 1.7) and we're getting strange search behavior that I can't explain. What seems to happen is that a search request containing a bool query and a sort clause is returning:
Documents that don't seem to match the given search terms in any way.
Wildly different estimates on the total of matching documents each request
A minimal example of a request with this behavior:
post pim_search_1/_search
{
"explain": false,
"track_scores": false,
"sort": [
{
"product_id": {
"order": "desc"
}
}
],
"query": {
"bool": {
"filter": [
{
"terms": {
"publication": [
"public"
]
}
},
{
"query_string": {
"query": "iphone",
"default_operator": "and"
}
}
]
}
}
}
So in this case, a query string for "iphone" returns no iPhones at all. Setting explain to true yields this for the documents that appear to have no matching terms at all:
"_explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause (#ConstantScore(publication:public) #_all:iphone)",
So the document has no matching clauses, but it's still returned?
We've found two workarounds for this behavior:
Sort on _score or leave out the sort clause entirely. Sorting on anything else, like the field above or on _doc gives the wonky behavior.
Include track_scores : true on the request.
So it appears to have something to do with scoring and relevancy. But since we're sorting on a field of our own, we're not interested in relevancy or score. Without the workarounds, the max_score on the response is null and so is the _score of every document.
Is this behavior something that can be explained in any way, or should we be looking at cluster health/configuration/corruption? According to the cluster, its health is green and all shards for this index appear healthy. It's currently a small index with 3 shards (1 replica per shard) over 3 nodes.
Update
I've further investigated the issue and it seems cache related. Specifically, the fielddata cache for the _all field (I'm not very familiar with the internals of Elasticsearch, so please correct me if that's not a thing).
Steps to reproduce
I have a data set that reproduces the problem, leave a comment and I can send it to you.
Use the following query:
post pim_search_1/_search
{
"fields": [
"_all"
],
"explain": true,
"size": 100,
"sort": [
{
"product_id": {
"order": "desc"
}
}
],
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "surface",
"default_operator": "and"
}
}
],
"filter": [
{
"terms": {
"publication": [
"public"
]
}
}
]
}
}
}
Execute the query. You're searching for "surface" in the query string here and this should result in 22 hits total. This is correct. Execute this query a bunch of times (this seems to matter for step 2).
Change the query string to "iphone". This will result in 22 hits still, even though the dataset contains only one item that should match. The _explanation also mentions that the found documents don't actually match, like my example above.
Execute this: post pim_search_1/_cache/clear
Execute the query again for "iphone". It should now only return 1 hit, which is correct. Also execute this one a bunch of times.
Execute the query again for "surface", this will now return only 1 hit and again the _explanation states that it didn't get a match on the resulting document.
Remove the sort clause from the query and everything appears normal. The same is true for including "track_scores" : true.
Instead of _cache/clear it also works to just restart the cluster.
I say it's related to the _all field because changing the default_field of the query_string to the primitive_name field (an analyzed field) results in the correct behavior. For this example, I've made _all a stored field (it isn't normally with us) and it's returned in the search results so you can inspect it (doesn't appear to contain anything weird).
The above was done on a single node cluster (my local PC) on Elasticsearch 2.3.5.
This Github question seems to be about the same issue as mine, but could not be reproduced at the time and was closed.
This has been fixed in Elasticsearch 2.4:
https://github.com/elastic/elasticsearch/pull/20196

ElasticSearch: Highlights every word in phrase query

How can I get Elastic Search to only highlight words that caused the document to be returned?
I have the following index
{
"mappings": {
"document": {
"properties": {
"content": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
Let say I have indexed:
Nuclear power is the use of nuclear reactions that release nuclear
energy[5] to generate heat, which most frequently is then used in
steam turbines to produce electricity in a nuclear power station. The
term includes nuclear fission, nuclear decay and nuclear fusion.
Presently, the nuclear fission of elements in the actinide series of
the periodic table produce the vast majority of nuclear energy in the
direct service of humankind, with nuclear decay processes, primarily
in the form of geothermal energy, and radioisotope thermoelectric
generators, in niche uses making up the rest.
And search for "nuclear elements"~2
I only want "nuclear fission of elements" or parts of "nuclear fission of elements" to be highlighted but every single occurrence of nuclear is now highlighted.
This is my query if it helps:
{
"fields": [
],
"query": {
"query_string": {
"query": "\"nuclear elements\"~2",
"fields": [
"content.english"
]
}
},
"highlight": {
"pre_tags": [
"<em class='h'>"
],
"post_tags": [
"</em>"
],
"fragment_size": 500,
"number_of_fragments": 20,
"fields": {
"content.english": {}
}
}
}
There is a highlighting bug in ES 2.1, which was caused due to this change. This has been fixed by this Pull Request.
According to ES developer
This is a bug that I introduced in #13239 while thinking that the
differences were due to changes in Lucene: extractUnknownQuery is also
called when span extraction already succeeded, so we should only fall
back to Weight.extractTerms if no spans have been extracted yet.
It works in older versions till 2.0 and would work as expected in future versions.

Manipulate score in elasticsearch

I would like to manipulate the score I get when I do a search on elasticsearch.
I already use the boost option, but it does not give me the results I would like to have. After some reading I think the function_score query is the solution to my problem.
I understand how it works, but I can’t figure out how I can change my current query to use it with the function_score query.
"query": {
"filtered": {
"query": {
"bool": {
"should": [{
"multi_match": {
"type": "most_fields",
"query": "paus",
"operator": "and",
"boost": 2,
"fields": [
"fullname^2",
"fullname.folded",
"alias^2",
"name^2"
],
"fuzziness": 0
}
}, {
"multi_match": {
"type": "most_fields",
"query": "paus",
"operator": "and",
"boost": 1.9,
"fields": [
"taggings.tag.name^1.9",
"function",
"relations.master.name^1.9",
"relations.master.first_name^1.9",
"relations.master.last_name^1.9",
"relations.slave.name^1.9",
"relations.slave.first_name^1.9",
"relations.slave.last_name^1.9"
],
"fuzziness": 0
}
}, {
"multi_match": {
"type": "most_fields",
"query": "paus",
"operator": "and",
"fields": [
"fullname",
"alias",
"name"
],
"boost": 0.2,
"fuzziness": 1
}
}, {
"match": {
"extra": {
"query": "paus",
"fuzziness": 0,
"boost": 0.1
}
}
}]
}
},
"filter": {
"bool": {
"must": [
{
"terms": {
"type": ["Person"]
}
},
{
"term": {
"deleted": false
}
}
]
}
}
}
As you can see we have four kinds of matches.
Boost 2: when there are exact matches on the name
Boost 1.9: when there are exact matches on the taggings
Boost 0.2: when there are matches on the name but with one character written wrong
Boost 0.1: when there are matches in the extra (description) field
The problem I am facing is that the matches with one character written wrong and no tagging score higher than the matches with the right tagging and the whole word written wrong. That should be the other way...
Any help would be appreciated :)
There is no clear answer to this. Your best friend is Explain API,It will tell you how each and every document's score is calculated.
The most important thing to remember is boost is simply one of the factors considered while calculating score. From the Docs
Practically, there is no simple formula for deciding on the “correct” boost value for a particular query clause. It’s a matter of try-it-and-see. Remember that boost is just one of the factors involved in the relevance score; it has to compete with the other factors
It would help you a lot if you go through Theory and Lucene's Practical Scoring Function. This is the formula used by Lucene.
score(q,d) =
queryNorm(q)
· coord(q,d)
· ∑ (
tf(t in d)
· idf(t)²
· t.getBoost()
· norm(t,d)
) (t in q)
Now One of the several reasons you are not getting expected results could be norm(t,d) and idf(t)². For e.g if you have extra field as paus me and other fields have something like my name is some paus something, that would give field length norm i.e norm(t.d) higher value. Also if you have say 10000 documents and only one document has paus in extra field, that would make Inverse Document Frequency pretty high because it is calculated as idf(t) = 1 + log ( numDocs / (docFreq + 1)) here numDocs=10000 and docFreq=1 and this value will be squared. I had exactly this problem in my dataset.
Fuzzy query scoring higher could be related to this issue which is basically a Lucene Issue. This is fixed in latest version.
One way that might work is giving constant_score to last two clauses and say a boost of 5 to first two clauses. This would help in understanding.
Try to solve this issue step by step, start with two clauses and see output of explain api, then try with three and finally all four. Also remove field boosting and try with query boost only. Gradually you will figure out.
I hope this helps!!

Diversified results on Elasticsearch search

I've done a complex query using the popularity to improve the results of social media documents using Elasticsearch.
The query works really fine and the top results are always centered on the query and with interesting elements.
However it has a problem, for some queries the first results are all from the same user.
I would like to downscore a document if same user was retrieved on a higher document. This way I expect to have more diversification on the results.
Note that I don't want them to be removed, as in some cases it may still be interesting to find more documents of the same user, but I would like them to be in a lower position.
Can anybody suggest a way to make it work?
As suggested in some comments I update a (simplified version) of my query:
query = {"function_score": {
"functions": [
{"gauss": {"createdAt":
{"origin": "now", "scale": "30d", "offset": "7d", "decay" :0.9 }
}},
{"gauss": {"shares.last.twitter_retweets_log":
{"origin": 4.52, "scale": 2.61, "decay" : 0.9}
}},
],
"query": {"bool":{"must":[
{"exists":{"field": "images"}},
{"multi_match":{"query": "foo boo", fields:["text", "link.title"]}}
]}},
"score_mode": "multiply"
}};
P.S: some documents that may be interesting, as they talk about diversity, but I'm not sure how to apply:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-sampler-aggregation.html?q=sampler
https://lucene.apache.org/core/5_1_0/misc/org/apache/lucene/search/DiversifiedTopDocsCollector.html
You can couple the sampler with the top_hits aggregation to get diversified results.
{
"query": {
"match": {
"query": "iphone"
}
},
"size":0,
"aggs": {
"sample": {
"sampler": {
"shard_size": 200,
"field" : "user.id"
},
"aggs": {
"diversifiedMatches": {
"top_hits": {
"size":10
}
}
}
}
}
}
There are some caveats e.g:
1) Deduplication is per-shard not global
2) Choice of diversification field must be a single-value field
3) No support for pagination
4) No support for sorting on anything other than score
Addressing the above issues would be hard and would require expensive/complex co-ordination internally plus more guidance from the client about when and where "duplicate" results can be re-introduced (page 2? page 3? how many?) etc.

Adjusting Elasticsearch _score based on field value, relative to other matching document's field value

We're updating our search system from Solr to Elasticsearch. We've already improved lots of things, but something we haven't got right yet is boosting a document's (product's) score by the popularity of the product (it's an ecommerce website).
This is what we have currently (with lots of irrelevant bits stripped out):
{
"query": {
"function_score": {
"query": {
"multi_match" : {
"query": "renal dog food",
"fields": [ "family_name^20", "parent_categories^2", "description^0.2", "product_suffixes^8", "facet_values^5" ],
"operator": "and",
"type": "best_fields",
"tie_breaker": 0.3
}
},
"functions": [{
"script_score": {
"script": "_score * log1p(1 + doc['popularity_score'].value)"
}
}],
"score_mode": "sum"
}
},
"sort": [
{ "_score": "desc" }
],
}
The popularity_score field contains the total number of orders containing this item in the last 6 weeks. Some items will have never been ordered and some will have had up 30,000 (with potentially a lot more as we continue to grow the business). It's quite a bit range.
The problem we have is that a document (product) might be a really good match text-wise but not very popular. We then have another not-very-relevant product does just about match the query, but because it is very popular it jumps up the list. What we are looking for is something will allow the popularity_score to be taken relative to the popularity_score of other matching results and get some form of normalisation, rather than just being taken as is (log1p doesn't seem to be enough sometimes). Does anyone have any suggestions or ideas?
Thank you!

Resources