Return high quality results in ElasticSearch - elasticsearch

I'm sending search queries to my ES index and get multiple results back. A lot of times the results with lower scores are irrelevant and I want to remove these results and return only high-quality results (which mostly have a higher score).
My index contains 1000 documents of type text of 100-500 words. For example - {"text":'AVENGERS: ENDGAME is set after Thanos' catastrophic use of the Infinity Stones randomly wiped out half of Earth's population in Avengers: Infinity War. Those left behind are desperate to do something -- anything -- to bring back their lost loved ones. But after an initial attempt -- with extra help from Captain Marvel -- creates more problems than solutions, the grieving, purposeless Avengers think all hope is lost.'}
If the user searches for 'Captain Marvel aka Brie Larson kills Thanos in the movie', the above document should be returned as a result since it contains similar terms.
Currently, I am using min_score to set the threshold, but I know it's not best practice and the scores vary depending on the number of documents in the index (which will keep growing). So this approach doesn't seem scalable.
I also tried multiple ways of tuning the query to get high-quality results back, such as More Like This functionality
"must":
[{"more_like_this" : {
"fields" : field_list,
"like" : query_data,
"min_term_freq" : 1,
"max_query_terms" : 50,
"min_doc_freq" : 1,
"minimum_should_match" : '50%'}}]}}
But I'm still getting results with low scores like 1.5, whereas a good quality result usually has a score of 20. Is there a good way to tune the query further or adjust the min_score to be dynamic to only return highly relevant documents? Any help would be appreciated!

Related

Elasticsearch Track total hits alternative with approximation

Based on this article - link there are some serious performance implications with having track_total_hits property set to true.
We currently use it to get the number of documents matching after users search. Then user can use pagination to scroll through the results. The number of documents for such a search usually ranges from 10k - 5M.
Example of a user work flow:
User performs a search which matches 150.000 documents
We show him the first 200 results which he can scroll through but we also show him the total number of documents found in the search.
Since we always show the number of document searches and often those numbers can be quite high we need some kind of a way to get that count. I'm not sure but if we almost always perform paginated searches I would assume a lot of the things would be in memory ? Maybe then this actually effects us less then how it's shown in the provided article?
Some kind of an approximation and not an exact count would be ok for us if it would improve performance.
Is there such an option in Elasticsearch where we can get approximated count on search requests ?
There is no option to get an approximate count, but you may want to consider assigning track_total_hits a lower bound instead of true , which is a good compromise from a performance standpoint ( https://www.elastic.co/guide/en/elasticsearch/reference/master/search-your-data.html#track-total-hits)
That way, you can show users that there are at least k results - but there could be more.
Also, try using search_after (if you are not using it already) for pagination.

Speeding up elasticsearch more_like_this query

I was interested in fetching similar documents for a given input document (similar to KNN). As vectorizing documents (using doc2vec) that are not similar in sizes would result in inconsistent document vectors, and then computing a vector for the user's input (which maybe just a few terms/sentences compared the docs on which the doc2vec model was trained on where each doc would consist of 100s or 1000s of words) trying to find k-Nearest Neighbours would produce incorrect results due to lack of features.
Hence, I went ahead with using more_like_this query, which does a similar job compared to kNN, irrespective of the size of the user's input, since I'm interested in analyzing only text fields.
But I was concerned about the performance when I have millions of documents indexed in elasticsearch. The documentation says that using term_vector to store the term vectors at the index time can speed up the analysis.
But what I don't understand is which type of term vector the documentation refers to in this context. As there are three different types of term vectors: term information, term statistics, and field statistics.
And term statistics and field statistics compute the frequency of the terms with respect to other documents in the index, wouldn't these vectors be outdated when I introduce new documents in the index.
Hence I presume that the more_like_this documentation refers to the term information (which is the information of the terms in one particular document irrespective of the others).
Can anyone let me know if computing only the term information vector at the index time is sufficient to speed up more_like_this?
There shouldn't be any worries about term vectors being outdated, since they are stored for every document, so they will be updated respectively.
For More Like This it will be enough just to have term_vectors:yes, you don't need to have offsets and positions. So, if you don't plan using highlighting, you should be fine with just default one.
So, for your text field, you would need to have mappings like this and it will be enough to speed up MLT execution:
{
"mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "yes"
}
}
}
}

Elasticsearch - Best way to trim results by score?

Some of my search results returns a total of over 10k documents, varying from a high score (in my most recent search, ~75) to a very low score (less than 5). Other queries return a high score of ~20 and a low score of ~1.
Does anyone have a good solution for trimming off the less relevant documents? A java or query implementation would work. I've thought about using min_score, but i'm wary of that since it has to be a constant number, and some of the scores of my responses are a lot closer than the above. I suppose I could come up with some formula based off of the returned scores to create a cutoff for every response, but I was curious if anyone has come up with a solution to a similar use case?

Elasticsearch: What would be the lowest-cost, highest-impact change I can make to decrease response times?

Unguided beginners in any field often find themselves barking up the wrong tree in trying to solve a problem — this question is asked in hoping that it'll vector my approach in a more direct path towards solving the problem.
On to the question:
I'm about a month into working with ES and so far it's been awesome. I've been incrementally indexing to ES from a set of data I've got in CSV, and I'm beginning to encounter slow response times. I want to bring the response time down, but don't know what's a good way / the best way to approach it.
My research thus far tells me that it really depends on a number of variables. So, listed below are details on the ES variables which might help you with writing an answer:
Shards & Stuff
I say "& Stuff" because I don't know enough to know what's significant here.
Running the default ES settings, 5 shards, 1 node.
Running index-time-search-as-you-type, exactly as-is from the ES guide. There's a bit in there which PUTs settings for the indices: "number_of_shards": 1. I'm not sure how that affects things.
Index
2 indices with similar mappings (mirror a DB, so don't want to combine them)
Multi-language, but at the moment I only care about English.
As mentioned above, configured for index-time-search-as-you-type (min: 3, max: 20).
Documents
Have currently indexed ~1mil documents.
Have total of ~4mil documents to index.
Very short documents, like 5 fields of 10 english words per doc.
Total CSV filesize of all ~4mil rows is only ~400MB.
Queries
Main query is run as a bool (should) query.
Heavy on score scripting.
Heavy on script-sorted aggregations.
Fuzzy search (fuzziness: 1).
Hardware
Running Linode's $20/mo VPS.
Response Time
Queries with very high frequencies (typically a single English word) in the index are taking forever (~7-9000ms) to return results.
More specific queries (>=2 eng words) return more acceptable response times (~2-3000ms).
Ideally, all response times should be <2s.
If there are other variables which are important, and I've missed out, let me know and I'll edit them in.
Thank you!
I turned caching on (kinda a stop-gap measure) and it has helped a bunch. Meets my needs for now.

ElasticSearch -- is there any way to retrieve multiple result sets, or top results for fascets?

I'm curious if there's a way to query ElasticSearch so that it will return the top results for various fascets. For example, let's pretend we have some users writing tweets,
user: kimchy
user_eye_color: blue
tweet: elasticsearch training early bird discounts
# Lots of other message from blue eye color users mentioning 'bird'
user: lord_oliver
user_eye_color: amber-green
tweet: vanquished and consumed the twitter bird. today is a good day.
If there are enough blue-eyed users (or other colors more common than amber-green) writing tweets mentioning "bird", searching for "bird" will never surface Lord Oliver's tweet, even if Lord Oliver's tweet has a reasonably high score.
This is a problem because [in this hypothetical example], I want to surface results from a diversity of users. One current solution would be to add facets on eye color,
facets:
eye_color:
terms: {"field": "user_eye_color"}
and then perform multiple filtered searches afterward. This seems rather inefficient, however.
Question: Is there any way in ElasticSearch to return multiple result sets, either by returning top results from different facets (in this case, user_eye_color=amber-green), writing a stateful custom scoring function, or any other creative solution?
The justification for why I want to do this is that it's sometimes difficult to put a total order (floating point score) on all search results. Suppose that all amber-green eye color users happen to be cats, and they write different types of documents (tweets). Instead of trying to force all cat-written documents into a total order with all documents, I want pareto-optimal documents -- those optimal within the X-eye-color categories. I could then do more sensible postfiltering, for example, dropping cat-written documents if there's nothing good, and otherwise doing some kind of sensible interleaving of results. Dropping in some kind of score multiplier [based on eye color] would likely be less effective.
If you don't like my toy example (or its underhand satire), consider cases where you have an index with different document types, say tweets and FBI reports ...
It can be now done using top hits aggregation.

Resources