How to get ElasticSearch to return scores independent of case? - elasticsearch

I would like ElasticSearch to return result scores that are independent of case. As an example, suppose I query for the string "HOUSE" or ("house") I obtain the following results:
"House" => score: 0.6868894,
"House on the hill" => score: 0.52345484
"HOUSE" => score: 0.52200186
In an ideal world, both "House" and "HOUSE" would have a score of 1.0 and "House on the hill" a score of 0.5.
So far I've tried adding a custom analyser and am now looking at the omit_norms option. I'm also considering patterns since they have a CASE_INSENSITIVE flag. Unfortunately I'm finding the official documentation lacks examples and code snippets...
Can anyone provide code snippets/examples of a query including the parameters required to achieve scores independent of case? Extra recognition to anyone who can provide a solution using Tire for Rails.
MAPPING
mapping _source: {} do
indexes :id, type: 'integer'
indexes :value, :analyzer => 'string_lowercase'
end
** analyser is custom analyser mentioned above
QUERY
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "house"
}
}
}
},
"fields": ["value"],
"from": 0,
"size": 50,
"sort": {
"_score": {
"order": "desc"
}
},
"explain": true
}
ElasticSearch 0.90.5;
Rails 4.0.0;
Tire (gem) 0.6.0

Better yet, the problem is caused by ES using many (5 by default) shards to score the documents and each shard uses only the documents that were allocated to it to compute its scores. Since I'm using test data and my DB is practically empty the score were completely off. Answer is to use dfs_query_then_fetch search type (at least while developping..). Still searching for how to implement in Rails/Tire or set as default in ES.
Cheers,
nic

Related

Elasticsearch - best query and index for partial and fuzzy search

I thought this scenario must be quite common, but I was unable to find the best way to do it.
I have a big dataset of products. All the products have this kind of schema:
{
"productID": 1,
"productName": "Whatever",
"productBoost": 1234
}
I have this problem to combine partial (query string) and fuzzy query.
What i have is about 1.5M records in an index which have listed the names od the product and the boost value- like the popularity value of the product(most common have higher popularity and less popular ones have less popularity).
For this i would like to use function score.
What i was trying to achieve is search as you type, with the function score and fuzziness.
I’m not sure if this is the best approach.
Current query i'm using is this:
"query": {
"function_score": {
"query": {
"match": {
"productName": {
"query": "word",
"fuzziness": "AUTO",
"operator": "AND"
}
}
},
"field_value_factor": {
"field": "productBoost",
"factor": 1,
"modifier": "square"
}
}
}
This is working kinda ok, but the problem is that i want products like:
"Cabbage raw", to come up before "Cabernet red wine", when i try to search for the string "cab" because the boost is way higher on "Cabbage raw".
Another problem is when i search for the word "cabage" (typo of "cabagge"), there is only one product, and there are a lot of "cabagge" containing products.
If the query_string had the fuzziness with the wildcards, that would be ideal for this solution i think.
Also this is a match query so partial part is not working as well.
I tried using query_string, with the wildcards, but the downside of that is i can not use fuzziness for that kind of query.
Also i've tried nGrams and edge but i'm not sure how to implement it in this case scenario and how to combine the search score with the existing boost i have.
The only thing, that might even fix this issue, that i didn't try are suggesters.
I couldn't make them work with the function_score.
If anyone have any ideas on implementing this, it would be really helpful.

Why is Elasticsearch with Wildcard Query always 1.0?

When i do a search in Elasticsearch with a Wildcard-Query (Wildcard at the End) the score results for all hits in 1.0.
Is this by design? Can I change this behavior somewhere?
Elasticsearch is basically saying that all results are equally relevant, as you've provided an unqualified search (a wildcard, equivalent to a match_all). As soon as you add some additional context through the various types of queries, you will notice changes in the scoring.
Depending on your ultimate goal, you may want to look into the Function Score query - reference: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-function-score-query.html
The first example provided would give you essentially random scores for all documents in your cluster:
GET /_search
{
"query": {
"function_score": {
"query": { "match_all": {} },
"boost": "5",
"random_score": {},
"boost_mode":"multiply"
}
}
}

Custom score for exact, phonetic and fuzzy matching in elasticsearch

I have a requirement where there needs to be custom scoring on name. To keep it simple lets say, if I search for 'Smith' against names in the index, the logic should be:
if input = exact 'Smith' then score = 100%
else
if input = phonetic match then
score = <depending upon fuzziness match of input with name>%
end if
end if;
I'm able to search documents with a fuzziness of 1 but I don't know how to give it custom score depending upon how fuzzy it is. Thanks!
Update:
I went through a post that had the same requirement as mine and it was mentioned that the person solved it by using native scripts. My question still remains, how to actually get the score based on the similarity distance such that it can be used in the native scripts:
The post for reference:
https://discuss.elastic.co/t/fuzzy-query-scoring-based-on-levenshtein-distance/11116
The text to look for in the post:
"For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring."
You can implement this search logic using the rescore function query (docs here).
Here there is a possible example:
{
"query": {
"function_score": {
"query": { "match": {
"input": "Smith"
} },
"boost": "5",
"functions": [
{
"filter": { "match": { "input.keyword": "Smith" } },
"random_score": {},
"weight": 23
}
]
}
}
}
In this example we have a mapping with the input field indexed both as text and keyword (input.keyword is for exact match). We re-score the documents that match exactly the term "Smith" with an higher score respect to the all documents matched by the first query (in the example is a match, but in your case will be the query with fuzziness).
You can control the re-score effect tuning the weight parameter.

How to get only x results from elastic and then stop searching?

My whole index is about 700M docs, this query:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10
}
applies to ca 5M docs. "Some_field" is indexed, not analysed.
Query takes ca 1s on average hetzner. Too slow :) I don't care about pagination or sorting or scoring. I just want 10 first "random" matching docs.
Is there the way to do it with disabled score, in the "mysql way"?
filter or constant_score do not help
If you go with filters, that will remove the score computation and should provide faster query speeds:
{
"query": {
"bool": {
"filter": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
}
}
}
"size": 10
}
If that's still too slow, you could consider using document routing, but it may not be a viable option for you as you might have just 1 shard or very few terms for SOME_FIELD.
I also suggest you go over the production deployment document by Elastic, it gives you an overview on how to configure your cluster optimally and can also produce some serious performance boost in case you currently have a misconfigured cluster, i.e. running on a strong machine but keeping the default ES_HEAP_SIZE value.
The option i was looking for is "terminate_after". Unfortunately it is not "very well" documemented, see:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-limit-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-count.html#_request_parameters
so, my query looks like this:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10,
"terminate_after": 10
}
Don't use "10" instead of 10. Elastic does not cast it to integer and ignores the parameter

Elasticsearch to recommend book authors: how to limit maximum 3 books per author?

I use Elasticsearch to recommend authors (my Elasticsearch documents represent books, with a title, a summary and a list of author ids).
The user queries my index with some text (e.g. Georgia or Paris) and I need to aggregate the score of individual books at the author level (meaning: recommand an author that writes about Paris).
I began with a simple aggregation, however, experimentally (cross-validation) it is best to stop aggregating the score of each users after maximum 4 books per user. This way, we do not have an author with 200 books that can "dominate" the results. Let me explain in pseudocode:
# the aggregated score of each author
Map<Author, Double> author_scores = new Map()
# the number of books (hits) that contributed to each author
Map<Author, Integer> author_cnt = new Map()
# iterate ES query results
for Document doc in hits:
# stop aggregating if more that 4 books from this author have already been found
if (author_cnt.get(doc.author_id) < 4):
author_scores.increment_by(doc.author_id, doc.score)
author_cnt.increment_by(doc.author_id, 1)
the_result = author_scores.sort_map_by_value(reverse=true)
So far, I have implemented the above aggregation in custom application code, but I was wondering if it was possible to rewrite it using Elasticsearch's query DSL or org.elasticsearch.search.aggregations.Aggregator interface.
My opinion is that you cannot do this with the features ES offers. The closest thing I could find about your requirement is "top_hits" aggregation. With this you perform your query, you aggregate on whatever you want and then you say you need only the top X hits ordered by a criteria.
For your particular scenario your query is a "match" for "Paris", the aggregation is on author id and then you tell ES to only return the first 3 books, ordered by score for each author. The good part is that ES will offer you the best three books for each particular author, ordered by relevance, and not all the books for each or none. The not-so-good part is that "top-hits" doesn't allow another sub-aggregation to make possible a sum of scores only for those "top hits". In this case you would still need to compute the sum of scores for each author.
And a sample query:
{
"query": {
"match": {
"title": "Paris"
}
},
"aggs": {
"top-authors": {
"terms": {
"field": "author_ids"
},
"aggs": {
"top_books_hits": {
"top_hits": {
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"_source": {
"include": [
"title"
]
},
"size": 3
}
}
}
}
}
}

Resources