Does Algolia have a boost score feature like Elasticsearch? - elasticsearch

I have a requirement on sorting a field which is when a value matches its field, then this document has a higher score than other documents. Can Algolia do this?

To reflect the importance of an attribute compared to another, the way to go using Algolia is definitely to order the attributes you want to search in in the searchableAttributes index setting.
For instance, if you want to search in both title and description, but title is more important; you should go for:
searchableAttributes:
- title
- description
Compared to the boosting approach, this ensures the number of match occurrences you have won't impact the overall ranking (common issue in ES is: is 4 words matching here and there in description better than 2 words matching exactly in title?).
With Algolia, the objects matching the longest expression (in terms of proximity between query words in the text) will always be used to identify the best matching attribute; and then to sort the results according to the attributes importance.

Related

IDF recaculation for existing documents in index?

I have gone through [Theory behind relevance scoring][1] and have got two related questions
Q1 :- As IDF formula is idf(t) = 1 + log ( numDocs / (docFreq + 1)) where numDocs is total number of documents in index. Does it mean each time new document is added in index, we need to re-calculate the IDF for each word for all existing documents in index ?
Q2 :- Link mentioned below statement. My question is there any reason why TF/IDF score is calculated against each field instead of complete document ?
When we refer to documents in the preceding formulae, we are actually
talking about a field within a document. Each field has its own
inverted index and thus, for TF/IDF purposes, the value of the field
is the value of the document.
You only calculate the score at query time and not at insert time. Lucene has the right statistics to make this a fast calculation and the values are always fresh.
The frequency only really makes sense against a single field since you are interested in the values for that specific field. Assume we have multiple fields and we search a single one, then we're only interested in the frequency of that one. Searching multiple ones you still want control over the individual fields (such as boosting "title" over "body") or want to define how to combine them. If you have a use-case where this doesn't make sense (not sure I have a good example right now — it's IMO far less common) then you could combine multiple fields into one with copy_to and search on that.

Configuring ElasticSearch relevance score to prefer a match on all words over a match with some words?

For example, with a search for "stack overflow" I want a document containing both "stack" and "overflow" to have a higher score than a document containing only one of those words.
Right now, I am seeing cases where a document that contains "stack" 0 times and "overflow" 50 times gets ranked above a document that contains "stack" 1 time and "overflow" 1 time.
A secondary concern is ranking documents higher that have the exact word as opposed to a word variant. For example, a document containing "stack" should be ranked higher than a document containing "stacking".
A third concern is ranking documents higher that have the words adjacent. For example a document "How to use stack overflow" should be ranked higher than a document "The stack of papers caused the inbox to overflow."
If you put those three concerns together, here is an example of the desired rank of results for "stack overflow":
Is it possible to configure an index or a query to calculate score this way?
Here you are trying to achieve multiple things in a single query. First you should try to understand how ES is returning you the results.
Document containing overflow 50 times gets ranked above a document that contains "stack" 1 time and "overflow" 1 time because ES score calculation is based on tf/idf based score calculation. And in this case obviously, overflow comes 50 times which is quite higher than other frequency combined for other 2
terms in another document.
Note:- You can disable this calculation as mentioned in the link.
If you don’t care about how often a term appears in a field and all
you care about is that the term is present, then you can disable term
frequencies in the field mapping:
You are getting the results containing the term stacking due to stemming and if you don't want document containing stacking shouldn't come in search results, than don't documents in stemmed form or do some post-processing after getting the results from ES and reduce their score, not sure if ES provide it out of the box.
The third thing which you want is a phrase search.
Also use explain api to understand, how ES calculates the score of the document with your query, It will help you to construct the right query according to your requirements.

Does Elasticsearch score different length shingles with the same IDF?

In Elasticsearch 5.6.5 I'm searching against a field with the following filter applied:
"filter_shingle":{
"max_shingle_size":"4",
"min_shingle_size":"2",
"output_unigrams":"true",
"type":"shingle"
}
When I perform a search for depreciation tax against a document with that exact text, I see the following explanation of the score:
weight(Synonym(content:depreciation content:depreciation tax)) .... [7.65]
weight(content:tax) ... [6.02]
If I change the search to depreciation taffy against the exact same document with depreciation tax in the content I get this explanation:
weight(Synonym(content:depreciation content:depreciation taffy)) .... [7.64]
This is not what I expected. I thought a match on the bigram token for depreciation tax would get a much higher score than a match on the unigram depreciation. However this scoring seems to reflect a simple unigram match. There is an extremely small difference and digging further this is because the termFreq=28 under the depreciation taffy match, and termFreq=29 under the depreciation tax match. I'm also not sure how this relates as I imagine across the shard holding this document there are very different counts for depreciation, depreciation tax and depreciation tafffy
Is this expected behavior? Is ES treating all the different sized shingles, including unigrams, with the same IDF value? Do I need to split out each shingle size into different sub fields with different analyzers to get the behavior I expect?
TL;DR
Shingles and Synonyms are broken in Elastic/Lucene and a lot of hacks need to be applied until a fix is released (accurate as of ES 6).
Put unigrams, bigrams and so on in individual subfields and search them separately, combining the scores for an overall match. Don't use a single shingle filter on a field that does multiple n-gram configurations
Don't combine a synonym and shingle filter on the same field.
In my case I do a must match with synonyms on a unigram field, then a series of should matches to boost the score on shingles of each size, without synonyms
Details
I got an answer on the elastic support forums:
https://discuss.elastic.co/t/does-elasticsearch-score-different-length-shingles-with-the-same-idf/126653/2
Yep, this is mostly expected.
It's not really the shingles causing the scoring oddness, but the fact
that SynonymQueries do the frequency blending behavior that you're
seeing. They use frequency of the original token for all the
subsequent 'synonym' tokens, as a way to help prevent skewing the
score results. Synonyms are often relatively rare, and would
drastically affect the scoring if they each used their individual
df's.
From the Lucene docs:
For scoring purposes, this query tries to score the terms as if you
had indexed them as one term: it will match any of the terms but only
invoke the similarity a single time, scoring the sum of all term
frequencies for the document.
The SynonymQuery also sets the docFrequency to the maximum
docFrequency of the terms in the document. So for example, if:
"deprecation"df == 5 "deprecation tax"df == 2, "deprecation taffy"df
== 1, it will use 5 as the docFrequency for scoring purposes.
The bigger issue is that Lucene doesn't have a way to differentiate
shingles from synonyms... they both use tokens that overlap the
position of other tokens in the token stream. So if unigrams are mixed
with bi-(or larger)-grams, Lucene is tricked into thinking it's
actually a synonym situation.
The fix is to keep your unigrams and bi-plus-grams in different
fields. That way Lucene won't attempt to use SynonymQueries in these
situations, because the positions won't be overlapping anymore.
Here's another related question that I asked which relates to how actual synonyms also get mangled when combined with shingles. https://discuss.elastic.co/t/es-5-4-synonyms-and-shingles-dont-seem-to-work-together/127552
Elastic/Lucene expands the synonym set, injects them into the token stream, then creates shingles. E.g. Query: econ supply and demand => econ, economics, supply, demand. Document: `... econ foo ... => econ, foo '. Now we get the shingle from the query "econ economics" and somehow this matches the document. No idea why since I only applied synonyms to the query, not the document, so I don't see the match. Also, the way the shingles are created from the query is wrong too.
This is a known problem, and it is still not fully resolved. A number
of Lucene filters can't consume graphs as their inputs.
There is currently active work being done on developing a fixed
shingle filter, and also an idea to have a sub-field for indexing
shingles.

List items is some indices first in Elasticsearch search results

I'm scraping few sites and relisting their products, each site has their own index in Elasticsearch. Some sites have affiliate programs, I'd like to list those first in my search results.
Is there a way for me to "boost" results from a certain index?
Should I write a field hasAffiliate: true into ES when I'm scraping and then boosting the query clauses that have that has that value? Or is there a better way?
Using boost could be difficult to guarantee that they appear first in the search. According to the official guide:
Practically, there is no simple formula for deciding on the “correct”
boost value for a particular query clause. It’s a matter of
try-it-and-see. Remember that boost is just one of the factors
involved in the relevance score
https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html
It depends on the type of queries you are doing, but here you have other couple of options:
A score function with weights: could be a more predictable option.
Simply using a sort by hasAffiliate (the easiest one).
Note: Not sure if sorting by boolean field is possible, in that case you could set hasAffiliate mapping as integer byte (smallest one), setting it as 1 when true.

Solr Boosting Logic Concepts

I'm trying to understand boosting and if boosting is the answer to my problem.
I have an index and that has different types of data.
EG: Index Animals. One of the fields is animaltype. This value can be Carnivorous, herbivorous etc.
Now when a we query in search, I want to show results of type carnivorous at top, and then the herbivorous type.
Also would it be possible to show only say top 3 results from a type and then remaining from other types?
Let assume for a herbivourous type we have a field named vegetables. This will have values only for a herbivourous animaltype.
Now, can it be possible to have boosting rules specified as follows:
Boost Levels:
animaltype:Carnivorous
then animaltype:Herbivorous and vegatablesfield: spinach
then animaltype:herbivoruous and vegetablesfield: carrot
etc. Basically boosting on various fields at various levels. Im new to this concept. It would really helpful to get some inputs/guidance.
Thanks,
Kasturi Chavan
Your example is closer to sorting than boosting, as you have a priority list for how important each document is - while boosting (in Solr) is usually applied a bit more fluent, meaning that there is no hard line between documents of type X and type Y.
However - boosting with appropriately large values will in effect give you the same result, putting the documents into different score "areas" which will then give you the sort order you're looking for. You can see the score contributed by each term by appending debugQuery=true to your query. Boosting says that 'a document with this value is z times more important than those with a different value', but if the document only contains low scoring tokens from the search (usually words that are very common), while other documents contain high scoring tokens (words that are infrequent), the latter document might still be considered more important.
Example: Searching for "city paris", where most documents contain the word 'city', but only a few contain the word 'paris' (but does not contain city). Even if you boost all documents assigned to country 'germany', the score contributed from city might still be lower - even with the boost factor than what 'paris' contributes alone. This might not occur in real life, but you should know what the boost actually changes.
Using the edismax handler, you can apply the boost in two different ways - one is to use boost=, which is multiplicative, or to use either bq= or bf=, which are additive. The difference is how the boost contributes to the end score.
For your example, the easiest way to get something similar to what you're asking, is to use bq (boost query):
bq=animaltype:Carnivorous^1000&
bq=animaltype:Herbivorous^10
These boosts will probably be large enough to move all documents matching these queries into their own buckets, without moving between groups. To create "different levels" as your example shows, you'll need to tweak these values (and remember, multiple boosts can be applied to the same document if something is both herbivorous and eats spinach).
A different approach would be to create a function query using query, if and similar functions to result in a single integer value that you can use as a sorting value. You can also calculate this value when indexing the document if it's static (which your example is), and then sort by that field instead. It will require you to reindex your documents if the sorting values change, but it might be an easy and effective solution.
To achieve the "Top 3 results from a type" you're probably going to want to look at Result grouping support - which makes it possible to get "x documents" for each value in a single field. There is, as far as I know, no way to say "I want three of these at the top, then the rest from other values", except for doing multiple queries (and excluding the three you've already retrieved from the second query). Usually issuing multiple queries works just as fine (or better) performance wise.

Resources