Solr Boosting Logic Concepts - boost

I'm trying to understand boosting and if boosting is the answer to my problem.
I have an index and that has different types of data.
EG: Index Animals. One of the fields is animaltype. This value can be Carnivorous, herbivorous etc.
Now when a we query in search, I want to show results of type carnivorous at top, and then the herbivorous type.
Also would it be possible to show only say top 3 results from a type and then remaining from other types?
Let assume for a herbivourous type we have a field named vegetables. This will have values only for a herbivourous animaltype.
Now, can it be possible to have boosting rules specified as follows:
Boost Levels:
animaltype:Carnivorous
then animaltype:Herbivorous and vegatablesfield: spinach
then animaltype:herbivoruous and vegetablesfield: carrot
etc. Basically boosting on various fields at various levels. Im new to this concept. It would really helpful to get some inputs/guidance.
Thanks,
Kasturi Chavan

Your example is closer to sorting than boosting, as you have a priority list for how important each document is - while boosting (in Solr) is usually applied a bit more fluent, meaning that there is no hard line between documents of type X and type Y.
However - boosting with appropriately large values will in effect give you the same result, putting the documents into different score "areas" which will then give you the sort order you're looking for. You can see the score contributed by each term by appending debugQuery=true to your query. Boosting says that 'a document with this value is z times more important than those with a different value', but if the document only contains low scoring tokens from the search (usually words that are very common), while other documents contain high scoring tokens (words that are infrequent), the latter document might still be considered more important.
Example: Searching for "city paris", where most documents contain the word 'city', but only a few contain the word 'paris' (but does not contain city). Even if you boost all documents assigned to country 'germany', the score contributed from city might still be lower - even with the boost factor than what 'paris' contributes alone. This might not occur in real life, but you should know what the boost actually changes.
Using the edismax handler, you can apply the boost in two different ways - one is to use boost=, which is multiplicative, or to use either bq= or bf=, which are additive. The difference is how the boost contributes to the end score.
For your example, the easiest way to get something similar to what you're asking, is to use bq (boost query):
bq=animaltype:Carnivorous^1000&
bq=animaltype:Herbivorous^10
These boosts will probably be large enough to move all documents matching these queries into their own buckets, without moving between groups. To create "different levels" as your example shows, you'll need to tweak these values (and remember, multiple boosts can be applied to the same document if something is both herbivorous and eats spinach).
A different approach would be to create a function query using query, if and similar functions to result in a single integer value that you can use as a sorting value. You can also calculate this value when indexing the document if it's static (which your example is), and then sort by that field instead. It will require you to reindex your documents if the sorting values change, but it might be an easy and effective solution.
To achieve the "Top 3 results from a type" you're probably going to want to look at Result grouping support - which makes it possible to get "x documents" for each value in a single field. There is, as far as I know, no way to say "I want three of these at the top, then the rest from other values", except for doing multiple queries (and excluding the three you've already retrieved from the second query). Usually issuing multiple queries works just as fine (or better) performance wise.

Related

How term not present queries work in lucene?

I have started reading about indexing in Lucene and sharding in Elastic search.
One thing I have not been able to understand is how queries like these look up indexes.
field-x contains term1 but not term2
Does it look up stored field for it.
The data in a stored field could be relatively large (it could be the text of an entire book). How would you efficiently search that text for an "exclusion" term? By indexing it!
You've already done that, to support field-x contains term1. So, no, you would not use a stored field for this. Instead, you would just use the indexed data to find term2 - and remove those results from the term1 results.
(I'm not saying this is the exact algorithm Lucene uses, because there may be significant optimizations Lucene makes, behind the scenes. But it will not be using the contents of the stored field.)
Also, if your indexed data does not contain any stored fields, the query would still work. You can try that for yourself.
Stored fields are useful when presenting results. From the Field documentation:
StoredField: Stored-only value for retrieving in summary results
In reality you would probably never want to store a large amount of data (e.g. a complete book) in a stored field. You could store a summary of the data - and that would make it unsuitable for use by queries, anyway.
Another consideration: You might as well ask "how does field-x contains term1 and also term2 work? It works the same way as the first example - except you aren't removing the term2 results from the term1 results - instead, you are finding the intersection between the two sets of results (if both terms are mandatory) or you are finding the union of the two sets (if both terms are optional)... and so on.

IDF recaculation for existing documents in index?

I have gone through [Theory behind relevance scoring][1] and have got two related questions
Q1 :- As IDF formula is idf(t) = 1 + log ( numDocs / (docFreq + 1)) where numDocs is total number of documents in index. Does it mean each time new document is added in index, we need to re-calculate the IDF for each word for all existing documents in index ?
Q2 :- Link mentioned below statement. My question is there any reason why TF/IDF score is calculated against each field instead of complete document ?
When we refer to documents in the preceding formulae, we are actually
talking about a field within a document. Each field has its own
inverted index and thus, for TF/IDF purposes, the value of the field
is the value of the document.
You only calculate the score at query time and not at insert time. Lucene has the right statistics to make this a fast calculation and the values are always fresh.
The frequency only really makes sense against a single field since you are interested in the values for that specific field. Assume we have multiple fields and we search a single one, then we're only interested in the frequency of that one. Searching multiple ones you still want control over the individual fields (such as boosting "title" over "body") or want to define how to combine them. If you have a use-case where this doesn't make sense (not sure I have a good example right now — it's IMO far less common) then you could combine multiple fields into one with copy_to and search on that.

Elasticsearch filter only if no matches to first filter

My use case is for searching UK addresses where there is a well defined postal code system however my users may still make mistakes in the postcode. I want to use a filter as in most cases the user will get the postcode right and I do not want to make Elasticsearch work harder than it needs to however I want to avoid roundtrips from my application to ES.
I am using an edge n-gram analyzer as described in the docs, so, taking the postcode ME4 4NR as an example I have ME4 4NR, ME4 4N, ME4 4 and ME4 indexed. I want to first filter by ME4 4NR and only widen to ME4 4N if this yields no matches.
Can I achieve this in my ES query or do I need to implement this in my application logic? Any advice would be much appreciated. I could use a boolean filter with a must on the ME4 and shoulds on the others but I wondered if there is a better way?
I think you are a bit over-complicating the matter here. This if-this-then-that-else-somethingelse can be achieved with ES, but the cases when this is possible are limited. For example - this question - the "else" part was a must in which the statement was a bool filter that first checked another must with a missing "condition". So, something must still be true in order for the other part of a "if-then-else" statement to be applied. Is not a strict matter of doing this only if "a certain condition" is true or false like in programming. You need to approach this Elasticsearch way, not programming way.
Your solution - use a must on ME4 and shoulds on the others - is not necessary imo. If you have analyzer set to an edge n-gram, then the same analyzer is used at indexing time but also at search time. Which means that, depending on the query/filter used, your input text will be analyzed before the search is performed.
For example, if you use at search time match query, then the input text you provide is analyzed. What this means is that if you input ME4 4N as search text, first ES will edge n-gram the input text and use the resulting tokens to search the inverted index. So, no need of doing this in your own code or come up with multiple shoulds in your ES query.
My suggestion here is to have a well-defined set of requirements set up properly first. Meaning, know what you want your search to do: think about the tokens that should be put in the inverted index and think about what users input. Decide if you need analysis at index time, but also at search time. Depending on this, think about the ways to use filters/queries at search time, meaning which analyze the input text and which don't (term doesn't for example while match does). Then, test you approaches and see the performance. Don't assume something is putting more work on ES than it should because you might be wrong. Test and compare the results, then start improving and coming up with other ideas.

Difference between Elasticsearch Range Query and Range Filter

I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document — 
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html

PHP MYSQL search engine using keywords

I am trying to implement search engine based on keywords search.
Can anyone tell me which is the best (fastest) algorithm to implement a search for key words?
What I need is:
My keywords:
search, faster, profitable
Their synonyms:
search: grope, google, identify, search
faster: smart, quick, faster
profitable: gain, profit
Now I should search all possible permutations of the above synonyms in a Database to identify the most matching words.
The best solution would be to use an existing search engine, like Lucene or one of its alternative ( see Which are the best alternatives to Lucene? ).
Now, if you want to implement that yourself (it's really a great and existing problem), you should have a look at the concept of Inverted Index. That's what Google and other search engines use. Of course, they have a LOT of additional systems on top of it, but that's the basic.
The idea of an inverted index, is that for each keyword (and synonyms), you store the id of the documents that contain the keyword. It's then very easy to lookup the matching documents for a set of keyword, because you just calculate an intersection (or an union depending on what you want to do) of their list in the inverted index. Example :
Let's assume that is your inverted index :
smart: [42,35]
gain: [42]
profit: [55]
Now if you have a query "smart, gain", your matching documents are the intersection (or the union) of [42, 35] and [42].
To handle synonyms, you just need to extend your query to include all synonyms for the words in the initial query. Based on your example, you query would become "faster, quick, gain, profit, profitable".
Once you've implemented that, a nice improvement is to add TFIDF weighting to your keywords. That's basically a way to weight rare words (programming) more than common ones (the).
The other approach is to just go through all your documents and find the ones that contain your words (or their synonyms). The inverted index will be MUCH faster though, because you don't have to go through all your documents every time. The time-consuming operation is building the index, which only has to be done once.

Resources