IDF recaculation for existing documents in index? - elasticsearch

I have gone through [Theory behind relevance scoring][1] and have got two related questions
Q1 :- As IDF formula is idf(t) = 1 + log ( numDocs / (docFreq + 1)) where numDocs is total number of documents in index. Does it mean each time new document is added in index, we need to re-calculate the IDF for each word for all existing documents in index ?
Q2 :- Link mentioned below statement. My question is there any reason why TF/IDF score is calculated against each field instead of complete document ?
When we refer to documents in the preceding formulae, we are actually
talking about a field within a document. Each field has its own
inverted index and thus, for TF/IDF purposes, the value of the field
is the value of the document.

You only calculate the score at query time and not at insert time. Lucene has the right statistics to make this a fast calculation and the values are always fresh.
The frequency only really makes sense against a single field since you are interested in the values for that specific field. Assume we have multiple fields and we search a single one, then we're only interested in the frequency of that one. Searching multiple ones you still want control over the individual fields (such as boosting "title" over "body") or want to define how to combine them. If you have a use-case where this doesn't make sense (not sure I have a good example right now — it's IMO far less common) then you could combine multiple fields into one with copy_to and search on that.

Related

Elastic search calculation with data from different indexes

Good day, everyone. I have a lit bit strange case of using elastic search for me.
There are two different indexes, each index contain one data type.
First type contains next important for this case data:
keyword (text,keyword),
URL (text,keyword)
position (number).
Second type contains next data fields:
keyword (text,keyword)
numberValue (number).
I need to do next things:
1.Group data from the first ind by URL
2.For each object in group calculate new metric (metric A) by next simple formula: position*numberValue*Param
3.For each groups calculate sum of elements metric A we have calculated on stage 1
4.Order by desc result groups by sums we have calculated on stage 3
5.Take some interval of result groups.
Param - param, i need to set for calculation, this is not in elastic.
That is not difficult algorithm, but data in different indices, and i don`t know how to do it fast, and i prefer to do it on elastic search level.
I don`t know how to make effective data search or pipeline of data processing which can help me to implement this case.
I use ES version 6.2.3 if it is important.
Give me some advice, please, how can i implement this algorithm.
By reading 2. you seem to assume keyword is some sort of primary key. Elasticsearch is not an RDB and can only reason over one document at a time, so unless numberValue and position are (indexed) fields of the same document you can't combine them.
The rest of the items seem to be possible to achieve with the help of Aggregation

Solr Boosting Logic Concepts

I'm trying to understand boosting and if boosting is the answer to my problem.
I have an index and that has different types of data.
EG: Index Animals. One of the fields is animaltype. This value can be Carnivorous, herbivorous etc.
Now when a we query in search, I want to show results of type carnivorous at top, and then the herbivorous type.
Also would it be possible to show only say top 3 results from a type and then remaining from other types?
Let assume for a herbivourous type we have a field named vegetables. This will have values only for a herbivourous animaltype.
Now, can it be possible to have boosting rules specified as follows:
Boost Levels:
animaltype:Carnivorous
then animaltype:Herbivorous and vegatablesfield: spinach
then animaltype:herbivoruous and vegetablesfield: carrot
etc. Basically boosting on various fields at various levels. Im new to this concept. It would really helpful to get some inputs/guidance.
Thanks,
Kasturi Chavan
Your example is closer to sorting than boosting, as you have a priority list for how important each document is - while boosting (in Solr) is usually applied a bit more fluent, meaning that there is no hard line between documents of type X and type Y.
However - boosting with appropriately large values will in effect give you the same result, putting the documents into different score "areas" which will then give you the sort order you're looking for. You can see the score contributed by each term by appending debugQuery=true to your query. Boosting says that 'a document with this value is z times more important than those with a different value', but if the document only contains low scoring tokens from the search (usually words that are very common), while other documents contain high scoring tokens (words that are infrequent), the latter document might still be considered more important.
Example: Searching for "city paris", where most documents contain the word 'city', but only a few contain the word 'paris' (but does not contain city). Even if you boost all documents assigned to country 'germany', the score contributed from city might still be lower - even with the boost factor than what 'paris' contributes alone. This might not occur in real life, but you should know what the boost actually changes.
Using the edismax handler, you can apply the boost in two different ways - one is to use boost=, which is multiplicative, or to use either bq= or bf=, which are additive. The difference is how the boost contributes to the end score.
For your example, the easiest way to get something similar to what you're asking, is to use bq (boost query):
bq=animaltype:Carnivorous^1000&
bq=animaltype:Herbivorous^10
These boosts will probably be large enough to move all documents matching these queries into their own buckets, without moving between groups. To create "different levels" as your example shows, you'll need to tweak these values (and remember, multiple boosts can be applied to the same document if something is both herbivorous and eats spinach).
A different approach would be to create a function query using query, if and similar functions to result in a single integer value that you can use as a sorting value. You can also calculate this value when indexing the document if it's static (which your example is), and then sort by that field instead. It will require you to reindex your documents if the sorting values change, but it might be an easy and effective solution.
To achieve the "Top 3 results from a type" you're probably going to want to look at Result grouping support - which makes it possible to get "x documents" for each value in a single field. There is, as far as I know, no way to say "I want three of these at the top, then the rest from other values", except for doing multiple queries (and excluding the three you've already retrieved from the second query). Usually issuing multiple queries works just as fine (or better) performance wise.

Performance of Elastic Time Range Queries against Out of Range Indexes

It is common to have elastic indices with dates, in particular from something like logstash.
So for example, you have indices like foo-2016.05.01, foo-2016.05.02, etc...
When doing a time range query for data. What is the cost of querying indexes that I already know won't have data for that time range?
So for example if time range query only asks for data from 2016.05.02 but I also include the foo-2016.05.01 index in my query.
Is that basically a quick one-op per index where the index knows it has no data in that date range, or will doing this be costly to performance? I'm hoping not only to know the yes/no answer, but to understand why it behaves the way it does.
Short version: it's probably going to be expensive. The cost will be n where n is the number of distinct field values for the date data. If all entries in the index had an identical date field value, it'd be a cheap query of 1 check (and would be pointless since it'd be a binary "all or nothing" response at that point). Of course, the reality is usually that every single doc has a unique date field value (which is incrementing such as in a log), depending on how granular the date is (assuming here that the time is included to seconds or milliseconds). Elasticsearch will check each aggregated, unique date field value of the included indices to try and find documents that match on the field by satisfying the predicates of the range query. This is the nature of the inverted index (indexing documents by their fields).
An easy way to improve performance is to change the Range Query to a Range Filter which caches results and improves performance for requests beyond the first one. Of course, this is only valuable if you're repeating the same range filter over time (the cache is read more than it is written), and if the range is not part of scoring the documents (that is to say those in range are not more valuable that those not in range when returning a set of both - also known as "boosting").
Another way to improve performance is by convention. If you query by day, store each day in its own rolling index and then do pre-search logic to select the indexes to query. This eliminates the need for the filter or query entirely.
Elasticsearch doesn't care about the index name (that includes the date) and it doesn't automagically exclude that index from your range query. It will query all the shards (a copy - be it replica or primary) of all the indices specified in the query. Period.
Kibana, on the other hand, knows based on the time range selected to query specific indices only.
If you know your range will not make sense on some indices, then exclude those from the query before creating the query.
A common approach for logging usecase, in case the current day is most frequently queried is to create an alias. Give it a significant name - like today - that will always point to today's index. Also, common with time based indices is the retention period. For these two tasks - managing the aliases and deleting the "expired" indices - you can use Curator.
In case the most times you care about the current day, use that alias and thus you get rid of the days before today.
In case not, then filter the indices to be queried based on the range before deciding on which indices to run the query.

Elasticsearch: Modifying Field Normalization at Query Time (omit_norms in queries)

Elasticsearch takes the length of a document into account when ranking (they call this field normalization). The default behavior is to rank shorter matching documents higher than longer matching documents.
Is there anyway to turn off or modify field normalization at query time? I am aware of the index time omit_norms option, but I would prefer to not reindex everything to try this out.
Also, instead of simply turning off field normalization, I wanted to try out a few things. I would like to take field length into account, but not as heavily as elasticsearch currently does. With the default behavior, a document will rank 2 times higher than a document which is two times longer. I wanted to try a non-linear relationship between ranking and length.

Difference between Elasticsearch Range Query and Range Filter

I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document — 
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html

Resources