Lucene: "breaking" filter/query - filter

are there any Lucene filter/queruies, that return the result immediately after finding the 1st match? Has anyone met something like that?
TIA

So, the question is about performances.
A Lucene search process is:
Lookup the terms you are looking for in the inverted index. Typically there is one bitset for each term. That's typically fast.
Apply the boolean query: Mixing the bitsets using boolean clauses (AND, OR, NOT..): Typically fast.
Compute the scoring for each document found. This is where the performances can be slow. It is typically an iteration over all the document found as well as computing the full-text score (vector space model or BM25).
If you are not interested by the full-text score, you may use the search method which allow to use another sort order and therefore allow to disable the scoring:
TopFieldDocs search(Query query,
int n,
Sort sort,
boolean doDocScores,
boolean doMaxScore)
You may use the SortField.FIELD_DOC as an arbitrary sorting method.

Related

List items is some indices first in Elasticsearch search results

I'm scraping few sites and relisting their products, each site has their own index in Elasticsearch. Some sites have affiliate programs, I'd like to list those first in my search results.
Is there a way for me to "boost" results from a certain index?
Should I write a field hasAffiliate: true into ES when I'm scraping and then boosting the query clauses that have that has that value? Or is there a better way?
Using boost could be difficult to guarantee that they appear first in the search. According to the official guide:
Practically, there is no simple formula for deciding on the “correct”
boost value for a particular query clause. It’s a matter of
try-it-and-see. Remember that boost is just one of the factors
involved in the relevance score
https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html
It depends on the type of queries you are doing, but here you have other couple of options:
A score function with weights: could be a more predictable option.
Simply using a sort by hasAffiliate (the easiest one).
Note: Not sure if sorting by boolean field is possible, in that case you could set hasAffiliate mapping as integer byte (smallest one), setting it as 1 when true.

User defined termvectors in ElasticSearch

How (if at all possible) can one insert any term-vector in an ElasticSearch index?
ES computes term-vectors, behind the scenes, in order to carry out it's text mining tasks, but it would be useful to be able to enter any list of (term, weight) pairs instead.
Why?
Well, for instance, though ES enables kNN (k-nearest-neighbors) for k=2, in the context of geographic proximity, it doesn't have any explicit k>2 functionality. If we were able to insert our own term-vectors, we could hack a k>2 functionality by harnessing ES's built in text-indexing methods.
Any indications on this issue?
As far as I know, there's no way to do that by elasticsearch (I'm still looking for the fastest KNN real time search approach, elasticsearch is one of my choices).
Elasticsearch is based on inverted index, so each term in the term vector (which may comes from a sentence) will be indexed in a sorted list. When we're searching a query, the query will be analyzed into a term vector and elasticsearch (lucene actually) will search the indices for each term.
But KNN requires calculating the distance between two vectors even they don't share the same term, the traditional inverted index is not designed for this requirement.
As you have said, elasticsearch could implement the real time KNN search when k = 2 by geo query, but I don't think it could support k > 2.
By the way, if you have found any approach that could help implement real time KNN search that K may be a very large number ( 100000 ?) and on a huge data set (number of vectors), please tell me, thx :)

Elasticsearch scoring

I'm using elasticsearch to find similar documents to a given document using the "more like this" query.
Is there an easy way to get the elasticsearch scoring between 0 and 1 (using cosine similarity) ?
Thanks!
You may want to look into the Function Score functionality of Elasticsearch, more specifically the script_score and field_value_factor functions. This will allow you to take the score from default scoring (_score) and enhance or replace it in other ways. It really depends on what sort of boosting or transformation you'd like. The default scoring model takes into account the Vector model but other things as well .
I don't think that's possible to retrieve directly.
But perhaps this workaround would make sense?
Elasticsearch always bring back max_score in hits document.
You can potentially divide your document _score by max_score. Report with highest value will score as 1, documents, that are not so like given one, will score less.
The Elasticsearch uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.

Difference between Elasticsearch Range Query and Range Filter

I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document — 
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html

PHP MYSQL search engine using keywords

I am trying to implement search engine based on keywords search.
Can anyone tell me which is the best (fastest) algorithm to implement a search for key words?
What I need is:
My keywords:
search, faster, profitable
Their synonyms:
search: grope, google, identify, search
faster: smart, quick, faster
profitable: gain, profit
Now I should search all possible permutations of the above synonyms in a Database to identify the most matching words.
The best solution would be to use an existing search engine, like Lucene or one of its alternative ( see Which are the best alternatives to Lucene? ).
Now, if you want to implement that yourself (it's really a great and existing problem), you should have a look at the concept of Inverted Index. That's what Google and other search engines use. Of course, they have a LOT of additional systems on top of it, but that's the basic.
The idea of an inverted index, is that for each keyword (and synonyms), you store the id of the documents that contain the keyword. It's then very easy to lookup the matching documents for a set of keyword, because you just calculate an intersection (or an union depending on what you want to do) of their list in the inverted index. Example :
Let's assume that is your inverted index :
smart: [42,35]
gain: [42]
profit: [55]
Now if you have a query "smart, gain", your matching documents are the intersection (or the union) of [42, 35] and [42].
To handle synonyms, you just need to extend your query to include all synonyms for the words in the initial query. Based on your example, you query would become "faster, quick, gain, profit, profitable".
Once you've implemented that, a nice improvement is to add TFIDF weighting to your keywords. That's basically a way to weight rare words (programming) more than common ones (the).
The other approach is to just go through all your documents and find the ones that contain your words (or their synonyms). The inverted index will be MUCH faster though, because you don't have to go through all your documents every time. The time-consuming operation is building the index, which only has to be done once.

Resources