How (if at all possible) can one insert any term-vector in an ElasticSearch index?
ES computes term-vectors, behind the scenes, in order to carry out it's text mining tasks, but it would be useful to be able to enter any list of (term, weight) pairs instead.
Why?
Well, for instance, though ES enables kNN (k-nearest-neighbors) for k=2, in the context of geographic proximity, it doesn't have any explicit k>2 functionality. If we were able to insert our own term-vectors, we could hack a k>2 functionality by harnessing ES's built in text-indexing methods.
Any indications on this issue?
As far as I know, there's no way to do that by elasticsearch (I'm still looking for the fastest KNN real time search approach, elasticsearch is one of my choices).
Elasticsearch is based on inverted index, so each term in the term vector (which may comes from a sentence) will be indexed in a sorted list. When we're searching a query, the query will be analyzed into a term vector and elasticsearch (lucene actually) will search the indices for each term.
But KNN requires calculating the distance between two vectors even they don't share the same term, the traditional inverted index is not designed for this requirement.
As you have said, elasticsearch could implement the real time KNN search when k = 2 by geo query, but I don't think it could support k > 2.
By the way, if you have found any approach that could help implement real time KNN search that K may be a very large number ( 100000 ?) and on a huge data set (number of vectors), please tell me, thx :)
Related
I try to migrate my program from lucene based search to Elasticsearch and I put same collection of documents into lucene and Elasticsearch to do compare test. in ES, I used only one shard and change mapping for each filed's similarity to classic. but still I got different results when I use public TopDocs search(Query query, int n) to fetch N top score docs. when N is small, the results lucene based and ES has diff while when N is bigger and bigger, the diff ration is reduce. so I'd like to know what cause this diff and how can I achieve no diff.
below is my assumption to the problem:
since TopScoreCollector return topN docs while score is calculated based on segment.
so if N documents divide into different number of segments, the score will be influenced since tf/idf will changed.
this is my overall understanding and I'd like to know if my understanding is right or wrong.
I'm trying to use elastic search to do a fuzzy query for strings. According to this link (https://www.elastic.co/guide/en/elasticsearch/reference/1.6/common-options.html#fuzziness), the maximum fuzziness allowed is 2, so the query will only return results that are 2 edits away using the Levenshtein Edit Distance. The site says that Fuzzy Like This Query supports fuzzy searches with a fuzziness greater than 2, but so far using the Fuzzy Like This Query has only allowed me to search for results within two edits of the search. Is there any workaround for this constraint?
It looks like this was a bug which was fixed quite a while back. Which Elasticsearch version are you using?
For context, the reason why Edit Distance is now limited to [0,1,2] for most Fuzzy operations has to do with a massive performance improvement of fuzzy/wildcard/regexp matching in Lucene 4 using Finite State Transducers.
Executing a fuzzy query via an FST requires knowing the desired edit-distance at the time the transducer is constructed (at index-time). This was likely capped at an edit-distance of 2 to keep the FST size requirements manageable. But also possibly, because for many applications, an edit-distance of greater than 2 introduces a whole lot of noise.
The previous fuzzy query implementation required visiting each document to calculate edit distance at query-time and was impractical for large collections.
It sounds like Elasticsearch (1.x) is still using the original (non-performant) implementation for the FuzzyLikeThisQuery, which is why the edit-distance can increase beyond 2. However, FuzzyLikeThis has been deprecated as of 1.6 and won't be supported in 2.0.
I'm using elasticsearch to find similar documents to a given document using the "more like this" query.
Is there an easy way to get the elasticsearch scoring between 0 and 1 (using cosine similarity) ?
Thanks!
You may want to look into the Function Score functionality of Elasticsearch, more specifically the script_score and field_value_factor functions. This will allow you to take the score from default scoring (_score) and enhance or replace it in other ways. It really depends on what sort of boosting or transformation you'd like. The default scoring model takes into account the Vector model but other things as well .
I don't think that's possible to retrieve directly.
But perhaps this workaround would make sense?
Elasticsearch always bring back max_score in hits document.
You can potentially divide your document _score by max_score. Report with highest value will score as 1, documents, that are not so like given one, will score less.
The Elasticsearch uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.
I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:
some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)
To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.
But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:
set A to be the set of documents which have word 1 (by searching the index).
set B to be the set of documents which have word 2 (ditto).
compute the intersection of A and B.
Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.
One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).
But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.
Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?
As you said some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending), I think the search engine may not do this, the doc list should be sorted by doc ID, each doc has a rank according to the word.
When a query comes, it contains several keywords. For each word, you can find a doc list. For all keywords, you can do merge operations, and compute the relevance of doc to query. Finally return the top ranked relevance doc to user.
And the query process can be distributed to gain better performance.
Even without ranking, I wonder how the intersection of two sets is computed so fast by google.
Obviously the worst-case scenario for computing the intersection for some words A, B, C is when their indexes are very big and the intersection very small. A typical case would be a search for some very common ("popular" in DB terms) words in different languages.
Let's try "concrete" and 位置 ("site", "location") in chinese and 極端な ("extreme") in japanese.
Google search for 位置 returns "About 1,500,000,000 results (0.28 seconds) "
Google search for "concrete" returns "About 2,020,000,000 results (0.46 seconds) "
Google search for "極端な" About 7,590,000 results (0.25 seconds)
It is extremly improbable that all three terms would ever appear in the same document, but let's google them:
Google search for "concrete 位置 極端な" returns "About 174,000 results (0.13 seconds)"
Adding a russian word "игра" (game)
Search игра: About 212,000,000 results (0.37 seconds)
Search for all of them: " игра concrete 位置 極端な " returns About 12,600 results (0.33 seconds)
Of course the returned search results are nonsense and they do not contain all the search terms.
But looking at the query time for the composed ones, I wonder if there is some intersection computed on the word indexes at all. Even if everything is in RAM and heavily sharded, computing the intersection of two sets with 1,500,000,000 and 2,020,000,000 entries is O(n) and can hardly be done in <0.5 sec, since the data is on different machines and they have to communicate.
There must be some join computation, but at least for popular words, this is surely not done on the whole word index. Adding the fact that the results are fuzzy, it seems evident that Google uses some optimization of kind "give back some high-ranked results, and stop after 0,5 sec".
How this is implemented, I don't know. Any ideas?
Most systems somehow implement TF-IDF in one way or another. TF-IDF is a product of functions term frequency and inverse document frequency.
The IDF function relates the document frequency to the total number of documents in a collection. The common intuition for this function says that it should give a higher value for terms that appear in few documents and lower value for terms that appear in all documents making them irrelevant.
You mention Google, but Google optimises search with PageRank (links in/out) as well as term frequency and proximity. Google distributes the data and uses Map/Reduce to parallelise operations - to compute PageRank+TF-IDF.
There's a great explanation of the theory behind this in Information Retrieval: Implementing Search Engines chapter 2. Another idea to investigate further is also to look how Solr implements this.
Google does not need to actually find all results, only the top ones.
The index can be sorted by grade first and only then by id. Since the same ID always has the same grade this does not hurt sets intersection time.
So google starts intersection until it finds 10 results , and then does a statistical estimation to tell you how many more results it found.
A worst case is almost impossible.
If all words are "common" then intersection will give the first 10 results very fast. If there is a rare word, then intersection is fast because complexity is O(N long M) where N is the smallest group.
You need to remember that google keeps it's indexes in memory and uses parallel computing.For example U can split the problem into two searches each searching only half of the web, and then marge result and take the best. Google has millions of computes
I am trying to implement search engine based on keywords search.
Can anyone tell me which is the best (fastest) algorithm to implement a search for key words?
What I need is:
My keywords:
search, faster, profitable
Their synonyms:
search: grope, google, identify, search
faster: smart, quick, faster
profitable: gain, profit
Now I should search all possible permutations of the above synonyms in a Database to identify the most matching words.
The best solution would be to use an existing search engine, like Lucene or one of its alternative ( see Which are the best alternatives to Lucene? ).
Now, if you want to implement that yourself (it's really a great and existing problem), you should have a look at the concept of Inverted Index. That's what Google and other search engines use. Of course, they have a LOT of additional systems on top of it, but that's the basic.
The idea of an inverted index, is that for each keyword (and synonyms), you store the id of the documents that contain the keyword. It's then very easy to lookup the matching documents for a set of keyword, because you just calculate an intersection (or an union depending on what you want to do) of their list in the inverted index. Example :
Let's assume that is your inverted index :
smart: [42,35]
gain: [42]
profit: [55]
Now if you have a query "smart, gain", your matching documents are the intersection (or the union) of [42, 35] and [42].
To handle synonyms, you just need to extend your query to include all synonyms for the words in the initial query. Based on your example, you query would become "faster, quick, gain, profit, profitable".
Once you've implemented that, a nice improvement is to add TFIDF weighting to your keywords. That's basically a way to weight rare words (programming) more than common ones (the).
The other approach is to just go through all your documents and find the ones that contain your words (or their synonyms). The inverted index will be MUCH faster though, because you don't have to go through all your documents every time. The time-consuming operation is building the index, which only has to be done once.