The texts I query for (and the queries itself) have on average 11 words (up to about 25). I want my query to return matches only if at least half of the words in query are matched in text.
For example, this is how my initial Lucene query looks like (for simplicity it has only 4 words):
jakarta~ apache~ lucene~ stackoverflow~
It will return a match if at least one of the words is fuzzy matched but I want it to return a match only if at least any two (half of 4) of the words are fuzzy matched.
Is it possible in Lucene?
I could split my query like this (OR is default operator in Lucene):
(jakarta~ apache~) AND (lucene~ stackoverflow~)
But that wouldn’t return a match is both jakarta and apache are matched but none of lucene and stackoverflow is matched.
I could change my query to:
(jakarta~ AND apache~) (jakarta~ AND lucene~) (jakarta~ AND stackoverflow~)
(apache~ AND lucene~) (apache~ and stackoverflow~) (lucene~ AND stackoverflow~)
Would that be efficient? On average my expression would consist of 462 AND clauses (binomial coefficient of 11 and 6), in the worst case of 5200300 AND clauses (binomial coefficient of 25 and 13).
If it is not possible (or doesn’t make sense performance wise) to do in Lucene, is it possible in Elasticsearch or Solr?
It should work fast (<= 0.5 sec/search) for at least 10 000 texts in database.
It would be even better if I could easily later change the minimum matches percentage (e.g. 40% instead of 50%) but I may not need this.
All three options support a minimum should match functionality among optional query clauses.
Lucene: Set in BooleanQueries via the BooleanQuery.Builder.setMinimumShouldMatch method.
Solr: The DisMax mm parameter.
Elasticsearch: The minimum_should_match parameter, in Bool queries, Multi Match queries, etc.
In Solr, you can use minimum match (mm) parameter with DisMax and eDisMax and you can specify the percentage of the match expected.
I'm trying to use elastic search to do a fuzzy query for strings. According to this link (https://www.elastic.co/guide/en/elasticsearch/reference/1.6/common-options.html#fuzziness), the maximum fuzziness allowed is 2, so the query will only return results that are 2 edits away using the Levenshtein Edit Distance. The site says that Fuzzy Like This Query supports fuzzy searches with a fuzziness greater than 2, but so far using the Fuzzy Like This Query has only allowed me to search for results within two edits of the search. Is there any workaround for this constraint?
It looks like this was a bug which was fixed quite a while back. Which Elasticsearch version are you using?
For context, the reason why Edit Distance is now limited to [0,1,2] for most Fuzzy operations has to do with a massive performance improvement of fuzzy/wildcard/regexp matching in Lucene 4 using Finite State Transducers.
Executing a fuzzy query via an FST requires knowing the desired edit-distance at the time the transducer is constructed (at index-time). This was likely capped at an edit-distance of 2 to keep the FST size requirements manageable. But also possibly, because for many applications, an edit-distance of greater than 2 introduces a whole lot of noise.
The previous fuzzy query implementation required visiting each document to calculate edit distance at query-time and was impractical for large collections.
It sounds like Elasticsearch (1.x) is still using the original (non-performant) implementation for the FuzzyLikeThisQuery, which is why the edit-distance can increase beyond 2. However, FuzzyLikeThis has been deprecated as of 1.6 and won't be supported in 2.0.
How can I omit the fieldLength Norm at search time? I have some documents I want to search and I want to ignore extraneous strings in my query so that things like "parrot" match to "multiple parrots."
So, how can I ignore the field length norm at search time?
I had the same problem - though I think for efficiency reasons this is normally done at index time. Instead of using tf-idf I used BM25 (which is supposedly better). BM25 has a coefficient to the field norm term which can be set to 0 so it doesn't effect the solution...
https://stackoverflow.com/a/38362244/3071643
I am currently using Lucene to search a large amount of documents.
Most commonly it is being searched on the name of the object in the document.
I am using the standardAnalyser with a null list of stop words. This means words like 'and' will be searchable.
The search term looks like this (+keys:bunker +keys:s*)(keys:0x000bunkers*)
the 0x000 is a prefix to make sure that it comes higher up the list of results.
the 'keys' field also contains other information like postcode.
So must match at least one of those.
Now with the background done on with the main problem.
For some reason when I search a term with a single character. Whether it is just 's' or bunker 's' it takes around 1.7 seconds compared to say 'bunk' which will take less than 0.5 seconds.
I have sorting, I have tried it with and without that no difference. I have tried it with and without the prefix.
Just wondering if anyone else has come across anything like this, or will have any inkling of why it would do this.
Thank you.
The most commonly used terms in your index will be the slowest terms to search on.
You're using StandardAnalyzer which does not remove any stop words. Further, it splits words on punctuation, so John's is indexed as two terms John and s. These splits are likely creating a lot of occurrences of s in your index.
The more occurrences of a term in your index, the more work Lucene has to do at search-time. A term like bunk likely occurs much less in your index by orders of magnitude, thus it requires a lot less work to process at search-time.
I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:
some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)
To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.
But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:
set A to be the set of documents which have word 1 (by searching the index).
set B to be the set of documents which have word 2 (ditto).
compute the intersection of A and B.
Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.
One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).
But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.
Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?
As you said some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending), I think the search engine may not do this, the doc list should be sorted by doc ID, each doc has a rank according to the word.
When a query comes, it contains several keywords. For each word, you can find a doc list. For all keywords, you can do merge operations, and compute the relevance of doc to query. Finally return the top ranked relevance doc to user.
And the query process can be distributed to gain better performance.
Even without ranking, I wonder how the intersection of two sets is computed so fast by google.
Obviously the worst-case scenario for computing the intersection for some words A, B, C is when their indexes are very big and the intersection very small. A typical case would be a search for some very common ("popular" in DB terms) words in different languages.
Let's try "concrete" and 位置 ("site", "location") in chinese and 極端な ("extreme") in japanese.
Google search for 位置 returns "About 1,500,000,000 results (0.28 seconds) "
Google search for "concrete" returns "About 2,020,000,000 results (0.46 seconds) "
Google search for "極端な" About 7,590,000 results (0.25 seconds)
It is extremly improbable that all three terms would ever appear in the same document, but let's google them:
Google search for "concrete 位置 極端な" returns "About 174,000 results (0.13 seconds)"
Adding a russian word "игра" (game)
Search игра: About 212,000,000 results (0.37 seconds)
Search for all of them: " игра concrete 位置 極端な " returns About 12,600 results (0.33 seconds)
Of course the returned search results are nonsense and they do not contain all the search terms.
But looking at the query time for the composed ones, I wonder if there is some intersection computed on the word indexes at all. Even if everything is in RAM and heavily sharded, computing the intersection of two sets with 1,500,000,000 and 2,020,000,000 entries is O(n) and can hardly be done in <0.5 sec, since the data is on different machines and they have to communicate.
There must be some join computation, but at least for popular words, this is surely not done on the whole word index. Adding the fact that the results are fuzzy, it seems evident that Google uses some optimization of kind "give back some high-ranked results, and stop after 0,5 sec".
How this is implemented, I don't know. Any ideas?
Most systems somehow implement TF-IDF in one way or another. TF-IDF is a product of functions term frequency and inverse document frequency.
The IDF function relates the document frequency to the total number of documents in a collection. The common intuition for this function says that it should give a higher value for terms that appear in few documents and lower value for terms that appear in all documents making them irrelevant.
You mention Google, but Google optimises search with PageRank (links in/out) as well as term frequency and proximity. Google distributes the data and uses Map/Reduce to parallelise operations - to compute PageRank+TF-IDF.
There's a great explanation of the theory behind this in Information Retrieval: Implementing Search Engines chapter 2. Another idea to investigate further is also to look how Solr implements this.
Google does not need to actually find all results, only the top ones.
The index can be sorted by grade first and only then by id. Since the same ID always has the same grade this does not hurt sets intersection time.
So google starts intersection until it finds 10 results , and then does a statistical estimation to tell you how many more results it found.
A worst case is almost impossible.
If all words are "common" then intersection will give the first 10 results very fast. If there is a rare word, then intersection is fast because complexity is O(N long M) where N is the smallest group.
You need to remember that google keeps it's indexes in memory and uses parallel computing.For example U can split the problem into two searches each searching only half of the web, and then marge result and take the best. Google has millions of computes