Lucene - limit amount of results for specific term in search query - performance

The problem is that one of our terms could be very common (for example number "3"). In that case I would like to limit the amount of search result Scored while Lucene is running the Query. Is that even possible?
Just to emphasize - I don't want just to limit Lucene search results (that could easily be done using second number parameter in IndexSearher.Search method). I want to tell Lucene something like - don't spent too much time searching hits for that specific term. In case you found, let's say, a 1,000,000 - stop looking and go to other terms.

No, you can't. As you might know, absolute scores are meaningless in Lucene, so there's no support for them.
Because the term is really common, the idf will be high (or low, depending on your perspective) so it will probably be relatively inconsequential due to Lucene's pruning algorithms. You can always change the boost to make it matter even less, but I'd double check that this is really your performance bottleneck.

Related

Strategies to compare performance of two Elasticsearch queries?

Since actual query runtime varies, it's not always useful to just check the runtime of two queries to determine which is generally faster. What are some ways to generally test whether one query is more efficient than another?
As an example of what I'm after, in MongoDB I can run explain on a query to get the number of documents iterated vs. returned. If the documents iterated is several orders of magnitude higher than what it's actually returning, I know I have an inefficient query. I know that since Elasticsearch indexes data much differently than other dbs, this may not translate well, but I'm wondering if there's some rough equivalent.
I'm looking at the Profile API which looks like a good starting place. Are fields like next_doc and next_doc_count what I'm after? Are there any others I should look for? Thanks!!

Does BM25 use query coordinator?

In Lucene's practical scoring function there is a query coordinator which punishes documents that fail to match all the query terms. does Okapi BM25 use the same trick?
The reason I'm curious about it is that I'm using Elasticsearch with BM25 similarity module and sometimes I feel this algorithm does not favor documents with more matches. There are cases that a document contains one or two terms a lot, outscores a document containing all query terms.
Yes and no.
No, it doesn't use a coord factor as described by the old Lucene default similarity (note: Lucene core now uses BM25 by default, as well).
Yes, it does weigh hits on more of the query terms more heavily than a bunch of hits on the same term. It does this with better term saturation, making the old coord factor effectively obsolete.
It is, however, always possible that many hits on less terms will outscore few hits on more terms using either algorithm.

Performance of "function_score"

I'm working on a solution for custom score boosting in Elasticsearch.
I wanted to ask if using function_score is a good idea. Because the index size is great but the result of the query should not be that big.
Does function_score work on a query result or rather as a part of query logic? If former, it might be fast, is it?
PS. Initially query boost operator seemed like a best option, but I can't get it to raise a score much above the normal range for one of the match. I've checked _explain API and it says that queryNorm normalizes my boost and I still get values below normal range (0.1 .. 4).
In principle - yes, it will slow down the performance of the search. Of course real penalty will depend on the complexity of your script. It will work during so called 'search' phase, so it means, that it will be applied for all matched docs.
You could try to make your logic faster, if your case is suitable for rescoring functionality, cause it's applied only to the top N (configurable in rescore API) results.
More information about rescoring - https://www.elastic.co/guide/en/elasticsearch/guide/current/_improving_performance.html#rescore-api

Lucene Which would be better: many queries or massive OR query?

Problem I have a large list of keywords that I want to see if the are contained in a document or documents. (My users want to know when a document is published, if it has any of their saved keywords)
So I could make many queries; one for each keyword.
Or I could construct a query something like: "coffee OR tea OR milk OR sugar OR beer"
Now lets say there are over 1,000 key words.
Which one is likely to lead to pain and suffering?
Would one be better over the other when running against one document or many documents?
(I am leaning towards the OR version but I am am worried I will hit some query length (performance) limit if I go too far)
Once I have enough data I will run some comparisons and report back.
Any hints between now and then would be great though.
Single Giant Query Pro: You get ranking by the Lucene's scoring algorithm for all of the keywords.
Single Giant Query Con: You make Lucene use a huge amount of memory, as it needs to remember each subquery's result (or part of it) in order to give you that nice ranking that takes all keywords into account. The bigger the OR query, the more memory Lucene needs to do it, and the slower it does it.
I'd say, if at all possible for your purposes, break it up, since OR queries are The Devil (even though it's sometimes necessary to deal with them); but benchmark should be better than asking random people for opinions :P

How does a search engine rank millions of pages within 1 second?

I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.
However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!
How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?
Thanks!
UPDATE:
Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.
Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.
This gives the developers a degree of latitude entirely unavailable in almost all other domains.
The real question to ask is - how precisely do the results match the actual rank assigned to each page?
There are two major factors that influence the time it takes for you to get a response from your search engine.
The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.
The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.
To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.
Another issue that effects search engine latency is the network overheads.
Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.
The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.
I'm sure they have a bunch of other tricks up their sleeves, too.
EDIT:
They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).
With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.
The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).
But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.
One possible strategy is just rank the top-k instead of the entire list.
For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).
Now, you only need O(n) to obtain the top 100 results out of 1 million hits.
Also I guess the use of NoSQL databases instead of RDBMS helps.
NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.
As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.
The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD
As Xiao said, just rank the top-k instead of the entire list.
Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)
Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm
This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all.
The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.
Regarding your update:
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.
Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.
There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.
Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.
Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .
This is just my two cents, but I think I'm pretty accurate with this hypothesis.
EDIT: You might want to check this out for complexity of high-order queries.
I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.

Resources