Lucene Which would be better: many queries or massive OR query? - performance

Problem I have a large list of keywords that I want to see if the are contained in a document or documents. (My users want to know when a document is published, if it has any of their saved keywords)
So I could make many queries; one for each keyword.
Or I could construct a query something like: "coffee OR tea OR milk OR sugar OR beer"
Now lets say there are over 1,000 key words.
Which one is likely to lead to pain and suffering?
Would one be better over the other when running against one document or many documents?
(I am leaning towards the OR version but I am am worried I will hit some query length (performance) limit if I go too far)
Once I have enough data I will run some comparisons and report back.
Any hints between now and then would be great though.

Single Giant Query Pro: You get ranking by the Lucene's scoring algorithm for all of the keywords.
Single Giant Query Con: You make Lucene use a huge amount of memory, as it needs to remember each subquery's result (or part of it) in order to give you that nice ranking that takes all keywords into account. The bigger the OR query, the more memory Lucene needs to do it, and the slower it does it.
I'd say, if at all possible for your purposes, break it up, since OR queries are The Devil (even though it's sometimes necessary to deal with them); but benchmark should be better than asking random people for opinions :P

Related

Strategies to compare performance of two Elasticsearch queries?

Since actual query runtime varies, it's not always useful to just check the runtime of two queries to determine which is generally faster. What are some ways to generally test whether one query is more efficient than another?
As an example of what I'm after, in MongoDB I can run explain on a query to get the number of documents iterated vs. returned. If the documents iterated is several orders of magnitude higher than what it's actually returning, I know I have an inefficient query. I know that since Elasticsearch indexes data much differently than other dbs, this may not translate well, but I'm wondering if there's some rough equivalent.
I'm looking at the Profile API which looks like a good starting place. Are fields like next_doc and next_doc_count what I'm after? Are there any others I should look for? Thanks!!

How does a search engine rank millions of pages within 1 second?

I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.
However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!
How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?
Thanks!
UPDATE:
Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.
Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.
This gives the developers a degree of latitude entirely unavailable in almost all other domains.
The real question to ask is - how precisely do the results match the actual rank assigned to each page?
There are two major factors that influence the time it takes for you to get a response from your search engine.
The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.
The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.
To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.
Another issue that effects search engine latency is the network overheads.
Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.
The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.
I'm sure they have a bunch of other tricks up their sleeves, too.
EDIT:
They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).
With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.
The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).
But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.
One possible strategy is just rank the top-k instead of the entire list.
For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).
Now, you only need O(n) to obtain the top 100 results out of 1 million hits.
Also I guess the use of NoSQL databases instead of RDBMS helps.
NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.
As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.
The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD
As Xiao said, just rank the top-k instead of the entire list.
Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)
Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm
This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all.
The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.
Regarding your update:
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.
Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.
There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.
Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.
Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .
This is just my two cents, but I think I'm pretty accurate with this hypothesis.
EDIT: You might want to check this out for complexity of high-order queries.
I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.

Benefits and trade offs for improving text search on small data in PostgreSQL

I have 4 text columns of interest.
Each column is up to about 100 characters.
The text in 3 of the columns is mostly Latin words. (The data is a biological catalog, and these are names of things.)
The data is currently about 500 rows. I don't expect this to grow beyond 1000.
A small number of users (under 10) will have editing privileges to add, update, and delete data. I do not expect these users to put a heavy load on the database.
So all this suggests a pretty small data set to consider.
I need to perform a search on all 4 columns for rows where at least 1 column contains the search text (case insensitive). The query will be issued (and the results served) via a web application. I'm a bit lost about how to approach it.
PostgreSQL offers a few options for improving text searching speed. The possible options built into PostgreSQL I've been considering are
Don't try to index this at all. Just use ILIKE, LIKE on lower, or similar. (Without an index?)
Index with pg_trgm to improve search speed. I would assume that I would need to index the concatenation somehow.
Full text searching. I assume this would involve concatenating for the index also.
Unfortunately, I'm not really familiar with the expected performance of any of these or the benefits and trades off, so it's hard to know what things I should try first and what things I shouldn't even consider. Some things I have read suggest that doing the indexing for 2 and 3 is pretty slow, which conflicts with the fact that I'll be having occasional modifications going on. And the mixed language makes full text search seem unattractive since it appears to be language based, unless it can handle multiple languages simultaneously. Would I expect that for data this small, a simple ILIKE or maybe a LIKE on lower is probably fast enough? Or maybe the indexing is fast enough for the low load of modifications on data this small? Would I be better off looking for something outside the database?
Granted, I would have to actually benchmark all these to really know for sure what's fastest, but unfortunately, I don't have much time for this project. So what are the benefits and trade offs of these methods? What of these options are not appropriate for solving this type of problem? What are some other types of solutions (including potentially outside the database) worth considering?
(I suppose I might find some kind of beginner's tutorial on text searching in PG useful, but my searches turn up Full Text Search for the most part, which I don't even know if it's useful for me.)
I'm on PG 9.2.4, so any goodies pre-9.3 are an option.
Update: I've expanded this answer into a detailed blog post.
Rather than focusing purely on speed, please consider search semantics first. Define your requirements.
For example, do users need to be able to differentiate based on the order of terms? Should
radiata pinus
find:
pinus radiata
? Does the same rule apply to words within a column as between columns?
Are spaces always word separators, or are spaces within a column part of the search term?
Do you need wildcards? If so, do you need only left-anchored wildcards (think staph%) or do you need right-anchored or infix wildcards too (%ccus, p%s)? Only pg_tgrm will help you with infix wildcards. Suffix wildcards can be handled by an index on the reverse() of a word, but that gets clumsy quickly so in practice pg_tgrm is the best option there.
If you're mostly searching for discrete words and word-order isn't important, Pg's full-text search with to_tsvector and to_tsquery will be desirable. It supports left-anchored wildcard searches, weighting, categories, etc.
If you're mostly doing prefix searches of discrete columns then simple LIKE queries on a regular b-tree index per column will be the way to go.
So. Figure out what you need, then how to do it. Your current uncertainty probably stems partly from not really knowing quite what you want.
For a 1000 rows, I would guess that LIKE together with lower() should be fast enough. After a couple of queries the table will most probably be completely cached.
Regarding the indexing using pg_trgm: you are talking about "occasional" updates/inserts to the table. I would think that the additional costs of using a trigram index would only show up when you update/insert that table a lot - like several times a second.
If "occasional" only means several times an hour (or even less), then I doubt you'd see the difference in real live. I think somewhere in Depesz's blob there was also an article that compared the insert speed with and without a trigram index, but I can't find it anymore.

MongoDB text index search slow for common words in large table

I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.
Its text index includes ten fields with varying weights.
Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.
Searching takes 15-60 seconds when words that are very common in the index are used in the search query.
I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?
I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.
There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.
Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!
Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.
If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.
hth
This is the exact same issue as $all versus $in. $all only uses the index for the first keyword in the array. I believe your seeing the same issue here, reason why the OR a.k.a. IN works for you.

In elasticsearch, is there some method to reduce the importance of a set of search terms?

Ideally, I would like to reduce the importance of certain words such as "store", "shop", "restaurant".
I would like "Jimmy's Steak Restaurant" to be about as important as "Ralph's Steak House" when a user searches for "Steak Restaurant". I hope to accomplish this by severely diminishing the importance of the word "restaurant" (along with 20-50 other words).
Stop words work well for some words, such as "a", "the", "of", etc, but they are all-or-nothing.
Is there a way to provide a weighting or boost value per word at the index or mapping level?
I can probably accomplish this at the query level, but that could be very bad if I have 50 words whose impact I need to reduce.
This was a generalized example. In my actual complex solution, I really do need to reduce the impact of quite a few search terms.
I don't believe it is possible to specify a term-level boost during indexing. In this thread, Shay mentions that it is possible in Lucene, but that it's a tricky feature to surface through the API.
Another relevant thread, suggesting the same thing. Shay recommends trying to sort it out using a custom_score query:
I think that you should first try and solve it on the search side. If you know the weights when you do a search, you can either construct a query that applies different boosts depending the tag, or use custom_score query.
Custom_score query is slower than other queries, but I suggest you run and check if it's ok for you (with actual data, and relevant index size). The good thing is that if its slow for you (and slow here means both latency and QPS under load), you can always add more replicas and more machines to separate the load.
Here is an example of a custom_score query that boosts on a somewhat-similar term level (except it's for a special field that only has one category term, so this may not apply). It might be easier to break the script out into a native script, instead of using mvel, since you'll have a big list of words.
As an alternative, perhaps add a synonym token filter that interchanges words like "shop", "restaurant", "store", etc?

Resources