At my workplace, we can't really decide the best solution to implement paging and sorting. We have two different solutions, and currently use the second one. We do it because it seems like a good idea, but we don't have any real evidence for that.
First solution:
The client passes a PageSize, CurrentPage and SortingColumn parameter to the stored-procedure . from the requirements, every output parameter can be sorted. It might look like this:
GetSomething
CurrentPage (INTEGER)
PageSize (INTEGER)
SortingColumn (VARCHAR)
SortingDirection (BIT, 0:Ascending, 1:Descending)
Second Solution:
The SP doesn't implement any of this, and instead we specify this within the query.
SELECT [...]
FROM GetSomething(...)
ORDER BY [ID]
OFFSET 100 ROWS
FETCH NEXT 100 ROWS ONLY
Are both solutions really equally well suited to implement paging and sorting, or does one have benefits over the other? If it helps to know, we operate with a 1TB database which might benefit from small performance optimization that would normally not be taken into consideration.
S.P.: I don't have the required privileges to access the database as DBA, so I cannot simply try it out and benchmark. The actual DBAs for their part don't think this is worth investigating, so I thought I might just ask Stackoverflow in case somehow has hard facts :-)
I usually used the first option. I feel like that little extra control you get to sort the data on client side is worth it. I have yet to see any noticeable difference so you could use either one. In my case I can change the sorting if I need to and don't have to mess with the query.
I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.
However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!
How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?
Thanks!
UPDATE:
Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.
Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.
This gives the developers a degree of latitude entirely unavailable in almost all other domains.
The real question to ask is - how precisely do the results match the actual rank assigned to each page?
There are two major factors that influence the time it takes for you to get a response from your search engine.
The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.
The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.
To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.
Another issue that effects search engine latency is the network overheads.
Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.
The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.
I'm sure they have a bunch of other tricks up their sleeves, too.
EDIT:
They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).
With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.
The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).
But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.
One possible strategy is just rank the top-k instead of the entire list.
For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).
Now, you only need O(n) to obtain the top 100 results out of 1 million hits.
Also I guess the use of NoSQL databases instead of RDBMS helps.
NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.
As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.
The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD
As Xiao said, just rank the top-k instead of the entire list.
Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)
Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm
This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all.
The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.
Regarding your update:
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.
Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.
There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.
Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.
Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .
This is just my two cents, but I think I'm pretty accurate with this hypothesis.
EDIT: You might want to check this out for complexity of high-order queries.
I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.
I have 4 text columns of interest.
Each column is up to about 100 characters.
The text in 3 of the columns is mostly Latin words. (The data is a biological catalog, and these are names of things.)
The data is currently about 500 rows. I don't expect this to grow beyond 1000.
A small number of users (under 10) will have editing privileges to add, update, and delete data. I do not expect these users to put a heavy load on the database.
So all this suggests a pretty small data set to consider.
I need to perform a search on all 4 columns for rows where at least 1 column contains the search text (case insensitive). The query will be issued (and the results served) via a web application. I'm a bit lost about how to approach it.
PostgreSQL offers a few options for improving text searching speed. The possible options built into PostgreSQL I've been considering are
Don't try to index this at all. Just use ILIKE, LIKE on lower, or similar. (Without an index?)
Index with pg_trgm to improve search speed. I would assume that I would need to index the concatenation somehow.
Full text searching. I assume this would involve concatenating for the index also.
Unfortunately, I'm not really familiar with the expected performance of any of these or the benefits and trades off, so it's hard to know what things I should try first and what things I shouldn't even consider. Some things I have read suggest that doing the indexing for 2 and 3 is pretty slow, which conflicts with the fact that I'll be having occasional modifications going on. And the mixed language makes full text search seem unattractive since it appears to be language based, unless it can handle multiple languages simultaneously. Would I expect that for data this small, a simple ILIKE or maybe a LIKE on lower is probably fast enough? Or maybe the indexing is fast enough for the low load of modifications on data this small? Would I be better off looking for something outside the database?
Granted, I would have to actually benchmark all these to really know for sure what's fastest, but unfortunately, I don't have much time for this project. So what are the benefits and trade offs of these methods? What of these options are not appropriate for solving this type of problem? What are some other types of solutions (including potentially outside the database) worth considering?
(I suppose I might find some kind of beginner's tutorial on text searching in PG useful, but my searches turn up Full Text Search for the most part, which I don't even know if it's useful for me.)
I'm on PG 9.2.4, so any goodies pre-9.3 are an option.
Update: I've expanded this answer into a detailed blog post.
Rather than focusing purely on speed, please consider search semantics first. Define your requirements.
For example, do users need to be able to differentiate based on the order of terms? Should
radiata pinus
find:
pinus radiata
? Does the same rule apply to words within a column as between columns?
Are spaces always word separators, or are spaces within a column part of the search term?
Do you need wildcards? If so, do you need only left-anchored wildcards (think staph%) or do you need right-anchored or infix wildcards too (%ccus, p%s)? Only pg_tgrm will help you with infix wildcards. Suffix wildcards can be handled by an index on the reverse() of a word, but that gets clumsy quickly so in practice pg_tgrm is the best option there.
If you're mostly searching for discrete words and word-order isn't important, Pg's full-text search with to_tsvector and to_tsquery will be desirable. It supports left-anchored wildcard searches, weighting, categories, etc.
If you're mostly doing prefix searches of discrete columns then simple LIKE queries on a regular b-tree index per column will be the way to go.
So. Figure out what you need, then how to do it. Your current uncertainty probably stems partly from not really knowing quite what you want.
For a 1000 rows, I would guess that LIKE together with lower() should be fast enough. After a couple of queries the table will most probably be completely cached.
Regarding the indexing using pg_trgm: you are talking about "occasional" updates/inserts to the table. I would think that the additional costs of using a trigram index would only show up when you update/insert that table a lot - like several times a second.
If "occasional" only means several times an hour (or even less), then I doubt you'd see the difference in real live. I think somewhere in Depesz's blob there was also an article that compared the insert speed with and without a trigram index, but I can't find it anymore.
Counting the number of records in a collection that is matched by a query even on an indexed field takes too much time. For example lets say there is a collection consists of 10000 records, and there is an index on the creationDate field of this collection. Getting the last ten records from the collection is faster than counting the number of records created on the last day. It takes more than 5 seconds, sometimes even up to 70 seconds to return the result of the count query. Do you have any idea how to solve this problem, what is the best way to solve this issue?
Btw we also use morphia, and we saw that getting the count through morphia is even slower, so for count queries, we transform the morphia query to the java driver query. Did anyone encounter a similar situation, why does morphia response even slower? Does this happen only for count queries or is it slow in general compared to using only java driver?
Help, suggestions or work-arounds would be really appreciated, our application relies heavily on count queries and the slowness of system is really annoying for us right now.
Thanks in advance.
While this might not be the final answer, let's get started and evolve this further:
Your indexes should always fit into RAM, otherwise you will get really bad performance.
To evaluate how much RAM is used, you can either use 10gen's MMS or check with various tools. For a description plus possible reasons for low (resident) memory usage see http://www.kchodorow.com/blog/2012/05/10/thursday-5-diagnosing-high-readahead/. Or you simply haven't accessed enough data yet in which case you can use MongoDB's touch (but I doubt that since you're already having performance issues).
Besides adding RAM and making sure you use all the available RAM, you could also drop unused indexes or use compound indexes where possible.
Wait for a fix.
As Asya Kamsky commented out, count performance is really bad on 2.2. The only workaround we've found is to avoid them as much as possible.
(There are other things that are unexplainably slow with mongodb - such as aggregate queries - most of those have associated JIRA issues and are being worked on/scheduled)
I'm not really trying to compress a database. This is more of a logical problem. Is there any algorithm that will take a data table with lots of columns and repeated data and find a way to organize it into many tables with ID's in such a way that in total there are as few cells as possible, and that this tables can be then joined with a query to replicate the original one.
I don't care about any particular database engine or language. I just want to see if there is a logical way of doing it. If you will post code, I like C# and SQL but you can use any.
I don't know of any automated algorithms but what you really need to do is heavily normalize your database. This means looking at your actual functional dependencies and breaking this off wherever it makes sense.
The problem with trying to do this in a computer program is that it isn't always clear if your current set of stored data represents all possible problem cases. You can't only look at numbers of values either. It makes little sense to break off booleans into their own table because they have only two values, for example, and this is only the tip of the iceberg.
I think that at this point, nothing is going to beat good ol' patient, hand-crafted normalization. This is something to do by hand. Any possible computer algorithm will either make a total mess of things or make you define the relationships such that you might as well do it all yourself.