Practical importance of efficient sorting algorithms - algorithm

I've been looking around trying to learn of any practical applications where sorting is needed and its efficiency matters, but couldn't find anything.
The only examples I could find or think off either did not need total sorting (like when looking for 100 top results or for the median) or sorting efficiency was hardly important (like when sorting once a year a spreadsheet with student names or past transactions).
When sorting web search results, only a few dozens of top ranked results need to be found and sorted, not all of the Internet, so classical sorting algorithms are not needed or practical.
When sorting a spreadsheet, it hardly matters if it will be sorted by a triple-pivot Las Vegas randomised quicksort or by the insertion sort.
Using sorted arrays as sets or associative arrays seems to be practically less efficient than using hash tables.
So my question is: what are practical ("real-life") examples where a total sorting is required and its efficiency is a bottleneck? I am particularly curious about applications for comparison sorting.
Update.
I've stumbled upon this phrase in lecture notes by Steven Skiena:
Computers spend more time sorting than anything else, historically 25% on mainframes.
With some details, that could make a perfect answer to my question. Where can I find the source for this statistics, ideally with some details about the kind and the application of sorting done by mainframes?

In some graphics rendering algorithms, objects need to be drawn in back to front order. A good example is transparent particles: there can be hundreds of thousands of them, and because of the transparency, traditional depth buffering doesn't work. So you need to sort these particles by distance from the camera, and keep them sorted, at 60 frames per second.
Interestingly, if the order of the particles doesn't change much (relatively slow particle motion, little camera movement), then the array of particles will already be "mostly sorted" in the next frame, and a simple bubble sort or insertion sort can actually work fine. But on frames where many particles are created, or the camera moves quickly, sort performance can become important, simply because there are so many other things to do each frame.

Imagine you have a daily list of transactions (deposits and withdrawls) for bank accounts. There are millions of accounts, and millions of transactions per day. Each night, you have to update the accounts to reflect those transactions, and compute the interest accrued that day, and print a report, ordered by account, that shows each account with its daily activity.
One way to do that is to go through the list sequentially, reading a transaction and updating the account in the database. That will work, but it has several drawbacks, including:
If there are many transactions for a single account, you pay the price of retrieving and updating the account for every transaction. Considering that a business account can have thousands of transactions per day, those costs add up.
The typical rule is that deposits are recorded before withdrawals, so as to prevent an overdraft. If an account's balance is 0, and the transactions list has a withdrawal of $5 ahead of a $10 deposit, the system will record an overdraft when it shouldn't.
Printing the report would require a separate scan of the database, after all transactions are recorded.
The solution to those problems is to sort the transactions list by account and type (deposits first). Then, the update is a simple merge operation. You read the database and the transactions list in account number order, apply any transactions for that account, compute interest, print the output line, and write the updated record to the database.
The result is much faster than doing a read-update-write for every single transaction, and it eliminates the problems #2 and #3 I outlined above. Sort-and-merge makes the difference between the update taking all night, and the update taking a few hours.
Also, MapReduce (and Hadoop), used for processing big data, make good use of sorting. Those programming models simply would not be possible without high performance sorting algorithms.
Any time you need to merge multiple large data streams into a single output stream (and those applications are legion), the sort-and-merge approach is useful. There are times when other techniques might be faster, but the sort-and-merge is reliable and durable, and, as shown by MapReduce, scales well.

Related

What are best known algorithms/techniques for updating edges on huge graph structures like social networks?

On social networks like twitter where millions follow single account, it must be very challenging to update all followers instantly when a new tweet is posted. Similarly on facebook there are fan pages with millions of followers and we see updates from them instantly when posted on page. I am wondering what are best known techniques and algorithms to achieve this. I understand with billion accounts, they have huge data centers across globe and even if we reduce this problem for just one computer in following manner - 100,000 nodes with average 200 edges per node, then every single node update will require 200 edge updates. So what are best techniques/algorithms to optimize such large updates. Thanks!
The best way is usually just to do all the updates. You say they can be seen "instantly", but actually the updates probably propagate through the network and can take up to a few seconds to show up in followers' feeds.
Having to do all those updates may seem like a lot, but on average follower will check for updates much more often than a person being followed will produce them, and checking for updates has to be much faster.
The choices are:
Update 1 million followers, a couple times a day, within a few seconds; or
Respond to checks from 1 million followers, a couple hundred times a day, within 1/10 second or so.
There are in-between strategies involving clustering users and stuff, but usage patterns like you see on Facebook and Twitter are probably so heavily biased toward option (1) that such strategies don't pay off.

How does a search engine rank millions of pages within 1 second?

I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.
However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!
How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?
Thanks!
UPDATE:
Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
The question would be really relevant if we were sure that the ranking was complete. It is quite possible that the ordering provided is approximate.
Given the fluidity of the ranking results, no answer that looks reasonable could be considered incorrect. For example, if an entire section of the web were excluded from the top results, you would not notice, provided they were included later.
This gives the developers a degree of latitude entirely unavailable in almost all other domains.
The real question to ask is - how precisely do the results match the actual rank assigned to each page?
There are two major factors that influence the time it takes for you to get a response from your search engine.
The first is if you're storing your index on hard disk. If you're using a database, it's very likely that you're using the hard disk at least a little. From a cold boot, your queries will be slow until the data necessary for those queries has been pulled into the database cache.
The other is having a cache for your popular queries. It takes a lot longer to search for a query than it does to return results from a cache. Now, the random access time for a disk is too slow, so they need to have it stored in RAM.
To solve both of these problems, Google uses memcached. It's an application that caches the output of the Google search engine and feeds slightly old results to users. This is fine because most of the time the web doesn't change fast enough for it to be a problem, and because of the significant overlap in searches. You can be almost guaranteed that Barack Obama has been searched for recently.
Another issue that effects search engine latency is the network overheads.
Google have been using a custom variant of the Linux (IIRC) that has been optimised for use as a web server. They've managed to reduce some of the time it takes to start turning around results to a query.
The moment a query hits their servers, the server immediately responds back to the user with the header for the HTTP response, even before Google has finished processing the query terms.
I'm sure they have a bunch of other tricks up their sleeves, too.
EDIT:
They also keep their inverted lists sorted already, from the indexing process (it's better to process once than for each query).
With these pre-sorted lists, the most expensive operation is list intersection. Although I'm fairly sure Google doesn't rely on a vector space model, so list intersection isn't so much a factor for them.
The models that pay off the best according to the literature are the probabilistic models. As an example, you may wish to look up Okapi BM25. It does fairly well in practice within my area of research (XML Retrieval). When working with probabilistic models, it tends to be much more efficient to process document at a time instead of term at a time. What this means is that instead of getting a list of all of the documents that contain a term, we look at each document and rank it based on the terms it contains from our query (skipping documents that have no terms).
But if we want to be smart, we can approach the problem in a different way (but only when it appears to be better). If there's a query term that is extremely rare, we can rank with that first, because it has the highest impact. Then we rank with the next best term, and we continue until we've determined if it's likely that this document will be within our top k results.
One possible strategy is just rank the top-k instead of the entire list.
For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).
Now, you only need O(n) to obtain the top 100 results out of 1 million hits.
Also I guess the use of NoSQL databases instead of RDBMS helps.
NoSQL databases scales horizontally better, and don't generate bottlenecks. Big guys like Google Facebook or Twitter use them.
As other comments/answers suggested the data might be already sorted, and they are returning offsets of the data found instead of the whole batch.
The real question is not how they sort that many results that quickly, but how do they do it when tens or hundreds of millions of people around the world are querying google at the same time xD
As Xiao said, just rank the top-k instead of the entire list.
Google tells you there are 937,000,000 results, but it won't show them all to you. If you keep scrolling page after page, after a while it will truncate the results :)
Here you go, i looked it up for you and this is what i found! http://computer.howstuffworks.com/internet/basics/search-engine.htm
This ia my theory...Its highly impossible that you are the first guy to search for a keyword.So for every keyword (or a combination) searched on a search engine, it maintains a hash of links to relevent web pages. Everytime you click a link in search results it gets a vote-up on the hashset of that keyword combination. Unfortunatly if you are the first guy, it saves your search keyword(for suggesting future searches) and starts the hashing of that keyword. So you end up with a fewer or no results at all.
The page ranking as you might be knowing depends on many other factors too like backlinks,no. Of pages refering a keyword in seaech. etc.
Regarding your update:
MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.
MapReduce is not just designed for batch tasks. There are quite a lot MapReduce frameworks supporting real time computing: Apache Spark, Storm, Infinispan Distributed Executor, Hazelcast Distributed Executor Service.
Back to your question MapReduce is the key to distribute the query task to multiple nodes, and then merge the result together.
There's no way you expect to get an accurate answer to this question here ;) Anyway, here's a couple of things to consider - Google uses a unique infrastructure, in every part of it. We cannot even guess the order of complexity of their network equipment or their database storage. That is all I know about the hardware component of this problem.
Now, for the software implementation - like the name says the PageRank is a rank by itself. It doesn't rank the pages when you enter the search query. I assume it ranks it on a totally independent part of the infrastructure every hour. And we already know that Google crawler bots are roaming the Web 24/7 so I assume that new pages are added into an "unsorted" hash map and then they are ranked on the next run of the algorithm.
Next, when you type your query, thousands of CPUs independently scan thousands of different parts of the PageRank database with a gapping factor. For example if the gapping factor is 10, one machine queries the part of the database that has PageRank values from 0-9.99, the other one queries the database from 10-19.99 etc. Since resources aren't an obstacle for Google they can set the gapping factor so low (for example 1) in order for each machine to query less than 100k pages which isn't to much for their hardware. Then when they need to compile the results of your query, since they know which machine ranks exactly which part of the database they can use the 'fill the pool' principle. Let n be the number of links on each Google page. The algorithm that combines the pages returned from queries ran on all those machines against all the different parts of database needs to only fill the first n results. So they take the results from the machine querying against the highest rank of the database. If it is greater than n they're done, if not they move to the next machine. This takes only O(q*g/r) where s is the quantity of the pages Google serves, g is the gapping factor and r is the highest value of PageRank. This assumption is encouraged by the fact that when you turn to second page your query is ran once again (notice the different time taken to generate it) .
This is just my two cents, but I think I'm pretty accurate with this hypothesis.
EDIT: You might want to check this out for complexity of high-order queries.
I don't know what Google really does, but surely they use approximation. For example if the search query is 'Search engine' then the number of results will be = (no. of documents where there is one or more occurrence of the word 'search' + no. of documents where there is one or more occurrence of the word 'engine' ). This can be done in O(1) time complexity. For details read the basic structure of Google http://infolab.stanford.edu/~backrub/google.html.

Efficient distributed counting

I have a series of events flowing through a system (e.g a pizza ordering system) and I want to count certain properties of each event through time. For example, I might want to see how many unique people ordered pepperoni pizza in the last 5 minutes, or how many pizzas John Doe ordered in the past week.
It is a LOT of events, so we're using something like Cassandra or HBase because even the counts can't be stored in memory. Also, since we need to keep track of set membership (in order to count unique people ordering a particular kind of pizza, for example), it gets bigger.
We could store a list of orders and then query to count, but this is slow. And we mostly don't care who ordered pepperoni pizza, just how many unique orders were made, and in a given time window.
What's the best way to store this information, for example in Cassandra, such that the information can be retrieved in some time intervals?
I tried at first to use Redis + bloom filters, but storing a bloom filter bit vector would require transactions to avoid race conditions, so then I used redis sets.
Then I realized the whole thing was too big to just be in memory, so I decided to switch to a disk-backed store. However, there are no native sets like in redis.
I looked at sketches / streaming algos like HyperLogLog but the conclusion was that to save the hyperloglog object, I need to store the bit array (or pickle the object or whatever)...is that kosher, and what are the best practices for this, if this is indeed the solution?
I was tempted to save each event individually with a timestamp, then query and count on demand, but this is slow. I'm looking for something better, if it exists.
Example Requests:
How many unique people had a pepperoni pizza order in the past 10 minutes
How many unique pepperoni pizzas were ordered by some person John Doe in the past 30 minutes
There are a few ways to approach this problem from what I have learned.
Use locking + set membership / counting data structure e.g hyperloglog or bloom filter. As long as there's not that much fighting over a particular lock, things should be okay.
Use a database that has built-in sets/collections support. They pretty much implement #1 internally.
my guesses:
cassandra supports counters - i think i saw some incr operation which should work concurrently - by using free running counter on your event, you just need to setup something which samples all counters at specified intervals (5 min?) then you can give estimations between two samples
(http://wiki.apache.org/cassandra/Counters)
cassandra can timeout a column..i never really used it, but it might worth a try

Is there a well established incremental algorithm to maintain a history of values with accumulation over specific time frames?

I have practically completed one, but wanted to compare mine with a well researched an possibly academic algorithm. There may be a library of statistical objects which either directly or in-combination solve my particular need.
My system (which I intend to OpenSource) has a stream of NetFlow data. Rather than store in database and using SQL functions, I prefer to have a database-free system and maintain a set of statistics, updated for each new flow, and scrolled per-second (or higher).
My solution involves an single array of uint, to effectively create a jagged array of sizes [60, 59, 23, 6, ...], representing seconds, minutes, hours, days, weeks, etc....
Each slot contains the total amount of Bytes for that time. So after 60 seconds a single minute statistic is created as Avg(seconds). This of course continues relatively up the time scale.
Rather than simply having thousands of second increments, it is due to:
Memory constraints and the potential to have more statistical nodes; AND
Ideal presentation to users
...that I roll up time scales.
Given that a flow may be applied to several nodes in a heirarchy of statistics (WAN Link, IP Address, Destination Address, SourcePort-DestinationPort), I calculate the delta once (GenerateDelta) and then simply apply at every node which is both active and which matches the flow meta-data.
A statistic on a given node would be "scrolled", in the following potential cases:
When being read/displayed (via HTTP\JSON AJAX Request)
When a delta is being applied (due to relevant flow)
Simply every n-seconds (n is typically 1)
Overall there may be a well established algorithm for keeping running totals over time (with seconds, minutes...). But failing that, there may also be a suitable algorithms for comparison on smaller sub-sections of my code:
GenerateDelta - not likely as this is specific for breaking down and averaging a flow with duration over slots in the statistics array.
Scroll - if there were only seconds, then this would of course be simple, however my solution requires the 60 seconds to be combined into a new minute total every 60 seconds, and so on.
I do not wish responders to suggest any of their own algorithms, I have already (almost) completed all of my own without any problems and with many performance considerations. And others will likely be able to have a look at my algorithm when I have finished and published as Open Source.
What I do wish to see is any "well established" algorithms for comparison. Perhaps mine will be better, perhaps mine will be worse. Google isn't good at this sort of question, I need your help.
Thanks!
Thanks to comment from #rici, I found the "Stream Statistics" domain is what is required. There are Data Stream Management Systems (DSMS) for dealing with stream statistics. Whereas SQL RDBMS systems can store data with statistics generated by SQL query, a Data Stream Management Systems, enables the processing of a continuous stream of data, given one or more queries.
This paper, describes a DSMS as:
Being able to sacrifice quality for qualitative use
Being single pass, because the data is vast
Having Queries treating data as sequences not sets
And more...
This one, depicts a diagram of such a DSMS, references the Network Traffic Analysis problem domain,
This paper, describes StreamSQL, SQL-like syntax, for defining continuous queries.
Even though proprietary solutions are not accessible. There certainly are well established algorithms. I can therefore test the performance of my specialised system against general stream query tools.
Several products/prototypes of DSMS can be found in this wiki page, specifically Odysseus is of interest, being Java based and open source.

Does someone really sort terabytes of data?

I recently spoke to someone, who works for Amazon and he asked me: How would I go about sorting terabytes of data using a programming language?
I'm a C++ guy and of course, we spoke about merge sort and one of the possible techniques is to split the data into smaller size and sort each of them and merge them finally.
But in reality, do companies like Amazon or eBay sort terabytes of data? I know, they store tons of information, but do they sorting them?
In a nutshell my question is: Why wouldn't they keep them sorted in the first place, instead of sorting terabytes of data?
But in reality, does companies like
Amazon/Ebay, sort terabytes of data? I
know, they store tons of info but
sorting them???
Yes. Last time I checked Google processed over 20 petabytes of data daily.
Why wouldn't they keep them sorted at
the first place instead of sorting
terabytes of data, is my question in a
nutshell.
EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.
Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.
Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.
For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.
http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/
The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.
However for sorting 1 TB I would use map-reduce using Hadoop.
Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.
Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.
Here is the link to the identity mapper:
http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.
Yes, certain companies certainly sort at least that much data every day.
Google has a framework called MapReduce that splits work - like a merge sort - onto different boxes, and handles hardware and network failures smoothly.
Hadoop is a similar Apache project you can play with yourself, to enable splitting a sort algorithm over a cluster of computers.
Every database index is a sorted representation of some part of your data. If you index it, you sort the keys - even if you do not necessarily reorder the entire dataset.
Yes. Some companies do. Or maybe even individuals. You can take high frequency traders as an example. Some of them are well known, say Goldman Sachs. They run very sophisticated algorithms against the market, taking into account tick data for the last couple of years, which is every change in the price offering, real deal prices (trades AKA as prints), etc. For highly volatile instruments, such as stocks, futures and options, there are gigabytes of data every day and they have to do scientific research on data for thousands of instruments for the last couple years. Not to mention news that they correlate with market, weather conditions and even moon phase. So, yes, there are guys who sort terabytes of data. Maybe not every day, but still, they do.
Scientific datasets can easily run into terabytes. You may sort them and store them in one way (say by date) when you gather the data. However, at some point someone will want the data sorted by another method, e.g. by latitude if you're using data about the Earth.
Big companies do sort tera and petabytes of data regularly. I've worked for more than one company. Like Dean J said, companies rely on frameworks built to handle such tasks efficiently and consistently. So,the users of the data do not need to implement their own sorting. But the people who built the framework had to figure out how to do certain things (not just sorting, but key extraction, enriching, etc.) at massive scale. Despite all that, there might be situations when you will need to implement your own sorting. For example, I recently worked on data project that involved processing log files with events coming from mobile apps.
For security/privacy policies certain fields in the log files needed to be encrypted before the data could be moved over for further processing. That meant that for each row, a custom encryption algorithm was applied. However, since the ratio of Encrypted to events was high (the same field value appears 100s of times in the file), it was more efficient to sort the file first, encrypt the value, cache the result for each repeated value.

Resources