Parallel matrix product - algorithm

In order to compute the product between 2 matrices A and B (nxm dimension) in a parallel mode, I have the following restrictions: the server sends to each client a number of rows from matrix A, and a number of rows from matrix B. This cannot be changed. Further the clients may exchange between each other information so that the matrices product to be computed, but they cannot ask the server to send any other data.
This should be done the most efficient possible, meaning by minimizing the number of messages sent between processes - considered as an expensive operation - and by doing the small calculations in parallel, as much as possible.
From what I have researched, practically the highest number of messages exchanged between the clients is n^2, in case each process broadcasts its lines to all the others. Now, the problem is that if I minimize the number of messages sent - this would be around log(n) for distributing the input data - but the computation then would only be done by one process, or more, but anyhow, it is not anymore done in parallel, which was the main idea of the problem.
What could be a more efficient algorithm, that would compute this product?
(I am using MPI, if it makes any difference).

To compute the matrix product C = A x B element-by-element you simply calculate C(i,j) = dot_product(A(i,:),B(:,j)). That is, the (i,j) element of C is the dot product of row i of A and column j of B.
If you insist on sending rows of A and rows of B around then you are going to have a tough time writing a parallel program whose performance exceeds a straightforward serial program. Rather, what you ought to do is send rows of A and columns of B to processors for computation of elements of C. If you are constrained to send rows of A and rows of B, then I suggest that you do that, but compute the product on the server. That is, ignore all the worker processors and just perform the calculation serially.
One alternative would be to compute partial dot-products on worker processors and to accumulate the partial results. This will require some tricky programming; it can be done but I will be very surprised if, at your first attempt, you can write a program which outperforms (in execution speed) a simple serial program.
(Yes, there are other approaches to decomposing matrix-matrix products for parallel execution, but they are more complicated than the foregoing. If you want to investigate these then Matrix Computations is the place to start reading.)
You need also to think hard about your proposed measures of efficiency -- the most efficient message-passing program will be the one which passes no messages. If the cost of message-passing far outweighs the cost of computation then the no-message-passing implementation will be the most efficient by both measures. Generally though, measures of the efficiency of parallel programs are ratios of speedup to number of processors: so 8 times speedup on 8 processors is perfectly efficient (and usually impossible to achieve).
As stated yours is not a sensible problem. Either the problem-setter has mis-specified it, or you have mis-stated (or mis-understood) a correct specification.

Something's not right: if both matrices have n x m dimensions, then they can not be multiplied together (unless n = m). In the case of A*B, A has to have as many columns as B has rows. Are you sure that the server isn't sending rows of B's transposed? That would be equivalent to sending columns from B, in which case the solution is trivial.
Assuming that all those check out, and your clients do indeed get rows from A and B: probably the easiest solution would be for each client to send its rows of matrix B to client #0, who reassambles the original matrix B, then sends out its columns back to the other clients. Basically, client #0 would act as a server that actually knows how to efficiently decompose data. This would be 2*(n-1) messages (not counting the ones used to reunite the product matrix), but considering how you already need n messages to distribute the A and B matrices between the clients, there's no significant performance loss (it's still O(n) messages).
The biggest bottleneck here is obviously the initial gathering and redistribution of the matrix B, which scales terribly, so if you have fairly small matrices and a lot of processes, you might just be better off calculating the product serially on the server.

I don't know if this is homework. But if it is not homework, then you should probably use a library. One idea is scalapack
http://www.netlib.org/scalapack/scalapack_home.html
Scalapack is writtten in fortran, but you can call it from c++.

Related

Performance comparsion: Algorithm S and Algorithm Z

Recently I ran into two sampling algorithms: Algorithm S and Algorithm Z.
Suppose we want to sample n items from a data set. Let N be the size of the data set.
When N is known, we can use Algorithm S
When N is unknown, we can use Algorithm Z (optimized atop Algorithm R)
Performance of the two algorithms:
Algorithm S
Time complexity: average number of scanned items is n(N+1)/n+1 (I compute the result, Knuth's book left this as exercises), we can say it O(N)
Space complexity: O(1) or O(n)(if returning an array)
Algorithm Z (I search the web, find the paper https://www.cs.umd.edu/~samir/498/vitter.pdf)
Time complexity: O(n(1+log(N/n))
Space complexity: in TAOCP vol2 3.4.2, it mentions Algorithm R's space complexity is O(n(1+log(N/n))), so I suppose Algorithm Z might be the same
My question
The model for Algorithm Z is: keep calling next method on the data set until we reach the end. So for the problem that N is known, we can still use Algorithm Z.
Based on the above performance comparison, Algorithm Z has better time complexity than Algorithm S, and worse space complexity.
If space is not a problem, should we use Algorithm Z even when N is known?
Is my understanding correct? Thanks!
Is the Postgres code mentioned in your comment actually used in production? In my opinion, it really should be reviewed by someone who has at least some understanding of the problem domain. The problem with random sampling algorithms, and random algorithms in general, is that it is very hard to diagnose biased sampling bugs. Most samples "look random" if you don't look too hard, and biased sampling is only obvious when you do a biased sample of a biased dataset. Or when your biased sample results in a prediction which is catastrophically divergent from reality, which will eventually happen but maybe not when you're doing the code review.
Anyway, by way of trying to answer the questions, both the one actually in the text of this post and the ones added or implied in the comment stream:
Properly implemented, Vitter's algorithm Z is much faster than Knuth's algorithm S. If you have a use case in which reservoir sampling is indicated, then you should probably use Vitter, subject to the code testing advice above: Vitter's algorithm is more complicated and it might not be obvious how to validate the implementation.
I noticed in the Postgres code that it just uses the threshold value of 22 to decide whether to use the more complicated code, based on testing done almost 40 years ago on hardware which you'd be hard pressed to find today. It's possible that 22 is not a bad threshold, but it's just a number pulled out of thin air. At least some attempt should be made to verify or, more likely, correct it.
Forty years ago, when those algorithms were developed, large datasets were typically stored on magnetic tape. Magnetic tape is still used today, but applications have changed; I think that you're not likely to find a Postgres installation in which a live database is stored on tape. This matters because the way you get data off a tape drive is radically different from the way you get data from a file server. Or a sharded distributed collection of file servers, which also has its particular needs.
Data on a reel of tape can only be accessed linearly, although it is possible to skip tape somewhat faster than you can read it. On a file server, data is random access; there may be a slight penalty for jumping around in a file, but there might not. (On the sharded distributed model, it might well be faster then linear reads.) But trying to read out of order on a tape drive might turn an input operation which takes an hour into an operation which takes a week. So it's very important to access the sample in order. Moreover, you really don't want to have to read the tape twice, which would take twice as long.
One of the other assumptions that was made in those algorithms is that you might not have enough memory to store the entire sample; in 1985, main memory was horribly expensive and databases were already quite large. So a common way to collect a large sample from a huge database was to copy the sampled blocks onto secondary memory, such as another tape drive. But there's a bit of a catch with reservoir sampling: as the sampling algorithm proceeds, some items which were initially inserted in the sample are later replaced with other items. But you can't replace data written on tape, so you need to just keep on appending the newly selected samples. What you do hold in random access memory is a list of locations of the sample; once you've finished selecting the sample, you can sort this list of locations and then use it to read out the final selection in storage order, skipping over the rejected items. That means that the temporary sample storage ends up holding both the final sample, and some number of later rejected items. The O(n(1+log(N/n))) space complexity in Algorithm R refers to precisely this storage, and it's actually a reasonably small multiplier, considering.
All that is irrelevant if you can just allocate enough random access storage somewhere to hold the entire sample. Or, even better, if you can directly read a data from the database. There could well still be good reasons to read the sample into local storage, but nothing stops you from updating a block of local storage with a different block.
On the other hand, in many common cases, you don't need to read the data in order to sample it. You can just take a list of items numbers, select a sample from that list of the desired size, and then set about acquiring the sample from the list of selected item numbers. And that presents a rather different problem: how to choose an unbiased sample of size k from a set of K item indexes.
There's a fast and simple solution to that (also described by Knuth, unsurprisingly): make an array of all the item numbers (say, the integers from 0 to K, and then shuffle the array using the standard Knuth/Fisher-Yates shuffle, with a slight modification: you run the algorithm from front to back (instead of back to front, as it is often presented), and stop after k iterations. At that point the first k elements in the partially shuffled array are an unbiased sample. (In fact, you don't need the entire vector of K indices, as long as k is much smaller than K. You're only going to touch O(k) of the values, and you can keep the ones you touched in a hash table of size O(k).)
And there's an even simpler algorithm, again for the case where the sample is small relative to the dataset: just keep one bit for each item in the dataset, which indicates that the item has been selected. Now select k items at random, marking the bit vector as you go; if the relevant bit is already marked, then that item is already in the sample; you just ignore that selection and continue with the next random choice. The expected number of ignored sample is very small unless the sample size is a significant fraction of the dataset size.
There's one other criterion which weighed on the minds of Vitter and Knuth: you'll normally want to do something with the selected sample. And given the amount of time it takes to read through a tape, you want to be able to start processing each item immediately as it is accepted. That precludes algorithms which include, for example, "sort the selected indices and then read the indicated items. (See above.) For immediate processing to be possible, you must not depend on being able to "deselect" already selected items.
Fortunately, both the quick algorithms mentioned at the end of point 2 do satisfy this requirement. In both cases, an item once selected will never be later rejected.
There is at least one use case for reservoir sampling which is still very much relevant: sampling a datastream which is too voluminous or too high-bandwidth to store. That might be some kind of massive social media feed, or it might be telemetry data from a large sensor array, or whatever. In that case, you might want to reduce the size of the datastream by extracting only a small sample, and reservoir sampling is a good candidate. However, that has nothing to do with the Postgres example.
In summary:
Yes, you can (and probably should) use Vitter's Algorithm Z in preference to Knuth's Algorithm S, even if you know how big the data set it.
But there are certainly better algorithms, some of which are outlined above.

Statistical Analysis of Runtime Measurements of a Parallel Algorithm

Problem Introduction
Assume we have a parallel algorithm f(<params>) running on P cores whereas
<params>: Parameters for algorithm
P: Number of cores it runs on (i.e. threads, cores, processors)
We further assume that out implementation actually consists of three parts:
A - Distribution: We distribute the input to all processors
B - Run the algorithm: We run f(<params>) ("on each processor")
C - Collection: We collect the computed data from all processors
After fixing <params> and P like input size, number of processors etc. the algorithm itself is deterministic i.e. we can write down an exact cDAG for it.
I'm now trying to answer the question: "For a given set of parameters, what is the execution time for a given system?"
With "given system" I mean e.g. "my computer" or "the university super computer" because obviously, the runtime does depend on the system it runs on and obviously the system itself does introduce non-determinism because you never really know the state of the system.
So in short: While the algorithm might be deterministic, runtime measurements aren't. (but e.g. communication measurements would be deterministic.) So we need to do a proper statistical analysis. And this is where I'm unsure.
Measuring Runtime: Basic idea
We are interested in how long part "B - Run the Algorithm" takes. Since the algorithm actually runs on P cores we'll make a measurement on each core and so get P values, let call those P values P_measurements. Some cores might finish before others, so which value does represent the runtime of the whole algorithm? I think a good choice is to simply take value of the core that took the longest i.e. max(P_measurements).
Now there are two things that need consideration here:
We have to repeat the measurement n times since it's a non-deterministic value
Once we have those n*P values, we need to know how to properly summarize them.
(And additional concern would be how to communicate those results in the end, but that's not part of this question.)
Measuring Runtime: Statistical Analysis
So here's what I'd do and this is also the part where I'm very unsure.
We measure the runtime of f(<params>) on each of the P cores. We get P_measurements
We take max(P_measurements)
We repeat 1. & 2. n times and we end up with maxes. Whereas maxes is a list of the n values max(P_measurements)
We check if maxesis normally distributed using a Q-Q-Plot. If not, we normalize. We do expect it to be right-skewed.
Now we take the median of maxes. (If we normalized, we use the normalized values)
We compute the standard deviation, the population mean and the 95% confidence interval.
We might want to say that all values are of an error of e.g. 5% so we check if all the values lie between +-5% of the population mean i.e. the confidence interval should be rather "thin".
We got ourselves some nice runtime measurement.
Clarifications:
Step 4. was necessary because computing the CI in step 6 uses the t-distribution and because later on I want to measure a different implementation of the same algorithm. So I'll have to compare two values and for that I need to do e.g. a t-test. So I need to make sure, the prerequisites for the t-test are met, which are: iid & normally distributed. Iid is assumed.
Question
I am very unsure what I did is statistically sound. Especially step 1-3. I'm not sure if I can do that kind of summarization (just take the max) here. I know that we might have an outsider value that's "especially" high but since we only measure on super computers we can assume the noise to be low and since we take the median in the end any outliners shouldn't have a big impact.
I hope for good input since it's a rather complex topic and I'm very interested in doing it right. I mostly followed the following paper, which I can recommend: http://spcl.inf.ethz.ch/Teaching/2020-dphpc/hoefler-scientific-benchmarking.pdf
But even with the paper, I'm not used to use statistical analysis and thus would just like to get some input from people who actually know this stuff. :)

Quick and Merge sort for multiple CPUs

Both merge sort and quick sort can work in parallel. Each time we split a problem in two sub-problems we can run those sub-problems in parallel. However it looks sub-optimal.
Suppose we have 4 CPUs. On the 1st iteration we split the problem in only 2 sub-problems and two CPUs are idle. On the 2nd iteration all CPUs are busy but on the 3d iteration we do not have enough CPUs. So, we should adapt the algorithm for the case when CPUs << log(N).
Does it make sense? How would you adapt the sorting algorithms to these cases?
First off, the best parallel implementation will depend highly on the environment. Some factors to consider:
Shared Memory (a 4-core computer) vs. Not Shared (4 single-core computers)
Size of data to sort
Speed of comparing two elements
Speed of swapping/moving two elements
Memory available
Is each computer/core identical or are there differences in speeds, network latency to communicate between parts, cache effects, etc.
Fault tolerance: what if one computer/core broke down in the middle of the operation.
etc.
Now moving back to the theoretical:
Suppose I have 1024 cards, and 7 other people to help me sort them.
Merge Sort
I quickly split the stack into 8 sections of somewhat equal size. It won't be perfectly equal since I am going fast. Actually since my friends can start sorting their part as soon as they get their section, I should give my first friend a stack bigger than the rest and get smaller towards the end.
Each person sorts their part however they like sequentially. (radix sort, quick sort, merge sort, etc.)
Now for the hard part ... merging.
In real life I would probably have the first two people that are ready form a pair and start merging their decks together. Perhaps they could work together, one person merging from the front and the other from the back. Perhaps they could both work from the front while calling their numbers out.
Soon enough other people will be done with their individual sorting, and can start merging. I would have them form pairs as they find convenient and keep going until all the cards are merged.
Quick Sort
The real trick here is to try to parallelize the partitioning, since the rest is pretty easy to do.
I will start by breaking the stack into 8 parts, and hand one part out to each friend. While doing this, I will choose one of the cards that looks like it might end up towards the middle of the sorted deck. I call out that number.
Each of my friends will partition their smaller stack into three piles, less than the called out number, equal to the called out number, and greater than the called out number. If one friend is faster than the others, he/she can steal some cards from a neighboring friend.
When they are finished with that, I collect all the less thans into one pile and give that to friends 0 through 3, I set aside the equal to's, and give the greater's to friends 4 through 7.
Friends 0 through 3, will divide their stack into four somewhat equal parts, will choose a card to partition around, and repeat the process amongst themselves.
This repeats until each friend has their own stack.
(Note that if the partitioning card wasn't chosen well, rather than dividing up the work 50-50, maybe I would only assign 2 friends to work on the less thans, and let the other 6 work on the greater thans.)
At the end, I just collect all of the stacks in the right order, along with the partition cards.
Conclusion
While it is true that some approaches are faster on a computer than in real life, I think the preceding is a good start. Different computers or cores or threads will perform their work at different speeds, unless you are implementing the sort in hardware. (If you are, you might want to look into "Sorting Networks" and or "Optimal Sorting Networks").
If you are sorting numbers, you will need a large dataset to be helped by paralellizing it.
However, if you are sorting images by comparing the sum manhattan distance between corresponding pixel red green blue values. You will find it less difficult to get speed-up of just less than k times with k cpu's.
Lastly, you will want to time the sequential version(s), and compare as you go along, since, cache effects, memory usage, network costs, etc, might just might make a difference.

Coming up with factors for a weighted algorithm?

I'm trying to come up with a weighted algorithm for an application. In the application, there is a limited amount of space available for different elements. Once all the space is occupied, the algorithm should choose the best element(s) to remove in order to make space for new elements.
There are different attributes which should affect this decision. For example:
T: Time since last accessed. (It's best to replace something that hasn't been accessed in a while.)
N: Number of times accessed. (It's best to replace something which hasn't been accessed many times.)
R: Number of elements which need to be removed in order to make space for the new element. (It's best to replace the least amount of elements. Ideally this should also take into consideration the T and N attributes of each element being replaced.)
I have 2 problems:
Figuring out how much weight to give each of these attributes.
Figuring out how to calculate the weight for an element.
(1) I realize that coming up with the weight for something like this is very subjective, but I was hoping that there's a standard method or something that can help me in deciding how much weight to give each attribute. For example, I was thinking that one method might be to come up with a set of two sample elements and then manually compare the two and decide which one should ultimately be chosen. Here's an example:
Element A: N = 5, T = 2 hours ago.
Element B: N = 4, T = 10 minutes ago.
In this example, I would probably want A to be the element that is chosen to be replaced since although it was accessed one more time, it hasn't been accessed in a lot of time compared with B. This method seems like it would take a lot of time, and would involve making a lot of tough, subjective decisions. Additionally, it may not be trivial to come up with the resulting weights at the end.
Another method I came up with was to just arbitrarily choose weights for the different attributes and then use the application for a while. If I notice anything obviously wrong with the algorithm, I could then go in and slightly modify the weights. This is basically a "guess and check" method.
Both of these methods don't seem that great and I'm hoping there's a better solution.
(2) Once I do figure out the weight, I'm not sure which way is best to calculate the weight. Should I just add everything? (In these examples, I'm assuming that whichever element has the highest replacementWeight should be the one that's going to be replaced.)
replacementWeight = .4*T - .1*N - 2*R
or multiply everything?
replacementWeight = (T) * (.5*N) * (.1*R)
What about not using constants for the weights? For example, sure "Time" (T) may be important, but once a specific amount of time has passed, it starts not making that much of a difference. Essentially I would lump it all in an "a lot of time has passed" bin. (e.g. even though 8 hours and 7 hours have an hour difference between the two, this difference might not be as significant as the difference between 1 minute and 5 minutes since these two are much more recent.) (Or another example: replacing (R) 1 or 2 elements is fine, but when I start needing to replace 5 or 6, that should be heavily weighted down... therefore it shouldn't be linear.)
replacementWeight = 1/T + sqrt(N) - R*R
Obviously (1) and (2) are closely related, which is why I'm hoping that there's a better way to come up with this sort of algorithm.
What you are describing is the classic problem of choosing a cache replacement policy. Which policy is best for you, depends on your data, but the following usually works well:
First, always store a new object in the cache, evicting the R worst one(s). There is no way to know a priori if an object should be stored or not. If the object is not useful, it will fall out of the cache again soon.
The popular squid cache implements the following cache replacement algorithms:
Least Recently Used (LRU):
replacementKey = -T
Least Frequently Used with Dynamic Aging (LFUDA):
replacementKey = N + C
Greedy-Dual-Size-Frequency (GDSF):
replacementKey = (N/R) + C
C refers to a cache age factor here. C is basically the replacementKey of the item that was evicted last (or zero).
NOTE: The replacementKey is calculated when an object is inserted or accessed, and stored alongside the object. The object with the smallest replacementKey is evicted.
LRU is simple and often good enough. The bigger your cache, the better it performs.
LFUDA and GDSF both are tradeoffs. LFUDA prefers to keep large objects even if they are less popular, under the assumption that one hit to a large object makes up lots of hits for smaller objects. GDSF basically makes the opposite tradeoff, keeping many smaller objects over fewer large objects. From what you write, the latter might be a good fit.
If none of these meet your needs, you can calculate optimal values for T, N and R (and compare different formulas for combining them) by minimizing regret, the difference in performance between your formula and the optimal algorithm, using, for example, Linear regression.
This is a completely subjective issue -- as you yourself point out. And a distinct possibility is that if your test cases consist of pairs (A,B) where you prefer A to B, then you might find that you prefer A to B , B to C but also C over A -- i.e. its not an ordering.
If you are not careful, your function might not exist !
If you can define a scalar function of your input variables, with various parameters for coefficients and exponents, you might be able to estimate said parameters by using regression, but you will need an awful lot of data if you have many parameters.
This is the classical statistician's approach of first reviewing the data to IDENTIFY a model, and then using that model to ESTIMATE a particular realisation of the model. There are large books on this subject.

What is the best way to analyse a large dataset with similar records?

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

Resources