How to merge two presorted rdds in spark? - sorting

I have two large csv files presorted by one of the columns. Is there a way to use the fact that they are already sorted to get a new sorted RDD faster, without full sorting again?

The short answer: No, there is no way to leverage the fact that two input RDDs are already sorted when using the sort facilities offered by Apache Spark.
The long answer: Under certain conditions, there might be a better way than using sortBy or sortByKey.
The most obvious case is when the input RDDs are already sorted and represent distinct ranges. In this case, simply using rdd1.union(rdd2) is the fastest (virtually zero cost) way for combining the input RDDs, assuming that all elements in rdd1 come before all elements in rdd2 (according to the chosen ordering).
When the ranges of the input RDDs overlap, things get more tricky. Assuming that the target RDD shall only have a single partition, it might be efficient to use toLocalIterator on both RDDs and then do a merge manually. If the result has to be an RDD, one could do this inside the compute method of a custom RDD type, processing the input RDDs and generating the outputs.
When the inputs are large and thus consist of many partitions, things get even trickier. In this case, you probably want multiple partitions in the output RDD as well. You could use the custom RDD mentioned earlier, but create multiple partitions (using a RangePartitioner). Each partition would cover a distinct range of elements (in the optimal case, these ranges would cover roughly equally sized parts of the output).
The tricky part with this is avoiding to process the complete input RDDs multiple times inside compute. This can be avoided efficiently using filterByRange from OrderedRDDFunctions when the input RDDs are using a RangePartitioner. When they are not using a RangePartitioner, but you know that partitions are ordered internally and also have a global order, you would first need to find out the effective ranges covered by these partitions by actually probing into the data.
As the multiple partition case is rather complex, I would check whether the custom-made sort is really faster than simply using sortBy or sortByKey. The logic for sortBy and sortByKey is highly optimized regarding the shuffling process (transferring data between nodes). For this reason, it might well be that for many cases these methods are faster than the custom-made logic, even though the custom-made logic could be O(n) while sortBy / sortByKey can be O(n log(n)) at best.
If you are interested in learning more about the shuffling logic used by Apache Spark, there is an article explaining the basic concept.

Related

Too many values for a particular key in Hadoop

I have written a K-Means Clustering code in MapReduce on Hadoop. If I have few number of clusters, consider 2, and if the data is very large, the whole data would be divided into two sets and each Reducer would receive too many values for a particular key, i.e the cluster centroid. How to solve this?
Note: I use the iterative approch to calculate new centers.
Algorithmically, there is not much you can do, as the nature of this algorithm is the one that you describe. The only option, in this respect, is to use more clusters and divide your data to more reducers, but this yields a different result.
So, the only thing that you can do, in my opinion, is compressing. And I do not only mean, using a compression codec of Hadoop.
For instance, you could find a compact representation of your data. E.g., give an integer id to each element and only pass this id to the reducers. This will save network traffic (store elements as VIntWritables, or define your own VIntArrayWritable extending ArrayWritable) and memory of each reducer.
In this case of k-means, I think that a combiner is not applicable, but if it is, it would greatly reduce the network and the reducer's overhead.
EDIT: It seems that you CAN use a combiner, if you follow this iterative implementation. Please, edit your question to describe the algorithm that you have implemented.
If you have too much shuffle then you will run into OOM issues.
Try to split the dataset in smaller chunks and try
yarn.app.mapreduce.client.job.retry-interval
AND
mapreduce.reduce.shuffle.retry-delay.max.ms
where there are more splits but the retries of the job will be long enough so there is no OOM issues.

Apache Pig: Does ORDER BY with PARALLEL ensure consistent hashing/distribution?

If I load one dataset, order it on a specific key with a parallel clause, and then store it, I can get multiple files, part-r-00000 through part-r-00XXX, depending on what I specify in the parallel statement.
If I then load a new dataset, say another day's worth of data, with some new keys, and some of the same keys, order it, and then store it, is there any way to guarantee that part-r-00000 from yesterday's data will contain the same keyspace as part-r-00000 from today's data?
Is there even a way to guarantee that all of the records will be contained in a single part file, or is it possible that a key could get split across 2 files, given enough records?
I guess the question is really about how the ordering function works in pig - does it use a consistent hash-mod algorithm to distribute data, or does it order the whole set, and then divide it up?
The intent or hope would be that if the keyspace is consistently partitioned, it would be easy enough to perform rolling merges of data per part file. If it is not, I guess the question becomes, is there some operator or way of writing pig to enable that sort of consistent hashing?
Not sure if my question is very clear, but any help would be appreciated - having trouble figuring it out based on the docs. Thanks in advance!
Alrighty, attempting to answer my own question.
It appears that Pig does NOT have a way of ensuring said consistent distribution of results into files. This is partly based on docs, partly based on information about how hadoop works, and partly based on observation.
When pig does a partitioned order-by (eg, using the PARALLEL clause to get more than one reducer), it seems to force an intermediate job between whatever comes before the order-by, and the ordering itself. From what I can tell, pig looks at 1-10% of the data (based on the number of mappers in the intermediate job being 1-10% of the number in the load step) and gets a sampled distribution of the keys you are attempting to sort on.
My guess/thinking is that pig figures out the key distribution, and then uses a custom partitioner from the mappers to the reducers. The partitioner maps a range of keys to each reducer, so it becomes a simple lexical comparison - "is this record greater than my assigned end_key? pass it down the line to the next reducer."
Of course, there are two factors to consider that basically mean that Pig would not be consistent on different datasets, or even on a re-run of the same dataset. For one, since pig is sampling data in the intermediate job, I imagine it's possible to get a different sample and thus a different key distribution. Also, consider an example of two different datasets with widely varying key distributions. Pig would necessarily come up with a different distribution, and thus if key X was in part-r-00111 one day, it would not necessarily end up there the next day.
Hope this helps anyone else looking into this.
EDIT
I found a couple of resources from O'Reilly that seem to back up my hypothesis.
One is about map reduce patterns. It basically describes the standard total-order problem as being solvable by a two-pass approach, one "analyze" phase and a final sort phase.
The second is about pig's order by specifically. It says (in case the link ever dies):
As discussed earlier in “Group”, skew of the values in data is very common. This affects order just as it does group, causing some reducers to take significantly longer than others. To address this, Pig balances the output across reducers. It does this by first sampling the input of the order statement to get an estimate of the key distribution. Based on this sample, it then builds a partitioner that produces a balanced total order...
An important side effect of the way Pig distributes records to minimize skew is that it breaks the MapReduce convention that all instances of a given key are sent to the same partition. If you have other processing that depends on this convention, do not use Pig’s order statement to sort data for it...
Also, Pig adds an additional MapReduce job to your pipeline to do the sampling. Because this sampling is very lightweight (it reads only the first record of every block), it generally takes less than 5% of the total job time.

Sort large amount of data in the cloud?

Given a cloud storage folder with say 1PB of data in it, what would be the quickest way to sort all of that data? It's easy to sort small chunks of it, but then merging them into a larger sorted output will take longer since at some point a single process will have to merge the whole thing. I would like to avoid this, and have a fully distributed solution, is there a way? If so, is there any implementation that would be suitable for using to sort data in S3?
Since the amount of data you need to sort exceeds RAM (by a lot), the only reasonable way (to my knowledge) is to sort chunks first and then merge them together.
Merge Sort is the best way to accomplish this task. You can sort separate chunks of data at the same time with parallel processes, which should speed up your sort.
The thing is, after you done sorting chunks, you don't have to have a single process doing all of merging, you can have several processes merging different chunks at the same time:
This algorithm uses a parallel merge algorithm to not only parallelize the recursive division of the array, but also the merge operation. It performs well in practice when combined with a fast stable sequential sort, such as insertion sort, and a fast sequential merge as a base case for merging small arrays.
Here is a link that gives a bit more info about Merge Algorithm (just in case).
Bad news- you cannot avoid k-merge of multiple sorted files.
Good thing is that you can do some operations in parallel.

Motivation behind an implicit sort after the mapper phase in map-reduce

I am trying to understand as to why does map-reduce does an implicit sorting during the shuffle and sort phase both on the map side and the reduce side which is manifested as a mixture of both in-memory as well as on-disk sorting (can be really expensive for large sets of data).
My concern is that while running map-reduce jobs, performance is a significant consideration and an implicit sorting based on the keys before throwing the output of the mapper to the reducer will have a great impact on the performance when dealing with large sets of data.
I understand that sorting can prove to be a boon in certain cases where it is explicitly required but this is not always true? So, why does the concept of implicit sorting exist in Hadoop Map-Reduce?
For any kind of reference to what I am talking about while mentioning the shuffle and sort phase feel free to give a brief reading to the post : Map-Reduce: Shuffle and Sort on my blog: Hadoop-Some Salient Understandings
One of the possible explanation to the above which came to my mind much later after posting this question is:
The sorting is done just to aggregate all the records corresponding to a particular key, together, so that all these records corresponding to that single key maybe sent to a single reducer (default partitioning logic in Hadoop Map-Reduce). So, it may be said that by sorting all the records by the keys after the Mapper phase just allows to bring all records corresponding to a single key together where the order of the keys in sorted order may just get used for certain use cases such as sorting large sets of data.
If people can verify the above if they think the same, it shall be great. Thanks.

Efficiently sorting many strings in parallel for presentation

I am confronted with a problem where I have a massive list of information (287,843 items) that must be sorted for display. Which is more efficient, to use a self-organizing red-black binary tree to keep them sorted or to build an array and then sort? My keys are strings, if that helps. This algorithm should make use of multiple processor cores.
Thank you!
This really depends on the particulars of your setup. If you have a multicore machine, you can probably sort the strings extremely quickly by using a parallel version of quicksort, in which each recursive call is executed in parallel with each other call. With many cores, this can take the already fast quicksort and make it substantially faster. Other sorting algorithms like merge sort can also be parallelized, though parallel quicksort has the advantage of requiring less extra memory. Since you know that you're sorting strings, you may also want to look into parallel radix sort, which could potentially be extremely fast.
Most binary search trees cannot easily be multithreaded, because rebalance operations often require changing multiple parts of the tree at once, so a balanced red/black tree may not be the best approach here. However, you may want to look into a concurrent skiplist, which is a data structure that can be made to work efficiently in parallel. There are some newer binary search trees designed for parallelism that sometimes outperform the skiplist (here is one such data structure), though I expect that there will be fewer existing implementations and discussion of these newer structures.
If the elements are not changing frequently or you only need sorted order once, then just sorting once with parallel quicksort is probably the best bet. If the elements are changing frequently, then a concurrent data structure like the parallel skiplist will probably be a better bet.
Hope this helps!
Assuming that you're reading that list from a file or some other data source, it seems quite right to read all that into an array, and then sort it. If you have a GUI of some sort, it seems even more feasible to do both reading and sorting in a thread, while having the GUI in a "waiting to complete" state. Keeping a tree of the values sounds feasible only if you're going to do a lot of deletions/insertions, which would make an array less usable in this case.
When it comes to multi-core sorting, I believe the merge sort is the easiest to parallelize. But I'm no expert when it comes to this, so don't take my word for a definite answer.

Resources