Too many values for a particular key in Hadoop - hadoop

I have written a K-Means Clustering code in MapReduce on Hadoop. If I have few number of clusters, consider 2, and if the data is very large, the whole data would be divided into two sets and each Reducer would receive too many values for a particular key, i.e the cluster centroid. How to solve this?
Note: I use the iterative approch to calculate new centers.

Algorithmically, there is not much you can do, as the nature of this algorithm is the one that you describe. The only option, in this respect, is to use more clusters and divide your data to more reducers, but this yields a different result.
So, the only thing that you can do, in my opinion, is compressing. And I do not only mean, using a compression codec of Hadoop.
For instance, you could find a compact representation of your data. E.g., give an integer id to each element and only pass this id to the reducers. This will save network traffic (store elements as VIntWritables, or define your own VIntArrayWritable extending ArrayWritable) and memory of each reducer.
In this case of k-means, I think that a combiner is not applicable, but if it is, it would greatly reduce the network and the reducer's overhead.
EDIT: It seems that you CAN use a combiner, if you follow this iterative implementation. Please, edit your question to describe the algorithm that you have implemented.

If you have too much shuffle then you will run into OOM issues.
Try to split the dataset in smaller chunks and try
yarn.app.mapreduce.client.job.retry-interval
AND
mapreduce.reduce.shuffle.retry-delay.max.ms
where there are more splits but the retries of the job will be long enough so there is no OOM issues.

Related

How to merge two presorted rdds in spark?

I have two large csv files presorted by one of the columns. Is there a way to use the fact that they are already sorted to get a new sorted RDD faster, without full sorting again?
The short answer: No, there is no way to leverage the fact that two input RDDs are already sorted when using the sort facilities offered by Apache Spark.
The long answer: Under certain conditions, there might be a better way than using sortBy or sortByKey.
The most obvious case is when the input RDDs are already sorted and represent distinct ranges. In this case, simply using rdd1.union(rdd2) is the fastest (virtually zero cost) way for combining the input RDDs, assuming that all elements in rdd1 come before all elements in rdd2 (according to the chosen ordering).
When the ranges of the input RDDs overlap, things get more tricky. Assuming that the target RDD shall only have a single partition, it might be efficient to use toLocalIterator on both RDDs and then do a merge manually. If the result has to be an RDD, one could do this inside the compute method of a custom RDD type, processing the input RDDs and generating the outputs.
When the inputs are large and thus consist of many partitions, things get even trickier. In this case, you probably want multiple partitions in the output RDD as well. You could use the custom RDD mentioned earlier, but create multiple partitions (using a RangePartitioner). Each partition would cover a distinct range of elements (in the optimal case, these ranges would cover roughly equally sized parts of the output).
The tricky part with this is avoiding to process the complete input RDDs multiple times inside compute. This can be avoided efficiently using filterByRange from OrderedRDDFunctions when the input RDDs are using a RangePartitioner. When they are not using a RangePartitioner, but you know that partitions are ordered internally and also have a global order, you would first need to find out the effective ranges covered by these partitions by actually probing into the data.
As the multiple partition case is rather complex, I would check whether the custom-made sort is really faster than simply using sortBy or sortByKey. The logic for sortBy and sortByKey is highly optimized regarding the shuffling process (transferring data between nodes). For this reason, it might well be that for many cases these methods are faster than the custom-made logic, even though the custom-made logic could be O(n) while sortBy / sortByKey can be O(n log(n)) at best.
If you are interested in learning more about the shuffling logic used by Apache Spark, there is an article explaining the basic concept.

The design of Clustering using MapReduce

I have got a similarity matrix like this: ItemA, ItemB, Similarity.
I wanted it to cluster the dataset using algorithm such as Kmeans by using MapReduce. But I don't know how many MapReduces I should use and how to design them.
You cannot use k-means with a similarity matrix. End of story: k-means needs the similarity to the means, not between instances. But there are alternative algorithms. Unfortunately, PAM for example scales so badly, it does not pay off to run it on a cluster either.
Other than that, just experiment. Choose as many reduces as you have cores, for example; and choose as many mappers as your cluster can sustain (unless your data is too tiny - there should be several MB per mapper to make the startup cost pay off)
But I don't think you are ready for that question yet. First figure out what you want to do, then how to set parameters that may or may not arise at all..

Apache Pig: Does ORDER BY with PARALLEL ensure consistent hashing/distribution?

If I load one dataset, order it on a specific key with a parallel clause, and then store it, I can get multiple files, part-r-00000 through part-r-00XXX, depending on what I specify in the parallel statement.
If I then load a new dataset, say another day's worth of data, with some new keys, and some of the same keys, order it, and then store it, is there any way to guarantee that part-r-00000 from yesterday's data will contain the same keyspace as part-r-00000 from today's data?
Is there even a way to guarantee that all of the records will be contained in a single part file, or is it possible that a key could get split across 2 files, given enough records?
I guess the question is really about how the ordering function works in pig - does it use a consistent hash-mod algorithm to distribute data, or does it order the whole set, and then divide it up?
The intent or hope would be that if the keyspace is consistently partitioned, it would be easy enough to perform rolling merges of data per part file. If it is not, I guess the question becomes, is there some operator or way of writing pig to enable that sort of consistent hashing?
Not sure if my question is very clear, but any help would be appreciated - having trouble figuring it out based on the docs. Thanks in advance!
Alrighty, attempting to answer my own question.
It appears that Pig does NOT have a way of ensuring said consistent distribution of results into files. This is partly based on docs, partly based on information about how hadoop works, and partly based on observation.
When pig does a partitioned order-by (eg, using the PARALLEL clause to get more than one reducer), it seems to force an intermediate job between whatever comes before the order-by, and the ordering itself. From what I can tell, pig looks at 1-10% of the data (based on the number of mappers in the intermediate job being 1-10% of the number in the load step) and gets a sampled distribution of the keys you are attempting to sort on.
My guess/thinking is that pig figures out the key distribution, and then uses a custom partitioner from the mappers to the reducers. The partitioner maps a range of keys to each reducer, so it becomes a simple lexical comparison - "is this record greater than my assigned end_key? pass it down the line to the next reducer."
Of course, there are two factors to consider that basically mean that Pig would not be consistent on different datasets, or even on a re-run of the same dataset. For one, since pig is sampling data in the intermediate job, I imagine it's possible to get a different sample and thus a different key distribution. Also, consider an example of two different datasets with widely varying key distributions. Pig would necessarily come up with a different distribution, and thus if key X was in part-r-00111 one day, it would not necessarily end up there the next day.
Hope this helps anyone else looking into this.
EDIT
I found a couple of resources from O'Reilly that seem to back up my hypothesis.
One is about map reduce patterns. It basically describes the standard total-order problem as being solvable by a two-pass approach, one "analyze" phase and a final sort phase.
The second is about pig's order by specifically. It says (in case the link ever dies):
As discussed earlier in “Group”, skew of the values in data is very common. This affects order just as it does group, causing some reducers to take significantly longer than others. To address this, Pig balances the output across reducers. It does this by first sampling the input of the order statement to get an estimate of the key distribution. Based on this sample, it then builds a partitioner that produces a balanced total order...
An important side effect of the way Pig distributes records to minimize skew is that it breaks the MapReduce convention that all instances of a given key are sent to the same partition. If you have other processing that depends on this convention, do not use Pig’s order statement to sort data for it...
Also, Pig adds an additional MapReduce job to your pipeline to do the sampling. Because this sampling is very lightweight (it reads only the first record of every block), it generally takes less than 5% of the total job time.

Is there a better solution than Hadoop for simple O(n) complexity queries?

I need to create a system, which is required to take terabytes of numerical data and answer three questions: 1. Min, 2. Max, 3. Total count
A friend suggested that Hadoop uses map-reduce where reduce step always sorts the data. This results in complexity of O(nlogn) even for O(n) queries such as min, max, and total count.
I have been searching on the internet; however, I have not been able to find an answer. Can some one please help? I am new to this field, so please bear with my lack of knowledge.
thanks!
Hadoop does not change the asymptotic complexity of anything. It is merely about reducing the constant factor which big-O ignores.
There's always some overhead in putting together the results of a distributed computation. However in the case of your three problems, using a combiner will reduce the final sort to O(1). I don't know what the complexity is of the local sorting that happens on each map host for grouping for the combiner when there is only one key. It might be better than O(n lg n) in that case.
I haven't tried this in practice, but I believe you can effectively disable the sort by defining a custom sorting and grouping comparator for your job. You want to use a sorting comparator that says that all keys are equal for sorting purposes. I believe this will make all the sorts at least do as little work as possible -- one pass. You want to keep the default partitioner and grouping comparator though, so work is still distributed the same way, and the same values go with the same keys.
I don't know if this makes it O(n), since there is plenty of other stuff going on internally, like a merge.
And, big-O is a very crude measure of speed. Things like efficient writables and combiners will make a bigger difference than these issues.
Of course, I would probably not advise that you build custom MapReduce job for this kind of work. It's the kind of thing that Hive can answer for you, although it's just going to delegate to MapReduce jobs and will be slower than the simple MapReduce you contemplated at the beginning.
There are real-time-ish tools like Impala to answer these types of queries much, much faster. They don't use MapReduce, but do run on Hadoop. If you really want to do this, I'd strongly suggest looking in that direction.

Motivation behind an implicit sort after the mapper phase in map-reduce

I am trying to understand as to why does map-reduce does an implicit sorting during the shuffle and sort phase both on the map side and the reduce side which is manifested as a mixture of both in-memory as well as on-disk sorting (can be really expensive for large sets of data).
My concern is that while running map-reduce jobs, performance is a significant consideration and an implicit sorting based on the keys before throwing the output of the mapper to the reducer will have a great impact on the performance when dealing with large sets of data.
I understand that sorting can prove to be a boon in certain cases where it is explicitly required but this is not always true? So, why does the concept of implicit sorting exist in Hadoop Map-Reduce?
For any kind of reference to what I am talking about while mentioning the shuffle and sort phase feel free to give a brief reading to the post : Map-Reduce: Shuffle and Sort on my blog: Hadoop-Some Salient Understandings
One of the possible explanation to the above which came to my mind much later after posting this question is:
The sorting is done just to aggregate all the records corresponding to a particular key, together, so that all these records corresponding to that single key maybe sent to a single reducer (default partitioning logic in Hadoop Map-Reduce). So, it may be said that by sorting all the records by the keys after the Mapper phase just allows to bring all records corresponding to a single key together where the order of the keys in sorted order may just get used for certain use cases such as sorting large sets of data.
If people can verify the above if they think the same, it shall be great. Thanks.

Resources