I am fairly new to both parallel programming and the Erlang language and I'm struggling a bit.
I'm having a hard time implementing a mapreduce skeleton. I spawn M mappers (their task is to map the power function into a list of floats) and R reducers (they sum the elements of the input list sent by the mapper).
What I then want to do is to send the intermediate results of each mapper to a random reducer, how do I go about linking one mapper to a reducer?
I have looked around the internet for examples. The closest thing to what I want to do that I could find is this word counter example, the author seems to have found a clever way to link a mapper to a reducer and the logic makes sense, however I have not been able to tweak it in order to fit my particular needs. Maybe the key-value implementation is not suitable for finding the sum of a list of powers?
Any help, please?
Just to give an update, apparently there were problems with the algorithm linked in the OP. It looks like there is something wrong with the sychronization protocol, which is hinted at by the presence of the call to the sleep() function (ie. it's not supposed to be there).
For a good working implementation of the map/reduce framework, please refer to Joe Armstrong's version in the Programming Erlang book (2nd ed).
Armstrong's version only uses one reducer, but it can be easily modified for more reducers in order to eliminate the bottleck.
I have also added a function to split the input list into chunks. Each mapper will get a chunk of data.
Related
I am a beginner in Hadoop. I am trying to understand why MapReduce is named like that.
From what I understand it's basically transforming for filtering the data first and then aggregating it to produce some output.
Why that filtering or transforming is called mapping? How is that operation can be considered as mapping?
Why that aggregate operation is called reducing? Here at least I can imagine that aggregate will reduce the input data set to a limited number of values.
I am trying to understand the meaning of MapReduce from a semantic perspective.
In order to find the reasoning behind the terms of MapReduce, we must go back to the roots of those elements that make up this particular programming paradigm. This means we need to talk (as much precise and as less boring as possible) about functional programming.
In short, functional programming for Wikipedia is:
a declarative programming paradigm in which function definitions are trees of expressions that map values to other values, rather than a sequence of imperative statements which update the running state of the program.
This basically means that the emphasis of this model is on the application of functions and not on the imperative programming that is focused on the changes being made to a state. So by using functional code, a function in execution doesn't really rely on or manipulate data outside of its scope (as brilliantly said here).
"Ok, and what does that have to do with MapReduce, anyhow?"
Well, MapReduce is directly inspired by functional programming, because the Map and Reduce functions are the basic functions used in functional programming. Of course, MapReduce has many other added stages for an execution like Combine, Shuffle, Sort, etc., but the core idea of the model stems from that idea of functional programming described above.
About mapping, in a functional sense it is described as a function that receives two arguments, a function and a list of values. The Map function is essentially implementing the function upon each and every one value of the list to return an output list of results. You can indeed call this a type of "filtering", however data can be manipulated in a lot more ways than just "filtering" them out. The main goal of a Map function is changing input data to the desired form for the calculations being made up next in the Reduce function.
Talking about Reduce now, it follows a similar approach. Two arguments are given here as well, a function and a list of values where the function is going to be implemented. Since the list of values here is the transformed collection of data from the output of the Map function, all left to do is work on them and reach to the desired results. With your knowledge of the abstract sense of that step of a MapReduce job, you have the right idea when you describe the Reduce function as trying to aggregate the input data. The one thing that is "missing" from that procedure, though, is how and based on what will those input data be aggregated. And this is the main essence of the Map function, as described above.
With all this, we are able to understand that the MapReduce model is named after those two basic functions of functional programming that is abstractly implementing, so the model essentially follows the semantic contracts of the latter.
You can go on a quest yourself about all of this and a lot more by starting from here, here, here, and here.
I have heard and bought the argument that mutation and state is bad for concurrency. But I struggle to understand what the correct alternatives actually are?
For example, when looking at the simplest of all tasks: counting, e.g. word counting in a large corpus of documents. Accessing and parsing the document takes a while so we want to do it in parallel using k threads or actors or whatever the abstraction for parallelism is.
What would be the correct but also practical pure functional way, using immutable data structures to do this?
The general approach in analyzing data sets in a functional way is to partition the data set in some way that makes sense, for a document you might cut it up into sections based on size. i.e. four threads means the doc is sectioned into four pieces.
The thread or process then executes its algorithm on each section of the data set and generates an output. All the outputs are gathered together and then merged. For word counts, for example, a collection of word counts are sorted by the word, and then each list is stepped through using looking for the same words. If that word occurs in more than one list, the counts are summed. In the end, a new list with the sums of all the words is output.
This approach is commonly referred to as map/reduce. The step of converting a document into word counts is a "map" and the aggregation of the outputs is a "reduce".
In addition to the advantage of eliminating the overhead to prevent data conflicts, a functional approach enables the compiler to optimize to a faster approach. Not all languages and compilers do this, but because a compiler knows its variables are not going to be modified by an outside agent it can apply transforms to the code to increase its performance.
In addition, functional programming lets systems like Spark to dynamically create threads because the boundaries of change are clearly defined. That's why you can write a single function chain in Spark, and then just throw servers at it without having to change the code. Pure functional languages can do this in a general way making every application intrinsically multi-threaded.
One of the reasons functional programming is "hot" is because of this ability to enable multiprocessing transparently and safely.
Mutation and state are bad for concurrency only if mutable state is shared between multiple threads for communication, because it's very hard to argue about impure functions and methods that silently trash some shared memory in parallel.
One possible alternative is using message passing for communication between threads/actors (as is done in Akka), and building ("reasonably pure") functional data analysis frameworks like Apache Spark on top of it. Apache Spark is known to be rather suitable for counting words in a large corpus of documents.
I have written a K-Means Clustering code in MapReduce on Hadoop. If I have few number of clusters, consider 2, and if the data is very large, the whole data would be divided into two sets and each Reducer would receive too many values for a particular key, i.e the cluster centroid. How to solve this?
Note: I use the iterative approch to calculate new centers.
Algorithmically, there is not much you can do, as the nature of this algorithm is the one that you describe. The only option, in this respect, is to use more clusters and divide your data to more reducers, but this yields a different result.
So, the only thing that you can do, in my opinion, is compressing. And I do not only mean, using a compression codec of Hadoop.
For instance, you could find a compact representation of your data. E.g., give an integer id to each element and only pass this id to the reducers. This will save network traffic (store elements as VIntWritables, or define your own VIntArrayWritable extending ArrayWritable) and memory of each reducer.
In this case of k-means, I think that a combiner is not applicable, but if it is, it would greatly reduce the network and the reducer's overhead.
EDIT: It seems that you CAN use a combiner, if you follow this iterative implementation. Please, edit your question to describe the algorithm that you have implemented.
If you have too much shuffle then you will run into OOM issues.
Try to split the dataset in smaller chunks and try
yarn.app.mapreduce.client.job.retry-interval
AND
mapreduce.reduce.shuffle.retry-delay.max.ms
where there are more splits but the retries of the job will be long enough so there is no OOM issues.
If I load one dataset, order it on a specific key with a parallel clause, and then store it, I can get multiple files, part-r-00000 through part-r-00XXX, depending on what I specify in the parallel statement.
If I then load a new dataset, say another day's worth of data, with some new keys, and some of the same keys, order it, and then store it, is there any way to guarantee that part-r-00000 from yesterday's data will contain the same keyspace as part-r-00000 from today's data?
Is there even a way to guarantee that all of the records will be contained in a single part file, or is it possible that a key could get split across 2 files, given enough records?
I guess the question is really about how the ordering function works in pig - does it use a consistent hash-mod algorithm to distribute data, or does it order the whole set, and then divide it up?
The intent or hope would be that if the keyspace is consistently partitioned, it would be easy enough to perform rolling merges of data per part file. If it is not, I guess the question becomes, is there some operator or way of writing pig to enable that sort of consistent hashing?
Not sure if my question is very clear, but any help would be appreciated - having trouble figuring it out based on the docs. Thanks in advance!
Alrighty, attempting to answer my own question.
It appears that Pig does NOT have a way of ensuring said consistent distribution of results into files. This is partly based on docs, partly based on information about how hadoop works, and partly based on observation.
When pig does a partitioned order-by (eg, using the PARALLEL clause to get more than one reducer), it seems to force an intermediate job between whatever comes before the order-by, and the ordering itself. From what I can tell, pig looks at 1-10% of the data (based on the number of mappers in the intermediate job being 1-10% of the number in the load step) and gets a sampled distribution of the keys you are attempting to sort on.
My guess/thinking is that pig figures out the key distribution, and then uses a custom partitioner from the mappers to the reducers. The partitioner maps a range of keys to each reducer, so it becomes a simple lexical comparison - "is this record greater than my assigned end_key? pass it down the line to the next reducer."
Of course, there are two factors to consider that basically mean that Pig would not be consistent on different datasets, or even on a re-run of the same dataset. For one, since pig is sampling data in the intermediate job, I imagine it's possible to get a different sample and thus a different key distribution. Also, consider an example of two different datasets with widely varying key distributions. Pig would necessarily come up with a different distribution, and thus if key X was in part-r-00111 one day, it would not necessarily end up there the next day.
Hope this helps anyone else looking into this.
EDIT
I found a couple of resources from O'Reilly that seem to back up my hypothesis.
One is about map reduce patterns. It basically describes the standard total-order problem as being solvable by a two-pass approach, one "analyze" phase and a final sort phase.
The second is about pig's order by specifically. It says (in case the link ever dies):
As discussed earlier in “Group”, skew of the values in data is very common. This affects order just as it does group, causing some reducers to take significantly longer than others. To address this, Pig balances the output across reducers. It does this by first sampling the input of the order statement to get an estimate of the key distribution. Based on this sample, it then builds a partitioner that produces a balanced total order...
An important side effect of the way Pig distributes records to minimize skew is that it breaks the MapReduce convention that all instances of a given key are sent to the same partition. If you have other processing that depends on this convention, do not use Pig’s order statement to sort data for it...
Also, Pig adds an additional MapReduce job to your pipeline to do the sampling. Because this sampling is very lightweight (it reads only the first record of every block), it generally takes less than 5% of the total job time.
I hope I'm asking this in the right way. I'm learning my way around Elastic MapReduce and I've seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows.
In Amazon's "Introduction to Amazon Elastic MapReduce" PDF it states "Amazon Elastic MapReduce has a default reducer called aggregrate"
What I would like to know is: are there other default reducers availiable?
I understand that I can write my own reducer, but I don't want to end up writing something that already exists and "reinvent the wheel" because I'm sure my wheel won't be as good as the original.
The reducer they refer to is documented here:
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html
That's a reducer that is built into the streaming utility. It provides a simple way of doing common calculation by writing a mapper that output keys that are formatted in a special way.
For example, if your mapper outputs:
LongValueSum:id1\t12
LongValueSum:id1\t13
LongValueSum:id2\t1
UniqValueCount:id3\tval1
UniqValueCount:id3\tval2
The reducer will calculate the sum of each LongValueSum, and count the distinct values for UniqValueCount. The reducer output will therefore be:
id1\t25
id2\t12
id3\t2
The reducers and combiners in this package are very fast compared to running streaming combiners and reducers, so using the aggregate package is both convenient and fast.
I'm in a similar situation. I infer from Google results etc that the answer right now is "No, there are no other default reducers in Hadoop", which kind of sucks, because it would be obviously useful to have default reducers like, say, "average" or "median" so you don't have to write your own.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html shows a number of useful aggregator uses but I cannot find documentation for how to access other functionality than the very basic key/value sum described in the documentation and in Erik Forsberg's answer. Perhaps this functionality is only exposed in the Java API, which I don't want to use.
Incidentally, I'm afraid Erik Forsberg's answer is not a good answer to this particular question. Another question for which it could be a useful answer can be constructed, but it is not what the OP is asking.