I am a beginner in Hadoop. I am trying to understand why MapReduce is named like that.
From what I understand it's basically transforming for filtering the data first and then aggregating it to produce some output.
Why that filtering or transforming is called mapping? How is that operation can be considered as mapping?
Why that aggregate operation is called reducing? Here at least I can imagine that aggregate will reduce the input data set to a limited number of values.
I am trying to understand the meaning of MapReduce from a semantic perspective.
In order to find the reasoning behind the terms of MapReduce, we must go back to the roots of those elements that make up this particular programming paradigm. This means we need to talk (as much precise and as less boring as possible) about functional programming.
In short, functional programming for Wikipedia is:
a declarative programming paradigm in which function definitions are trees of expressions that map values to other values, rather than a sequence of imperative statements which update the running state of the program.
This basically means that the emphasis of this model is on the application of functions and not on the imperative programming that is focused on the changes being made to a state. So by using functional code, a function in execution doesn't really rely on or manipulate data outside of its scope (as brilliantly said here).
"Ok, and what does that have to do with MapReduce, anyhow?"
Well, MapReduce is directly inspired by functional programming, because the Map and Reduce functions are the basic functions used in functional programming. Of course, MapReduce has many other added stages for an execution like Combine, Shuffle, Sort, etc., but the core idea of the model stems from that idea of functional programming described above.
About mapping, in a functional sense it is described as a function that receives two arguments, a function and a list of values. The Map function is essentially implementing the function upon each and every one value of the list to return an output list of results. You can indeed call this a type of "filtering", however data can be manipulated in a lot more ways than just "filtering" them out. The main goal of a Map function is changing input data to the desired form for the calculations being made up next in the Reduce function.
Talking about Reduce now, it follows a similar approach. Two arguments are given here as well, a function and a list of values where the function is going to be implemented. Since the list of values here is the transformed collection of data from the output of the Map function, all left to do is work on them and reach to the desired results. With your knowledge of the abstract sense of that step of a MapReduce job, you have the right idea when you describe the Reduce function as trying to aggregate the input data. The one thing that is "missing" from that procedure, though, is how and based on what will those input data be aggregated. And this is the main essence of the Map function, as described above.
With all this, we are able to understand that the MapReduce model is named after those two basic functions of functional programming that is abstractly implementing, so the model essentially follows the semantic contracts of the latter.
You can go on a quest yourself about all of this and a lot more by starting from here, here, here, and here.
Related
I am fairly new to both parallel programming and the Erlang language and I'm struggling a bit.
I'm having a hard time implementing a mapreduce skeleton. I spawn M mappers (their task is to map the power function into a list of floats) and R reducers (they sum the elements of the input list sent by the mapper).
What I then want to do is to send the intermediate results of each mapper to a random reducer, how do I go about linking one mapper to a reducer?
I have looked around the internet for examples. The closest thing to what I want to do that I could find is this word counter example, the author seems to have found a clever way to link a mapper to a reducer and the logic makes sense, however I have not been able to tweak it in order to fit my particular needs. Maybe the key-value implementation is not suitable for finding the sum of a list of powers?
Any help, please?
Just to give an update, apparently there were problems with the algorithm linked in the OP. It looks like there is something wrong with the sychronization protocol, which is hinted at by the presence of the call to the sleep() function (ie. it's not supposed to be there).
For a good working implementation of the map/reduce framework, please refer to Joe Armstrong's version in the Programming Erlang book (2nd ed).
Armstrong's version only uses one reducer, but it can be easily modified for more reducers in order to eliminate the bottleck.
I have also added a function to split the input list into chunks. Each mapper will get a chunk of data.
I have heard and bought the argument that mutation and state is bad for concurrency. But I struggle to understand what the correct alternatives actually are?
For example, when looking at the simplest of all tasks: counting, e.g. word counting in a large corpus of documents. Accessing and parsing the document takes a while so we want to do it in parallel using k threads or actors or whatever the abstraction for parallelism is.
What would be the correct but also practical pure functional way, using immutable data structures to do this?
The general approach in analyzing data sets in a functional way is to partition the data set in some way that makes sense, for a document you might cut it up into sections based on size. i.e. four threads means the doc is sectioned into four pieces.
The thread or process then executes its algorithm on each section of the data set and generates an output. All the outputs are gathered together and then merged. For word counts, for example, a collection of word counts are sorted by the word, and then each list is stepped through using looking for the same words. If that word occurs in more than one list, the counts are summed. In the end, a new list with the sums of all the words is output.
This approach is commonly referred to as map/reduce. The step of converting a document into word counts is a "map" and the aggregation of the outputs is a "reduce".
In addition to the advantage of eliminating the overhead to prevent data conflicts, a functional approach enables the compiler to optimize to a faster approach. Not all languages and compilers do this, but because a compiler knows its variables are not going to be modified by an outside agent it can apply transforms to the code to increase its performance.
In addition, functional programming lets systems like Spark to dynamically create threads because the boundaries of change are clearly defined. That's why you can write a single function chain in Spark, and then just throw servers at it without having to change the code. Pure functional languages can do this in a general way making every application intrinsically multi-threaded.
One of the reasons functional programming is "hot" is because of this ability to enable multiprocessing transparently and safely.
Mutation and state are bad for concurrency only if mutable state is shared between multiple threads for communication, because it's very hard to argue about impure functions and methods that silently trash some shared memory in parallel.
One possible alternative is using message passing for communication between threads/actors (as is done in Akka), and building ("reasonably pure") functional data analysis frameworks like Apache Spark on top of it. Apache Spark is known to be rather suitable for counting words in a large corpus of documents.
When evaluating a recommender system, one could split his data into three pieces: training, validation and testing sets. In such case, the training set would be used to learn the recommendation model from data and the validation set would be used to choose the best model or parameters to use. Then, using the chosen model, the user could evaluate the performance of his algorithm using the testing set.
I have found a documentation page for the scikit-learn cross validation (http://scikit-learn.org/stable/modules/cross_validation.html) where it says that is not necessary to split the data into three pieces when using k-fold-cross validation, but only into two: training and testing.
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles).
I am wondering if this would be a good approach. And if so, someone could show me a reference to an article/book backing this theory up?
Cross validation does not avoid validation set, it simply uses many. In other words instead of one split into three parts, you have one split into two, and what you now call "training" is actually what previously has been training and validation, CV is simply about repeated splits (in slightly more smart manner than just randomly) into train and test, and then averaging the results. Theory backing it up is widely available in pretty much any good ML book; the crucial bit is "should I use it" and the answer is suprisingly simple - only if you do not have enough data to do one split. CV is used when you do not have enough data for each of the splits to be representative for the distribution you are interested in, then doing repeated splits simply reduce the variance. Furthermore, for really small datasets one does nested CV - one for [train+val][test] split and internal for [train][val], so the variance of both - model selection and its final evaluation - are reduced.
Pragmatically, what are the main advantages of using promises? Can you show me some examples of real-life useful usage of promises?
In Scheme a promise is just a value that has a task that is not necessarily done yet and if you never use the value it will never be calculated. In short it is a way to do lazy evaluation in the otherwise eager Scheme. A typical way is to do computations on streams instead of lists.
With lists you can use higher order functions so that you can have a list, then filter it for values you are interested in, then transform these values and perhaps at some point you have enough to produce the value you needed. This is nice since you can abstract each step so that you can make logic that only does one step and compose steps to make the whole program, but in this scenario the first step needs to finish in full before the next step can handle the result of the first while it might be that if you are searching for the first prime number between 0 and 1000 having iterated over all the numbers in each step might not be so effective. Here is where streams comes in.
With streams the code looks the same, but the intermediate result is made by need. A stream is a pair where the parts are promises so that the code that would otherwise make a pair is delayed until the values are used. Every step just produces enough data for the next step and thus should it be enough for the first step to iterate just 20% of the elements for the last step to have computed the final result the 80% rest will never ever be processed in any of the steps. With such a structure the initial stream can also be infinite, like all the numbers from 0 increased by 1.
There are penalties involved using streams. Imagine you make an algorithm that would visit all the elements anyway. Then a stream version of an algorithm would be slower since the promises that are created and the forcing gives th eprogram overhead compared with doing the computation without laziness.
You might be interested in seeing Hal Abelson explaining streams and their pros and cons.
There are other alternatives to streams an lazy evaluation. One is generators. Here you can also make composable procedures that takes a generator and produces a generator. The iteration will be by need like with streams.
Another alternative would be transducers. This is also composable and iterates like streams and generators, but unlike generators initial data cannot be an infinite sequence like with streams and generators unless the underlying structure supports it.
The advantages of using promises or any other technique in this answer is not scheme specific. They are true for all eager programming languages!
I am working through a particular type of code testing that is rather nettlesome and could be automated, yet I'm not sure of the best practices. Before describing the problem, I want to make clear that I'm looking for the appropriate terminology and concepts, so that I can read more about how to implement it. Suggestions on best practices are welcome, certainly, but my goal is specific: what is this kind of approach called?
In the simplest case, I have two programs that take in a bunch of data, produce a variety of intermediate objects, and then return a final result. When tested end-to-end, the final results differ, hence the need to find out where the differences occur. Unfortunately, even intermediate results may differ, but not always in a significant way (i.e. some discrepancies are tolerable). The final wrinkle is that intermediate objects may not necessarily have the same names between the two programs, and the two sets of intermediate objects may not fully overlap (e.g. one program may have more intermediate objects than the other). Thus, I can't assume there is a one-to-one relationship between the objects created in the two programs.
The approach that I'm thinking of taking to automate this comparison of objects is as follows (it's roughly inspired by frequency counts in text corpora):
For each program, A and B: create a list of the objects created throughout execution, which may be indexed in a very simple manner, such as a001, a002, a003, a004, ... and similarly for B (b001, ...).
Let Na = # of unique object names encountered in A, similarly for Nb and # of objects in B.
Create two tables, TableA and TableB, with Na and Nb columns, respectively. Entries will record a value for each object at each trigger (i.e. for each row, defined next).
For each assignment in A, the simplest approach is to capture the hash value of all of the Na items; of course, one can use LOCF (last observation carried forward) for those items that don't change, and any as-yet unobserved objects are simply given a NULL entry. Repeat this for B.
Match entries in TableA and TableB via their hash values. Ideally, objects will arrive into the "vocabulary" in approximately the same order, so that order and hash value will allow one to identify the sequences of values.
Find discrepancies in the objects between A and B based on when the sequences of hash values diverge for any objects with divergent sequences.
Now, this is a simple approach and could work wonderfully if the data were simple, atomic, and not susceptible to numerical precision issues. However, I believe that numerical precision may cause hash values to diverge, though the impact is insignificant if the discrepancies are approximately at the machine tolerance level.
First: What is a name for such types of testing methods and concepts? An answer need not necessarily be the method above, but reflects the class of methods for comparing objects from two (or more) different programs.
Second: What are standard methods exist for what I describe in steps 3 and 4? For instance, the "value" need not only be a hash: one might also store the sizes of the objects - after all, two objects cannot be the same if they are massively different in size.
In practice, I tend to compare a small number of items, but I suspect that when automated this need not involve a lot of input from the user.
Edit 1: This paper is related in terms of comparing the execution traces; it mentions "code comparison", which is related to my interest, though I'm concerned with the data (i.e. objects) than with the actual code that produces the objects. I've just skimmed it, but will review it more carefully for methodology. More importantly, this suggests that comparing code traces may be extended to comparing data traces. This paper analyzes some comparisons of code traces, albeit in a wholly unrelated area of security testing.
Perhaps data-tracing and stack-trace methods are related. Checkpointing is slightly related, but its typical use (i.e. saving all of the state) is overkill.
Edit 2: Other related concepts include differential program analysis and monitoring of remote systems (e.g. space probes) where one attempts to reproduce the calculations using a local implementation, usually a clone (think of a HAL-9000 compared to its earth-bound clones). I've looked down the routes of unit testing, reverse engineering, various kinds of forensics, and whatnot. In the development phase, one could ensure agreement with unit tests, but this doesn't seem to be useful for instrumented analyses. For reverse engineering, the goal can be code & data agreement, but methods for assessing fidelity of re-engineered code don't seem particularly easy to find. Forensics on a per-program basis are very easily found, but comparisons between programs don't seem to be that common.
(Making this answer community wiki, because dataflow programming and reactive programming are not my areas of expertise.)
The area of data flow programming appears to be related, and thus debugging of data flow programs may be helpful. This paper from 1981 gives several useful high level ideas. Although it's hard to translate these to immediately applicable code, it does suggest a method I'd overlooked: when approaching a program as a dataflow, one can either statically or dynamically identify where changes in input values cause changes in other values in the intermediate processing or in the output (not just changes in execution, if one were to examine control flow).
Although dataflow programming is often related to parallel or distributed computing, it seems to dovetail with Reactive Programming, which is how the monitoring of objects (e.g. the hashing) can be implemented.
This answer is far from adequate, hence the CW tag, as it doesn't really name the debugging method that I described. Perhaps this is a form of debugging for the reactive programming paradigm.
[Also note: although this answer is CW, if anyone has a far better answer in relation to dataflow or reactive programming, please feel free to post a separate answer and I will remove this one.]
Note 1: Henrik Nilsson and Peter Fritzson have a number of papers on debugging for lazy functional languages, which are somewhat related: the debugging goal is to assess values, not the execution of code. This paper seems to have several good ideas, and their work partially inspired this paper on a debugger for a reactive programming language called Lustre. These references don't answer the original question, but may be of interest to anyone facing this same challenge, albeit in a different programming context.