I want to fill a Cassandra database with a list of strings that I then process using Hadoop. What I want to do it run through all the strings in order using a Hadoop cluster and record how much overlap there is between each string in order to find the Longest Common Substring.
My question is, will the InputFormat object allow me to read out the data in a sorted order or will my strings be read out "randomly" (according to how Cassandra decides to distribute them) throughout every machine in the cluster? Is the MapReduce process designed to process each row by itself w/out the intent of looking at two rows consecutively like I'm asking for?
First of all, the Mappers will read the data in whatever order they get it from the InputFormat. I'm not a Cassandra expert, but I don't expect that will be in sorted order.
If you want sorted order, you should use an identity mapper (one that does nothing) whose output key is the string itself. Then they will be sorted before passed to the reduce step. But it gets a little more complicated since you can have more than one reducer. With only one reducer, everything is globally sorted. With more than one, each reducer's input is sorted, but the input across reducers might not be sorted. That is, adjacent strings might not go to the same reducer. You would need a custom partitioner to handle that.
Lastly, you mentioned that you're doing longest common substring- are you looking for the longest substring among each pair of strings? Among consecutive pairs of strings? Among all strings? Each of these possibilities will affect how you need to structure your MapReduce job.
Related
I am trying to design a mapper and reducer for Hadoop. I am new to Hadoop, and I'm a bit confused about how the mapper and reducer is supposed for work for my specific application.
The input to my mapper is a large directed graph's connectivity. It is a 2 column input where each row is an individual edge connectivity. The first column is the start node id and the second column is the end node id of each edge. I'm trying to output the number of neighbors for each start node id into a 2 column text file, where the first column is sorted in order of increasing start node id.
My questions are:
(1) The input is already set up such that each line is a key-value pair, where the key is the start node id, and the value is the end node id. Would the mapper simply just read in each line and write it out? That seems redundant.
(2) Does the sorting take place in between the mapper and reducer or could the sorting actually be done with the reducer itself?
If my understanding is correct, you want to count how many distinct values a key will have.
Simply emitting the input key-value pairs in the mapper, and then counting the distinct values per key (e.g., by adding them to a set and emitting the set size as the value of the reducer) in the reducer is one way of doing it, but a bit redundant, as you say.
In general, you want to reduce the network traffic, so you may want to do some more computations before the shuffling (yes, this is done by Hadoop).
Two easy ways to improve the efficiency are:
1) Use a combiner, which will output sets of values, instead of single values. This way, you will send fewer key-value pairs to the reducers, and also, some values may be skipped, since they have been already in the local value set of the same key.
2) Use map-side aggregation. Instead of emitting the input key-value pairs right away, store them locally in the mapper (in memory) in a data structure (e.g., hashmap or multimap). The key can be the map input key and the value can be a set of values seen so far for this key. Each type you meet a new value for this key, you append it to this structure. At the end of each mapper, you emit this structure (or you convert the values to an array), from the close() method (if I remember the name).
You can lookup both methods using the keywords "combiner" and "map-side aggregation".
A global sorting on the key is a bit trickier. Again, two basic options, but are not really good though:
1) you use a single reducer, but then you don't gain anything from parallelism,
2) you use a total order partitioner, which needs some extra coding.
Other than that, you may want to move to Spark for a more intuitive and efficient solution.
In Google's paper it states:
We guarantee that within a given partition, the intermediate
key/value pairs are processed in increasing key order. This ordering
guarantee makes it easy to generate a sorted output file per
partition, which is useful when the output file format needs to
support efficient random access lookups by key, or users of the output
find it convenient to have the data sorted.
I also tried with a simple example of "wordcount" with "mrJob" (Python) and it works as expected. In the output, all keys (words) are in alphabetical (ascending) order.
However, I don't understand why it's possible. MapReduce is for parallel processing that means all sub-processes are independent. For example, a reducer calculates the frequency of a word and write to the output and then it finishes. But for an order output, apparently a sub-process has to wait others to compare the keys before write its output. This means it cannot leverage the power of parallel processing.
So how they solve it?
Thank you.
Hadoop is naturally created to work with Big data. But what happens if you're output from Mappers is also big, too big to fit to Reducers memory?
Let's say we're considering some large amount of data that we want to cluster. We use some partitioning algorithm, that will find specified number of "groups" of elements (clusters), such that elements in one cluster are similar, but elements that belong to different clusters are dissimilar. Number of clusters often needs to be specified.
If I try to implement K-means as best known clustering algorithm, one iteration would look like this:
Map phase - assign objects to closest centroids
Reduce phase - calculate new centroids based on all objects in a cluster
But what happens if we have only two clusters?
In that case, the large dataset will be divided into two parts, and there would be only two keys and for each of the keys values would contain half of the large dataset.
What I don't understand is - what if the Reducer gets many values for one key? How can he fit it in its RAM?? Isn't this one of the things why Hadoop was created?
I gave just an example of an algorithm, but this is a general question.
Precisely the reason why in the Reducer you never get a List of the values for a particular key. You only get an Iterator for the values. If the number of values for a particular key are too many they are not stored in memory but values are read off the local disk.
Links: Reducer
Also please see Secondary Sort which is a very useful design pattern when you have scenario where there are too many values.
I have only a high-level understanding of MapReduce but a specific question about what is allowed in implementations.
I want to know whether it's easy (or possible) for a Mapper to uniformly distribute the given key-value pairs among the reducers. It might be something like
(k,v) -> (proc_id, (k,v))
where proc_id is the unique identifier for a processor (assume that every key k is unique).
The central question is that if the number of reducers is not fixed (is determined dynamically depending on the size of the input; is this even how it's done in practice?), then how can a mapper produce sensible ids? One way could be for the mapper to know the total number of key-value pairs. Does MapReduce allow mappers to have this information? Another way would be to perform some small number of extra rounds of computation.
What is the appropriate way to do this?
The distribution of keys to reducers is done by a Partitioner. If you don't specify otherwise, the default partitioner uses a simple hashCode-based partitioning algorithm, which tends to distribute the keys very uniformly when every key is unique.
I'm assuming that what you actually want is to process random groups of records in parallel, and that the keys k have nothing to do with how the records should be grouped. That suggests that you should focus on doing the work on the map side instead. Hadoop is pretty good at cleanly splitting up the input into parallel chunks for processing by the mappers, so unless you are doing some kind of arbitrary aggregation I see no reason to reduce at all.
Often the procId technique you mention is used to take otherwise heavily-skewed groups and un-skew them (for example, when performing a join operation). In your case the key is all but meaningless.
I'm looking at solutions to a problem that involves reading keyed data from more than one file. In a single map step I need all the values for a particular key in the same place at the same time. I see in White's book the discussion about "the shuffle" and am tempted to wonder if when you come out of merging and the input to a reducer is sorted by key, if all the data for a key is there....if you can count on that.
The bigger pictures is that I want to do a poor-man's triple-store federation and the triples I want to load into an in-memory store don't all come from the same file. It's a vertical (?) partition where the values for a particular key are in different files. Said another way, the columns for a complete record each come from different files. Does Hadoop re-assemble that? ...at least for a single key at a time.
In short: yes. In a Hadoop job, the partitioner chooses which reducer receives which (key, value) pairs. Quote from the Yahoo tutorial section on partitioning: "It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same". This is also necessary for many of the types of algorithms typically solved with map reduce (such as distributed sorting, which is what you're describing).