Parallel reducing with Hadoop mapreduce - hadoop

I'm using Hadoop's MapReduce. I have a a file as an input to the map function, the map function does something (not relevant for the question). I'd like my reducer to take the map's output and write to two different files.
The way I see it (I want an efficient solution), there are two ways in my mind:
1 reducer which will know to identify to different cases and write to 2 different contexts.
2 parallel reducers, which each one will know to identify his relevant input, ignore the other one's and this way each one will write to a file (each reducer will write to a different file).
I'd prefer the first solution, due to the fact it means I'll go over map's output only once instead of twice parallel - but if the first isn't supported in some way - I'll be glad to hear a solution for the second suggestion.
*Note: These two final files are supposed to be separated, no need into joining them at this point.

The Hadoop API has a feature for creating multiple outputs called MultipleOutputs which makes your preferred solution possible.

If you know at the map stage which file the record must go to, you can tag your map output with special key specifying which file it should go to. For eg, if a record R1 must go to file 1, you would output <1, R1>.. (1 is the key.. a symbolic representation for file1 and R1 is the value) If a Record R2 must go to file 2, your map output would be <2, R2>.
Then if you configure the map reduce job to use only 2 reducers.. it will guarantee that all records tagged with <1, _> will be sent to 1 reducer and <2, _> will be sent to the other.
This would be better than your preferred solution since you are still going thru your map output only once.. and at the same time, it would be in parallel.

Related

All map handle the same file?

Usually Hadoop split the file and send every split to each machine, but I want to let each machine handle the same file(not a split of the file), and then send the result to reduce,and in reduce process it sums all the result. How can I do this? Can anyone help me?
Ok.. This may not be the exact solution , but a dirty way to achieve this is :
set FileInputFormat.setMaxInputSplitSize(job, size) where value of size parameter must be greater than input file size in bytes which can be calculated using length() method of java File class. It ensures that there will be only one mapper per file and your fie doesn't get split.
Now use MultipleInputs.addInputPath(job, input_path, InputFormat.class) for each of your machines which will run single mapper on each of the machine . And as per your requirement reduce function don't require any changes. Dirty part here is - that MultipleInputs.addInputPath requires unique path . So, you may have to copy the same file to no of times the no of mappers you want and give them unique names and supply it to parameter of MultipleInputs.addInputPath. If you provide the same path, it will be ignored.
Your problem is that you have more than one problem. What (I think) you want to do:
Produce some set of random samples
Sum sample
I'd just break these down into two separate, simple maps / reduces. A mapreduce to produce the random samples. A second to sum each sample separately.
Now there's probably a clever way to do this all in one pass, but unless you've got some unusual constraints, I'd be surprised if it was worth the extra complexity.

in hadoop, how to parse log files in order to obtain multiple information not just one information like wordcount?

I am wondering how hadoop handle log file parsing if we need to calculate not just one simple metric (e.g. the most popular word) but many metrics(e.g. ALL the following: average height break down into gender, top 10 sites break down into phone types, top word break down into adults/kids)?
Without using hadoop, a typical distributed solution I can think is: split logs into different machines using hash, etc;
each machine parses its own log files and calculate different metrics for these log files. The results can be stored as SQL, XML, or some other format in files. Then a master machine parses these intermediate files, aggregates these metrics and stored the final results to another file.
Using hadoop, how to obtain the final results? All the examples I saw are very simple cases, like count words.
I just cannot figure out how hadoop mapreducer will cooperate to aggregate the intermediate files intelligently to final result. I thought maybe my mapper should save intermediate files somewhere and my reducer should parse the intermediate files to get the final results. I must be wrong since I do not see any benefit if my mapper and reducer are implemented in this way.
It is said the format of map and reduce should be:
map: (K1, V1) → list(K2, V2)
combine: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
In summery, how to design my mapper and reducer code (suppose using python, other language is also fine.) Can anybody answer my question or provide a link for me to read?
Start thinking on how to solve challenges in a MR way. Here (1, 2) are some resources. These have got some of the MR algorithms which can be implemented in any language.

How do I process a 2-D array, one per file, using Hadoop MapReduce?

I need to read and process a file as a single unit, not line by line, and it's not clear how you'd do this in a Hadoop MapReduce application. What I need to do is to read the first line of the file as a header, which I can use as my key, and the following lines as data to build a 2-D data array, which I can use as my value. I'll then do some analysis on the entire 2-D array of data (i.e. the value).
Below is how I'm planning to tackle this problem, and I would very much appreciate comments if this doesn't look reasonable or if there's a better way to go about this (this is my first serious MapReduce application so I'm probably making rookie mistakes):
My text file inputs contain one line with station information (name, lat/lon, ID, etc.) and then one or more lines containing a year value (i.e. 1956) plus 12 monthly values (i.e. 0.3 2.8 4.7 ...) separated by spaces. I have to do my processing over the entire array of monthly values [number_of_years][12] so each individual line is meaningless in isolation.
Create a custom key class, making it implement WritableComparable. This will hold the header information from the initial line of the input text files.
Create a custom input format class in which a) the isSplitable() method returns false, and b) the getRecordReader() method returns a custom record reader that knows how to read a file split and turn it into my custom key and value classes.
Create a mapper class which does the analysis on the input value (the 2-D array of monthly values) and outputs the original key (the station header info) and an output value (a 2-D array of analysis values). There'll only be a wrapper reducer class since there's no real reduction to be done.
It's not clear that this is a good/correct application of the map reduce approach a) since I'm doing analysis on a single value (the data array) mapped to a single key, and b) since there is never more than a single value (data array) per key then no real reduction will ever need to be performed. Another issue is that the files I'm processing are relatively small, much less than the default 64MB split size. With this being the case perhaps the first task is instead to consolidate the input files into a sequence file, as shown in the SmallFilesToSequenceFileConverter example in the Definitive Hadoop O'Reilly book (p. 194 in the 2nd Edition)?
Thanks in advance for your comments and/or suggestions!
It looks like your plan regarding coding is spot on, I would do the same thing.
You will benefit from hadoop if you have a lot of input files provided as input to the Job, as each file will have its own InputSplit and in Hadoop number of executed mappers is the same as number of input splits.
Too many small files will cause too much memory use on the HDFS Namenode. To consolidate the files you can use SequenceFiles or Hadoop Archives (hadoop equivalent of tar) See docs. With har files (Hadoop Archives) each small file will have its own Mapper.

Specify Hadoop mapreduce input keys directly (not from a file)

I'd like to generate some data using a mapreduce. I'd like to invoke the job with one parameter N, and get Map called with each integer from 1 to N, once.
Obviously I want a Mapper<IntWritable, NullWritable, <my output types>>...that's easy. But I can't figure out how to generate the input data! Is there an InputFormat I'm not seeing somewhere that lets me just pull keys + values from a collection directly?
Do you want each mapper to process all integers from 1 to N? Or do you want to distribute the processing of integers 1 to N across the concurrently running mappers?
If the former, I believe you'll need to create a custom InputFormat. If the latter, the easiest way might be to generate a text file with integers 1 to N, each integer on one line, and use LineInputFormat.

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.

Resources