Does this model fit into hadoop correctly? - hadoop

I need to know if my use case is correctly structured for hadoop. Assume that I wanted to run the word count jar on a hadoop cluster, but I want my output sorted such that each output file only has the words that have the same starting letter.
I believe that I can use the partitioner class, to sort to different reducers based on the first letter of the word. And I think that have 26 reducers one for each letter should get the out put the way I want. But I need to know if this is possible and or correct way to approach such a type of problem with regards to hadoop.

Yes, this would be the simplest way of doing it - one reducer per starting letter. As you say, you'll need a simple custom partioner to route the map phase output correctly.

Related

Filtering output in hadoop

I'm new with Hadoop and playing around with the WordCount example.
I ran into an issue that is confusing me. If I take word count from a text file and I want to, for example, filter it in a such way that only words longer than 5 letters are in the output, do I have to run 2 jobs to do this?
The first job to do the word count and second job to filter the words shorter than 5 letters?
Or can I just write logic into reducer that does not write the word into the result file if there are less then 5 occurrences? Would this result an invalid output if there are multiple instances of the reducer running?
Simple answer is you don't need to jobs.
You can achieve this with a single job. The logic you have described into the last of the problem is absolutely correct.
In the MapReduce framework, all the data (values) related to a single keys always passed to the same Reducer. So even if multiple reducers are running for your job will not affect the output.
PS:
only words longer than 5 letters are in the output
This is from second paragraph of your problem. I am assuming that you mean 5 occurrence of a word not the length of the word.
But you want only words with more than 5 length, that you can filter in Mapper itself. So there will be less data for sort-shuffle phase (data sorting and transfer over network) and less data process for Reducer.
One MapReduce job should be enough.
The best-practices say that you should filter and project data in the mapper if it is possible.
In you case, your filter condition only depends on the input data (characters in the input word), then you could filter the input in the mapper side and only send to the reducer words with more than 5 letters. Improving the performance of your job. It doesn't make sense send the data to the reducer to drop it. Although it should work too.

Is (key,value) pair in Hadoop always ('text',1)?

I am new to Hadoop.
Can you please tell about (key/value) pair? Is the value always one? Is the output of the reduce step always a (key/value) pair? If yes, how is that (key/value) data used further?
Please help me.
I guess you are asking about the 'one' value for the (key,values) pair due to the wordcount example in the Hadoop tutorials. So, the answer is no, it is not always 'one'.
Hadoop implementation of MapReduce works by passing (key,values) pairs in the entire workflow, from the input to the output:
Map step: Generally speaking (there are other particular cases, depending on the input format), the mappers process line by line the data within the splits they are assigned to; such lines are passed to the map method as (key,value) pairs telling about the offset (the key) of the line within the split, and the line itself (the value). Then, they produce at the output another (key,value) pair, and its meaning depends on the mapping function you are implementing; sometimes it will be a variable key and a fixed value (e.g. in wordcount, the key is the word, and the value is always 'one'); other times the value will be the length of the line, or the sum of all the words starting by a prefix... whatever you may imagine; the key may be a word, a fixed custom key...
Reduce step: Typically the reducer receives lists of (key,value) pairs produced by the mappers whose key is the same (this depends on the combiner class you are using, of course but this is generaly speaking). Then, they produce another (key,value) pair in the poutput, again, this depends on the logic of your application. Typically, the reducer is used to aggregate all the values regarding the same key.
This is a very rough quick and undetailed explanation, I encourage you to read some official documentation about it, or especialized literature such as this.
Hope you have started learning mapreduce with Wordcount example..
Key/Value pair is the record entity that mapreduce accepts for execution. The InputFormat classes to read records from source and the OutputFormat classes to commit results operate only using the records as Key/Value format.
Key/Value format is the best suited representation of records to pass through the different stages of the map-partition-sort-combine-shuffle-merge-sort-reduce lifecycle of mapreduce. Please refer,
http://www.thecloudavenue.com/2012/09/why-does-hadoop-uses-kv-keyvalue-pairs.html
The Key/Value data types can be anything. The Text/Interwritable key/value you used is the best pair used for wordcount. Its actually can be anything according to your requirement.
Kindly Spend some time in reading hadoop definitive guide/ Yahoo tutorials to get more understanding. happy learning...

All map handle the same file?

Usually Hadoop split the file and send every split to each machine, but I want to let each machine handle the same file(not a split of the file), and then send the result to reduce,and in reduce process it sums all the result. How can I do this? Can anyone help me?
Ok.. This may not be the exact solution , but a dirty way to achieve this is :
set FileInputFormat.setMaxInputSplitSize(job, size) where value of size parameter must be greater than input file size in bytes which can be calculated using length() method of java File class. It ensures that there will be only one mapper per file and your fie doesn't get split.
Now use MultipleInputs.addInputPath(job, input_path, InputFormat.class) for each of your machines which will run single mapper on each of the machine . And as per your requirement reduce function don't require any changes. Dirty part here is - that MultipleInputs.addInputPath requires unique path . So, you may have to copy the same file to no of times the no of mappers you want and give them unique names and supply it to parameter of MultipleInputs.addInputPath. If you provide the same path, it will be ignored.
Your problem is that you have more than one problem. What (I think) you want to do:
Produce some set of random samples
Sum sample
I'd just break these down into two separate, simple maps / reduces. A mapreduce to produce the random samples. A second to sum each sample separately.
Now there's probably a clever way to do this all in one pass, but unless you've got some unusual constraints, I'd be surprised if it was worth the extra complexity.

Generating multiple equally sized output files in Hadoop

What are some methods for finding X data ranges in Hadoop so that one can use these ranges as partitions in the reducer step?
Looks like you need something like TotalOrderPartitioner, which allows a total order by reading split points from an externally generated source. You might find this link useful :
http://chasebradford.wordpress.com/2010/12/12/reusable-total-order-sorting-in-hadoop/.
Don't know if this is exactly what you need? Apologies if I have get it wrong.

Parallel reducing with Hadoop mapreduce

I'm using Hadoop's MapReduce. I have a a file as an input to the map function, the map function does something (not relevant for the question). I'd like my reducer to take the map's output and write to two different files.
The way I see it (I want an efficient solution), there are two ways in my mind:
1 reducer which will know to identify to different cases and write to 2 different contexts.
2 parallel reducers, which each one will know to identify his relevant input, ignore the other one's and this way each one will write to a file (each reducer will write to a different file).
I'd prefer the first solution, due to the fact it means I'll go over map's output only once instead of twice parallel - but if the first isn't supported in some way - I'll be glad to hear a solution for the second suggestion.
*Note: These two final files are supposed to be separated, no need into joining them at this point.
The Hadoop API has a feature for creating multiple outputs called MultipleOutputs which makes your preferred solution possible.
If you know at the map stage which file the record must go to, you can tag your map output with special key specifying which file it should go to. For eg, if a record R1 must go to file 1, you would output <1, R1>.. (1 is the key.. a symbolic representation for file1 and R1 is the value) If a Record R2 must go to file 2, your map output would be <2, R2>.
Then if you configure the map reduce job to use only 2 reducers.. it will guarantee that all records tagged with <1, _> will be sent to 1 reducer and <2, _> will be sent to the other.
This would be better than your preferred solution since you are still going thru your map output only once.. and at the same time, it would be in parallel.

Resources