What are some methods for finding X data ranges in Hadoop so that one can use these ranges as partitions in the reducer step?
Looks like you need something like TotalOrderPartitioner, which allows a total order by reading split points from an externally generated source. You might find this link useful :
http://chasebradford.wordpress.com/2010/12/12/reusable-total-order-sorting-in-hadoop/.
Don't know if this is exactly what you need? Apologies if I have get it wrong.
Related
I need to know if my use case is correctly structured for hadoop. Assume that I wanted to run the word count jar on a hadoop cluster, but I want my output sorted such that each output file only has the words that have the same starting letter.
I believe that I can use the partitioner class, to sort to different reducers based on the first letter of the word. And I think that have 26 reducers one for each letter should get the out put the way I want. But I need to know if this is possible and or correct way to approach such a type of problem with regards to hadoop.
Yes, this would be the simplest way of doing it - one reducer per starting letter. As you say, you'll need a simple custom partioner to route the map phase output correctly.
Usually Hadoop split the file and send every split to each machine, but I want to let each machine handle the same file(not a split of the file), and then send the result to reduce,and in reduce process it sums all the result. How can I do this? Can anyone help me?
Ok.. This may not be the exact solution , but a dirty way to achieve this is :
set FileInputFormat.setMaxInputSplitSize(job, size) where value of size parameter must be greater than input file size in bytes which can be calculated using length() method of java File class. It ensures that there will be only one mapper per file and your fie doesn't get split.
Now use MultipleInputs.addInputPath(job, input_path, InputFormat.class) for each of your machines which will run single mapper on each of the machine . And as per your requirement reduce function don't require any changes. Dirty part here is - that MultipleInputs.addInputPath requires unique path . So, you may have to copy the same file to no of times the no of mappers you want and give them unique names and supply it to parameter of MultipleInputs.addInputPath. If you provide the same path, it will be ignored.
Your problem is that you have more than one problem. What (I think) you want to do:
Produce some set of random samples
Sum sample
I'd just break these down into two separate, simple maps / reduces. A mapreduce to produce the random samples. A second to sum each sample separately.
Now there's probably a clever way to do this all in one pass, but unless you've got some unusual constraints, I'd be surprised if it was worth the extra complexity.
I am wondering how hadoop handle log file parsing if we need to calculate not just one simple metric (e.g. the most popular word) but many metrics(e.g. ALL the following: average height break down into gender, top 10 sites break down into phone types, top word break down into adults/kids)?
Without using hadoop, a typical distributed solution I can think is: split logs into different machines using hash, etc;
each machine parses its own log files and calculate different metrics for these log files. The results can be stored as SQL, XML, or some other format in files. Then a master machine parses these intermediate files, aggregates these metrics and stored the final results to another file.
Using hadoop, how to obtain the final results? All the examples I saw are very simple cases, like count words.
I just cannot figure out how hadoop mapreducer will cooperate to aggregate the intermediate files intelligently to final result. I thought maybe my mapper should save intermediate files somewhere and my reducer should parse the intermediate files to get the final results. I must be wrong since I do not see any benefit if my mapper and reducer are implemented in this way.
It is said the format of map and reduce should be:
map: (K1, V1) → list(K2, V2)
combine: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
In summery, how to design my mapper and reducer code (suppose using python, other language is also fine.) Can anybody answer my question or provide a link for me to read?
Start thinking on how to solve challenges in a MR way. Here (1, 2) are some resources. These have got some of the MR algorithms which can be implemented in any language.
I'm using Hadoop's MapReduce. I have a a file as an input to the map function, the map function does something (not relevant for the question). I'd like my reducer to take the map's output and write to two different files.
The way I see it (I want an efficient solution), there are two ways in my mind:
1 reducer which will know to identify to different cases and write to 2 different contexts.
2 parallel reducers, which each one will know to identify his relevant input, ignore the other one's and this way each one will write to a file (each reducer will write to a different file).
I'd prefer the first solution, due to the fact it means I'll go over map's output only once instead of twice parallel - but if the first isn't supported in some way - I'll be glad to hear a solution for the second suggestion.
*Note: These two final files are supposed to be separated, no need into joining them at this point.
The Hadoop API has a feature for creating multiple outputs called MultipleOutputs which makes your preferred solution possible.
If you know at the map stage which file the record must go to, you can tag your map output with special key specifying which file it should go to. For eg, if a record R1 must go to file 1, you would output <1, R1>.. (1 is the key.. a symbolic representation for file1 and R1 is the value) If a Record R2 must go to file 2, your map output would be <2, R2>.
Then if you configure the map reduce job to use only 2 reducers.. it will guarantee that all records tagged with <1, _> will be sent to 1 reducer and <2, _> will be sent to the other.
This would be better than your preferred solution since you are still going thru your map output only once.. and at the same time, it would be in parallel.
I am working on the parallelization an algorithm, which roughly does the following:
Read several text documents with a total of 10k words.
Create an objects for every word in the text corpus.
Create a pair between all word-objects (yes, O(n)). And return the most frequent pairs.
I would like to parallelize the 3. step by creating the pairs between the first 1000 word-objects the rest on the fist machine, the second 1000 word-objects on the next machine, etc.
My question is how to pass the objects created in the 2. step to the Mapper? As far as I am aware I would require input files for this and hence would need to serialize the objects (though haven't worked with this before). Is there a direct way to pass the objects to the Mapper?
Thanks in advance for the help
Evgeni
UPDATE
Thank you for reading my question before. Serialization seems to be the best way to solve this (see java.io.Serializable). Furthermore, I have found this tutorial useful to read data from serialized objects into hadoop: http://www.cs.brown.edu/~pavlo/hadoop/).
How about parallelize all steps? Use your #1 text documents as input to your Mapper. Create the object for every word in the Mapper. In the Mapper your key-value pair will be the word-object pair (or object-word depending on what you are doing). The Reducer can then count the unique pairs.
Hadoop will take care of bringing all the same keys together into the same Reducer.
Use twitter protobufs ( elephant-bird ) . Convert each word into a protobuf object and process it however you want. Also protobufs are much faster and light compared to default java serialization. Refer Kevin Weil's presentation on this. http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter