Usually Hadoop split the file and send every split to each machine, but I want to let each machine handle the same file(not a split of the file), and then send the result to reduce,and in reduce process it sums all the result. How can I do this? Can anyone help me?
Ok.. This may not be the exact solution , but a dirty way to achieve this is :
set FileInputFormat.setMaxInputSplitSize(job, size) where value of size parameter must be greater than input file size in bytes which can be calculated using length() method of java File class. It ensures that there will be only one mapper per file and your fie doesn't get split.
Now use MultipleInputs.addInputPath(job, input_path, InputFormat.class) for each of your machines which will run single mapper on each of the machine . And as per your requirement reduce function don't require any changes. Dirty part here is - that MultipleInputs.addInputPath requires unique path . So, you may have to copy the same file to no of times the no of mappers you want and give them unique names and supply it to parameter of MultipleInputs.addInputPath. If you provide the same path, it will be ignored.
Your problem is that you have more than one problem. What (I think) you want to do:
Produce some set of random samples
Sum sample
I'd just break these down into two separate, simple maps / reduces. A mapreduce to produce the random samples. A second to sum each sample separately.
Now there's probably a clever way to do this all in one pass, but unless you've got some unusual constraints, I'd be surprised if it was worth the extra complexity.
Related
I'm reading in tens of thousands of files into an rdd via something like sc.textFile("/data/*/*/*") One problem is that most of these files are tiny, whereas others are huge. That leads to imbalanced tasks, which causes all sorts of well-known problems.
Can I break up the largest partitions by instead reading in my data via sc.textFile("/data/*/*/*", minPartitions=n_files*5), where n_files is the number of input files?
As convered elsewhere on stackoverflow, minPartitions gets passed way down the hadoop rabit hole and is used in the org.apache.hadoop.mapred.TextInputFormat.getSplits. My question is whether this is implemented such that the largest files are split first. In other words, is the splitting strategy one that tries to lead to evenly sized partitions?
I would prefer an answer that points to wherever the splitting strategy is actually implemented in a recent version of spark/hadoop.
Nobody's posted an answer so I dug into this myself and will post an answer to my own question:
It appears that, if your input file(s) are splittable, then textFile will indeed try to balance partition size if you use the minPartitions option.
The partitioning strategy is implemented here, i.e., in the getSplits method of org.apache.hadoop.mapred.TextInputFormat. This partitioning strategy is complex, and operates by first setting goalSize, which is simply the total size of the input divided by the numSplits (minPartitions is passed down to set the value of numSplits). It then splits up files in such a way that tries to ensure that each partition's size (in terms of its input's byte size) is as close as possible to the goalSize/
If your input file(s) are not splittable, then this splitting will not take place: see the source code here.
What are some methods for finding X data ranges in Hadoop so that one can use these ranges as partitions in the reducer step?
Looks like you need something like TotalOrderPartitioner, which allows a total order by reading split points from an externally generated source. You might find this link useful :
http://chasebradford.wordpress.com/2010/12/12/reusable-total-order-sorting-in-hadoop/.
Don't know if this is exactly what you need? Apologies if I have get it wrong.
I need to read and process a file as a single unit, not line by line, and it's not clear how you'd do this in a Hadoop MapReduce application. What I need to do is to read the first line of the file as a header, which I can use as my key, and the following lines as data to build a 2-D data array, which I can use as my value. I'll then do some analysis on the entire 2-D array of data (i.e. the value).
Below is how I'm planning to tackle this problem, and I would very much appreciate comments if this doesn't look reasonable or if there's a better way to go about this (this is my first serious MapReduce application so I'm probably making rookie mistakes):
My text file inputs contain one line with station information (name, lat/lon, ID, etc.) and then one or more lines containing a year value (i.e. 1956) plus 12 monthly values (i.e. 0.3 2.8 4.7 ...) separated by spaces. I have to do my processing over the entire array of monthly values [number_of_years][12] so each individual line is meaningless in isolation.
Create a custom key class, making it implement WritableComparable. This will hold the header information from the initial line of the input text files.
Create a custom input format class in which a) the isSplitable() method returns false, and b) the getRecordReader() method returns a custom record reader that knows how to read a file split and turn it into my custom key and value classes.
Create a mapper class which does the analysis on the input value (the 2-D array of monthly values) and outputs the original key (the station header info) and an output value (a 2-D array of analysis values). There'll only be a wrapper reducer class since there's no real reduction to be done.
It's not clear that this is a good/correct application of the map reduce approach a) since I'm doing analysis on a single value (the data array) mapped to a single key, and b) since there is never more than a single value (data array) per key then no real reduction will ever need to be performed. Another issue is that the files I'm processing are relatively small, much less than the default 64MB split size. With this being the case perhaps the first task is instead to consolidate the input files into a sequence file, as shown in the SmallFilesToSequenceFileConverter example in the Definitive Hadoop O'Reilly book (p. 194 in the 2nd Edition)?
Thanks in advance for your comments and/or suggestions!
It looks like your plan regarding coding is spot on, I would do the same thing.
You will benefit from hadoop if you have a lot of input files provided as input to the Job, as each file will have its own InputSplit and in Hadoop number of executed mappers is the same as number of input splits.
Too many small files will cause too much memory use on the HDFS Namenode. To consolidate the files you can use SequenceFiles or Hadoop Archives (hadoop equivalent of tar) See docs. With har files (Hadoop Archives) each small file will have its own Mapper.
I'm using Hadoop's MapReduce. I have a a file as an input to the map function, the map function does something (not relevant for the question). I'd like my reducer to take the map's output and write to two different files.
The way I see it (I want an efficient solution), there are two ways in my mind:
1 reducer which will know to identify to different cases and write to 2 different contexts.
2 parallel reducers, which each one will know to identify his relevant input, ignore the other one's and this way each one will write to a file (each reducer will write to a different file).
I'd prefer the first solution, due to the fact it means I'll go over map's output only once instead of twice parallel - but if the first isn't supported in some way - I'll be glad to hear a solution for the second suggestion.
*Note: These two final files are supposed to be separated, no need into joining them at this point.
The Hadoop API has a feature for creating multiple outputs called MultipleOutputs which makes your preferred solution possible.
If you know at the map stage which file the record must go to, you can tag your map output with special key specifying which file it should go to. For eg, if a record R1 must go to file 1, you would output <1, R1>.. (1 is the key.. a symbolic representation for file1 and R1 is the value) If a Record R2 must go to file 2, your map output would be <2, R2>.
Then if you configure the map reduce job to use only 2 reducers.. it will guarantee that all records tagged with <1, _> will be sent to 1 reducer and <2, _> will be sent to the other.
This would be better than your preferred solution since you are still going thru your map output only once.. and at the same time, it would be in parallel.
I am working on the parallelization an algorithm, which roughly does the following:
Read several text documents with a total of 10k words.
Create an objects for every word in the text corpus.
Create a pair between all word-objects (yes, O(n)). And return the most frequent pairs.
I would like to parallelize the 3. step by creating the pairs between the first 1000 word-objects the rest on the fist machine, the second 1000 word-objects on the next machine, etc.
My question is how to pass the objects created in the 2. step to the Mapper? As far as I am aware I would require input files for this and hence would need to serialize the objects (though haven't worked with this before). Is there a direct way to pass the objects to the Mapper?
Thanks in advance for the help
Evgeni
UPDATE
Thank you for reading my question before. Serialization seems to be the best way to solve this (see java.io.Serializable). Furthermore, I have found this tutorial useful to read data from serialized objects into hadoop: http://www.cs.brown.edu/~pavlo/hadoop/).
How about parallelize all steps? Use your #1 text documents as input to your Mapper. Create the object for every word in the Mapper. In the Mapper your key-value pair will be the word-object pair (or object-word depending on what you are doing). The Reducer can then count the unique pairs.
Hadoop will take care of bringing all the same keys together into the same Reducer.
Use twitter protobufs ( elephant-bird ) . Convert each word into a protobuf object and process it however you want. Also protobufs are much faster and light compared to default java serialization. Refer Kevin Weil's presentation on this. http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter