Specify Hadoop mapreduce input keys directly (not from a file) - hadoop

I'd like to generate some data using a mapreduce. I'd like to invoke the job with one parameter N, and get Map called with each integer from 1 to N, once.
Obviously I want a Mapper<IntWritable, NullWritable, <my output types>>...that's easy. But I can't figure out how to generate the input data! Is there an InputFormat I'm not seeing somewhere that lets me just pull keys + values from a collection directly?

Do you want each mapper to process all integers from 1 to N? Or do you want to distribute the processing of integers 1 to N across the concurrently running mappers?
If the former, I believe you'll need to create a custom InputFormat. If the latter, the easiest way might be to generate a text file with integers 1 to N, each integer on one line, and use LineInputFormat.

Related

MapReduce: Given a file of numbers, output the amount of distinct / unique numbers

If the Input file is: 1,1,2,2,3,4,4,4,5,5,5,5,6,6,6, then the output of MapReduce should be 6 (i.e. the size of the set of unique integers {1,2,3,4,5,6}).
I need help with implementing the above. I know that we can filter out duplicates by emitting each number vs. a null value in map(), and then similarly output the key vs. a null value in reduce() to a resultant file / console.
But if I directly need to get the number of distinct numbers, how would I go about with this?
My current implementation is to build a Set, pass it as the output of the Mapper, and in the Reducer, combine all Sets passed to it, and return the count of that resultant Set. Do note that this is more of a design question than a library-specific (say, Hadoop) implementation question.
Use a mapper to build a Hashset. Make the output of IntWritable and NullWritable.
Add all the input values to the Set.
Write out the size of the Hashset.
Set number of Reduce Tasks to 0, since it's not needed.
If you must use a Reducer, output (null, value) from the mapper.
Do the same as above.
Alternative (simpler) methods exist if you can use Hive, Pig, or Spark

How does map reduce parallel processing really work in hadoop with respect to the word count example?

I am learning hadoop map reduce using word count example , please see the diagram attached :
My questions are regarding how the parallel processing actually happens , my understanding/questions below , please correct me if i am wrong :
Split step : This assigns number of mappers , here the two data sets go to two different processor [p1,p2] , so two mappers ? This splitting is done by first processor P.
Mapping Step : Each of these processor [p1,p2] now divides the data into key value pairs by applying required function f() on keys which produces value v giving [k1,v1],[k2,v2].
Merge Step 1 : Within each processor , values are grouped by key giving [k1,[v1,v2,v3]].
Merge Step 2 : Now p1,p2 returns output to P which merges both the resultant key value pairs. This happens in P.
Sorting Step : Now here P , will sort all the results.
Reduce Step : Here P will apply f() on each individual keys [k1,[v1,v2,v3]] to give [k1,V]
Let me know is this understanding right, i have a feeling i am completely off in many respects?
Let me explain each step in little bit detail so that it will be more clear to you and I have tried to keep them as brief as possible but I would recommend you to go through offical docs (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html) to get a good feel about this whole process
Split Step: if you have made some program by now you must have observed that we sometimes set a number of reducers but we never set a number of mapper because of the reason that number of mapper depend on the number of input splits. In simple words, no of mapper in any job is proportional to a number of input split. So now the question arises how do splittings take place. That actually depends on a number of factors like mapred.max.split.size which set the size of input split and there are many other ways but the fact we can control the size of input split.
Mapping Step: if by 2 processors you mean 2 JVM(2 containers) or 2 different node or 2 mapper then your intuition is wrong container or for say node have nothing to do with splitting any input file it is job of hdfs which divide and distribute the files on different nodes and then it is the responsibility of resource manager to initiate the mapper task on the same node which has the input splits if possible and once map task is initiated you can create pair of key and value according to your logic in mapper. Here one thing to remember is that one mapper can only work on one input split.
you have mixed up a little bit on step 3, step 4 and step 5. I have tried to explain those step by describing in reference with the actual classes which handles these steps.
Partitioner class: This class divide the output from mapper task according to the number of reducers. This class is useful if you have more then 1 reducer otherwise it do not effect your output. This class contain a method called getPartition which decide to which reducer your mapper output will go(if you have more than one reducer) this method is called for each key which is present in mapper output. You can override this class and subsiquently this method to customize it according to your requirment. So in case of your example since there is one reducer so it will merge output from both the mapper in a single file. If there would have been more reducer and the same number of intermediate files would have been created.
WritableComparator class : Sorting of your map output is done by this class This sorting is done on the basis of key. Like partitioner class you can override this. In your example if the key is colour name then it will sort them like this (here we are considering if you do not overide this class then it will use the default method for sorting of Text which is alphabatical order):
Black,1
Black,1
Black,1
Blue,1
Blue,1
.
.
and so on
Now this same class is also used for grouping your values according your key so that in reducer you can use iterable over them in case of your Ex ->
Black -> {1,1,1}
Blue -> {1,1,1,1,1,1}
Green -> {1,1,1,1,1}
.
.
and so on
Reducer -> This step will simply reduce your map accroding to the logic define in your reducer class. you initution is appropriate for this class.
Now there are some other implications also which effect the intermediae step between mapper and reducer and before mapper also but those are not that much relevent to what you want to know.
I Hope this solve your query.
Your diagram is not exactly showing the basic word counting in MapReduce. Specifically, the stuff after 'Merging-step 1' is misleading in terms of understanding how MapReduce parallelize the reducing phase. The better diagram, imo, can be found at https://dzone.com/articles/word-count-hello-word-program-in-mapreduce
On the latter diagram it is easy to see that as soon as mappers' output is sorted by output key and then shuffled based on this key across the nodes with reducers then reducers can easily run in parallel.

Hadoop map/reduce sort

I have a map-reduce job and I am using just the mapper because the output of each mapper will definitely have a unique key. My question is when this job is run and I get the output files, which are like part-m-00000, part-m-00001 ... Will they be sorted in order of key?
Or Do I need to implement a reducer which does nothing but just writes them to files like part-r-00000, part-r-000001. And does these guarantee that the output is sorted in the order of the key.
If you want to sort the keys within the file and make sure that the keys in the file are less than the keys in file j when i is less than j, you not only need to use a reducer, but also a partitioner. You might want to consider using something like Pig to do this as it will be trivial. If you want to do it with MR, use the sorted field as your key and write a partitioner to make sure that your keys end up in the correct reducer.
When your map function outputs the keys, it goes to the partition function which does a sort. Therefore by default the keys will be in sorted order and you can use the identity reducer.
If you want to guarantee sorted order, you can simply use a single IdentityReducer.
If you want it to be more parallelizable, you can specify more reducers, but then the output will by default only be sorted within files, not across files. I.e., each file will be sorted, but part-r-00000 will not necessarily come before part-r-00001. If you DO want it to be sorted across files, you can use a custom partitioner that partitions based on the sorting order. I.E., reducer 0 gets all of the lowest keys, then reducer 1, ... and reducer N gets all of the highest keys.

How do I process a 2-D array, one per file, using Hadoop MapReduce?

I need to read and process a file as a single unit, not line by line, and it's not clear how you'd do this in a Hadoop MapReduce application. What I need to do is to read the first line of the file as a header, which I can use as my key, and the following lines as data to build a 2-D data array, which I can use as my value. I'll then do some analysis on the entire 2-D array of data (i.e. the value).
Below is how I'm planning to tackle this problem, and I would very much appreciate comments if this doesn't look reasonable or if there's a better way to go about this (this is my first serious MapReduce application so I'm probably making rookie mistakes):
My text file inputs contain one line with station information (name, lat/lon, ID, etc.) and then one or more lines containing a year value (i.e. 1956) plus 12 monthly values (i.e. 0.3 2.8 4.7 ...) separated by spaces. I have to do my processing over the entire array of monthly values [number_of_years][12] so each individual line is meaningless in isolation.
Create a custom key class, making it implement WritableComparable. This will hold the header information from the initial line of the input text files.
Create a custom input format class in which a) the isSplitable() method returns false, and b) the getRecordReader() method returns a custom record reader that knows how to read a file split and turn it into my custom key and value classes.
Create a mapper class which does the analysis on the input value (the 2-D array of monthly values) and outputs the original key (the station header info) and an output value (a 2-D array of analysis values). There'll only be a wrapper reducer class since there's no real reduction to be done.
It's not clear that this is a good/correct application of the map reduce approach a) since I'm doing analysis on a single value (the data array) mapped to a single key, and b) since there is never more than a single value (data array) per key then no real reduction will ever need to be performed. Another issue is that the files I'm processing are relatively small, much less than the default 64MB split size. With this being the case perhaps the first task is instead to consolidate the input files into a sequence file, as shown in the SmallFilesToSequenceFileConverter example in the Definitive Hadoop O'Reilly book (p. 194 in the 2nd Edition)?
Thanks in advance for your comments and/or suggestions!
It looks like your plan regarding coding is spot on, I would do the same thing.
You will benefit from hadoop if you have a lot of input files provided as input to the Job, as each file will have its own InputSplit and in Hadoop number of executed mappers is the same as number of input splits.
Too many small files will cause too much memory use on the HDFS Namenode. To consolidate the files you can use SequenceFiles or Hadoop Archives (hadoop equivalent of tar) See docs. With har files (Hadoop Archives) each small file will have its own Mapper.

Parallel reducing with Hadoop mapreduce

I'm using Hadoop's MapReduce. I have a a file as an input to the map function, the map function does something (not relevant for the question). I'd like my reducer to take the map's output and write to two different files.
The way I see it (I want an efficient solution), there are two ways in my mind:
1 reducer which will know to identify to different cases and write to 2 different contexts.
2 parallel reducers, which each one will know to identify his relevant input, ignore the other one's and this way each one will write to a file (each reducer will write to a different file).
I'd prefer the first solution, due to the fact it means I'll go over map's output only once instead of twice parallel - but if the first isn't supported in some way - I'll be glad to hear a solution for the second suggestion.
*Note: These two final files are supposed to be separated, no need into joining them at this point.
The Hadoop API has a feature for creating multiple outputs called MultipleOutputs which makes your preferred solution possible.
If you know at the map stage which file the record must go to, you can tag your map output with special key specifying which file it should go to. For eg, if a record R1 must go to file 1, you would output <1, R1>.. (1 is the key.. a symbolic representation for file1 and R1 is the value) If a Record R2 must go to file 2, your map output would be <2, R2>.
Then if you configure the map reduce job to use only 2 reducers.. it will guarantee that all records tagged with <1, _> will be sent to 1 reducer and <2, _> will be sent to the other.
This would be better than your preferred solution since you are still going thru your map output only once.. and at the same time, it would be in parallel.

Resources