Hadoop - set reducer number to 0 but write to same file? - hadoop

My job is computational intensive so I am actually only using the distribution function of Hadoop, and I want all my output to be in 1 single file so I have set the number of reducer to 1. My reducer is actually doing nothing...
By explicitly setting the number of reducer to 0, may I know how can I control in the mapper to force all the outputs are written into the same 1 output file? Thanks.

You can't do that in Hadoop. Your mappers each have to write to independent files. This makes them efficient (no contention or network transfer). If you want to combine all those files, you need a single reducer. Alternatively, you can let them be separate files, and combine the files when you download them (e.g., using HDFS's command-line cat or getmerge options).
EDIT: From your comment, I see that what you want is to get away with the hassle of writing a reducer. This is definitely possible. To do this, you can use the IdentityReducer. You can check its API here and an explanation of 0 reducers vs. using the IdentityReducer is available here.
Finally, when I say that having multiple mappers generate a single output is not possible, I mean it is not possible with plain files in HDFS. You could do this with other types of output, like having all mappers write to a single database. This is OK if your mappers are not generating much output. Details on how this would work are available here.

cabad is correct for the most part. However, if you want to process the file with a single Mapper to a single output file you could use a FileInputFormat that marks the file as not splittable. Do this as well as set the number of Reducers to 0. This reduces the performance of using multiple data nodes but skips Shuffle and Sort.

Related

What is the benefit of the reducers in Hadoop?

I don't see a value for the reducers in Hadoop in the following scenario:
The Map Tasks generate unique keys (Because we can merge both the Map/Reduce functionality together)
The output size of the Map Tasks is too big (This will exhaust the memory if we wait for the reducers to begin the work)
If we have any functionality that doesn't need grouping and sorting of the keys
Please correct me if I am wrong.
And if someone could give me a real example of the benefits of the reducers and when it should be used, I will appreciate it.
Reducer is beneficial (or required) when you need to do operations like aggregation/grouping etc..
FYI : Reducer is meant for grouping different value for a key which comes from different mapper. So for a use case which do not require grouping/aggregation then there is no point of using reducer(you can set it to Zero , meaning Map-Only jobs).
One quick use-case i can think of is - you want to randomly split a big file to multiple part file. In this case you will supply big file (lets say 100G) to Map-Only jobs. All maps will read a chunk of file and write as a part of file.

Large numbers of Hadoop output files

Is there a sensible way, in Hadoop, to write very large numbers of output files? I've been using MultipleOutputs. However, MultipleOutputs allocates a large (~1MB) buffer to each file, so I ran into memory problems.
The ordering of my data is such that in any given reducer, I can write to a target file, close it, then move on to the next one. Unfortunately, MultipleOutputs doesn't expose a method to close a given file. I've written a modified MultipleOutputs which exposes such a method, and deals with the problem, but this doesn't seem ideal.
The alternative would be a final step to split my output into the required files, but I'm not sure of a good way to do this.
Each reducer will generate an output file, more the no. of reducer more the no. of o/p files and lesser the size.
probably you can restrict your no. of reducers.
But make sure limited reducers is optimized.
e.g. if you set reducers=1 then only 1 process has to process all your mapper data hence increases the processing time.

processing very small file with hadoop

I have a question about using hadoop to process a small file. My file only has about a 1,000 or so records but i want the records to roughly be evenly distributed among the nodes. Is there a way to do this? I'm new to hadoop and so far it seems that all the execution is happening on one node instead a multiple simultaneously. Let me know if my question makes sense or if I need to clarify anything. Like I said, i'm very new to Hadoop but am hoping to get some clarification. Thanks.
Use the NLineInputFormat and specify the number of records to be processed by each mapper. This way the records in a single block will be processed by multiple mappers.
The other option is to split your one input file into multiple input files (in the one input path directory).
Each of those input files will then be able to be spread across the hdfs and the map
operations will occur on the worker machines that own those input splits.

How to increase number of mappers in Mahout MatrixMultiplicationJob?

I am using Mahout 0.7's MatrixMultiplicationJob for multiplying a large matrix. But it always uses 1 map task which makes it slow. its probably due to the InputSplit which forces the number of mappers to be 1.
Is there a way I can efficiently multiply matrices in Hadoop / Mahout or change the number of mappers?
Ultimately, it is Hadoop that decides how many mappers to use. Generally it will use one mapper per HDFS block (typically 64 or 128MB). If your data is smaller than that, it's too small to bother with more than 1 mapper.
You can encourage it to use more anyway by setting mapred.max.split.size to something smaller than 64MB (remember the value is set in bytes, not MB). But, are you sure you want to? It is much more common to need more reducers, not mappers, since Hadoop will never use more than 1 unless you (or your job) tells it to.
Also know that Hadoop will not be able to use more than one mapper on a single compressed file. So if your input is one huge compressed file, it will only ever use 1 mapper on that file. You can however split it up yourself into many smaller compressed files.
had you tried to specify number of mappers via command line with -Dmapred.map.tasks=N option? I hadn't tried it, but it should work. If it won't work, then try to set this parameter in the MAHOUT_OPTS environment variable...

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions.
All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.
Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.
Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.
Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.
Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.
Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.
You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.
It's possible to do that, but not in trivial way in one MR.
You may use composite keys. Let's say you need two kinds of the reducers, 'R1' and 'R2'. Add ids for these as a prefix to your o/p keys in the mapper. So, in the mapper, a key 'K' now becomes 'R1:K' or 'R2:K'.
Then, in the reducer, pass values to implementations of R1 or R2 based on the prefix.
I guess you want to run different reducers in a chain. In hadoop 'multiple reducers' means running multiple instances of the same reducer. I would propose you run one reducer at a time, providing trivial map function for all of them except the first one. To minimize time for data transfer, you can use compression.
Of course you can define multiple reducers. For the Job (Hadoop 0.20) just add:
job.setNumReduceTasks(<number>);
But. Your infrastructure has to support the multiple reducers, meaning that you have to
have more than one cpu available
adjust mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml accordingly
And of course your job has to match some specifications. Without knowing what you exactly want to do, I only can give broad tips:
the keymap-output have either to be partitionable by %numreducers OR you have to define your own partitioner:
job.setPartitionerClass(...)
for example with a random-partitioner ...
the data must be reduce-able in the partitioned format ... (references needed?)
You'll get multiple output files, one for each reducer. If you want a sorted output, you have to add another job reading all files (multiple map-tasks this time ...) and writing them sorted with only one reducer ...
Have a look too at the Combiner-Class, which is the local Reducer. It means that you can aggregate (reduce) already in memory over partial data emitted by map.
Very nice example is the WordCount-Example. Map emits each word as key and its count as 1: (word, 1). The Combiner gets partial data from map, emits (, ) locally. The Reducer does exactly the same, but now some (Combined) wordcounts are already >1. Saves bandwith.
I still dont get your problem you can use following sequence:
database-->map-->reduce(use cat or None depending on requirement)
then store the data representation you have extracted.
if you are saying that it is small enough to fit in memory then storing it on disk shouldnt be an issue.
Also your use of MapReduce paradigm for the given problem is incorrect, using a single map function and multiple "different" reduce function makes no sense, it shows that you are just using map to pass out data to different machines to do different things. you dont require hadoop or any other special architecture for that.

Resources