Large numbers of Hadoop output files - hadoop

Is there a sensible way, in Hadoop, to write very large numbers of output files? I've been using MultipleOutputs. However, MultipleOutputs allocates a large (~1MB) buffer to each file, so I ran into memory problems.
The ordering of my data is such that in any given reducer, I can write to a target file, close it, then move on to the next one. Unfortunately, MultipleOutputs doesn't expose a method to close a given file. I've written a modified MultipleOutputs which exposes such a method, and deals with the problem, but this doesn't seem ideal.
The alternative would be a final step to split my output into the required files, but I'm not sure of a good way to do this.

Each reducer will generate an output file, more the no. of reducer more the no. of o/p files and lesser the size.
probably you can restrict your no. of reducers.
But make sure limited reducers is optimized.
e.g. if you set reducers=1 then only 1 process has to process all your mapper data hence increases the processing time.

Related

Pig: Control number of mappers

I can control the number of reducers by using PARALLEL clause in the statements which result in reducers.
I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this?
I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred.tasktracker.map.tasks.maximum etc, but they seem to not help.
Can someone please help me understand how to control the number of maps and possibly share a working example?
There is a simple rule of thumb for number of mappers: There is as many mappers as there are file splits. A file split depends on the size of the block into which you HDFS splits the files (64MB, 128MB, 256MB depending on your configuration), please note that FileInput formats take into account, but can define their own behaviour.
Splits are important, because they are tied to the physical location of the data in the cluster, Hadoop brings code to the data and not data to the code.
The problem arises when the size of the file is less than the size of the block (64MB, 128MB, 256MB), this means there will be as many splits as there are input files, which is not efficient, as each Map Task usually startup time. In this case your best bet is to use pig.maxCombinedSplitSize, as it will try to read multiple small files into one Mapper, in a way ignore splits. But if you make it too large you run a risk of bringing data to the code and will run into network issues. You could have network limitations if you force too few Mappers, as data will have to be streamed from other data nodes. Keep the number close to the block size or half of it and you should be fine.
Other solution might be to merge the small files into one large splitable file, that will automatically generate and efficient number of Mappers.
You can change the property mapred.map.tasks to number you want. THis property contains default map task/job. Instead of setting it globally set the property for your session so default will be restored once your job is done.

Hadoop - set reducer number to 0 but write to same file?

My job is computational intensive so I am actually only using the distribution function of Hadoop, and I want all my output to be in 1 single file so I have set the number of reducer to 1. My reducer is actually doing nothing...
By explicitly setting the number of reducer to 0, may I know how can I control in the mapper to force all the outputs are written into the same 1 output file? Thanks.
You can't do that in Hadoop. Your mappers each have to write to independent files. This makes them efficient (no contention or network transfer). If you want to combine all those files, you need a single reducer. Alternatively, you can let them be separate files, and combine the files when you download them (e.g., using HDFS's command-line cat or getmerge options).
EDIT: From your comment, I see that what you want is to get away with the hassle of writing a reducer. This is definitely possible. To do this, you can use the IdentityReducer. You can check its API here and an explanation of 0 reducers vs. using the IdentityReducer is available here.
Finally, when I say that having multiple mappers generate a single output is not possible, I mean it is not possible with plain files in HDFS. You could do this with other types of output, like having all mappers write to a single database. This is OK if your mappers are not generating much output. Details on how this would work are available here.
cabad is correct for the most part. However, if you want to process the file with a single Mapper to a single output file you could use a FileInputFormat that marks the file as not splittable. Do this as well as set the number of Reducers to 0. This reduces the performance of using multiple data nodes but skips Shuffle and Sort.

Is InputSplit size or number of map tasks affected by the number of input files

Would it make a difference to the number of map tasks spawned by a job if I have a lot of small files (~HDFS block size) vs a few large files
It depends which InputFormat you use, because this is what determines the input splits computation, and thus the number of map tasks.
If you use the default TextInputFormat, each file will have at least 1 split, so at least 1 mapper per file, even if these files are a few kB, each mapper doing very little work, but this introduces a lot of overhead for the Map/Reduce framework. That said if you have a guarantee that these "small" files will be close to the block size, that probably doesn't matter too much.
If you have no control over your files and they might get really small, I would advise using a different InputFormat called CombineFileInputFormat which combines several input files in the same split, the number of maps in this case will only depend on the overall amount of data, regardless of the number of files. An implementation can be found here.

how can i work with large number of small files in hadoop?

i am new to hadoop and i'm working with large number of small files in wordcount example.
it takes a lot of map tasks and results in slowing my execution.
how can i reduce the number of map tasks??
if the best solution to my problem is catting small files to a larger file, how can i cat them?
If you're using something like TextInputFormat, the problem is that each file has at least 1 split, so the upper bound of the number of maps is the number of files, which in your case where you have many very small files you will end up with many mappers processing each very little data.
To remedy to that, you should use CombineFileInputFormat which will pack multiple files into the same split (I think up to the block size limit), so with that format the number of mappers will be independent of the number of files, it will simply depend on the amount of data.
You will have to create your own input format by extending from CombineFileInputFormt, you can find an implementation here. Once you have your InputFormat defined, let's called it like in the link CombinedInputFormat, you can tell your job to use it by doing:
job.setInputFormatClass(CombinedInputFormat.class);
Cloudera posted a blog on small files problem sometime back. It's an old entry, but the suggested method still applies.

Hadoop Pipes: how to pass large data records to map/reduce tasks

I'm trying to use map/reduce to process large amounts of binary data. The application is characterized by the following: the number of records is potentially large, such that I don't really want to store each record as a separate file in HDFS (I was planning to concatenate them all into a single binary sequence file), and each record is a large coherent (i.e. non-splittable) blob, between one and several hundred MB in size. The records will be consumed and processed by a C++ executable. If it weren't for the size of the records, the Hadoop Pipes API would be fine: but this seems to be based around passing the input to map/reduce tasks as a contiguous block of bytes, which is impractical in this case.
I'm not sure of the best way to do this. Does any kind of buffered interface exist that would allow each M/R task to pull multiple blocks of data in manageable chunks? Otherwise I'm thinking of passing file offsets via the API and streaming in the raw data from HDFS on the C++ side.
I'd like to have any opinions from anyone who's tried anything similar - I'm pretty new to hadoop.
Hadoop is not designed for records about 100MB in size. You will get OutOfMemoryError and uneven splits because some records are 1MB and some are 100MB. By Ahmdal's Law your parallelism will suffer greatly, reducing throughput.
I see two options. You can use Hadoop streaming to map your large files into your C++ executable as-is. Since this will send your data via stdin it will naturally be streaming and buffered. Your first map task must break up the data into smaller records for further processing. Further tasks then operate on the smaller records.
If you really can't break it up, make your map reduce job operate on file names. The first mapper gets some file names, runs them thorough your mapper C++ executable, stores them in more files. The reducer is given all the names of the output files, repeat with a reducer C++ executable. This will not run out of memory but it will be slow. Besides the parallelism issue you won't get reduce jobs scheduled onto nodes that already have the data, resulting in non-local HDFS reads.

Resources