processing very small file with hadoop - hadoop

I have a question about using hadoop to process a small file. My file only has about a 1,000 or so records but i want the records to roughly be evenly distributed among the nodes. Is there a way to do this? I'm new to hadoop and so far it seems that all the execution is happening on one node instead a multiple simultaneously. Let me know if my question makes sense or if I need to clarify anything. Like I said, i'm very new to Hadoop but am hoping to get some clarification. Thanks.

Use the NLineInputFormat and specify the number of records to be processed by each mapper. This way the records in a single block will be processed by multiple mappers.

The other option is to split your one input file into multiple input files (in the one input path directory).
Each of those input files will then be able to be spread across the hdfs and the map
operations will occur on the worker machines that own those input splits.

Related

Does Hadoop Distcp copy at block level?

Distcp between/within clusters are Map-Reduce jobs. My assumption was, it copies files on the input split level, helping with copy performance since a file will be copied by multiple mappers working on multiple "pieces" in parallel.
However when I was going through the documentation of Hadoop Distcp, it seems Distcp will only work on the file level.
Please refer to here: hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
According to the distcp doc, the distcp will only split the list of files, instead of the files themselves, and give the partitions of list to the mappers.
Can anyone tell how exactly this will work?
additional question: if a file is assigned to only one mapper, how does the mapper find all the input splits on one node that it's running on?
For a single file of ~50G size, 1 map task will be triggered to copy the data since files are the finest level of granularity in Distcp.
Quoting from the documentation:
Why does DistCp not run faster when more maps are specified?
At
present, the smallest unit of work for DistCp is a file. i.e., a file
is processed by only one map. Increasing the number of maps to a value
exceeding the number of files would yield no performance benefit. The
number of maps launched would equal the number of files.
UPDATE
The block locations of the file is obtained from the namenode during mapreduce. On Distcp, each Mapper will be initiated, if possible, on the node where the first block of the file is present. In cases where the file is composed of multiple splits, they will be fetched from the neighbourhood if not available on the same node.

Pig: Control number of mappers

I can control the number of reducers by using PARALLEL clause in the statements which result in reducers.
I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this?
I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred.tasktracker.map.tasks.maximum etc, but they seem to not help.
Can someone please help me understand how to control the number of maps and possibly share a working example?
There is a simple rule of thumb for number of mappers: There is as many mappers as there are file splits. A file split depends on the size of the block into which you HDFS splits the files (64MB, 128MB, 256MB depending on your configuration), please note that FileInput formats take into account, but can define their own behaviour.
Splits are important, because they are tied to the physical location of the data in the cluster, Hadoop brings code to the data and not data to the code.
The problem arises when the size of the file is less than the size of the block (64MB, 128MB, 256MB), this means there will be as many splits as there are input files, which is not efficient, as each Map Task usually startup time. In this case your best bet is to use pig.maxCombinedSplitSize, as it will try to read multiple small files into one Mapper, in a way ignore splits. But if you make it too large you run a risk of bringing data to the code and will run into network issues. You could have network limitations if you force too few Mappers, as data will have to be streamed from other data nodes. Keep the number close to the block size or half of it and you should be fine.
Other solution might be to merge the small files into one large splitable file, that will automatically generate and efficient number of Mappers.
You can change the property mapred.map.tasks to number you want. THis property contains default map task/job. Instead of setting it globally set the property for your session so default will be restored once your job is done.

Hadoop - set reducer number to 0 but write to same file?

My job is computational intensive so I am actually only using the distribution function of Hadoop, and I want all my output to be in 1 single file so I have set the number of reducer to 1. My reducer is actually doing nothing...
By explicitly setting the number of reducer to 0, may I know how can I control in the mapper to force all the outputs are written into the same 1 output file? Thanks.
You can't do that in Hadoop. Your mappers each have to write to independent files. This makes them efficient (no contention or network transfer). If you want to combine all those files, you need a single reducer. Alternatively, you can let them be separate files, and combine the files when you download them (e.g., using HDFS's command-line cat or getmerge options).
EDIT: From your comment, I see that what you want is to get away with the hassle of writing a reducer. This is definitely possible. To do this, you can use the IdentityReducer. You can check its API here and an explanation of 0 reducers vs. using the IdentityReducer is available here.
Finally, when I say that having multiple mappers generate a single output is not possible, I mean it is not possible with plain files in HDFS. You could do this with other types of output, like having all mappers write to a single database. This is OK if your mappers are not generating much output. Details on how this would work are available here.
cabad is correct for the most part. However, if you want to process the file with a single Mapper to a single output file you could use a FileInputFormat that marks the file as not splittable. Do this as well as set the number of Reducers to 0. This reduces the performance of using multiple data nodes but skips Shuffle and Sort.

how can i work with large number of small files in hadoop?

i am new to hadoop and i'm working with large number of small files in wordcount example.
it takes a lot of map tasks and results in slowing my execution.
how can i reduce the number of map tasks??
if the best solution to my problem is catting small files to a larger file, how can i cat them?
If you're using something like TextInputFormat, the problem is that each file has at least 1 split, so the upper bound of the number of maps is the number of files, which in your case where you have many very small files you will end up with many mappers processing each very little data.
To remedy to that, you should use CombineFileInputFormat which will pack multiple files into the same split (I think up to the block size limit), so with that format the number of mappers will be independent of the number of files, it will simply depend on the amount of data.
You will have to create your own input format by extending from CombineFileInputFormt, you can find an implementation here. Once you have your InputFormat defined, let's called it like in the link CombinedInputFormat, you can tell your job to use it by doing:
job.setInputFormatClass(CombinedInputFormat.class);
Cloudera posted a blog on small files problem sometime back. It's an old entry, but the suggested method still applies.

Writing to the same file from two mappers

In Hadoop MR (basically HDFS), is it possible to write to the same file from two mappers belonging to a single job in synchronous/serialized fashion?
Also writing to a single file from two mappers running in different jobs in a serialized fashion?
There are semaphores in other filesystems. What is the mechanism in HDFS?
There is no communication between the map tasks in Hadoop, so some sort of synchronization between them is not possible.
Files in HDFS may be written by a single writer, while many readers can read it.
I think MapR allows multiple writers to the same file.
FYI, the file has to be appended at the end and modifications at any arbitrary offset are also not possible.
Just curious, what is the use case for multiple map tasks writing to a single file?
Set the number or reducers = 1 (mapred.reduce.tasks=1)

Resources