Can HDFS block size be changed during job run? Custom Split and Variant Size - hadoop

I am using hadoop 1.0.3. Can the input split/block be changed (increase/decrease) during run time based on some constraints. Is there a class to override to accomplish this mechanism like FileSplit/InputTextFormat? Can we have variant size blocks in HDFS depending on logical constraint in one job?

You're not limited to TextInputFormat... Thats entirely configurable based on the data source you are reading. Most examples are line delimited plaintext, but that obviously doesn't work for XML, for example.
No, block boundaries can't change during runtime as your data should already be on disk, and ready to read.
But the InputSplit is dependent upon the InputFormat for the given job, which should remain consistent throughout a particular job, but the Configuration object in the code is basically a Hashmap which can be changed while running, sure

If you want to change block size only for a particular run or application you can do by overriding "-D dfs.block.size=134217728" .It helps you to change block size for your application instead of changing overall block size in hdfs-site.xml.
-D dfs.block.size=134217728

Related

Setting Mappers of desired numbers

I have gone through lot of blogs on stackoverflow and also apache wiki for getting to know the way the mappers are set in Hadoop. I also went through [hadoop - how total mappers are determined [this] post.
Some say its based on InputFormat and some posts say its based on the number of blocks the input file id split into.
Some how I am confused by the default setting.
When I run a wordcount example I see the mappers are low as 2. What is really happening in the setting ? Also this post [http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/QuasiMonteCarlo.java] [example program]. Here they set the mappers based on user input. How can one manually do this setting ?
I would really appreciate for some help and understanding of how mappers work.
Thanks in advance
Use the java system properties mapred.min.split.size and mapred.max.split.size to guide Hadoop to use the split size you want. This won't always work - particularly when your data is in a compression format that is not splittable (e.g. gz, but bzip2 is splittable).
So if you want more mappers, use a smaller split size. Simple!
(Updated as requested) Now this won't work for many small files, in particular you'll end up with more mappers than you want. For this situation use CombineFileInputFormat ... in Scalding this SO explains: Create Scalding Source like TextLine that combines multiple files into single mappers

how output files(part-m-0001/part-r-0001) are created in map reduce

I understand that the map reduce output are stored in files named like part-r-* for reducer and part-m-* for mapper.
When I run a mapreduce job sometimes a get the whole output in a single file(size around 150MB), and sometimes for almost same data size I get two output files(one 100mb and other 50mb). This seems very random to me. I cant find out any reason for this.
I want to know how its decided to put that data in a single or multiple output files. and if any way we can control it.
Thanks
Unlike specified in the answer by Jijo here - the number of the files depends on on the number of Reducers/Mappers.
It has nothing to do with the number of physical nodes in the cluster.
The rule is: one part-r-* file for one Reducer. The number of Reducers is set by job.setNumReduceTasks();
If there are no Reducers in your job - then one part-m-* file for one Mapper. There is one Mapper for one InputSplit (usually - unless you use custom InputFormat implementation, there is one InputSplit for one HDFS block of your input data).
The number of output files part-m-* and part-r-* is set according to the number of map tasks and the number of reduce tasks respectively.

Reading contents of blocks directly in a datanode

In HDFS , the blocks are distributed among the active nodes/slaves. The content of the blocks are simple text so is there any way to see read or access the blocks present in each data node ?
As an entire file or to read a single block (say block number 3) out of sequence?
You can read the file via various mechanisms including the Java API but you cannot start reading in the middle of the file (for example at the start of block 3).
Hadoop reads a block of data and feeds each line to the mapper for further processing. Also, the Hadoop clients gets the blocks related to a file from different Data Nodes before concatenating them. So, it should be possible to get the data from a particular block.
Hadoop Client might be a good place to start with to look at the code. But, HDFS provides file system abstraction. Not sure what the requirement would be for reading the data from a particular block.
Assuming you have ssh access (and appropriate permissions) to the datanodes, you can cd to the path where the blocks are stored and read the blocks stored on that node (e.g., do a cat BLOCK_XXXX). The configuration parameter that tells you where the blocks are stored is dfs.datanode.data.dir, which defaults to file://${hadoop.tmp.dir}/dfs/data. More details here.
Caveat: the block names are coded by HDFS depending on their internal block ID. Just by looking at their names, you cannot know to which file a block belongs.
Finally, I assume you want to do this for debugging purposes or just to satisfy your curiosity. Normally, there is no reason to do this and you should just use the HDFS web-UI or command-line tools to look at the contents of your files.

hadoop - How can i use data in memory as input format?

I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.
The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers
I cant find a simple method to pass, lets say List of elemnts to the mappers.
I find myself having to wrtite these elements to disk and then use fileinputformat.
any solution?
I'm writing the code in java offcourse.
thanks.
Input format is not have to load data from the disk or file system.
There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster.
So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records .
Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process.
I would be glad also to know what is your case which leads to this unusual requirement.

Writing to single file from mappers

I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.

Resources