increasing the hadoop task JVM size

increasing the hadoop task JVM size - hadoop

I want to set the task JVM size (Map task and Reduce task) , it can be done using the the property mapred.child.java.opts . But my concern is ,where do i need to set it . Can i set it using -D option while submitting job or I need to set this propery in each node's mapred-site.xml .
Thanks,
Priyaranjan

You can use
-Dmapred.child.java.opts='-Xmx1024m'
on the command line to set the tasks memory to 1024 mib.
Similarly in Java code of the job, you can set it as a configuration parameter:
conf.set("mapred.child.java.opts", "-Xmx1024m");

Like this:
hadoop jar your.jar package.MainClass -Dmapred.child.java.opts=blar some more args

Related

Hadoop Number of Reducers Configuration Options Priority

What are the priorities of the following 3 options for setting number of reduces? In other words, if all three are set, which one will be taken into account?
Option1:
setNumReduceTasks(2) within the application code
Option2:
-D mapreduce.job.reduces=2 as command line argument
Option3:
through $HADOOP_CONF_DIR/mapred-site.xml file
<property>
<name>mapreduce.job.reduces</name>
<value>2</value>
</property>

According to the Hadoop - The Definitive Guide
The -D option is used to set the configuration property with key color to the value
yellow. Options specified with -D take priority over properties from the configuration
files. This is very useful because you can put defaults into configuration files and then
override them with the -D option as needed. A common example of this is setting the
number of reducers for a MapReduce job via -D mapred.reduce.tasks=n. This will
override the number of reducers set on the cluster or set in any client-side configuration
files.

You have them racked in priority order - option 1 will override 2, and 2 will override 3. In other words Option 1 will be the one used by your job in this scenario

First Priority: Passing configuration parameters through command line (while submitting MR Application)
Second Priority: Setting configuration parameters in application code
Third Priority: It will read default parameters from multiple xml files such as core-site.xml, hadoop-env.sh, hdfs-site.xml, log4j.properties and mapred-site.xml

How to reduce number of output files in Apache Hive

Does anyone know of a tool that can "crunch" the output files of Apache Hadoop into fewer files or one file. Currently I am downloading all the files to a local machine and the concatenate them in one file. So does anyone know of an API or a tool that does the same.
Thanks in advance.

Limiting the number of output files means you want to limit the number of reducers. You could do that with the help of mapred.reduce.tasks property from the Hive shell. Example :
hive> set mapred.reduce.tasks = 5;
But it might affect the performance of your query. Alternatively, you could use getmerge command from the HDFS shell once you are done with your query. This command takes a source directory and a destination file as input and concatenates files in src into the destination local file.
Usage :
bin/hadoop fs -getmerge <src> <localdst>
HTH

See https://community.cloudera.com/t5/Support-Questions/Hive-Multiple-Small-Files/td-p/204038
set hive.merge.mapfiles=true; -- Merge small files at the end of a map-only job.
set hive.merge.mapredfiles=true; -- Merge small files at the end of a map-reduce job.
set hive.merge.size.per.task=???; -- Size (bytes) of merged files at the end of the job.
set hive.merge.smallfiles.avgsize=??? -- File size (bytes) threshold
-- When the average output file size of a job is less than this number,
-- Hive will start an additional map-reduce job to merge the output files
-- into bigger files. This is only done for map-only jobs if hive.merge.mapfiles
-- is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

mapreduce program not producing the requied output in distributed mode

I need some help in my map-reduce code.
The code run's perfectly in eclipse and in standalone mode, but when i package the code and try running it locally on pseudo distributed mode, the output is not as i expect.
Map input records = 11
Map input records = 11
Reduce input records = 11
Reduce output records = 0
These are the values i get.
where as when i run the same code in eclipse or in standalone mode with same config & input file
Map input records = 11
Map output records = 11
Reduce input records = 11
Reduce output records = 4
Can any one tell me whats wrong..??
i tried both the ways of building .jar file for eclipse -> export -> runable jar and form terminal as well(javac -classpath hadoop-core-1.0.4 -d classes mapredcode.java && jar -cvf mapredcode.jar -C classes/ .)
and how do i debug this..

Are you using a combiner() method?
And if yes. then is the o/p of combiner the same as that of the mapper?
Because in Hadoop, Combiner is run at the disposal of Hadoop itself and may not be running in the pseudo-disrtibuted mode in your case.
The combiner in itself is nothing but a reducer that is used to lower the network traffic.
And the code should be such that even if a Combiner is not running, the reducer should get the expected format from the mapper.
Hope it helps.

Can we set the multiple generic arguments with -D option in GenericOptionsParser?

I want to pass multiple configuration parameters to my Hadoop job through GenericOptionsParser.
With "-D abc=xyz" I can pass one argument and able to retrieve the same from the configuration object but I am not able to pass the multiple argument.
Is it possible to pass multiple argument?If yes how?

Passed the parameters as -D color=yellow -D number=10
Had the following code in the run() method
String color = getConf().get("color");
System.out.println("color = " + color);
String number = getConf().get("number");
System.out.println("number = " + number);
The following was the o/p in the console
color = yellow
number = 10

I recently ran in to this issue after upgrading from Hadoop 1.2.1 to Hadoop 2.4.1. The problem is that Hadoop's dependency on commons-cli 1.2 was being omitted due to a conflict with commons-cli 1.1 that was pulled in from Cassandra 2.0.5.
After a quick look through the source it looks like commons-cli options that have an uninitialized number of values (what Hadoop's GenericOptionsParser does) default to a limit of 1 in version 1.1 and no limit in 1.2.
I hope that helps!

I tested passing multiple parameters and I used the -D flag multiple times.
$HADOOP_HOME/bin/hadoop jar /path/to/my.jar -D mapred.heartbeats.in.second=80 -D mapred.map.max.attempts=2 ...`
Doing this changed the values to what I specified in the Job's configuration.

Change block size of dfs file

My map is currently inefficient when parsing one particular set of files (a total of 2 TB). I'd like to change the block size of files in the Hadoop dfs (from 64MB to 128 MB). I can't find how to do it in the documentation for only one set of files and not the entire cluster.
Which command changes the block size when I upload? (Such as copying from local to dfs.)

For me, I had to slightly change Bkkbrad's answer to get it to work with my setup, in case anyone else finds this question later on. I've got Hadoop 0.20 running on Ubuntu 10.10:
hadoop fs -D dfs.block.size=134217728 -put local_name remote_location
The setting for me is not fs.local.block.size but rather dfs.block.size

I change my answer! You just need to set the fs.local.block.size configuration setting appropriately when you use the command line.
hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location
Original Answer
You can programatically specify the block size when you create a file with the Hadoop API. Unfortunately, you can't do this on the command line with the hadoop fs -put command. To do what you want, you'll have to write your own code to copy the local file to a remote location; it's not hard, just open a FileInputStream for the local file, create the remote OutputStream with FileSystem.create, and then use something like IOUtils.copy from Apache Commons IO to copy between the two streams.

In conf/ folder we can change the value of dfs.block.size in configuration file hdfs-site.xml.
In hadoop version 1.0 default size is 64MB and in version 2.0 default size is 128MB.
<property>
<name>dfs.block.size<name>
<value>134217728<value>
<description>Block size<description>
<property>

you can also modify your block size in your programs like this
Configuration conf = new Configuration() ;
conf.set( "dfs.block.size", 128*1024*1024) ;

We can change the block size using the property named dfs.block.size in the hdfs-site.xml file.
Note:
We should mention the size in bits.
For example :
134217728 bits = 128 MB.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

increasing the hadoop task JVM size - hadoop

You can use -Dmapred.child.java.opts='-Xmx1024m' on the command line to set the tasks memory to 1024 mib. Similarly in Java code of the job, you can set it as a configuration parameter: conf.set("mapred.child.java.opts", "-Xmx1024m");

Like this: hadoop jar your.jar package.MainClass -Dmapred.child.java.opts=blar some more args

Related

Hadoop Number of Reducers Configuration Options Priority

How to reduce number of output files in Apache Hive

mapreduce program not producing the requied output in distributed mode

Can we set the multiple generic arguments with -D option in GenericOptionsParser?

Change block size of dfs file

Categories

Resources