How can I have a 66MB job config in a job tracker while jobconf.limit is set to 5MB? - hadoop

How can I have a 66MB job config in a job tracker while mapred.user.jobconf.limit is set to 5MB ?
$ ls -lh /mapred/jt/jobTracker/job_201309061800_0037.xml
-rwxr-xr-x 1 mapred mapred 66M Sep 6 22:21 /mapred/jt/jobTracker/job_201309061800_0037.xml
$ cat /mapred/jt/jobTracker/job_201309061800_0037.xml | grep mapred.user.jobconf.limit
<property><name>mapred.user.jobconf.limit</name><value>5242880</value><source>mapred-default.xml</source></property>

You only showed the configuration sent from the client (job_201309061800_0037.xml). This configuration is only applied to the current Job and is not effective to the JobTracker. You need to check mapred-default.xml in your JobTracker.
JobTracker will read mapred.user.jobconf.limit when it initializes. After that, this value in the memory (MAX_JOBCONF_SIZE in JobTacker) is not changed. You can check the code here: http://www.grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/0.20.2-cdh3u1/org/apache/hadoop/mapred/JobTracker.java#158
I admit hadoop does not provide some mechanism to indicate which configuration can be set by a Job and which can not be set by a Job. Now my solution is searching the configuration in hadoop source codes and finding out how hadoop uses this configuration.

Related

Hadoop Nodemanager failing with error Can't get group information

I have kerberos configured Apache hadoop(2.8.5) installed. NameNode, DataNode and ResourceManager is running fine, but Nodemanager is failing to start with error:
Can't get group information for hadoop#configured value of yarn.nodemanager.linux-container-executor.group - Success.
file permissions:
container-executor.cfg: -rw------- 1 root hadoop
container-executor: ---Sr-s--- 1 root hadoop
container-executor.cfg
yarn.nodemanager.local-dirs=/hadoop/data/yarn/local
yarn.nodemanager.linux-container-executor.group=hadoop#configured value
of yarn.nodemanager.linux-container-executor.group
banned.users=hdfs,yarn,mapred,bin,root#comma separated list of users who can not run applications
min.user.id=1000#Prevent other super-users
Simply remove the comment:
#configured value
from the configuration line:
yarn.nodemanager.linux-container-executor.group
on the container-executor.cfg file
It should looke like this:
yarn.nodemanager.local-dirs=/hadoop/data/yarn/local
yarn.nodemanager.linux-container-executor.group=hadoop
of yarn.nodemanager.linux-container-executor.group
banned.users=hdfs,yarn,mapred,bin,root
min.user.id=1000
This configuration file has had historical problem with spaces, comments, etc..

Namenode and Jobtracker information on Hadoop cluster

How can i get the following information on the Hadoop Cluster ?
1. namenode and jobtracker name
2. list of all nodes with their roles on the cluster
To get namenode info:
hdfs getconf -confKey fs.defaultFS
For jobtracker
hdfs getconf -confKey yarn.resourcemanager.address.rm2
I am using cloudera based cluster and also working on EMR.
In both the clusters I can find the information from the configuration dir.
To get the namenode information go into core-site.xml file and look for the fs.defaultFS as #daemon12 said
Here is the straight way to get it.
For namenode information use the below command
cat /etc/hadoop/conf/core-site.xml | grep '8020'
Here is the result
<value>hdfs://10.872.22.1:8020</value>
The values inside the value tag is the name node information.
Similarly to get the jobtracker information do the below
cat /etc/hadoop/conf/yarn-site.xml | grep '8032'
Here is the result
<value>10.872.12.32:8032</value>
Again the jobtracker value is inside the value tag.
Generally the NN and JT information is used to run the Oozie jobs and this method will help you for that purpose.
DISCLAIMER: I am grepping the result of cat based on the namenode and jobtracker port number which is 8020 and 8032 respectively. This is widely known ports for NN and JT in Hadoop. If your organization uses a different one, please use that to get more appropriate result.
Along with the command-line way of getting information, you can get the similar information in the browser also:
http://<namenode>:50070 (For in general hadoop informtion)
http://<namenode>:50030 (For JobTracker related information)
These are default ports. You can check here for more information.
With the correct granted authorization, (like sudo -u hdfs ), you may try :
hdfs dfsadmin -report

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command?
For example I would like to do something like this
yarn get-config yarn.scheduler.maximum-allocation-mb
It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS.
> hdfs getconf -confKey fs.defaultFS
hdfs://localhost:19000
> hdfs getconf -confKey dfs.namenode.name.dir
file:///Users/chris/hadoop-deploy-trunk/data/dfs/name
> hdfs getconf -confKey yarn.resourcemanager.address
0.0.0.0:8032
> hdfs getconf -confKey mapreduce.framework.name
yarn
A benefit of using this is that you'll see the actual, final results of any configuration properties as they are actually used by Hadoop. This would account for some of the more advanced configuration patterns, such as use of XInclude in the XML files or property substitutions, like this:
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
Any scripting approach that tries to parse the XML files directly is unlikely to accurately match the implementation as its done inside Hadoop, so it's better to ask Hadoop itself.
You might be wondering why an hdfs command can get configuration properties for YARN and MapReduce. Great question! It's somewhat of a coincidence of the implementation needing to inject an instance of MapReduce's JobConf into some objects created via reflection. The relevant code is visible here:
https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L82-L114
This code is executed as part of running the hdfs getconf command. By triggering a reference to JobConf, it forces class loading and static initialization of the relevant MapReduce and YARN classes that add yarn-default.xml, yarn-site.xml, mapred-default.xml and mapred-site.xml to the set of configuration files in effect.
Since it's a coincidence of the implementation, it's possible that some of this behavior will change in future versions, but it would be a backwards-incompatible change, so we definitely wouldn't change that behavior inside the current Hadoop 2.x line. The Apache Hadoop Compatibility policy commits to backwards-compatibility within a major version line, so you can trust that this will continue working at least within the 2.x version line.

i executed a hadoop mapreduce program successfully, Can someone tell me how see output through browser like <http://localhost:port/hdfsLocation/>

i executed a hadoop mapreduce program successfully in CDH4, but where can i see my output ? , Can someone tell me how to see output through browser like: It will be helpfull to me
on terminal
hadoop dfs -ls /inputfile
it will give result like
Found 2 items
-rw-r--r-- 3 user17 supergroup 0 2014-11-27 16:47 /inputfile/_SUCCESS
-rw-r--r-- 3 user17 supergroup 24441 2014-11-27 16:47 /inputfile/part-00000
hadoop dfs -cat /inputfile/part-00000
NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster. With the default configuration, the NameNode front page is at http://namenode-name:50070/. It lists the DataNodes in the cluster and basic statistics of the cluster. The web interface can also be used to browse the file system (using "Browse the file system" link on the NameNode front page).
if you want see output on web please see. http://gethue.com/#

Hadoop fsck shows missing replicas

I am running Hadoop 2.2.0 cluster with two datanodes and one namenode. When I try checking the system using hadoop fsck command on namenode or any of the datanodes, I get the following:
Target Replicas is 3 but found 2 replica(s).
I tried changing the configuration in hdfs-site.xml (dfs.replication to 2 ) and restarted the cluster services. On running hadoop fsck / it is still showing the same status:
Target Replicas is 3 but found 2 replica(s).
Please clarify, is this a caching issue or a bug?
By setting dfs.replication does not bring down your replication. this property will be referred only when a files is created whose replication is not specified. For changing the replication following hadoop utility could be used
hadoop fs -setrep [-R] [-w] <rep> <path/file>
or
hdfs dfs -setrep [-R] [-w] <rep> <path/file>
Here / also can be specified for changing the replication factor of the complete filesystem.

Resources