Hadoop jobs fail when submitted by users other than yarn (MRv2) or mapred (MRv1) - hadoop

I am running a test cluster running MRv1 (CDH5) paired with LocalFileSystem, and the only user I am able to run jobs as is mapred (as mapred is the user starting the jobtracker/tasktracker daemons). When submitting jobs as any other user, the jobs fail because the jobtracker/tasktracker is unable to find the job.jar under the .staging directory.
I have the exact same issue with YARN (MRv2) when paired with LocalFileSystem, i.e. when submitting jobs by a user other than 'yarn', the application master is unable to locate the job.jar under the .staging directory.
Upon inspecting the .staging directory of the user submitting the job I found that job.jar exists under the .staging// directory, but the permissions on the and .staging directories are set to 700 (drwx------) and hence the application master / tasktracker is not able to access the job.jar and supporting files.
We are running the test cluster with LocalFileSystem since we use only MapReduce part of the Hadoop project paired with OCFS in our production setup.
Any assistance in this regard would be immensely helpful.

You need to be setting up a staging directory for each user in the cluster. This is not as complicated as it sounds.
Check the following properties:
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<source>core-default.xml</source>
</property>
This basically setups a tmp directory for each user.
Tie this to your staging directory :
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>${hadoop.tmp.dir}/mapred/staging</value>
<source>mapred-default.xml</source>
</property>
Let me know if this works or if it already setup this way.
These properties should be in yarn-site.xml - if i remember correctly.

This worked for me, I just set this property in MR v1:
<property>
<name>hadoop.security.authorization</name>
<value>simple</value>
</property>
Please go through this:
Access Control Lists
${HADOOP_CONF_DIR}/hadoop-policy.xml defines an access control list for each Hadoop service. Every access control list has a simple format:
The list of users and groups are both comma separated list of names. The two lists are separated by a space.
Example: user1,user2 group1,group2.
Add a blank at the beginning of the line if only a list of groups is to be provided, equivalently a comman-separated list of users followed by a space or nothing implies only a set of given users.
A special value of * implies that all users are allowed to access the service.
Refreshing Service Level Authorization Configuration
The service-level authorization configuration for the NameNode and JobTracker can be changed without restarting either of the Hadoop master daemons. The cluster administrator can change ${HADOOP_CONF_DIR}/hadoop-policy.xml on the master nodes and instruct the NameNode and JobTracker to reload their respective configurations via the -refreshServiceAcl switch to dfsadmin and mradmin commands respectively.
Refresh the service-level authorization configuration for the NameNode:
$ bin/hadoop dfsadmin -refreshServiceAcl
Refresh the service-level authorization configuration for the JobTracker:
$ bin/hadoop mradmin -refreshServiceAcl
Of course, one can use the security.refresh.policy.protocol.acl property in ${HADOOP_CONF_DIR}/hadoop-policy.xml to restrict access to the ability to refresh the service-level authorization configuration to certain users/groups.
Examples
Allow only users alice, bob and users in the mapreduce group to submit jobs to the MapReduce cluster:
<property>
<name>security.job.submission.protocol.acl</name>
<value>alice,bob mapreduce</value>
</property>
Allow only DataNodes running as the users who belong to the group datanodes to communicate with the NameNode:
<property>
<name>security.datanode.protocol.acl</name>
<value>datanodes</value>
</property>
Allow any user to talk to the HDFS cluster as a DFSClient:
<property>
<name>security.client.protocol.acl</name>
<value>*</value>
</property>

Related

Duration of yarn application log in hadoop

I am using the output of the yarn application command in hadoop to get to know about the details of the mapreduce job that were run by using the job name. My cluster is using HDP distribution. Does anyone know that till how long are the job status available? Does it keep track of the jobs for previous few days?
It depends on our cluster configuration. At production level setting, usually there is a history/archive server available to hold the logs for previous run. In a default yarn configuration, the log retention is set to 1 day, hence by default 1 day log is preserved.
If history server is running, its default port is 19888. Check mapred-site.xml for below entry
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>{job-history-hostname}:19888</value>
</property>
and yarn-site.xml
<property>
<name>yarn.log.server.url</name>
<value>http://{job-history-hostname}:19888/jobhistory/logs</value>
</property>

How to submit MR job to YARN cluster with ResourceManager HA wrt Hortowork's HDP?

I am trying to understand how to submit a MR job to Hadoop cluster, YARN based.
Case1:
For case in which there is only one ResourceManager (that is NO HA), we can submit the job like this (which i actually used and I believe is correct).
hadoop jar word-count.jar com.example.driver.MainDriver -fs hdfs://master.hadoop.cluster:54310 -jt master.hadoop.cluster:8032 /first/dir/IP_from_hdfs.txt /result/dir
As can be seen, RM is running on port 8032 and NN on 54310 and I am specifying the hostname becasue there is only ONE master.
case2:
Now, for the case when there is HA for both NN and RM, how do I submit the job? I am not able to understand this, because now we have two RM and NN (active / standby), and I understand that there is zookeeper to keep track of failures. So, from client perspective trying to submit a job, do I need to know the exact NN and RM for submitting the job or is there some logical naming which we have to use for submitting the job?
Can anyone please help me understand this?
With or without HA, the command to submit the job remains the same.
hadoop jar <jar> <mainClass> <inputpath> <outputpath> [args]
Using -fs and -jt is optional and are not used unless you want to specify a Namenode and JobTracker that is different from the one in the configurations.
If the fs.defaultFS property in core-site.xml and the properties defining the nameservice (dfs.nameservices) and its namenodes are configured properly in hdfs-site.xml of the client, the Active Master will be chosen whenever a client operation is performed.
By default, this Java class is used by the DFS Client to determine which NameNode is currently Active.
<property>
<name>dfs.client.failover.proxy.provider.<nameserviceID></name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

Hadoop : HDFS Cluster running out of space even though space is available

We have 4 datanode HDFS cluster ...there is large amount of space available on each data node of about 98gb ...but when i look at the datanode information .. it's only using about 10gb and running out of space ...
How can we make it use all the 98gb and not run out of space as indicated in image
this is the disk space configuration
this is the hdfs-site.xml on name node
<property>
<name>dfs.name.dir</name>
<value>/test/hadoop/hadoopinfra/hdfs/namenode</value>
</property>
this is the hdfs-site.xml under data node
<property>
<name>dfs.data.dir</name>
<value>/test/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
Eventhough /test has 98GB and hdfs is configured to use it it's not using it
Am I missing anything while doing the configuration changes? And how can we make sure 98GB is used?
According to this Hortonworks Community Portal link, the steps to amend your Data Node directory are as follows:
Stop the cluster.
Go to the ambari HDFS configuration and edit the datanode directory configuration: Remove /hadoop/hdfs/data and /hadoop/hdfs/data1. Add [new directory location].
Login into each datanode (via SSH) and copy the contents of /data and /data1 into the new directory.
Change the ownership of the new directory and everything under it to “hdfs”.
Start the cluster.
I'm assuming that you're technically already up to Step 2 since you've displayed your correctly configured core-site.xml files in the original question. Make sure you've done the other steps and that all Hadoop services have been stopped. From there, change the ownership to the user running Hadoop (typically hdfs but I've worked in a place where root was running the Hadoop processes) and you should be good to go :)

Hadoop 2.x -- how to configure secondary namenode?

I have an old Hadoop install that I'm looking to update to Hadoop 2. In the
old setup, I have a $HADOOP_HOME/conf/masters file that specifies the
secondary namenode.
Looking through the Hadoop 2 documentation I can't find any mention of a
"masters" file, or how to setup a secondary namenode.
Any help in the right direction would be appreciated.
The slaves and masters files in the conf folder are only used by some scripts in the bin folder like start-mapred.sh, start-dfs.sh and start-all.sh scripts.
These scripts are a mere convenience so that you can run them from a single node to ssh into each master / slave node and start the desired hadoop service daemons.
You only need these files on the name node machine if you intend to launch your cluster from this single node (using password-less ssh).
Alternatively, You can also start an Hadoop daemon manually on a machine via
bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode | jobtracker | tasktracker]
In order to run the secondary name node, use the above script on the designated machines providing the 'secondarynamenode' value to the script
See #pwnz0r 's 2nd comment on answer on How separate hadoop secondary namenode from primary namenode?
To reiterate here:
In hdfs-site.xml:
<property>
<name>dfs.secondary.http.address</name>
<value>$secondarynamenode.full.hostname:50090</value>
<description>SecondaryNameNodeHostname</description>
</property>
I am using Hadoop 2.6 and had to use
<property>
<name>dfs.secondary.http.address</name>
<value>secondarynamenode.hostname:50090</value>
</property>
for further details refer https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
Update hdfs-site.xml file by updating and adding following property
cd $HADOOP_HOME/etc/hadoop
sudo vi hdfs-site.xml
Then paste these lines into configuration tag
<property>
<name>dfs.secondary.http.address</name>
<value>hostname:50090</value>
</property>

Why do we need to format HDFS after every time we restart machine?

I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.
I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp folder)
hdfs-site.xml file looks as below :
<property>
<name>dfs.data.dir</name>
<value>/HADOOP_CLUSTER_DATA/data</value>
</property>
Now whenever I restart machine and try to start hadoop cluster using start-all.sh script, data node never starts. I confirmed that data node is not start by checking logs and by using jps command.
Then I
Stopped cluster using stop-all.sh script.
Formatted HDFS using hadoop namenode -format command.
Started cluster using start-all.sh script.
Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.
Has anyone encountered similar problem?
Why this is happening and
How can we solve this problem?
By changing dfs.datanode.data.dir away from /tmp you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp, most notably dfs.namenode.name.dir (I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).
I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir.
For those who use hadoop 2.0 or above versions config file names may be different.
As this answer points out, go to the /etc/hadoop directory of your hadoop installation.
Open the file hdfs-site.xml. This user configuration will override the default hadoop configurations, that are loaded by the java classloader before.
Add dfs.namenode.name.dir property and set a new namenode dir (default is file://${hadoop.tmp.dir}/dfs/name).
Do the same for dfs.datanode.data.dir property (default is file://${hadoop.tmp.dir}/dfs/data).
For example:
<property>
<name>dfs.namenode.name.dir</name>
<value>/Users/samuel/Documents/hadoop_data/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/Users/samuel/Documents/hadoop_data/data</value>
</property>
Other property where a tmp dir appears is dfs.namenode.checkpoint.dir. Its default value is: file://${hadoop.tmp.dir}/dfs/namesecondary.
If you want, you can easily also add this property:
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/Users/samuel/Documents/hadoop_data/namesecondary</value>
</property>

Resources