Yarn - Why application attempt twice? - hadoop

I made spark application which throw a error on purpose. When I ran this application on hadoop yarn, It always tried two times.
I want to run a application just once, not twice.

The number of application attempts is controlled by this property yarn.resourcemanager.am.max-attempts. This is 2 by default.
Modify this in yarn-site.xml,
<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>1</value>
</property>

Related

Unable to install hadoop on macosx

I am unable to run hadoop on my Max OSX.
When I run hadoop version -> nothing happens
When I run sudo hadoop version -> it shows my hadoop version. I read somewhere that I shldn't be using sudo to run hadoop. but anyway this tells me that my hadoop is installed?
Because hadoop version returns nothing, I am unable to open any nodes. Everytime I try to start a node with start-dfs.sh, nothing happens as well. Does anyone knows what's happening here? I've look through all the configuration files multiple times to ensure that I have set them right. Not sure where the problem lies here.
What i did:
Followed the instructions from here
In my config files I edited:
etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
etc/hadoop/hadoop-env.sh
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home
In fact I managed to run everything before.
I am not sure what I did after tht, but I am unable to start the nodes again. I think it might be because I failed to close my nodes before I shut down my computer before?

Duration of yarn application log in hadoop

I am using the output of the yarn application command in hadoop to get to know about the details of the mapreduce job that were run by using the job name. My cluster is using HDP distribution. Does anyone know that till how long are the job status available? Does it keep track of the jobs for previous few days?
It depends on our cluster configuration. At production level setting, usually there is a history/archive server available to hold the logs for previous run. In a default yarn configuration, the log retention is set to 1 day, hence by default 1 day log is preserved.
If history server is running, its default port is 19888. Check mapred-site.xml for below entry
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>{job-history-hostname}:19888</value>
</property>
and yarn-site.xml
<property>
<name>yarn.log.server.url</name>
<value>http://{job-history-hostname}:19888/jobhistory/logs</value>
</property>

Hadoop jobs fail when submitted by users other than yarn (MRv2) or mapred (MRv1)

I am running a test cluster running MRv1 (CDH5) paired with LocalFileSystem, and the only user I am able to run jobs as is mapred (as mapred is the user starting the jobtracker/tasktracker daemons). When submitting jobs as any other user, the jobs fail because the jobtracker/tasktracker is unable to find the job.jar under the .staging directory.
I have the exact same issue with YARN (MRv2) when paired with LocalFileSystem, i.e. when submitting jobs by a user other than 'yarn', the application master is unable to locate the job.jar under the .staging directory.
Upon inspecting the .staging directory of the user submitting the job I found that job.jar exists under the .staging// directory, but the permissions on the and .staging directories are set to 700 (drwx------) and hence the application master / tasktracker is not able to access the job.jar and supporting files.
We are running the test cluster with LocalFileSystem since we use only MapReduce part of the Hadoop project paired with OCFS in our production setup.
Any assistance in this regard would be immensely helpful.
You need to be setting up a staging directory for each user in the cluster. This is not as complicated as it sounds.
Check the following properties:
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<source>core-default.xml</source>
</property>
This basically setups a tmp directory for each user.
Tie this to your staging directory :
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>${hadoop.tmp.dir}/mapred/staging</value>
<source>mapred-default.xml</source>
</property>
Let me know if this works or if it already setup this way.
These properties should be in yarn-site.xml - if i remember correctly.
This worked for me, I just set this property in MR v1:
<property>
<name>hadoop.security.authorization</name>
<value>simple</value>
</property>
Please go through this:
Access Control Lists
${HADOOP_CONF_DIR}/hadoop-policy.xml defines an access control list for each Hadoop service. Every access control list has a simple format:
The list of users and groups are both comma separated list of names. The two lists are separated by a space.
Example: user1,user2 group1,group2.
Add a blank at the beginning of the line if only a list of groups is to be provided, equivalently a comman-separated list of users followed by a space or nothing implies only a set of given users.
A special value of * implies that all users are allowed to access the service.
Refreshing Service Level Authorization Configuration
The service-level authorization configuration for the NameNode and JobTracker can be changed without restarting either of the Hadoop master daemons. The cluster administrator can change ${HADOOP_CONF_DIR}/hadoop-policy.xml on the master nodes and instruct the NameNode and JobTracker to reload their respective configurations via the -refreshServiceAcl switch to dfsadmin and mradmin commands respectively.
Refresh the service-level authorization configuration for the NameNode:
$ bin/hadoop dfsadmin -refreshServiceAcl
Refresh the service-level authorization configuration for the JobTracker:
$ bin/hadoop mradmin -refreshServiceAcl
Of course, one can use the security.refresh.policy.protocol.acl property in ${HADOOP_CONF_DIR}/hadoop-policy.xml to restrict access to the ability to refresh the service-level authorization configuration to certain users/groups.
Examples
Allow only users alice, bob and users in the mapreduce group to submit jobs to the MapReduce cluster:
<property>
<name>security.job.submission.protocol.acl</name>
<value>alice,bob mapreduce</value>
</property>
Allow only DataNodes running as the users who belong to the group datanodes to communicate with the NameNode:
<property>
<name>security.datanode.protocol.acl</name>
<value>datanodes</value>
</property>
Allow any user to talk to the HDFS cluster as a DFSClient:
<property>
<name>security.client.protocol.acl</name>
<value>*</value>
</property>

Hadoop/MR temporary directory

I've been struggling with getting Hadoop and Map/Reduce to start using a separate temporary directory instead of the /tmp on my root directory.
I've added the following to my core-site.xml config file:
<property>
<name>hadoop.tmp.dir</name>
<value>/data/tmp</value>
</property>
I've added the following to my mapreduce-site.xml config file:
<property>
<name>mapreduce.cluster.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
</property>
<property>
<name>mapreduce.jobtracker.system.dir</name>
<value>${hadoop.tmp.dir}/mapred/system</value>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>${hadoop.tmp.dir}/mapred/staging</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>${hadoop.tmp.dir}/mapred/temp</value>
</property>
No matter what job I run though, it's still doing all of the intermediate work out in the /tmp directory. I've been watching it do it via df -h and when I go in there, there are all of the temporary files it creates.
Am I missing something from the config?
This is on a 10 node Linux CentOS cluster running 2.1.0.2.0.6.0 of Hadoop/Yarn Mapreduce.
EDIT:
After some further research, the settings seem to be working on my management and namednode/secondarynamed nodes boxes. It is only on the data nodes that this is not working and it is only with the mapreduce temporary output files that are still going to /tmp on my root drive, not the my data mount where I have set in the configuration files.
If you are running Hadoop 2.0, then the proper name of the config file you need to change is mapred-site.xml, not mapreduce-site.xml.
An example can be found on the Apache site: http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
and it uses the mapreduce.cluster.local.dir property name, with a default value of ${hadoop.tmp.dir}/mapred/local
Try renaming your mapreduce-site.xml file to mapred-site.xml in your /etc/hadoop/conf/ directories and see if that fixes it.
If you are using Ambari, you should be able to just go to use the "Add Property" button on the MapReduce2 / Custom mapred-site.xml section, enter 'mapreduce.cluster.local.dir' for the property name, and a comma separated list of directories you want to use.
I think you need to specify this property in hdfs-site.xml rather than core-site.xml.Try setting this property in hdfs-site.xml. I hope this will solve your problem
The mapreduce properties should be in mapred-site.xml.
I was facing a similar issue where some nodes would not honor the hadoop.tmp.dir set in the config.
A reboot of the misbehaving nodes fixed it for me.

Why do we need to format HDFS after every time we restart machine?

I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.
I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp folder)
hdfs-site.xml file looks as below :
<property>
<name>dfs.data.dir</name>
<value>/HADOOP_CLUSTER_DATA/data</value>
</property>
Now whenever I restart machine and try to start hadoop cluster using start-all.sh script, data node never starts. I confirmed that data node is not start by checking logs and by using jps command.
Then I
Stopped cluster using stop-all.sh script.
Formatted HDFS using hadoop namenode -format command.
Started cluster using start-all.sh script.
Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.
Has anyone encountered similar problem?
Why this is happening and
How can we solve this problem?
By changing dfs.datanode.data.dir away from /tmp you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp, most notably dfs.namenode.name.dir (I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).
I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir.
For those who use hadoop 2.0 or above versions config file names may be different.
As this answer points out, go to the /etc/hadoop directory of your hadoop installation.
Open the file hdfs-site.xml. This user configuration will override the default hadoop configurations, that are loaded by the java classloader before.
Add dfs.namenode.name.dir property and set a new namenode dir (default is file://${hadoop.tmp.dir}/dfs/name).
Do the same for dfs.datanode.data.dir property (default is file://${hadoop.tmp.dir}/dfs/data).
For example:
<property>
<name>dfs.namenode.name.dir</name>
<value>/Users/samuel/Documents/hadoop_data/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/Users/samuel/Documents/hadoop_data/data</value>
</property>
Other property where a tmp dir appears is dfs.namenode.checkpoint.dir. Its default value is: file://${hadoop.tmp.dir}/dfs/namesecondary.
If you want, you can easily also add this property:
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/Users/samuel/Documents/hadoop_data/namesecondary</value>
</property>

Resources