Running submitted job sequentially in Google Cloud Dataproc - hadoop

I created Google Dataproc cluster with 2 workers using n1-standard-4 VMs for master and workers.
I want to submit jobs on a given cluster and all jobs should run sequentially (like on AWS EMR), i.e., if first job is in running state then upcoming job goes to pending state, after completing first job, second job starts running.
I tried with submitting jobs on cluster but it run all jobs in parallel - no jobs went to pending state.
Is there any configuration that I can set in Dataproc cluster so all jobs will run sequentially?
Updated following files :
/etc/hadoop/conf/yarn-site.xml
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<property>
<name>yarn.scheduler.fair.user-as-default-queue</name>
<value>false</value>
</property>
<property>
<name>yarn.scheduler.fair.allocation.file</name>
<value>/etc/hadoop/conf/fair-scheduler.xml</value>
</property>
/etc/hadoop/conf/fair-scheduler.xml
<?xml version="1.0" encoding="UTF-8"?>
<allocations>
<queueMaxAppsDefault>1</queueMaxAppsDefault>
</allocations>
After that restart services using this command systemctl restart hadoop-yarn-resourcemanager the above changes on master node. But still job running in parallel.

Dataproc tries to execute submitted jobs in parallel if resources are available.
To achieve sequential execution you may want to use some orchestration solution, either Dataproc Workflows or Cloud Composer.
Alternatively, you may want to configure YARN Fair Scheduler on Dataproc and set queueMaxAppsDefault property to 1.

Related

Duration of yarn application log in hadoop

I am using the output of the yarn application command in hadoop to get to know about the details of the mapreduce job that were run by using the job name. My cluster is using HDP distribution. Does anyone know that till how long are the job status available? Does it keep track of the jobs for previous few days?
It depends on our cluster configuration. At production level setting, usually there is a history/archive server available to hold the logs for previous run. In a default yarn configuration, the log retention is set to 1 day, hence by default 1 day log is preserved.
If history server is running, its default port is 19888. Check mapred-site.xml for below entry
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>{job-history-hostname}:19888</value>
</property>
and yarn-site.xml
<property>
<name>yarn.log.server.url</name>
<value>http://{job-history-hostname}:19888/jobhistory/logs</value>
</property>

Hadoop (EMR) Cluster Fair Scheduler is completing FIFO instead of in Parallel

This is my first time attempting to configure the YARN scheduler and it is not working as I would hope. The cluster originally worked as FIFO and I am attempting to get jobs to run in parallel. I have added to the top of the yarn-site.xml
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<property>
<name>yarn.scheduler.fair.allocation.file</name>
<value>/etc/hadoop/conf.empty/fair-scheduler.xml</value>
</property>
And then added the file /etc/hadoop/conf.empty/fair-scheduler.xml:
<allocations>
<queue name="root">
<weight>1.0</weight>
<schedulingPolicy>fair</schedulingPolicy>
<aclSubmitApps> </aclSubmitApps>
<aclAdministerApps>*</aclAdministerApps>
</queue>
<defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>
<queuePlacementPolicy>
<rule name="specified" create="true"/>
<rule name="user" create="true"/>
</queuePlacementPolicy>
</allocations>
So after this I stopped and started the yarn resource manager, and I see Fair Scheduler on the YARN Application console! But when attempting to run multiple jobs on cluster, the AWS EMR console shows just one job running and the other two pending. Furthermore the YARN console shows only one job running in the queue root.hadoop and don't see the other jobs (which will run after that one completes).
So how can I get the jobs to run in parallel?
Setting the scheduler via the yarn-site.xml does in fact work. If you pull up the YARN resource manager, the scheduler will in fact show the change, but the issue is when submitting an AWS EMR step. EMR steps inherently run sequentially, meaning AWS will not submit the next job to YARN until the previous step completes. So one had to submit the job to directly to Yarn to see the benefits; however, EMR steps seemed to have recently changed. AWS EMR now supports parallel step execution if using EMR version 5.28.0: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/.

How to delete a queue in YARN?

I see a queue in the yarn configuration file that I want to delete:
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>a,b,c</value>
<description>The queues at the this level (root is the root queue).
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.a.queues</name>
<value>a1,a2</value>
<description>The queues at the this level (root is the root queue).
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.b.queues</name>
<value>b1,b2,b3</value>
<description>The queues at the this level (root is the root queue).
</description>
</property>
Say I want to remove queue c. I remove c from the list under the line <name>yarn.scheduler.capacity.root.queues</name> so that line looks like this:
<value>a,b,c</value>
Then I go to the command line and run yarn rmadmin -refreshQueues.
But I get the following error message:
Caused by java.io.IOException: c cannot be found during refresh!`
I'm trying to delete the queue c. How do I delete it?
I noticed here it says that
Note: Queues cannot be deleted, only addition of new queues is
supported - the updated queue configuration should be a valid one i.e.
queue-capacity at each level should be equal to 100%.
...so, how do I delete a queue if I don't need it anymore?
thanks.
From that page:
Administrators can add additional queues at runtime, but queues cannot be deleted at runtime.
It sounds like you cannot delete at runtime, BUT you can stop/kill the YARN, update the config file (and remove c) and run YARN with the new configuration. So if you can afford to stop/start the YARN then here is the process:
Note: Before killing the YARN Read here on the effects on the system depending on your YARN version.
Here goes the process:
Stop the MapReduce JobHistory service, ResourceManager service, and NodeManager on all nodes where they are running, as follows:
sudo service hadoop-mapreduce-historyserver stop
sudo service hadoop-yarn-resourcemanager stop
sudo service hadoop-yarn-nodemanager stop
Then Edit the config file and remove c
Start the MapReduce JobHistory server, ResourceManager, and NodeManager on all nodes where they were previously running, as follows:
sudo service hadoop-mapreduce-historyserver start
sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start
Commands are from here
You have to stop Yarn if you want to delete a queue. Only add, update and stop is supported while Yarn is running.

How to separately specify a set of nodes for HDFS and others for MapReduce jobs?

While deploying hadoop, I want some set of nodes to run HDFS server but not to run any MapReduce tasks.
For example, there are two nodes A and B that run HDFS.
I want to exclude the node A from running any map/reduce task.
How can I achieve it? Thanks
If you do not want to run any MapReduce job in a particular node or a set of nodes,
Stopping the nodemanager daemon would be the simplest option if they are already running.
Run this command on the nodes where the MR tasks should not be attempted.
yarn-daemon.sh stop nodemanager
Or exclude the hosts using the property yarn.resourcemanager.nodes.exclude-path in yarn-site.xml
<property>
<name>yarn.resourcemanager.nodes.exclude-path</name>
<value>/path/to/excludes.txt</value>
<description>Path of the file containing the hosts to exclude. Should be readable by YARN user</description>
</property>
On adding this property, refresh the resourcemanager
yarn rmadmin -refreshNodes
The nodes specified in the file will be exempted from attempting MapReduce tasks.
I answer my question
If you use Yarn for resource management,
go check franklinsijo's answer.
If you use standalone mode,
make a list of nodes that you will run MR tasks and specify its path as 'mapred.hosts' at mapred-default file. (https://hadoop.apache.org/docs/r1.2.1/mapred-default.html)

Hadoop jobs fail when submitted by users other than yarn (MRv2) or mapred (MRv1)

I am running a test cluster running MRv1 (CDH5) paired with LocalFileSystem, and the only user I am able to run jobs as is mapred (as mapred is the user starting the jobtracker/tasktracker daemons). When submitting jobs as any other user, the jobs fail because the jobtracker/tasktracker is unable to find the job.jar under the .staging directory.
I have the exact same issue with YARN (MRv2) when paired with LocalFileSystem, i.e. when submitting jobs by a user other than 'yarn', the application master is unable to locate the job.jar under the .staging directory.
Upon inspecting the .staging directory of the user submitting the job I found that job.jar exists under the .staging// directory, but the permissions on the and .staging directories are set to 700 (drwx------) and hence the application master / tasktracker is not able to access the job.jar and supporting files.
We are running the test cluster with LocalFileSystem since we use only MapReduce part of the Hadoop project paired with OCFS in our production setup.
Any assistance in this regard would be immensely helpful.
You need to be setting up a staging directory for each user in the cluster. This is not as complicated as it sounds.
Check the following properties:
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<source>core-default.xml</source>
</property>
This basically setups a tmp directory for each user.
Tie this to your staging directory :
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>${hadoop.tmp.dir}/mapred/staging</value>
<source>mapred-default.xml</source>
</property>
Let me know if this works or if it already setup this way.
These properties should be in yarn-site.xml - if i remember correctly.
This worked for me, I just set this property in MR v1:
<property>
<name>hadoop.security.authorization</name>
<value>simple</value>
</property>
Please go through this:
Access Control Lists
${HADOOP_CONF_DIR}/hadoop-policy.xml defines an access control list for each Hadoop service. Every access control list has a simple format:
The list of users and groups are both comma separated list of names. The two lists are separated by a space.
Example: user1,user2 group1,group2.
Add a blank at the beginning of the line if only a list of groups is to be provided, equivalently a comman-separated list of users followed by a space or nothing implies only a set of given users.
A special value of * implies that all users are allowed to access the service.
Refreshing Service Level Authorization Configuration
The service-level authorization configuration for the NameNode and JobTracker can be changed without restarting either of the Hadoop master daemons. The cluster administrator can change ${HADOOP_CONF_DIR}/hadoop-policy.xml on the master nodes and instruct the NameNode and JobTracker to reload their respective configurations via the -refreshServiceAcl switch to dfsadmin and mradmin commands respectively.
Refresh the service-level authorization configuration for the NameNode:
$ bin/hadoop dfsadmin -refreshServiceAcl
Refresh the service-level authorization configuration for the JobTracker:
$ bin/hadoop mradmin -refreshServiceAcl
Of course, one can use the security.refresh.policy.protocol.acl property in ${HADOOP_CONF_DIR}/hadoop-policy.xml to restrict access to the ability to refresh the service-level authorization configuration to certain users/groups.
Examples
Allow only users alice, bob and users in the mapreduce group to submit jobs to the MapReduce cluster:
<property>
<name>security.job.submission.protocol.acl</name>
<value>alice,bob mapreduce</value>
</property>
Allow only DataNodes running as the users who belong to the group datanodes to communicate with the NameNode:
<property>
<name>security.datanode.protocol.acl</name>
<value>datanodes</value>
</property>
Allow any user to talk to the HDFS cluster as a DFSClient:
<property>
<name>security.client.protocol.acl</name>
<value>*</value>
</property>

Resources