How I can find out IPs of slave nodes where currently map reduce task is running or about to run for a given Job? - hadoop

I want to find out IPs of slave nodes where currently map reduce job is running or about to run for a given Job.
Is there any method to do this ?
Thanks in Advance.

For any job, you can view the list of running tasks through the Job Scheduler Web UI - this will detail the nodes on which the task is running.
As for where tasks are about to run - this is not neccessarily decided in advance. As slots become available on a node, the Job Scheduler (there are a number which behave differently depending on your needs) identifies a job task which will run on that node (based upon a number of criteria, hopefully honoring data locality where it can) and instructs the task tracker on that node to run the specific task.
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names (you'll probably need to do a DNS lookup to get the actual IPs i imagine)

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)
Thanks to #Chris..
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names.

Related

Difference between job, application, task, task attempt logs in Hadoop, Oozie

I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourceman­ager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.

Two simultaneous jobs on same node

In Microsoft HPC Cluster Manager, is it possible to run two jobs (MPI job) simultaneously on same node? If so, how a job should be configured?
I've made some tries with HPC Cluster Manager, and I've found a solution like that;
First, job scheduler configuration must be selected as Balanced
Second, job resource type must be selected Core or Socket, not Node.
In addition to these two settings, if minimum requested resource is available for both 2 jobs, they start to run simultaneously on same node.

Map on data node is run by whom?

I faced this tricky question in one of my interviews.
Question was
Who run map on data node ?
Answer is neither Job tracker nor task tracker.
Could anybody help me please
Datanodes do not run any task, they are part of HDFS and take care of storing data.
So "map on data node" makes no sense at all.
if hadoop 1.x is installed on the system then If task tracker is running on the same data node then task tracker daemon is the one who would run the map task after getting instruction from job tracker .
if no task tracker is running on the data node then no map task can run on that node , data node takes care of storage part it has nothing to do with map processing .
if hadoop 2.x then application master is the entity which does so by coordinating with node manager and resource manager.

Capacity scheduler in Amazon Elastic MapReduce

I am totally new to Amazon Elastic MapReduce. I have a need that I want to use my custom scheduler, which is implemented based on Hadoop capacity scheduler, to schedule my jobs in Amazon Elastic MapReduce.
According to my current understanding, to achieve this, I can define only one stage in the job flow, and submit my custom jar file via SSH connection to the master node. However, I cannot find how can I edit the xml configuration files, like capacity-scheduler.xml in the master node. Anyone knows how to do that?
Moreover, if I want to add the dynamic sizing property onto it, can I dynamically tune the number of task nodes in the cluster, when the job is currently running? Or in per stage, the size of a cluster should remain the same? Thank you so much.
You should use a bootstrap action to change Hadoop configuration.
The following AWS doc can be referenced for Hadoop configuratio bootstrap action.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop
This blog article that I bookmarked also has some info.
http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/
For changing the cluster size dynamically, one option is to use the AWS SDK.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/calling-emr-with-java-sdk.html
Using the following interface you can modify the instance count of the instance group.
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/AmazonElasticMapReduce.html

Schedule a trigger for a job that is excecuted on every node in a cluster

I'm wondering if there is a simple workaround/hack for quartz of triggering a job that is excecuted on every node in a cluster.
My situation:
My application is caching some things and is running in a cluster with no distributed-cache. Now I have situations where I want to refresh the caches on all nodes triggered by a job.
As you have found out, Quartz always picks up a random instance to execute a scheduled job and this cannot be easily changed unless you want to hack its internals.
Probably the easiest way to achieve what you describe would be to implement some sort of a coordinator (or master) job that will be aware of all Quartz instances in the cluster and will "manually" trigger execution of the cache-sync job on every single node. The master job can easily do it via the RMI, or JMX APIs exposed by Quartz.
You may want to check this somewhat similar question.

Resources