Task tracker running but no job is submitted to it - hadoop

I am running a 4 core node EMR cluster. I noticed on the aws console that the number of task trackers have reduced from 4 to 3. I checked on all the individual nodes of the cluster and task trackers are running on all of them. It seems one of the tasktracker is not visible to the jobtracker. What could be causing this? Is there a way to rectify this?
Thanks,

Related

Number of yarn applications getting launched as soon as hadoop services gets up. Cluster is 4 nodes ie. Hadoop HA cluster

Hadoop-HA cluster - 4 nodes
As soon as I start hadoop services unnecessary yarn applications gets launched and no application logs gets generated. Not able to debug problem without logs. Can anyone help me to resolve this issue.
https://i.stack.imgur.com/RjvkB.png
Never come across such issue. But it seems that there is some script or may be some oozie job triggering these apps. Try Yarn-Clean if this is of any help.
Yarn-Clean

YARN and Hadoop

I had a couple of questions regarding job submission to HDFS and the YARN architecture in Hadoop:
So in the Hadoop ecosystem you have one NameNode for each cluster which can contain any number of data nodes that store your data. When you submit a job to Hadoop, the job tracker on the NameNode will pick each job and assign it to the task tracker on which the file is present on the data node.
So my question is how do the components of YARN work together in HDFS:?
So YARN consists of the NodeManager and the Resource Manager. Out of these two components: Is the NodeManager run on every DataNode and the ResourceManager runs on each NameNode for each cluster? So when the task tracker (in each DataNode) gets assigned a task from the job tracker (in the NameNode), the NodeManager in a specific data node will create an container which will request resources from the ResourceManager in the NameNode. So this resource manager and node manager only come into play when a task tracker in a data node gets a job from the job tracker in the NameNode, in which the NodeManager will ask the ResourceManager for resources for the job to be executed. Is this correct?
You are partially correct. YARN was brought into picture to avoid the burden of Jobtracker which does both scheduling and monitoring. So with YARN you dont have any Job tracker or task tracker. The job done by Job tracker is now done by Resource Manager which has two main components Scheduler(allocating resources to applications) and ApplicationsManager(accepting job submissions and restarts the ApplicationMaster in case of any failure). Now each application has a ApplicationMaster which negotiates containers(where the job would be run) from the scheduler for running application.
Nodemanager runs on every slave node/data node. Resource Manager may/maynot be installed where the namenode is present. For a large cluster we usually need to separate the masters, so that the load doesn't go to a single machine.

YARN Queue Can't Run More Than One Spark Job

I can run several jobs (MapReduce, Hive) in one queue. But if I run a Spark/Spark Streaming job, every job added after that will be in ACCEPTED state but not RUNNING. Only after I kill the Spark job the other job will be RUNNING.
I tried to create a different queue for Spark and non Spark jobs, they work as expected but this is not what I want.
My questions:
1. Is this YARN or Spark config issue?
2. What is the right config to solve that issue?
Any helps will be appreciated, thanks.

Hadoop on Mesos uses only one node?

I have successfully set up Mesos 0.22.1 cluster on 5 nodes. I can run Marathon and Chronos tasks on all slave nodes. Now I’m trying to run Hadoop jobs using Mesos Scheduler. I have followed very good tutorial and I could run wordcount test job. But when I try to run some larger job (loading data from Kafka to HDFS using Camus) job is running without the errors, but uses only one node with one task tracker, though it has in total 30 map jobs, and my nodes configured to run 2 map jobs in parallel.
What am I missing? Shouldn’t Jobtracker split task to run in parallel on all available nodes using 2 Map slots on eash node?
And what is strange - on Jobtracker webpage cluster summary reports only 1 available node. Is it correct behavior?
Any ideas are greatly appreciated!

Map on data node is run by whom?

I faced this tricky question in one of my interviews.
Question was
Who run map on data node ?
Answer is neither Job tracker nor task tracker.
Could anybody help me please
Datanodes do not run any task, they are part of HDFS and take care of storing data.
So "map on data node" makes no sense at all.
if hadoop 1.x is installed on the system then If task tracker is running on the same data node then task tracker daemon is the one who would run the map task after getting instruction from job tracker .
if no task tracker is running on the data node then no map task can run on that node , data node takes care of storage part it has nothing to do with map processing .
if hadoop 2.x then application master is the entity which does so by coordinating with node manager and resource manager.

Resources