Get list of executed job on Hadoop cluster after cluster reboot - hadoop

I have a hadoop cluster 2.7.4 version. Due to some reason, I have to restart my cluster. I need job IDs of those jobs that were executed on cluster before cluster reboot. Command mapred -list provide currently running of waiting jobs details only

You can see a list of all jobs on the Yarn Resource Manager Web UI.
In your browser go to http://ResourceManagerIPAdress:8088/
This is how the history looks on the Yarn cluster I am currently testing on (and I restarted the services several times):
See more info here

Related

Hadoop job submission using Apache Ignite Hadoop Accelerators

Disclaimer: I am new to both Hadoop and Apache Ignite. sorry for the lengthy background info.
Setup:
I have installed and configured Apache Ignite Hadoop Accelerator. Start-All.sh brings up the below services. I can submit Hadoop jobs. They complete and I can see results as expected. The start all uses traditional core-site, hdfs-site, mapred-site, and yarn-site configuration files.
28336 NodeManager
28035 ResourceManager
27780 SecondaryNameNode
27429 NameNode
28552 Jps
27547 DataNode
I also have installed Apache Ignite 2.6.0. I am able to start ignite nodes, connect to it using web console. I was able to load the cache from MySQL and run SQL queries and java programs against this cache.
For running Hadoop jobs using ignited Hadoop, I created a separate ignite-config directory, in which I have customized core-site and mapred-site configurations as per the instructions in the Apache ignite web site.
Issue:
When I run a Hadoop job using the command:
hadoop --config ~/ignite-conf jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar wordcount input output1
I get the below error (Note, the same job ran successfully against the Hadoop/without ignite):
java.io.IOException: Failed to get new job ID.
...
...
Caused by: class org.apache.ignite.internal.client.GridClientDisconnectedException: Latest topology update failed.
...
...
Caused by: class org.apache.ignite.internal.client.GridServerUnreachableException: Failed to connect to any of the servers in list: [/:13500]
...
...
It looks like, there was attempt made to lookup the jobtracker (13500) and it was not able to find. From the service list above, it's obvious that job tracker is not running. However, the job ran just fine on non-ignited hadoop over YARN.
Can you help please?
This is resolved in my case.
The job tracker here meant the Apache Ignite memory cache services listening on port 11211.
After making this change in mapred-site.xml, the job ran!

Do we need to put namenode in safe mode before restarting the job tracker?

I have a Hadoop cluster running Cloudera's CDH3, Apache Hadoop's 0.20.2 equivalent. I want to restart the job-tracker as there are some jobs which are not getting killed. I tried killing them from the command line, the command executes successfully, but the jobs are still in Job Cleanup: Pending status. Anyways I want to restart the job-tracker and see if that cleanup the jobs. I know the command to restart the job-tracker, but I am not sure if I need to put the name-node in safe-mode before I restart the job-tracker.
You can try to kill the unwanted jobs using hadoop job -kill <Job-ID> and check for command status echo "$?". If that doesn't work, Restart is the only option.
Hadoop Jobtracker and namenodes are independent components, No need to execute namenode safenode before Jobtracker restart. You can restart Jobtracker process alone.(tasktracker if required)

gcloud console indicating job is running, while hadoop application manager says it is finished

The job that I've submitted to spark cluster is not finishing. I see it is pending forever, however logs say that even spark jetty connector is shut down:
17/05/23 11:53:39 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector#4f67e3df{HTTP/1.1}{0.0.0.0:4041}
I run latest cloud dataproc v1.1 (spark 2.0.2) on yarn. I submit spark job via gcloud api:
gcloud dataproc jobs submit spark --project stage --cluster datasys-stg \
--async --jar hdfs:///apps/jdbc-job/jdbc-job.jar --labels name=jdbc-job -- --dbType=test
The same spark pi stuff is finished correctly:
gcloud dataproc jobs submit spark --project stage --cluster datasys-stg --async \
--class org.apache.spark.examples.SparkPi --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 100
While visiting hadoop application manager interface I see it is finished with Successful result:
Google cloud console and job list is showing it is still running until killed (see job run for 20 hours before killed, while hadoop says it ran for 19 seconds):
Is there something I can monitor to see what is preventing gcloud to finish the job?
I couldn't find anything that I can monitor my application is not finishing, but I've found the actual problem and fixed it. Turns out I had abandoned threads in my application - I had connection to RabbitMQ and that seemed to create some threads that prevented application from being finally stoped by gcloud.

DataNode automatically getting restarted in the CDH5 cluster

We have setup a cluster with 6 slave nodes. I am trying to see how replication happens when one of the DataNode dies.
I logged into one of the slave and killed the DataNode using the kill -9 command. After sometime the DataNode is restarted automatically and HDFS gets back into healthy status. I am verify this because the PID of the DataNode has changed.
I don't see any documentation on the above behavior of DataNode. Is this the Apache Hadoop or Cloudera CDH feature? Any reference to the documentation is appreciated.
As the pid of datanode has been changed, I don't think it is a behavior of datanode. If you are managing your cluster using Cloudera Manager, there is an option for restarting datanode daemon if it fails(Automatically Restart Process). This option will be set by default. When the datanode process gets failed or killed, As Automatic restart option is set Cloudera Scm agent will start the the datanode daemon.
For Automatic restart option : Choose HDFS services -> go to Configuration section -> Search for automatic restart.
This feature is available in CM 4.X release as well.

Spark on yarn concept understanding

I am trying to understand how spark runs on YARN cluster/client. I have the following question in my mind.
Is it necessary that spark is installed on all the nodes in yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". Why does client node have to install Hadoop when it is sending the job to cluster?
Adding to other answers.
Is it necessary that spark is installed on all the nodes in the yarn
cluster?
No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.
These are the visualizations of spark app deployment modes.
Spark Standalone Cluster
In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.
YARN cluster mode
YARN client mode
This table offers a concise list of differences between these modes:
pics source
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side)
configuration files for the Hadoop cluster". Why does the client node have
to install Hadoop when it is sending the job to cluster?
Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.
The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from the
Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.
Update: (2017-01-04)
Spark 2.0+ no longer requires a fat assembly jar for production
deployment. source
We are running spark jobs on YARN (we use HDP 2.2).
We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.
For example to run the Pi example:
./bin/spark-submit \
--verbose \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 4 \
hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.
About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.
1 - Spark if following s slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits.
There is a very good explanation of it here : http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.
Let me try to cut glues and make it short for impatient.
6 components: 1. client, 2. driver, 3. executors, 4. application master, 5. workers, and 6. resource manager; 2 deploy modes; and 2 resource (cluster) management.
Here's the relation:
Client
Nothing special, is the one submitting spark app.
Worker, executors
Nothing special, one worker holds one or more executors.
Master, & resource (cluster) manager
(no matter client or cluster mode)
in yarn, resource manager and master sit in two different nodes;
in standalone, resource manager == master, same process in the same node.
Driver
in client mode, sits with client
in yarn - cluster mode, sits with master (in this case, client process exits after submission of app)
in standalone - cluster mode, sits with one worker
VoilĂ !

Resources