I am learning Hadoop Administration but I don't know how to start debugging if my job is taking more time than its average, or where to start the debugging if my job has failed.
I generally starts with logs in Resource Manager UI but I want to know if there is any other process to debug as a Hadoop Admin. I am looking for a generic approach to follow for debugging Hadoop jobs using Hortonworks Ambari Web UI.
Logs help in case you end up with failed jobs. I am assuming your jobs are successful but slow.
Best place to start debugging for slow running jobs would be to look at the Job Counters (Once you get into the MR App Master page for the Job, you can find the counters link on left panel). Look at Hadoop Definitive Guide's Chapter 8 for details on what each counter means.
Related
I am running a MapReduce job in Hadoop 2.7.3 in a single node cluster. How do I calculate the time taken by the map and reduce tasks of this job?
SOLVED
In case it helps anyone who views this question or faces a similar problem.
Thanks to #Shubham's answer and a little research I did:
Job tracker has been removed in hadoop 2. It has been split into resource manager and application master.
To access the Resource manager, type in the URL in your browser "http://localhost:8088"
To access the Job History Server (to view statistics about the applications and jobs that have been completed) type in the URL in your browser "http://localhost:19888"
You could encounter an error when trying to access the Job History Server. It may show that there is no history for the application. In that case follow these steps:
Change the bashrc file
Steps:
i. In your terminal, type "nano ~/.bashrc"
ii. Now in this file, where the other hadoop variables are written add the line
export HADOOP_CONFIG_DIR=/usr/local/hadoop/etc/hadoop
iii. Exit out of nano and save the file.
iv. Run the command "source ~/.bashrc"
1. To start the job history server
Steps:
i. Run the command in your terminal
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONFIG_DIR start historyserver
ii. Then run the command
jps
You should be able to see the "JobHistoryServer" in the list
iii. Now run the command
netstat -ntlp | grep 19888
Hit resource manager's web UI(http://rm_http_address_host:port/). Typically the web port is 8088. You can hit http://resourcemanager_host:8088/ for this.
There you will find the link for all the applications that are in various states like STARTED, RUNNING, FAILED, SUCCEEDED etc
Clicking on each application's link will give you all the statistics(like number of containers(mappers/reducers in case of mapreduce), memory/Vcores used, running time and a lot more stats) about that yarn job.
And a lot many stats are exposed by ResourceManager REST API’s. Find them here https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html
You can go to the jobtracker (Runs on port 50030 by default) and check the job details. It shows the counters for Map time and reduce time. Moreover if you are interested in individual tasks you can follow link "Analyse This Job" that shows best and worst performing tasks.
I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.
When I run a hadoop job with the hadoop application it prints a lot of stuff. Among them, It show the relative progress of the job ("map: 30%, reduce: 0%" and stuff like that). But, when running a job without the application it does not print anything, not even errors. Is there a way to get that level of logging without the application? That is, without running [hadoop_folder]/bin/hadoop jar <my_jar> <indexer> <args>....
You can get this information from Application Master (assuming you use YARN and not MR1 where you would get it from Job Tracker). There is usually web UI where you can find this information. Details will depend on your Hadoop installation / distribution.
In case of Hadoop v1 check Job tracker web URL and in case of Hadoop v2 check Application Master web UI
I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourcemanager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.
I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability.
During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important for me to keep all the MapReduce jobs up that I have previously launched. So, my question is:
¿How can I avoid Hadoop removes all MapReduce-job history after it is restarted?
¿Is there some property to control job removing after Hadoop environment restarting?
Thanks!
the MR job history logs are not deleted right way after you restart hadoop, the new job will be counted from *_0001 and only new jobs which are started after hadoop restart will be displayed on resource manager web portal though. In fact, there are 2 log related settings from yarn default:
# this is where you can find the MR job history logs
yarn.nodemanager.log-dirs = ${yarn.log.dir}/userlogs
# this is how long the history logs will be retained
yarn.nodemanager.log.retain-seconds = 10800
and the default ${yarn.log.dir} is defined in $HADOOP_HONE/etc/hadoop/yarn-env.sh.
YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"
BTW, similar settings could also be found in mapred-env.sh if you are use Hadoop 1.X