Calculate time taken by reducers hadoop - hadoop

I am running a MapReduce job in Hadoop 2.7.3 in a single node cluster. How do I calculate the time taken by the map and reduce tasks of this job?
SOLVED
In case it helps anyone who views this question or faces a similar problem.
Thanks to #Shubham's answer and a little research I did:
Job tracker has been removed in hadoop 2. It has been split into resource manager and application master.
To access the Resource manager, type in the URL in your browser "http://localhost:8088"
To access the Job History Server (to view statistics about the applications and jobs that have been completed) type in the URL in your browser "http://localhost:19888"
You could encounter an error when trying to access the Job History Server. It may show that there is no history for the application. In that case follow these steps:
Change the bashrc file
Steps:
i. In your terminal, type "nano ~/.bashrc"
ii. Now in this file, where the other hadoop variables are written add the line
export HADOOP_CONFIG_DIR=/usr/local/hadoop/etc/hadoop
iii. Exit out of nano and save the file.
iv. Run the command "source ~/.bashrc"
1. To start the job history server
Steps:
i. Run the command in your terminal
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONFIG_DIR start historyserver
ii. Then run the command
jps
You should be able to see the "JobHistoryServer" in the list
iii. Now run the command
netstat -ntlp | grep 19888

Hit resource manager's web UI(http://rm_http_address_host:port/). Typically the web port is 8088. You can hit http://resourcemanager_host:8088/ for this.
There you will find the link for all the applications that are in various states like STARTED, RUNNING, FAILED, SUCCEEDED etc
Clicking on each application's link will give you all the statistics(like number of containers(mappers/reducers in case of mapreduce), memory/Vcores used, running time and a lot more stats) about that yarn job.
And a lot many stats are exposed by ResourceManager REST API’s. Find them here https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html

You can go to the jobtracker (Runs on port 50030 by default) and check the job details. It shows the counters for Map time and reduce time. Moreover if you are interested in individual tasks you can follow link "Analyse This Job" that shows best and worst performing tasks.

Related

Get status when running job without hadoop

When I run a hadoop job with the hadoop application it prints a lot of stuff. Among them, It show the relative progress of the job ("map: 30%, reduce: 0%" and stuff like that). But, when running a job without the application it does not print anything, not even errors. Is there a way to get that level of logging without the application? That is, without running [hadoop_folder]/bin/hadoop jar <my_jar> <indexer> <args>....
You can get this information from Application Master (assuming you use YARN and not MR1 where you would get it from Job Tracker). There is usually web UI where you can find this information. Details will depend on your Hadoop installation / distribution.
In case of Hadoop v1 check Job tracker web URL and in case of Hadoop v2 check Application Master web UI

Debugging procedure for Hadoop Failed/Hung job

I am learning Hadoop Administration but I don't know how to start debugging if my job is taking more time than its average, or where to start the debugging if my job has failed.
I generally starts with logs in Resource Manager UI but I want to know if there is any other process to debug as a Hadoop Admin. I am looking for a generic approach to follow for debugging Hadoop jobs using Hortonworks Ambari Web UI.
Logs help in case you end up with failed jobs. I am assuming your jobs are successful but slow.
Best place to start debugging for slow running jobs would be to look at the Job Counters (Once you get into the MR App Master page for the Job, you can find the counters link on left panel). Look at Hadoop Definitive Guide's Chapter 8 for details on what each counter means.

where to find the M/R configuration file and update it

our Hadoop cluster shows the job tracker process eat up memory gradually that we have to restart the cluster every week. I searched around for the possible solution for this. one of the post mentioned to decrease 'mapred.jobtracker.completeuserjobs.maximum' to 5, so i checked mapred-site.xml under /hadoop-install/conf directory on name node and found there are two entries for that parameter, one set it to 30, the other set it to 5, when i goto any of the data node and check mapred-site.xml, i don't find the setting for that parameter at all. however when I checked running job on M/R administration page and checked their job file, it showed that parameters set to 100. I'm really confused where does this parameter is set. and if i updated it, do i need to restart the cluster? we are running apache Hadoop 1.2.1 on google cloud
Hadoop does not automatically copy the configuration files from your driver machine to all of the cluster machines You need to do that via scp and/or rsync or preferably an automated deployment tool like chef, ansible, puppet, etc.
As far as the individual job parameters: you can actually set them on per-job basis by using -D:
<path to jar>/myHadoopJobJar.jar -Dmapred.jobtracker.completeuserjobs.maximum=5

hadoop web interface failed to show job history

I could access most functionality of hadoop admin site, like below:
But, when I tried to visit the history of each application, I am no luck any more:
Any body know what happens to my environment? Where should I check?
BTW, when I try to run "netstat -a" on my VM, I found no records for port 8088 or 19888, which is very unreasonable to me, because 8088 lead to hadoop main-page and works well.
In this web interface, you can see your jobs in real time if they are running or the history :
Once a M/R finish, the ressource manager does'nt matter of it. This is the job of the historyServer.
Your historyServer (optionnal part of hadoop YARN) seems not to be launched.
It's this service which listen on 19888.
You can launch it with the command : /etc/init.d/hadoop-mapreduce-historyserver start

Hadoop removes MapReduce history when it is restarted

I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability.
During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important for me to keep all the MapReduce jobs up that I have previously launched. So, my question is:
¿How can I avoid Hadoop removes all MapReduce-job history after it is restarted?
¿Is there some property to control job removing after Hadoop environment restarting?
Thanks!
the MR job history logs are not deleted right way after you restart hadoop, the new job will be counted from *_0001 and only new jobs which are started after hadoop restart will be displayed on resource manager web portal though. In fact, there are 2 log related settings from yarn default:
# this is where you can find the MR job history logs
yarn.nodemanager.log-dirs = ${yarn.log.dir}/userlogs
# this is how long the history logs will be retained
yarn.nodemanager.log.retain-seconds = 10800
and the default ${yarn.log.dir} is defined in $HADOOP_HONE/etc/hadoop/yarn-env.sh.
YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"
BTW, similar settings could also be found in mapred-env.sh if you are use Hadoop 1.X

Resources