How do you get task level progress from a Hadoop MapReduce job? - hadoop

I am trying to monitor the progress of tasks within a job in Hadoop 1.2.1 MapReduce. I am able to start up my job with the following command:
RunningJob runningJob = JobClient.submitJob(conf);
I see that I can get the job status of all of the map and reduce tasks like this:
JobStatus jobStatus = runningJob.getJobStatus();
float mapProgress = jobStatus.mapProgress();
float reduceProgress = jobStatus.reduceProgress();
But I want to be able to get the progress of each task running within the job. Is there an API call to do this?

Related

When using go-cron to work, multiple services are started at the same time, and multiple cron tasks will be executed at the same time?

When using go-cron to work, multiple services are started at the same time, and multiple cron tasks will be executed at the same time? What can be done to ensure that the cron task is only executed once?
my cron expression is [0 2 */1 * *]
You can configure the SingletonMode for your task to prevents a new job from starting if the prior job has not yet completed:
s := gocron.NewScheduler(time.UTC)
_, _ = s.Every(1).Second().SingletonMode().Do(task)
https://pkg.go.dev/github.com/go-co-op/gocron#Scheduler.SingletonMode
Or enable the SingletonModeAll to prevent new jobs from starting if the prior instance of the particular job has not yet completed:
s := gocron.NewScheduler(time.UTC)
s.SingletonModeAll()
_, _ = s.Every(1).Second().Do(task)
https://pkg.go.dev/github.com/go-co-op/gocron#Scheduler.SingletonModeAll

how to get mapreduce job number from hive server

if use hive cli. the log is :
Total MapReduce jobs = 1
Stage-1 is selected by condition resolver.
Launching Job 1 out of 1
but in hive server or beeline. the log is :
INFO : Stage-1 is selected by condition resolver.
INFO : Number of reduce tasks not specified. Estimated from input data size: 1
how can I get the job number ?
I need calculate job progress and print it..

Hive takes long time to launch hadoop job

I am a newbie to Hadoop and Hive. I am using Hive integration with Hadoop to execute the queries. When I submit any query, following log messages appear on console:
Hive history
file=/tmp/root/hive_job_log_root_28058#hadoop2_201203062232_1076893031.txt Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce
tasks determined at compile time: 1 In order to change the average
load for a reducer (in bytes): set
hive.exec.reducers.bytes.per.reducer= In order to limit the
maximum number of reducers: set hive.exec.reducers.max= In
order to set a constant number of reducers: set
mapred.reduce.tasks= Starting Job = job_201203062223_0004,
Tracking URL =
http://:50030/jobdetails.jsp?jobid=job_201203062223_0004 Kill
Command = //opt/hadoop_installation/hadoop-0.20.2/bin/../bin/hadoop
job -kill job_201203062223_0004 Hadoop job information for Stage-1:
number of mappers: 1; number of reducers: 1 2012-03-06 22:32:26,707
Stage-1 map = 0%, reduce = 0% 2012-03-06 22:32:29,716 Stage-1 map =
100%, reduce = 0% 2012-03-06 22:32:38,748 Stage-1 map = 100%, reduce
= 100% Ended Job = job_201203062223_0004 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 HDFS Read: 8107686 HDFS Write: 4 SUCCESS Total
MapReduce CPU Time Spent: 0 msec OK
The text mentioned in bold starts a hadoop job (that's what I believe). It takes long time to start the job. Once this line gets executed, the map reduce operations execute swiftly. Following are my questions:
Is there any way to make the launch of hadoop job faster. Is it possible to skip this phase?
Where does the value of 'Kill command' come from (in the bold text)?
Please let me know if any inputs are required.
1) Starting Job = job_201203062223_0004, Tracking URL = http: :50030/jobdetails.jsp?jobid=job_201203062223_0004
ANS: your HQL query > translated to hadoop job > hadoop will do some background work (like planning resources,data locality,stages needed to process query,launch configs,job,taskids generation etc) > launch mappers > sort && shuffle > reduce (aggregation) > result to hdfs .
The above flow is part of hadoop job life cycle, so no skipping of any..
http://namenode:port/jobtracker.jsp --- you can see ur job status with job-id :job_201203062223_0004, (Monitering)
2) Kill Command = HADOOP_HOME/bin/hadoop job -kill job_201203062223_0004
Ans : before launching your mappers, you will be showed with these lines because, hadoop works on bigdata, which may take much or less time depends on your dataset size. so at any point of time if you want to kill the job, its a help line . For any hadoop-job this line will be shown, it won't take much time to show an info line like this.
some addons with respect to your comments :
Hive is not meant for low Latency jobs , i mean immediate in time results not possible.
(plz check the hive -purposes in apache.hive)
launching overhead(refer q1s - hadoop will do some background work) is there in Hive, it cant be avoided.
Even for datasets of small size, these launching over head is there in hadoop.
PS : if you are really expecting in time quick results ( plz refer shark )
first,Hive is the tool which replace your mr work by HQL.In the background,it has lost of predefined funcitions,mr programes.Run a HQL,HADOOP Cluster will do lost of things,find the data blocks,allocating taskļ¼Œand so on.
Second,you can kill a job by the hadoop shell command.
If you job id is AAAAA.
you can execute below command to kill it
$HADOOP_HOME/bin/hadoop job -kill AAAAA
Launch of hadoop job can get delayed due to unavailability of resources. If you use yarn you can see that the jobs are in accepted state but not yet running. This means there is some other ongoing job that has consumed all your executors and the new query is waiting to run.
You can kill the older job by using hadoop job -kill <job_id> command or wait for it to finish.

Specify Hadoop process split

I want to run Hadoop MapReduce on a small part of my text file.
One of my task is failing. I can read in the log:
Processing split: hdfs://localhost:8020/user/martin/history/history.xml:3556769792+67108864
Can I execute once again MapReduce on this file from offset 3556769792 to 3623878656 (3556769792+67108864) ?
A way to do is to copy the file from the offset define and add it back into HDFS. From this point simply run the mapreduce job only on this block.
1) copy file from offset 3556769792 follow by 67108864:
dd if=history.xml bs=1 skip=3556769792 count=67108864 >
history_offset.xml
2) import into HDFS
hadoop fs -copyFromLocal history_offset.xml offset/history_offset.xml
3) run again MapReduce
hadoop jar myJar.jar 'offset' 'offset_output'

mapreduce program not producing the requied output in distributed mode

I need some help in my map-reduce code.
The code run's perfectly in eclipse and in standalone mode, but when i package the code and try running it locally on pseudo distributed mode, the output is not as i expect.
Map input records = 11
Map input records = 11
Reduce input records = 11
Reduce output records = 0
These are the values i get.
where as when i run the same code in eclipse or in standalone mode with same config & input file
Map input records = 11
Map output records = 11
Reduce input records = 11
Reduce output records = 4
Can any one tell me whats wrong..??
i tried both the ways of building .jar file for eclipse -> export -> runable jar and form terminal as well(javac -classpath hadoop-core-1.0.4 -d classes mapredcode.java && jar -cvf mapredcode.jar -C classes/ .)
and how do i debug this..
Are you using a combiner() method?
And if yes. then is the o/p of combiner the same as that of the mapper?
Because in Hadoop, Combiner is run at the disposal of Hadoop itself and may not be running in the pseudo-disrtibuted mode in your case.
The combiner in itself is nothing but a reducer that is used to lower the network traffic.
And the code should be such that even if a Combiner is not running, the reducer should get the expected format from the mapper.
Hope it helps.

Resources