How to access MR job counters after job completion? - hadoop

I have a MR job which runs fine in the cluster.
After the job completion I'm able to get YARN logs but I couldn't find the MR job counters like no of input records, output records.
Is it possible to get that information after job completion?

Related

MR job transactional or non-transactional

When we use MR job with HDFS input / output then MR job behave as transactional i.e. it gets executed successfully or in case of failure data written to HDFS gets rolled back, we do not get partial results say, 3 out of 10 lines are present in output.
But when we run same MR job over HBase it behave as non-transactional i.e. if I have to put 10 objects to HTable and I called context.write(...) 3 times and failed on 4th iteration then I can see 3 puts in HBase though MR job has failed.
Is there any way via which we can have transactional MR job over HBase ? i.e. either entire output is written to HBase or no-output is written to HBase.
Thanks in advance.

Load and process data in parallel inside Hadoop

i am using hadoop to process bigdata, i first load data to hdfs and then execute jobs, but it is sequential. Is it possible to do it in parallel. For example,
running 3 jobs and 2 process of load data from others jobs at same time on my cluster.
Cheers
It is possible to run the all job's in parallel in hadoop if your cluster and jobs satisfies the below criteria:
1) Hadoop Cluster should have capability to run reasonable number of map/reduce task(depends on jobs) in parallel(i.e. should have enough map/reduce slots).
2) If jobs that is currently being run , depends on the data which is loaded through another process, we cannot run data load and job in parallel.
If you process satisfies the above condition, you can all the jobs in parallel.
Using Oozie you can schedule all the process to run in parallel. Fork and Join properties in Oozie allows you to accomplish the task to run in parallel.
If your cluster has enough resources to run the jobs in parallel, then yes. But be sure that the work of each job, doesn't interfere with the others. Like load the data at the same time that another job in execution should be using it, that won't work as you expected.
If there is not enough resources, then hadoop will enqueue the jobs until the resources are available, depending on the Scheduler configured.

How to see all Hadoop counter when running pig

I run my pig via the command line, and I want to see all Hadoop counters after the run is finish.
I have written UDF that write to Hadoop counter base on this blog, but I want to test it - when the pig start I can see logs from the the constructor, but later I see no log
Currently all I see is simple static - see below
Counters:
Total records written : 3487
Total bytes written : 38078
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 101
Total records proactively spilled: 12464701
Pig job is actually a MapReduce job so you could see the status of the job and its complete list of counters from JobTracker page (if using MR1) or Application Master page (if using YARN).
A single pig script may create multiple jobs depending on the complexity. You can query all the counters for each job from the command line by running
mapred job -status <job-id>
If you know the actual counter you are interested you can retrieve individual counters with
mapred job -counter <job-id> <group-name> <counter-name>
Of course, you need to know the job-id(s) - those should be available in the original pig output following the line 'Job DAG:'

Changing the name of Hadoop job?

Hi I would like to change the name of the running Hadoop Job to a meaningful name.
Is there any command to change the name of running job, just like this -
hadoop job -set-priority <JOB_ID> 'HIGH'; which changes the priority of the job
The job id is assigned by the job tracker on submit, by calling JobTracker.getNewJobId(). It cannot be pre-set. To change the the job priority, you must retrieve the ID from submission. Read comments on PIG-948 why is not always possible to know the MR job id from PIG:
Reason for that is JobControlCompiler compiles a set of
inter-dependent MR jobs and generates a job-control object which is
then submitted asynchronously to hadoop for execution. Since we dont
block on those thread, its possible that job-ids are not yet assigned
when we ask for them

Run hive queries, and collect job information

I would like to run a list of generated HIVE queries.
For each, I would like to retrieve the MR job_id (or ids, in case of multiple stages).
And then, with this job_id, collect statistics from job tracker (cumulative CPU, read bytes...)
How can I send HIVE queries from a bash or python script, and retrieve the job_id(s) ?
For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop cluster, so I don't have the AppMaster REST API. I'm about to collect data from the jobtracker web UI. Any better idea ?
you can get the list of jobs executed by running this command,
hadoop job -list all
then for each job-id, you can retrieve the stats, using the command,
hadoop job -status job-id
And for associating the jobs with a query, you can get the job_name and match it with the query.
something like this,
How to get names of the currently running hadoop jobs?
hope this helps.

Resources