Changing the name of Hadoop job? - hadoop

Hi I would like to change the name of the running Hadoop Job to a meaningful name.
Is there any command to change the name of running job, just like this -
hadoop job -set-priority <JOB_ID> 'HIGH'; which changes the priority of the job

The job id is assigned by the job tracker on submit, by calling JobTracker.getNewJobId(). It cannot be pre-set. To change the the job priority, you must retrieve the ID from submission. Read comments on PIG-948 why is not always possible to know the MR job id from PIG:
Reason for that is JobControlCompiler compiles a set of
inter-dependent MR jobs and generates a job-control object which is
then submitted asynchronously to hadoop for execution. Since we dont
block on those thread, its possible that job-ids are not yet assigned
when we ask for them

Related

How to access MR job counters after job completion?

I have a MR job which runs fine in the cluster.
After the job completion I'm able to get YARN logs but I couldn't find the MR job counters like no of input records, output records.
Is it possible to get that information after job completion?

Hadoop schedule jobs to run sequentially (one job after other)?

Lets say I am resource constrained in my Hadoop environment and I don't want to schedule really long running jobs (ie takes days to complete). I am analyzing vast amount of past time series data. I want to schedule mapreduce jobs that take a day's worth of data at a time (which takes an hour to crunch).
So how do I schedule such that new job is submitted as soon as previous job is completed?
If you want a quick and simple approach you could just write a shell script that calls hadoop jar in sequence for each job you want to run.
If you want a more robust approach you could use Apache Oozie to define a workflow of jobs that will run your jobs in sequence. If you are new to Hadoop you may find it easiest to define and run your Oozie workflow using the Hue GUI.

hive query in Job tracker

Hi we are running hive queries in CDH 4 environment to which we recently upgraded. One thing I notice is that earlier in CDH 3 we were able to track our queries in Job tracker.
The link similar to "hostname:50030/jobconf.jsp?jobid=job_12345" would have a parameter "hive.query.string" or "mapred.jdbc.input.bounding.query" which contains the actual query for which the MR job is executed.
But in CDH4 I do not see where I can get the query. Many queries are run in parallel to keep track of which is the query we are concerned.
You can still view the hive queries in job tracker.
Get the job information based on the job id from below url hostname:50030/jobtracker.jsp
You will find some details as mentioned below at the top of the page.
Hadoop Job 4651 on History Viewer
User: xxxx JobName: test.jar
JobConf:
hdfs://domain:port/user/xxxx/.staging/job_201403111534_4651/job.xml
Job-ACLs: All users are allowed Submitted At: 14-Mar-2014 03:15:19
Launched At: 14-Mar-2014 03:15:19 (0sec) Finished At: 14-Mar-2014
03:18:04 (2mins, 44sec) Status: FAILED Analyse This Job
Now click the URL next to the Job Conf you will find your submitted hive query.
I see that the query parameters for each job can be found in .staging folder in HDFS itself and can be parsed to get the Job_Ids associated query.

How to find the node where the pig job is running

I ran a pig work-flow using oozie. The job completed successfully but now I want to know on which slave or master the job ran. My input file is a 1.4GB file which is distributed on the nodes (1 master and 2 slaves).
And I also want to figure out how much time did the pig executed on each node.
Thank you in advance
Point your web browser to "JobTracker_Machine:50030" and it will presetn you the MapReduce webUI. Here you'll find all the jobs you have run(Running, Completed and Retired). Click on the job which you want to analyze and it will give you all the information you need including the node where a particular task has run and the time taken to finish the task.
HTH
Go to the Oozie Web console and click on the workflow (which contains the pig node).Clicking on the worklfow job will lead to a dialog box (for your workflow) containing details of all the action nodes in the workflow. Select the pig node (which you want to analyse) and a detailed dialog box will appear containing the Job Tracker's URL of that pig job.
There you will find all the details you are looking for.

Run hive queries, and collect job information

I would like to run a list of generated HIVE queries.
For each, I would like to retrieve the MR job_id (or ids, in case of multiple stages).
And then, with this job_id, collect statistics from job tracker (cumulative CPU, read bytes...)
How can I send HIVE queries from a bash or python script, and retrieve the job_id(s) ?
For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop cluster, so I don't have the AppMaster REST API. I'm about to collect data from the jobtracker web UI. Any better idea ?
you can get the list of jobs executed by running this command,
hadoop job -list all
then for each job-id, you can retrieve the stats, using the command,
hadoop job -status job-id
And for associating the jobs with a query, you can get the job_name and match it with the query.
something like this,
How to get names of the currently running hadoop jobs?
hope this helps.

Resources