hive query in Job tracker - hadoop

Hi we are running hive queries in CDH 4 environment to which we recently upgraded. One thing I notice is that earlier in CDH 3 we were able to track our queries in Job tracker.
The link similar to "hostname:50030/jobconf.jsp?jobid=job_12345" would have a parameter "hive.query.string" or "mapred.jdbc.input.bounding.query" which contains the actual query for which the MR job is executed.
But in CDH4 I do not see where I can get the query. Many queries are run in parallel to keep track of which is the query we are concerned.

You can still view the hive queries in job tracker.
Get the job information based on the job id from below url hostname:50030/jobtracker.jsp
You will find some details as mentioned below at the top of the page.
Hadoop Job 4651 on History Viewer
User: xxxx JobName: test.jar
JobConf:
hdfs://domain:port/user/xxxx/.staging/job_201403111534_4651/job.xml
Job-ACLs: All users are allowed Submitted At: 14-Mar-2014 03:15:19
Launched At: 14-Mar-2014 03:15:19 (0sec) Finished At: 14-Mar-2014
03:18:04 (2mins, 44sec) Status: FAILED Analyse This Job
Now click the URL next to the Job Conf you will find your submitted hive query.

I see that the query parameters for each job can be found in .staging folder in HDFS itself and can be parsed to get the Job_Ids associated query.

Related

Athena query submit via shell script

I am executing a shell script In an emr cluster with script-runner to run a Athena query for all data dates in a Month by using aws Athena --query string.
Once I submitted a step, query is being executed for 10 days and the corresponding query Id is displayed in sysout log and the step moves to completed status.
Question:
How to make the job step to execute for all days and then make job status completed?
Thanks in advance

Nifi Job to execute a spark submit command not giving correct results

I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.
However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:
•Data on the source hive table gets updated by 4 AM approx
•Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match.
•Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data.
•Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.
We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.
Please help me with understanding why this would be happening and how I can fix this.
P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.

Changing the name of Hadoop job?

Hi I would like to change the name of the running Hadoop Job to a meaningful name.
Is there any command to change the name of running job, just like this -
hadoop job -set-priority <JOB_ID> 'HIGH'; which changes the priority of the job
The job id is assigned by the job tracker on submit, by calling JobTracker.getNewJobId(). It cannot be pre-set. To change the the job priority, you must retrieve the ID from submission. Read comments on PIG-948 why is not always possible to know the MR job id from PIG:
Reason for that is JobControlCompiler compiles a set of
inter-dependent MR jobs and generates a job-control object which is
then submitted asynchronously to hadoop for execution. Since we dont
block on those thread, its possible that job-ids are not yet assigned
when we ask for them

How to find the node where the pig job is running

I ran a pig work-flow using oozie. The job completed successfully but now I want to know on which slave or master the job ran. My input file is a 1.4GB file which is distributed on the nodes (1 master and 2 slaves).
And I also want to figure out how much time did the pig executed on each node.
Thank you in advance
Point your web browser to "JobTracker_Machine:50030" and it will presetn you the MapReduce webUI. Here you'll find all the jobs you have run(Running, Completed and Retired). Click on the job which you want to analyze and it will give you all the information you need including the node where a particular task has run and the time taken to finish the task.
HTH
Go to the Oozie Web console and click on the workflow (which contains the pig node).Clicking on the worklfow job will lead to a dialog box (for your workflow) containing details of all the action nodes in the workflow. Select the pig node (which you want to analyse) and a detailed dialog box will appear containing the Job Tracker's URL of that pig job.
There you will find all the details you are looking for.

Run hive queries, and collect job information

I would like to run a list of generated HIVE queries.
For each, I would like to retrieve the MR job_id (or ids, in case of multiple stages).
And then, with this job_id, collect statistics from job tracker (cumulative CPU, read bytes...)
How can I send HIVE queries from a bash or python script, and retrieve the job_id(s) ?
For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop cluster, so I don't have the AppMaster REST API. I'm about to collect data from the jobtracker web UI. Any better idea ?
you can get the list of jobs executed by running this command,
hadoop job -list all
then for each job-id, you can retrieve the stats, using the command,
hadoop job -status job-id
And for associating the jobs with a query, you can get the job_name and match it with the query.
something like this,
How to get names of the currently running hadoop jobs?
hope this helps.

Resources