I would like to run a list of generated HIVE queries.
For each, I would like to retrieve the MR job_id (or ids, in case of multiple stages).
And then, with this job_id, collect statistics from job tracker (cumulative CPU, read bytes...)
How can I send HIVE queries from a bash or python script, and retrieve the job_id(s) ?
For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop cluster, so I don't have the AppMaster REST API. I'm about to collect data from the jobtracker web UI. Any better idea ?
you can get the list of jobs executed by running this command,
hadoop job -list all
then for each job-id, you can retrieve the stats, using the command,
hadoop job -status job-id
And for associating the jobs with a query, you can get the job_name and match it with the query.
something like this,
How to get names of the currently running hadoop jobs?
hope this helps.
Related
I have a MR job which runs fine in the cluster.
After the job completion I'm able to get YARN logs but I couldn't find the MR job counters like no of input records, output records.
Is it possible to get that information after job completion?
Is there any Hive internal process that connects to reduce or map tasks?
Adding to that!
How does Hive work in relation with MapReduce?
How is the job getting scheduled?
How does the query result return to the hive driver?
For HIVE there is no process to communicate Map/Reduce tasks directly. It's communicates (flow 6.3) with Jobtracker(Application Master in YARN) only for job processing related things once it got scheduled.
This image will give clear understanding about,
How HIVE uses MapReduce as execution engine?
How is the job getting scheduled?
How does the result return to the driver?
Edit: suggested by dennis-jaheruddin
Hive is typically controlled by means of HQL (Hive Query Language)
which is often conveniently abbreviated to Hive.
source
i am using hadoop to process bigdata, i first load data to hdfs and then execute jobs, but it is sequential. Is it possible to do it in parallel. For example,
running 3 jobs and 2 process of load data from others jobs at same time on my cluster.
Cheers
It is possible to run the all job's in parallel in hadoop if your cluster and jobs satisfies the below criteria:
1) Hadoop Cluster should have capability to run reasonable number of map/reduce task(depends on jobs) in parallel(i.e. should have enough map/reduce slots).
2) If jobs that is currently being run , depends on the data which is loaded through another process, we cannot run data load and job in parallel.
If you process satisfies the above condition, you can all the jobs in parallel.
Using Oozie you can schedule all the process to run in parallel. Fork and Join properties in Oozie allows you to accomplish the task to run in parallel.
If your cluster has enough resources to run the jobs in parallel, then yes. But be sure that the work of each job, doesn't interfere with the others. Like load the data at the same time that another job in execution should be using it, that won't work as you expected.
If there is not enough resources, then hadoop will enqueue the jobs until the resources are available, depending on the Scheduler configured.
I'm been using hive at my work, when i run a select like this
"Select * from TABLENAME"
hive executes a mapreduce job and when I run
"Select * from TABLENAME LIMIT X" independently of x.
hive doesn't execute mapreduce jobs.
I use hive 1.2.1, HDP 2.3.0, hue 2.6.1 and hadoop 2.7.1
Any ideas about this fact?
Thanks!
Select * from table;
Requires no map nor reduce. There is no filter(where statement) or aggregation function here. This query simply reads from HDFS.
This is hive's essential task. It is just an abstraction to map-reduce jobs. The former facebook engineers had to write 100s of map-reduce jobs for ad-hoc analysis and map-reduce jobs are somewhat a pain in the ass. So they abstracted it by an sql-language that will be translated in map-reduce jobs.
This is the same with Pig (yahoo).
P.S Some queries are so easy that they aren't translated to map-reduce jobs but are executed locally on one node as far as i know
In the first map reduce job I am processing an HBase table and outputting a smaller list of the rowkeys. I need to use this list of strings in order to process another map reduce job which is pulling from a different HBase table and outputting to another Hbase table. What is the proper way to store and access the ouput of the first map reduce job?
Hadoop doesn't support streaming the output of one MR job to another. So, the output of the first MR job has to be stored in HDFS (or some other persistent storage) and then read in the second MR job. Create a DAG of jobs using Oozie or Azkaban. For a simple work flow use Hadoop's JobControl API.
Apache Tez which is still in the incubator phase allows streaming of data across MR tasks. As mentioned, Tez is still in the Incubator stage, so use it with a bit of caution.