Alarm notification in hive/hadoop - hadoop

I have been using apache hive for a while. And I want to know if there is a way of setting alarms in hive, ie., I want to know if I can run a shell script or send an email if there is a job failure. My hive jobs generally take a couple of hours and I want to get an immediate notification if it failed so that I can take an action immediately if my job failed. Or atleast please tell me if I can setup similar alarms in hadoop?

When you call a hive script from Unix/Linux box, you use hive -(hyphen) and then your hive sql or hadoop thru unix script. So you execute your hive script or hadoop script from Unix/lInux box. So a shell script, which you need to use with mail or mailx is enough to alert you.
Cheers,
Raja
Available to help you in Teradata forum too
(ktraj1#gmail.com)

Related

How to get resources used for FINISHED hadoop jobs from YARN logs using job names?

I have a unix shell script which runs multiple hive scripts. I have given Job names for every hive queries inside the hive scripts.
What I need is that at the end of the shell script, I want to retrieve the resources (in terms of memory used,containers) used for the hive queries based on the job names from the YARN logs/application having appstatus as 'FINISHED'
How do I do this?
Any help would be appreciated.
You can pull this information from the Yarn History server via rest apis.
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html
Scroll through this documentation and you will see examples of how to get cluster level information on jobs executed and then how to get information on individual jobs.

Clarification regarding oozie launcher jobs

I needed some clarifications regarding the oozie launcher job.
1) Is the launcher job launched per workflow application (with several actions) or per action within a workflow application?
2) Use Case: I have workflows that contain multiple shell actions (which internally execute spark, hive, pig actions etc.). The reason for using shell is because additional parameters like partition date can be computed using custom logic and passed to hive using .q files
Example Shell File:
hive -hiveconf DATABASE_NAME=$1 -hiveconf MASTER_TABLE_NAME=$2 -hiveconf SOURCE_TABLE_NAME=$3 -hiveconf -f $4
Example .q File:
use ${hiveconf:DATABASE_NAME};
insert overwrite into table ${hiveconf:MASTER_TABLE_NAME} select * from ${hiveconf:SOURCE_TABLE_NAME};
I set the oozie.launcher.mapreduce.job.queuename and mapreduce.job.queuename to different queues to avoid starvation of task slots in a single queue. I also omitted the <capture-output></capture-output> in the corresponding shell action. However, I still see the launcher job occupying a lot of memory from the launcher queue.
Is this because the launcher job caches the log ouput that comes from hive?
Is it necessary to give the launcher job enough memory when executing a shell action the way I am?
What would happen if I explicitly limited the launcher job memory?
I would highly appreciate it if someone could outline the responsibilities of the oozie launcher job.
Thanks!
Is the launcher job launched per workflow application (with several actions) or per action within a workflow application?
The launcher job is launched per action in the workflow.
I would highly recommend you to use respective oozie actions, Hive, Pig etc. Because it allows oozie to handle your workflow and actions in a better manner.

How to run hive on google cloud dataproc from within the machine?

I've just created a google cloud dataproc cluster. A few basic things are not working for me:
I'm trying to run the hive console from the master node but it fails to load with any user other than root (it looks like there's a lock, the console is just stuck).
But even when using root, I see some odd behaviour:
"show tables;" shows a table named "input"
querying the table raises an exception that this table not found.
It is not clear which user is creating the tables through the web ui. I create a job, execute it, but then don't see the results through the console.
Couldn't find any good documentation on that - does anybody have an idea on this?
Running the hive command at present is somewhat broken due to the default metastore configuration.
I recommend you use the beeline client instead, which talks to the same Hive Server 2 as Dataproc Hive Jobs. You can use it via ssh by running beeline -u jdbc:hive2://localhost:10000 on the master.
YARN applications are submitted by the Hive Server 2 as the user "nobody", you can specify a different user by passing the -n flag to beeline, but it shouldn't matter with default permissions.
This thread is a bit old but when some one search Google Cloud Platform and Hive this result is coming. So I'm adding some info which may be useful.
Currently, in order to submit job to Google dataproc, I think - like all other products - there are 3 options:
from UI
from console using command line like:
gcloud dataproc jobs submit hive --cluster=CLUSTER (--execute=QUERY, -e QUERY | --file=FILE, -f FILE) [--async] [--bucket=BUCKET] [--continue-on-failure] [--jars=[JAR,…]] [--labels=[KEY=VALUE,…]] [--params=[PARAM=VALUE,…]] [--properties=[PROPERTY=VALUE,…]] [GLOBAL-FLAG …]
REST API call like: https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit
Hope this will be useful to someone.

How to access Hive log information

I am trying to analyze the performance of the Hive queries. Though I was able to make Hive queries with Java but I still need to access the log information getting generated after each query. Instead of using a hack to read the latest log on the disk and using regex to extract the numbers I am looking for a graceful method if already available.
Any pointers will be helpful. Thanks in advance.
-lg
Query execution details like Status,Finished at, Finished in are displayed in Job Tracer, you can access job tracker programmatically . Related info at this link
How could I programmatically get all the job tracker and tasktracker information that is displayed by Hadoop in the web interface?
Once hive starts running a corresponding map-reduce job starts. The logs of this hadoop job can be found on the corresponding tasktracker on which each task runs.
Use jobclient API to retrieve these logs programmatically.

Hadoop Job Scheduling query

I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.

Resources