Job name for Tez job in beeline and view it in YARN - hadoop

I'm using Beeline and like to set a specific name for a TEZ job, like I use mapreduce.job.name for a MapReduce job. I tried hive.query.name, but it doesn't make any difference in yarn application -list.
Some say we can view the name only in TEZ UI, but I only have access to YARN. Please help me.
I have a load script in Beeline with TEZ as execution engine running now,
when I'm trying to see the active applications in YARN with yarn application -list command, I get something like HIVE-<UUID> as the job name.
I would like to change it to more readable.
I can do the same if the execution engine is MR with SET mapreduce.job.name = myJobName command.
I want similar command for TEZ engine, as I already said SET hive.query.name=myJobName is not seems to be working.

Try also to set session id:
set hive.session.id=myJobName;
Or start hive with hiveconf parameter:
hive --hiveconf hive.session.id=myJobName -f "myscript.hql"

Related

How to get the job id of a specific running hadoop jobs

I need to get the id of a specific hadoop job.
In my case, I lunch a sqoop commande remotely and I went to verify the job status with this commande :
hadoop job -status job_id | grep -w 'state'
I can get this information from the GUI but i went to do something
can any one help me !!!
You can use the Yarn REST apis, via your browser or curl from the command line. It will list all the currently running and previously running jobs, including sqoop and the mapreduce jobs that sqoop generates and executes. Use the UI first, if you have it up and running just point your browser to http:<host>:8088/cluster (not sure if the port is the same on all hadoop distributions. I believe 8088 is the default on apache). Alternatively you can use yarn commands directly, e.g, yarn application -list.

Spark setAppName doesn't appear in Hadoop running applications UI

I am running a spark streaming job and when I set the app name (a better readable string) for my spark streaming job, It doesn't appear in the Hadoop running applications UI. I always see the class name as the name in Hadoop UI
val sparkConf = new SparkConf().setAppName("BetterName")
How to set the job name in Spark, so it appears in this Hadoop UI ?
Hadoop URL for running applications is - http://localhost:8088/cluster/apps/RUNNING
[update]
Looks like this is the issue only with Spark Streaming jobs, couldn't find solution on how to fix it though.
When submitting a job via spark-submit, the SparkContext created can't set the name of the app, as the YARN is already configured for job before Spark. For the app name to appear in the Hadoop running jobs UI, you have to set it in the command line for spark-submit "--name BetterName". I kick off my job with a shell script that calls spark-submit, so added the name to the command in my shell script.

Running Spark Jobs via Oozie

Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?
In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.
Thanks.
Yup its possible ... The procedure is also same, that you have to provide Oozia a directory structure having coordinator.xml, workflow.xml and a lib directory containing your Jar files.
But remember Oozie starts the job with java -cp command, not with spark-submit, so if you have to run it with Oozie, Here is a trick.
Run your jar with spark-submit in background.
Look for that process in process list. It will be running under java -cp command but with some additional Jars, that are added by spark-submit. Add those Jars in CLASS_PATH. and that's it. Now you can run your Spark applications through Oozie.
1. nohup spark-submit --class package.to.MainClass /path/to/App.jar &
2. ps aux | grep '/path/to/App.jar'
EDITED: You can also use latest Oozie, which has Spark Action also.
To run Spark SQL by Oozie you need to use Oozie Spark Action.
You can locate oozie.gz on your distribution. Usually in cloudera you can find this oozie examples directory at below path.
]$ locate oozie.gz
/usr/share/doc/oozie-4.1.0+cdh5.7.0+267/oozie-examples.tar.gz
Spark SQL need hive-site.xml file for execution which you need to provide in workflow.xml
< spark-opts>--file /hive-site.xml < /spark-opts>

How to kill a mapred job started by hive?

I'm working by CDH 5.1 now. It starts normal Hadoop job by YARN but hive still works with mapred. Sometimes a big query will hang for a long time and I want to kill it.
I can find this big job by JobTracker web console while it didn't provide a button to kill it.
Another way is killing by command line. However, I couldn't find any job running by command line.
I have tried 2 commands:
yarn application -list
mapred job -list
How to kill big query like this?
You can get the Job ID from Hive CLI when you run a job or from the Web UI. You can also list the job IDs using the application ID from resource manager. Ideally, you should get everything from
mapred job -list
or
hadoop job -list
Using the Job ID you can kill it by using the below command.
hadoop job -kill <job_id>
Another alternative would be to kill the application using
yarn application -kill <application_id>

configure hive with hadoop

I have configured hadoop 2.2.0 as single node cluster ( was able to run example jar)
Now I need to make hive perform queries using this hadoop
should I set
mapred.job.tracker
to
yarn.resourcemanager.resource-tracker.address
property?
tried so, but can't see the data loaded into hive tables in hdfs
I don't have enough reputation points to add a comment, so trying to help via an answer.
What are the daemons currently running for Hadoop? Use ps -eaf |
grep "java" to check.
Do you see the JobTracker running or the ResourceManager?
Also, can you elaborate on the steps you performed to install Hive?
I have screen cast, Installing Apache Hive that walks you through installing Hive. Next, you can follow my blog post Apache Hive - Getting Started. Hope this helps.

Resources