How to change Tez job name when running query in HIVE - hadoop

When I submit a Hive SQL using Tez like below:
hive (default)> select count(*) from simple_data;
In Resource Manager UI the job name shows something like HIVE-9d1906a2-25dd-4a7c-9ea3-bf651036c7eb Is there a way to change the job name tomy_job_nam?
If I am not using Tez and running the job in MR, I can set the job name using set mapred.job.name.
Are there any Tez parameters I need to set, to change the job name?
Any input is appreciated.

You can use "set hiveconf hive.query.name=myjobname"But you will be able to see the name only in TEZ view. Not in Yarn.
See the link below:https://community.hortonworks.com/questions/5309/how-to-set-tez-job-name.htmlI`m looking into this issue also. If I find the solution I update the question.

Got this figured out. Using the property hive.session.id the name could be changed. Below is an example.
hive --hiveconf hive.session.id=test_$(date '+%Y%m%d_%H%M%S') \
-e "select month, max(sale) from simple_data group by month;"

Good question. There is a JIRA for Hive on Spark for a very similar thing that you're asking: HIVE-12811 - you could use spark.app.name there; landing in Hive 2.1.
Can't find anything specific for Hive on Tez.. perhaps somebody needs to submit a Hive jira/patch similar to Hive-12811 but for Tez.

set hive.query.name="test_query";
Will work in hive with TEZ

set mapred.job.name = more helpful name

Related

Example about how set a Hive property from within a Hive query

I need a quick example of how to change a property in hive using a query, for instance, I would like to change the property 'mapred.reduce.tasks' so, how to perform this change within a query.
I'm training my self for HDPCD exam and one of the goals in the exam is 'Set a Hadoop or Hive configuration property from within a Hive query' So I suppose that it's not the same as performing in hive console something like:
set mapred.reduce.tasks=2;
To change Hadoop and Hive configuration variable you need to use set in the hive query.
The change made will be applicable only to that query session
set -v prints all Hadoop and Hive configuration variables.
SET mapred.reduce.tasks=XX // In Hadoop 1.X
SET mapreduce.job.reduces=XX // In Hadoop 2.X (YARN)
reset in query resets the configuration to the default values

storing a Dataframe to a hive partition table in spark

I'm trying to store a stream of data comming in from a kafka topic into a hive partition table. I was able to convert the dstream to a dataframe and created a hive context. My code looks like this
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
newdf.registerTempTable("temp") //newdf is my dataframe
newdf.write.mode(SaveMode.Append).format("osv").partitionBy("date").saveAsTable("mytablename")
But when I deploy the app on cluster, its says
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/tmp/spark-3f00838b-c5d9-4a9a-9818-11fbb0007076/scratch_hive_2016-10-18_23-18-33_118_769650074381029645-1, expected: hdfs://
When I try to save it as a normal table and comment out the hiveconfigurations it work. But, with partition table...its giving me this error.
I also tried registering the dataframe as a temp table and then to write that table to the partition table. Doing that also gave me the same error
Can someone please tell how can I solve it.
Thanks.
You need to use hadoop(hdfs) configured if you are deploying the app
on the cluster.
With saveAsTable the default location that Spark saves to is controlled by the HiveMetastore (based on the docs). Another option would be to use saveAsParquetFile and specify the path and then later register that path with your hive metastore OR use the new DataFrameWriter interface and specify the path option write.format(source).mode(mode).options(options).saveAsTable(tableName).
I figured it out.
In the code for spark app, I declared the scratch dir location as below and it worked.
sqlContext.sql("SET hive.exec.scratchdir=<hdfs location>")
sqlContext.sql("SET hive.exec.scratchdir=location")

how to set spark RDD StorageLevel in hive on spark?

In my hive on spark job , I get this error :
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
thanks for this answer (Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?) , I know it may be my hiveonspark job has the same problem
since hive translates sql to a hiveonspark job, I don't how to set it in hive to make its hiveonspark job change from StorageLevel.MEMORY_ONLY to StorageLevel.MEMORY_AND_DISK ?
thanks for you help~~~~
You can use CACHE/UNCACHE [LAZY] Table <table_name> to manage caching. More details.
If you are using DataFrame's then you can use the persist(...) to specify the StorageLevel. Look at API here..
In addition to setting the storage level, you can optimize other things as well. SparkSQL uses a different caching mechanism called Columnar storage which is a more efficient way of caching data (as SparkSQL is schema aware). There are different set of config properties that can be tuned to manage caching as described in detail here (THis is latest version documentation. Refer to the documentation of version you are using).
spark.sql.inMemoryColumnarStorage.compressed
spark.sql.inMemoryColumnarStorage.batchSize

Errors from avro.serde.schema - "CannotDetermineSchemaSentinel"

When running jobs on Hadoop (CDH4.6 and Hive 0.10), these errors showed up:
avro.serde.schema
{"type":"record","name":"CannotDetermineSchemaSentinel","namespace":"org.apache.hadoop.hive","fields":
[{"name":"ERROR_ERROR_ERROR_ERROR_ERROR_ERROR_ERROR","type":"string"},{"name":"Cannot_determine_schema","type":"string"},{"name":"check","type":"string"},
{"name":"schema","type":"string"},{"name":"url","type":"string"},{"name":"and","type":"string"},{"name":"literal","type":"string"}]}
What's the root cause, and how do I resolve them?
Thanks!
This happens when Hive is unable to read or parse the avro schema you have given it. Check the avro.schema.url or avro.schema.literal property in your table; it is likely it is set incorrectly.

hive query in Job tracker

Hi we are running hive queries in CDH 4 environment to which we recently upgraded. One thing I notice is that earlier in CDH 3 we were able to track our queries in Job tracker.
The link similar to "hostname:50030/jobconf.jsp?jobid=job_12345" would have a parameter "hive.query.string" or "mapred.jdbc.input.bounding.query" which contains the actual query for which the MR job is executed.
But in CDH4 I do not see where I can get the query. Many queries are run in parallel to keep track of which is the query we are concerned.
You can still view the hive queries in job tracker.
Get the job information based on the job id from below url hostname:50030/jobtracker.jsp
You will find some details as mentioned below at the top of the page.
Hadoop Job 4651 on History Viewer
User: xxxx JobName: test.jar
JobConf:
hdfs://domain:port/user/xxxx/.staging/job_201403111534_4651/job.xml
Job-ACLs: All users are allowed Submitted At: 14-Mar-2014 03:15:19
Launched At: 14-Mar-2014 03:15:19 (0sec) Finished At: 14-Mar-2014
03:18:04 (2mins, 44sec) Status: FAILED Analyse This Job
Now click the URL next to the Job Conf you will find your submitted hive query.
I see that the query parameters for each job can be found in .staging folder in HDFS itself and can be parsed to get the Job_Ids associated query.

Resources