Example about how set a Hive property from within a Hive query - hadoop

I need a quick example of how to change a property in hive using a query, for instance, I would like to change the property 'mapred.reduce.tasks' so, how to perform this change within a query.
I'm training my self for HDPCD exam and one of the goals in the exam is 'Set a Hadoop or Hive configuration property from within a Hive query' So I suppose that it's not the same as performing in hive console something like:
set mapred.reduce.tasks=2;

To change Hadoop and Hive configuration variable you need to use set in the hive query.
The change made will be applicable only to that query session
set -v prints all Hadoop and Hive configuration variables.
SET mapred.reduce.tasks=XX // In Hadoop 1.X
SET mapreduce.job.reduces=XX // In Hadoop 2.X (YARN)
reset in query resets the configuration to the default values

Related

How do you get the list of settings for a particular hive job?

By settings I mean things like hive.cbo.enable=true and other similar properties. I will be running these queries in an environment that has multiple concurrent jobs, and I was wondering how to do this for individual jobs using jobid or name.
You could use:
hive> SET;
set prints all the variables in the namespaces hivevar, hiveconf,
system, and env.
Example out put looks like
hive.stats.retries.wait=3000
env:TERM=xterm
system:user.timezone=America/New_York
You can also use hive> set -v;
With the -v option, it also prints all the properties defined by Hadoop,
such as properties controlling HDFS and MapReduce
If you want to get/display a specific value then you need to specify it as below (set namespace:variable-name)
hive> set hiveconf:hive.cbo.enable;
hiveconf:hive.cbo.enable=true

Create hive table through spark job

I am trying to create hive tables as outputs of my spark (1.5.1 version) job on a hadoop cluster (BigInsight 4.1 distribution) and am facing permission issues. My guess is spark is using a default user (in this case 'yarn' and not the job submitter's username) to create the tables and therefore fails to do so.
I tried to customize the hive-site.xml file to set an authenticated user that has permissions to create hive tables, but that didn't work.
I also tried to set Hadoop user variable to an authenticated user but it didn't work either.
I want to avoid saving txt files and then creating hive tables to optimize performances and reduce the size of the outputs through orc compression.
My questions are :
Is there any way to call write function of the spark dataframe api
with a specified user ?
Is it possible to choose a username using oozie's workflow file ?
Does anyone have an alternative idea or has ever faced this problem ?
Thanks.
Hatak!
Consider df holding your data, you can write
In Java:
df.write().saveAsTable("tableName");
You can use different SaveMode like Overwrite, Append
df.write().mode(SaveMode.Append).saveAsTable("tableName");
In Scala:
df.write.mode(SaveMode.Append).saveAsTable(tableName)
A lot of other options can be specified depending on what type you would like to save. Txt, ORC (with buckets), JSON.

Documentation of manually passing parameters ${parameter} inside query

Hive documented about setting variables in hiveconf
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution
I know there is also a way of passing parameters using ${parameter}(not hiveconf), e.g.
select * from table_one where variable = ${parameter}
And then the hive editor would prompt you to enter the value for parameter when you submit the query.
I can't find where Apache hadoop documents this way of passing parameters. Is this way of passing parameters inherent in hive or oozie? If it is oozie why can it be used in the hive editor?
This is a feature of Hue. There is a reference to this feature in Cloudera documentation, at least for older versions. For example, the Hive Query Editor User Guide describes it.
PARAMETERIZATION Indicate that a dialog box should display to enter parameter values when a query containing the string $parametername is executed. Enabled by default.

How to change Tez job name when running query in HIVE

When I submit a Hive SQL using Tez like below:
hive (default)> select count(*) from simple_data;
In Resource Manager UI the job name shows something like HIVE-9d1906a2-25dd-4a7c-9ea3-bf651036c7eb Is there a way to change the job name tomy_job_nam?
If I am not using Tez and running the job in MR, I can set the job name using set mapred.job.name.
Are there any Tez parameters I need to set, to change the job name?
Any input is appreciated.
You can use "set hiveconf hive.query.name=myjobname"But you will be able to see the name only in TEZ view. Not in Yarn.
See the link below:https://community.hortonworks.com/questions/5309/how-to-set-tez-job-name.htmlI`m looking into this issue also. If I find the solution I update the question.
Got this figured out. Using the property hive.session.id the name could be changed. Below is an example.
hive --hiveconf hive.session.id=test_$(date '+%Y%m%d_%H%M%S') \
-e "select month, max(sale) from simple_data group by month;"
Good question. There is a JIRA for Hive on Spark for a very similar thing that you're asking: HIVE-12811 - you could use spark.app.name there; landing in Hive 2.1.
Can't find anything specific for Hive on Tez.. perhaps somebody needs to submit a Hive jira/patch similar to Hive-12811 but for Tez.
set hive.query.name="test_query";
Will work in hive with TEZ
set mapred.job.name = more helpful name

Mahout Hive Integration

I want to combine Hadoop based Mahout recommenders with Apache Hive.So that My generated Recommendations are directly stored in to my Hive Tables..Do any one know similar tutorials for this..?
Hadoop based Mahout recommenders can store the results in HDFS directly.
Hive also allows you to create table schema on top of any data using CREATE EXTERNAL TABLE recommend_table which also specifies the location of the data (LOCATION '/home/admin/userdata';).
This way you are ensured that when new data is written to that location - /home/admin/userdata then it is already available to Hive and can be queried by existing Table schema : recommend_table.
I had blogged about it some time back: external-tables-in-hive-are-handy. This solution helps for any kind of map-reduce program output that needs to be available immediately for Hive ad-hoc queries.

Resources