Dynamically calculating oozie parameter (number of reducers for MR action) - hadoop

In my oozie workflow I dynamically create a hive table, say T1. This hive action is then followed by a map-reduce action. I want to set number of reducers property (mapred.reduce.tasks) equal to distinct values of a field say (T1.group). Any ideas how to set value of some oozie parameter dynamically and how to get value of the parameter from hive distinct action to oozie parameter?

I hope this can help:
Create the hive table as you are doing already.
Execute another Hive query which calculates the distinct values for the column and writes it to a file in hdfs.
Create an Shell action, which will read the file and echo the value in the form of key=value. Enable the capture-output for the shell action.
This is your MR action. Now access the action data using the Oozie EL functions. e.g. ${wf:actionData('ShellAction')['key']}, pass this value to the mapred.reduce.tasks in the configuration tag of the MR action.

Related

oozie variables across action actions

I am new to oozie and have a usecase where we would need to set a variable in a oozie action and read the same variable in a different oozie action. This job runs every week and in first action we calculate few values and we like to use the same values in other actions as well,Just wanted to know oozie has provisions to do the same ?

Example about how set a Hive property from within a Hive query

I need a quick example of how to change a property in hive using a query, for instance, I would like to change the property 'mapred.reduce.tasks' so, how to perform this change within a query.
I'm training my self for HDPCD exam and one of the goals in the exam is 'Set a Hadoop or Hive configuration property from within a Hive query' So I suppose that it's not the same as performing in hive console something like:
set mapred.reduce.tasks=2;
To change Hadoop and Hive configuration variable you need to use set in the hive query.
The change made will be applicable only to that query session
set -v prints all Hadoop and Hive configuration variables.
SET mapred.reduce.tasks=XX // In Hadoop 1.X
SET mapreduce.job.reduces=XX // In Hadoop 2.X (YARN)
reset in query resets the configuration to the default values

Documentation of manually passing parameters ${parameter} inside query

Hive documented about setting variables in hiveconf
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution
I know there is also a way of passing parameters using ${parameter}(not hiveconf), e.g.
select * from table_one where variable = ${parameter}
And then the hive editor would prompt you to enter the value for parameter when you submit the query.
I can't find where Apache hadoop documents this way of passing parameters. Is this way of passing parameters inherent in hive or oozie? If it is oozie why can it be used in the hive editor?
This is a feature of Hue. There is a reference to this feature in Cloudera documentation, at least for older versions. For example, the Hive Query Editor User Guide describes it.
PARAMETERIZATION Indicate that a dialog box should display to enter parameter values when a query containing the string $parametername is executed. Enabled by default.

sqoop oozie write query result to a file

I have a current oozie job that queries an Oracle table and writes - overwrites the result on a hive query.
Now I need to prevent overwriting the hive table and save the existing data on that hive table.
For this I wanted to plan such steps:
1st step: Get record count running a "select count(*) from..." query and write it on a file.
2nd step: Check the count written in file.
3rd step: decision step whether or not 4th step will be applied.
4th step: Run the main query and overwrite the hive table.
My problem is I couldn't find anything on documentation and or examples regarding writing them on a file (I know import and export is the aim of sqoop) .
Does anyone know how to write the wuery result on a file?
In theory:
build a Pig job to run the "count(*)" and dump the result to StdOut
as if it was a Java property e.g. my.count=12345
in Oozie, define a Pig Action, with <capture_output/> flag, to run that job
then define a Decision based on the value for key my.count using
the appropriate EL function
In practise, well, have fun!

Oozie workflow for hive action

I am using oozie to execute few hive queries one after another and if a query fails it will send error email that a particular hive query is failed.
Now I have to implement another email triggers based on the result of each hive query. So how can we do that ? Its like if a query returns any result then send the results to the email and continue executing remaining hive queries. There should be no stoppings of oozie workflow execution irrespective of query returns value or not.
In short, if it returns value then send email and continue if it didnt return value also it should continue executing.
Thank you in advance.
If you want to make decisions based on previous step its better to use shell actions(hive -e option to execute query) along with capture_output tag in oozie. Or better use java actions with hive jdbc connection to execute hive queries where you can utilize java for doing all logical looping and decision making.
As oozie doesn't support cycles/loops of execution you might need to repeat the email action in workflow based on the decision making and flow.

Resources