I am using oozie to execute few hive queries one after another and if a query fails it will send error email that a particular hive query is failed.
Now I have to implement another email triggers based on the result of each hive query. So how can we do that ? Its like if a query returns any result then send the results to the email and continue executing remaining hive queries. There should be no stoppings of oozie workflow execution irrespective of query returns value or not.
In short, if it returns value then send email and continue if it didnt return value also it should continue executing.
Thank you in advance.
If you want to make decisions based on previous step its better to use shell actions(hive -e option to execute query) along with capture_output tag in oozie. Or better use java actions with hive jdbc connection to execute hive queries where you can utilize java for doing all logical looping and decision making.
As oozie doesn't support cycles/loops of execution you might need to repeat the email action in workflow based on the decision making and flow.
Related
I have a current oozie job that queries an Oracle table and writes - overwrites the result on a hive query.
Now I need to prevent overwriting the hive table and save the existing data on that hive table.
For this I wanted to plan such steps:
1st step: Get record count running a "select count(*) from..." query and write it on a file.
2nd step: Check the count written in file.
3rd step: decision step whether or not 4th step will be applied.
4th step: Run the main query and overwrite the hive table.
My problem is I couldn't find anything on documentation and or examples regarding writing them on a file (I know import and export is the aim of sqoop) .
Does anyone know how to write the wuery result on a file?
In theory:
build a Pig job to run the "count(*)" and dump the result to StdOut
as if it was a Java property e.g. my.count=12345
in Oozie, define a Pig Action, with <capture_output/> flag, to run that job
then define a Decision based on the value for key my.count using
the appropriate EL function
In practise, well, have fun!
Is it possible to get application id of all MapReduce jobs for a Hive query? I can look into history or timeline server and get Hive Query strings for every application id. But, was wondering whether I can get userid, all application id from Hive's post hook?
scenario: you are writing a MR job which will use mappers to process data and then use Reducers to insert the resultant data directly into an external RDBMS.what must you be sure to do?? and why
Pre-requsite:
1.Ensure that the database driver is present on the client machine which is submitting the job.
2.Disable speculative execution for the data insert job
1)If you forgot to disable speculative execution, It is possible that multiple instances of given Reducer could run, which would result extra data than expected into RDBMS.
2)Even we need the database driver for client machine, If you plan to connect to RDBMS from that client , it is not needed.
So "1" option is correct.
I got this solution , Can any body Improve this answer or let me correct If any issues. Thank you
I need to pass some parameters to map program. The values for these parameters need to be fetched from database and these values are dynamic. I know how to pass the parameters using Configuration API. If I write JDBC code to retrieve these values from database in the driver or client and then set the values to configuration API, Then how many times this code will be executed. The driver code will be distributed and executed on each data node where hadoop framework identifies to run the MR program ?
What is the best way to do this ?
Yes driver code will be executed on each machine.
I suggest to fetch the data outside the map-reduce program and then pass it as a parameter.
Say you have a script to execute then you just fetch the data from database in a variable and then pass that variable to the hadoop job.
I think this will do your work.
If the data you need is big (more than a few kilobytes), Configuration may not be suitable. A better alternative is to use Sqoop to fetch those data from database to your HDFS. Then use hadoop distribute cache so in your map or reduce code, you can just get those data without any parameters passing in.
You can retrieve the values from DB in the driver code. The driver code will execute only once per Job.
guys I am newbie to Hive and have some doubts in it.
Normally we write custom UDF in Hive for the particular number of columns. (Consider UDF is in Java). Means it performs some operation on that particular column.
I am thinking that can we write such UDF through which we can give the particular column as a input to some query and can we return that query from UDF which will execute on Hive CLI by taking the column as a input?
Can we do this? If yes please suggest me.
Thanks and sorry for my bad english.
This is not possible out of the box because as the Hive query is running, there has been a plan already built that is going to execute. What you suggest is to dynamically change that plan while it is running, which is not only hard because the plan is already built, but also because the Hadoop MapReduce jobs are already running.
What you can do is have your initial Hive query output new Hive queries to a file, then have some sort of bash/perl/python script that goes through that and formulates new Hive queries and passes them to the CLI.