sqoop oozie write query result to a file - hadoop

I have a current oozie job that queries an Oracle table and writes - overwrites the result on a hive query.
Now I need to prevent overwriting the hive table and save the existing data on that hive table.
For this I wanted to plan such steps:
1st step: Get record count running a "select count(*) from..." query and write it on a file.
2nd step: Check the count written in file.
3rd step: decision step whether or not 4th step will be applied.
4th step: Run the main query and overwrite the hive table.
My problem is I couldn't find anything on documentation and or examples regarding writing them on a file (I know import and export is the aim of sqoop) .
Does anyone know how to write the wuery result on a file?

In theory:
build a Pig job to run the "count(*)" and dump the result to StdOut
as if it was a Java property e.g. my.count=12345
in Oozie, define a Pig Action, with <capture_output/> flag, to run that job
then define a Decision based on the value for key my.count using
the appropriate EL function
In practise, well, have fun!

Related

Sqoop incremental job importing more number of records than in source

I have created sqoop job to import data from Netezza. It imports data by comparing a timestamp column (checking column) from source on daily basis. I am observing that, the job is importing more number of records each day when compared with source table in Netezza.
There seems no problem or error with the job. The 'incremental.last.value' is also updated properly for each run.
How can I find out what is wrong with the job. I am using Sqoop version: 1.4.5.2.2.6.0-2800
Can you please show the sqoop job statement used.Have you used any split-by column in the sqoop job, if yes try using other split-by column.
More investigation showed the job is working correctly. Problem is with verification method. I was trying to validate the number of rows on a given date in Netezza and Hive. But, the date value of checking-column gets updated in Netezza. These updates are not reflected on Hive by any means. Hence, the number of records for a day doesn't remain constant at Netezza side.
The problem has given a good learning to first check all the conditions of a scenario under consideration.There may be many factors involved in achieving a output other than just correctness of code written.

To Replace Name with Another Name in a file

I am very new to hadoop and i have requirement of scrubbing the file in which account no,name and address details and i need to change these name and address details with some other name and address which are existed in another file.
And am good with either Mapreduce or Hive.
Need help on this.
Thank you.
You can write simple Mapper only job (with reducer set to zero), update the information and store them on some other location. Verify the output of the your job, if it is as you expected, then remove the old files. Remember, HDFS does not support in-placing editing and over-write of files.
Hadoop - MapReduce Tutorial.
You can also use Hive to accomplish this task.
Write hive UDF based on your logic of scrubbing
Use above UDF for each column in hive table you want to scrub and store data in new Hive table.
3.You can remove old hive table.

Writing autosys job information to Oracle DB

Here's my situation: we have no access to the autosys server other than using the autorep command. We need to keep detailed statistics on each of our jobs. I have written some Oracle database tables that will store start/end times, exit codes, JIL, etc.
What I need to know is what is the easiest way to output the data we require (which is all available in the autosys tables that we do not have access to) to an Oracle database.
Here are the technical details of our system:
autosys version - I cannot figure out how to get this information
Oracle version - 11g
We have two separate environments - one for UAT/QA/IT and several PROD servers
Do something like below
Create a table with the parameters you want to put. Put a key columns which should be auto generated. The jil column should be able to handle huge data. Also add one columns for sysdate.
Create a shell script. Inside it do as follows
"autorep -j -l0" to get all the jobs you want and put them in a file. -l0 is to ignore duplicate jobs. If a Box contain a job, then without -l0 you will get the job twice.
create a loop and read all the job names one by one.
In the loop, set varaibles for jobname/starttime/endtime/status (which all you can get from autorep -j . Then use a variable to hold jil by autorep -q -j
Append all these variable values in a flat file.
End the loop. After exiting a loop you wil end up with a file with all the job details.
Then use SQL loader to put the data in your oracle table. You can hardcode a control file and use it for every run. But the content of data file will change for every run.
Let me know if any part is not clear.

Oozie workflow for hive action

I am using oozie to execute few hive queries one after another and if a query fails it will send error email that a particular hive query is failed.
Now I have to implement another email triggers based on the result of each hive query. So how can we do that ? Its like if a query returns any result then send the results to the email and continue executing remaining hive queries. There should be no stoppings of oozie workflow execution irrespective of query returns value or not.
In short, if it returns value then send email and continue if it didnt return value also it should continue executing.
Thank you in advance.
If you want to make decisions based on previous step its better to use shell actions(hive -e option to execute query) along with capture_output tag in oozie. Or better use java actions with hive jdbc connection to execute hive queries where you can utilize java for doing all logical looping and decision making.
As oozie doesn't support cycles/loops of execution you might need to repeat the email action in workflow based on the decision making and flow.

Can we run queries from the Custom UDF in Hive?

guys I am newbie to Hive and have some doubts in it.
Normally we write custom UDF in Hive for the particular number of columns. (Consider UDF is in Java). Means it performs some operation on that particular column.
I am thinking that can we write such UDF through which we can give the particular column as a input to some query and can we return that query from UDF which will execute on Hive CLI by taking the column as a input?
Can we do this? If yes please suggest me.
Thanks and sorry for my bad english.
This is not possible out of the box because as the Hive query is running, there has been a plan already built that is going to execute. What you suggest is to dynamically change that plan while it is running, which is not only hard because the plan is already built, but also because the Hadoop MapReduce jobs are already running.
What you can do is have your initial Hive query output new Hive queries to a file, then have some sort of bash/perl/python script that goes through that and formulates new Hive queries and passes them to the CLI.

Resources