Passing date parameter to sqoop import into Hive table - oracle

I am importing a set of tables from an Oracle database into Hive using sqoop import statement as follows:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" --connect CONNECTIONSTRING --table TABLENAME --username USERNAME --password PASSWORD --hive-import --hive-drop-import-delims --hive-overwrite --hive-table HIVE_TABLE_NAME1 --null-string '\N' --null-non-string '\N' -m 1
and i am using the following check column keyword in this sqoop statement for incremental loads:
--check-column COLUMN_NAME --incremental lastmodified --last-value HARDCODED_DATE
I tested this and it works great but I want to modify this so that it is dynamic and I dont have to hard code the date into the statement and I can just pass it as a parameter so that it checks the specified column and gets all the data after that date. I understand that the date has to be passed from a different file but I am not really sure what the structure of the file should be and how it would be referencing this sqoop statement. Any help or guidance would be greatly appreciated. Thank you in advance!

You can use sqoop job for the same.
Using sqoop job, you have to apply last-value as 0, it will import and update the data in the job so you only have to run sqoop-job --exec <> everytime, it will update the data without any hardcoded value.
sqoop job create <<job_name>> -- import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" --connect <<db_url>> --table <<db_name>> --username <<username>> --password <<password>> --hive-import --hive-drop-import-delims --hive-overwrite --hive-table <<hive_table>> --null-string '\N' --null-non-string '\N' -m 1 --incremental lastmodified --check-column timedate --last-value 0
sqoop job --exec <<job_name>>
For more details visit https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_job_literal

Related

Is it possible to import a table with sqoop and add an extra timestamp column?

is it possible to use the sqoop command "import table" to import a table from an oracle database to an Hadoop cluster and add an extra column with the current timestamp (for troubleshouting purposes)? so far, I have the following command:
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect jdbc:oracle:thin:#//MY_ORACLE_SERVER --username USERNAME --password PASSWORD --target-dir /MyDIR --fields-terminated-by '\b' --table SOURCE_TABLE --hive-table DESTINATION_TABLE --hive-import --hive-overwrite --hive-delims-replacement '<newline>'
I would like to add a timestamp column to the table so that I know when that data was loaded. Is it possible?
Thanks in advance
you can use the free-form query import instead of table import, and call the timestamp function :
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect jdbc:oracle:thin:#//MY_ORACLE_SERVER --username USERNAME --password PASSWORD --target-dir /MyDIR --fields-terminated-by '\b' ----query 'SELECT a.*,systimestamp FROM SOURCE_TABLE a' --hive-table DESTINATION_TABLE --hive-import --hive-overwrite --hive-delims-replacement '<newline>'
Maybe you could use sysdate instead systimestamp (smaller datatype but less precision)
You can create a temp hive table by using sqoop ,after that create a new hive table by using old one with extra required columns.

Incrimental update in HIVE table using sqoop

I have a table in oracle with only 4 columns...
Memberid --- bigint
uuid --- String
insertdate --- date
updatedate --- date
I want to import those data in HIVE table using sqoop. I create corresponding HIVE table with
create EXTERNAL TABLE memberimport(memberid BIGINT,uuid varchar(36),insertdate timestamp,updatedate timestamp)LOCATION '/user/import/memberimport';
and sqoop command
sqoop import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username ** --password *** --hive-import --table MEMBER --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
Its working properly and able to import data in HIVE table.
Now I want to update this table with incremental update with updatedate (last value today's date) so that I can get day to day update for that OLTP table into my HIVE table using sqoop.
For Incremental import I am using following sqoop command
sqoop import --hive-import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username *** --password *** --table MEMBER --check-column UPDATEDATE --incremental append --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
But I am getting exception
"Append mode for hive imports is not yet supported. Please remove the parameter --append-mode"
When I remove the --hive-import it run properly but I did not found those new update in HIVE table that I have in OLTP table.
Am I doing anything wrong ?
Please suggest me how can I run incremental update with Oracle - Hive using sqoop.
Any help will be appropriated..
Thanks in Advance ...
Although i don't have resources to replicate your scenario exactly.
You might want to try building a sqoop job and test your use case.
sqoop job --create sqoop_job \
-- import \
--connect "jdbc:oracle://server:port/dbname" \
--username=(XXXX) \
--password=(YYYY) \
--table (TableName)\
--target-dir (Hive Directory corresponding to the table) \
--append \
--fields-terminated-by '(character)' \
--lines-terminated-by '\n' \
--check-column "(Column To Monitor Change)" \
--incremental append \
--last-value (last value of column being monitored) \
--outdir (log directory)
when you create a sqoop job, it takes care of --last-value for subsequent runs. Also here i have used the Hive table's data file as target for incremental update.
Hope this provides a helpful direction to proceed.
There is no direct way to achieve this in Sqoop. However you can use 4 Step Strategy.

import data from vertica to hive

I try to upload data from Vertica to Hive by using Sqoop.
I can see that it creates a file and a table on HIVE, but when i try to select the data from the HIVE or from the file i cannot see the data. it shows me an ERROR(there is no delimiter on the column of the file) select.
this is my code:
sqoop import -m -1 --driver com.vertica.jdbc.Driver --connect "jdbc:vertica://serverName:5443/DBName" --username "user" --password "pass" --query 'select id, name from contacts limit 10' --target-dir "folder/contacts" --hive-import --create-hive-table --hive-table db.contacts
Use these arguments and choose a delimiters for your data
--fields-terminated-by
--lines-terminated-by

saved sqoop job not using time zone of the server

The following saved sqoop job is using a timezone not that of the server in which the job is saved.
sqoop job --create myjob9 -- import --connect jdbc:oracle:thin:#xyz:1234/abc --check-column LAST_UPDATE_DATETIME --incremental lastmodified --last-value "2015-02-15 19.19.37.000000000" --hive-import --table SIM_UNAUDITED_SALES_TMP --append
Last value when the job is executed is 1 hour ahead of the system time. How do I sync the timezone?
you can use the following generic-argument in order to set the server timezone:
-D mapreduce.map.java.opts=" -Duser.timezone=$your_timezone"
Be careful to use this generic argument before calling the job-args. So, you can make it this way:
sqoop job -D mapreduce.map.java.opts=" -Duser.timezone=$your_timezone" --create myjob9 -- import --connect jdbc:oracle:thin:#xyz:1234/abc --check-column LAST_UPDATE_DATETIME --incremental lastmodified --last-value "2015-02-15 19.19.37.000000000" --hive-import --table SIM_UNAUDITED_SALES_TMP --append

Sqoop Incremental import and update

I am trying to import data from sql into a hive database. The goal is to update the changes in the oracle database to hive using sqoop import. The sqoop command is as follows:
sqoop import -D mapred.child.java.opts='\-Djava.security.egd=file:/dev/../dev/urandom'
--connect jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(LOAD_BALANCE=ON)(FAILOVER=ON)(ADDRESS=(PROTOCOL=TCP)(HOST=)(PORT=1545))(ADDRESS=(PROTOCOL=TCP)(HOST=)(PORT=1545)))(CONNECT_DATA=(SERVICE_NAME=)(SERVER=DEDICATED)))'
--username abcde
--password 1234rgtds
--table Customer_Acc
--columns Name,ID,Address,Date_booking ,Last_update_date
-m 1
--target-dir /final/table
--hive-import
--hive-table tesupd
--map-column-hive Name,ID,Address,Date_booking
--null-string '\\N'
--null-non-string '\\N'
--hive-delims-replacement ' '
--incremental lastmodified
--check-column Last_update_date
--last-value "2009-12-31 12:14:28"
The final output should be the data greater than the last value, but in the above case it is appending the data instead of incrementally updating it.
I want the data to be updated rather than appended.
use --merge-key option in your sqoop-import command. This will replace the older records with the latest records.
Alternately you can use sqoop-merge command as well but it should be done in two steps. First sqoop-import without merge-key and then sqoop-merge
Try using --incremental append rather than --incremental last modified.
With --incremental append last-value of field mentioned is stored in sqoop metastore 'incremental.last.value' which keeps changing whenever the job is executed. Using --incremental append you do not have to update the last-value in your query but it is updated automatically.
By this your value will always be updated (in sqoop metastore) and there will not be any redundant data
Neither sqoop nor Hive can directly update the data in Hive using sqoop imports. Please follow the steps in the below link for row level updates.
http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
So your data is mutable and you'd like to modify in HDFS the records which have been changed in your DB.
For this, you need to use the --incremental append flag. You also need to create a Sqoop job, because that will capture the most recent --last-value and serialize it back to the saved job.
You should create a Sqoop job which looks something like this.
sqoop job \
--create jobName \
-- \
import \
jdbc:oracle:thin:#hostname:port:sid \
--username user \
--password fileOnHDFS.password \
--table tableName \
--incremental lastmodified \
--check-column UPDATE_DATUM \
--last-value 1985.01.01 \
--merge-key ID \
More details can be found in the Sqoop User Guide (v1.4.6)

Resources