Share sqoop incremental last value between two jobs - sqoop

I have a sqoop job that records incremental last value to do incremental appends through out the day. My problem is that my directory changes each day so we can create partitions based on log_date.
I need to record --last-value through out the day. Then I need to pass that value into a newly created job for the next day. Is it possible to call a method to get last-value?
My current sqoop job looks like this written in a shell script.
sqoop job --create test_last_index \
-- import --connect jdbc:xxxx \
--password xxx \
--table test_$(date -d yesterday +%Y_%m_%d) \
--target-dir /dir/where/located \
--incremental append \
--check-column id
--last-value 1

You need not call a method for the sqooping that you are doing. All you need to do is create a sqoop job and save it. Add the paramenters --check-column , --incremental and --last-value in the sqoop job that you create. The --last-value will be picked up with each consecutive run and will be retained in the job. Then you can use a --exec command to run the job periodically and also sqoop merge to merge the modified/appended data with the historical data.
Hope this helps.

I have developed sqoop script for Incremental Import as follows.
sqoop import
--driver com.sap.db.jdbc.Driver
--fetch-size 3000
--connect connectionURL
--username test
--password test
--table DATA
--where YEAR=2002
--check-column TIMESTAMP
--incremental append
--last-value "2016-06-22 12:31:37.0"
--target-dir "/incremental_data_2002/year_partition=2002"
--fields-terminated-by ","
--lines-terminated-by "\n"
--split-by YEAR
--m 4
Now, the above script is getting executed successfully.
In the above script i have hardcoded the --last-value as "2016-06-22 12:31:37.0". when new data comes to source table in RDBMS again i am checking the last-value in the table and modifying the sqoop script manually with the value. Instead of that what i wanted here is that i need to have --last-value dynamically without hardcoding in sqoop script file.

Sadly, Sqoop is not incorporating an automatic last value retrieving.
In the sqoop documentation
You should use:
At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

Related

last-value in sqoop( incremental import)

sqoop import --connect \\
jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rank --incremental append --last-value
We don't know the last value of the previous table. How can I write the query?
You can try to 2 approaches to solve this.
Query into table and get maximum value of last-value column.
Create a job in sqoop and set the column as incremental one and moving forward, your job will run on incremental basis
Go to your pwd
cd .sqoop
open file metastore.db.script using vi or
your fav editor.
search for incremental.last.value
It should be something like
INSERT INTO SQOOP_SESSIONS VALUES('incimpjob','incremental.last.value','2018-09-11 19:20:52.0','SqoopOptions')
Note: I am assuming that you have created a Sqoop Job. The 'incimpjob' is the name of my sqoop job.

How to use sqoop validation?

Can you please help me with the below points.
I have a oracle data base with huge no.of records today - suppose 5TB data, so we can use the vaildator sqoop framework- It will validate and import in the HDFS.
Then, Suppose tomorrow- i will receive the new records on top of the above TB data, so how can i import those new records (only new records to the existing directory) and validation by using the validator sqoop framework.
I have a requirement, how to use sqoop validator if new records arrives.
I need sqoop validatior framework used in new records arrives to be imported in HDFS.
Please help me team.Thanks.
Thank You,
Sipra
My understanding is that you need to validate the oracle database for new records before you start your delta process. I don’t think you can validate based on the size of the records. But if you have a offset or a TS column that will be helpful for validation.
how do I know if there is new records in oracle since last run/job/check ??
You can do this in two sqoop import approaches, following is the examples and explanation for both.
sqoop incremental
Following is an example for the sqoop incremental import
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
This link explained it : https://www.tutorialspoint.com/sqoop/sqoop_import.html
sqoop import using query option
Here you basically use the where condition in the query and pull the data which is greater than the last received date or offset column.
Here is the syntax for it sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba --password cloudera \
--query 'select * from sample_data where $CONDITIONS AND salary > 1000' \
--split-by salary \
--target-dir hdfs://quickstart.cloudera/user/cloudera/sqoop_new
Isolate the validation and import job
If you want to run the validation and import job independently you have an other utility in sqoop which is sqoop eval, with this you can run the query on the rdbms and point the out put to the file or to a variable In your code and use that for validation purpose as you want.
Syntax :$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
--query “SELECT * FROM employee LIMIT 3”
Explained here : https://www.tutorialspoint.com/sqoop/sqoop_eval.htm
validation parameter in sqoop
You can use this parameter to validate the counts between what’s imported/exported between RDBMS and HDFS
—validate
More on that : https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#validation

Sqoop-Imported data is not shown in the target directory

I have imported the data from MYSQL to HDFS with Sqoop but not able to see the imported data into desired given path.
Sqoop query is like -
sqoop job --create EveryDayImport --import --connect jdbc:mysql://localhost:3306/books --username=root --table=authors -m 1 --target-dir /home/training/viresh/Sqoop/authors1234 --incremental append --check-column id --last-value 0;
sqoop job --create EveryDayImport -- import --connect jdbc:mysql://localhost:3306/books --username=root --table=authors -m 1 --target-dir /home/training/viresh/Sqoop/authors1234 --incremental append --check-column id --last-value 0
There is a mistake in your Sqoop statement that you missed to give space between "--" and import as mentioned in the comment by dev
Your sqoop statement use to create a sqoop job. To execute you job (sqoop import) you have to submit it by below statement.
$ sqoop job --exec EveryDayImport
I feel this is the reason no data present in your target dir

Incrimental update in HIVE table using sqoop

I have a table in oracle with only 4 columns...
Memberid --- bigint
uuid --- String
insertdate --- date
updatedate --- date
I want to import those data in HIVE table using sqoop. I create corresponding HIVE table with
create EXTERNAL TABLE memberimport(memberid BIGINT,uuid varchar(36),insertdate timestamp,updatedate timestamp)LOCATION '/user/import/memberimport';
and sqoop command
sqoop import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username ** --password *** --hive-import --table MEMBER --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
Its working properly and able to import data in HIVE table.
Now I want to update this table with incremental update with updatedate (last value today's date) so that I can get day to day update for that OLTP table into my HIVE table using sqoop.
For Incremental import I am using following sqoop command
sqoop import --hive-import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username *** --password *** --table MEMBER --check-column UPDATEDATE --incremental append --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
But I am getting exception
"Append mode for hive imports is not yet supported. Please remove the parameter --append-mode"
When I remove the --hive-import it run properly but I did not found those new update in HIVE table that I have in OLTP table.
Am I doing anything wrong ?
Please suggest me how can I run incremental update with Oracle - Hive using sqoop.
Any help will be appropriated..
Thanks in Advance ...
Although i don't have resources to replicate your scenario exactly.
You might want to try building a sqoop job and test your use case.
sqoop job --create sqoop_job \
-- import \
--connect "jdbc:oracle://server:port/dbname" \
--username=(XXXX) \
--password=(YYYY) \
--table (TableName)\
--target-dir (Hive Directory corresponding to the table) \
--append \
--fields-terminated-by '(character)' \
--lines-terminated-by '\n' \
--check-column "(Column To Monitor Change)" \
--incremental append \
--last-value (last value of column being monitored) \
--outdir (log directory)
when you create a sqoop job, it takes care of --last-value for subsequent runs. Also here i have used the Hive table's data file as target for incremental update.
Hope this provides a helpful direction to proceed.
There is no direct way to achieve this in Sqoop. However you can use 4 Step Strategy.

Sqoop Incremental import and update

I am trying to import data from sql into a hive database. The goal is to update the changes in the oracle database to hive using sqoop import. The sqoop command is as follows:
sqoop import -D mapred.child.java.opts='\-Djava.security.egd=file:/dev/../dev/urandom'
--connect jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(LOAD_BALANCE=ON)(FAILOVER=ON)(ADDRESS=(PROTOCOL=TCP)(HOST=)(PORT=1545))(ADDRESS=(PROTOCOL=TCP)(HOST=)(PORT=1545)))(CONNECT_DATA=(SERVICE_NAME=)(SERVER=DEDICATED)))'
--username abcde
--password 1234rgtds
--table Customer_Acc
--columns Name,ID,Address,Date_booking ,Last_update_date
-m 1
--target-dir /final/table
--hive-import
--hive-table tesupd
--map-column-hive Name,ID,Address,Date_booking
--null-string '\\N'
--null-non-string '\\N'
--hive-delims-replacement ' '
--incremental lastmodified
--check-column Last_update_date
--last-value "2009-12-31 12:14:28"
The final output should be the data greater than the last value, but in the above case it is appending the data instead of incrementally updating it.
I want the data to be updated rather than appended.
use --merge-key option in your sqoop-import command. This will replace the older records with the latest records.
Alternately you can use sqoop-merge command as well but it should be done in two steps. First sqoop-import without merge-key and then sqoop-merge
Try using --incremental append rather than --incremental last modified.
With --incremental append last-value of field mentioned is stored in sqoop metastore 'incremental.last.value' which keeps changing whenever the job is executed. Using --incremental append you do not have to update the last-value in your query but it is updated automatically.
By this your value will always be updated (in sqoop metastore) and there will not be any redundant data
Neither sqoop nor Hive can directly update the data in Hive using sqoop imports. Please follow the steps in the below link for row level updates.
http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
So your data is mutable and you'd like to modify in HDFS the records which have been changed in your DB.
For this, you need to use the --incremental append flag. You also need to create a Sqoop job, because that will capture the most recent --last-value and serialize it back to the saved job.
You should create a Sqoop job which looks something like this.
sqoop job \
--create jobName \
-- \
import \
jdbc:oracle:thin:#hostname:port:sid \
--username user \
--password fileOnHDFS.password \
--table tableName \
--incremental lastmodified \
--check-column UPDATE_DATUM \
--last-value 1985.01.01 \
--merge-key ID \
More details can be found in the Sqoop User Guide (v1.4.6)

Resources