last-value in sqoop( incremental import) - hadoop

sqoop import --connect \\
jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rank --incremental append --last-value
We don't know the last value of the previous table. How can I write the query?

You can try to 2 approaches to solve this.
Query into table and get maximum value of last-value column.
Create a job in sqoop and set the column as incremental one and moving forward, your job will run on incremental basis

Go to your pwd
cd .sqoop
open file metastore.db.script using vi or
your fav editor.
search for incremental.last.value
It should be something like
INSERT INTO SQOOP_SESSIONS VALUES('incimpjob','incremental.last.value','2018-09-11 19:20:52.0','SqoopOptions')
Note: I am assuming that you have created a Sqoop Job. The 'incimpjob' is the name of my sqoop job.

Related

How to use sqoop validation?

Can you please help me with the below points.
I have a oracle data base with huge no.of records today - suppose 5TB data, so we can use the vaildator sqoop framework- It will validate and import in the HDFS.
Then, Suppose tomorrow- i will receive the new records on top of the above TB data, so how can i import those new records (only new records to the existing directory) and validation by using the validator sqoop framework.
I have a requirement, how to use sqoop validator if new records arrives.
I need sqoop validatior framework used in new records arrives to be imported in HDFS.
Please help me team.Thanks.
Thank You,
Sipra
My understanding is that you need to validate the oracle database for new records before you start your delta process. I don’t think you can validate based on the size of the records. But if you have a offset or a TS column that will be helpful for validation.
how do I know if there is new records in oracle since last run/job/check ??
You can do this in two sqoop import approaches, following is the examples and explanation for both.
sqoop incremental
Following is an example for the sqoop incremental import
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
This link explained it : https://www.tutorialspoint.com/sqoop/sqoop_import.html
sqoop import using query option
Here you basically use the where condition in the query and pull the data which is greater than the last received date or offset column.
Here is the syntax for it sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba --password cloudera \
--query 'select * from sample_data where $CONDITIONS AND salary > 1000' \
--split-by salary \
--target-dir hdfs://quickstart.cloudera/user/cloudera/sqoop_new
Isolate the validation and import job
If you want to run the validation and import job independently you have an other utility in sqoop which is sqoop eval, with this you can run the query on the rdbms and point the out put to the file or to a variable In your code and use that for validation purpose as you want.
Syntax :$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
--query “SELECT * FROM employee LIMIT 3”
Explained here : https://www.tutorialspoint.com/sqoop/sqoop_eval.htm
validation parameter in sqoop
You can use this parameter to validate the counts between what’s imported/exported between RDBMS and HDFS
—validate
More on that : https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#validation

How to take updated records along with incremental import from RDBMS table to a Hive table?

Im working with Sqoop incremental import by taking the data everyday into my hive table. I have the following scenario:
I have an RDBMS table: empdata with columns
id name city
1 Sid Amsterdam
2 Bob Delhi
3 Sun Dubai
4 Rob London
I am importing the data into Hive, using Sqoop incremental import through a cron job which shell script to do the work.
#!/bin/bash
DATE=$(date +"%d-%m-%y")
while IFS=":" read -r server dbname tablename; do
sqoop import --connect jdbc:mysql://$server/$dbname --table $tablename --username root --password cloudera --hive-import --hive-table dynpart --hive-partition-key 'thisday' --hive-partition-value $DATE --target-dir '/user/hive/newimp5' --incremental-append --check-column id --last-value $(hive -e "select max(id) from $tablename");
done</home/cloudera/Desktop/MyScripts/tables.txt
The above script for incremental load is working fine. But now I have another requirement, which is to check if there are any updates to previous records. Like if the record:
1 Rob London is updated to 1 Rob NewYork I need to take that updated record(s) along with the incremental import but only the updated value should be present in Hive table so that I don't have duplicate values either. Could anyone tell me how can I accomplish it ?
In sqoop you can not use 2 columns in --check-column and even if you are allowed (you can combine 2 fields in --check-column see ex : Sqoop Incremental Import multiple columns in check-column) then also you are not sure if the city will have the higher or lower value next time so you can not really use city field in check column . now you have following options :
1) in your RDBMS create a new table where you have another field of type timestamp and this will be auto incremented so that every time you have any update or insert it has the current time stamp. and then after incremental append you again import this table using incremental lastmodified ...--check-column ts_field -- last-value also use "--merge-key id" in sqoop import so that it can merge the updates on the basis of id.
2) a) first run your sqoop import with --check-cloumn id --incremental append last value
b) then run the sqoop import again without using --incremental and with target dir as some temporary folder
c) then using sqoop merge merge the dataset(target-dir in step a. and b ) where new data will be in target dir of step a onto tar dir of step b and --merge key will be "id".
Please let me know if you have any further questions.

what is the relevence of -m 1

I am executing below sqoop command::=
sqoop import --connect 'jdbc:sqlserver://10.xxx.xxx.xx:1435;database=RRAM_Temp' --username DRRM_DATALOADER --password ****** --table T_VND --hive-import --hive-table amitesh_db.amit_hive_test --as-textfile --target-dir amitesh_test_hive -m 1
I have two queries::-
1) what is the relevence of -m 1? as far as I know Its the number of mapper that I am assigning to the sqoop job. If that is true, then, the moment I assign -m 2, the execution start throwing error as below:
ERROR tool.ImportTool: Error during import: No primary key could be found for table xxx. Please specify one with --split-by or perform a sequential import with '-m 1'
Now, I am forced to change my concept, now I see, it has something to do with database primary key. Can somebody help me a logic behind this?
2) I have ordered the above sqoop command to save the file as text file format.But when I go to the location suggested by the execution, I find tbl_name.jar. Why, if --as-textfile is a wrong sytax, then what is the right one. Or is there any other location that I can find the file in?
1) To have -m or --num-mappers to be set to a value greater than 1, the table must either have PRIMARY KEY or the sqoop command must be provided with a --split-by column. Controlling Parallelism would explain the logic behind this.
2) The FileFormat of the data imported into the Hive table amit_hive_test would be plain text(--as-textfile). As this is --hive-import, the data will be first imported into the --target-dir and then is loaded (LOAD DATA INPATH) into the Hive table. The resultant data will be inside the table's LOCATION and not in --target-dir.

Share sqoop incremental last value between two jobs

I have a sqoop job that records incremental last value to do incremental appends through out the day. My problem is that my directory changes each day so we can create partitions based on log_date.
I need to record --last-value through out the day. Then I need to pass that value into a newly created job for the next day. Is it possible to call a method to get last-value?
My current sqoop job looks like this written in a shell script.
sqoop job --create test_last_index \
-- import --connect jdbc:xxxx \
--password xxx \
--table test_$(date -d yesterday +%Y_%m_%d) \
--target-dir /dir/where/located \
--incremental append \
--check-column id
--last-value 1
You need not call a method for the sqooping that you are doing. All you need to do is create a sqoop job and save it. Add the paramenters --check-column , --incremental and --last-value in the sqoop job that you create. The --last-value will be picked up with each consecutive run and will be retained in the job. Then you can use a --exec command to run the job periodically and also sqoop merge to merge the modified/appended data with the historical data.
Hope this helps.
I have developed sqoop script for Incremental Import as follows.
sqoop import
--driver com.sap.db.jdbc.Driver
--fetch-size 3000
--connect connectionURL
--username test
--password test
--table DATA
--where YEAR=2002
--check-column TIMESTAMP
--incremental append
--last-value "2016-06-22 12:31:37.0"
--target-dir "/incremental_data_2002/year_partition=2002"
--fields-terminated-by ","
--lines-terminated-by "\n"
--split-by YEAR
--m 4
Now, the above script is getting executed successfully.
In the above script i have hardcoded the --last-value as "2016-06-22 12:31:37.0". when new data comes to source table in RDBMS again i am checking the last-value in the table and modifying the sqoop script manually with the value. Instead of that what i wanted here is that i need to have --last-value dynamically without hardcoding in sqoop script file.
Sadly, Sqoop is not incorporating an automatic last value retrieving.
In the sqoop documentation
You should use:
At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

Sqoop creating insert statements containing multiple records

we are trying to load the data from sqoop to netezza. And we are facing the following issue.
java.io.IOException: org.netezza.error.NzSQLException: ERROR:
Example Input dataset is as shown below:
1,2,3
1,3,4
sqoop command is as shown below:
sqoop export --table <tablename> --export-dir <path>
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --connect
'jdbc:netezza://<host>/<db>' --driver org.netezza.Driver
--username <username> --password <passwrd>
The Sqoop is creating an insert statement in the following way:
insert into (c1,c2,c3) values (1,2,3),(1,3,4).
We are able to load one record but when we try to load the data to multiple records, the error is as said above.
Your help is highly appreciated.
Making sqoop.export.records.per.statement=1 will definitely help but this will make the export process extremely slow if your export record count is very large say "5 Million".
To solve this you need add following things:
1.) A properties file sqoop.properties, it must contain this property jdbc.transaction.isolation=TRANSACTION_READ_UNCOMMITTED (It avoids deadlock during exports)
also in the export command you need to specify this:
--connection-param-file /path/to/sqoop.properties
2.) Also sqoop.export.records.per.statement=100, making this will increase the speed of export.
3.) Third you have to add --batch, Use batch mode for underlying statement execution.
So you final export will look like this,
sqoop export -D sqoop.export.records.per.statement=100 --table <tablename> --export-dir <path>
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --connect
'jdbc:netezza://<host>/<db>' --driver org.netezza.Driver
--username <username> --password <passwrd>
--connection-param-file /path/to/sqoop.properties
--batch
Hope this will help.
You can customise the number of rows that will be used in one insert statement with property "sqoop.export.records.per.statement". For example for Netezza you will need to set it to 1:
sqoop export -Dsqoop.export.records.per.statement=1 --connect ...
I would recommend you to also take a look on Apache Sqoop Cookbook where this and many other tips are described.

Resources