Sqoop incremental lastmodified - sqoop

I am having a accounts table in mysql db.
it has around 19654 records.I used sqoop to import the table data in HDFS.It created four files in HDFS with data evenly distributed
then i executed below sql statement on DB
update accounts set modified = now() where acct_num in (1,2,3,4) ;
Then i executed below sqoop tool
sqoop import --table accounts --connect jdbc:mysql://localhost/loudacre
--username training --password training
--incremental lastmodified
--check-column modified --last-value '2014-03-18 13:29:47.0'
--merge-key acct_num --target-dir /accounts/
After the above completed it created only one file with around 10 entries only.Does not even include new timestamp value.
I was just trying to update the rows which have new timestamp. Can anyone help?

Related

incremental load using sqoop from mysql to hive

I am new to sqoop and hive . Please help me with understanding
The count of mysql and hive table are different
mysql is 51 rows (table has primary key and no duplicates ) ad hive is 38rows - first run itself
sqoop job --create mmod -- import --connect "jdbc:mysql://cxln2.c.thelab-240901.internal:3306/retail_db" --username sqoopuser --password-file
/tmp/.mysql-pass.txt --table mod --compression-codec org.apache.hadoop.io.compress.BZip2Codec --hive-import --hive-database encry --hive-table mod2 --h
ive-overwrite --check-column last_update_date --incremental lastmodified --merge-key id --last-value 0 --target-dir /user/user_name/append1sqo
pp
It is not creating target dir in given location , instead it creating in warehouse location
I am trying to schedule a sqoop incremental job , somehow I am doing mistake some where
command : above command
2.1 new rows are added with same date
2.2 delete and update on few rows
Output :
No new updates on given table .
It is not updating lastvalue in sqoop job
How to choose merge-key column in sqoop
Where condition in sqoop
--query "select * from reason where id>20 AND $CONDITIONS"
What is the use of $CONDITIONS and do we need to pass the variable in Linux
Is that possible to track rejected rows in sqoop job

How to use sqoop validation?

Can you please help me with the below points.
I have a oracle data base with huge no.of records today - suppose 5TB data, so we can use the vaildator sqoop framework- It will validate and import in the HDFS.
Then, Suppose tomorrow- i will receive the new records on top of the above TB data, so how can i import those new records (only new records to the existing directory) and validation by using the validator sqoop framework.
I have a requirement, how to use sqoop validator if new records arrives.
I need sqoop validatior framework used in new records arrives to be imported in HDFS.
Please help me team.Thanks.
Thank You,
Sipra
My understanding is that you need to validate the oracle database for new records before you start your delta process. I don’t think you can validate based on the size of the records. But if you have a offset or a TS column that will be helpful for validation.
how do I know if there is new records in oracle since last run/job/check ??
You can do this in two sqoop import approaches, following is the examples and explanation for both.
sqoop incremental
Following is an example for the sqoop incremental import
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
This link explained it : https://www.tutorialspoint.com/sqoop/sqoop_import.html
sqoop import using query option
Here you basically use the where condition in the query and pull the data which is greater than the last received date or offset column.
Here is the syntax for it sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba --password cloudera \
--query 'select * from sample_data where $CONDITIONS AND salary > 1000' \
--split-by salary \
--target-dir hdfs://quickstart.cloudera/user/cloudera/sqoop_new
Isolate the validation and import job
If you want to run the validation and import job independently you have an other utility in sqoop which is sqoop eval, with this you can run the query on the rdbms and point the out put to the file or to a variable In your code and use that for validation purpose as you want.
Syntax :$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
--query “SELECT * FROM employee LIMIT 3”
Explained here : https://www.tutorialspoint.com/sqoop/sqoop_eval.htm
validation parameter in sqoop
You can use this parameter to validate the counts between what’s imported/exported between RDBMS and HDFS
—validate
More on that : https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#validation

Incrimental update in HIVE table using sqoop

I have a table in oracle with only 4 columns...
Memberid --- bigint
uuid --- String
insertdate --- date
updatedate --- date
I want to import those data in HIVE table using sqoop. I create corresponding HIVE table with
create EXTERNAL TABLE memberimport(memberid BIGINT,uuid varchar(36),insertdate timestamp,updatedate timestamp)LOCATION '/user/import/memberimport';
and sqoop command
sqoop import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username ** --password *** --hive-import --table MEMBER --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
Its working properly and able to import data in HIVE table.
Now I want to update this table with incremental update with updatedate (last value today's date) so that I can get day to day update for that OLTP table into my HIVE table using sqoop.
For Incremental import I am using following sqoop command
sqoop import --hive-import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username *** --password *** --table MEMBER --check-column UPDATEDATE --incremental append --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
But I am getting exception
"Append mode for hive imports is not yet supported. Please remove the parameter --append-mode"
When I remove the --hive-import it run properly but I did not found those new update in HIVE table that I have in OLTP table.
Am I doing anything wrong ?
Please suggest me how can I run incremental update with Oracle - Hive using sqoop.
Any help will be appropriated..
Thanks in Advance ...
Although i don't have resources to replicate your scenario exactly.
You might want to try building a sqoop job and test your use case.
sqoop job --create sqoop_job \
-- import \
--connect "jdbc:oracle://server:port/dbname" \
--username=(XXXX) \
--password=(YYYY) \
--table (TableName)\
--target-dir (Hive Directory corresponding to the table) \
--append \
--fields-terminated-by '(character)' \
--lines-terminated-by '\n' \
--check-column "(Column To Monitor Change)" \
--incremental append \
--last-value (last value of column being monitored) \
--outdir (log directory)
when you create a sqoop job, it takes care of --last-value for subsequent runs. Also here i have used the Hive table's data file as target for incremental update.
Hope this provides a helpful direction to proceed.
There is no direct way to achieve this in Sqoop. However you can use 4 Step Strategy.

Share sqoop incremental last value between two jobs

I have a sqoop job that records incremental last value to do incremental appends through out the day. My problem is that my directory changes each day so we can create partitions based on log_date.
I need to record --last-value through out the day. Then I need to pass that value into a newly created job for the next day. Is it possible to call a method to get last-value?
My current sqoop job looks like this written in a shell script.
sqoop job --create test_last_index \
-- import --connect jdbc:xxxx \
--password xxx \
--table test_$(date -d yesterday +%Y_%m_%d) \
--target-dir /dir/where/located \
--incremental append \
--check-column id
--last-value 1
You need not call a method for the sqooping that you are doing. All you need to do is create a sqoop job and save it. Add the paramenters --check-column , --incremental and --last-value in the sqoop job that you create. The --last-value will be picked up with each consecutive run and will be retained in the job. Then you can use a --exec command to run the job periodically and also sqoop merge to merge the modified/appended data with the historical data.
Hope this helps.
I have developed sqoop script for Incremental Import as follows.
sqoop import
--driver com.sap.db.jdbc.Driver
--fetch-size 3000
--connect connectionURL
--username test
--password test
--table DATA
--where YEAR=2002
--check-column TIMESTAMP
--incremental append
--last-value "2016-06-22 12:31:37.0"
--target-dir "/incremental_data_2002/year_partition=2002"
--fields-terminated-by ","
--lines-terminated-by "\n"
--split-by YEAR
--m 4
Now, the above script is getting executed successfully.
In the above script i have hardcoded the --last-value as "2016-06-22 12:31:37.0". when new data comes to source table in RDBMS again i am checking the last-value in the table and modifying the sqoop script manually with the value. Instead of that what i wanted here is that i need to have --last-value dynamically without hardcoding in sqoop script file.
Sadly, Sqoop is not incorporating an automatic last value retrieving.
In the sqoop documentation
You should use:
At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

Configuring Sqoop with Mysql?

I have successfully installed SQOOP now the problem is that how to implement it with RDBMS and how to load data from RDBMS to HDFS using SQOOP.
By Using Sqoop You can Load Data directly to Hive Tables or Store the data in Some target Directory in HDFS
If you Need to copy data from RDBMS into Some directory
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--col column_name(s) {In case you need to call specific columns}
--target-dir '/tmp/myfolder'
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--fields-terminated-by ',' {how do you want your data to look in target file}
Boundary Query : This is something you can specify. If you do not specify this , then by default this is run in as an inner query which adds up to a complex query.
If you specify this explicitly then this runs as a normal query and hence the performance is increased.
Also you may want to restrict the number of observation ,say based on column ID, and suppose you need data from ID 1 to 1000. Then using Boundary condition and split-by you will be able to restrict your import data.
--boundary-query "select 0,1000 from employee'
--split-by ID
Split-By : You use Split by on a Sqoop import to specify the column on basis of which split is required. By default,if you do not specify this, sqoop pics up table's primary key as the Split_by column.
Split By picks up data from tables and stores them in different folders based on number of mappers. By Default Number of Mappers are 4.
This may seem unwanted but in case you have a composite primary key or no primary key at all, then sqoop fails to pick up data and may error out.
Note: You may not face any issue if you set the number of mappers to 1. In this case, no split by condition is used since there is only one mapper. So query runs fine. This can be done using
--m 1
If you Need to copy data from RDBMS into Hive Table
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--hive-import
--hive-table serviceorderdb.productinfo
--m 1
Running a query instead of calling entire table itself
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password
--query 'select name from employees where name like '%s' and $CONDITIONS'
--m 5 {set number of mappers to 5}
--target-dir '/tmp/myfolder'
--fields-terminated-by ',' {how do you want your data to look in target file}
You may see $conditions as extra parameter $CONDITIONS. This is because this time you specified no table and specified a query explicity. When Sqoop runs, it searches for a boundary conditions, which it does not find. Then It Searches for a table and a primary key for applying boundary query which again it will not find. Hence we use $CONDITIONS to explicitly specify that we are not using a query and use default boundry condition from query result.
Checking if your connection is set up properly : For this you can just call list databases and if the you see your data populated then your connection is fine.
$ sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password pwd
Connection String for Different Databases :
MYSQL: jdbc:mysql://<hostname>:<port>/<dbname>
jdbc:mysql://127.0.0.1:3306/test_database
Oracle :#//host_name:port_number/service_name
jdbc:oracle:thin:scott/tiger#//myhost:1521/myservicename
You may learn more about sqoop imports from : https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
By using sqoop import command you can import data from RDBMS to HDFS, Hive and HBase
sqoop import --connect jdbc:mysql://localhost:portnumber/DBName --username root --table emp --password root -m1
By using this command data will be stored in HDFS.
Sample commands to run sqoop import (load data from RDBMS to HDFS):
Postgres
sqoop import --connect jdbc:postgresql://postgresHost/databaseName
--username username --password 123 --table tableName
MySQL
sqoop import --connect jdbc:mysql://mysqlHost/databaseName --username username --password 123 --table tableName
Oracle*
sqoop import --connect jdbc:oracle:thin:#oracleHost:1521/databaseName --username USERNAME --password 123 --table TABLENAME
SQL Server
sqoop import --connect 'jdbc:sqlserver://sqlserverhost:1433;database=dbname;username=<username>;password=<password>' --table tableName
*Sqoop won't find any columns from a table if you don't specify both the username and the table in correct case. Usually, specifying both in uppercase will resolve the issue.
Read the Sqoop User's Guide: https://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
I also recommend the Apache Sqoop Cookbook. You will learn how to use import and export tools, do incremental import jobs, save jobs, solve problems with jdbc drivers and much more. http://shop.oreilly.com/product/0636920029519.do

Resources