incremental load using sqoop from mysql to hive - hadoop

I am new to sqoop and hive . Please help me with understanding
The count of mysql and hive table are different
mysql is 51 rows (table has primary key and no duplicates ) ad hive is 38rows - first run itself
sqoop job --create mmod -- import --connect "jdbc:mysql://cxln2.c.thelab-240901.internal:3306/retail_db" --username sqoopuser --password-file
/tmp/.mysql-pass.txt --table mod --compression-codec org.apache.hadoop.io.compress.BZip2Codec --hive-import --hive-database encry --hive-table mod2 --h
ive-overwrite --check-column last_update_date --incremental lastmodified --merge-key id --last-value 0 --target-dir /user/user_name/append1sqo
pp
It is not creating target dir in given location , instead it creating in warehouse location
I am trying to schedule a sqoop incremental job , somehow I am doing mistake some where
command : above command
2.1 new rows are added with same date
2.2 delete and update on few rows
Output :
No new updates on given table .
It is not updating lastvalue in sqoop job
How to choose merge-key column in sqoop
Where condition in sqoop
--query "select * from reason where id>20 AND $CONDITIONS"
What is the use of $CONDITIONS and do we need to pass the variable in Linux
Is that possible to track rejected rows in sqoop job

Related

Sqoop Import Command Directories

Env: CDH
Tool: Sqoop
Version: Sqoop 1.4.6-cdh5.8.0
Objective: Import table from MySQL database
Create hive table with a subset of source data (e.g order_status = 'CLOSED')
Reimport more data in the same directory using order_status not in ('CLOSED')
Results:
1. Objective 1 complete using the command
sqoop import --connect jdbc:mysql://xxx:000/xxxx_db
--username=xxxx_dba --P
--warehouse-dir=/user/hive/warehouse/hex.db/
-m 1
--table orders --compression-codec=snappy
--hive-import --as-textfile --create-hive-table
--hive-table closed_orders
--hive-overwrite
--where "order_status='CLOSED'"
--compress
--columns "order_id, order_customer_id, order_status"
Creates the directory /user/hive/warehouse/hex.db/closed_orders with a data file and a hive table with "CLOSED" Orders.
I am trying to re import more data - this time order_status not in ('CLOSED')
-- This time not creating a hive table and just importing the order_status != 'CLOSED' into different directory (open_orders).
Issue: It creates a directory /user/hive/warehouse/hex.db/open_orders/orders/.
2.a How can import the file into the directory /user/hive/warehouse/hex.db/open_orders?
2.b How can we import the subset date of order_status != 'CLOSED' ie. open orders into the same directory created in the step 1 ie. /user/hive/warehouse/hex.db/closed_orders ?
Command used for step 2:
sqoop import --connect jdbc:mysql://xxxx:0000/retail_db
--username=xxxx_dba --P --warehouse-dir=/user/hive/warehouse/hex.db/ -m 1
--table orders --compression-codec=snappy --hive-import
--as-textfile --hive-table open_orders
--where "order_status not in ('CLOSED')"
--compress --columns "order_id, order_customer_id, order_status"
2.3 Error with --append command where in I am trying to import the open orders into the directory created from the step 1 /user/hive/warehouse/hex.db/closed_orders
17/04/15 14:24:22 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
17/04/15 14:24:22 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
Append mode for hive imports is not yet supported. Please remove the parameter --append-mode
when you doing #2 i.e. reimporting data, use
--target-dir /user/hive/warehouse/hex.db/open_orders
instead of
warehouse-dir

incremental "lastmodified" not working in sqoop

I'm trying sqoop to perform incremental import from Teradata DB to Hive. Below is the query:
sqoop import --connect jdbc:teradata://xxx.xxx.x.xx/DATABASE=DBN --driver com.teradata.jdbc.TeraDriver --username userN --password pass --query "SELECT alias.colA, alias.call_date, alias.colB, alias.colC FROM tableName alias where \$CONDITIONS" --target-dir /apps/hive/warehouse/staging.db/tableName -m 26 --check-column call_date --incremental append --split-by alias.colA --last-value '2016-02-01'
The column call_date is of DATE type, values in the format 'YYYY-MM-DD'.
When I use 'append' for --incremental, everything works fine. But when I put 'lastmodified', the following error is thrown:
ERROR util.SqlTypeMap: It seems like you are looking up a column that does not
ERROR util.SqlTypeMap: exist in the table. Please ensure that you've specified
ERROR util.SqlTypeMap: correct column names in Sqoop options.
ERROR tool.ImportTool: Imported Failed: column not found: call_date
I'm using sqoop 1.4.4.2.1 on HDP 2.1
While Teradata DB is 14.10
Any pointers will be helpful.
I think, in case of query you can perform the last value check in the query itself some think like this
"SELECT alias.colA, alias.call_date, alias.colB, alias.colC FROM tableName alias where call_date >'2016-02-01' and \$CONDITIONS" .
Reference (refer section Incrementally Updating Data in Hive > 1.Ingest the data.)
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html

Sqoop incremental lastmodified

I am having a accounts table in mysql db.
it has around 19654 records.I used sqoop to import the table data in HDFS.It created four files in HDFS with data evenly distributed
then i executed below sql statement on DB
update accounts set modified = now() where acct_num in (1,2,3,4) ;
Then i executed below sqoop tool
sqoop import --table accounts --connect jdbc:mysql://localhost/loudacre
--username training --password training
--incremental lastmodified
--check-column modified --last-value '2014-03-18 13:29:47.0'
--merge-key acct_num --target-dir /accounts/
After the above completed it created only one file with around 10 entries only.Does not even include new timestamp value.
I was just trying to update the rows which have new timestamp. Can anyone help?

Incrimental update in HIVE table using sqoop

I have a table in oracle with only 4 columns...
Memberid --- bigint
uuid --- String
insertdate --- date
updatedate --- date
I want to import those data in HIVE table using sqoop. I create corresponding HIVE table with
create EXTERNAL TABLE memberimport(memberid BIGINT,uuid varchar(36),insertdate timestamp,updatedate timestamp)LOCATION '/user/import/memberimport';
and sqoop command
sqoop import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username ** --password *** --hive-import --table MEMBER --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
Its working properly and able to import data in HIVE table.
Now I want to update this table with incremental update with updatedate (last value today's date) so that I can get day to day update for that OLTP table into my HIVE table using sqoop.
For Incremental import I am using following sqoop command
sqoop import --hive-import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username *** --password *** --table MEMBER --check-column UPDATEDATE --incremental append --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
But I am getting exception
"Append mode for hive imports is not yet supported. Please remove the parameter --append-mode"
When I remove the --hive-import it run properly but I did not found those new update in HIVE table that I have in OLTP table.
Am I doing anything wrong ?
Please suggest me how can I run incremental update with Oracle - Hive using sqoop.
Any help will be appropriated..
Thanks in Advance ...
Although i don't have resources to replicate your scenario exactly.
You might want to try building a sqoop job and test your use case.
sqoop job --create sqoop_job \
-- import \
--connect "jdbc:oracle://server:port/dbname" \
--username=(XXXX) \
--password=(YYYY) \
--table (TableName)\
--target-dir (Hive Directory corresponding to the table) \
--append \
--fields-terminated-by '(character)' \
--lines-terminated-by '\n' \
--check-column "(Column To Monitor Change)" \
--incremental append \
--last-value (last value of column being monitored) \
--outdir (log directory)
when you create a sqoop job, it takes care of --last-value for subsequent runs. Also here i have used the Hive table's data file as target for incremental update.
Hope this provides a helpful direction to proceed.
There is no direct way to achieve this in Sqoop. However you can use 4 Step Strategy.

Sqoop job incremental import using free form query

I am trying to do sqoop job incremental import using free form query. Here's the query being used
sqoop job --create importjobinl -- import --connect jdbc:mysql://localhost/test --username training --password training --query 'select id,name,unix_timestamp(time_updated) from intest where $CONDITIONS' --target-dir /user/new/lll/`date +%d%T|sed 's/://g'` -m 1 --check-column time_updated --incremental append --last-value '1441526438'
The job is not getting created It shows.
Incremental imports require a table.
Try --help for usage instructions.
It works when I use --table intest instead of --query, but I want to use --query to convert date to epochtime using unix_timestamp since the value in mysql table intest is in yyyy-mm-dd format
Version used :Sqoop 1.2.0-cdh3u0
Sqoop incremental imports for free form queries was added from Sqoop 1.4.2
JIRA link : Sqoop Incremental import Support for free form queries
Since you are using Sqoop 1.2.0, this feature might not be available for you to use
Do an initial pull using sqoop.
Make sure the date format of your column is in YYYY-MM-DD HH:MM:SS if you are using the last modified column as date.
Run below statement for incremental load to your hive table which includes free from query.
sqoop import --connect jdbc:mysql://localhost/test --username training --password training --query "select * from intest where $CONDITIONS" --hive-import --hive-table db_name_x.table_name_x --incremental lastmodified -check-column date_x --target-dir /user/xyz -m 1

Resources