Sqoop Import Command Directories - sqoop

Env: CDH
Tool: Sqoop
Version: Sqoop 1.4.6-cdh5.8.0
Objective: Import table from MySQL database
Create hive table with a subset of source data (e.g order_status = 'CLOSED')
Reimport more data in the same directory using order_status not in ('CLOSED')
Results:
1. Objective 1 complete using the command
sqoop import --connect jdbc:mysql://xxx:000/xxxx_db
--username=xxxx_dba --P
--warehouse-dir=/user/hive/warehouse/hex.db/
-m 1
--table orders --compression-codec=snappy
--hive-import --as-textfile --create-hive-table
--hive-table closed_orders
--hive-overwrite
--where "order_status='CLOSED'"
--compress
--columns "order_id, order_customer_id, order_status"
Creates the directory /user/hive/warehouse/hex.db/closed_orders with a data file and a hive table with "CLOSED" Orders.
I am trying to re import more data - this time order_status not in ('CLOSED')
-- This time not creating a hive table and just importing the order_status != 'CLOSED' into different directory (open_orders).
Issue: It creates a directory /user/hive/warehouse/hex.db/open_orders/orders/.
2.a How can import the file into the directory /user/hive/warehouse/hex.db/open_orders?
2.b How can we import the subset date of order_status != 'CLOSED' ie. open orders into the same directory created in the step 1 ie. /user/hive/warehouse/hex.db/closed_orders ?
Command used for step 2:
sqoop import --connect jdbc:mysql://xxxx:0000/retail_db
--username=xxxx_dba --P --warehouse-dir=/user/hive/warehouse/hex.db/ -m 1
--table orders --compression-codec=snappy --hive-import
--as-textfile --hive-table open_orders
--where "order_status not in ('CLOSED')"
--compress --columns "order_id, order_customer_id, order_status"
2.3 Error with --append command where in I am trying to import the open orders into the directory created from the step 1 /user/hive/warehouse/hex.db/closed_orders
17/04/15 14:24:22 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
17/04/15 14:24:22 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
Append mode for hive imports is not yet supported. Please remove the parameter --append-mode

when you doing #2 i.e. reimporting data, use
--target-dir /user/hive/warehouse/hex.db/open_orders
instead of
warehouse-dir

Related

incremental load using sqoop from mysql to hive

I am new to sqoop and hive . Please help me with understanding
The count of mysql and hive table are different
mysql is 51 rows (table has primary key and no duplicates ) ad hive is 38rows - first run itself
sqoop job --create mmod -- import --connect "jdbc:mysql://cxln2.c.thelab-240901.internal:3306/retail_db" --username sqoopuser --password-file
/tmp/.mysql-pass.txt --table mod --compression-codec org.apache.hadoop.io.compress.BZip2Codec --hive-import --hive-database encry --hive-table mod2 --h
ive-overwrite --check-column last_update_date --incremental lastmodified --merge-key id --last-value 0 --target-dir /user/user_name/append1sqo
pp
It is not creating target dir in given location , instead it creating in warehouse location
I am trying to schedule a sqoop incremental job , somehow I am doing mistake some where
command : above command
2.1 new rows are added with same date
2.2 delete and update on few rows
Output :
No new updates on given table .
It is not updating lastvalue in sqoop job
How to choose merge-key column in sqoop
Where condition in sqoop
--query "select * from reason where id>20 AND $CONDITIONS"
What is the use of $CONDITIONS and do we need to pass the variable in Linux
Is that possible to track rejected rows in sqoop job

error while performing sqoop - merge

I was trying to sqoop merge two data sets by importing the data from the netezza server.
below are the data sets with the numeric as id and letters as name:
Both of the below tables are imported from netezza using the commands:
sqoop import --connect neteeza_url --username uname --password pwd --table sqoop_merge_1 --hive-import --warehouse-dir hdfs_pth --create-hive-table sqoop_merge_1 -m 1
sqoop_merge_1:
1,a
2,b
3,c
4,d
5,e
sqoop_merge_2:
4,z
5,y
and the commands are:
sqoop merge --new-data hdfs_path/sqoop_merge_2 --onto hdfs_path/sqoop_merge_1 --target-dir hdfs_path/sqoop_merge_output --jar-file jar_file_path/sqoop_merge_class_name.jar --class-name sqoop_merge_class_name --merge-key id
I created the jar file by using the codegen command:
sqoop codegen --connect netezza_url --username uname --password -pwd --table sqoop_merge_1
But I am getting the following error:
java.io.IOException: Cannot join values on null key. Did you specify a key column that exists?
Tried all the ways i knew but still getting the error.
Please help.
As you are sure about id column existence, it could be an issue due to case-sensitivity.
Check if you specified ID in Netezza?
If yes, try with --merge-key ID.

Sqoop-Imported data is not shown in the target directory

I have imported the data from MYSQL to HDFS with Sqoop but not able to see the imported data into desired given path.
Sqoop query is like -
sqoop job --create EveryDayImport --import --connect jdbc:mysql://localhost:3306/books --username=root --table=authors -m 1 --target-dir /home/training/viresh/Sqoop/authors1234 --incremental append --check-column id --last-value 0;
sqoop job --create EveryDayImport -- import --connect jdbc:mysql://localhost:3306/books --username=root --table=authors -m 1 --target-dir /home/training/viresh/Sqoop/authors1234 --incremental append --check-column id --last-value 0
There is a mistake in your Sqoop statement that you missed to give space between "--" and import as mentioned in the comment by dev
Your sqoop statement use to create a sqoop job. To execute you job (sqoop import) you have to submit it by below statement.
$ sqoop job --exec EveryDayImport
I feel this is the reason no data present in your target dir

Incrimental update in HIVE table using sqoop

I have a table in oracle with only 4 columns...
Memberid --- bigint
uuid --- String
insertdate --- date
updatedate --- date
I want to import those data in HIVE table using sqoop. I create corresponding HIVE table with
create EXTERNAL TABLE memberimport(memberid BIGINT,uuid varchar(36),insertdate timestamp,updatedate timestamp)LOCATION '/user/import/memberimport';
and sqoop command
sqoop import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username ** --password *** --hive-import --table MEMBER --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
Its working properly and able to import data in HIVE table.
Now I want to update this table with incremental update with updatedate (last value today's date) so that I can get day to day update for that OLTP table into my HIVE table using sqoop.
For Incremental import I am using following sqoop command
sqoop import --hive-import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username *** --password *** --table MEMBER --check-column UPDATEDATE --incremental append --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
But I am getting exception
"Append mode for hive imports is not yet supported. Please remove the parameter --append-mode"
When I remove the --hive-import it run properly but I did not found those new update in HIVE table that I have in OLTP table.
Am I doing anything wrong ?
Please suggest me how can I run incremental update with Oracle - Hive using sqoop.
Any help will be appropriated..
Thanks in Advance ...
Although i don't have resources to replicate your scenario exactly.
You might want to try building a sqoop job and test your use case.
sqoop job --create sqoop_job \
-- import \
--connect "jdbc:oracle://server:port/dbname" \
--username=(XXXX) \
--password=(YYYY) \
--table (TableName)\
--target-dir (Hive Directory corresponding to the table) \
--append \
--fields-terminated-by '(character)' \
--lines-terminated-by '\n' \
--check-column "(Column To Monitor Change)" \
--incremental append \
--last-value (last value of column being monitored) \
--outdir (log directory)
when you create a sqoop job, it takes care of --last-value for subsequent runs. Also here i have used the Hive table's data file as target for incremental update.
Hope this provides a helpful direction to proceed.
There is no direct way to achieve this in Sqoop. However you can use 4 Step Strategy.

sqoop import complete but hive show tables can't see table

After install hadoop, hive (CDH version) I execute
./sqoop import -connect jdbc:mysql://10.164.11.204/server -username root -password password -table user -hive-import --hive-home /opt/hive/
All goes fine, but when I enter hive command line and execute show tables, there are nothing.
I use ./hadoop fs -ls, I can see /user/(username)/user existing.
Any help is appreciated.
---EDIT-----------
/sqoop import -connect jdbc:mysql://10.164.11.204/server -username root -password password -table user -hive-import --target-dir /user/hive/warehouse
import fail due to :
11/07/02 00:40:00 INFO hive.HiveImport: FAILED: Error in semantic analysis: line 2:17 Invalid Path 'hdfs://hadoop1:9000/user/ubuntu/user': No files matching path hdfs://hadoop1:9000/user/ubuntu/user
11/07/02 00:40:00 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive exited with status 10
at com.cloudera.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.java:326)
at com.cloudera.sqoop.hive.HiveImport.executeScript(HiveImport.java:276)
at com.cloudera.sqoop.hive.HiveImport.importTable(HiveImport.java:218)
at com.cloudera.sqoop.tool.ImportTool.importTable(ImportTool.java:362)
at com.cloudera.sqoop.tool.ImportTool.run(ImportTool.java:423)
at com.cloudera.sqoop.Sqoop.run(Sqoop.java:144)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.cloudera.sqoop.Sqoop.runSqoop(Sqoop.java:180)
at com.cloudera.sqoop.Sqoop.runTool(Sqoop.java:218)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:228)
Check your hive-site.xml for the value of the property
javax.jdo.option.ConnectionURL. If you do not define this explicitly,
the default value will use a relative path for creation of hive
metastore (jdbc:derby:;databaseName=metastore_db;create=true) which
will be different depending upon where you launch the process from.
This would explain why you cannot see the table via show tables.
define this property value in your
hive-site.xml using an absolute path
no need of creating the table in hive..refer the below query
sqoop import --connect jdbc:mysql://xxxx.com/Database name --username root --password admin --table tablename (mysql table) --direct -m 1 --hive-import --create-hive-table --hive-table table name --target-dir '/user/hive/warehouse/Tablename(which u want create in hive)' --fields-terminated-by '\t'
In my case Hive stores data in /user/hive/warehouse directory in HDFS. This is where Sqoop should put it.
So I guess you have to add:
--target-dir /user/hive/warehouse
Which is default location for Hive tables (might be different in your case).
You might also want to create this table in Hive:
sqoop create-hive-table --connect jdbc:mysql://host/database --table tableName --username user --password password
in my case it creates table in hive default database, you can give it a try.
sqoop import --connect jdbc:mysql://xxxx.com/Database name --username root --password admin --table NAME --hive-import --warehouse-dir DIR --create-hive-table --hive-table NAME -m 1
Hive tables will be created by Sqoop import process. Please make sure the /user/hive/warehouse is created in you HDFS. You can browse the HDFS (http://localhost:50070/dfshealth.jsp - Browse the File System option.
Also include the HDFS local in -target dir i.e hdfs://:9000/user/hive/warehouse in the sqoop import command.
First of all , create the table definition in Hive with exact field names and types as in mysql.
Then, perform the import operation
For Hive Import
sqoop import --verbose --fields-terminated-by ',' --connect jdbc:mysql://localhost/test --table tablename --hive-import --warehouse-dir /user/hive/warehouse --fields-terminated-by ',' --split-by id --hive-table tablename
'id' can be your primary key of the existing table
'localhost' can be your local ip
'test' is database
'warehouse' directory is in HDFS
I think all you need is to specify the hive table where data should go.
add "--hive-table database.tablename" to the sqoop command and remove the --hive-home /opt/hive/. I think that should resolve the problem.

Resources