Using sqoop to create an orc table - hadoop

I have a DB2 database running with a table t1, and a hadoop cluster. I want to create an orc table in hadoop, with the same table definition as t1.
For this task I want to use Sqoop.
I tried using the sqoop create-hive-table command, but this command isn't compatible with hcatalog - and from what I've found, hcatalog is the only command, that allows me to create orc tables.
Instead i do this:
sqoop import \
--driver com.ibm.db2.jcc.DB2Driver \
--connect jdbc:db2://XXXXXXX \
--username user \
--password-file file:///pass.txt \
--query "select * from D1.t1 where \$CONDITIONS and reptime < '1864-11-16 13:23:54.749' fetch first 1 rows only" \
--split-by 1 \
--hcatalog-database default \
--hcatalog-table t1 \
--create-hcatalog-table \
--hcatalog-storage-stanza "stored as orcfile"
Which queries the database about somthing that does not exist and creates an orc table. Of course, this isn't optimal - any ideas on how to do this with sqoop create-hive-table, or at least without having to do a useless database query returning nothing?

Related

Sqoop export from Hcatalog to MySQL with different col names assign

Now my hive table with columns - id, name
and MySQL table - number, id, name
I want to map id (from hive) with number (from mysql), name (from hive) with id (from mysql).
I use the command :
sqoop export --hcatalog-database <my_db> --hcatalog-table <my_table> --columns "number,id" \
--connect jdbc:mysql://db...:3306/test \
--username <my_user> --password <my_passwd> --table <my_mysql_table>
However, it didn't work.
The same scenario liked this case can work fine [1]. The requirement can be fulfilled by locating the hive table on hdfs and using the following command to achieve.
sqoop export --export-dir /[hdfs_path] --columns "number,id" \
--connect jdbc:mysql://db...:3306/test \
--username <my_user> --password <my_passwd> --table <my_mysql_table>
Is there any solution can fulfill my scenario via Hcatalog?
reference :
[1]. Sqoop export from hive to oracle with different col names, number of columns and order of columns
I didn't used the hcatalog part of sqoop, but as is written in the manual, the next script should do the work:
sqoop export --hcatalog-database <my_db> --hcatalog-table <my_table> --map-column-hive "number,id" \
--connect jdbc:mysql://db...:3306/test \
--username <my_user> --password <my_passwd> --table <my_mysql_table>
This option: --map-column-hive when is used along with --hcatalog, does the work for hcatalog instead of hive.
Hope that this works for you.

Does Sqoop support extracting data from partitioned oracle table

I have a very large oracle table which is a partitioned table, I would ask whether or how Sqoop supports to do split based on oracle partitions, eg, one mapper to do import from one oracle partition.
Sqoop supports import from oracle partitioned table. Here is the documentation.
Syntax is somthing like this
sqoop import \
-Doraoop.disabled=false \
-Doraoop.import.partitions='"PARTITION-NAME","PARTITION-NAME1","PARTITION-NAME2",' \
--connect jdbc:oracle:thin:#XXX.XXX.XXX.XXX:15XX:SCHEMA_NAME \
--username user \
--password password \
--table SCHEMA.TABLE_NAME \
--target-dir /HDFS/PATH/ \
-m 1
Single mapper will be assigned to each partition that will write data to HDFS simultaneously.
Make sure you have dynamic partitions property enabled and number of partitions property value is also higher than the partitions existing in oracle when you create Hive table.

sqoop import as parquet file to target dir, but can't find the file

I have been using sqoop to import data from mysql to hive, the command I used are below:
sqoop import --connect jdbc:mysql://localhost:3306/datasync \
--username root --password 654321 \
--query 'SELECT id,name FROM test WHERE $CONDITIONS' --split-by id \
--hive-import --hive-database default --hive-table a \
--target-dir /tmp/yfr --as-parquetfile
The Hive table is created and the data is inserted, however I can not find the parquet file.
Does anyone know?
Best regards,
Feiran
Sqoop import to hive works in 2 steps:
Fetching data from RDBMS to HDFS
Create hive table if not exists and Load data into hive table
In your case,
firstly, data is stored at --target-dir i.e. /tmp/yfr
Then, it is loaded into Hive table a using
LOAD DATA INPTH ... INTO TABLE..
command.
As mentioned in the comments, data is moved to hive warehouse directory that's why there is no data in --target-dir.

Incrimental update in HIVE table using sqoop

I have a table in oracle with only 4 columns...
Memberid --- bigint
uuid --- String
insertdate --- date
updatedate --- date
I want to import those data in HIVE table using sqoop. I create corresponding HIVE table with
create EXTERNAL TABLE memberimport(memberid BIGINT,uuid varchar(36),insertdate timestamp,updatedate timestamp)LOCATION '/user/import/memberimport';
and sqoop command
sqoop import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username ** --password *** --hive-import --table MEMBER --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
Its working properly and able to import data in HIVE table.
Now I want to update this table with incremental update with updatedate (last value today's date) so that I can get day to day update for that OLTP table into my HIVE table using sqoop.
For Incremental import I am using following sqoop command
sqoop import --hive-import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username *** --password *** --table MEMBER --check-column UPDATEDATE --incremental append --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
But I am getting exception
"Append mode for hive imports is not yet supported. Please remove the parameter --append-mode"
When I remove the --hive-import it run properly but I did not found those new update in HIVE table that I have in OLTP table.
Am I doing anything wrong ?
Please suggest me how can I run incremental update with Oracle - Hive using sqoop.
Any help will be appropriated..
Thanks in Advance ...
Although i don't have resources to replicate your scenario exactly.
You might want to try building a sqoop job and test your use case.
sqoop job --create sqoop_job \
-- import \
--connect "jdbc:oracle://server:port/dbname" \
--username=(XXXX) \
--password=(YYYY) \
--table (TableName)\
--target-dir (Hive Directory corresponding to the table) \
--append \
--fields-terminated-by '(character)' \
--lines-terminated-by '\n' \
--check-column "(Column To Monitor Change)" \
--incremental append \
--last-value (last value of column being monitored) \
--outdir (log directory)
when you create a sqoop job, it takes care of --last-value for subsequent runs. Also here i have used the Hive table's data file as target for incremental update.
Hope this provides a helpful direction to proceed.
There is no direct way to achieve this in Sqoop. However you can use 4 Step Strategy.

Importing chunks into the same table using sqoop

I'm very new to hive and sqoop, as my company has just adopted them. As such, I am trying to import data from a sql database into hdfs/hive. However, we still only have a few clusters so I am worried about importing all the data at once (19 million records in total). I have searched furiously for a solution but the only thing close to what I am looking for that I have found is using incremental import. However, this is not a solution as it imports everything newer than the first import, and I have historical data for 2 years.
Therefore, is there a way to append to a table that I am missing (so I can import a month at a time into the same table for example?
Here is the initial command I am using to insert the first chunk of data into the table.
sqoop import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--connect jdbc:sqlserver://******omitted******* \
--username **** \
--password ******* \
--hive-table <tablename> \
--m 1 \
--delete-target-dir \
--target-dir /apps/hive/warehouse/<dir to table> \
--hive-drop-import-delims \
--hive-import --query "select * from <old sql table> where record_id
<='000000001433106' and \$CONDITIONS"
Thanks for any help.

Resources