Importing chunks into the same table using sqoop - hadoop

I'm very new to hive and sqoop, as my company has just adopted them. As such, I am trying to import data from a sql database into hdfs/hive. However, we still only have a few clusters so I am worried about importing all the data at once (19 million records in total). I have searched furiously for a solution but the only thing close to what I am looking for that I have found is using incremental import. However, this is not a solution as it imports everything newer than the first import, and I have historical data for 2 years.
Therefore, is there a way to append to a table that I am missing (so I can import a month at a time into the same table for example?
Here is the initial command I am using to insert the first chunk of data into the table.
sqoop import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--connect jdbc:sqlserver://******omitted******* \
--username **** \
--password ******* \
--hive-table <tablename> \
--m 1 \
--delete-target-dir \
--target-dir /apps/hive/warehouse/<dir to table> \
--hive-drop-import-delims \
--hive-import --query "select * from <old sql table> where record_id
<='000000001433106' and \$CONDITIONS"
Thanks for any help.

Related

Sqoop import from Teradata - No more room in database

I am new to Big data, when I am using Sqoop commands to import data from teradata into my Hadoop cluster I am encountering a "No more room in database" error
I am doing the following:
1.The data I am trying to pull into my Hadoop cluster is a view table
2.The I have used the following sqoop command
sqoop import --connect "jdbc:teradata://xxx.xxx.xxx.xxx/DATABASE=XY" \
-- username user1 \
-- password xyc
-- query "
SELECT * FROM TABLE1 WHERE .... AND \$CONDITIONS \
" \
--split-by ITEM_1 \
--delete-target-dir \
--target-dir /user/home/folder1 \
--as-avrodatafile;
I know that the default mappers is 4 since I do not have a primary key for my view, I am using split-by.
Using --num-mappers 1, works but takes a long time for to port over roughly 36GB of data, hence I wanted to increase the num-mappers to 4 or more, however, I am getting the "no more room" error. Does anyone know what's happening?

Incrimental update in HIVE table using sqoop

I have a table in oracle with only 4 columns...
Memberid --- bigint
uuid --- String
insertdate --- date
updatedate --- date
I want to import those data in HIVE table using sqoop. I create corresponding HIVE table with
create EXTERNAL TABLE memberimport(memberid BIGINT,uuid varchar(36),insertdate timestamp,updatedate timestamp)LOCATION '/user/import/memberimport';
and sqoop command
sqoop import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username ** --password *** --hive-import --table MEMBER --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
Its working properly and able to import data in HIVE table.
Now I want to update this table with incremental update with updatedate (last value today's date) so that I can get day to day update for that OLTP table into my HIVE table using sqoop.
For Incremental import I am using following sqoop command
sqoop import --hive-import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username *** --password *** --table MEMBER --check-column UPDATEDATE --incremental append --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
But I am getting exception
"Append mode for hive imports is not yet supported. Please remove the parameter --append-mode"
When I remove the --hive-import it run properly but I did not found those new update in HIVE table that I have in OLTP table.
Am I doing anything wrong ?
Please suggest me how can I run incremental update with Oracle - Hive using sqoop.
Any help will be appropriated..
Thanks in Advance ...
Although i don't have resources to replicate your scenario exactly.
You might want to try building a sqoop job and test your use case.
sqoop job --create sqoop_job \
-- import \
--connect "jdbc:oracle://server:port/dbname" \
--username=(XXXX) \
--password=(YYYY) \
--table (TableName)\
--target-dir (Hive Directory corresponding to the table) \
--append \
--fields-terminated-by '(character)' \
--lines-terminated-by '\n' \
--check-column "(Column To Monitor Change)" \
--incremental append \
--last-value (last value of column being monitored) \
--outdir (log directory)
when you create a sqoop job, it takes care of --last-value for subsequent runs. Also here i have used the Hive table's data file as target for incremental update.
Hope this provides a helpful direction to proceed.
There is no direct way to achieve this in Sqoop. However you can use 4 Step Strategy.

sqoop-hive Import adding an extra column

I have imported data from sqoop to hive successfully. I have added an column in Oracle and again imported the particular column to hive using sqoop-import. But,it is appending to the first column data and remaining columns with null and no new column came in hive. Can anyone resolve the issue.
With out looking at your import statements, I am assuming that in your second import you are trying to append to the existing import but only importing new column using --columns and --append arguments. It will not work this way as it will append to the file at end of the file not at end of the each line.
you will need to overwrite the existing data in hdfs using --hive-overwrite; and alter hive table for adding additional column. OR just drop the hive table and use --create-hive-table in sqoop command.
so you import command should look like this:
sqoop --import \
--connect $CONNECTION_STR \
--username $USER \
--password $PASS \
--table $ORACLE_TABLE \
--hive-import \
--hive-overwrite \
--hive-table \
--hive-home $HIVE_HOME \
--hive-table $HIVE_TABLE
Change values to actual values of your environment

Using sqoop to create an orc table

I have a DB2 database running with a table t1, and a hadoop cluster. I want to create an orc table in hadoop, with the same table definition as t1.
For this task I want to use Sqoop.
I tried using the sqoop create-hive-table command, but this command isn't compatible with hcatalog - and from what I've found, hcatalog is the only command, that allows me to create orc tables.
Instead i do this:
sqoop import \
--driver com.ibm.db2.jcc.DB2Driver \
--connect jdbc:db2://XXXXXXX \
--username user \
--password-file file:///pass.txt \
--query "select * from D1.t1 where \$CONDITIONS and reptime < '1864-11-16 13:23:54.749' fetch first 1 rows only" \
--split-by 1 \
--hcatalog-database default \
--hcatalog-table t1 \
--create-hcatalog-table \
--hcatalog-storage-stanza "stored as orcfile"
Which queries the database about somthing that does not exist and creates an orc table. Of course, this isn't optimal - any ideas on how to do this with sqoop create-hive-table, or at least without having to do a useless database query returning nothing?

Sqoop Incremental import and update

I am trying to import data from sql into a hive database. The goal is to update the changes in the oracle database to hive using sqoop import. The sqoop command is as follows:
sqoop import -D mapred.child.java.opts='\-Djava.security.egd=file:/dev/../dev/urandom'
--connect jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(LOAD_BALANCE=ON)(FAILOVER=ON)(ADDRESS=(PROTOCOL=TCP)(HOST=)(PORT=1545))(ADDRESS=(PROTOCOL=TCP)(HOST=)(PORT=1545)))(CONNECT_DATA=(SERVICE_NAME=)(SERVER=DEDICATED)))'
--username abcde
--password 1234rgtds
--table Customer_Acc
--columns Name,ID,Address,Date_booking ,Last_update_date
-m 1
--target-dir /final/table
--hive-import
--hive-table tesupd
--map-column-hive Name,ID,Address,Date_booking
--null-string '\\N'
--null-non-string '\\N'
--hive-delims-replacement ' '
--incremental lastmodified
--check-column Last_update_date
--last-value "2009-12-31 12:14:28"
The final output should be the data greater than the last value, but in the above case it is appending the data instead of incrementally updating it.
I want the data to be updated rather than appended.
use --merge-key option in your sqoop-import command. This will replace the older records with the latest records.
Alternately you can use sqoop-merge command as well but it should be done in two steps. First sqoop-import without merge-key and then sqoop-merge
Try using --incremental append rather than --incremental last modified.
With --incremental append last-value of field mentioned is stored in sqoop metastore 'incremental.last.value' which keeps changing whenever the job is executed. Using --incremental append you do not have to update the last-value in your query but it is updated automatically.
By this your value will always be updated (in sqoop metastore) and there will not be any redundant data
Neither sqoop nor Hive can directly update the data in Hive using sqoop imports. Please follow the steps in the below link for row level updates.
http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
So your data is mutable and you'd like to modify in HDFS the records which have been changed in your DB.
For this, you need to use the --incremental append flag. You also need to create a Sqoop job, because that will capture the most recent --last-value and serialize it back to the saved job.
You should create a Sqoop job which looks something like this.
sqoop job \
--create jobName \
-- \
import \
jdbc:oracle:thin:#hostname:port:sid \
--username user \
--password fileOnHDFS.password \
--table tableName \
--incremental lastmodified \
--check-column UPDATE_DATUM \
--last-value 1985.01.01 \
--merge-key ID \
More details can be found in the Sqoop User Guide (v1.4.6)

Resources