I am new to Big data, when I am using Sqoop commands to import data from teradata into my Hadoop cluster I am encountering a "No more room in database" error
I am doing the following:
1.The data I am trying to pull into my Hadoop cluster is a view table
2.The I have used the following sqoop command
sqoop import --connect "jdbc:teradata://xxx.xxx.xxx.xxx/DATABASE=XY" \
-- username user1 \
-- password xyc
-- query "
SELECT * FROM TABLE1 WHERE .... AND \$CONDITIONS \
" \
--split-by ITEM_1 \
--delete-target-dir \
--target-dir /user/home/folder1 \
--as-avrodatafile;
I know that the default mappers is 4 since I do not have a primary key for my view, I am using split-by.
Using --num-mappers 1, works but takes a long time for to port over roughly 36GB of data, hence I wanted to increase the num-mappers to 4 or more, however, I am getting the "no more room" error. Does anyone know what's happening?
Related
Can you please help me with the below points.
I have a oracle data base with huge no.of records today - suppose 5TB data, so we can use the vaildator sqoop framework- It will validate and import in the HDFS.
Then, Suppose tomorrow- i will receive the new records on top of the above TB data, so how can i import those new records (only new records to the existing directory) and validation by using the validator sqoop framework.
I have a requirement, how to use sqoop validator if new records arrives.
I need sqoop validatior framework used in new records arrives to be imported in HDFS.
Please help me team.Thanks.
Thank You,
Sipra
My understanding is that you need to validate the oracle database for new records before you start your delta process. I don’t think you can validate based on the size of the records. But if you have a offset or a TS column that will be helpful for validation.
how do I know if there is new records in oracle since last run/job/check ??
You can do this in two sqoop import approaches, following is the examples and explanation for both.
sqoop incremental
Following is an example for the sqoop incremental import
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
This link explained it : https://www.tutorialspoint.com/sqoop/sqoop_import.html
sqoop import using query option
Here you basically use the where condition in the query and pull the data which is greater than the last received date or offset column.
Here is the syntax for it sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba --password cloudera \
--query 'select * from sample_data where $CONDITIONS AND salary > 1000' \
--split-by salary \
--target-dir hdfs://quickstart.cloudera/user/cloudera/sqoop_new
Isolate the validation and import job
If you want to run the validation and import job independently you have an other utility in sqoop which is sqoop eval, with this you can run the query on the rdbms and point the out put to the file or to a variable In your code and use that for validation purpose as you want.
Syntax :$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
--query “SELECT * FROM employee LIMIT 3”
Explained here : https://www.tutorialspoint.com/sqoop/sqoop_eval.htm
validation parameter in sqoop
You can use this parameter to validate the counts between what’s imported/exported between RDBMS and HDFS
—validate
More on that : https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#validation
I have a doubt in the following sqoop import command
sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username user_name \
--table user_table \
--m 1 \
--target-dir /sample
why we use M in the above command? please clarify
-m is representing the mappers, by specifying -m 1 means you need only one mapper to be run to import the table. This is used for controlling parallelism. To achieve the parallelism sqoop uses the primary key/unique key to split the rows from source table.
Basically the default number of mappers in sqoop is 4. so for this you need to mention by which column you need to acheive parallelism using --split-by column_name, so by giving -m 1 you dont need splitting.
for more information check the below link,
click here
I have a table in oracle with only 4 columns...
Memberid --- bigint
uuid --- String
insertdate --- date
updatedate --- date
I want to import those data in HIVE table using sqoop. I create corresponding HIVE table with
create EXTERNAL TABLE memberimport(memberid BIGINT,uuid varchar(36),insertdate timestamp,updatedate timestamp)LOCATION '/user/import/memberimport';
and sqoop command
sqoop import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username ** --password *** --hive-import --table MEMBER --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
Its working properly and able to import data in HIVE table.
Now I want to update this table with incremental update with updatedate (last value today's date) so that I can get day to day update for that OLTP table into my HIVE table using sqoop.
For Incremental import I am using following sqoop command
sqoop import --hive-import --connect jdbc:oracle:thin:#dbURL:1521/dbName --username *** --password *** --table MEMBER --check-column UPDATEDATE --incremental append --columns 'MEMBERID,UUID,INSERTDATE,UPDATEDATE' --map-column-hive MEMBERID=BIGINT,UUID=STRING,INSERTDATE=TIMESTAMP,UPDATEDATE=TIMESTAMP --hive-table memberimport -m 1
But I am getting exception
"Append mode for hive imports is not yet supported. Please remove the parameter --append-mode"
When I remove the --hive-import it run properly but I did not found those new update in HIVE table that I have in OLTP table.
Am I doing anything wrong ?
Please suggest me how can I run incremental update with Oracle - Hive using sqoop.
Any help will be appropriated..
Thanks in Advance ...
Although i don't have resources to replicate your scenario exactly.
You might want to try building a sqoop job and test your use case.
sqoop job --create sqoop_job \
-- import \
--connect "jdbc:oracle://server:port/dbname" \
--username=(XXXX) \
--password=(YYYY) \
--table (TableName)\
--target-dir (Hive Directory corresponding to the table) \
--append \
--fields-terminated-by '(character)' \
--lines-terminated-by '\n' \
--check-column "(Column To Monitor Change)" \
--incremental append \
--last-value (last value of column being monitored) \
--outdir (log directory)
when you create a sqoop job, it takes care of --last-value for subsequent runs. Also here i have used the Hive table's data file as target for incremental update.
Hope this provides a helpful direction to proceed.
There is no direct way to achieve this in Sqoop. However you can use 4 Step Strategy.
I'm very new to hive and sqoop, as my company has just adopted them. As such, I am trying to import data from a sql database into hdfs/hive. However, we still only have a few clusters so I am worried about importing all the data at once (19 million records in total). I have searched furiously for a solution but the only thing close to what I am looking for that I have found is using incremental import. However, this is not a solution as it imports everything newer than the first import, and I have historical data for 2 years.
Therefore, is there a way to append to a table that I am missing (so I can import a month at a time into the same table for example?
Here is the initial command I am using to insert the first chunk of data into the table.
sqoop import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--connect jdbc:sqlserver://******omitted******* \
--username **** \
--password ******* \
--hive-table <tablename> \
--m 1 \
--delete-target-dir \
--target-dir /apps/hive/warehouse/<dir to table> \
--hive-drop-import-delims \
--hive-import --query "select * from <old sql table> where record_id
<='000000001433106' and \$CONDITIONS"
Thanks for any help.
My scenario : I will get daily 100 records in hdfs through sqoop at particular time. But, yesterday i got only 50 records for that particular time today i need to get 50+100 records in hdfs through sqoop for that particular time. Please help me. Thanks in advance.
To handle such scenario, you need to add a where condition on time. No matters, what the record count is.
You can use something like this in sqoop import command using --query parameter:
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username sqoop \
--password sqoop \
--query 'SELECT * from records
WHERE recordTime BETWEEN ('<datetime>' AND NOW()) \
--target-dir /user/hadoop/records
You need to modify the where condition as per your table schema.
Please refer Sqoop Documentation for more details.
sqoop import --connect jdbc:mysql://localhost:3306/your_mysql_databasename --username root -P --query 'SELECT * from records WHERE recordTime BETWEEN ('' AND NOW()) --target-dir /where you want to store data
and make when sqoop ask for password enter your mysql password eg.(my pwd is root)