I have a doubt in the following sqoop import command
sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username user_name \
--table user_table \
--m 1 \
--target-dir /sample
why we use M in the above command? please clarify
-m is representing the mappers, by specifying -m 1 means you need only one mapper to be run to import the table. This is used for controlling parallelism. To achieve the parallelism sqoop uses the primary key/unique key to split the rows from source table.
Basically the default number of mappers in sqoop is 4. so for this you need to mention by which column you need to acheive parallelism using --split-by column_name, so by giving -m 1 you dont need splitting.
for more information check the below link,
click here
Related
I am new to Big data, when I am using Sqoop commands to import data from teradata into my Hadoop cluster I am encountering a "No more room in database" error
I am doing the following:
1.The data I am trying to pull into my Hadoop cluster is a view table
2.The I have used the following sqoop command
sqoop import --connect "jdbc:teradata://xxx.xxx.xxx.xxx/DATABASE=XY" \
-- username user1 \
-- password xyc
-- query "
SELECT * FROM TABLE1 WHERE .... AND \$CONDITIONS \
" \
--split-by ITEM_1 \
--delete-target-dir \
--target-dir /user/home/folder1 \
--as-avrodatafile;
I know that the default mappers is 4 since I do not have a primary key for my view, I am using split-by.
Using --num-mappers 1, works but takes a long time for to port over roughly 36GB of data, hence I wanted to increase the num-mappers to 4 or more, however, I am getting the "no more room" error. Does anyone know what's happening?
Have executed a similar kind of sqoop command as shown below. The free form query mentioned below, I wanted to keep it in a file and run the sqoop command since my real time queries are quite complex and bigger.
Wanted to know, Is there a way to keep the query in a file and execute the sqoop command which will refer the free form query inside the file and execute?
like we do for --password-file case. Thanks in advance.
sqoop import --connect "jdbc:mysql://<localhost>:port" --username "admin" --password-file "<passwordfile>" --query "select * from employee" --split-by employee_id --target-dir "<target directory>" --incremental append --check-column employee_id --last-value 0 --fields-terminated-by "|"
The command line options that are not convenient to put in command, can be read using the Sqoop--options-file argument for convenience, hence you can read the query using the options file. Using options file the Sqoop command should be similar to this:
sqoop import --connect $connect_string --username $username --password $pwd --options-file /home/user/sqoop_poc/query.txt --target-dir $target_dir --m 1
Entry in options file should be like this:
--query
select * from TEST_OPTION where ID <= 10 AND $CONDITIONS
More details on options file are available in Sqoop User Guide.
Can you please help me with the below points.
I have a oracle data base with huge no.of records today - suppose 5TB data, so we can use the vaildator sqoop framework- It will validate and import in the HDFS.
Then, Suppose tomorrow- i will receive the new records on top of the above TB data, so how can i import those new records (only new records to the existing directory) and validation by using the validator sqoop framework.
I have a requirement, how to use sqoop validator if new records arrives.
I need sqoop validatior framework used in new records arrives to be imported in HDFS.
Please help me team.Thanks.
Thank You,
Sipra
My understanding is that you need to validate the oracle database for new records before you start your delta process. I don’t think you can validate based on the size of the records. But if you have a offset or a TS column that will be helpful for validation.
how do I know if there is new records in oracle since last run/job/check ??
You can do this in two sqoop import approaches, following is the examples and explanation for both.
sqoop incremental
Following is an example for the sqoop incremental import
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
This link explained it : https://www.tutorialspoint.com/sqoop/sqoop_import.html
sqoop import using query option
Here you basically use the where condition in the query and pull the data which is greater than the last received date or offset column.
Here is the syntax for it sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba --password cloudera \
--query 'select * from sample_data where $CONDITIONS AND salary > 1000' \
--split-by salary \
--target-dir hdfs://quickstart.cloudera/user/cloudera/sqoop_new
Isolate the validation and import job
If you want to run the validation and import job independently you have an other utility in sqoop which is sqoop eval, with this you can run the query on the rdbms and point the out put to the file or to a variable In your code and use that for validation purpose as you want.
Syntax :$ sqoop eval \
--connect jdbc:mysql://localhost/db \
--username root \
--query “SELECT * FROM employee LIMIT 3”
Explained here : https://www.tutorialspoint.com/sqoop/sqoop_eval.htm
validation parameter in sqoop
You can use this parameter to validate the counts between what’s imported/exported between RDBMS and HDFS
—validate
More on that : https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#validation
I have imported data from sqoop to hive successfully. I have added an column in Oracle and again imported the particular column to hive using sqoop-import. But,it is appending to the first column data and remaining columns with null and no new column came in hive. Can anyone resolve the issue.
With out looking at your import statements, I am assuming that in your second import you are trying to append to the existing import but only importing new column using --columns and --append arguments. It will not work this way as it will append to the file at end of the file not at end of the each line.
you will need to overwrite the existing data in hdfs using --hive-overwrite; and alter hive table for adding additional column. OR just drop the hive table and use --create-hive-table in sqoop command.
so you import command should look like this:
sqoop --import \
--connect $CONNECTION_STR \
--username $USER \
--password $PASS \
--table $ORACLE_TABLE \
--hive-import \
--hive-overwrite \
--hive-table \
--hive-home $HIVE_HOME \
--hive-table $HIVE_TABLE
Change values to actual values of your environment
My scenario : I will get daily 100 records in hdfs through sqoop at particular time. But, yesterday i got only 50 records for that particular time today i need to get 50+100 records in hdfs through sqoop for that particular time. Please help me. Thanks in advance.
To handle such scenario, you need to add a where condition on time. No matters, what the record count is.
You can use something like this in sqoop import command using --query parameter:
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username sqoop \
--password sqoop \
--query 'SELECT * from records
WHERE recordTime BETWEEN ('<datetime>' AND NOW()) \
--target-dir /user/hadoop/records
You need to modify the where condition as per your table schema.
Please refer Sqoop Documentation for more details.
sqoop import --connect jdbc:mysql://localhost:3306/your_mysql_databasename --username root -P --query 'SELECT * from records WHERE recordTime BETWEEN ('' AND NOW()) --target-dir /where you want to store data
and make when sqoop ask for password enter your mysql password eg.(my pwd is root)