what is the relevence of -m 1 - hadoop

I am executing below sqoop command::=
sqoop import --connect 'jdbc:sqlserver://10.xxx.xxx.xx:1435;database=RRAM_Temp' --username DRRM_DATALOADER --password ****** --table T_VND --hive-import --hive-table amitesh_db.amit_hive_test --as-textfile --target-dir amitesh_test_hive -m 1
I have two queries::-
1) what is the relevence of -m 1? as far as I know Its the number of mapper that I am assigning to the sqoop job. If that is true, then, the moment I assign -m 2, the execution start throwing error as below:
ERROR tool.ImportTool: Error during import: No primary key could be found for table xxx. Please specify one with --split-by or perform a sequential import with '-m 1'
Now, I am forced to change my concept, now I see, it has something to do with database primary key. Can somebody help me a logic behind this?
2) I have ordered the above sqoop command to save the file as text file format.But when I go to the location suggested by the execution, I find tbl_name.jar. Why, if --as-textfile is a wrong sytax, then what is the right one. Or is there any other location that I can find the file in?

1) To have -m or --num-mappers to be set to a value greater than 1, the table must either have PRIMARY KEY or the sqoop command must be provided with a --split-by column. Controlling Parallelism would explain the logic behind this.
2) The FileFormat of the data imported into the Hive table amit_hive_test would be plain text(--as-textfile). As this is --hive-import, the data will be first imported into the --target-dir and then is loaded (LOAD DATA INPATH) into the Hive table. The resultant data will be inside the table's LOCATION and not in --target-dir.

Related

last-value in sqoop( incremental import)

sqoop import --connect \\
jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rank --incremental append --last-value
We don't know the last value of the previous table. How can I write the query?
You can try to 2 approaches to solve this.
Query into table and get maximum value of last-value column.
Create a job in sqoop and set the column as incremental one and moving forward, your job will run on incremental basis
Go to your pwd
cd .sqoop
open file metastore.db.script using vi or
your fav editor.
search for incremental.last.value
It should be something like
INSERT INTO SQOOP_SESSIONS VALUES('incimpjob','incremental.last.value','2018-09-11 19:20:52.0','SqoopOptions')
Note: I am assuming that you have created a Sqoop Job. The 'incimpjob' is the name of my sqoop job.

Import the all tables from RDBMS using sqoop

I am trying to import data from testing mysql database to hadoop using sqoop. But in some tables having primary and some tables does not have primary key.
$sqoop import-all-tables --connect jdbc:mysql://192.168.0.101/mysql -username test -P --warehouse-dir /home/user_all_tables
17/08/01 22:46:54 ERROR tool.ImportAllTablesTool: Error during import:
No primary key could be found for table general_log. Please specify
one with --split-by or perform a sequential import with '-m 1'.
Kindly suggest me how to use split by in sqoop command line.
For the import-all-tables tool to be useful, the following conditions must be met:
Each table must have a single-column primary key.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.
Default option doesn't fit with the non primary key table therefore it is not working. Here I will suggests to use -m 1 option to strict the import with one mapper only.
Sqoop command:
import-all-tables --connect jdbc:mysql://192.168.0.101/mysql -username test \
-P --warehouse-dir /home/user_all_tables -m 1

Share sqoop incremental last value between two jobs

I have a sqoop job that records incremental last value to do incremental appends through out the day. My problem is that my directory changes each day so we can create partitions based on log_date.
I need to record --last-value through out the day. Then I need to pass that value into a newly created job for the next day. Is it possible to call a method to get last-value?
My current sqoop job looks like this written in a shell script.
sqoop job --create test_last_index \
-- import --connect jdbc:xxxx \
--password xxx \
--table test_$(date -d yesterday +%Y_%m_%d) \
--target-dir /dir/where/located \
--incremental append \
--check-column id
--last-value 1
You need not call a method for the sqooping that you are doing. All you need to do is create a sqoop job and save it. Add the paramenters --check-column , --incremental and --last-value in the sqoop job that you create. The --last-value will be picked up with each consecutive run and will be retained in the job. Then you can use a --exec command to run the job periodically and also sqoop merge to merge the modified/appended data with the historical data.
Hope this helps.
I have developed sqoop script for Incremental Import as follows.
sqoop import
--driver com.sap.db.jdbc.Driver
--fetch-size 3000
--connect connectionURL
--username test
--password test
--table DATA
--where YEAR=2002
--check-column TIMESTAMP
--incremental append
--last-value "2016-06-22 12:31:37.0"
--target-dir "/incremental_data_2002/year_partition=2002"
--fields-terminated-by ","
--lines-terminated-by "\n"
--split-by YEAR
--m 4
Now, the above script is getting executed successfully.
In the above script i have hardcoded the --last-value as "2016-06-22 12:31:37.0". when new data comes to source table in RDBMS again i am checking the last-value in the table and modifying the sqoop script manually with the value. Instead of that what i wanted here is that i need to have --last-value dynamically without hardcoding in sqoop script file.
Sadly, Sqoop is not incorporating an automatic last value retrieving.
In the sqoop documentation
You should use:
At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

Configuring Sqoop with Mysql?

I have successfully installed SQOOP now the problem is that how to implement it with RDBMS and how to load data from RDBMS to HDFS using SQOOP.
By Using Sqoop You can Load Data directly to Hive Tables or Store the data in Some target Directory in HDFS
If you Need to copy data from RDBMS into Some directory
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--col column_name(s) {In case you need to call specific columns}
--target-dir '/tmp/myfolder'
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--fields-terminated-by ',' {how do you want your data to look in target file}
Boundary Query : This is something you can specify. If you do not specify this , then by default this is run in as an inner query which adds up to a complex query.
If you specify this explicitly then this runs as a normal query and hence the performance is increased.
Also you may want to restrict the number of observation ,say based on column ID, and suppose you need data from ID 1 to 1000. Then using Boundary condition and split-by you will be able to restrict your import data.
--boundary-query "select 0,1000 from employee'
--split-by ID
Split-By : You use Split by on a Sqoop import to specify the column on basis of which split is required. By default,if you do not specify this, sqoop pics up table's primary key as the Split_by column.
Split By picks up data from tables and stores them in different folders based on number of mappers. By Default Number of Mappers are 4.
This may seem unwanted but in case you have a composite primary key or no primary key at all, then sqoop fails to pick up data and may error out.
Note: You may not face any issue if you set the number of mappers to 1. In this case, no split by condition is used since there is only one mapper. So query runs fine. This can be done using
--m 1
If you Need to copy data from RDBMS into Hive Table
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--hive-import
--hive-table serviceorderdb.productinfo
--m 1
Running a query instead of calling entire table itself
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password
--query 'select name from employees where name like '%s' and $CONDITIONS'
--m 5 {set number of mappers to 5}
--target-dir '/tmp/myfolder'
--fields-terminated-by ',' {how do you want your data to look in target file}
You may see $conditions as extra parameter $CONDITIONS. This is because this time you specified no table and specified a query explicity. When Sqoop runs, it searches for a boundary conditions, which it does not find. Then It Searches for a table and a primary key for applying boundary query which again it will not find. Hence we use $CONDITIONS to explicitly specify that we are not using a query and use default boundry condition from query result.
Checking if your connection is set up properly : For this you can just call list databases and if the you see your data populated then your connection is fine.
$ sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password pwd
Connection String for Different Databases :
MYSQL: jdbc:mysql://<hostname>:<port>/<dbname>
jdbc:mysql://127.0.0.1:3306/test_database
Oracle :#//host_name:port_number/service_name
jdbc:oracle:thin:scott/tiger#//myhost:1521/myservicename
You may learn more about sqoop imports from : https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
By using sqoop import command you can import data from RDBMS to HDFS, Hive and HBase
sqoop import --connect jdbc:mysql://localhost:portnumber/DBName --username root --table emp --password root -m1
By using this command data will be stored in HDFS.
Sample commands to run sqoop import (load data from RDBMS to HDFS):
Postgres
sqoop import --connect jdbc:postgresql://postgresHost/databaseName
--username username --password 123 --table tableName
MySQL
sqoop import --connect jdbc:mysql://mysqlHost/databaseName --username username --password 123 --table tableName
Oracle*
sqoop import --connect jdbc:oracle:thin:#oracleHost:1521/databaseName --username USERNAME --password 123 --table TABLENAME
SQL Server
sqoop import --connect 'jdbc:sqlserver://sqlserverhost:1433;database=dbname;username=<username>;password=<password>' --table tableName
*Sqoop won't find any columns from a table if you don't specify both the username and the table in correct case. Usually, specifying both in uppercase will resolve the issue.
Read the Sqoop User's Guide: https://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
I also recommend the Apache Sqoop Cookbook. You will learn how to use import and export tools, do incremental import jobs, save jobs, solve problems with jdbc drivers and much more. http://shop.oreilly.com/product/0636920029519.do

Sqoop creating insert statements containing multiple records

we are trying to load the data from sqoop to netezza. And we are facing the following issue.
java.io.IOException: org.netezza.error.NzSQLException: ERROR:
Example Input dataset is as shown below:
1,2,3
1,3,4
sqoop command is as shown below:
sqoop export --table <tablename> --export-dir <path>
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --connect
'jdbc:netezza://<host>/<db>' --driver org.netezza.Driver
--username <username> --password <passwrd>
The Sqoop is creating an insert statement in the following way:
insert into (c1,c2,c3) values (1,2,3),(1,3,4).
We are able to load one record but when we try to load the data to multiple records, the error is as said above.
Your help is highly appreciated.
Making sqoop.export.records.per.statement=1 will definitely help but this will make the export process extremely slow if your export record count is very large say "5 Million".
To solve this you need add following things:
1.) A properties file sqoop.properties, it must contain this property jdbc.transaction.isolation=TRANSACTION_READ_UNCOMMITTED (It avoids deadlock during exports)
also in the export command you need to specify this:
--connection-param-file /path/to/sqoop.properties
2.) Also sqoop.export.records.per.statement=100, making this will increase the speed of export.
3.) Third you have to add --batch, Use batch mode for underlying statement execution.
So you final export will look like this,
sqoop export -D sqoop.export.records.per.statement=100 --table <tablename> --export-dir <path>
--input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --connect
'jdbc:netezza://<host>/<db>' --driver org.netezza.Driver
--username <username> --password <passwrd>
--connection-param-file /path/to/sqoop.properties
--batch
Hope this will help.
You can customise the number of rows that will be used in one insert statement with property "sqoop.export.records.per.statement". For example for Netezza you will need to set it to 1:
sqoop export -Dsqoop.export.records.per.statement=1 --connect ...
I would recommend you to also take a look on Apache Sqoop Cookbook where this and many other tips are described.

Resources