How to protect password and username in Sqoop? - hadoop

I want to hide my password that I am using to import data from my RDBMS to Hadoop cluster. I am using --option-files for keeping my password and username in a text file but it's not protected.
Can I do some kind encryption on that particular file for better protection?

Secure way of supplying password to the database.
You should save the password in a file on the users home directory with 400 permissions and specify the path to that file using the --password-file argument, and is the preferred method of entering credentials. Sqoop will then read the password from the file and pass it to the MapReduce cluster using secure means with out exposing the password in the job configuration. The file containing the password can either be on the Local FS or HDFS. For example:
$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username venkatesh --password-file ${user.home}/.password
Check drill docs for more details.
Also, you can use -P option to Read password from console.

It seems that this question has been addressed previously here,
also described on this hortonworks page and basically consists on creating and .enc file. You also need to configure several parameters like the key to reveal the encryption.
sqoop import \
-Dorg.apache.sqoop.credentials.loader.class=org.apache.sqoop.util.password.CryptoFileLoader \
-Dorg.apache.sqoop.credentials.loader.crypto.passphrase=sqoop2 \
--connect jdbc:mysql://example.com/sqoop \
--username sqoop \
--password-file file:///tmp/pass.enc \
--table tbl
Here are multiple parameters that can be configured (again following the reference):
org.apache.sqoop.credentials.loader.class - Credentials loader
org.apache.sqoop.credentials.loader.crypto.alg – The Algorithm used to decrypt the file (default is AES/ECB/PKCS5Padding).
org.apache.sqoop.credentials.loader.crypto.salt – The salt used to derive a key with the passphrase (default is SALT).
org.apache.sqoop.credentials.loader.crypto.iterations – Number of PBKDF2 iterations (default is 10000).
org.apache.sqoop.credentials.loader.crypto.salt.key.len – Derived key length (default is 128).
org.apache.sqoop.credentials.loader.crypto.passphrase Passphrase used to derive key.
Alternatively you can also follow Sqoop documentation page and create a password alias that gets retrieved with an implementation of CredentialProviderPasswordLoader class. You can see the whole class here

Related

Sqoop fails with password-file argument

I have a sqoop script which ingests data from SAP HANA to Hive. The sqoop scripts runs fine when I give password as argument "--password Password$$", but to secure the password , I put it in a file called sap.password and used argument"--password-file /dev/configs/sap.password", But the sqoop script returns an execption .
Below is my sqoop script and exception occured:
sqoop import
--connect jdbc:sap://hostname?currentschema=SCHEMA_REF
--driver com.sap.db.jdbc.Driver
--username SERVICE_ACCOUNT
--password-file /dev/configs/sap.password
--table TABLE1
--hive-import
--hive-overwrite
--hive-database cdc_stg
--hive-table HIVE_TABLE1
--as-parquetfile
--m 1
Exception that I get is (I'm sure the credentials are correct):
9/11/14 05:47:08 ERROR manager.SqlManager: Error executing statement:
com.sap.db.jdbc.exceptions.jdbc40.SQLInvalidAuthorizationSpecException: [10]: authentication failed
com.sap.db.jdbc.exceptions.jdbc40.SQLInvalidAuthorizationSpecException: [10]: authentication failed
at com.sap.db.jdbc.exceptions.jdbc40.SQLInvalidAuthorizationSpecException.createException(SQLInvalidAuthorizationSpecException.java:40)
at com.sap.db.jdbc.exceptions.SQLExceptionSapDB.createException(SQLExceptionSapDB.java:290)
at com.sap.db.jdbc.exceptions.SQLExceptionSapDB.generateDatabaseException(SQLExceptionSapDB.java:174)
at com.sap.db.jdbc.packet.ReplyPacket.buildExceptionChain(ReplyPacket.java:100)
at com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:1141)
at com.sap.db.jdbc.ConnectionSapDB.execute(ConnectionSapDB.java:888)
at com.sap.db.util.security.AbstractAuthenticationManager.connect(AbstractAuthenticationManager.java:43)
at com.sap.db.jdbc.ConnectionSapDB.openSession(ConnectionSapDB.java:586)
at com.sap.db.jdbc.ConnectionSapDB.doConnect(ConnectionSapDB.java:436)
at com.sap.db.jdbc.ConnectionSapDB.<init>(ConnectionSapDB.java:195)
at com.sap.db.jdbc.ConnectionSapDBFinalize.<init>(ConnectionSapDBFinalize.java:13)
at com.sap.db.jdbc.Driver.connect(Driver.java:255)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:903)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:59)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:762)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:785)
at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:288)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:259)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:245)
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:333)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1879)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1672)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:106)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:515)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:633)
at org.apache.sqoop.Sqoop.run(Sqoop.java:146)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:182)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:233)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:242)
at org.apache.sqoop.Sqoop.main(Sqoop.java:251)
19/11/14 05:47:08 ERROR tool.ImportTool: Import failed: java.io.IOException: No columns to generate for ClassWriter
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1678)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:106)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:515)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:633)
at org.apache.sqoop.Sqoop.run(Sqoop.java:146)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:182)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:233)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:242)
at org.apache.sqoop.Sqoop.main(Sqoop.java:251)
I suspect the password file might be created with newline characters since --password works fine and the only difference or change made is conversion to using a password file.
Can you please re-create the password file using the sqoop docs warning clause stated below.
Reference: SqoopUserGuide
Sqoop will read the entire content of the password file and use it as a password. This will include any trailing white space characters such as newline characters that are added by default by most of the text editors. You need to make sure that your password file contains only characters that belong to your password. On the command line, you can use command echo with switch -n to store password without any trailing white space characters.
Ex: To store password secret use below.
echo -n "secret" > password.file
Also instead of sqoop import try list-databases or list-tables or eval for testing the connection with the password file.
Please check password file permissions. From Sqoop docs:
You should save the password in a file on the users home directory with 400 permissions

Import the all tables from RDBMS using sqoop

I am trying to import data from testing mysql database to hadoop using sqoop. But in some tables having primary and some tables does not have primary key.
$sqoop import-all-tables --connect jdbc:mysql://192.168.0.101/mysql -username test -P --warehouse-dir /home/user_all_tables
17/08/01 22:46:54 ERROR tool.ImportAllTablesTool: Error during import:
No primary key could be found for table general_log. Please specify
one with --split-by or perform a sequential import with '-m 1'.
Kindly suggest me how to use split by in sqoop command line.
For the import-all-tables tool to be useful, the following conditions must be met:
Each table must have a single-column primary key.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.
Default option doesn't fit with the non primary key table therefore it is not working. Here I will suggests to use -m 1 option to strict the import with one mapper only.
Sqoop command:
import-all-tables --connect jdbc:mysql://192.168.0.101/mysql -username test \
-P --warehouse-dir /home/user_all_tables -m 1

what is the relevence of -m 1

I am executing below sqoop command::=
sqoop import --connect 'jdbc:sqlserver://10.xxx.xxx.xx:1435;database=RRAM_Temp' --username DRRM_DATALOADER --password ****** --table T_VND --hive-import --hive-table amitesh_db.amit_hive_test --as-textfile --target-dir amitesh_test_hive -m 1
I have two queries::-
1) what is the relevence of -m 1? as far as I know Its the number of mapper that I am assigning to the sqoop job. If that is true, then, the moment I assign -m 2, the execution start throwing error as below:
ERROR tool.ImportTool: Error during import: No primary key could be found for table xxx. Please specify one with --split-by or perform a sequential import with '-m 1'
Now, I am forced to change my concept, now I see, it has something to do with database primary key. Can somebody help me a logic behind this?
2) I have ordered the above sqoop command to save the file as text file format.But when I go to the location suggested by the execution, I find tbl_name.jar. Why, if --as-textfile is a wrong sytax, then what is the right one. Or is there any other location that I can find the file in?
1) To have -m or --num-mappers to be set to a value greater than 1, the table must either have PRIMARY KEY or the sqoop command must be provided with a --split-by column. Controlling Parallelism would explain the logic behind this.
2) The FileFormat of the data imported into the Hive table amit_hive_test would be plain text(--as-textfile). As this is --hive-import, the data will be first imported into the --target-dir and then is loaded (LOAD DATA INPATH) into the Hive table. The resultant data will be inside the table's LOCATION and not in --target-dir.

Sqoop import converting TINYINT to BOOLEAN

I am attempting to import a MySQL table of NFL play results into HDFS using Sqoop. I issued the following command to achieve this:
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/nfl \
--username <username> -P \
--table play
Unfortunately, there are columns of type TINYINT, which are being converted to booleans upon import. For instance, there is a 'quarter' column for which quarter of the game the play occurred in. The value in this column is converted to 'true' if the play occurred in the first quarter and 'false' otherwise.
In fact, I did a sqoop import-all-tables, importing the entire NFL database I have, and it behaves like this uniformly.
Is there a way around this, or perhaps some argument for import or import-all-tables that prevents this from happening?
Add tinyInt1isBit=false in your JDBC connection URL. Something like
jdbc:mysql://127.0.0.1:3306/nfl?tinyInt1isBit=false
Another solution would be to explicitly override the column mapping for the datatype TINYINT(1) column. For example, if the column name is foo, then pass the following option to Sqoop during import: --map-column-hive foo=tinyint. In the case of non-Hive imports to HDFS, use --map-column-java foo=integer.
Source

Sqoop Import is completed successfully. How to view these tables in Hive

I am trying something on hadoop and its related things. For this, I have configured hadoop, hase, hive, sqoop in Ubuntu machine.
raghu#system4:~/sqoop$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/mysql --username root --password password --table user --hive-import -m 1
All goes fine, but when I enter hive command line and execute show tables, there are nothing. I am able to see that these tables are created in HDFS.
I have seen some options in Sqoop import - it can import to Hive/HDFS/HBase.
When importing into Hive, it is indeed importing directly into HDFS. Then why Hive?
Where can I execute HiveQL to check the data.
From cloudera Support, I understood that I can Hue and check it. But, I think Hue is just an user interface to Hive.
Could someone help me here.
Thanks in advance,
Raghu
I was having the same issue. I was able to work around/through it by importing the data directly into HDFS and then create an External Hive table to point at that specific location in HDFS. Here is an example that works for me.
create external table test (
sequencenumber int,
recordkey int,
linenumber int,
type string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
location '/user/hdfs/testdata';
You will need to change your location to where you saved the data in HDFS.
Can you post the output from sqoop? Try using --verbose option.
Here's an example of the command I use, and it does import directly to a Hive table.
sqoop import --hive-overwrite --hive-drop-import-delims --warehouse-dir "/warehouse" --hive-table hive_users --connect jdbc:mysql://$MYSQL_HOST/$DATABASE_NAME --table users --username $MYSQL_USER --password $MYSQL_PASS --hive-import
when we are not giving any database in the sqoop import command,the table will be created in the default database with the same name of the RDBMS table name.
you can specify the database name where you want to import the the RDBMS table in hive by "--hive-database".
Instead of creating the Hive table every time, you can import the table structure in the hive using the create-hive-table command of sqoop. It will import the table as managed_table then you can convert that table to external table by changing the table properties to external table and then add partition. This will reduce the effort of finding the right data type. Please note that there will be precision change
Whenever ,you are using a Sqoop with Hive import option,the sqoop connects directly the corresponding the database's metastore and gets the corresponding table 's metadata(the table's schema),so there is no need to create a table structure in Hive.This schema is then provided to the Hive when used with Hive-import option.
So the output of all the sqoop data on HDFS will by default stored in the default directory .i.e /user/sqoop/tablename/part-m files
with hive import option,the tables will be downloaded directly into the default warehouse direcotry i.e.
/user/hive/warehouse/tablename
command : sudo -u hdfs hadoop fs -ls -R /user/
this lists recursively all the files with in the user.
Now go to Hive and type show databases.if there is only default database,
then type show tables:
remember OK is common default system output and is not part of the command output.
hive> show databases;
OK
default
Time taken: 0.172 seconds
hive> show tables;
OK
genre
log_apache
movie
moviegenre
movierating
occupation
user
Time taken: 0.111 seconds
Try sqoop command like this, its working for me and directly creating hive table, u need not create external table every time
sqoop import --connect DB_HOST --username ***** --password ***** --query "select *from SCHEMA.TABLE where \$CONDITIONS"
--num-mappers 5 --split-by PRIMARY_KEY --hive-import --hive-table HIVE_DB.HIVE_TABLE_NAME --target-dir SOME_DIR_NAME;
The command you are using imports data into the $HIVE_HOME directory. If the HIVE_HOME environment variable is not set or points to a wrong directory, you will not be able to see imported tables.
The best way to find the hive home directory is to use the Hive QL SET command:
hive -S -e 'SET' | grep warehouse.dir
Once you retrieved the hive home directory, append the --hive-home <hive-home-dir>option to your command.
Another possible reason is that in some Hive setups the metadata is cached and you cannot see the changes immediately. In this case you need to flush the metadata cache, using the INVALIDATE METADATA;command.

Resources