SQOOP is not able to create a hive table in PARQUET format.
sqoop import --connect jdbc:mysql://localhost:3306/sqoop_practice --username root --password hortonworks1 -m 1 --delete-target-dir --target-dir /user/hduser/sqoop/import_customer --driver com.mysql.jdbc.Driver --fetch-size 1000 --table customer --fields-terminated-by '~' --hive-import --hive-database hivepractice --hive-table customer_parquet --as-parquetfile;
Error: Caused by: MetaException(message:Table hivepractice.customer_parquet failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.)
I am running below command to import data from Oracle Db to HIVE
sqoop import --connect jdbc:oracle:thin:#//localhost:1521/newDB --username <USERNAME> --P --table <ORACLE_TABLE_NAME> --hive-table <HIVE_TABLE_NAME> --hive-import -m 1
I am getting below Error when i am running this Query
17/11/21 05:05:46 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
17/11/21 05:05:46 INFO manager.SqlManager: Using default fetchSize of 1000
17/11/21 05:05:46 INFO tool.CodeGenTool: Beginning code generation
17/11/21 05:05:47 INFO manager.OracleManager: Time zone has been set to GMT
17/11/21 05:05:47 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "<TABLE_NAME>" t WHERE 1=0
17/11/21 05:05:47 ERROR tool.ImportTool: Imported Failed: There is no column found in the target table <TABLE_NAME>. Please ensure that your table name is correct.
Check the owner of table and try using table_owner.table_name for oracle source.
I find you are not used --create-hive-table and few other parameter in your query.
Below is the sqoop import query i use in my project:
oracle_connection.txt will have the connection info.
sqoop --options-file oracle_connection.txt \
--table $DATABASE.$TABLENAME \
-m $NUMMAPPERS \
--where "$CONDITION" \
--hive-import \
--map-column-hive "$COLLIST" \
--create-hive-table \
--hive-drop-import-delims \
--split-by $SPLITBYCOLUMN \
--hive-table $HIVEDATABASE.$TABLENAME \
--bindir sqoop_hive_rxhome/bindir/ \
--outdir sqoop_hive_rxhome/outdir
I am using below given sqoop command:
sqoop import
--libjars /usr/hdp/2.4.0.0-169/sqoop/lib,/usr/hdp/2.4.0.0-169/hive/lib
--connect jdbc:teradata://x/DATABASE=x
--connection-manager org.apache.sqoop.teradata.TeradataConnManager
--username ec
--password dc
--query "select * from hb where yr_nbr=2017"
--hive-table schema.table
--num-mappers 1
--hive-import
--target-dir /user/hive/warehouse/GG
I'm getting this error:
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
17/04/06 11:15:41 INFO mapreduce.Job: map 100% reduce 0%
17/04/06 11:15:41 INFO mapreduce.Job: Task Id : attempt_1491466460468_0029_m_000000_1, Status : FAILED
Error: org.apache.hadoop.fs.FileAlreadyExistsException: /user/root/temp_111508/part-m-00000 for client 192.168.211.133 already exists
From the error, I can guess that the output file is already in your target directory, may be from your previous sqoop import. There is an option in sqoop import named --delete-target-dir which will delete your target output directory and re-create them in your next sqoop import. Hope that helps.
I have stored my database passwords in Hadoop CredentialProvider.
Sqoop import from terminal is working fine, successfully fetching the password from CredentialProvider.
sqoop import
-Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks
--table myTable -m 1 --target-dir /user/vijay/output --delete-target-dir --username vijay --password-alias db2-dev-password
But when I try to setup as a Sqoop job, it is unable to recognize the -Dhadoop.security.credential.provider.path argument.
sqoop job --create my-sqoop-job -- import --table myTable -m 1 --target-dir /user/vijay/output --delete-target-dir --username vijay -Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks --password-alias
Following is the error message:
14/04/05 13:57:53 ERROR tool.BaseSqoopTool: Error parsing arguments for import:
14/04/05 13:57:53 ERROR tool.BaseSqoopTool: Unrecognized argument: -Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks
14/04/05 13:57:53 ERROR tool.BaseSqoopTool: Unrecognized argument: --password-alias
14/04/05 13:57:53 ERROR tool.BaseSqoopTool: Unrecognized argument: db2-dev-password
I couldn't find any special instructions in Sqoop User Guide for configuring Hadoop credential API with Sqoop Job.
How to resolve this issue?
Re positioning the Sqoop parameters solve the problem.
sqoop job -Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks --create my-sqoop-job -- import --table myTable -m 1 --target-dir /user/vijay/output --delete-target-dir --username vijay --password-alias myPasswordAlias
Place the Hadoop credential before the Sqoop job keyword.
Your Sqoop job command is not proper, i.e. --password-alias is incomplete.
Please execute below command in your Hadoop server
hadoop credential list -provider jceks://hdfs/user/vijay/myPassword.jceks
Add the output in below Sqoop job command
sqoop job --create my-sqoop-job -- import --table myTable -m 1 --target-dir /user/vijay/output --delete-target-dir --username vijay -Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks --password-alias <<output of above command>>
I'm trying to import data directly from mysql to parquet but it doesn't seem to work correctly...
I'm using CDH5.3 which includes Sqoop 1.4.5.
Here is my command line :
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --hive-import --hive-table default.pages_users3 --target-dir hive_pages_users --as-parquetfile
Then I get this error :
Warning: /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/01/09 14:31:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.0
15/01/09 14:31:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/09 14:31:49 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
15/01/09 14:31:49 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
15/01/09 14:31:49 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/09 14:31:49 INFO tool.CodeGenTool: Beginning code generation
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/01/09 14:31:51 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.jar
15/01/09 14:31:51 INFO mapreduce.ImportJobBase: Beginning query import.
15/01/09 14:31:51 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:51 WARN spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader
15/01/09 14:31:51 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:109)
at org.kitesdk.data.Datasets.create(Datasets.java:189)
at org.kitesdk.data.Datasets.create(Datasets.java:240)
at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:81)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:70)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:112)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:262)
at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:721)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:499)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
I have no problem importing data to hive file format but parquet is a problem... Do you have any idea why this occurs ?
Thank you :)
Please do not use <db>.<table> with --hive-table. This doesn't work well with Parquet import. Sqoop uses Kite SDK to write Parquet files and it doesn't like this <db>.<table> format.
Instead, please use --hive-database --hive-table . for your command, it should be:
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database \
--username username --password mypass \
--query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id \
--hive-import --hive-database default --hive-table pages_users3 \
--target-dir hive_pages_users --as-parquetfile
Here's my pipeline in CDH 5.5 to import from a jdbc into Hive parquet files.
JDBC data source is for Oracle, but explanation below fits MySQL too.
1) Sqoop:
$ sqoop import --connect "jdbc:oracle:thin:#(complete TNS descriptor)" \
--username MRT_OWNER -P \
--compress --compression-codec snappy \
--as-parquetfile \
--table TIME_DIM \
--warehouse-dir /user/hive/warehouse \
--num-mappers 1
I chose --num-mappers as 1 because TIME_DIM table had just around ~20k rows, and it's not advised to split parquet table into multiple files for such a small dataset. Each mapper creates a separate output (parquet) file.
(ps. for Oracle users: I had to connect as owner of the source table, otherwise had to specify "MRT_OWNER.TIME_DIM", and was getting error org.kitesdk.data.ValidationException: Namespace MRT_OWNER.TIME_DIM is not alphanumeric (plus '_'), seems a sqoop bug).
(ps2. Table name had to be all-uppercase.. not sure if this is Oracle specific (shouldn't be) and if this is another sqoop bug).
(ps3. --compress --compression-codec snappy parameters were recognized but did not seem made any effect)
2) Above command creates a directory named
/user/hive/warehouse/TIME_DIM
It's a wise idea to move it to a specific Hive database directory, e.g.:
$ hadoop fs -mv /hivewarehouse/TIME_DIM /hivewarehouse/dwh.db/time_dim
Assuming name of Hive database/schema is "dwh".
3) Create Hive table, by taking schema directly from parquet file:
$ hadoop fs -ls /user/hive/warehouse/dwh.db/time_dim | grep parquet
-rwxrwx--x+ 3 hive hive 1216 2016-02-04 23:56 /user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet
If above command returns more than parquet file (it means you had more than one mapper, the --num-mappers parameter), you can pick any parquet file into the below command.
This command should run in Impala and not in Hive. Hive currently can't infer schema from parquet files, but Impala can:
[impala-shell] > CREATE TABLE dwh.time_dim
LIKE PARQUET '/user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet'
COMMENT 'sqooped from MRT_OWNER.TIME_DIM'
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/dwh.db/time_dim'
;
ps. It's also possible to infer schema from parquet using Spark, e.g.
spark.read.schema('hdfs:///user/hive/warehouse/dwh.db/time_dim')
4) Since table wasn't created in Hive (which collects stats automatically), it's a good idea to collect stats:
[impala-shell] > compute stats dwh.time_dim;
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_import_literal
--as-parquetfile
was added in Sqoop 1.4.6 (CDH 5.5).
(ps. for Oracle users: I had to connect as owner of the source table, otherwise had to specify "MRT_OWNER.TIME_DIM", and was getting error org.kitesdk.data.ValidationException: Namespace MRT_OWNER.TIME_DIM is not alphanumeric (plus '_'), seems a sqoop bug).
This can be fixed if database name and table name is written as db_name/table_name instead of db_name.table_name.
Seems like database support is missing in your distribution. It looks like it was added rather recently. Try setting --hive-table to --hive-table pages_users3 and removing --target-dir.
If the above doesn't work, try:
This blog post.
The docs.
Check with the user#sqoop.apache.org mailing list.
I found a solution, I droppped all the hive parts and use the target dir to store the data... Seems to work :
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --target-dir /home/cloudera/user/hive/warehouse/soprism.db/pages_users3 --as-parquetfile -m 1
I then link to the directory making an external table from Impala...