Sqoop import --as-parquetfile with CDH5 - hadoop

I'm trying to import data directly from mysql to parquet but it doesn't seem to work correctly...
I'm using CDH5.3 which includes Sqoop 1.4.5.
Here is my command line :
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --hive-import --hive-table default.pages_users3 --target-dir hive_pages_users --as-parquetfile
Then I get this error :
Warning: /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/01/09 14:31:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.0
15/01/09 14:31:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/09 14:31:49 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
15/01/09 14:31:49 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
15/01/09 14:31:49 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/09 14:31:49 INFO tool.CodeGenTool: Beginning code generation
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/01/09 14:31:51 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.jar
15/01/09 14:31:51 INFO mapreduce.ImportJobBase: Beginning query import.
15/01/09 14:31:51 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:51 WARN spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader
15/01/09 14:31:51 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:109)
at org.kitesdk.data.Datasets.create(Datasets.java:189)
at org.kitesdk.data.Datasets.create(Datasets.java:240)
at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:81)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:70)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:112)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:262)
at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:721)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:499)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
I have no problem importing data to hive file format but parquet is a problem... Do you have any idea why this occurs ?
Thank you :)

Please do not use <db>.<table> with --hive-table. This doesn't work well with Parquet import. Sqoop uses Kite SDK to write Parquet files and it doesn't like this <db>.<table> format.
Instead, please use --hive-database --hive-table . for your command, it should be:
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database \
--username username --password mypass \
--query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id \
--hive-import --hive-database default --hive-table pages_users3 \
--target-dir hive_pages_users --as-parquetfile

Here's my pipeline in CDH 5.5 to import from a jdbc into Hive parquet files.
JDBC data source is for Oracle, but explanation below fits MySQL too.
1) Sqoop:
$ sqoop import --connect "jdbc:oracle:thin:#(complete TNS descriptor)" \
--username MRT_OWNER -P \
--compress --compression-codec snappy \
--as-parquetfile \
--table TIME_DIM \
--warehouse-dir /user/hive/warehouse \
--num-mappers 1
I chose --num-mappers as 1 because TIME_DIM table had just around ~20k rows, and it's not advised to split parquet table into multiple files for such a small dataset. Each mapper creates a separate output (parquet) file.
(ps. for Oracle users: I had to connect as owner of the source table, otherwise had to specify "MRT_OWNER.TIME_DIM", and was getting error org.kitesdk.data.ValidationException: Namespace MRT_OWNER.TIME_DIM is not alphanumeric (plus '_'), seems a sqoop bug).
(ps2. Table name had to be all-uppercase.. not sure if this is Oracle specific (shouldn't be) and if this is another sqoop bug).
(ps3. --compress --compression-codec snappy parameters were recognized but did not seem made any effect)
2) Above command creates a directory named
/user/hive/warehouse/TIME_DIM
It's a wise idea to move it to a specific Hive database directory, e.g.:
$ hadoop fs -mv /hivewarehouse/TIME_DIM /hivewarehouse/dwh.db/time_dim
Assuming name of Hive database/schema is "dwh".
3) Create Hive table, by taking schema directly from parquet file:
$ hadoop fs -ls /user/hive/warehouse/dwh.db/time_dim | grep parquet
-rwxrwx--x+ 3 hive hive 1216 2016-02-04 23:56 /user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet
If above command returns more than parquet file (it means you had more than one mapper, the --num-mappers parameter), you can pick any parquet file into the below command.
This command should run in Impala and not in Hive. Hive currently can't infer schema from parquet files, but Impala can:
[impala-shell] > CREATE TABLE dwh.time_dim
LIKE PARQUET '/user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet'
COMMENT 'sqooped from MRT_OWNER.TIME_DIM'
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/dwh.db/time_dim'
;
ps. It's also possible to infer schema from parquet using Spark, e.g.
spark.read.schema('hdfs:///user/hive/warehouse/dwh.db/time_dim')
4) Since table wasn't created in Hive (which collects stats automatically), it's a good idea to collect stats:
[impala-shell] > compute stats dwh.time_dim;
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_import_literal

--as-parquetfile
was added in Sqoop 1.4.6 (CDH 5.5).

(ps. for Oracle users: I had to connect as owner of the source table, otherwise had to specify "MRT_OWNER.TIME_DIM", and was getting error org.kitesdk.data.ValidationException: Namespace MRT_OWNER.TIME_DIM is not alphanumeric (plus '_'), seems a sqoop bug).
This can be fixed if database name and table name is written as db_name/table_name instead of db_name.table_name.

Seems like database support is missing in your distribution. It looks like it was added rather recently. Try setting --hive-table to --hive-table pages_users3 and removing --target-dir.
If the above doesn't work, try:
This blog post.
The docs.
Check with the user#sqoop.apache.org mailing list.

I found a solution, I droppped all the hive parts and use the target dir to store the data... Seems to work :
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --target-dir /home/cloudera/user/hive/warehouse/soprism.db/pages_users3 --as-parquetfile -m 1
I then link to the directory making an external table from Impala...

Related

SemanticException 10072: database does not exist (Sqoop)

I created a hive internal table using sqoop command.
sqoop import -Dmapreduce.map.memory.mb=4096
--driver com.mysql.jdbc.Driver
--connect 'jdbc:mysql://{mysql_url}'
--username 'xxxx'
--password 'xxxx'
--input-fields-terminated-by '\t'
--split-by id
--target-dir {hdfs_path}
--verbose -m 1
--hive-drop-import-delims
--fields-terminated-by '\t'
--hive-import
--hive-table '{table_name}'
--query "select id from temp WHERE \$CONDITIONS LIMIT 10"
I created a table into it and it was working find.
19/01/06 19:33:44 DEBUG hive.TableDefWriter: Load statement: LOAD DATA INPATH 'hdfs://hadoop/{hdfs_path}' INTO TABLE `tmp.temp`
19/01/06 19:33:44 INFO hive.HiveImport: Loading uploaded data into Hive
19/01/06 19:33:44 DEBUG hive.HiveImport: Using in-process Hive instance.
19/01/06 19:33:44 DEBUG util.SubprocessSecurityManager: Installing subprocess security manager
Logging initialized using configuration in jar:file:${HADOOP_HOME}/hive-1.1.0-cdh5.14.2/lib/hive-common-1.1.0-cdh5.14.2.jar!/hive-log4j.properties
It created in hdfs warehouse location.
$ hadoop dfs -ls {hdfs_path}
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
19/01/06 19:43:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
0 2019-01-06 19:33 {hdfs_path}/_SUCCESS
65 2019-01-06 19:33 {hdfs_path}/part-m-00000.gz
BUT it was an error:
FAILED: SemanticException [Error 10072]: Database does not exist: tmp
I already hive-site.xml into sqoop conf directory.
cp ${HIVE_HOME}/conf/hive-site.xml ${SQOOP_HOME}/conf/hive-site.xml
"hive.metastore.uris" was set local and remote thrift.
How do I do? Help me. Thanks
Please used this sqoop command to import data from sqoop to hive with existing table and change clause according to your requirement.i have just modified some clause and added some clause as per requirement in your command
ubuntu#localhost:/usr/local/hive$ sqoop import -Dmapreduce.map.memory.mb=4096 --connect 'jdbc:mysql://localhost/test' --username 'root' -P --input-fields-terminated-by ',' --split-by id --target-dir /user/hive/warehouse/test_hive --hive-drop-import-delims --fields-terminated-by ',' --hive-import --hive-database default --hive-table test_hive --query "select id from test WHERE \$CONDITIONS LIMIT 10" --driver com.mysql.jdbc.Driver --delete-target-dir
Happy Hadooppppppppp

Getting Error When trying to Import Data From Oracle to HIVE

I am running below command to import data from Oracle Db to HIVE
sqoop import --connect jdbc:oracle:thin:#//localhost:1521/newDB --username <USERNAME> --P --table <ORACLE_TABLE_NAME> --hive-table <HIVE_TABLE_NAME> --hive-import -m 1
I am getting below Error when i am running this Query
17/11/21 05:05:46 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
17/11/21 05:05:46 INFO manager.SqlManager: Using default fetchSize of 1000
17/11/21 05:05:46 INFO tool.CodeGenTool: Beginning code generation
17/11/21 05:05:47 INFO manager.OracleManager: Time zone has been set to GMT
17/11/21 05:05:47 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "<TABLE_NAME>" t WHERE 1=0
17/11/21 05:05:47 ERROR tool.ImportTool: Imported Failed: There is no column found in the target table <TABLE_NAME>. Please ensure that your table name is correct.
Check the owner of table and try using table_owner.table_name for oracle source.
I find you are not used --create-hive-table and few other parameter in your query.
Below is the sqoop import query i use in my project:
oracle_connection.txt will have the connection info.
sqoop --options-file oracle_connection.txt \
--table $DATABASE.$TABLENAME \
-m $NUMMAPPERS \
--where "$CONDITION" \
--hive-import \
--map-column-hive "$COLLIST" \
--create-hive-table \
--hive-drop-import-delims \
--split-by $SPLITBYCOLUMN \
--hive-table $HIVEDATABASE.$TABLENAME \
--bindir sqoop_hive_rxhome/bindir/ \
--outdir sqoop_hive_rxhome/outdir

Sqoop export to sql server error

I want to export hdfs file to sql server. I'm using sqoop for that
sqoop export --bindir . --connect "jdbc:sqlserver://server;database=db" --username sa --password pwd --table sqoop_test -m 1 --export-dir /user/sqooptest
but i get the following error.
Warning: /usr/local/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/local/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
sqoop export --bindir . --connect "jdbc:sqlserver://server;database=db" --username sa --password pwd --table sqoop_test -m 1 --export-dir /user/sqooptest
Warning: /usr/local/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
16/07/30 03:59:06 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/07/30 03:59:07 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/07/30 03:59:07 INFO manager.SqlManager: Using default fetchSize of 1000
16/07/30 03:59:07 INFO tool.CodeGenTool: Beginning code generation
16/07/30 03:59:07 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [sqoop_test] AS t WHERE 1=0
16/07/30 03:59:07 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop/hadoop-2.6.0
Note: ./sqoop_test.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/07/30 03:59:10 INFO orm.CompilationManager: Writing jar file: ./sqoop_test.jar
16/07/30 03:59:10 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException
java.lang.NullPointerException
at java.util.Objects.requireNonNull(Objects.java:203)
at java.util.Arrays$ArrayList.<init>(Arrays.java:3813)
at java.util.Arrays.asList(Arrays.java:3800)
at org.apache.sqoop.util.FileListing.getFileListingNoSort(FileListing.java:76)
at org.apache.sqoop.util.FileListing.getFileListingNoSort(FileListing.java:82)
at org.apache.sqoop.util.FileListing.getFileListingNoSort(FileListing.java:82)
at org.apache.sqoop.util.FileListing.getFileListing(FileListing.java:67)
at com.cloudera.sqoop.util.FileListing.getFileListing(FileListing.java:39)
at org.apache.sqoop.orm.CompilationManager.addClassFilesFromDir(CompilationManager.java:284)
at org.apache.sqoop.orm.CompilationManager.jar(CompilationManager.java:346)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:109)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
the file has only 3 rows with three columns each. it has no null values. I tried using --input-null-string as well.
my sql table :
create table sqoop_test
(id int,
name nvarchar(200),
title nvarchar(200))
and the file content in hdfs is,
5,X,analyst
6,Y,architect
7,Z,lead

Sqoop incremental import (db schema incorrect)

Im trying to do an incremental import using the Sanbox 2.1 and a Microsoft SQL Server (AdventureWorks database). For the incremental import i’m using the following command:
sqoop import --connect "jdbc:sqlserver://192.168.40.133:1434;database=AdventureWorksLT2012;username=test;password=test" --table ProductModel --hive-import --check-column ProductModelID --incremental append --last-value 128 -- --schema SalesLT
As you can see in the error message below, the select statement “SELECT MAX([SalesLT].[ProductModelID]) FROM ProductModel” is not constructed the right way.
The schema name is added to the column without the table name and the table name in the FROM clause is missing the schema name…
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
14/06/24 07:05:33 INFO sqoop.Sqoop: Running Sqoop version: 1.4.4.2.1.1.0-385
14/06/24 07:05:33 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
14/06/24 07:05:33 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
14/06/24 07:05:33 INFO manager.SqlManager: Using default fetchSize of 1000
14/06/24 07:05:33 INFO manager.SQLServerManager: We will use schema SalesLT
14/06/24 07:05:33 INFO tool.CodeGenTool: Beginning code generation
14/06/24 07:05:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [SalesLT].[ProductModel] AS t WHERE 1=0
14/06/24 07:05:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/bcb9143989664ced51458e8a0dbd52b9/ProductModel.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/06/24 07:05:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/bcb9143989664ced51458e8a0dbd52b9/ProductModel.jar
14/06/24 07:05:37 INFO tool.ImportTool: Maximal id query for free form incremental import: SELECT MAX([SalesLT].[ProductModelID]) FROM ProductModel
14/06/24 07:05:37 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: com.microsoft.sqlserver.jdbc.SQLServerException: Invalid object name 'ProductModel'.
Any help is appreciated.
Thank you!
PS.
Importing a full table is working fine.
sqoop import --connect "jdbc:sqlserver://192.168.40.133:1434;database=AdventureWorksLT2012;username=test;password=test" --table ProductModel --hive-import -- --schema SalesLT
Try this :
sqoop import --connect "jdbc:sqlserver://192.168.40.133:1434;database=AdventureWorksLT2012;username=test;password=test" --table ProductModel --hive-import -- --schema SalesLT --incremental append --check-column ProductModelID --last-value "128"
I think following command will work..
sqoop import --connect "jdbc:sqlserver://192.168.40.133:1434;database=AdventureWorksLT2012;username=test;password=test" --table ProductModel --hive-import --incremental append --check-column ProductModelID --last-value "128" -- --schema SalesLT

Hive to Vectorwise using Sqoop

I have Vectorwise 2.0.2 and Sqoop 1.4.1 installed.
When I'm trying to use sqoop-export:
sudo -u hdfs ./sqoop-export --driver com.ingres.jdbc.IngresDriver --connect jdbc:ingres://172.16.63.157:VW7/amit --username ingres -P -m 1 --table book_call_log --export-dir /user/hive/warehouse/book_call_log --input-fields-terminated-by '\001' --verbose
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Enter password:
12/06/08 19:00:52 INFO manager.SqlManager: Using default fetchSize of 1000
12/06/08 19:00:52 INFO tool.CodeGenTool: Beginning code generation
12/06/08 19:00:53 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM book_call_log AS t WHERE 1=0
The operation gets stuck here. No error is indicated and the prompt also does not appear.
Any help related to this is appreciated.

Resources