FileNotFound error in Sqoop Merge command - hadoop

I am trying to execute a sqoop merge command and for that, I have executed a Sqoop codegen to get the class and the jar of the table into the HDFS
Sqoop CodeGen Command:
sqoop codegen --connect jdbc:mysql://127.0.0.1/mydb --table mergetab --username root --password cloudera --outdir /user/cloudera/codegenclasses --fields-terminated-by '\t'
I have the following files in the outdir: /user/cloudera/codegenclasses
-rw-r--r-- 1 cloudera cloudera 9572 2017-04-20 16:26 codegenclasses/mergetab.class
-rw-r--r-- 1 cloudera cloudera 3902 2017-04-20 16:26 codegenclasses/mergetab.jar
-rw-r--r-- 1 cloudera cloudera 12330 2017-04-20 16:26 codegenclasses/mergetab.java
I am running the below sqoop merge command to update the rows in my hive table:
sqoop merge --merge-key id --new-data /user/cloudera/incrdata/incrementaldata --onto /user/cloudera/hivetables/fulltabledata --target-dir /user/cloudera/updateddatam --class-name /user/cloudera/codegenclasses/mergetab.class --jar-file /user/cloudera/codegenclasses/mergetab.jar
But Im getting the error:
Encountered IOException running import job: java.io.FileNotFoundException: File /user/cloudera/codegenclasses/44059c9b2bd47b95f03866d8d93eff7f/mergetab.jar does not exist
I have all the files in the folder and I gave the proper directories. But Im unable to identify the mistake Im doing here.
Could anyone help me fixing this ?

In the argument --outdir <dir>, the <dir> path specified belongs to the local filesystem. And using --outdir will only store the generated code i.e., tablename.java. Use --bindir <dir> instead.
sqoop codegen --connect jdbc:mysql://127.0.0.1/mydb --table mergetab --username root --password cloudera --bindir /path/to/store/jarfile --fields-terminated-by '\t'
Then merge. By default, the table name is the --class-name.
sqoop merge --merge-key id --new-data /user/cloudera/incrdata/incrementaldata --onto /user/cloudera/hivetables/fulltabledata --target-dir /user/cloudera/updateddatam --class-name mergetab --jar-file /path/to/store/jarfile/mergetab.jar

Related

SemanticException 10072: database does not exist (Sqoop)

I created a hive internal table using sqoop command.
sqoop import -Dmapreduce.map.memory.mb=4096
--driver com.mysql.jdbc.Driver
--connect 'jdbc:mysql://{mysql_url}'
--username 'xxxx'
--password 'xxxx'
--input-fields-terminated-by '\t'
--split-by id
--target-dir {hdfs_path}
--verbose -m 1
--hive-drop-import-delims
--fields-terminated-by '\t'
--hive-import
--hive-table '{table_name}'
--query "select id from temp WHERE \$CONDITIONS LIMIT 10"
I created a table into it and it was working find.
19/01/06 19:33:44 DEBUG hive.TableDefWriter: Load statement: LOAD DATA INPATH 'hdfs://hadoop/{hdfs_path}' INTO TABLE `tmp.temp`
19/01/06 19:33:44 INFO hive.HiveImport: Loading uploaded data into Hive
19/01/06 19:33:44 DEBUG hive.HiveImport: Using in-process Hive instance.
19/01/06 19:33:44 DEBUG util.SubprocessSecurityManager: Installing subprocess security manager
Logging initialized using configuration in jar:file:${HADOOP_HOME}/hive-1.1.0-cdh5.14.2/lib/hive-common-1.1.0-cdh5.14.2.jar!/hive-log4j.properties
It created in hdfs warehouse location.
$ hadoop dfs -ls {hdfs_path}
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
19/01/06 19:43:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
0 2019-01-06 19:33 {hdfs_path}/_SUCCESS
65 2019-01-06 19:33 {hdfs_path}/part-m-00000.gz
BUT it was an error:
FAILED: SemanticException [Error 10072]: Database does not exist: tmp
I already hive-site.xml into sqoop conf directory.
cp ${HIVE_HOME}/conf/hive-site.xml ${SQOOP_HOME}/conf/hive-site.xml
"hive.metastore.uris" was set local and remote thrift.
How do I do? Help me. Thanks
Please used this sqoop command to import data from sqoop to hive with existing table and change clause according to your requirement.i have just modified some clause and added some clause as per requirement in your command
ubuntu#localhost:/usr/local/hive$ sqoop import -Dmapreduce.map.memory.mb=4096 --connect 'jdbc:mysql://localhost/test' --username 'root' -P --input-fields-terminated-by ',' --split-by id --target-dir /user/hive/warehouse/test_hive --hive-drop-import-delims --fields-terminated-by ',' --hive-import --hive-database default --hive-table test_hive --query "select id from test WHERE \$CONDITIONS LIMIT 10" --driver com.mysql.jdbc.Driver --delete-target-dir
Happy Hadooppppppppp

sqoop import error from teradata to hive

I am using below given sqoop command:
sqoop import
--libjars /usr/hdp/2.4.0.0-169/sqoop/lib,/usr/hdp/2.4.0.0-169/hive/lib
--connect jdbc:teradata://x/DATABASE=x
--connection-manager org.apache.sqoop.teradata.TeradataConnManager
--username ec
--password dc
--query "select * from hb where yr_nbr=2017"
--hive-table schema.table
--num-mappers 1
--hive-import
--target-dir /user/hive/warehouse/GG
I'm getting this error:
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
17/04/06 11:15:41 INFO mapreduce.Job: map 100% reduce 0%
17/04/06 11:15:41 INFO mapreduce.Job: Task Id : attempt_1491466460468_0029_m_000000_1, Status : FAILED
Error: org.apache.hadoop.fs.FileAlreadyExistsException: /user/root/temp_111508/part-m-00000 for client 192.168.211.133 already exists
From the error, I can guess that the output file is already in your target directory, may be from your previous sqoop import. There is an option in sqoop import named --delete-target-dir which will delete your target output directory and re-create them in your next sqoop import. Hope that helps.

Sqoop job unable to work with Hadoop Credential API

I have stored my database passwords in Hadoop CredentialProvider.
Sqoop import from terminal is working fine, successfully fetching the password from CredentialProvider.
sqoop import
-Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks
--table myTable -m 1 --target-dir /user/vijay/output --delete-target-dir --username vijay --password-alias db2-dev-password
But when I try to setup as a Sqoop job, it is unable to recognize the -Dhadoop.security.credential.provider.path argument.
sqoop job --create my-sqoop-job -- import --table myTable -m 1 --target-dir /user/vijay/output --delete-target-dir --username vijay -Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks --password-alias
Following is the error message:
14/04/05 13:57:53 ERROR tool.BaseSqoopTool: Error parsing arguments for import:
14/04/05 13:57:53 ERROR tool.BaseSqoopTool: Unrecognized argument: -Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks
14/04/05 13:57:53 ERROR tool.BaseSqoopTool: Unrecognized argument: --password-alias
14/04/05 13:57:53 ERROR tool.BaseSqoopTool: Unrecognized argument: db2-dev-password
I couldn't find any special instructions in Sqoop User Guide for configuring Hadoop credential API with Sqoop Job.
How to resolve this issue?
Re positioning the Sqoop parameters solve the problem.
sqoop job -Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks --create my-sqoop-job -- import --table myTable -m 1 --target-dir /user/vijay/output --delete-target-dir --username vijay --password-alias myPasswordAlias
Place the Hadoop credential before the Sqoop job keyword.
Your Sqoop job command is not proper, i.e. --password-alias is incomplete.
Please execute below command in your Hadoop server
hadoop credential list -provider jceks://hdfs/user/vijay/myPassword.jceks
Add the output in below Sqoop job command
sqoop job --create my-sqoop-job -- import --table myTable -m 1 --target-dir /user/vijay/output --delete-target-dir --username vijay -Dhadoop.security.credential.provider.path=jceks://hdfs/user/vijay/myPassword.jceks --password-alias <<output of above command>>

sqoop import unable to locate sqoop-1.4.6.jar

I'm using sqoop for importing data from mysql table to be used with hadoop.
While importing it is showing error.
Hadoop Version: 2.5.0
Sqoop Version: 1.4.6
Command Used for import
sqoop import --connect jdbc:mysql://localhost/<dbname> --username root --password pass#123 --table <tablename> -m 1
Error Shown
15/05/27 23:13:59 ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File does not exist: hdfs://localhost:9000/usr/lib/sqoop/sqoop-1.4.6.jar
Any help?
Try this:
1. Create a directory in HDFS:
hdfs dfs -mkdir /usr/lib/sqoop
2. Copy sqoop jar into HDFS:
hdfs dfs -put /usr/lib/sqoop/sqoop-1.4.6.jar /usr/lib/sqoop/
3. Check whether the file exists in HDFS:
hdfs dfs -ls /usr/lib/sqoop
4. Import using sqoop:
sqoop import --connect jdbc:mysql://localhost/<dbname> --username root --password pass#123 --table <tablename> -m 1

Sqoop import --as-parquetfile with CDH5

I'm trying to import data directly from mysql to parquet but it doesn't seem to work correctly...
I'm using CDH5.3 which includes Sqoop 1.4.5.
Here is my command line :
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --hive-import --hive-table default.pages_users3 --target-dir hive_pages_users --as-parquetfile
Then I get this error :
Warning: /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/01/09 14:31:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.0
15/01/09 14:31:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/09 14:31:49 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
15/01/09 14:31:49 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
15/01/09 14:31:49 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/09 14:31:49 INFO tool.CodeGenTool: Beginning code generation
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/01/09 14:31:51 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.jar
15/01/09 14:31:51 INFO mapreduce.ImportJobBase: Beginning query import.
15/01/09 14:31:51 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE (1 = 0)
15/01/09 14:31:51 WARN spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader
15/01/09 14:31:51 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:109)
at org.kitesdk.data.Datasets.create(Datasets.java:189)
at org.kitesdk.data.Datasets.create(Datasets.java:240)
at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:81)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:70)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:112)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:262)
at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:721)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:499)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
I have no problem importing data to hive file format but parquet is a problem... Do you have any idea why this occurs ?
Thank you :)
Please do not use <db>.<table> with --hive-table. This doesn't work well with Parquet import. Sqoop uses Kite SDK to write Parquet files and it doesn't like this <db>.<table> format.
Instead, please use --hive-database --hive-table . for your command, it should be:
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database \
--username username --password mypass \
--query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id \
--hive-import --hive-database default --hive-table pages_users3 \
--target-dir hive_pages_users --as-parquetfile
Here's my pipeline in CDH 5.5 to import from a jdbc into Hive parquet files.
JDBC data source is for Oracle, but explanation below fits MySQL too.
1) Sqoop:
$ sqoop import --connect "jdbc:oracle:thin:#(complete TNS descriptor)" \
--username MRT_OWNER -P \
--compress --compression-codec snappy \
--as-parquetfile \
--table TIME_DIM \
--warehouse-dir /user/hive/warehouse \
--num-mappers 1
I chose --num-mappers as 1 because TIME_DIM table had just around ~20k rows, and it's not advised to split parquet table into multiple files for such a small dataset. Each mapper creates a separate output (parquet) file.
(ps. for Oracle users: I had to connect as owner of the source table, otherwise had to specify "MRT_OWNER.TIME_DIM", and was getting error org.kitesdk.data.ValidationException: Namespace MRT_OWNER.TIME_DIM is not alphanumeric (plus '_'), seems a sqoop bug).
(ps2. Table name had to be all-uppercase.. not sure if this is Oracle specific (shouldn't be) and if this is another sqoop bug).
(ps3. --compress --compression-codec snappy parameters were recognized but did not seem made any effect)
2) Above command creates a directory named
/user/hive/warehouse/TIME_DIM
It's a wise idea to move it to a specific Hive database directory, e.g.:
$ hadoop fs -mv /hivewarehouse/TIME_DIM /hivewarehouse/dwh.db/time_dim
Assuming name of Hive database/schema is "dwh".
3) Create Hive table, by taking schema directly from parquet file:
$ hadoop fs -ls /user/hive/warehouse/dwh.db/time_dim | grep parquet
-rwxrwx--x+ 3 hive hive 1216 2016-02-04 23:56 /user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet
If above command returns more than parquet file (it means you had more than one mapper, the --num-mappers parameter), you can pick any parquet file into the below command.
This command should run in Impala and not in Hive. Hive currently can't infer schema from parquet files, but Impala can:
[impala-shell] > CREATE TABLE dwh.time_dim
LIKE PARQUET '/user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet'
COMMENT 'sqooped from MRT_OWNER.TIME_DIM'
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/dwh.db/time_dim'
;
ps. It's also possible to infer schema from parquet using Spark, e.g.
spark.read.schema('hdfs:///user/hive/warehouse/dwh.db/time_dim')
4) Since table wasn't created in Hive (which collects stats automatically), it's a good idea to collect stats:
[impala-shell] > compute stats dwh.time_dim;
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_import_literal
--as-parquetfile
was added in Sqoop 1.4.6 (CDH 5.5).
(ps. for Oracle users: I had to connect as owner of the source table, otherwise had to specify "MRT_OWNER.TIME_DIM", and was getting error org.kitesdk.data.ValidationException: Namespace MRT_OWNER.TIME_DIM is not alphanumeric (plus '_'), seems a sqoop bug).
This can be fixed if database name and table name is written as db_name/table_name instead of db_name.table_name.
Seems like database support is missing in your distribution. It looks like it was added rather recently. Try setting --hive-table to --hive-table pages_users3 and removing --target-dir.
If the above doesn't work, try:
This blog post.
The docs.
Check with the user#sqoop.apache.org mailing list.
I found a solution, I droppped all the hive parts and use the target dir to store the data... Seems to work :
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --target-dir /home/cloudera/user/hive/warehouse/soprism.db/pages_users3 --as-parquetfile -m 1
I then link to the directory making an external table from Impala...

Resources