I was trying use TDCH connector to load the data from Hive table to Teradata table. But, I want to load the data into the targt table (in teradata) via VIEW instead of accessing the table directly.
So, is there a way to load the data to target table through the VIEW?
There is an option called "tdch.output.teradata.data.dictionary.use.xview" but setting this option to true didn't help me either.
Below is a sample TDCH job I was using:
hadoop jar $TDCH_JAR
com.teradata.connector.common.tool.ConnectorExportTool
-Dmapred.job.queue.name=
-libjars $HIVE_LIB_JARS
-classname com.teradata.jdbc.TeraDriver
-url jdbc:teradata:///
-username xxxxx
-password xxxxx
-jobtype hive
-fileformat textfile
-nummappers 10
-method internal.fastload
-separator "\u0009"
-sourcedatabase
-sourcetable
-sourcefieldnames ""
-targettable
-targetfieldnames ""
-stagedatabase
-forcestage true
Can you try using '-method batch.insert'? Based on this developer exchange post, fastload sessions don't support loading views but regular sql sessions should work fine (TDCH's internal.fastload method uses fastload sessions, while batch.insert uses sql sessions).
Related
We are using netezza direct to import data from Netezza to Hadoop as part of POC.
Have couple of questions on Netezza specific and Netezza Sqoop Integration.
Q1. Does Sqoop direct mode always require CREATE EXTERNAL TABLE and DROP privilege to perform direct transfer?
Q2. Does external table get created in Netezza ? If yes, which database ? I see Sqoop using below query :
CREATE EXTERNAL TABLE '/yarn/local/usercache/someuser/appcache/application_1483624176418_42787/work/task_1483624176418_42787_m_000000/nzexttable-0.txt'
USING (REMOTESOURCE 'JDBC'
BOOLSTYLE 'T_F'
CRINSTRING FALSE DELIMITER 44 ENCODING
'internal' FORMAT 'Text' INCLUDEZEROSECONDS TRUE
NULLVALUE 'null' MAXERRORS 1)
AS SELECT * FROM SOME_TBL WHERE (DATASLICEID % 3)
Does it create in Database selected in db URL ? jdbc:netezza://somehostname:5480/SOME_DB_1
Q3. If Netezza needs to create External tables, can it create the external table in different database than the one which the actual table with data that needs to be pulled into Hadoop. What is the config change that needs to be done ?
Q4. Does Sqoop run DROP table on external table which was created by individual mappers ?
Sqoop command Used :
export HADOOP_CLASSPATH=/opt/nz/lib/nzjdbc3.jar
sqoop import -D mapreduce.job.queuename=some_queue
-D yarn.nodemanager.local-dirs=/tmp -D mapreduce.map.log.level=DEBUG
--direct --connect jdbc:netezza://somehost:5480/SOME_DB --table SOME_TBL_1
--username SOMEUSER --password xxxxxxx --target-dir /tmp/netezza/some_tbl_file
--num-mappers 2 --verbose
This is what I got as reply in Sqoop User community (Thanks Szabolcs Vasas).
In case of Netezza direct imports Sqoop executes a CREATE EXTERNAL TABLE command (so you will need CREATE EXTERNAL TABLE privilege) to create a backup of the content of the table to a temporary file and it copies the content of this file to the final output on HDFS.
The SQL command you pasted in your email is indeed the one which is executed by Sqoop but as far as I understand from the Netezza documentation (http://www.ibm.com/support/knowledgecenter/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_expls.html, 6th example) this does not really create a new external table in any schema it just backs up the content of the table and because of that no DROP TABLE statement is executed.
Q1. Yes, Sqoop need CREATE EXTERNAL TABLE but not DROP privilege.
Q2. Sqoop does not really create a new external table in any schema it just backs up the content of the table (http://www.ibm.com/support/knowledgecenter/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_expls.html, 6th example).
Q3. Not possible to create an EXTERNAL table in a specific schema.
Q4. No, Sqoop does not run DROP command.
Moreover, the table created by sqoop direct process is Netezza TET - Transient external tables. Thus, the external remotesource JDBC table is dropped once the mapper receives the data as NamedFifo. Thus tables are not stored in Netezza after the transfer.
Is there anyone here who has worked with sqoop and hp vertica?
I am trying to export data from sqoop to vertica and I find that the performance is extremely poor.
I can switch to the HP vertica connector... but I still want to know why sqoop works so slow when exporting data to vertica.
I also found that when inserting data, sqoop does not support upserts against vertica. I want to know if this issue will be fixed anytime soon?
sqoop export -Dsqoop.export.records.per.statement=1 --driver
com.vertica.jdbc.Driver --mysql-delimiters --username **** --password **** --
connect jdbc:vertica://hostname/schema?ConnectionLoadBalance=1 --export-dir <hdfs-
data-dir> --table <table_name>
One of the issues is that sqoop if forcing us to set sqoop.export.records.per.statement to 1 for Vertica. Otherwise it throws an error.
I've never used sqoop, but the command line data import function in vertica uses the COPY function; basically it makes a temp file and then runs a file import in the background. It wouldn't be a graceful solution, but you could try dumping your data to a gzip and then running the COPY function directly. I find that the gzip is always the bottleneck for files over a certain threshold (~50Mb+), never the COPY. Could be a backdoor to a faster import.
i work sqoop with vertica database, i use sqoop to export data from the vertica to the hive/HDFS and it work grate, you just need to add the vertica jar to the sqoop folder.
when i want to asq vertica on data that in the HDFS/Hive i use the hcatalog of the vertica. in version 8.1.* it comes with the vertica database and you don't need more connectors.
hcatalog
Using sqoop I can create managed table but not the externale table.
Please let me know what are the best practices to unload data from data warehouse and load them in Hive external table.
1.The tables in warehouse are partitioned. Some are date wise partitioned some are state wise partitioned.
Please put your thoughts or practices used in production environment.
Sqoop does not support creating Hive external tables. Instead you might:
Use the Sqoop codegen command to generate the SQL for creating the Hive internal table that matches your remote RDBMS table (see http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_codegen_literal)
Modify the generated SQL to create a Hive external table
Execute the modified SQL in Hive
Run your Sqoop import command, loading into the pre-created Hive external table
Step 1: import data from mysql to hive table.
sqoop import
--connect jdbc:mysql://localhost/
--username training --password training
--table --hive-import --hive-table -m 1
--fields-terminated-by ','
Step 2: In hive change the table type from Managed to External.
Alter table <Table-name> SET TBLPROPERTIES('EXTERNAL'='TRUE')
Note:you can import directly into hive table or else to back end of hive.
My best suggestion is to SQOOP your data to HDFS and create EXTERNAL for Raw operations and Transformations.
Finally mashed up data to the internal table. I believe this is one of the best practices to get things done in a proper way.
Hope this helps!!!
Refer to these links:
https://mapr.com/blog/what-kind-hive-table-best-your-data/
In the above if you want to skip directly to the point -->2.2.1 External or Internal
https://hadoopsters.net/2016/07/15/hive-tables-internal-and-external-explained/
After referring to the 1st link then second will clarify most of your questions.
Cheers!!
What is the difference between Apache Sqoop and Hive? I know that sqoop is used to import/export data from RDBMS to HDFS and Hive is a SQL layer abstraction on top of Hadoop. Can I can use Sqoop for importing data into HDFS and then use Hive for querying?
Yes, you can. In fact many people use sqoop and hive for exactly what you have told.
In my project what I had to do was to load the historical data from my RDBMS which was oracle, move it to HDFS. I had hive external tables defined for this path. This allowed me to run hive queries to do transformations. Also, we used to write mapreduce programs on top of these data to come up with various analysis.
Sqoop transfers data between HDFS and relational databases. You can use Sqoop to transfer data from a relational database management system (RDBMS) such as MySQL or Oracle into HDFS and use MapReduce on the transferred data. Sqoop can export this transformed data back into an RDBMS as well. More info http://sqoop.apache.org/docs/1.4.3/index.html
Hive is a data warehouse software that facilitates querying and managing large datasets residing in HDFS. Hive provides schema on read (as opposed to schema on write for RDBMS) onto the data and the ability to query the data using a SQL-like language called HiveQL. More info https://hive.apache.org/
Yes you can. As a matter of fact, that's exactly how it is meant to be used.
Sqoop :
We can integrate with any external data sources with HDFS i.e Sql , NoSql and Data warehouses as well using this tool at the same time we export it as well since this can be used as bi-directional ways.
sqoop to move data from a relational database into Hbase.
Hive: 1.As per my understanding we can import the data from Sql databases into hive rather than NoSql Databases.
We can't export the data from HDFS into Sql Databases.
We can use both together using the below two options
sqoop create-hive-table --connect jdbc:mysql://<hostname>/<dbname> --table <table name> --fields-terminated-by ','
The above command will generate the hive table and this table name will be same name in the external table and also the schema
Load the data
hive> LOAD DATA INPATH <filename> INTO TABLE <filename>
Hive can be shortened to one step if you know that you want to import stright from a database directly into hive
sqoop import --connect jdbc:mysql://<hostname>/<dbname> --table <table name> -m 1 --hive-import
I am facing issue with Teradata connector for Sqoop, when i am trying to import table from Teradata View. I have only access for Views.
but somehow when sqoop job starts it is trying to create a table in Teradata DB which i am accessing but dont have right to create any table in that DB/schema
I am getting below Error
13/05/31 03:40:12 ERROR tool.ImportTool: Encountered IOException running import job: com.teradata.hadoop.exception.TeradataHadoopSQLException: com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata Database] [TeraJDBC 14.00.00.01] [Error 3524] [SQLState 42000] The user does not have CREATE TABLE access to database EDWABSVIEWS.
at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(ErrorFactory.java:307)
at com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveInitSubState.java:102)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachine(StatementReceiveState.java:298)
at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(StatementReceiveState.java:179)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(StatementController.java:120)
at com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementController.java:111)
at com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:372)
at com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:314)
at com.teradata.jdbc.jdbc_4.TDStatement.doNonPrepExecute(TDStatement.java:277)
at com.teradata.jdbc.jdbc_4.TDStatement.execute(TDStatement.java:1087)
at com.teradata.hadoop.TeradataConnection.executeDDL(TeradataConnection.java:379)
at com.teradata.hadoop.TeradataConnection.createTable(TeradataConnection.java:1655)
at com.teradata.hadoop.TeradataPartitionStageInputProcessor.createStageTable(TeradataPartitionStageInputProcessor.java:233)
at com.teradata.hadoop.TeradataPartitionStageInputProcessor.setup(TeradataPartitionStageInputProcessor.java:87)
at com.teradata.hadoop.TeradataImportJob.run(TeradataImportJob.java:36)
at org.apache.sqoop.teradata.TeradataImportJob.doSubmitJob(TeradataImportJob.java:173)
at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:141)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:208)
at org.apache.sqoop.teradata.TeradataConnManager.importTable(TeradataConnManager.java:64)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:403)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
at com.teradata.hadoop.TeradataPartitionStageInputProcessor.createStageTable(TeradataPartitionStageInputProcessor.java:243)
at com.teradata.hadoop.TeradataPartitionStageInputProcessor.setup(TeradataPartitionStageInputProcessor.java:87)
at com.teradata.hadoop.TeradataImportJob.run(TeradataImportJob.java:36)
at org.apache.sqoop.teradata.TeradataImportJob.doSubmitJob(TeradataImportJob.java:173)
at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:141)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:208)
at org.apache.sqoop.teradata.TeradataConnManager.importTable(TeradataConnManager.java:64)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:403)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
Please assist.
Cloudera Connector for Teradata 1.1.1 do not support imports from views as is documented in limitations section of the user guide.
The connector will try to create temporary tables in order to provide all or nothing semantics, which I'm expecting is the reason for the exception. If you do not have such privileges on the main database, you can instruct the connector to create the staging tables in any other database where you have enough privileges. Please check out the User guide for further instructions.
If you have specified the split.by.partition method and your table is not partitioned or if you enabled --staging-force, the TeraData connector will create a stage table by default into the specified DB in the connection URL.
From Cloudera documentation:
If your input table is not partitioned, the connector creates a partitioned staging table and executes an INSERT into SELECT query to move data from the source table into the staging table. Subsequently, each mapper transfers data from one partition, resulting in a single AMP operation. With a single AMP, you can use a large number of mappers to obtain optimal performance. The amount of available permanent space must be as large as your source table and the amount of spool space required to execute the SELECT queries.
If your table is already partitioned, no extra staging table is created. However, you can force the connector to re-partition your data using the --staging-force parameter to achieve better performance. Without forcing repartition of the data, this method opens all-AMP operation, so you should use between 20 and 30 mappers. If your source table is a PI table, and your split by column is the table’s primary key, the connector creates a single AMP operation, and you can use high number of mappers.
To solve that you have to avoid staging, either switch to split.by.value or split.by.hash or you could specify a different DB for the staging table (--staging-database) on which you have CREATE TABLE permission.
Furthermore, consider that the connector will try to create a view whenever you have a free form query (--query) and multiple mappers. To workaround the CREATE VIEW permission error either you specify --table <TABLE_NAME> and you load the whole table, you can still specify which columns (--columns) and the WHERE filter (--where), or you have to get rid of parallelism using --num-mappers 1.
I personally think there should be a property to enable/disable the creation of views. If you are performing an Import Data workflow you should not be required to create views as well as staging table.
I had that same error when I tried to use --split-by argument along with free form query --query argument. Then I replaced --split-by with --num-mappers 1, to force SQOOP run it in one process. And then that error went away.
I am assuming that --split-by will tried to create temporary files/views where single mapper process might not.
Here is a template that I am using:
sqoop import \
--connect jdbc:teradata://<host>/Database=<dbname> \
--username <username> \
--password <password> \
--hive-import \
--hive-table <hivetablename> \
--hive-drop-import-delims \
--num-mappers 1 \
--target-dir <hdfspath> \
--query "<sqlquery> and \$CONDITIONS"