Connect to Teradata Using Airflow JDBC Connection

Connect to Teradata Using Airflow JDBC Connection - jdbc

I'm trying to execute a SqlSensor task in Airflow using a connection to Teradata database. The connection is configured as follow:
I have provide in particular 2 driver paths separated by ", " but I am not sure if it's the proper way to do it?
/home/airflow/java_sample/tdgssconfig.jar
/home/airflow/java_sample/terajdbc4.jar
When the DAG executes, it triggers the error message
[2017-08-02 02:32:45,162] {models.py:1342} INFO - Executing <Task(SqlSensor): check_running_batch> on 2017-08-02 02:32:12
[2017-08-02 02:32:45,179] {base_hook.py:67} INFO - Using connection to: jdbc:teradata://myservername.mycompanyname.org/database=MYDBNAME,TMODE=ANSI,CHARSET=UTF8
[2017-08-02 02:32:45,313] {sensors.py:109} INFO - Poking: SELECT BATCH_KEY FROM MYDBNAME.AUDIT_BATCH WHERE BATCH_OWNER='ARO_TEST' AND AUDIT_STATUS_KEY=1;
[2017-08-02 02:32:45,316] {base_hook.py:67} INFO - Using connection to: jdbc:teradata://myservername.mycompanyname.org/database=MYDBNAME,TMODE=ANSI,CHARSET=UTF8
[2017-08-02 02:32:45,497] {models.py:1417} ERROR - java.lang.RuntimeException: Class com.teradata.jdbc.TeraDriver not found
What am I doing wrong?

The appropriate way to input multiple jars in the connections page is to separate both fully qualified paths with a comma which you did above.
I can confirm this is the approach I took and it worked (Airflow 10.1.1 and 10.1.2).
See: https://github.com/apache/airflow/blob/master/airflow/hooks/jdbc_hook.py#L51
Bonus: If you use Ad Hoc Query in Data Profiling to test it out, you'll notice that you'll get an error when you send a SELECT statement because Airflow wraps it in a LIMIT clause which TD doesn't support.

The solution provided by my team member was to merge the two jar into a single jar file. After doing it and indicating that new jar file in the driver path, it worked as expected.
Here is the link to the JAR file: https://github.com/alexisrolland/linux-setup/blob/master/teradataDriverJdbc.jar
Here is a code snippet example to use the connection in a SQLSensor Task:
CheckRunningBatch = SqlSensor(
task_id='check_running_batch',
conn_id='ed_data_quality_edw_dev',
sql="SELECT CASE WHEN MAX(BATCH_KEY) IS NOT NULL THEN 0 ELSE 1 END FROM DATABASE.AUDIT_BATCH WHERE STATUS_KEY=1;",
poke_interval=300,
dag=dag)

Related

Cannot create Hive external table using jdbcStorageHandler

I am running a small cluster in Amazone EMR in order to play with Apache Hive 2.3.5. It is my understanding that Apache Hive can import data from a remote database and have the cluster to run queries. I was following an example that is provided in Apache Hive web documentation (https://cwiki.apache.org/confluence/display/Hive/JdbcStorageHandler) and created the following code:
CREATE EXTERNAL TABLE hive_table
(
col1 int,
col2 string,
col3 date
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
'hive.sql.database.type'='POSTGRES',
'hive.sql.jdbc.driver'='org.postgresql.Driver',
'hive.sql.jdbc.url'='jdbc:postgresql://<url>/<dbname>',
'hive.sql.dbcp.username'='<username>',
'hive.sql.dbcp.password'='<password>',
'hive.sql.table'='<dbtable>',
'hive.sql.dbcp.maxActive'='1'
);
But I get the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.IllegalArgumentException: Property hive.sql.query is required.)
According to the documentation, I need to specify either “hive.sql.table” or “hive.sql.query” to tell how to get data from jdbc database. But if I replace hive.sql.table with hive.sql.query I get the following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.IllegalArgumentException: No enum constant org.apache.hive.storage.jdbc.conf.DatabaseType.POSTGRES)
I tried looking in the web for a solution and it doesn't look like anyone experience the same issues that I am having. Do I need to modify a config file or am I missing something critical in my code?

I think you are using a version of the jar which doesn't support POSTGRES.
Download the latest jar from this link:
http://repo1.maven.org/maven2/org/apache/hive/hive-jdbc-handler/3.1.2/hive-jdbc-handler-3.1.2.jar
Put this downloaded jar into a hdfs location.
Run hive normally.
Run command: add jar ${HDFS_PATH_TO_DOWNLOADED_JAR}
Run your create table command

Listing MS SQL Server table in OOZIE via SQOOP Action

I am able to execute the following SQOOP command in CLI perfectly.
sqoop list-tables
--connect 'jdbc:sqlserver://xx.xx.xx.xx\MSSQLSERVER2012:1433;username=usr;password=xxx;database=db'
--connection-manager org.apache.sqoop.manager.SQLServerManager
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver
-- --schema schma
But getting errors while trying out the same in OOZIE (HUE)
2055 [main] ERROR org.apache.sqoop.manager.CatalogQueryManager -
Failed to list tables java.sql.SQLException: No suitable driver found
for 'jdbc:sqlserver://xx.xx.xx.xx\MSSQLSERVER2012:1433;username=usr;password=xxx;database=db'
-
2057 [main] ERROR org.apache.sqoop.Sqoop - Got exception running
Sqoop: java.lang.RuntimeException: java.sql.SQLException: No suitable
driver found for 'jdbc:sqlserver://xx.xx.xx.xx\MSSQLSERVER2012:1433;username=usr;password=xxx;database=db'
How can we get it to work in oozie?
(Working on Cloudera Hadoop Distribution)

This worked for me using CDH 5.11 and the Hue Workflow Editor to create an Oozie>Sqoop1 workflow...but it REQUIRES you to hard code the UserName and Password arguments... Screenshots are included below.
Here is the Step-by-Step:
Open the Hue > Workflow Editor
Create a new workflow
Drag the Sqoop 1 action into the the "drop your action here" grey box.
Ignore the default Sqoop command box and instead hit the + to the right of the ARGUMENTS below the Sqoop command box to add a new argument.
Add "import" without the double quote marks as the very first argument.
Delete the entire content of the Sqoop command box, it needs to be empty.
Add a new argument with the value of "--connect" without the double quotes.
Add a new argument with the value of "jdbc:sqlserver://YourServerNameHere;database=YourDatabaseNameHere"
Add a new argument with the value of "--username"
Add a new argument with the value of "YourSQLServerNamedUserNameHere"
Add a new argument with the value of "--password"
Add a new argument with the value of "--query"
Add a new argument with the value of "Select * from OptionalDBNameHere.SchemaNameHere.TableNameHere Where $CONDITIONS"
Add a new argument with the value of "--delete-target-dir"
Add a new argument with the value of "--target-dir"
Add a new argument with the value of "hdfs://FDQServerName:PortNumber8020IsDefault/User/full/path/to/where/you/want/the/csv/file/placed/in/hdfs/NewFolderForThisTableHere" -- The last folder will be deleted and re-created each time you run the sqoop job.
Add a new argument with the value of "num-mappers"
Add a new argument with the value of "1"
Important:
A. The "Where $CONDITIONS" is critical to have at the end of the SQL Select statement in item 13. It will not run without it.
B. This uses a SQL Server Named User account with access to the DBServer Database and Table you want to Sqoop.
B. Entering arguments like this is required if your Named User does not have the default schema set to "dbo" or if the schema of your table is not the default schema for the database and user.
C. The SQL Server JDBC driver is placed correctly in your installation. For my particular version of Cloudera the location is: "/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/sqoop/lib/sqljdbc41.jar" but you may also try putting it in either "/var/lib/oozie" or "/var/lib/sqoop"...not sure either of those work on their own.
D. I have not been successful at replacing the UserName and Password I hardcoded in as Arguments with values from a job.properties file. I believe it is possible but I have been unable to find anyone who can clearly show how to do it and days of brute force trial and error have been unsuccessful.
Here are screenshots showing what this looks like when done.
SqoopCommandAsArguments
SqoopCommandAsArgumentsSuccess

Hive issue using yarn

I am running hive sql on yarn,
it's throwing error with join condition , I am able to create External as well as internal table but failed to create table when use command
create table as AS SELECT name from student.
when running same query through hive cli it's working fine but with spring jog it throws error
2016-03-28 04:26:50,692 [Thread-17] WARN
org.apache.hadoop.hive.shims.HadoopShimsSecure - Can't fetch tasklog:
TaskLogServlet is not supported in MR2 mode.
Task with the most failures(4):
-----
Task ID:
task_1458863269455_90083_m_000638
-----
Diagnostic Messages for this Task:
AttemptID:attempt_1458863269455_90083_m_000638_3 Timed out after 1 secs
2016-03-28 04:26:50,842 [main] INFO
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Killed application
application_1458863269455_90083
2016-03-28 04:26:50,849 [main] ERROR com.mapr.fs.MapRFileSystem - Failed to
delete path maprfs:/home/pro/amit/warehouse/scratdir/hive_2016-03-28_04-
24-32_038_8553676376881087939-1/_task_tmp.-mr-10003, error: No such file or
directory (2)
2016-03-28 04:26:50,852 [main] ERROR org.apache.hadoop.hive.ql.Driver -
FAILED: Execution Error, return code 2 from
As per my findings I think there is some issue with scratdir.
Kindly suggest if any one face same issue.

This issue occurs if the recursive directory doesnot exist. Hive doesnt automatically create directories recursively.
Please check existence of directories to child\table level from root

I faced a similar issue while running the below Hive query
select * from <db_name>.<internal_tbl_name> where <field_name_of_double_type> in (<list_of_double_values>) order by <list_of_order_fields> limit 10;
I performed an explain on the above statement and below was the result.
fs.FileUtil: Failed to delete file or dir [/hdfs/Hadoop_Misc_Logs/Edge01/local_scratch/<hive_username>/41289638-cd53-4d4b-88c9-3359e9ec99e2/hive_2017-05-08_04-26-36_658_6626096693992380903-1/.nfs0000000057b93e2d00001590]: it still exists.
2017-05-08 04:26:37,969 WARN [41289638-cd53-4d4b-88c9-3359e9ec99e2 main] fs.FileUtil: Failed to delete file or dir [/hdfs/Hadoop_Misc_Logs/Edge01/local_scratch/<hive_username>/41289638-cd53-4d4b-88c9-3359e9ec99e2/hive_2017-05-08_04-26-36_658_6626096693992380903-1/.nfs0000000057b93e2700001591]: it still exists.
Time taken: 0.886 seconds, Fetched: 24 row(s)
And checked the logs through
yarn logs -applicationID application_1458863269455_90083
The error happened after a MapR upgrade from the admin team. It is probably due to some upgrade or installation issue and Tez configurations (as suggested by the line 873 in log below). Or probably, the Hive query is syntactically not supporting the Tez optimization. Saying so, because another Hive query on an external table is running fine in my case. Have to check a bit deeper though.
Though not sure but the error line in the logs that looks to be most relevant is as follows:
2017-05-08 00:01:47,873 [ERROR] [main] |web.WebUIService|: Tez UI History URL is not set
Solution:
It is probably happening due to some open files or applications that are using some resources. Pls check https://unix.stackexchange.com/questions/11238/how-to-get-over-device-or-resource-busy
You can run the explain <your_Hive_statement>
In the result execution plan, you can come across the filenames/dirs that Hive execution engine fails to delete e.g.
2017-05-08 04:26:37,969 WARN [41289638-cd53-4d4b-88c9-3359e9ec99e2 main] fs.FileUtil: Failed to delete file or dir [/hdfs/Hadoop_Misc_Logs/Edge01/local_scratch/<hive_username>/41289638-cd53-4d4b-88c9-3359e9ec99e2/hive_2017-05-08_04-26-36_658_6626096693992380903-1/.nfs0000000057b93e2d00001590]: it still exists.
Go to the path given in the step 2 e.g. /hdfs/Hadoop_Misc_Logs/Edge01/local_scratch/<hive_username>/41289638-cd53-4d4b-88c9-3359e9ec99e2/hive_2017-05-08_04-26-36_658_6626096693992380903-1/
In path 3, doing ls -a or lsof +D /path will show the open process_ids blocking the files from delete.
If you run ps -ef | grep <pid>, you get
hive_username <pid> 19463 1 05:19 pts/8 00:00:35 /opt/mapr/tools/jdk1.7.0_51/jre/bin/java -Xmx256m -Dhiveserver2.auth=PAM -Dhiveserver2.authentication.pam.services=login -Dmapr_sec_enabled=true -Dhadoop.login=maprsasl -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0 -Dhadoop.id.str=hive_username -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dlog4j.configurationFile=hive-log4j2.properties -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/opt/mapr/hive/hive-2.1/bin/../conf/parquet-logging.properties -Dhadoop.security.logger=INFO,NullAppender -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf -Dzookeeper.saslprovider=com.mapr.security.maprsasl.MaprSaslProvider -Djavax.net.ssl.trustStore=/opt/mapr/conf/ssl_truststore org.apache.hadoop.util.RunJar /opt/mapr/hive/hive-2.1//lib/hive-cli-2.1.1-mapr-1703.jar org.apache.hadoop.hive.cli.CliDriver
CONCLUSION:
The HiveCLiDriver clearly shows that running "Hive on Spark" (or managed) tables through Hive CLI is not supported any more from Hive 2.0 onwards and it is going to be deprecated going forward. You have to use HiveContext in Spark for running Hive queries. But you can still run queries on Hive external tables through Hive CLI.

hive failed execution error return code 2 from org.apache.hadoop.hive.ql.exec.mapredtask

I have one query. It is executing fine on Hive CLI and returning the result. But when I am executing it with the help of Hive JDBC, I am getting an error below:
java.sql.SQLException: Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:192)
What is the problem? Also I am starting the Hive Thrift Server through Shell Script. (I have written a shell script which has command to start Hive Thrift Server) Later I decided to start Hive thrift Server manually by typing command as:
hadoop#ubuntu:~/hive-0.7.1$ bin/hive --service hiveserver
Starting Hive Thrift Server
org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:10000.
at org.apache.thrift.transport.TServerSocket.<init>(TServerSocket.java:99)
at org.apache.thrift.transport.TServerSocket.<init>(TServerSocket.java:80)
at org.apache.thrift.transport.TServerSocket.<init>(TServerSocket.java:73)
at org.apache.hadoop.hive.service.HiveServer.main(HiveServer.java:384)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
hadoop#ubuntu:~/hive-0.7.1$
Please help me out from this.
Thanks

For this error :
java.sql.SQLException: Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuer
Go to this link :
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html
and add
**hadoop-0.20-core.jar
hive/lib/hive-exec-0.7.1.jar
hive/lib/hive-jdbc-0.7.1.jar
hive/lib/hive-metastore-0.7.1.jar
hive/lib/hive-service-0.7.1.jar
hive/lib/libfb303.jar
lib/commons-logging-1.0.4.jar
slf4j-api-1.6.1.jar
slf4j-log4j12-1.6.1.jar**
to the class path of your project , add this jars from the lib of hadoop and hive, and try the code. and also add the path of hadoop, hive, and hbase(if your are using) lib folder path to the project class path, like you have added the jars.
and for the second error you got
type
**netstat -nl | grep 10000**
if it shows something means hive server is already running. the second error comes only when the port you are specifying is already acquired by some other process, by default server port is 10000 so very with the above netstat command which i said.
Note : suppose you have connected using code exit from ... bin/hive of if you are connected through bin/hive > then code will not connect because i think (not sure) only one client can connect to the hive server.
do above steps hopefully will solve your problem.
NOTE : exit from cli when you are going to execute the code, and dont start cli while code is being executing.

Might be some issue with permission, just try some query like "SELECT * FROM " which won't start MR jobs.

Try to paste these property before the codes.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
set hive.auto.convert.join = false;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=10000;

Adding data source for jidea 11.0.2

i'm trying to connect to a oracle 10g database from inside jidea,i'm using ojdbc6-11.2.0.1.0.jar as the jdbc driver. attached is the error message i'm getting when
i'm trying to connect! can any one help me to solve this issue??
Connection to oracle - albi1dv1 failed
java.sql.SQLException: ORA-00604: error occurred at recursive SQL level 1
ORA-01882: timezone region not found
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:439)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:388)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:381)
at oracle.jdbc.driver.T4CTTIfun.processError(T4CTTIfun.java:564)
at oracle.jdbc.driver.T4CTTIoauthenticate.processError(T4CTTIoauthenticate.java:431)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:436)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:186)
at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:366)
at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:752)
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:359)
at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:531)
at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:221)
at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:503)
Regards,
Rangana

i have used the jdbc driver named ojdbc14_noneXe.jar, it solved my problem. :)
FYI - i'm connecting to a remote oracle development database, not an local installation on my mechine!

You can set the timezone to IDEA. This will prevent this error.
Add the next line to file idea.vmoptions :
-Duser.timezone=your_database_timezone
Here's some explanation on how to get your database timezone

in eclipse go run - > run configuration
in there go to JRE tab in right side panels
in VM Arguments section paste this
-Duser.timezone=GMT
then Apply - > Run

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio