pyspark JDBC connection to Teradata - Limit of 3 concurrent requests - jdbc

I am able to successfully connect to Teradata using pyspark with JDBC
However, when trying to write to a csv file I get the following error :
Py4JJavaError: An error occurred while calling o100.load.
: java.sql.SQLException: [Teradata Database] [TeraJDBC 14.10.00.42] [Error 3153] [SQLState HY000] TDWM Throttle violation for Concurrent Sessions: For Rule Name 'Non-Batch Users session limit', Limit of 3 concurrent requests
Is there a way to limit the number of concurrent sessions being used to 3 ?
Code below :
df=spark.read \
.format("jdbc") \
.option("driver", driver_td) \
.option("url", jdbc_url_td) \
.option("dbtable", 'tablename') \
.option("user", user_td) \
.option("password", pwd_td) \
.load()
potentially_impacted_accounts5.repartition(1).write.csv(output+ userid +'potentially_impacted_accounts' + file_date + 'csv', sep = ',', header = True,mode='overwrite')
I get the error when running the write command

Related

Snowflake Kafka Connect: How to avoid the "Error 5016. SinkRecord.value and SinkRecord.valueSchema cannot be null"?

We are just starting to utilize Kafka Connect to pipe our data into Snowflake.
We are hitting this error:
"
com.snowflake.kafka.connector.internal.SnowflakeKafkaConnectorException:
[SF_KAFKA_CONNECTOR] Exception: Invalid SinkRecord received
[SF_KAFKA_CONNECTOR] Error Code: 5016
[SF_KAFKA_CONNECTOR] Detail: SinkRecord.value and SinkRecord.valueSchema cannot be null
"
What can we do / look at to avoid this error?

SQOOP 1 failed to load Sybase driver - Could not load db driver class: com.sybase.jdbc3.jdbc.SybDriver

I am trying to import data from the sybaseIQ using sqoop-1. The jdts-1.3.1.jar is placed on /sqoop/sqoop-1.4.6/lib folder.
When this syntax runs,
sqoop import --connect '`jdbc:jtds:sybase:tds`://10.***.*.***#5500:*****' --driver 'com.sybase.jdbc3.jdbc.SybDriver' --username "username" --password -p --query "select * from dw.DM_ADDRESS where rownum <= 1000 and \$CONDITIONS" --target-dir "/user/*****/WT_Address_Itc" --split-by 1 --verbose
and am getting this error:
17/03/02 11:15:19 DEBUG manager.SqlManager: Execute getColumnInfoRawQuery : select * from dw.DM_ADDRESS_ITC where rownum <= 1000 and (1 = 0)
17/03/02 11:15:19 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.sybase.jdbc3.jdbc.SybDriver
java.lang.RuntimeException: Could not load db driver class: com.sybase.jdbc3.jdbc.SybDriver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:744)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:767)
at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:270)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForQuery(SqlManager.java:234)
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:304)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1833)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1645)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:606)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
What am I missing here? Am I using the right driver? That's the one available on the http://jtds.sourceforge.net/ site?
I would really appreciate your help in this!
The driver class com.sybase.jdbc3.jdbc.SybDriver belongs to jconn3.jar
Download jconn3.jar and place it in $SQOOP_HOME/lib directory.
And the connectionString must be,
jdbc:sybase:Tds:dbhost:dbport/DATABASE=dbname

Is S3NativeFileSystem call killing my Pyspark Application on AWS EMR 4.6.0

My Spark application is failing when it has to access numerous CSV files (~1000 # 63MB each) from S3, and pipe them into a Spark RDD. The actual process of splitting up the CSV seems to work, but an extra function call to S3NativeFileSystem seems to be causing an error and the job to crash.
To begin, the following is my PySpark Application:
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
import time
startTime = float(time.time())
dataPath = 's3://PATHTODIRECTORY/'
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "MYKEY")
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "MYSECRETKEY")
def buildSchemaDF(tableName, columnList):
currentRDD = sc.textFile(dataPath + tableName).map(lambda line: line.split("|"))
currentDF = currentRDD.toDF(columnList)
return currentDF
loadStartTime = float(time.time())
lineitemDF = buildSchemaDF('lineitem*', ['l_orderkey','l_partkey','l_suppkey','l_linenumber','l_quantity','l_extendedprice','l_discount','l_tax','l_returnflag','l_linestatus','l_shipdate','l_commitdate','l_receiptdate','l_shipinstruct','l_shipmode','l_comment'])
lineitemDF.registerTempTable("lineitem")
loadTimeElapsed = float(time.time()) - loadStartTime
queryStartTime = float(time.time())
qstr = """
SELECT
lineitem.l_returnflag,
lineitem.l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_discount) as sum_disc,
sum(l_tax) as sum_tax,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(l_orderkey) as count_order
FROM
lineitem
WHERE
l_shipdate <= '19981001'
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus
"""
tpch1DF = sqlContext.sql(qstr)
queryTimeElapsed = float(time.time()) - queryStartTime
totalTimeElapsed = float(time.time()) - startTime
tpch1DF.show()
queryResults = [qstr, loadTimeElapsed, queryTimeElapsed, totalTimeElapsed]
distData = sc.parallelize(queryResults)
distData.saveAsTextFile(dataPath + 'queryResults.csv')
print 'Load Time: ' + str(loadTimeElapsed)
print 'Query Time: ' + str(queryTimeElapsed)
print 'Total Time: ' + str(totalTimeElapsed)
To take it step by step I start off by spinning up a Spark EMR Cluster with the following AWS CLI command (carriage returns added for readability):
aws emr create-cluster --name "Big TPCH Spark cluster2" --release-label emr-4.6.0
--applications Name=Spark --ec2-attributes KeyName=blazing-test-aws
--log-uri s3://aws-logs-132950491118-us-west-2/elasticmapreduce/j-1WZ39GFS3IX49/
--instance-type m3.2xlarge --instance-count 6 --use-default-roles
After the EMR cluster finishes provisioning I then copy over my Pyspark application onto the master node at '/home/hadoop/pysparkApp.py'. With it copied over I'm able to add the Step for spark-submit.
aws emr add-steps --cluster-id j-1DQJ8BDL1394N --steps
Type=spark,Name=SparkTPCHTests,Args=[--deploy-mode,cluster,-
conf,spark.yarn.submit.waitAppCompletion=true,--num-executors,5,--executor
cores,5,--executor memory,20g,/home/hadoop/tpchSpark.py]
,ActionOnFailure=CONTINUE
Now if I run this step over only a few of the aforementioned CSV files the final results will be generated, but the script will still claim to have failed.
I think it's associated with an extra call to S3NativeFileSystem, but I'm not certain. These are the Yarn log messages I'm getting which lead me to that conclusion. The first call appears to work just fine:
16/05/15 23:18:00 INFO HadoopRDD: Input split: s3://data-set-builder/splitLineItem2/lineitemad:0+64901757
16/05/15 23:18:00 INFO latency: StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[ED8011CE4E1F6F18], ServiceEndpoint=[https://data-set-builder.s3-us-west-2.amazonaws.com], HttpClientPoolLeasedCount=0, RetryCapacityConsumed=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=2, ClientExecuteTime=[77.956], HttpRequestTime=[77.183], HttpClientReceiveResponseTime=[20.028], RequestSigningTime=[0.229], CredentialsRequestTime=[0.003], ResponseProcessingTime=[0.128], HttpClientSendRequestTime=[0.35],
While the second one does not seem to execute properly, resulting in "Partial Results" (206 Error):
16/05/15 23:18:00 INFO S3NativeFileSystem: Opening 's3://data-set-builder/splitLineItem2/lineitemad' for reading
16/05/15 23:18:00 INFO latency: StatusCode=[206], ServiceName=[Amazon S3], AWSRequestID=[10BDDE61AE13AFBE], ServiceEndpoint=[https://data-set-builder.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RetryCapacityConsumed=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=2, Client Execute Time=[296.86], HttpRequestTime=[295.801], HttpClientReceiveResponseTime=[293.667], RequestSigningTime=[0.204], CredentialsRequestTime=[0.002], ResponseProcessingTime=[0.34], HttpClientSendRequestTime=[0.337],
16/05/15 23:18:02 INFO ApplicationMaster: Waiting for spark context initialization ...
I'm lost as to why it's even making the second call to S3NativeFileSystem when the first one appears to have responded effectively and even split the file. Is this something that is a product of my EMR configuration? I know S3Native has file limit issues and that a straight S3 call is optimal, which is what I've tried to do, but this call seems to be there no matter what I do. Please help!
Also, to add a few other error messages in my Yarn Log in case they are relevant.
1)
16/05/15 23:19:22 ERROR ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
16/05/15 23:19:22 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Timed out waiting for SparkContext.)
2)
16/05/15 23:19:22 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/2b/temp_shuffle_3fe2e09e-f8e4-4e5d-ac96-1538bdc3b401
java.io.FileNotFoundException: /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/2b/temp_shuffle_3fe2e09e-f8e4-4e5d-ac96-1538bdc3b401 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:162)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:226)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/05/15 23:19:22 ERROR BypassMergeSortShuffleWriter: Error while deleting file /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/2b/temp_shuffle_3fe2e09e-f8e4-4e5d-ac96-1538bdc3b401
16/05/15 23:19:22 WARN TaskMemoryManager: leak 32.3 MB memory from org.apache.spark.unsafe.map.BytesToBytesMap#762be8fe
16/05/15 23:19:22 ERROR Executor: Managed memory leak detected; size = 33816576 bytes, TID = 14
16/05/15 23:19:22 ERROR Executor: Exception in task 13.0 in stage 1.0 (TID 14)
java.io.FileNotFoundException: /mnt/yarn/usercache/hadoop/appcache/application_1463354019776_0001/blockmgr-f847744b-c87a-442c-9135-57cae3d1f6f0/3a/temp_shuffle_b9001fca-bba9-400d-9bc4-c23c002e0aa9 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Order of precedence for spark configurations is :
SparkContext (code/application) > Spark-submit > Spark-defaults.conf
So couple of things to point here -
Use YARN cluster as deploy mode and master in your spark submit command -
spark-submit --deploy-mode cluster --master yarn ...
OR
spark-submit --master yarn-cluster ...
Remove "local" string from line sc = SparkContext("local", "Simple App") in your code. Use conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf) to initialize Spark context.
Ref - http://spark.apache.org/docs/latest/programming-guide.html

SQOOP - Code too large > MAX table definition?

I'm trying to import data to HDFS from a TERADATA table with 2000 columns (the table definition make 90K caracters)... When I execute my script I get :
/tmp/sqoop-hadoopi/compile/636c527afc3baa6fdf33464f02430602/table.java:21971: code too large
My sqoop script :
sqoop import \
-libjars $LIB_JARS \
--connect jdbc:teradata://PRD/Database=database \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--table table \
--username login \
--password pass \
My output log :
13/11/07 14:54:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
13/11/07 14:54:50 INFO manager.SqlManager: Using default fetchSize of 1000
13/11/07 14:54:50 INFO tool.CodeGenTool: Beginning code generation
13/11/07 14:55:31 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM table AS t WHERE 1=0
13/11/07 14:55:46 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop/libexec/..
13/11/07 14:55:46 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop/libexec/../hadoop-core.jar
/tmp/sqoop-hadoopi/compile/636c527afc3baa6fdf33464f02430602/table.java:21971: code too large
public boolean equals(Object o) {
^
/tmp/sqoop-hadoopi/compile/636c527afc3baa6fdf33464f02430602/table.java:37949: code too large
public void write(DataOutput __dataOut) throws IOException {
^
/tmp/sqoop-hadoopi/compile/636c527afc3baa6fdf33464f02430602/table.java:49925: code too large
public String toString(DelimiterSet delimiters, boolean useRecordDelim) {
^
/tmp/sqoop-hadoopi/compile/636c527afc3baa6fdf33464f02430602/table.java:53970: code too large
private void __loadFromFields(List<String> fields) {
^
Note: /tmp/sqoop-hadoopi/compile/636c527afc3baa6fdf33464f02430602/table.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
4 errors
13/11/07 14:55:51 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Error returned by javac
at org.apache.sqoop.orm.CompilationManager.compile(CompilationManager.java:205)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:83)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:390)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
Maybe someone have already import a big table...
Thanks a lot!
Each method in Java is limited to 64KB of byte code. I'm afraid that current version of Sqoop do not have facilities to break the long methods that are being generated in your case into multiple sub-methods, so I would suggest to open a new feature request on Sqoop JIRA.
I don't know if you tried this already, but there's Teradata Connector for Hadoop:
http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available

Intermittent JDBC error two-task conversion routine

I am intermittently getting the following jdbc error with my oracle db (in remote linux machine)
java.sql.SQLException: ORA-03120: two-task conversion routine: integer overflow
The query that mostly breaks(not the only one!) looks like this, note that I am using Groovy Sql
String insertScript =
"""
INSERT INTO data_table (invoke_id , trans_type , trans_from ,
trans_to , serv_type ,msg_type , man_id , recip_id ,
trx_id , part_id , msg_orig_time , create_time ,
action_code , num_ranges , use_trx_id, due_date,owner,route,lsa,
cl_id,hl_id,requ_type,lnp_type,use_invoke_id)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
"""
def params = [invokeId,transType,transFrom,transTo,servType,msgType,man,recip,portId,owner,
now,now,"W",1,trxId,dueDate,owner," ", "LSA","CCH"," ","Basic","lspp",invokeId]
sql.execute insertScript, params
What I do not understand is why my queries work some time and not other times. Generally it will happen during the first run and only rarely appear in subsequent runs.
[UPDATE] I am getting this error sometimes on application startup also, when spring tries to load its context. I am using spring and dbcp for the connection pooling.
Caused by: org.apache.commons.dbcp.SQLNestedException: Error preloading the connection pool
at org.apache.commons.dbcp.BasicDataSource.createDataSource(BasicDataSource.java:1238)
at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:880)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79)
... 46 more
Caused by: java.sql.SQLException: ORA-03120: two-task conversion routine: integer overflow
at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)
at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:131)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:204)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:455)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:406)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399)
at oracle.jdbc.driver.T4CTTIoauthenticate.receiveOauth(T4CTTIoauthenticate.java:799)
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:368)
at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:508)
at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:203)
at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:33)
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:510)
at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:294)
at org.apache.commons.pool.impl.GenericObjectPool.addObject(GenericObjectPool.java:1059)
at org.apache.commons.dbcp.BasicDataSource.createDataSource(BasicDataSource.java:1235)
... 49 more

Resources