When I use jdbc to query data from oracle database,
java code like this:
ParsedSql parsedSql = NamedParameterUtils.parseSqlStatement(apiSql);
MapSqlParameterSource paramSource = new MapSqlParameterSource(param);
String sqlToUse = NamedParameterUtils.substituteNamedParameters(parsedSql, paramSource);
List<SqlParameter> declaredParameters = NamedParameterUtils.buildSqlParameterList(parsedSql, paramSource);
PreparedStatementCreatorFactory creatorFactory = new PreparedStatementCreatorFactory(sqlToUse, declaredParameters);
Object[] params = NamedParameterUtils.buildValueArray(parsedSql, paramSource, null);
PreparedStatementCreator creator = creatorFactory.newPreparedStatementCreator(params);
PreparedStatement preparedStatement = creator.createPreparedStatement(conn);
if(batchCount > maxBatchCount || batchCount == 0){
batchCount = (int)maxBatchCount;
}
preparedStatement.setFetchSize(batchCount);
ResultSet resultSet = preparedStatement.executeQuery();
I set the fetchSize here.When the fetchSize is 10,the execution is normal.When the fetchSize is 100,000,an error occurs.Here is the error message:
java.sql.SQLException: Error
at com.alibaba.druid.pool.DruidDataSource.handleConnectionException(DruidDataSource.java:1770)
at com.alibaba.druid.pool.DruidPooledConnection.handleException(DruidPooledConnection.java:133)
at com.alibaba.druid.pool.DruidPooledStatement.checkException(DruidPooledStatement.java:82)
at com.alibaba.druid.pool.DruidPooledPreparedStatement.executeQuery(DruidPooledPreparedStatement.java:240)
at com.eternalinfo.alioth.openapi.service.DsApiInfoService.getApiData(DsApiInfoService.java:444)
at com.eternalinfo.alioth.openapi.service.DsApiInfoService.queryForList(DsApiInfoService.java:398)
at com.eternalinfo.alioth.openapi.service.DsApiInfoService.queryForList(DsApiInfoService.java:519)
at com.eternalinfo.alioth.openapi.service.DsApiInfoService$$FastClassBySpringCGLIB$$7b418e4c.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:769)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:747)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:95)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 802200000
at oracle.jdbc.driver.T4CNumberAccessor.unmarshalOneRow(T4CNumberAccessor.java:201)
at oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:945)
at oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:865)
at oracle.jdbc.driver.T4C8Oall.readRXD(T4C8Oall.java:790)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:403)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:208)
at oracle.jdbc.driver.T4CPreparedStatement.executeForRows(T4CPreparedStatement.java:1046)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1207)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1296)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3613)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3657)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1495)
at com.alibaba.druid.pool.DruidPooledPreparedStatement.executeQuery(DruidPooledPreparedStatement.java:227)
... 117 more
enviroments:
jdk8
ojdbc-11.2.0.4
oracle 10,11,12
I execute a simple full query sql statement with 89 fields. I know that it's possible to get ArrayIndexOutOfBoundsException when carring too many arguments in bulk inserts,but I never seen it in queries.
There's nothing on that so far,does anyone know?
When you set the fetch size to 100,000 you're telling the driver to fetch that many rows in one single roundtrip. That means the driver has to allocate enough space in heap to store all these rows before it can start processing them. And in the end it looks like the buffers are being corrupted anyway, hence this error.
This fetch size is much higher than any reasonable number typically used. The default is 10 which may be too small. It's hard to tell what a reasonable number should be in this case without knowing the shape of the row but the rule of thumb is "don't go any higher than 1,000". Past that you'll be in the red zone for sure.
Related
I am attempting to read a large table into a spark dataframe from an Oracle database using spark's native read.jdbc in scala. I have tested this with small and medium sized tables (up to 11M rows) and it works just fine. However, when attempting to bring in a larger table (~70M rows) I keep getting errors.
Sample code to show how I am reading this in:
val df = sparkSession.read.jdbc(
url = jdbcUrl,
table = "( SELECT * FROM keyspace.table WHERE EXTRACT(year FROM date_column) BETWEEN 2012 AND 2016)"
columnName = "id_column", // numeric column, 40% NULL
lowerBound = 1L,
upperBound = 100000L,
numPartitions = 60, // same as number of cores
connectionProperties = connectionProperties) // this contains login & password
I am attempting to parallelise the operation, as I am using a cluster with 60 cores and 6 x 32GB RAM dedicated to this app. However, I still keep getting errors relating to timeouts and out of memory issues, such as:
17/08/16 14:01:18 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
....
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds}
...
17/08/16 14:17:14 ERROR RetryingBlockFetcher: Failed to fetch block rdd_2_89, and will not retry (0 retries)
org.apache.spark.network.client.ChunkFetchFailureException: Failure while fetching StreamChunkId{streamId=398908024000, chunkIndex=0}: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$4.apply(DiskStore.scala:125)
...
17/08/16 14:17:14 WARN BlockManager: Failed to fetch block after 1 fetch failures. Most recent failure cause:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
There should be more than enough RAM across the cluster for a table of this size (I've read in local tables 10x bigger), so I have a feeling that for some reason the data read may not be happening in parallel? Looking at the timeline in the spark UI, I can see that one executor hangs and is 'computing' for very long periods of time. Now, the partitioning column has a lot of NULL values in it (about 40%), but it is the only numeric column (other's are dates and strings) - could this make a difference? Is there another way to parallelise a jdbc read?
the partitioning column has a lot of NULL values in it (about 40%), but it is the only numeric column (other's are dates and strings) - could this make a difference?
It makes a huge difference. All values with NULL will go to the last partition:
val whereClause =
if (uBound == null) {
lBound
} else if (lBound == null) {
s"$uBound or $column is null"
} else {
s"$lBound AND $uBound"
}
Is there another way to parallelise a jdbc read?
You can use predicates with other columns than numeric ones. You could for example use ROWID pseudocoulmn in table and use a series of predicates based on prefix.
I am referring to this documentation. http://www-01.ibm.com/support/docview.wss?uid=swg21981328. As per the article if we use executeBatch method then inserts will be faster (The Netezza JDBC driver may detect a batch insert, and under the covers convert this to an external table load and external table load will be faster). I had to execute millions of insert statements and i am getting only a speed of 500 records per minute per connection max. Is there any better way to load data faster to netezza via jdbc connection? I am using spark and jdbc connection to insert the records.Why external table via loading is not happening even when i am executing in batches. Given below is the spark code i am using,
Dataset<String> insertQueryDataSet.foreachPartition( partition -> {
Connection conn = NetezzaConnector.getSingletonConnection(url, userName, pwd);
conn.setAutoCommit(false);
int commitBatchCount = 0;
int insertBatchCount = 0;
Statement statement = conn.createStatement();
//PreparedStatement preparedStmt = null;
while(partition.hasNext()){
insertBatchCount++;
//preparedStmt = conn.prepareStatement(partition.next());
statement.addBatch(partition.next());
//statement.addBatch(partition.next());
commitBatchCount++;
if(insertBatchCount % 10000 == 0){
LOGGER.info("Before executeBatch.");
int[] execCount = statement.executeBatch();
LOGGER.info("After execCount." + execCount.length);
LOGGER.info("Before commit.");
conn.commit();
LOGGER.info("After commit.");
}
}
//execute remaining statements
statement.executeBatch();
int[] execCount = statement.executeBatch();
LOGGER.info("After execCount." + execCount.length);
conn.commit();
conn.close();
});
I tried this approach(batch insert) but found very slow,
So I put all data in CSV & do external table load for each csv.
InsertReq="Insert into "+ tablename + " select * from external '"+ filepath + "' using (maxerrors 0, delimiter ',' unase 2000 encoding 'internal' remotesource 'jdbc' escapechar '\' )";
Jdbctemplate.execute(InsertReq);
Since I was using java so JDBC as source & note that csv file path is in single quotes .
Hope this helps.
If you find better than this approach, don't forget to post. :)
Either I run a scan command or a count, this error pops up and the error message doesn't make sense to me.
What does it say & how to solve it?
org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException:
Expected nextCallSeq: 1 But the nextCallSeq got from client: 0;
request=scanner_id: 788 number_of_rows: 100 close_scanner: false
next_call_seq: 0
Commands:
count 'table', 5000
scan 'table', {COLUMN => ['cf:cq'], FILTER => "ValueFilter( =, 'binaryprefix:somevalue')"}
EDIT:
I have added the following settings in hbase-site.xml
<property>
<name>hbase.rpc.timeout</name>
<value>1200000</value>
</property>
<property>
<name>hbase.client.scanner.caching</name>
<value>100</value>
</property>
NO IMPACT
EDIT2: Added sleep
Result[] results = scanner.next(100);
for (int i = 0; i < results.length; i++) {
result = results[i];
try {
...
count++;
...
Thread.sleep(10); // ADDED SLEEP
} catch (Throwable exception) {
System.out.println(exception.getMessage());
System.out.println("sleeping");
}
}
New Error after Edit2:
org.apache.hadoop.hbase.client.ScannerTimeoutException: 101761ms passed since the last invocation, timeout is currently set to 60000
...
Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 31, already closed?
...
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.UnknownScannerException): org.apache.hadoop.hbase.UnknownScannerException: Name: 31, already closed?
...
FINALLY BLOCK: 9900
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.hbase.client.ScannerTimeoutException: 101766ms passed since the last invocation, timeout is currently set to 60000
...
Caused by: org.apache.hadoop.hbase.client.ScannerTimeoutException: 101766ms passed since the last invocation, timeout is currently set to 60000
...
Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 31, already closed?
...
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.UnknownScannerException): org.apache.hadoop.hbase.UnknownScannerException: Name: 31, already closed?
...
EDIT: By using the same client version shipped with the downloaded hbase(not maven 0.99), i was able to solve this issue.
Server version is 0.98.6.1
Contains client jars inside ./lib folder
Don't forget to attach the zookeeper library
OLD:
Right now I did two things, changed the table connection API (0.99)
Configuration conf = HBaseConfiguration.create();
TableName name = TableName.valueOf("TABLENAME");
Connection conn = ConnectionFactory.createConnection(conf);
Table table = conn.getTable(name);
Then when the error pops up, i try to recreate the connection
scanner.close();
conn.close();
conf.clear();
conf = HBaseConfiguration.create();
conn = ConnectionFactory.createConnection(conf);
table = conn.getTable(name);
table = ConnectionFactory.createConnection(conf).getTable(name);
scanner = table.getScanner(scan);
This works but is might slow after the first error it receives. Very slow to scan through all the rows
this sometimes occurs when you did huge deletes, you need to merge empty regions and try to balance your regions
Can be caused by a broken disk as well. In my case it was not so broken so that Ambari, HDFS or our monitoring services noticed it, but broken enough so that it couldn't serve one region.
After stopping the regionserver using that disk, the scan worked.
I found the regionserver by running hbase shell in debug mode:
hbase shell -d
Then some regionservers appeared in the output and one of them stood out.
Then I ran dmesg on the host to find the failing disk.
I'm running a sample code I wrote to test HBase lockRow() and unlockRow() methods. The sample code is below:
HTable table = new HTable(config, "test");
RowLock rowLock = table.lockRow(Bytes.toBytes(row));
System.out.println("Obtained rowlock on " + row + "\nRowLock: " + rowLock);
Put p = new Put(Bytes.toBytes(row));
p.add(Bytes.toBytes("colFamily"), Bytes.toBytes(colFamily), Bytes.toBytes(value));
table.put(p);
System.out.println("put row");
table.unlockRow(rowLock);
System.out.println("Unlocked row!");
When I execute my code, I get an UnknownRowLockException. The documentation says that this error is thrown when an unknown row lock is passed to the region servers. I'm not sure how this is happening & how to resolve it.
The stack trace is below:
Obtained rowlock on row2
RowLock: org.apache.hadoop.hbase.client.RowLock#15af33d6
put row
Exception in thread "main" org.apache.hadoop.hbase.UnknownRowLockException: org.apache.hadoop.hbase.UnknownRowLockException: 5763272717012243790
at org.apache.hadoop.hbase.regionserver.HRegionServer.unlockRow(HRegionServer.java:2099)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:604)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1055)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.translateException(HConnectionManager.java:1268)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1014)
at org.apache.hadoop.hbase.client.HTable.unlockRow(HTable.java:870)
at HelloWorld.Hello.HelloWorld.main(HelloWorld.java:41)
EDIT:
I just realized that I should be printing rowLock.getLockId() instead of rowLock. I did this and compared it to the rowlock in the stack trace, and they are the same, so I'm not sure why the UnknownRowLockException occurs.
Please change the 'file descriptor limit' on the underlying system.
On linux you can do this with ulimit
Note that HBase prints in its logs as the first line the ulimit its seeing.
I was able to resolve this error in this way:
The rowLock being obtained needs to be passed as a parameter to the put constructor.
HTable table = new HTable(config, "test");
RowLock rowLock = table.lockRow(Bytes.toBytes(row));
System.out.println("Obtained rowlock on " + row + "\nRowLock: " + rowLock);
Put p = new Put(Bytes.toBytes(row), rowLock);
p.add(Bytes.toBytes("colFamily"), Bytes.toBytes(colFamily), Bytes.toBytes(value));
table.put(p);
System.out.println("put row");
table.unlockRow(rowLock);
System.out.println("Unlocked row!");
In my earlier approach, a rowLock was being obtained on a row of the table. However, since the rowLock was not used (not passed to put constructor), when I call the unlockRow method, the method waits for 60 seconds (lock timeout) to check if the lock has been used. After 60 seconds, the lock expires, and I end up with UnknownRowLockException
I have a stored procedure in an Oracle 10g database, in my java code, i call it with:
CallableStatement cs = bdr.prepareCall("Begin ADMBAS01.pck_basilea_reportes.cargar_reporte(?,?,?,?,?); END;", ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
cs.setInt(1, this.reportNumber);
cs.registerOutParameter(2, OracleTypes.CURSOR);
cs.registerOutParameter(3, OracleTypes.INTEGER);
cs.registerOutParameter(4, OracleTypes.VARCHAR);
cs.setDate(5, new java.sql.Date(this.fecha1.getTime()));
cs.execute();
ResultSet rs = (ResultSet)cs.getObject(2);
i do obtain an ResultSet with correct records in it, but when i try an "scroll_insensitive - only" operation, (like absolute(1) ). I keep getting an SQLException stating that it doesn't work on FORWARD only resultSet.
So how can i obtain this ResultSet with scroll_insensitive capabilites?
Thanks in Advance.
The result set type is merely a suggestion to the driver, which the driver can ignore or downgrade to FORWARD_ONLY if it can't comply. See here for details.