Timeout and out of memory errors reading large table using jdbc drivers - oracle

I am attempting to read a large table into a spark dataframe from an Oracle database using spark's native read.jdbc in scala. I have tested this with small and medium sized tables (up to 11M rows) and it works just fine. However, when attempting to bring in a larger table (~70M rows) I keep getting errors.
Sample code to show how I am reading this in:
val df = sparkSession.read.jdbc(
url = jdbcUrl,
table = "( SELECT * FROM keyspace.table WHERE EXTRACT(year FROM date_column) BETWEEN 2012 AND 2016)"
columnName = "id_column", // numeric column, 40% NULL
lowerBound = 1L,
upperBound = 100000L,
numPartitions = 60, // same as number of cores
connectionProperties = connectionProperties) // this contains login & password
I am attempting to parallelise the operation, as I am using a cluster with 60 cores and 6 x 32GB RAM dedicated to this app. However, I still keep getting errors relating to timeouts and out of memory issues, such as:
17/08/16 14:01:18 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
....
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds}
...
17/08/16 14:17:14 ERROR RetryingBlockFetcher: Failed to fetch block rdd_2_89, and will not retry (0 retries)
org.apache.spark.network.client.ChunkFetchFailureException: Failure while fetching StreamChunkId{streamId=398908024000, chunkIndex=0}: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$4.apply(DiskStore.scala:125)
...
17/08/16 14:17:14 WARN BlockManager: Failed to fetch block after 1 fetch failures. Most recent failure cause:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
There should be more than enough RAM across the cluster for a table of this size (I've read in local tables 10x bigger), so I have a feeling that for some reason the data read may not be happening in parallel? Looking at the timeline in the spark UI, I can see that one executor hangs and is 'computing' for very long periods of time. Now, the partitioning column has a lot of NULL values in it (about 40%), but it is the only numeric column (other's are dates and strings) - could this make a difference? Is there another way to parallelise a jdbc read?

the partitioning column has a lot of NULL values in it (about 40%), but it is the only numeric column (other's are dates and strings) - could this make a difference?
It makes a huge difference. All values with NULL will go to the last partition:
val whereClause =
if (uBound == null) {
lBound
} else if (lBound == null) {
s"$uBound or $column is null"
} else {
s"$lBound AND $uBound"
}
Is there another way to parallelise a jdbc read?
You can use predicates with other columns than numeric ones. You could for example use ROWID pseudocoulmn in table and use a series of predicates based on prefix.

Related

[Snowflake-jdbc]It hangs when get info from resetset object of connection.getMetadata().getColumns(...)

I try to test the jdbc connection of snowflake with codes below
Connection conn = .......
.......
ResultSet rs = conn.getMetaData().getColumns(**null**, "PUBLIC", "TAB1", null); // 1. set parameters to get metadata of table TAB1
while (rs.next()) { // 2. It hangs here if the first parameter is null in above liune; otherwise(set the corrent db name), it works fine
System.out.println( "precision:" + rs.getInt(7)
+ ",col type name:" + rs.getString(6)
+ ",col type:" + rs.getInt(5)
+ ",col name:" + rs.getString(4)
+ ",CHAR_OCTET_LENGTH:" + rs.getInt(16)
+ ",buf LENGTH:" + rs.getString(8)
+ ",SCALE:" + rs.getInt(9));
}
.......
I debug the codes above in Intellij IDEA, and find that the debugger can't get the details of the object, it always shows "Evaluating..."
The JDBC driver I used is snowflake-jdbc-3.12.5.jar
Is it a bug?
When the catalog (database) argument is null, the JDBC code effectively runs the following SQL, which you can verify in your Snowflake account's Query History UIs/Views:
show columns in account;
This is an expensive metadata query to run due to no filters and the wide requested breadth (columns across the entire account).
Depending on how many databases exist in your organization's account, it may require several minutes or upto an hour of execution to return back results, which explains the seeming "hang". On a simple test with about 50k+ tables dispersed across 100+ of databases and schemas, this took at least 15 minutes to return back results.
I debug the codes above in Intellij IDEA, and find that the debugger can't get the details of the object, it always shows "Evaluating..."
This may be a weirdness with your IDE, but in a pinch you can use the Dump Threads (Ctrl + Escape, or Ctrl + Break) option in IDEA to provide a single captured thread dump view. This should help show that the JDBC client thread isn't hanging (as in, its not locked or starved), it is only waiting on the server to send back results.
There is no issue with the 3.12.5 jar.I just tested the same version in Eclipse, I can inspect all the objects . Could be an issue with your IDE.
ResultSet columns = metaData.getColumns(null, null, "TESTTABLE123",null);
while (columns.next()){
System.out.print("Column name and size: "+columns.getString("COLUMN_NAME"));
System.out.print("("+columns.getInt("COLUMN_SIZE")+")");
System.out.println(" ");
System.out.println("COLUMN_DEF : "+columns.getString("COLUMN_DEF"));
System.out.println("Ordinal position: "+columns.getInt("ORDINAL_POSITION"));
System.out.println("Catalog: "+columns.getString("TABLE_CAT"));
System.out.println("Data type (integer value): "+columns.getInt("DATA_TYPE"));
System.out.println("Data type name: "+columns.getString("TYPE_NAME"));
System.out.println(" ");
}

C3P0 connection pool reported ORA-01000: maximum open cursors exceeded error

I am working with the Hibernate 3.6 and c3p0-0.9.2-pre1. I am getting Exception ORA-01000: maximum open cursors exceeded error while using the C3p0 configurations in the hibernate.cfg.xml.
I have a Oracle's DatabaseSchema.conf file in which I have around 200 create table statement, while validating that Oracle’s Database Schema, I encountered a weird scenario:
If I did not use connection pooling, then no problem, once closed Connection, database released the opened cursors, and open_cursors count did not increased significantly, it will remain under default open_cursors count i.e 300.
But if we used connection pooling, then open_cursors count increased significantly and crosses the default open_cursors count i.e 300 and reported exception ORA-01000: maximum open cursors exceeded error.
I have already tried the following approaches :
Set the value to '0' of the below properties in hibernate.cfg.xml, In order to disable the statement cache of the C3p0.
property name="hibernate.c3p0.max_statements"
property name="hibernate.c3p0.maxStatementsPerConnection"
After closing SESSION , CONNECTION , RESULTSET assign them NULL.
SESSION:
if(session != null && session.isOpen()) {
session.close();
session = null; //assign null
}
RESULTSET :
if(rs != null) {
rs.close();
rs=null; //assign null
}
CONNECTION :
if(conn != null) {
conn.close();
conn=null; //assign null
}
Used Various Permutation and Combination in configuration of C3P0 defined in the hibernate.cfg.xml
Ex: Subsequently Increased the numHelperThreads and observed the OPEN_CURSOR count, changed the pool’s min and max size. In similar way I have changed the other configurations also.
Used the Connection Provider class in the C3P0 configuration and imported the hibernate-c3p0 jar compatible with the Hibernate 3.6.
Upgraded the Hibernate version from 3.6 to 5.2.
Used the DBCP Connection Pooling Mechanism instead of the C3P0.
Approaches from 1-6 results in the High OPEN_CURSOR counts and Only Approach 7 results in the desired outcome.
What else should I tried to work with the C3P0?

Spark Streaming: Micro batches Parallel Execution

We are receiving data in spark streaming from Kafka. Once execution has been started in Spark Streaming, it executes only one batch and the remaining batches starting queuing up in Kafka.
Our data is independent and can be processes in Parallel.
We tried multiple configurations with multiple executor, cores, back pressure and other configurations but nothing worked so far. There are a lot messages queued and only one micro batch has been processed at a time and rest are remained in queue.
We want to achieve parallelism at maximum, so that not any micro batch is queued, as we have enough resources available. So how we can reduce time by maximum utilization of resources.
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe("TOPIC_NAME",
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
JavaDStream<byte[]> messagesStream = consumerStream.map(new Function<ConsumerRecord<String, byte[]>, byte[]>() {
private static final long serialVersionUID = 1L;
#Override
public byte[] call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {
private static final long serialVersionUID = 1L;
#Override
public String call(byte[] asn1Data) throws Exception {
if(asn1Data.length > 0) {
try (InputStream inputStream = new ByteArrayInputStream(asn1Data);
Writer writer = new StringWriter(); ) {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(asn1Data);
GZIPInputStream gzipInputStream = new GZIPInputStream(byteArrayInputStream);
byte[] buffer = new byte[1024];
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int len;
while((len = gzipInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
return new String(byteArrayOutputStream.toByteArray());
} catch (Exception e) {
//
producer.flush();
throw e;
}
}
return null;
}
});
// publish generated json gzip to kafka
cache.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
airMainJsonProcessor.processAIRData(json, sparkSession);
}
}
});
getJavaStreamingContext().start();
getJavaStreamingContext().awaitTermination();
getJavaStreamingContext().stop();
Technology that we are using:
HDFS 2.7.1.2.5
YARN + MapReduce2 2.7.1.2.5
ZooKeeper 3.4.6.2.5
Ambari Infra 0.1.0
Ambari Metrics 0.1.0
Kafka 0.10.0.2.5
Knox 0.9.0.2.5
Ranger 0.6.0.2.5
Ranger KMS 0.6.0.2.5
SmartSense 1.3.0.0-1
Spark2 2.0.x.2.5
Statistics that we got from difference experimentations:
Experiment 1
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 48 Minutes
Experiment 2
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 8 Minutes
Experiment 3
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 7 Minutes
Experiment 4
spark.default.parallelism=16
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 10 Minutes
Please advise, how we can process maximum so no queued.
I was facing same issue and I tried a few things in trying to resolve the issue and came to following findings:
First of all. Intuition says that one batch must be processed per executor but on the contrary, only one batch is processed at a time but jobs and tasks are processed in parallel.
Multiple batch processing can be achieved by using spark.streaming.concurrentjobs, but it's not documented and still needs a few fixes. One of problems is with saving Kafka offsets. Suppose we set this parameter to 4 and 4 batches are processed in parallel, what if 3rd batch finishes before 4th one, which Kafka offsets would be committed. This parameter is quite useful if batches are independent.
spark.default.parallelism because of its name is sometimes considered to make things parallel. But its true benefit is in distributed shuffle operations. Try different numbers and find an optimum number for this. You will get a considerable difference in processing time. It depends upon shuffle operations in your jobs. Setting it too high would decrease the performance. It's apparent from you experiments results too.
Another option is to use foreachPartitionAsync in place of foreach on RDD. But I think foreachPartition is better as foreachPartitionAsync would queue up the jobs whereas batches would appear to be processed but their jobs would still be in the queue or in processing. May be I didn't get its usage right. But it behaved same in my 3 services.
FAIR spark.scheduler.mode must be used for jobs with lots of tasks as round-robin assignment of tasks to jobs, gives opportunity to smaller tasks to start receiving resources while bigger tasks are processing.
Try to tune your batch duration+input size and always keep it below processing duration otherwise you're gonna see a long backlog of batches.
These are my findings and suggestions, however, there are so many configurations and methods to do streaming and often one set of operation doesn't work for others. Spark Streaming is all about learning, putting your experience and anticipation together to get to a set of optimum configuration.
Hope it helps. It would be a great relief if someone could tell specifically how we can legitimately process batches in parallel.
We want to achieve parallelism at maximum, so that not any micro batch is queued
That's the thing about stream processing: you process the data in the order it was received. If you process your data at the rate slower than it arrives it will be queued. Also, don't expect that processing of one record will suddenly be parallelized across multiple nodes.
From your screenshot, it seems your batch time is 10 seconds and your producer published 100 records over 90 seconds.
It took 36s to process 2 records and 70s to process 17 records. Clearly, there is some per-batch overhead. If this dependency is linear, it would take only 4:18 to process all 100 records in a single mini-batch thus beating your record holder.
Since your code is not complete, it's hard to tell what exactly takes so much time. Transformations in the code look fine but probably the action (or subsequent transformations) are the real bottlenecks. Also, what's with producer.flush() which wasn't mentioned anywhere in your code?
I was facing the same issue and I solved it using Scala Futures.
Here are some link that show how to use it:
https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop
https://www.beyondthelines.net/computing/scala-future-and-execution-context/
Also, this is piece of my code when I used Scala Futures:
messages.foreachRDD{ rdd =>
val f = Future {
// sleep(100)
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures, conf)
println("Inside scala future")
1
}
f.onComplete {
case Success(messages) => println("yay!")
case Failure(exception) => println("On no!")
}
}
It's hard to tell without having all the details, but general advice to tackle issues like that -- start with very simple application, "Hello world" kind. Just read from input stream and print data into log file. Once this works you prove that problem was in application and you gradually add your functionality back until you find what was culprit. If even simplest app doesn't work - you know that problem in configuration or Spark cluster itself. Hope this helps.

C# NEST Bulk api failing with System.IO.IOException [duplicate]

This question already has an answer here:
Elasticsearch bulk insert with NEST returns es_rejected_execution_exception
(1 answer)
Closed 5 years ago.
I am trying to bulk insert data from SQL to ElasticSearch index. Below is the code I am using and total number of records is around 1.5 million. I think it something to do with connection setting but I am not able to figure it out. Can someone please help with this code or suggest better way to do it?
public void InsertReceipts
{
IEnumerable<Receipts> receipts = GetFromDB() // get receipts from SQL DB
const string index = "receipts";
var config = ConfigurationManager.AppSettings["ElasticSearchUri"];
var node = new Uri(config);
var settings = new ConnectionSettings(node).RequestTimeout(TimeSpan.FromMinutes(30));
var client = new ElasticClient(settings);
var bulkIndexer = new BulkDescriptor();
foreach (var receiptBatch in receipts.Batch(20000)) //using MoreLinq for Batch
{
Parallel.ForEach(receiptBatch, (receipt) =>
{
bulkIndexer.Index<OfficeReceipt>(i => i
.Document(receipt)
.Id(receipt.TransactionGuid)
.Index(index));
});
var response = client.Bulk(bulkIndexer);
if (!response.IsValid)
{
_logger.LogError(response.ServerError.ToString());
}
bulkIndexer = new BulkDescriptor();
}
}
Code works fine but takes around 10 mins to complete. When I try to increase batch size, it fails with below error:
Invalid NEST response built from a unsuccessful low level call on
POST: /_bulk
Invalid Bulk items: OriginalException: System.Net.WebException: The
underlying connection was closed: An unexpected error occurred on a
send. ---> System.IO.IOException: Unable to write data to the
transport connection: An existing connection was forcibly closed by
the remote host. ---> System.Net.Sockets.SocketException: An existing
connection was forcibly closed by the remote host
A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.
I had a similar problem. My problem was solved by adding following code, before the ElasticClient connection is established:
System.Net.ServicePointManager.Expect100Continue = false;

SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs

we are testing solr 4.1 running inside tomcat 7 and java 7 with following options
JAVA_OPTS="-Xms256m -Xmx2048m -XX:MaxPermSize=1024m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+ParallelRefProcEnabled -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/ubuntu/OOM_HeapDump"
our source code looks like following:
/**** START *****/
int noOfSolrDocumentsInBatch = 0;
for(int i=0 ; i<5000 ; i++) {
SolrInputDocument solrInputDocument = getNextSolrInputDocument();
server.add(solrInputDocument);
noOfSolrDocumentsInBatch += 1;
if(noOfSolrDocumentsInBatch == 10) {
server.commit();
noOfSolrDocumentsInBatch = 0;
}
}
/**** END *****/
the method "getNextSolrInputDocument()" generates a solr document with 100 fields (average). Around 50 of the fields are of "text_general" type.
Some of the "test_general" fields consist of approx 1000 words rest consists of few words. Ouf of total fields there are around 35-40 multivalued fields (not of type "text_general").
We are indexing all the fields but storing only 8 fields. Out of these 8 fields two are string type, five are long and one is boolean. So our index size is only 394 MB. But the RAM occupied at time of OOM is around 2.5 GB. Why the memory is so high even though the index size is small?
What is being stored in the memory? Our understanding is that after every commit documents are flushed to the disk.So nothing should remain in RAM after commit.
We are using the following settings:
server.commit() set waitForSearcher=true and waitForFlush=true
solrConfig.xml has following properties set:
directoryFactory = solr.MMapDirectoryFactory
maxWarmingSearchers = 1
text_general data type is being used as supplied in the schema.xml with the solr setup.
maxIndexingThreads = 8(default)
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
We get Java heap Out Of Memory Error after commiting around 3990 solr documents.Some of the snapshots of memory dump from profiler are uploaded at following links.
http://s9.postimage.org/w7589t9e7/memorydump1.png
http://s7.postimage.org/p3abs6nuj/memorydump2.png
can somebody please suggest what should we do to minimize/optimize the memory consumption in our case with the reasons? also suggest what should be optimal values and reason for following parameters of solrConfig.xml
- useColdSearcher - true/false?
- maxwarmingsearchers- number
- spellcheck-on/off?
- omitNorms=true/false?
- omitTermFreqAndPositions?
- mergefactor? we are using default value 10
- java garbage collection tuning parameters ?

Resources