Loading huge data to sybase using spring jdbctemplate - spring

I'm new to spring. I'm working on a batch job that loads the data from a csv file to a sybase table. The input file is more than 5GB & has few millions of records.
I'm doing batch update using spring jdbctemplate. Below is my code snippet.
public int[] batchUpdate(final CsvReader products) throws IOException,
SQLException {
int[] updateCounts = getJdbcTemplate().batchUpdate(sqlStatement,
new AbstractInterruptibleBatchPreparedStatementSetter() {
#Override
public boolean setValuesIfAvailable(PreparedStatement pstmt, int i)
throws SQLException {
// logic to set the values to prepared statement etc...
#Override
public int getBatchSize() {
return batchSize;
}
});
}
}
I'm using apache dbcp datasource. I set the batchsize to 2000. I did not change anything else in the defaults for auto-commits etc..
Now, when I run the job.. it takes 4.5 min on an avg to insert 2000 records & the job runs for 2 days(didn't completed yet).
Can anyone suggest how this can be optimized? Thanks in advance.

Related

Is there any way to free JVM memory in #AfterChunks in spring batch?

Is there any way to free JVM memory in #AfterChunks? Because we are getting outOfMemory error after processing couple of records.
Is there any way to free memory after spring batch job completion ?
Public class ABC implements ChunkListener{
private static final Logger log = oggerFactory.getLogger(ABC .class);
private MessageFormat fmt = new MessageFormat("{0} items processed");
private int loggingInterval = 100;
#Override
public void beforeChunk(ChunkContext context) {
// Nothing to do here
}
#Override
public void afterChunk(ChunkContext context) {
int count = context.getStepContext().getStepExecution().getReadCount();
// If the number of records processed so far is a multiple of the logging interval then output a log message.
if (count > 0 && count % loggingInterval == 0) {
log.info( fmt.format(new Object[] {new Integer(count) })) ;
}
//String name = context.getStepContext().getStepName();
//context.getStepContext().registerDestructionCallback(name, callback);
}
How to call registerDestructionCallback to clean up? What are name and callback? Any reference?
There is no way to force a GC in Java (System.gc() is just a hint to the JVM) and you should leave this to the GC.
Items in a chunk-oriented step should be garbage collected after each chunk is processed, see Does Spring Batch release the heap memory after processing each batch?. If you have an OOM, make sure:
your items are not held by a processor, mapper, etc
a chunk can fit in memory: sometimes, using the driving query pattern, the processor fetches more details about each item and you can get out of memory very quickly for the first few items

Spring Batch memory leak - XML to database using ItemWriter

I had a problem with a Spring Batch job for reading a large XML file (a few million records) and saving the records from it to a database. The job uses chunk of 100 elements and MultiResourceItemReader for reading the XML, ItemProcessor for processed records and ItemWriter for writing records to the database using JPA and EntityManager. The problem is that when call persist operation the job ends up with OutOfMemoryError (I tried to comment writer phase and the problem does not occur).
public class MyClassWriter implements ItemWriter<MyObject> {
#Autowired
private MyDelegate delegate;
#Override
public void write(List<? extends MyObject> items) throws Exception {
...
List<MyObject> foos2 = (List<MyObject>)(List<?>)items;
delegate.setInsert(foos2);
...
}
and
public void setInsert(List<MyObject> list) {
for (MyObject el : list) {
em.persist(el);
}
em.flush();
em.clear(); //I tried to call clear operation too, but not solved problem
}
Any suggestion for me?
It seems the OutOfMemoryException is caused when trying to save too many items at once, try saving the items in batches:
int c = 0;
for (MyObject mo : list) {
em.persist(mo);
if (++c % 1000 == 0) {
em.flush();
}
}
// save any remaining items
em.flush();

Spring batch - pass values between reader and processor

I have a requirement where I need to read values from an xls (where a column called netCreditAmount exists) and save the values in database. The requirement is to add the value of netCreditAmount from all the rows and then set this sum in database only for the first row in xls and remaining rows are inserted with their corresponding netCreditAmounts.
How should I go ahead with the implemetation in Spring Batch. Normal reader, processor and writer are working fine but where exactly should i insert this implementation?
Thanks!
Yo can solve this by adding additional tasklet.
job flow can be like below
#Bean
public Job myJob(JobBuilderFactory jobs) throws Exception {
return jobs.get("myJob")
.start(step1LoadAllData()) // This step will load all data in database excpet first row in xls
.next(updateNetCreditAmountStep()) //// This step will be a tasklet. and will update total sum in first row. You can use database sql for sum for this
.build();
}
Tasklet will be something like below
#Component
public class updateNetCreditAmountTasklet implements Tasklet {
#Override
public RepeatStatus execute(StepContribution stepContribution, ChunkContext chunkContext)
throws Exception {
Double sum = jdbctemplate.queryForObject("select sum(netCreditAmount) from XYZ", Double.class);
// nouw update this some in database for first row
return null;
}
}
So what is the problem?
You need to setup your batch job step to use reader-processor-writer.
Reader has interface:
public interface ItemReader<T> {
T read();
}
Processor:
public interface ItemProcessor<I, O> {
O process(I item);
}
So what you need to have same type provided by reader - T; and pass it to processor - I
stepBuilderFactory.get("myCoolStep")
.<I, O>chunk(1)
.reader(myReader)
.processor(myProcessor)
.writer(myWriter)
.build();

Integrating Spark SQL and Apache Drill through JDBC

I would like to create a Spark SQL DataFrame from the results of a query performed over CSV data (on HDFS) with Apache Drill. I successfully configured Spark SQL to make it connect to Drill via JDBC:
Map<String, String> connectionOptions = new HashMap<String, String>();
connectionOptions.put("url", args[0]);
connectionOptions.put("dbtable", args[1]);
connectionOptions.put("driver", "org.apache.drill.jdbc.Driver");
DataFrame logs = sqlc.read().format("jdbc").options(connectionOptions).load();
Spark SQL performs two queries: the first one to get the schema, and the second one to retrieve the actual data:
SELECT * FROM (SELECT * FROM dfs.output.`my_view`) WHERE 1=0
SELECT "field1","field2","field3" FROM (SELECT * FROM dfs.output.`my_view`)
The first one is successful, but in the second one Spark encloses fields within double quotes, which is something that Drill doesn't support, so the query fails.
Did someone managed to get this integration working?
Thank you!
you can add JDBC Dialect for this and register the dialect before using jdbc connector
case object DrillDialect extends JdbcDialect {
def canHandle(url: String): Boolean = url.startsWith("jdbc:drill:")
override def quoteIdentifier(colName: java.lang.String): java.lang.String = {
return colName
}
def instance = this
}
JdbcDialects.registerDialect(DrillDialect)
This is how the accepted answer code looks in Java:
import org.apache.spark.sql.jdbc.JdbcDialect;
public class DrillDialect extends JdbcDialect {
#Override
public String quoteIdentifier(String colName){
return colName;
}
public boolean canHandle(String url){
return url.startsWith("jdbc:drill:");
}
}
Before creating the Spark Session register the Dialect:
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.jdbc.JdbcDialects;
public static void main(String[] args) {
JdbcDialects.registerDialect(new DrillDialect());
SparkSession spark = SparkSession
.builder()
.appName("Drill Dialect")
.getOrCreate();
//More Spark code here..
spark.stop();
}
Tried and tested with Spark 2.3.2 and Drill 1.16.0. Hope it helps you too!

set batch size in spring JDBC batch update

How do I set batch size in spring JDBC batch update to improve performance?
Listed below is my code snippet.
public void insertListOfPojos(final List<Student> myPojoList) {
String sql = "INSERT INTO " + "Student " + "(age,name) " + "VALUES "
+ "(?,?)";
try {
jdbcTemplateObject.batchUpdate(sql,
new BatchPreparedStatementSetter() {
#Override
public void setValues(PreparedStatement ps, int i)
throws SQLException {
Student myPojo = myPojoList.get(i);
ps.setString(2, myPojo.getName());
ps.setInt(1, myPojo.getAge());
}
#Override
public int getBatchSize() {
return myPojoList.size();
}
});
} catch (Exception e) {
System.out.println("Exception");
}
}
I read that with Hibernate you can provide your batch size in the
configuration xml.
For example,
<property name="hibernate.jdbc.batch_size" value="100"/>.
Is there something similar in Spring's jdbc?
There is no option for jdbc that looks like Hibernate; I think you have to get a look to specif RDBMS vendor driver options when preparing connection string.
About your code you have to use
BatchPreparedStatementSetter.getBatchSize()
or
JdbcTemplate.batchUpdate(String sql, final Collection<T> batchArgs, final int batchSize, final ParameterizedPreparedStatementSetter<T> pss)
if you use JDBC directly, you decide yourself how much statements are used in one commit, while using one of the provided JDBCWriters you decide the batch* size with the configured commit-rate
*afaik the actual spring version uses the prepared statement batch methods under the hood, see https://github.com/SpringSource/spring-framework/blob/master/spring-jdbc/src/main/java/org/springframework/jdbc/core/JdbcTemplate.java#L549

Resources