Spring Batch Performance Improvement - spring

I am writing a Spring Batch which needs to read from a Database table and then process the read data (while reading more database tables) and then finally write to a database. The performance of the Spring Batch needs to be updated so that 10 files are written every 1 second.
I followed this post and managed to increase some performance by using multi threaded steps.
But still the desired performance goal can not be met. Can anyone guide me on how to get more throughput from Spring batch.

Your performance depends of a lot of factors.
For example :
What does your query looks like? Are there any joins/subrequest who could slow down your whole job?
What does your processor do?
Did you use indexed tables (with a specific index tablespaces on a faster drive)?
Parallel processing, multi-threading and partitionning is only a small part of your performance gain.

Related

Cassandra integration with hadoop for read performance

I am using Apache Cassandra for storing around 100 million records. There is one single node with the following specifications-
RAM-32GB, HDD-2TB, Intel quad core processor.
With cassandra there is a read performance problem. For some queries it takes around 40mins for giving the output. After searching for how to improve the read performance i came to know about the following factors-
Compaction strategy,compression techniques, key cache, increase the heap space, turning off the swap space for cassandra.
After doing these optimizations, the performance remains the same. After seraching, I came around for integrating Hadoop with cassandra.Is it the correct way to do the queries in cassandra or any other factors I am missing here??
Thanks.
It looks like you data model could be improved. 40 minutes is something impossible. I download all data from 6 million records (around 10gb) within few minutes. And think it because I convert data in the process of download and store them. Trivial selects must take milliseconds.
Did you build it on the base of queries that you must do ?

What is the best way to extract big data to file?

I am using Oracle as a DBMS and Tuxedo for application server.
Customer has the need to export data from Oracle to SAMFILE for interface purpose.
Unfortunately, the total number of records size is huge (over 10 million) so
I was wondering what is the best practice to extract big amounts of data to a file on the database server.
I am used to creating a cursor and fetching a record then writing to file.
Is there a better i.e. faster way to handle this? It is a recurring task.
I suggest you read Adrian Billington's article on tuning UTL_FILE. It covers all the bases. Find it here.
The important thing is buffering records, so reducing the number of file I/O calls. You will need to benchmark the different implementations, to see which works best in your situation.
Pay attention to his advice on query performance. Optimising file I/O is pointless if most of the time is spent on data acquisition.

Spring Batch: Which ItemReader implementation to use for high volume & low latency

Use case: Read 10 million rows [10 columns] from database and write to a file (csv format).
Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason?
Which would be better performing (fast) in the above use case?
Would the selection be different in case of a single-process vs multi-process approach?
In case of a multi-threaded approach using TaskExecutor, which one would be better & simple?
To process that kind of data, you're probably going to want to parallelize it if that is possible (the only thing preventing it would be if the output file needed to retain an order from the input). Assuming you are going to parallelize your processing, you are then left with two main options for this type of use case (from what you have provided):
Multithreaded step - This will process a chunk per thread until complete. This allows for parallelization in a very easy way (simply adding a TaskExecutor to your step definition). With this, you do loose restartability out of the box because you will need to turn off state persistence on either of the ItemReaders you have mentioned (there are ways around this with flagging records in the database as having been processed, etc).
Partitioning - This breaks up your input data into partitions that are processed by step instances in parallel (master/slave configuration). The partitions can be executed locally via threads (via a TaskExecutor) or remotely via remote partitioning. In either case, you gain restartability (each step processes it's own data so there is no stepping on state from partition to partition) with parallization.
I did a talk on processing data in parallel with Spring Batch. Specifically, the example I present is a remote partitioned job. You can view it here: https://www.youtube.com/watch?v=CYTj5YT7CZU
To your specific questions:
Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason? - Either of these two options can be tuned to meet many performance needs. It really depends on the database you're using, driver options available as well as processing models you can support. Another consideration is, do you need restartability?
Which would be better performing (fast) in the above use case? - Again it depends on your processing model chosen.
Would the selection be different in case of a single-process vs multi-process approach? - This goes to how you manage jobs more so than what Spring Batch can handle. The question is, do you want to manage partitioning external to the job (passing in the data description to the job as parameters) or do you want the job to manage it (via partitioning).
In case of a multi-threaded approach using TaskExecutor, which one would be better & simple? - I won't deny that remote partitioning adds a level of complexity that local partitioning and multithreaded steps don't have.
I'd start with the basic step definition. Then try a multithreaded step. If that doesn't meet your needs, then move to local partitioning, and finally remote partitioning if needed. Keep in mind that Spring Batch was designed to make that progression as painless as possible. You can go from a regular step to a multithreaded step with only configuration updates. To go to partitioning, you need to add a single new class (a Partitioner implementation) and some configuration updates.
One final note. Most of this has talked about parallelizing the processing of this data. Spring Batch's FlatFileItemWriter is not thread safe. Your best bet would be to write to multiple files in parallel, then aggregate them afterwards if speed is your number one concern.
You should profile this in order to make a choice. In plain JDBC I would start with something that:
prepares statements with ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY. Several JDBC drivers "simulate" cursors in client side unless you use those two, and for large result sets you don't want that as it will probably lead you to OutOfMemoryError because your JDBC driver is buffering the entire data set in memory. By using those options you increase the chance that you get server side cursors and get the results "streamed" to you bit by bit, which is what you want for large result sets. Note that some JDBC drivers always "simulate" cursors in client side, so this tip might be useless for your particular DBMS.
set a reasonable fetch size to minimize the impact of network roundtrips. 50-100 is often a good starting value for profiling. As fetch size is hint, this might also be useless for your particular DBMS.
JdbcCursorItemReader seems to cover both things, but as it is said before they are not guaranteed to give you best performance in all DBMS, so I would start with that and then, if performance is inadequate, try JdbcPagingItemReader.
I don't think doing simple processing with JdbcCursorItemReader will be slow for your data set size unless you have very strict performance requirements. If you really need to parallelize using JdbcPagingItemReader might be easier, but the interface of those two is very similar, so I would not count on it.
Anyway, profile.

Insert performance with and without Index

Was doing a couple of tests.
Based on some great suggestions by Wes etc., I have tuned some of the neo4j properties with no cache to do insert on a large scale in a multithreaded environment and the performance is not bad.
However, when I introduce index (on the nodes), the performance degrades a lot. The difference is easily 5 fold. Are there configuration settings to make it better?
Thanks in advance,
Sachin
Neo4j version - 1.8.1; JVM - 1.6
Inserting nodes (or relationships) into a Lucene index is costly. Lucene is a powerful but complex tool, designed for fulltext/keyword search. Compared with the bare database, it is rather slow.
This is why most bulk insert tools do the indexing asynchronously, like Michael's batch inserter:
http://jexp.de/blog/2012/10/parallel-batch-inserter-with-neo4j/
Some even circumvent transactions, or write the store files directly:
http://blog.xebia.com/2012/11/13/combining-neo4j-and-hadoop-part-i/
To improve performance, using a SSD disk could help. But as Neo4j is a fully ACID transactional database, and the Lucene index is tightly coupled with the transactions (which is a good thing), there's not much else you can do besides optimizing your infrastructure for best write performance.
Just in case this additional answer is still of use for anyone running Neo4j on an ext4 filesystem under Linux:
By trading some transaction safety (negligible on USV/battery-buffered systems or laptops), the write performance can be increased by a factor of 10-15!
Read more in this recent blog post: http://structr.org/blog/neo4j-performance-on-ext4

What do I need to consider when performance testing DB2 read and write?

Calling all database guys...
The situation is this:
I have a DB2 database that is being written to and read from. I need to do some performance testing on programmatically executed read/writes.
I know how to write a program to read/write to this database, but I am not sure as to what factors I should consider in my performance test.
Do I need to worry about the difference between one session reading/writing vs multiple sessions?
What is the best way to interact with DB2 itself to get the amount of time these executions take?
The process I am testing is basically like a continuous batch proccess, constantly taking messages and persisting them. There will probably only be one or two sessions max on the DB at any given time.
Is time it takes to read/write really the best metric?
I am sure there are plenty of tools for this sort of testing. Any advice is appreciated.
Further info:
One thing I am considering is to try is to run X number of reads/writes with my database API (homebrew) and try to "time" how long it takes. Unfortuneately DB2 will buffer these messages. Is there any way to get DB2 to do a callback when it is done with a read/write? Or some way to externally measure the time these operations take? (tool, etc)
What is the goal for your performance testing?. Is it to test the performance for concurrent users or is it to test the load for batch process. Based on this there are tools available to test this. You may want to look jmeter from Apache.
In that case, you may want to trigger couple of concurrent processes to simaltaneously CRUD the data and monitor the activity using performance expert or something similar to that. While you do that you may want to use larger output so that you would be able to find any bottlenecks with larger sets of data. search for performance tuning in IBM redbooks site and you will find some case studies for this.
One huge factor in DB2 performance is how Buffer Pools are configured. e.g. http://www.ibm.com/developerworks/data/library/techarticle/0212wieser/0212wieser.html

Resources