Spring Batch - Performance Tuning - performance

We developed a spring batch application in which we have two flows. 1. Forward 2. Backward. We are only using file read/write no DB involved.
Forward Scenario : The input file will have records with 22 fields. The 22 fields to be converted into 32 fields by doing some operations like sequence number generation and adding few fillers fields. Based on country codes the output will be split into max 3. each chunk will have 250K records. (If records are in million the multiple files will be generated for same country).
8 Million records its taking 36 minutes.
8 Million records will be in single file.
We are using spring batch thread 1000 threads we are using.
Backward Flow : The input file will have 82 fields for each record. These 82 fields to be converted into 86 records. The two fields will be added in between which is taken from the Forward flow input file. The other fields are simply copied and pasted. The error records also to be written into error file. The error records is nothing but actual input records which came for Forward flow. To track we are persisting the sequence number & actual records in a file this is done in forward flow itself. We are taking the persistent file in backward flow and comparing the sequence number if anything is missing then we are writing into error records through key,value pair. This process is done after completion of backward flow.
Maximum size of input file is 250K.
8 Million records its taking 1 hour 8 minutes which is too bad.
32 Files (each 250K) will be there for input in this flow.
There is no thread used in backward. I don't know how thread usage will be. I tried but the process got hung.
Server Configurations:
12 CPU & 64 GB Linux Server.
Can you guys help in this regard to get improved the performance since we have 12 CPU/64GB RAM.

You are already using 1000 threads and that is a very high number. I have fine tuned spring batch jobs and this is what I have done
1. Reduce network traffic- Try to reduce number of calls to data base or file system in each process. Can you get all info possible in one shot and save it in memory for the life of thread ? I have uses org.apache.commons.collections.map.MultiKeyMap for storage and retrieval of data.
For eg in your case you need sequence number comparison . So get all sequence numbers into one map before you start the process. You can store the ids (if not too many) into step execution context.
Write less frequently - Keep storing all the info you need to write for some time and then write them at the end.
Set unused objects at end of process to null to expedite GC
Check your GC frequency through VisualVm or Jconsole . You should see frequent GC happening when your process is running which means objects are being created and garbage collected. If your memory graph keeps on increasing, something is wrong.

Related

nifi splitRecord hung

I'm testing nifi SplitRecord with a small file of only 11 records
However, SplitRecord hangs for a long time. I don't get a clue what it is doing.
Processor Hung
SPlitRecord Properties:
more properties
Is Records Per Split controlling
the maximum, or the minimum, or exact number of records per split?
if the total number of records is less than records per split, what's the behavior of SplitRecords? does it wait until a time-out and then put all on-hold records in to a single split?
After about 10 minutes or random number of start/stop/terminate/restart
it may trigger the processor to split the data sooner.
Records Per Split controls the maximum, see "SplitRecord.java" for the code. If there are fewer records than the RECORDS_PER_SPLIT value, it will immediately push them all out.
However, it does look like it is creating a new FlowFile, even if the total record count is less than the RECORDS_PER_SPLIT value, meaning it's doing disk writing regardless of whether a split really occured.
I would probably investigate two things:
Host memory - how much memory does the host have? How much is configured as NiFi max heap? How much total system memory is in use/free? NiFi performs best when plenty of system memory is left for file cache.
Host's disks, specifically the disk that has the Content Repository on it. Capacity? IO? Is it shared with other services? FlowFile content is written to the Content Repository, if the disk is shared with the OS, or other busy services (or other NiFi repos) it can really slow content modification down.
Note: your NiFi version over 3 years old, please consider upgrading.

Spring batch to process huge data

I have around 10 million files in my database in blob format which I need to convert and save them in pdf format. Each file size is around between 0.5 - 10mb and combined files size is around 20 TB. I’m trying to implement the functionality using spring batch. However my question is when I run the batch can the server memory hold that much amount of data? I’m trying to use chunk based processing and thread pool task executor. Please suggest if this best approach to run the job to process that much amount of data in less time
Each file size is 0.5 to 10 MB and approach you mentioned is perfect with chunks. You can get more control with below and monitor the processing.
Create Partition based on thread pool count(Based on your System resource) from file table.
Each partition step of reader will select only 1 file at a time.
You can calculate memory based on number of parallel steps and give as VM argument.
Configure Commit chunk based on memory calculation of total parallel steps.
Please refer below for example code.
Spring Batch multiple process for heavy load with multiple thread under every process

Why queries are getting progressively slower when using postgres with spring batch?

I'm running a job using Spring Batch 4.2.0 with postgres (11.2) as backend. It's all wrapped in a spring boot app. I've 5 steps and each runs using a simple partitioning strategy to divide data by id ranges and reads data into each partition (which are processed by separate threads). I've about 18M rows in the table, each step reads, changes few fields and writes back. Each step reads all 18M rows and writes back. The issue I'm facing is, the queries that run to pull data into each thread scans data by id range like,
select field_1, field_2, field_66 from table where id >= 1 and id < 10000.
In this case each thread processes 10_000 rows at a time. When there's no traffic the query takes less than a second to read all 10,000 rows. But when the job runs there's about 70 threads reading all that data in. It goes progressively slower to almost a minute and a half, any ideas where to start troubleshooting this?
I do see autovacuum running in the backgroun for almost the whole duration of job. It definitely has enough memory to hold all that data in memory (about 6GB max heap). Postgres has sufficient shared_buffers 2GB, max_wal_size 2GB but not sure if that in itself is sufficient. Another thing I see is loads of COMMIT queries hanging around when checking through pg_stat_activity. Usually as much as number of partitions. So, instead of 70 connections being used by 70 partitions there are 140 conections used up with 70 of them running COMMIT. As time progresses these COMMITs get progressively slower too.
You are probably hitting https://github.com/spring-projects/spring-batch/issues/3634.
This issue has been fixed and will be part of version 4.2.3 planned to be released this week.

Parse, replacing large (several thousands) number of records

I've got a class in parse with 1-4k records per user. This needs to be replaced from time to time (actually these are records representing multiple timetables).
The problem I'm facing that deleting and inserting these records is a ton of requests. Is there maybe a method to delete and insert a bunch of records, that counts as one request? Maybe it's possible from Cloud Code?
I tried compacting all this data in one record, but then I faced the size limit for records (128 KB). Using any sub format(like a db or file onside a record) would be really tedious, cause the app is targeting nearly all platforms supported by Parse.
EDIT
For clarification, the problem isn't the limit on saveAll/destroyAll. My problem is facing the req/s limit (or rather, as docs state req/min).
Also, I just checked that requests from Cloud Code also seem to count towards that limit.
Well, a possible solution would be also to redesing my datasets and use Array columns or something, but I'd rather avoid it if possible.
I think you could try Parse.Object.saveAll which batch processes the save() function.
Docs: https://www.parse.com/docs/js/api/symbols/Parse.Object.html#.saveAll
Guide: https://parse.com/questions/parseobjectsaveall-performances
I would use a saveAll/DestroyAll (or DeleteAll?) and anything -All that parse provides in its SDK.
You'd still reach a 1000 objects limit, but to counter that you can loop using the .skip property of a request.
Set a limit of 1000 and skip of 0, do the query, then increase the skip value by the previous limit, and so on. And you'd have 2 or 3 requests of a size of 1000 each time. You stop the loop when your results count is smaller than your limit. If it's not, then you query again and set the skip to the limit x loopcount.
Now you say you're facing size issues, maybe you can reduce that query limit to, say, 400, and your loop would just run for longer until your number of results is smaller than your limit (and then you can stop querying/limiting/skipping/looping or anything in -ing).
Okay, so this isn't an answer to my question, but it's a solution to my problem, so I'm posting it.
My problem was storing and then replacing a large amount of small records which add up to significant size (up to 500KB JSON [~1.5MB XML] in my current plans).
So I've chosen a middle path - I implemented sort of vertical partitions.
What I have is a master User record which holds array of pointers to other class (called Entries). Entries have only 2 fields - ID of school record and data which is type Array.
I decided to split "partitions" every 1000 records, which is about ~60-70KB per record, but in my calculations may go up to ~100KB.
I also made field names in json 1 letter, cause every letter in 1000 records is like 1 or 2 KB, depending on encoding.
Actually that approach made PHP code like twice as fast and there is a lot less usage on network and remote database (1000 times less inserts/destroys basically).
So, that is my solution, if anybody has any other ideas, please post it as answer here, cause probably I'm not the only one with such problem and that certainly isn't the only solution.

Hbase concurrency making it slow

I have 1 master server and 5 region server and each server has 200 GB disk space and 16 GB RAM on each. I created a table in HBase which has 10 million records. I am using hbase-0.96 version on hadoop 2.
Table Name - sh_self_profiles
column family - profile
In this table, we have 30 columns in each row.
When I get a single column value from HBase, it takes around 10 ms. My problem is when I hit 100 or more concurrent requests the time slowly accumulates and increases to more than 400 ms instead of completing in 10ms only. When 100 requests are hit linearly, each one takes 10 ms only.
One thing that you should check is how well distributed your table is.
You can do this by going to the HBase master web console http://:60010, you will be able to see how many regions you have for your table. If you have not done anything special on table creation you could easily have only one or two regions, which means that all the requests are being directed to a single region server.
If this is the case, you can recreate your table with pre-split regions (I would suggest a multiple of 5, such as 15 or 20), and make sure that the concurrent gets that you are doing are equally spread over the row-key space.
Also, pls check how much RAM you have allocated to the region server - you might need to increase it from the default. If you are not running anything else other than HBase Region Sever on those machines, you could probably increase to 8GB ram.
Other than that, you could also adjust the default for hbase.regionserver.handler.count.
I hope this helps.
Which client are you using? Are you using the standard Java client, the Thrift client, the HTTP REST client, or something else? If your use case is a high amount of random reads of single column values, I highly recommend you try asynchbase as it is much faster than the standard synchronous Java client.

Resources