I have around 10 million files in my database in blob format which I need to convert and save them in pdf format. Each file size is around between 0.5 - 10mb and combined files size is around 20 TB. I’m trying to implement the functionality using spring batch. However my question is when I run the batch can the server memory hold that much amount of data? I’m trying to use chunk based processing and thread pool task executor. Please suggest if this best approach to run the job to process that much amount of data in less time
Each file size is 0.5 to 10 MB and approach you mentioned is perfect with chunks. You can get more control with below and monitor the processing.
Create Partition based on thread pool count(Based on your System resource) from file table.
Each partition step of reader will select only 1 file at a time.
You can calculate memory based on number of parallel steps and give as VM argument.
Configure Commit chunk based on memory calculation of total parallel steps.
Please refer below for example code.
Spring Batch multiple process for heavy load with multiple thread under every process
Related
I'm testing nifi SplitRecord with a small file of only 11 records
However, SplitRecord hangs for a long time. I don't get a clue what it is doing.
Processor Hung
SPlitRecord Properties:
more properties
Is Records Per Split controlling
the maximum, or the minimum, or exact number of records per split?
if the total number of records is less than records per split, what's the behavior of SplitRecords? does it wait until a time-out and then put all on-hold records in to a single split?
After about 10 minutes or random number of start/stop/terminate/restart
it may trigger the processor to split the data sooner.
Records Per Split controls the maximum, see "SplitRecord.java" for the code. If there are fewer records than the RECORDS_PER_SPLIT value, it will immediately push them all out.
However, it does look like it is creating a new FlowFile, even if the total record count is less than the RECORDS_PER_SPLIT value, meaning it's doing disk writing regardless of whether a split really occured.
I would probably investigate two things:
Host memory - how much memory does the host have? How much is configured as NiFi max heap? How much total system memory is in use/free? NiFi performs best when plenty of system memory is left for file cache.
Host's disks, specifically the disk that has the Content Repository on it. Capacity? IO? Is it shared with other services? FlowFile content is written to the Content Repository, if the disk is shared with the OS, or other busy services (or other NiFi repos) it can really slow content modification down.
Note: your NiFi version over 3 years old, please consider upgrading.
I've developed a NiFi flow prototype for data ingestion in HDFS. Now I would like to improve the overall performances but it seems I cannot really move forward.
The flow takes in input csv files (each row has 80 fields), split them at row level, applies some transformations to the fields (using 4 custom processors executed sequentially), buffers the new rows into csv files, outputs them into HDFS. I've developed the processors in such a way the content of the flow file is accessed only once when each individual record is read and its fields are moved to flowfile attributes. Tests have been performed on a amazon EC2 m4.4xlarge instance (16 cores CPU, 64 GB RAM).
This is what I tried so far:
Moved the flowfile repository and the content repository on different SSD drives
Moved the provenance repository in memory (NiFi could not keep up with the events rate)
Configuring the system according to the configuration best practices
I've tried assigning multiple threads to each of the processors in order to reach different numbers of total threads
I've tried increasing the nifi.queue.swap.threshold and setting backpressure to never reach the swap limit
Tried different JVM memory settings from 8 up to 32 GB (in combination with the G1GC)
I've tried increasing the instance specifications, nothing changes
From the monitoring I've performed it looks like disks are not the bottleneck (they are basically idle a great part of the time, showing the computation is actually being performed in-memory) and the average CPU load is below 60%.
The most I can get is 215k rows/minute, which is 3,5k rows/second. In terms of volume, it's just 4,7 MB/s. I am aiming to something definitely greater than this.
Just as a comparison, I created a flow that reads a file, splits it in rows, merges them together in blocks and outputs on disk. Here I get 12k rows/second, or 17 MB/s. Doesn't look surprisingly fast too and let me think that probably I am doing something wrong.
Does anyone has suggestions about how to improve the performances? How much will I benefit from running NiFi on cluster instead of growing with the instance specs? Thank you all
It turned out the poor performances were a combination of both the custom processors developed, and the merge content built-in processor. The same question mirrored on the hortonworks community forum got interesting feedback.
Regarding the first issue, a suggestion is to add the SupportsBatching annotation to the processors. This allows the processors to batch together several commits, and allows the NiFi user to favor latency or throughput with the processor execution from the configuration menu. Additional info can be found on the documentation here.
The other finding was that the MergeContent built-in processor doesn't seem to have optimal performances itself, therefore if possible one should consider modifying the flow and avoid the merging phase.
I'm trying to figure out how I determine the IOPS my application is driving so I can property size our cloud infrastructure components. I understand what IOPS are between a database and the storage layer but I'd like to understand how I go about calculating what my application drives. Here are some of my applications characteristics:
1) 90% write and 10% read
2) We have a java based application that ultimately inserts into an HBase database
3) Process about 50 msg/sec where each message results in probably 2 HBase inserts
Here is what I'm not sure about:
1) Is the only way to calculate the IOPS is by running iostat or something on the actual server during load?
2) Is there a general way I can calculate what needed from the data volume/size coming in and not on the actual storage unit?
3) Is there any relationship to the # of transactions and the # of bytes in each transaction (I read somewhere an IO is usually 3K, most inserts don't contain that much info so it doesn't matter).
Any help would be greatly appreciated.
Not very familiar with Hbase. But from the documentation, it uses a log structure, which means the writes will be sequential writes. It also has compactions, which will cause both sequential reads and writes of multi-MB. The read queries will cause random reads on the storage layer.
So here is the answer to your questions:
As far as I know, yes. The only way to get IOPS is running iostat. You can probably get some compaction stats from the application level. But it is hard to extract IOPS level details.
Compaction will cause more storage than the entire data size. And if your application is write heavy(compaction might not catch up with the speed of inserts), the size of actual data volume will be much larger. Given the 50 msg/sec in your question, this should not be the case. I will provision disks double the size of expected data volume per instance.
As mentioned above, Hbase is log structured. Writes are accumulated in memory and flushed to disk together. So it doesn't matter the size of each transaction.
I am copying data from excel sheet to the SQL tables.
At this time it is around 2000 rows distributed across 18 tables.
Problem with my job is it is taking too much of time. It takes around 2.5 mins to do the job.
Other issue I am facing is with memory. I tried to copy around 250,000 rows and I couldn't run the job with basic settings. I have to increase Xms and Xmx allocation.
How do I solve these issues?
You should start your job with a tMSSQLConnection (I think that's the DBMS you're using) and then finish it with a tMSSQLCommit component and see if that helps at all as it could be that Talend is opening a large amount of connections to the database rather than pooling them.
Increasing the commit size will also help speed up bulk loads but obviously if anything fails to commit it will lose the entire commit.
As well as this, as long as you have no race conditions and don't care in what order tables are inserted to or updated then you could parallelise the whole job with either a tParallelize component or by enabling multi thread executions in the Extra tab under the Job window.
Sometimes the memory usage in the job can be improved by splitting the process down into separate jobs and linking them as child jobs in one large wrapper parent job with tRunJob components. This will also make the job more manageable.
Finally, there's a couple of options in the advanced settings of each database output component that allows you to increase the batch size (although this will increase the memory usage) and also to enable parallel connections which can greatly improve performance by utilising more database server cores.
Your memory issues are unlikely to be resolved short of re-engineering your job to only deal with smaller chunks of a data at a time and commit each part and then grab the next lot.
This could be done by using a tFilterRow component and only selecting the first x records (by some filter condition, if the data set has none you could always add one by first preprocessing everything to give every row a Numeric.Sequence), processing it and putting it in your table and then picking the next x records and so on.
Use "Singe Insert Query" in the MSSQL output. Make sure you're using the correct batch size.
Batch should should be LESS OR EQUAL to: 2000 / column count. This could speed up the load speed.
However I'm not sure about the memory errors. I think talend tries to read the excel inputs into memory as a whole, thus for bigger the excel files you need more memory.
We developed a spring batch application in which we have two flows. 1. Forward 2. Backward. We are only using file read/write no DB involved.
Forward Scenario : The input file will have records with 22 fields. The 22 fields to be converted into 32 fields by doing some operations like sequence number generation and adding few fillers fields. Based on country codes the output will be split into max 3. each chunk will have 250K records. (If records are in million the multiple files will be generated for same country).
8 Million records its taking 36 minutes.
8 Million records will be in single file.
We are using spring batch thread 1000 threads we are using.
Backward Flow : The input file will have 82 fields for each record. These 82 fields to be converted into 86 records. The two fields will be added in between which is taken from the Forward flow input file. The other fields are simply copied and pasted. The error records also to be written into error file. The error records is nothing but actual input records which came for Forward flow. To track we are persisting the sequence number & actual records in a file this is done in forward flow itself. We are taking the persistent file in backward flow and comparing the sequence number if anything is missing then we are writing into error records through key,value pair. This process is done after completion of backward flow.
Maximum size of input file is 250K.
8 Million records its taking 1 hour 8 minutes which is too bad.
32 Files (each 250K) will be there for input in this flow.
There is no thread used in backward. I don't know how thread usage will be. I tried but the process got hung.
Server Configurations:
12 CPU & 64 GB Linux Server.
Can you guys help in this regard to get improved the performance since we have 12 CPU/64GB RAM.
You are already using 1000 threads and that is a very high number. I have fine tuned spring batch jobs and this is what I have done
1. Reduce network traffic- Try to reduce number of calls to data base or file system in each process. Can you get all info possible in one shot and save it in memory for the life of thread ? I have uses org.apache.commons.collections.map.MultiKeyMap for storage and retrieval of data.
For eg in your case you need sequence number comparison . So get all sequence numbers into one map before you start the process. You can store the ids (if not too many) into step execution context.
Write less frequently - Keep storing all the info you need to write for some time and then write them at the end.
Set unused objects at end of process to null to expedite GC
Check your GC frequency through VisualVm or Jconsole . You should see frequent GC happening when your process is running which means objects are being created and garbage collected. If your memory graph keeps on increasing, something is wrong.