performance issue in getting millions of record from database and processing in ERP in mule esb - mule-component

We are trying to fetch millions of record from database and processing in ERP system per day and we are facing performance issue, is there any solution regarding this in Community?
What is the best way to process the records in mule? So should we use batch or is there any alternate to it? And if we use batch or any other solution, how can we use it so as not to face any performance issue?

Since we don't have details on your specific situation, here are some general ideas. You will definitely need to do performance testing when dealing with large data sets to make sure your flow design is performing well.
Just to clarify, I'm giving options below that show streaming, which are slightly less performant, but will allow you to process large datasets. If you can handle the dataset in memory and you want faster processing, then turn off streaming.
Test your db queries outside of mule to make sure they are performant and tables are properly indexed.
Use streaming db connection. Tweak chunk size for performance testing. (Using this with batch scope is a good combo)
If using on-premise runtime, do performance tuning.
Use batch scope (enterprise edition)

Batch sounds like what you want to do. For each batch step Mule creates a batch job instance and each instance contains a persistent queue with the batched records. However, it does a deep copy of the MuleEvent containing the flow variables, flow construct, message, processing time, session and exchange pattern so beware, make sure you keep a light footprint before going into your batch job. If you have to set the payload with millions of records to flow variables to do some manipulation, make sure you delete them before you start executing the batch. It will load these batch steps in memory and execute them concurrently so the amount of memory you will need will be the size of the batch job instance (in particular the MuleEvent) by the number of batch steps.

Related

spring batch: process large file

I have 10 large files in production, and we need to read each line from the file and convert comma separated values into some value object and send it to JMS queue and also insert into 3 different table in the database
if we take 10 files we will have 33 million lines. We are using spring batch(MultiResourceItemReader) to read the earch line and have write to write it o db and also send it to JMS. it roughly takes 25 hrs to completed all.
Eventhough we have 10 system in production, presently we use only one system to run this job( i am new to spring batch, and not aware how spring supports in load balancing)
Since we have only one system we configured data source to connect to db and max connection is specified as 25.
To improve the performance we thought to use spring multi thread support. started to use 5 threads. we could see the performance improvement and could see everything completed in 10 hours.
Here i Have below questions:
1) if i process using 5 threads, we will publish huge amount of data into JMS queue. Will queue support huge data.Note we have 10 systems in production to read JMS Message from the queue.
2) Using thread(5) and 1 production system is good approach (or) instead of spring batch insert the data into db i can create a rest service and spring batch calls the rest api to insert the data into db and let spring api inserts data into JmS queue(again, if spring batch process file annd use rest to insert data into db, per second i will read 4 or 5 lines and will call the rest api. Note we have 10 production system). If use rest API approach will my system support(rest can handle huge request using load balancer, and also JMS can handle huge and huge message) or using thread in spring batch app using 1 production system is better approach.
Different JMS providers are going to have different limits, but in general messaging can easily handle millions of rows in a small period of time.
Messaging is going to be faster than inserting directly into the database because a message has very little data to manage (other than JMS properties) instead of the overhead of a complete RDBMS or NoSQL database or whatever, messaging out performs them all.
Assuming the individual lines can be processed in any order, then sending all data to the same queue and have n consumers working the back-end is a sound solution.
Your big bottleneck, however, is getting the data into the database. If the destination table(s) have m/any keys/indices on them, there is going to be serious contention because each insert/update/delete needs to rebuild the indices, so even though you have n different consumers trying to update the database, they're going to trounce on each other as the transactions are completed.
One solution I've seen is disabling all database constrains before you start and enabling at the end, and hopefully if things worked the data is consistent and usable; of course, the risk is there was bad data that you didn't catch and now you need to clean up or reattempt the load
A better solution might be to transform the files into a single file that can be batch loaded into the database using a platform-specific tool. These tools often disable indexes, contraint checking, and anything else that's going to slow things down - often times bypassing SQL itself - to get performance.

Lambda Architecture - Why batch layer

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems.
I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store.
Why batch layer
To save Time and Money!
It basically has two functionalities,
To manage the master dataset (assumed to be immutable)
To pre-compute the batch views for ad-hoc querying
Everything can be stored in realtime view and generate the results out of it - NOT TRUE
The above is certainly possible, but not feasible as data could be 100's..1000's of petabytes and generating results could take time.. a lot of time!
Key here, is to attain low-latency queries over large dataset. Batch layer is used for creating batch views (queries served with low-latency) and realtime layer is used for recent/updated data which is usually small. Now, any ad-hoc query can be answered by merging results from batch views and real-time views instead of computing over all the master dataset.
Also, think of a query (same query?) running again and again over huge dataset.. loss of time and money!
Further to the answer provided by #karthik manchala, data Processing can be handled in three ways - Batch, Interactive and Real-time / Streaming.
I believe, your reference to real-time is more with interactive response than to streaming as not all use cases are streaming related.
Interactive responses are where the response can be expected anywhere from sub-second to few seconds to minutes, depending on the use case. Key here is to understand that processing is done on data at rest i.e. already stored on a storage medium. User interacts with the system while processing and hence waits for the response. All the efforts of Hive on Tez, Impala, Spark core etc are to address this issue and make the responses as fast as possible.
Streaming on the other side is where data streams into the system in real-time - for example twitter feeds, click streams etc and processing need to be done as soon as the data is generated. Frameworks like Storm, Spark Streaming address this space.
The case for batch processing is to address scenarios where some heavy-lifting need to be done on a huge dataset before hand such that user would be made believe that the responses he sees are real-time. For example, indexing a huge collection of documents into Apache Solr is a batch job, where indexing would run for minutes or possibly hours depending on the dataset. However, user who queries the Solr index would get the response in sub-second latency. As you can see, indexing cannot be achieved in real-time as there may be hue amounts of data. Same is the case with Google search, where indexing would be done in a batch mode and the results are presented in interactive mode.
All the three modes of data processing are likely involved in any organisation grappling with data challenges. Lambda Architecture addresses this challenge effectively to use the same data sources for multiple data processing requirements
You can check out the Kappa-Architecture where there is no seperate Batch-Layer.
Everything is analyzed in the Stream-Layer. You can use Kafka in the right configuration as as master-datasetstorage and save computed data in a database as your view.
If you want to recompute, you can start a new Stream-Processing job and recompute your view from Kafka into your database and replace your old view.
It is possible to use only the Realtime view as the main storage for adhoc query but as it is already mentioned in other answers, it is faster if you have much data to do batch-processing and stream-processing seperate instead of doing batch-jobs as a stream-job. It depends on the size of your data.
Also it is cheaper to have a storage like hdfs instead of a database for batch-computing.
And the last point in many cases you have different algorithms for batch and stream processing, so you need to do it seperate. But basically it is possible to only use the "realtime view" as your batch-and stream-layer also without using Kafka as masterset. It depends on your usecase.

Transfer large amount of data from DB2 to Oracle?

I need every day to transfer large amounts of data (about several millions records) from db2 to oracle database. Could u suggest the best perfoming method to do that?
DB2 will allow you to select Oracle as a replication target. This is probably the most efficient and easiest way to do it every day, it also removes the "intermediate container" objection that you have.
See this introduction (and more from the documentation online) for more.
If you're only talking speed then do this.
Time how long it takes to dump the DB2 data to a flatfile.
Time how long it takes to suck that flatfile into Oracle.
there's your baseline and it's free. If you can beat that with an ETL tool, then decide if the cost of the tool is worth it.
For a simple ETL like this, there's little that I've found that can beat this on time.
The downside of this is just general file manipulation BS...
how do you know when to read from the file
how do you know that you got all the rows
how do you resume when something breaks
All those little "niceties" usually get paid for with speed. Of course, I'm joking a bit. They aren't always a little nicety. They are often essential for a smooth running process.
Dump out data to delimited file. Load to Oracle via DIRECT load sqlldr job. Not sexy, but fast. If you can be on same subnet that would be best (pushing data across the network is not what you want). Set this up on a cron, add email alerts on errors

Need: In memory object database, transactional safety, indices, LINQ, no persistence

Anyone an idea?
The issue is: I am writing a high performance application. It has a SQL database which I use for persistence. In memory objects get updated, then the changes queued for a disc write (which is pretty much always an insert in a versioned table). The small time risk is given as accepted - in case of a crash, program code will resynclocal state with external systems.
Now, quite often I need to run lookups on certain values, and it would be nice to have standard interface. Basically a bag of objects, but with the ability to run queries efficiently against an in memory index. For example I have a table of "instruments" which all have a unique code, and I need to look up this code.... about 30.000 times per second as I get updates for every instrument.
Anyone an idea for a decent high performance library for this?
You should be able to use an in-memory SQLite database (:memory) with System.Data.SQLite.

What are the required functionalities of ETL frameworks?

I am writing an ETL (in python with a mongodb backend) and was wondering : what kind of standard functions and tools an ETL should have to be called an ETL ?
This ETL will be as general purpose as possible, with a scriptable and modular approach. Mostly it will be used to keep different databases in sync, and to import/export datasets in different formats (xml and csv) I don't need any multidimensional tools, but it is a possibility that it'll needed later.
Let's think of the ETL use cases for a moment.
Extract.
Read databases through a generic DB-API adapter.
Read flat files through a similar adapter.
Read spreadsheets through a similar adapter.
Cleanse.
Arbitrary rules
Filter and reject
Replace
Add columns of data
Profile Data.
Statistical frequency tables.
Transform (see cleanse, they're two use cases with the same implementation)
Do dimensional conformance lookups.
Replace values, or add values.
Aggregate.
At any point in the pipeline
Load.
Or prepare a flat-file and run the DB product's loader.
Further, there are some additional requirements that aren't single use cases.
Each individual operation has to be a separate process that can be connected in a Unix pipeline, with individual records flowing from process to process. This uses all the CPU resources.
You need some kind of time-based scheduler for places that have trouble reasoning out their ETL preconditions.
You need an event-based schedule for places that can figure out the preconditions for ETL processing steps.
Note. Since ETL is I/O bound, multiple threads does you little good. Since each process runs for a long time -- especially if you have thousands of rows of data to process -- the overhead of "heavyweight" processes doesn't hurt.
Here's a random list, in no particular order:
Connect to a wide range of sources, including all the major relational databases.
Handle non-relational data sources like text files, Excel, XML, etc.
Allow multiple sources to be mapped into a single target.
Provide a tool to help map from source to target fields.
Offer a framework for injecting transformations at will.
Programmable API for writing complex transformations.
Optimize load process for speed.
Automatic / heuristic mapping of column names. E.g simple string mappings:
DB1: customerId
DB2: customer_id
I find a lot of the work I (have) done in DTS / SSIS could've been automatically generated.
not necessarily "required functionality", but would keep a lot of your users very happy indeed.

Resources