I have an application that does a delayed operation. User generates 1 million messages that are stored in the JMS Queue and then a MDBeans are consuming these messages and performing some action and storing data in the database. Since JMS Queue is working too fast, it tries to create 1 million MDBean instances which in turn try to create 1 million database connections. No surprise that some of them timeout since JDBC connection pool cannot serve 1 million connection requests.
What is the best solution to control the number of MDBeans created? It would be better that 1 million messages would be processed by a certain number of MDBeans that is not exceeding the number of allowed connections in JDBC pool
You can limit the number of instances of your MDB by using the max-beans-in-free-pool element within the descriptor for your bean in weblogic-ejb-jar.xml.
<message-driven-descriptor>
<pool>
<max-beans-in-free-pool>100</max-beans-in-free-pool>
<initial-beans-in-free-pool>50</initial-beans-in-free-pool>
</pool>
...
</message-driven-descriptor>
If left unspecified, the number of instances created will only be bounded by the number of threads available. Regardless, it is good practice to set the maximum number of threads equal to or lower than the size of your database connection pool.
Related
Requirement:
Hold the data during db downtime and process it with 5 mins interval by keeping them in dead letter queue.
I have tried below approaches
Kafka retry topic but there are some limitations where I have no control over the listener to configure the interval. #kakfkalistner is picking the message as soon as we push
Pick the message from Kafka listener and storing it in hashset. Create schedular to scan the hashset in 5mins delay and wipe out(this approach is not handy since set is in memory)
I'm using Spring boot version 1.5.4.RELEASE & spring Kafka version 1.3.8.RELEASE.
My Kafka consumer is doing batch processing in chunks of 100. Topic I'm trying to consume has 10 partitions & I do have 10 instances of Kafka consumer.
Is there a I can enforce to get 100 fixed number of records ( as much as possible), apart from last chunk in particular partition.
Kafka has no property fetch.min.records.
The best you can do is simulate it with:
fetch.min.bytes: The minimum amount of data the server should return for a fetch request. If insufficient data is available the request will wait for that much data to accumulate before answering the request. The default setting of 1 byte means that fetch requests are answered as soon as a single byte of data is available or the fetch request times out waiting for data to arrive. Setting this to something greater than 1 will cause the server to wait for larger amounts of data to accumulate which can improve server throughput a bit at the cost of some additional latency.
and
fetch.max.wait.ms: The maximum amount of time the server will block before answering the fetch request if there isn't sufficient data to immediately satisfy the requirement given by fetch.min.bytes.
Which will work if your records have similar lengths.
By the way Spring Boot 1.5.x is end of life and no longer supported. The current Boot version is 2.2.3.
I have 10 large files in production, and we need to read each line from the file and convert comma separated values into some value object and send it to JMS queue and also insert into 3 different table in the database
if we take 10 files we will have 33 million lines. We are using spring batch(MultiResourceItemReader) to read the earch line and have write to write it o db and also send it to JMS. it roughly takes 25 hrs to completed all.
Eventhough we have 10 system in production, presently we use only one system to run this job( i am new to spring batch, and not aware how spring supports in load balancing)
Since we have only one system we configured data source to connect to db and max connection is specified as 25.
To improve the performance we thought to use spring multi thread support. started to use 5 threads. we could see the performance improvement and could see everything completed in 10 hours.
Here i Have below questions:
1) if i process using 5 threads, we will publish huge amount of data into JMS queue. Will queue support huge data.Note we have 10 systems in production to read JMS Message from the queue.
2) Using thread(5) and 1 production system is good approach (or) instead of spring batch insert the data into db i can create a rest service and spring batch calls the rest api to insert the data into db and let spring api inserts data into JmS queue(again, if spring batch process file annd use rest to insert data into db, per second i will read 4 or 5 lines and will call the rest api. Note we have 10 production system). If use rest API approach will my system support(rest can handle huge request using load balancer, and also JMS can handle huge and huge message) or using thread in spring batch app using 1 production system is better approach.
Different JMS providers are going to have different limits, but in general messaging can easily handle millions of rows in a small period of time.
Messaging is going to be faster than inserting directly into the database because a message has very little data to manage (other than JMS properties) instead of the overhead of a complete RDBMS or NoSQL database or whatever, messaging out performs them all.
Assuming the individual lines can be processed in any order, then sending all data to the same queue and have n consumers working the back-end is a sound solution.
Your big bottleneck, however, is getting the data into the database. If the destination table(s) have m/any keys/indices on them, there is going to be serious contention because each insert/update/delete needs to rebuild the indices, so even though you have n different consumers trying to update the database, they're going to trounce on each other as the transactions are completed.
One solution I've seen is disabling all database constrains before you start and enabling at the end, and hopefully if things worked the data is consistent and usable; of course, the risk is there was bad data that you didn't catch and now you need to clean up or reattempt the load
A better solution might be to transform the files into a single file that can be batch loaded into the database using a platform-specific tool. These tools often disable indexes, contraint checking, and anything else that's going to slow things down - often times bypassing SQL itself - to get performance.
I am building an application that allows multiple concurrent clients to submit queries against data that is scattered over multiple Data Nodes.
There are three tiers to the architecture: clients (eg browser based clients, command line clients, java clients) submit requests to a middle tier query engine. The middle tier query engine does query parsing, query planning and is responsible for query execution. Query execution involves retrieving data from the data tier (a set of data nodes running in a cluster).
I use Google Protocol Buffers to serialize the query requests and the result sets between the middle tier and the data tier. I use Netty NIO to send the GPB over TCP sockets between Netty clients running on the middle tier and Netty servers running on the data tier (one Netty server per data node).
Each data node has a Netty server to receive the request from the middle tier and respond with results.
Each query running in the middle tier talks to each data node in parallel. I could have N simultaneous queries executing, each query talking to M data nodes.
I am trying to understand how expensive it is to set up and tear down Netty clients. This will help me decide between a couple of different architectural options I am considering for organizing Netty clients in the middle tier.
Option 1: Each query would have its own set of Netty clients to talk to the data nodes. In this option, when I set up the query execution for a given query, I would instantiate M Netty clients (each client talking to the Netty server running on one of data nodes on behalf of that query). This would imply that I have MxN Netty client instances being setup and torn down as queries are submitted and completed. Although this is the simplest approach conceptually, if it is reasonably expensive to instantiate Netty clients, this would not be feasible. I am worried about generating too much garbage as well as a bit concerned about adding latency to the query by setting up M Netty clients.
Option 2: Have at the middle tier one Netty client per data node and share that client between the queries. This would mean M Netty clients would be created in the middle tier when the middle tier starts. As queries are submitted they would share that pool of Netty clients, each Netty client would need to multiplex the requests and responses between the different clients. This is a more complicated design for the Netty client (to keep track of which responses corresponded to which query) but would generate less garbage and would impose little additional latency to the queries.
Does anyone have a sense on how expensive option 1 might be?
As long as you share the NioEventLoopGroup it is very cheap to setup and teardown the clients. This is all you should remember
To create a Netty channel and make a connection attmpt, you need the following:
Create an event loop
Create a bootstrap and configure it
Make the actual connection attempt
Close the channel
Shut down the event loop
The most expensive steps are 1 and 5. The other steps are very cheap or pretty much close to using raw NIO API.
A great news is that the step 1 and 5 don't need to be done for every connection attempt. You can create an event loop, and then reuse it until your application terminates. You can run the step 5 when your application terminates, just once.
In WAS when I create a datasource I can edit the Connection Pool properties (# of active connections, max # of active connections). Now if I say max =20, and if 1000 user requests come in to the WAS and each request runs in its own thread and each thread wants a connection, in essence i am reduced to 20 parallel threads.
Is this right? Because a connection object cannot be shared between threads.
I ask the question because most times, i see this paramter has a max value 20 - 30 when clearly the peak # of simultaneous requests to the server is well over a thousand. It seems we are able to service only 20 requests at a time?
Not really. Connection pooling takes charge of eliminating the overhead of creating and closing connections, by reusing them on database access.
If you have a thousand requests, and a maxSize pool of 20, then 20 database accesses will be performed in parallel, and once every request releases the database access, the same connection will be reused to attend another request. This is assuming that database access will be for a limited period of time (short operations), and that, once data is fetched / inserted / updated, the connection will be released for another request.
On the other hand, requests that cannot retrieve a database connection because the pool is fully in use, (let's say, the 21th request) will be in a waiting status, until some connection is released, so it's a transparent condition on the client side. All of this is supposing that the code is requesting and releasing connections from the pool, in a efficient way. The timeout for the requests is also a configurable property of the pool
You can tune up these values to get the most of it, and you can also consider another alternatives to avoid exhausting the pool for repetitive querys (like always the same select for all requests) by using technologies like database caching (ie, open terracota).
Hope this helps!