We recently had a production outage where a 3rd party server was non-responsive. This occurred on a Spring Boot EC2 microservice pointing to a MySQL DB.
While waiting for these REST calls to complete, our service started throwing SpringBootJPAHikariCP - Connection is not available, request timed out after 30000ms. errors.
The thing is, we are pretty careful in avoiding keeping a DB transaction open when making a potentially high-latency REST call. Perhaps not careful enough?
We have DataDog monitoring enabled on this microservice and I am not seeing data that would help us diagnose the problem. Specifically, I would like some visibility in our MySQL db transactions to see where these are being left open longer than expected so we can refactor accordingly.
If I look at DD APM for the service, I can see the HTTP request latency stats, and JVM metrics which show the thread count creeping up. But no specific stats about transactions and connections from the pool.
If I look at DD APM for the DB, I can see the latency for queries, updates, etc, but it is all ordinary and nothing about transactions themselves.
Is it possible to configure spring and/or datadog for visualizing transaction time?
Related
We have been evaluating Spring-Stomp-Broker-websockets, for a full duplex type messaging application that will run on AWS. We had hoped to use Amazon MQ. We are pushing messages to individual users, and also broadcasting. So functionally the stack did look good. We have about 40,000 - 80,000 users. We quickly found, with load testing, that none of the spring stack or Amazon MQ scales very well, issues:
Spring Cloud Gateway instance cannot handle more than about 3,000
websockets before dying.
Spring Websocket server instance can also
only handle about 4,000 websockets, on a T3.Medium. When we bypass
the Gateway.
AWS limits Active MQ connections to 100 for a small
server, and then only 1000 on a massive server. No in-between, this
is just weird.
Yes we have increased the file handles etc on the machines so TCP connections are not the limit. There is no way Spring could ever get close to the limit here.We are sending a 18 K message, for load, the maximum we will expect. In our results message size has little impact, its just the connection over head on the Spring Stack.
The StompBrokerRelayMessageHandler opens a connection to the Broker for each STOMP Connect. There is no way to pool the connections. So this makes this Spring feature completely useless for any ‘real’ web applications. In order to support our users the cost of AWS massive servers for MQ means this solution is ridiculously expensive, requiring 40 of the biggest servers. In load testing, the Amazon MQ machine is doing nothing, with the 1000 users, it is not loaded.In reality a couple of medium sized machine is all we need for all our brokers.
Has any one ever built a real world solution, as above, using Spring Stack. It appears no one has done this, and no one has scaled this up.
Has anyone written a pooling StompBrokerRelayMessageHandle. I assume there must be a reason this can’t work as it should be the default approach ? What is the issue here ?
Seems this issues makes the whole Spring Websocket + STOMP + Broker approach pretty useless and we are now forced to use a different approach for message reliability, and for messaging across servers where users are not connected (main reason we are using broker) and have gone back too using a Simple Broker, and wrote a registry to manage the client server location. So we have now eliminated the broker and the figures above are with that model. The we may add in AWS SQS for reliability of messages.
Whats left. We were going to use the Spring Cloud Gateway to load balance across multiple small WebSocket servers, but seems this approach will not work, as the WebSocket load a server can handle is just way too small. The Gateway just cannot handle it. We are now removing Spring Cloud Gateway and using a AWS load balancer instead. So now we can get significantly more connections load balanced. Why does Spring Cloud Gateway not load balance ?
Whats left. The websocket server instances are t3.mediums, they have no business logic and just pass a message between 2 clients, so it really does not need a bigger server. We would expect considerably better than 4,000 connections. However this is close to usable.
We are now drilling into the issues to get more details on where the performance bottlenecks are, but the lack of any tuning guides or scaling information does not suggest good things about Spring. Compare this to Node solutions that scale very well, and handle larger number of connections on small machines.
Next approach is to look at WebFlux + WebSocket, but then we loose STOMP. Maybe we’ll check raw websockets ?
This is just an early attempt to see if anyone actually has used Spring Websockets in anger and can share real working production architecture, as only Toy examples are available. So any help on above issues would be appreciated.
I am quite new to Websocket Technology. I am building an application that requires realtime users communication. The messages (chat data) is going to be persisted into Cassandra (NoSql DB). The application is going to have at least 800 users online at any given moment. The communications between users need to be superfast. Persisting data into a DB takes some seconds (nano, micro or whatever).
I was thinking that maybe it would be best if I push all the messages to be persisted into a Message Queue, which will consumed by RabbitMQ consumer whose main job is to save messages into the DB.
I am still a junior developer. Will be extremely grateful for any suggestion. Thanks.
We have a problem at hand with a long service request to retrieve a huge amount of data which takes around 5 minutes to complete. We are using EJB and native JDBC to establish requests. Is there a way to extend the transaction timeout for this one particular request (that is overwriting the timeout configs in the domain's JTA) or do we have to increase the domain's JTA transaction timeout to 5 minutes? But the latter seems to be unfavorable since it might provoke database deadlock. Are there any other solutions you may suggested that is more robust and safe? Could we perhaps set the transaction timeout at a different level apart from Domain level? Looking forward to your reply soon. Thanks.
The JTA timeout can ba set at the EJB level. Read this documentation for details.
Let's say, I have several micro-services (REST API), the problem is, if one service is not accessible (let's call service "A" ) the data which was sending to service "A" will be saved in temporary database. And after service worked, the data will be sent again.
Question:
1. Should I create the service which pings to service "A" in every 10 seconds to know service works or not? Or is it possible to do it by task queue? Any suggestions?
Polling is a waste of bandwidth. You want to use a transactional queue.
Throw all your outbound messages in the queue, and have some other process to handle the messages.
How this will work is - after your process reads from the queue, and tries to send to the REST service:
If it works, commit the transaction (for the queue)
If it doesn't work, don't commit. Start a delay (minutes, seconds - you know best) until you read from the queue again.
You can use Circuit Breaker pattern for e.g. hystrix circuit breaker from netflix.
It is possible to open circuit-breaker base on a timeout or when service call fails or inaccessible.
There are multiple dimensions to your question. First you want to consider using an infrastructure that provides resilience and self healing. Meaning you want to deploy a cluster of containers, all containing your Service A. Now you use a load balancer or API gateway in front of your service to distribute calls/load. It will also periodically check for the health of your service. When it detects a container does not respond correctly it can kill the container and start another one. This can be provided by a container infrastructure such as kubernetes / docker swarm etc.
Now this does not protect you from losing any requests. In the event that a container malfunctions there will still be a short time between the failure and the next health check where requests may not be served. In many applications this is acceptable and the client side will just re-request and hit another (healthy container). If your application requires absolutely not losing requests you will have to cache the request in for example an API gateway and make sure it is kept until a Service has completed it (also called Circuit Breaker). An example technology would be Netflix Zuul with Hystrix. Using such a Gatekeeper with built in fault tolerance can increase the resiliency even further. As a side note - Using an API gateway can also solve issues with central authentication/authorization, routing and monitoring.
Another approach to add resilience / decouple is to use a fast streaming / message queue, such as Apache Kafka, for recording all incoming messages and have a message processor process them whenever ready. The trick then is to only mark the messages as processed when your request was served fully. This can also help in scenarios where faults can occur due to large number of requests that cannot be handled in real time by the Service (Asynchronous Decoupling with Cache).
Service "A" should fire a "ready" event when it becomes available. Just listen to that and resend your request.
I do not have much experiences with Oracle Service Bus, I am trying to design a logging solution with BigData.
As I read, the default log and report activity in OSB will put the data into the domain's server log file or into the database where we setup the server domain. If I want to put all the logs into a separate BigData database. I will need to either of these approaches:
Java callout, use JMS or some other technology to send data to the bigdata server.
Web service callout, create a separate web service to handle the logging.
Create custom report provider to replace the default one in OSB Reporting.
Something else
Please tell give me some ideas about what method I should be using, and please provide your reasons if you can, thank you so much.
Isn't the logging framework in weblogic based on Log4j? That means you can use a JMSAppender (probably prudent to wrap in an Async log4j appender if you can) and handle it however you want.
Or, if you're talking about the OSB Reporting framework, there's a few options:
Configure the default JMS reporting provider (which uses the underlying SOAINFRA database which hopefully is set up to be something better than the default Derby instance), then write a MDB that pulls reports off the queue and inserts it into SAS BigData
Turn the JMS provider off and use a custom provider, which can do anything you want. If you want, you can still do a two-step process, where the reporting provider itself puts reports on a JMS queue so it returns quickly, and a different MDB pulls messages off and persists them at its own pace.
I do not recommend a web service or database callout without an async step in the middle, because you need logging and reporting to be very quick and use as little resources for as short a period as possible.
You don't want logging to hog threads while you're experiencing load. I have seen entire buses brought down because of one hiccup, because the logging database suffered a performance blip, which caused a bunch of open threads trying to log to it, which caused thread starvation or timeouts, which caused more error logging...
if you have a buffer like a JMS queue, then you can handle peaks by planning ahead. You can say "actually I want a JMS queue of 10,000 messages, and if that overflows due to whatever reason, I want to (push the overflow to a separate queue over on this other box) or (filter out all the non-essential messages) or (throw new messages away) or (action of your choice). Oh yeah, and if the logging database fails then I will try 3 times to commit and if not, move it to this other queue". Or whatever you want.
There are multiple ways to achieve this. You could use the report activity to push to JMS or use the log activity.
You can also write a small routine such as this (either on OSB or outside it), that can read anything that you are logging (such as via the log activity but also additional metadata that is logged when you turn on monitoring of OSB components) and do with it whatever is needed (such as pushing it to a database or BigData store).
The key is to avoid writing an explicit service call in each pipeline/flow and the above approach(es) use standard OSB/ODL* loggers
*Oracle Diagnostic Logging