How to debug spring boot application not starting - spring

Spring lists SO as the only place to ask questions on their community page, which is why I ask this rather generic question here. It may not be the best fit for SO, but, according to Spring's community overview page, there's no other adequate place to ask such questions.
I have a spring boot application built on spring cloud gateway (version 2) which also uses an embedded hazelcast cluster. It runs in multiple instances, which communicate via hazelcast. Everything works fine, except under heavy load. If one instance fails, restarting it is no longer possible.
When the instance is restarted while the cluster of instances is under heavy load, it will start creating and wiring beans, up to some point, after which it will not do anything spring-related anymore. Hazelcast-generated messages are visible in the log (with root log level DEBUG), past that point, but nothing generated by spring or the application itself.
In order to restart that one instance that failed, I need to stop the load generation, wait some 10-15 minutes, then restart the failed instance. Then the new/restarted instance starts up rather quickly, with no problems at all.
The load consists of http requests which get proxied to another application, and is of such nature that it generates a lot of read accesses to hazelcast's distributed storage, but very few writes.
My problem: I have no idea how to debug this. Since the http endpoint never becomes available, there's no way I can query metrics or other actuator information.
So my question is: what tools or mechanisms can I employ to debug this problem? I.e. how can I find out exactly how the boot sequence under heavy load of the other instances of the hazelcast cluster differs from the boot sequence when there is no load at all in the cluster? Once I have this information, the problem is narrowed down enough for me to investigate it further on my own.

I didn't find a way to debug the problem, but had an idea of what might cause it, tried it, and it was a fix.
My application was running as a Kubernetes deployment. A few beans inside the application were relying on a usable CP subsystem during their initialization. Spring's bean initialization process is by necessity sequential and blocking, to account for inter-bean dependencies.
I hypothesized that under heavy load, for whatever reason, the initialization of those beans was blocking forever. As a first experiment, I made that initialization code async, so that Spring can finish bean wiring, even if, until that async part finished too, the instance was unable to perform usable work, to see if that was the problem, at least.
To my surprise, that fully fixed the problem. This way, Spring finished bean wiring, the HZ-dependant initialization also finished rather quickly, when executed async, even under high load, and the instance became usable soon after being started.
I didn't have the time to dig deeper to find out what the precise failure mechanism was. What I believe might have been the problem is the interaction between HZ and K8s. K8s-based discovery works using a K8S service. A pod/instance isn't added to the service until it becomes healthy. If a bean inside the application prevents initialization, the instance is never added to the service. As such, discovery never finds the new/restarted instance. I don't know what effect this might have on the HZ cluster's inner workings.

Related

Reusing expensive beans in Spring Boot Tests

I am trying to improve performance of medium tests in Spring Boot.
I am using the Spring Boot - testcontainers library.
For an individual test this works really well, with a few annotations I can get access to kafka, zookeeper, and schema-registry. These are full services so it takes a few seconds to start everything up, all together setup takes about 40 seconds. The test accurately recreates a realistic deployment, it's beautifully simple.
This would be fine if it just happened once but it happens every time a Spring Context is created. That means every test that uses #MockBean incurs that 40 second cost.
I've tried refactoring into a single TestConfiguration class and referencing that. I've looked into using ContextHierarchy but I think that means I'll lose all of the Spring Boot niceties and I'll need to recreate the context (which means it won't look exactly like the context created by the production app).
Is there a better way to do this?
Spring framework already took care of this scenario.
There is a concept of caching the application context for test class/classes.
See the documentation.
Few lines from the documentation:
The Spring TestContext framework stores application contexts in a
static cache. This means that the context is literally stored in a
static variable. In other words, if tests run in separate processes,
the static cache is cleared between each test execution, which
effectively disables the caching mechanism.
So essentially you need to structure your code or context configuration in such a way that you use cached context in your desired test cases.
But use this capability wisely, if not thought through properly this could lead to undesired side-effects

Which JerseyClientBuilder to use?

So I'm developing an application on top of Dropwizard and in one component I have to call an external rest service. I want to do it using JerseyClient.
Now there are 2 implementations available, the one from dropwizard and one from jersey. Using the Builder I have to choose between
io.dropwizard.client.JerseyClientBuilder.JerseyClientBuilder
and
org.glassfish.jersey.client.JerseyClientBuilder
Now the former requires Environment as well as JerseyClientConfiguration being passed to it.
Is there a good reason for using the dropwizard implementation over the vanilla one? What's the difference (except the timeout)?
Thank you
After some more digging and asking around, got an answer (below). It seems that it is better to use the JerseyClientBuilder that comes with Dropwizard as it is better integrated:
One can use JerseyClientConfiguration in order to configure the JerseyClientBuilder via the application configuration (service.yml file).
Passing the Environment information enables use of the managed thread pool that is integrated within Dropwizard's lifecycle, so when Dropwizard gets shut down, so does the client.
Dropwizard's Metrics get integrated into each client so you can see the latency and rate of calls for each one, as well as metrics around the thread pool sizes.

Need to Improve Startup Speed and Resource Usage on a Spring-WS Web Service

I have a Spring-WS web service that has three issues:
Slow startup time
Slow generation of the dynamic WSDL
Heavy usage of PermGen (app has to be 1.6 compatible)
Currently, the spring-ws-servlet.xml file has several <context:component-scan> elements for autowired dependencies. Two of these scan nearly everything in two external libraries containing Hibernate DAO and Entity classes. Similarly, the Hibernate session factory bean scans a large number of entities from these two libraries.
So, my questions:
Obviously, we would see at least some performance improvement by limiting the scope of the <context:component-scan> elements. But really, would it be that much?
Similarly, would I see improvements by limiting the scope of what Entities are scanned by the session factory?
Making these changes will NOT be a quick process (alter code, test, etc). Therefore, if anyone can add their wisdom, I would greatly appreciate it.
Actually I am developing a spring ws application on Google Cloud and I also have the same problem with slow start up time. The biggest difference that I have notice was when I have moved to aspectj compile time weaving using aspectj-maven-plugin. If you haven't done this yet try this one. The result may be vary depends on your code and deployment environment. On the cloud every file operation is much slower so this may be a reason why this work for me so well.

EJB3.1 Remote invocation - is it distributed automatically? is it expensive?

I'm building a JEE6 application with performance and scalability in the forefront of my mind.
Business logic and JPA2-facade is held in stateless session beans (EJB3.1). As of right now, the SLSBs implement only #Remote-interfaces. When a bean needs to access another bean, it does so via RMI.
My reasoning behind this is the assumption that, once the application runs on a bunch of clustered application servers, the RMI-part allows the execution to be distributed across the whole cluster automagically.
Is that a correct assumption?
I'm fine with dealing with the downsides of that (objects lose entityManager session, pass-by-value), at least I think so. But I am wondering if constant remote invocation isn't adding more load then necessary.
The EJB specification don't specify how clustering should be achieved, so this will depend on the particular implementation used. Actually, the EJB specifications are on purpose written to not make assumptions about the deployment: they don't mandate any support of clustering, but are written in a way that makes it possible (and a lot of restrictions in the EJB model stems from potential clustering issues, e.g. access to the file system). The implementer is then free to support clustering or not, and still comply with the spec.
In Glassfish, the reference to the remote EJB does the distribution itself. See my answer here for more information. Each request could potentially be dispatched to a different node. That's probably the way most implementations work. So I would say your assumption is correct.
I do hope however that they optimize the case when one EJB calls another EJB and try to dispatch the invacation on the same node whenever possible. That will depend whether the deployment is homogeneous or not (all nodes have the same beans, or not). Again, the spec are a bit vague regarding such points. But I guess that most deployment are homogeneous in practice: the same ear is deployed on all nodes.
Regarding the performance overhead of remote vs. local calls, I did some measures once (on Glassfish). See my answer here. Inter EJB calls in the same .ear through remote interface was 3x slower than local calls. That sounds big, but we are speaking of milliseconds, so the relative overhead depends on what the methods really does. I don't know the performance of other app. server.
Hope it helps.

How to know the line of a bug in a spring generated bean?

I've got a website build with Spring and jpa (by hibernate). I've got a bug and I don't know how to identify the line where the bug appears.
I can't debug it on my ide because it's a live version (all runs fine in local).
I've got log which says:
o
rg.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)#012#011
at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:625)#012#011
at com.mycompany.server.rpc.UserService$$EnhancerByCGLIB$$64ed2d4f.createAccount(<generated>)#012#011
at com.mycompany.server.rpc.ServiceRPCImpl.createAccount(ServiceRPCImpl.java:309)
My problem is the third line. As the UserService object is handled by Spring, it becomes a proxy and I can't know the line of the bug.
Do you know how to solve the problem ?
Thanks
Is it possible for you to change from cglib to jdk proxy? (Spring AOP proxy reference)
Basically: if you access your beans as interfaces, you can use jdk proxies (spring default mechanism), thereby leaving the underlying object intact and gaining access to line numbers in stack traces.
I would say that not being able to reproduce this locally is a significant restraint. I would try to set up your local environment or a test server to reproduce the problem, using JMeter or other load test software to simulate load of concurrent user accesses. Once this is done, your tweak/compile/test cycle becomes a lot shorter, and you can make experimental changes without fear of disrupting service on your production server. It may seem like a lot of effort, but the work will pay dividends not just for this bug, but for bugs you may encounter in future.
It sounds like it could be a threading bug, especially since spring by default uses singleton scope. With that in mind, look into creating multithreaded integration tests for the service that is failing. Once you have reproduced the bug through load testing, you can verify that it's a threading bug by making your main service method synchronized, preventing concurrent use. If the bug disappears, it is most likely a concurrency bug.
As to finding the line of the bug - there is no line to look for since the code is generated. The best you can do is to add defensive checks in all beans that are being used in the advice around the UserService. (E.g. check for null values due to missing injections.) The init-method attribute on beans is useful for performing checks that the bean has been fully constructed and all required collaborators have been set.
If you cannot reproduce the issue in local environment, then may be it is environment / network related issue. I would first recreate the issue in test environment ( which is closer to production environment and not just own local machine ) to debug the bug.
You may also use Fiddler to debug network related issues for a live version.

Resources