Hi. I'm having a pretty nasty issue with the app I'm responsible for. The app was running on osgi/karaf + spring + apache camel 2.14.1. I removed osgi/karaf, upgraded spring, and moved all that to the spring boot. Camel version was upgraded to 2.24.1. And we started to see some occasional perf. issues on prod. I wasn't able to see the logs until yesterday, but what I saw there confused me a lot. The app at some point stopped processing the files for one of the routes, while the second route was working fine...
And it lasted for almost 2 hours. Yes, we have pretty high load on it but it was always like that. I just can't figure out what exactly could possibly cause that.
Few small details though... The files which this failing route is monitoring are put to the folder through the symb link. So if the monitoring folder name is /test, there is a link to it in $HOME/test(for a different user) and the other process connects through the sftp and puts files to that folder(it's not a problem with that process, I'm 100% sure that the files were on the server but this thing just didn't see them).
I to be honest have no idea where to dig. The server is pretty old, and the FS is pretty defragmented, we have also disk usage on 100%(but our admins don't think it's related + it was like that before and there were no issues). Java changed by the way from 1.6 to 1.8. I also checked the memory, it doesn't seem like there is an issue, one global GC during the 12 hours, minor collections are not that frequent. I would really appreciate any thoughts... Thank you very much!
Related
I have a few weeks before the next project & i'm looking/wanting to streamline our development process to give the UX & Devs guys the shortest lead time to change validation (e.g. 10 seconds for a Java change/1 second for UX/JS changes).
Basically, I want what John Lindquist shows in this video (RT feedback with webstorm & the Angular todo list example in 3 minutes) but I with Tomcat & Spring.
Ive been researching/playing this for the last few days with our stack (Tomcat8,Intellij13, Spring4, Angular) and I am just not 'getting it' so think it is my lack of knowledge in this area and i'm missing something (hence the SO question).
What I have achieved to date for UX Guys
Grunt (using node) to serve up the 'static resources'(JS/SCSS/Templates) and livereload to refresh chrome - this is very good and very close to what i want (RT feeback from SCSS/JS/HTML changes) but the big problem is that node is serving the static resources and not TC so with Cross Origin Policies (solved via this and this )- rebuilds in intellij become messy with grunt involved - looked at SCSS compiles with file watchers but it is not gelling) - in short i did not get grunt servicing static & TC the REST API working in harmony. Another option was this guy updates the TC resources with grunt when the file changes but just don't want to go there.
This leads me back to file watchers, jetbrains live edit (what the web storm video shows) and intellij and again, i'm close when it comes to static content as intellij can update the resources on TC on frame deactivation but (and a big BUT) this is NOT real time and when you change resource structure, you need to refresh the page however we are working on a SPA which loses context on refresh which slows the guys down as have to reply sequences to get back to where the change happened and also when using intellij they have to 'frame de-activate' to get the changes pushed to TC (they are on dual monitors so to tab off intellij is the same as pushing a button to deploy the changes )
The best to date is grunt and accept the same origin issues for development but am I missing something for the UX guys?
What I have achieved to date for Dev Guys
Before we start, can't afford jrebel and haven't got Spring Loaded working with intellij and tomcat (yet).
at this stage simply having TC refreshed by intellij with classes change and restart when bean definitions/method structure changes. Bad I know but 'it is what we are use to'
Looking at spring boot - promising but ideally would like not to give the configuration freedom away but it does give live updates on the server I believe.
Grails is out at the moment so can't benefit there.
I know Play allows some real time updates but again, haven't looked at this in detail and a big shift from the current stack.
Summary
On the development side will likely stick to Live Edit and accept the refresh/deactivation issue so we can't 'achieve' what John Lindquist shows in Webstorm, i.e. real time updates when resources changes when using Tomcat/Intellij/Chrome - or at least 'I don't know' how to achieve this?
Server side - i'm still working on this, going to continue to look at spring loaded and intellij integration then look at jrebel and see what budget, if any, we can get but in the meantime is there any alternatives as I see the node/ruby/grails guys getting it all so i believe it must be me and i'm missing the best setup to get super fast feedback from our code changes when using Tomcat & Spring?
In Short, yes it is possible & have achieved what I set out to achieve - that was all developmental changes in a Java EE platform (including JS/SCSS Changes and Spring/Java Changes) to happen in 'real time' (5/10 seconds server, 2 seconds ux). I have recorded a little video showing it all in action (please excuse the lack of dramatics)..
Stack:
AngularJS
Grunt -serving up static pages with an http proxy to /service context calls. The proxy is needed for 2 reasons - 1 is to
get around the Cross origin issues & 2 - so that real time static
resources changes (HTML/JS/SCSS) are shown in Chrome - you can't do this with
tomcat as the resources are copied to the web-app folder in TC and
not being served directly from source (Intellij can redeploy on frame deactivation but that doesn't work well and it doesn't allow for instant changes to be reflected in Chrome)..
Grunt monitors SCSS changes (I believe you could use file watchers in
intellij though but have grunt serving the static content)
Live Edit updates Chrome on the fly.
JRebel for Spring/Server side changes to Tomcat (licence required for
commercial use)
The subtle but important thing is what Grunt is doing..
I think this is a simpler alternative to Ian's solution:
Structure the application in three parts:
REST services (src/main/java) and
Web source files (src/web, don't include them in the final war)
Static files produced by grunt (src/web/dist, and include them in final war at /)
Configure CORS filter (which you need anyway for REST services) [1]
Now the kicker:
Start tomcat (as usual)
Start your angularjs website from src/web (using IntelliJ it's just Debug index.html)
Done -- all your edits in source html/js files are reflected in next browser refresh (or use grunt plugin to make it even more "live")
[1] https://github.com/swagger-api/swagger-core/wiki/CORS
I implemented a web application to start the Tomcat service works very quickly, but spending hours and when more users are entering is getting slow (up to 15 users approx.).
Checking RAM usage statistics (20%), CPU (25%)
Server Features:
RAM 8GB
Processor i7
Windows Server 2008 64bit
Tomcat 7
MySql 5.0
Struts2
-Xms1024m
-Xmx1024m
PermGen = 1024
MaxPernGen = 1024
I do not use Web server, we publish directly on Tomcat.
Entering midnight slowness is still maintained (only 1 user online)
The solution I have is to restart the Tomcat service and response time is again excellent.
Is there anyone who has experienced this issue? Any clue would be appreciated.
Not enough details provided. Need more information :(
Use htop or top to find memory and CPU usage per process & per thread.
CPU
A constant 25% CPU usage in a 4 cores system can indicate that a single-core application/thread is running 100% CPU on the only core it is able to use.
Which application is eating the CPU ?
Memory
20% memory is ~1.6GB. It is a bit more than I expect for an idle server running only tomcat + mysql. The -Xms1024 tells tomcat to preallocate 1GB memory so that explains it.
Change tomcat settings to -Xms512 and -Xmx2048. Watch tomcat memory usage while you throw some users at it. If it keeps growing until it reaches 2GB... then freezes, that can indicate a memory leak.
Disk
Use df -h to check disk usage. A full partition can make the issues you are experiencing.
Filesystem Size Used Avail Usage% Mounted on
/cygdrive/c 149G 149G 414M 100% /
(If you just discovered in this example that my laptop is running out of space. You're doing it right :D)
Logs
Logs are awesome. Yet they have a bad habit to fill up the disk. Check logs disk usage. Are logs being written/erased/rotated properly when new users connect ? Does erasing logs fix the issue ? (copy them somewhere for future analysis before you erase them)
If not. Logs are STILL awesome. They have the good habit to help you track bugs. Check tomcat logs. You may want to set logging level to debug. What happens last when the website die ? Any useful error message ? Do user connections are still received and accepted by tomcat ?
Application
I suppose that the 25% CPU goes to tomcat (and not mysql). Tomcat doesn't fail by itself. The application running on it must be failing. Try removing the application from tomcat (you can eventually put an hello world instead). Can tomcat keep working overnight without your application ? It probably can, in which case the fault is on the application.
Enable full debug logging in your application and try to track the issue. Run it straight from eclipse in debug mode and throw users at it. Does it fail consistently in the same way ?
If yes, hit "pause" in the eclipse debugger and check what the application is doing. Look at the piece of code each thread is currently running + its call stack. Repeat that a few times. If there is a deadlock, an infinite loop, or similar, you can find it this way.
You will have found the issue by now if you are lucky. If not, you're unfortunate and it's a tricky bug that might be deep inside the application. That can get tricky to trace. Determination will lead to success. Good luck =)
For performance related issue, we need to follow the given rules:
You can equalize and emphasize the size of xms and xmx for effectiveness.
-Xms2048m
-Xmx2048m
You can also enable the PermGen to be garbage collected.
-XX:+UseConcMarkSweepGC -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled
If the page changes too frequently to make this option logical, try temporarily caching the dynamic content, so that it doesn't need to be regenerated over and over again. Any techniques you can use to cache work that's already been done instead of doing it again should be used - this is the key to achieving the best Tomcat performance.
If there any database related issue, then can follow sql query perfomance tuning
rotating the Catalina.out log file, without restarting Tomcat.
In details,There are two ways.
The first, which is more direct, is that you can rotate Catalina.out by adding a simple pipe to the log rotation tool of your choice in Catalina's startup shell script. This will look something like:
"$CATALINA_BASE"/logs/catalina.out WeaponOfChoice 2>&1 &
Simply replace "WeaponOfChoice" with your favorite log rotation tool.
The second way is less direct, but ultimately better. The best way to handle the rotation of Catalina.out is to make sure it never needs to rotate. Simply set the "swallowOutput" property to true for all Contexts in "server.xml".
This will route System.err and System.out to whatever Logging implementation you have configured, or JULI, if you haven't configured.
See more at: Tomcat Catalina Out
I experienced a very slow stock Tomcat dashboard on a clean Centos7 install and found the following cause and solution:
Slow start up times for Tomcat are often related to Java's
SecureRandom implementation. By default, it uses /dev/random as an
entropy source. This can be slow as it uses system events to gather
entropy (e.g. disk reads, key presses, etc). As the urandom manpage
states:
When the entropy pool is empty, reads from /dev/random will block until additional environmental noise is gathered.
Source: https://www.digitalocean.com/community/questions/tomcat-8-5-9-restart-is-really-slow-on-my-centos-7-2-droplet
Fix it by adding the following configuration option to your tomcat.conf or (preferred) a custom file into /tomcat/conf/conf.d/:
JAVA_OPTS="-Djava.security.egd=file:/dev/./urandom"
We encountered a similar problem, the cause was "catalina.out". It is the standard destination log file for "System.out" and "System.err". It's size kept on increasing thus slowing things down and ultimately tomcat crashed. This problem was solved by rotating "catalina.out". We were using redhat so we made a shell script to rotate "catalina.out".
Here are some links:-
Mulesoft article on catalina (also contains two methods of rotating):
Tomcat Catalina Introduction
If "catalina.out" is not the problem then try this instead:-
Mulesoft article on optimizing tomcat:
Tuning Tomcat Performance For Optimum Speed
We had a problem, which looks similar to yours. Tomcat was slow to respond, but access log showed just milliseconds for answer. The problem was streaming responses. One of our services returned real-time data that user could subscribe to. EPOLL were becoming bloated. Network requests couldn't get to the Tomcat. And whats more interesting, CPU was mostly idle (since no one could ask server to do anything) and acceptor/poller threads were sitting in WAIT, not RUNNING or IN_NATIVE.
At the time we just limited amount of such requests and everything became normal.
I am using a modified version of the TaskCloud example to try and read/write my own data.
While testing on a a deployed version, I've noticed that the round-trip response time is slow.
From my Android device, I have a 100ms ping response to appspot.com.
I have changed the AppEngine application to do nothing (The Google Dashboard shows insignificant Average Latency.
The problem is that the time it takes for HttpClient client .execute(post) is about 3 seconds.
(This is the time when an instance is already loaded)
Any suggestions would be greatly appreciated.
EDIT: I've watched the video of Google I/O showing the CloudTasks Android-AppEngine app, and you can see that refreshing the list (a single call to AppEngine) takes about 3 seconds as well. The guy is saying something about performance which I didn't fully get (debuggers are running at both ends?)
The video: http://www.youtube.com/watch?v=M7SxNNC429U&feature=related
Time location: 0:46:45
I'll keep investigating...
Thanks for your help so far.
EDIT 2: Back to this issue...
I've used shark packet sniffer to find out what is happening. Some of the time is spent negotiating a SSL connection for each server call. Using http (and ACSID) is faster than https (and SACSID).
new DefaultHttpClient() and new HttpPost() are used for each server call.
EDIT 3:
Looking at the sniffer logs again, there is an almost 2 seconds delay before the actual POST.
I have also found out that the issue exists with Android 2.2 (all versions) but is resolved with Android 2.3
EDIT 4: It's been resolved. Please see my answer below.
It's difficult to answer your question since no detail about your app is provided. Anyway you can try to use appstats tool provided by Google to analyze the bottleneck.
After using the Shark sniffer, I was able to understand the exact issue and I've found the answer in this question.
I have used Liudvikas Bukys's comment and solved the problem using the suggested line:
post.getParams().setBooleanParameter(CoreProtocolPNames.USE_EXPECT_CONTINUE, false);
Often the first call to your GAE app will take longer than subsequent calls. You should make yourself familiar with loading and warm-up requests and how GAE handles instances of your app: http://code.google.com/intl/de-DE/appengine/docs/adminconsole/instances.html
Some things you could also try:
make your app handle more than one request per instance (make sure your app is threadsafe!) http://code.google.com/intl/de-DE/appengine/docs/java/config/appconfig.html#Using_Concurrent_Requests
enable always on feature in app admin (this will cost you)
This problem appeared today and I have no idea what is going on. Please share you ideas.
I have 1 EC2 DB server (MYSQL + NFS File Sharing + Memcached).
And I have 3 EC2 Web servers (lighttpd) where it will mounted the NFS folders on the DB server.
Everything going smoothly for months but suddenly there is an interesting phenomenon.
In every 8 minutes to 10 minutes, PHP file will be unreachable. This will last about 1 minute and then back to normal. Normal files like .html file are unaffected. All servers have the same problem exactly at the same time.
I have spent one whole day to analysis the reason. Finally, I find out when the problem appear, the file descriptor of lighttpd suddenly increased a lot.
I used ls /proc/1234/fd | wc -l to check the number of fd.
The # of fd is around 250 in normal time. However, when the problem appeared, it will be raised to 1500 and then back to normal.
It sounds funny, right? Do you have any idea what's going on?
========================
The CPU graph of one of the web server.
alt text http://pencake.images.s3.amazonaws.com/4be1055884133.jpg
Thoughts:
Have a look at dmesg output.
The number of file descriptors jumping up sounds to me like something is blocking, including the processing of connections to the lighttpd/PHP, which builds up untile the blocking condition ends.
When you say the PHP file is unreachable, do you mean the file is missing? Or maybe the PHP script stalls during execution or? What do the lihttpd log files say is happening on the calls to this PHP script. Are there any other hints in the lighttpd?
What is the maximum file descriptors for the process/user?
I and others have had bizarre networking behavior on EC2 instances from time to time. Give us more details on it. Maybe setup some additional monitoring of the connectivity between your instances. Consider moving your problem instance to another instance in the hopes of the problem magically disappearing. (Shot in the dark.)
And finally...
DOS attack? I doubt it--it would be offline or not. It is way too early in the debugging process for you to infere malice on someone elses part.
This might also belong on serverfault. It's kind of a combo between server config and code (I think)
Here's my setup:
Rails 2.3.5 app running on jruby 1.3.1
Service Oriented backend over JMS with activeMQ 5.3 and mule 2.2.1
Tomcat 5.5 with opts: "-Xmx1536m -Xms256m -XX:MaxPermSize=256m -XX:+CMSClassUnloadingEnabled"
Java jdk 1.5.0_19
Debian Etch 4.0
Running top, every time i click a link on my site, I see my java process CPU usage spike. If it's a small page, it's sometimes just 10% usage, but sometimes on a more complicated page, my CPU goes up to 44% (never above, not sure why). In this case, a request can take upwards of minutes while my server's load average steadily climbs up to 8 or greater. This is just from clicking one link that loads a few requests from some services, nothing too complicated. The java process memory hovers around 20% most of the time.
If I leave it for a bit, load average goes back down to nothing. Clicking a few more links, climbs back up.
I'm running a small amazon instance for the rails frontend and a large instance for all the services.
Now, this is obviously unacceptable. A single user can bring spike the load average to 8 and with two people using it, it maintains that load average for the duration of our using the site. I'm wondering what I can do to inspect what's going on? I'm at a complete loss as to how I can debug this. (it doesn't happen locally when I run the rails app through jruby, not inside the tomcat container)
Can someone enlighten me as to how I might inspect on my jruby app to find out how it could possibly be using up such huge resources?
Note, I noticed this a little bit before, seemingly at random, but now, after upgrading from Rails 2.2.2 to 2.3.5 I'm seeing it ALL THE TIME and it makes the site completely unusable.
Any tips on where to look are greatly appreciated. I don't even know where to start.
Make sure that there is no unexpected communication between the Tomcat and something else. I would check in the first place if:
ActiveMQ broker doesn't communicate with the other brokers in your network. By default AMQ broker start in OpenWire auto-discovery mode.
JGroups/Multicasts in general do not communicate with something in your network.
This unnecessary load may result from the processing of the messages coming from another application.