WebSphere - dumps generation system signal vs server script - websphere

I am looking for an explanation in differences between methods generating thread and heap dumps.
What I know so far:
system signal eg. kill -3 triggers instant creation of both (thread and heap dump)
script shipped with Liberty does run java agent which does magic and generates customizable output: thread dump alone or together with heap dump or core dump (or even with both)
server javadump myserver --include=thread,heap,system
https://www.ibm.com/support/knowledgecenter/SSEQTP_liberty/com.ibm.websphere.wlp.doc/ae/rwlp_command_server.html
..so my questions are:
what's better and why?
is there any difference in generated dumps?
which one would you use for providing exposed and automated way of dumps creation (eg. for developers)?
anyone has any experience with my previous point? I would highly appreciate your ProTips
..and also anything you might consider worth to mention here.
PS
What I've noticed. If I do system signal multiple times in a row nothing hangs and the number of generated dumps is equal to number of attempts made. The same happens if I do the same using script based solution (of course it takes longer).
..but if I do kill -3 <PID> ; server javadump myserver --include=thread,heap then server hungs and dumps are not generated - this state is unrecoverable without a restart. <- I've not spent much time on this behaviour so it could be just a failure unrelated to commands performed.
Thank you and best regards!

Related

How can I troubleshoot a thread in hang in WebSphere Application Server?

I see there are threads in hang in the websphere application server. How can I troubleshoot this problem? What documentation should I send to the application developer?
Thanks.
The most important thing is the thread stack - that should show up with the message indicating the hung thread, and it'll tell you what that thread was doing.
That, on its own, might not be enough, particularly if that thread is waiting on some other thread. In that case, you might need a thread dump. That can be triggered with "kill -3" against the process ID on non-Windows systems (I'd have to do more research to tell you the equivalent process on Windows, although there are tools that can simulate "kill -3"), and the server also can be configured to do that when it detects a hung thread, using the JVM system property com.ibm.websphere.threadmonitor.dump.java (set to either "true" or an integer value representing the maximum number of thread dumps you want).
The thread dump will go to a file called "javacore...txt" (the "..." will be a long string representing stuff like the timestamp), except on Solaris, where it will go to the server's native_stdout.log. The javacore has a lot more than just thread stacks, so you can search for "Thread Details" to find that section quickly. You'll need to search on the thread name/stack from the server log to figure out which thread is the hung one and go from there.
If you are experiencing performance, hang, or high CPU issues with WebSphere Application Server, there is a procedure documented by IBM support team to collect the data necessary to diagnose and resolve the kind of issues. This procedure is based essentially on
enabling Application Server verboseGC
running a script, at the time of the problem, which collects 3 javacores for the problematic JVM
At the end of this procedure, you need to collect:
*.tar.gz file generated by the script
javacores generated by the script
server logs (SystemOut.log, native_stderr.log,...)
and send the results to IBM Support.
To get the script and for additional information about this procedure, I suggest to give a look to the following articles:
WebSphere MustGather procedure on Linux
WebSphere MustGather procedure on Windows
A similar document exists also for AIX platform.

Twisted process is huge

A Twisted app I have was constantly getting killed due to memory problems. The program grew in size, consuming all of the system's memory before being shut down by the os. Restart and repeat.
This is on a virtual server, so I doubled the memory, and the issue resolved - the daemon stabilized at around 1.25GB of memory
Does anyone have advice on how I can best profile this to tell what/where all the memory is getting sucked up into ?
If info on the app helps, I'm using the twisted reactor and internet.timer.TimerService to poll a database for items to update through three 'services'. the items to process are pushed into a twisted.internet.defer.DeferredList , and their processing occurs in a deferToThread block. In the deferred process there are a handful of blocking operations ( fetching web pages, etc ) and a lot of HTML parsing ( beautiful soup and other libraries ). I've suggested the reactor.threadpool size to be 10 and each 'service' defers to thread using a SemaphoreService that has 10 tokens. I really expected this daemon to max out at around 400MB of memory, not 3x that.
This is more of a generic share of thoughts how I debug memory leak/usage problems in my twisted applications.
Twisted has a ssh server support, and is something which I add in to almost all of my projects in development.
The ssh provides a interactive python interpreter access to the method which has python garbage collector available and a number of helper functions which allow me to a) inspect count of the instances from a same class, b) start and stop inspection of changes of that count over time and c) to get all references of that class. The nice thing with the interactive interpreter is that it allows ad-hoc introspection of offending instances, their relation to other objects and the state of process they are in. This so far has always proven a valuable instrument to pinpoint exact location where I have forgot / unforseen the ref release problems in my projects.

Web application very slow in Tomcat 7

I implemented a web application to start the Tomcat service works very quickly, but spending hours and when more users are entering is getting slow (up to 15 users approx.).
Checking RAM usage statistics (20%), CPU (25%)
Server Features:
RAM 8GB
Processor i7
Windows Server 2008 64bit
Tomcat 7
MySql 5.0
Struts2
-Xms1024m
-Xmx1024m
PermGen = 1024
MaxPernGen = 1024
I do not use Web server, we publish directly on Tomcat.
Entering midnight slowness is still maintained (only 1 user online)
The solution I have is to restart the Tomcat service and response time is again excellent.
Is there anyone who has experienced this issue? Any clue would be appreciated.
Not enough details provided. Need more information :(
Use htop or top to find memory and CPU usage per process & per thread.
CPU
A constant 25% CPU usage in a 4 cores system can indicate that a single-core application/thread is running 100% CPU on the only core it is able to use.
Which application is eating the CPU ?
Memory
20% memory is ~1.6GB. It is a bit more than I expect for an idle server running only tomcat + mysql. The -Xms1024 tells tomcat to preallocate 1GB memory so that explains it.
Change tomcat settings to -Xms512 and -Xmx2048. Watch tomcat memory usage while you throw some users at it. If it keeps growing until it reaches 2GB... then freezes, that can indicate a memory leak.
Disk
Use df -h to check disk usage. A full partition can make the issues you are experiencing.
Filesystem Size Used Avail Usage% Mounted on
/cygdrive/c 149G 149G 414M 100% /
(If you just discovered in this example that my laptop is running out of space. You're doing it right :D)
Logs
Logs are awesome. Yet they have a bad habit to fill up the disk. Check logs disk usage. Are logs being written/erased/rotated properly when new users connect ? Does erasing logs fix the issue ? (copy them somewhere for future analysis before you erase them)
If not. Logs are STILL awesome. They have the good habit to help you track bugs. Check tomcat logs. You may want to set logging level to debug. What happens last when the website die ? Any useful error message ? Do user connections are still received and accepted by tomcat ?
Application
I suppose that the 25% CPU goes to tomcat (and not mysql). Tomcat doesn't fail by itself. The application running on it must be failing. Try removing the application from tomcat (you can eventually put an hello world instead). Can tomcat keep working overnight without your application ? It probably can, in which case the fault is on the application.
Enable full debug logging in your application and try to track the issue. Run it straight from eclipse in debug mode and throw users at it. Does it fail consistently in the same way ?
If yes, hit "pause" in the eclipse debugger and check what the application is doing. Look at the piece of code each thread is currently running + its call stack. Repeat that a few times. If there is a deadlock, an infinite loop, or similar, you can find it this way.
You will have found the issue by now if you are lucky. If not, you're unfortunate and it's a tricky bug that might be deep inside the application. That can get tricky to trace. Determination will lead to success. Good luck =)
For performance related issue, we need to follow the given rules:
You can equalize and emphasize the size of xms and xmx for effectiveness.
-Xms2048m
-Xmx2048m
You can also enable the PermGen to be garbage collected.
-XX:+UseConcMarkSweepGC -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled
If the page changes too frequently to make this option logical, try temporarily caching the dynamic content, so that it doesn't need to be regenerated over and over again. Any techniques you can use to cache work that's already been done instead of doing it again should be used - this is the key to achieving the best Tomcat performance.
If there any database related issue, then can follow sql query perfomance tuning
rotating the Catalina.out log file, without restarting Tomcat.
In details,There are two ways.
The first, which is more direct, is that you can rotate Catalina.out by adding a simple pipe to the log rotation tool of your choice in Catalina's startup shell script. This will look something like:
"$CATALINA_BASE"/logs/catalina.out WeaponOfChoice 2>&1 &
Simply replace "WeaponOfChoice" with your favorite log rotation tool.
The second way is less direct, but ultimately better. The best way to handle the rotation of Catalina.out is to make sure it never needs to rotate. Simply set the "swallowOutput" property to true for all Contexts in "server.xml".
This will route System.err and System.out to whatever Logging implementation you have configured, or JULI, if you haven't configured.
See more at: Tomcat Catalina Out
I experienced a very slow stock Tomcat dashboard on a clean Centos7 install and found the following cause and solution:
Slow start up times for Tomcat are often related to Java's
SecureRandom implementation. By default, it uses /dev/random as an
entropy source. This can be slow as it uses system events to gather
entropy (e.g. disk reads, key presses, etc). As the urandom manpage
states:
When the entropy pool is empty, reads from /dev/random will block until additional environmental noise is gathered.
Source: https://www.digitalocean.com/community/questions/tomcat-8-5-9-restart-is-really-slow-on-my-centos-7-2-droplet
Fix it by adding the following configuration option to your tomcat.conf or (preferred) a custom file into /tomcat/conf/conf.d/:
JAVA_OPTS="-Djava.security.egd=file:/dev/./urandom"
We encountered a similar problem, the cause was "catalina.out". It is the standard destination log file for "System.out" and "System.err". It's size kept on increasing thus slowing things down and ultimately tomcat crashed. This problem was solved by rotating "catalina.out". We were using redhat so we made a shell script to rotate "catalina.out".
Here are some links:-
Mulesoft article on catalina (also contains two methods of rotating):
Tomcat Catalina Introduction
If "catalina.out" is not the problem then try this instead:-
Mulesoft article on optimizing tomcat:
Tuning Tomcat Performance For Optimum Speed
We had a problem, which looks similar to yours. Tomcat was slow to respond, but access log showed just milliseconds for answer. The problem was streaming responses. One of our services returned real-time data that user could subscribe to. EPOLL were becoming bloated. Network requests couldn't get to the Tomcat. And whats more interesting, CPU was mostly idle (since no one could ask server to do anything) and acceptor/poller threads were sitting in WAIT, not RUNNING or IN_NATIVE.
At the time we just limited amount of such requests and everything became normal.

I can't run more than 100 processes

I have a massive number of shell commands being executed with root/admin priveleges through Authorization Services' "AuthorizationExecuteWithPrivileges" call. The issue is that after a while (10-15 seconds, maybe 100 shell commands) the program stops responding with this error in the debugger:
couldn't fork: errno 35
And then while the app is running, I cannot launch any more applications. I researched this issue and apparently it means that there are no more threads available for the system to use. However, I checked using Activity Monitor and my app is only using 4-5 threads.
To fix this problem, I think what I need to do is separate the shell commands into a separate thread (away from the main thread). I have never used threading before, and I'm unsure where to start (no comprehensive examples I could find)
Thanks
As Louis Gerbarg already pointed out, your question has nothing to do with threads. I've edited your title and tags accordingly.
I have a massive number of shell commands being executed with root/admin priveleges through Authorization Services' "AuthorizationExecuteWithPrivileges" call.
Don't do that. That function only exists so you can restore the root:admin ownership and the setuid mode bit to the tool that you want to run as root.
The idea is that you should factor out the code that should run as root into a completely separate program from the part that does not need to run as root, so that the part that needs root can have it (through the setuid bit) and the part that doesn't need root can go without it (through not having setuid).
A code example is in the Authorization Services Programming Guide.
The issue is that after a while (10-15 seconds, maybe 100 shell commands) the program stops responding with this error in the debugger:
couldn't fork: errno 35
Yeah. You can only run a couple hundred processes at a time. This is an OS-enforced limit.
It's a soft limit, which means you can raise it—but only up to the hard limit, which you cannot raise. See the output of limit and limit -h (in zsh; I don't know about other shells).
You need to wait for processes to finish before running more processes.
And then while the app is running, I cannot launch any more applications.
Because you are already running as many processes as you're allowed to. That x-hundred-process limit is per-user, not per-process.
I researched this issue and apparently it means that there are no more threads available for the system to use.
No, it does not.
The errno error codes are used for many things. EAGAIN (35, “resource temporarily unavailable”) may mean no more threads when set by a system call that starts a thread, but it does not mean that when set by another system call or function.
The error message you quoted explicitly says that it was set by fork, which is the system call to start a new process, not a new thread. In that context, EAGAIN means “you are already running as many processes as you can”. See the fork manpage.
However, I checked using Activity Monitor and my app is only using 4-5 threads.
See?
To fix this problem, I think what I need to do is separate the shell commands into a separate thread (away from the main thread).
Starting one process per thread will only help you run out of processes much faster.
I have never used threading before …
It sounds like you still haven't, since the function you're referring to starts a process, not a thread.
This is not about threads (at least not threads in your application). This is about system resources. Each of those forked processes is consuming at least 1 kernel thread (maybe more), some vnodes, and a number of other things. Eventually the system will not allow you to spawn more processes.
The first limits you hit are administrative limits. The system can support more, but it may causes degraded performance and other issues. You can usually raise these through various mecahanisms, like sysctls. In general doing that is a bad idea unless you have a particular (special) work load that you know will benefit from specific tweaks.
Chances are raising those limits will not fix your issues. While adjusting those limits may make you run a little longer, in order to actually fix it you need to figure out why the resources are not being returned to the system. Based on what you described above I would guess that your forked processes are never exiting.

Looking for pattern/approach/suggestions for handling long-running operation tied to web app

I'm working on a consumer web app that needs to do a long running background process that is tied to each customer request. By long running, I mean anywhere between 1 and 3 minutes.
Here is an example flow. The object/widget doesn't really matter.
Customer comes to the site and specifies object/widget they are looking for.
We search/clean/filter for widgets matching some initial criteria. <-- long running process
Customer further configures more detail about the widget they are looking for.
When the long running process is complete the customer is able to complete the last few steps before conversion.
Steps 3 and 4 aren't really important. I just mention them because we can buy some time while we are doing the long running process.
The environment we are working in is a LAMP stack-- currently using PHP. It doesn't seem like a good design to have the long running process take up an apache thread in mod_php (or fastcgi process). The apache layer of our app should be focused on serving up content and not data processing IMO.
A few questions:
Is our thinking right in that we should separate this "long running" part out of the apache/web app layer?
Is there a standard/typical way to break this out under Linux/Apache/MySQL/PHP (we're open to using a different language for the processing if appropriate)?
Any suggestions on how to go about breaking it out? E.g. do we create a deamon that churns through a FIFO queue?
Edit: Just to clarify, only about 1/4 of the long running process is database centric. We're working on optimizing that part. There is some work that we could potentially do, but we are limited in the amount we can do right now.
Thanks!
Consider providing the search results via AJAX from a web service instead of your application. Presumably you could offload this to another server and let you web application deal with the content as you desire.
Just curious: 1-3 minutes seems like a long time for a lookup query. Have you looked at indexes on the columns you are querying to improve the speed? Or do you need to do some algorithmic process -- perhaps you could perform some of this offline and prepopulate some common searches with hints?
As Jonnii suggested, you can start a child process to carry out background processing. However, this needs to be done with some care:
Make sure that any parameters passed through are escaped correctly
Ensure that more than one copy of the process does not run at once
If several copies of the process run, there's nothing stopping a (not even malicious, just impatient) user from hitting reload on the page which kicks it off, eventually starting so many copies that the machine runs out of ram and grinds to a halt.
So you can use a subprocess, but do it carefully, in a controlled manner, and test it properly.
Another option is to have a daemon permanently running waiting for requests, which processes them and then records the results somewhere (perhaps in a database)
This is the poor man's solution:
exec ("/usr/bin/php long_running_process.php > /dev/null &");
Alternatively you could:
Insert a row into your database with details of the background request, which a daemon can then read and process.
Write a message to a message queue which a daemon then read and processed.
Here's some discussion on the Java version of this problem.
See java: what are the best techniques for communicating with a batch server
Two important things you might do:
Switch to Java and use JMS.
Read up on JMS but use another queue manager. Unix named pipes, for instance, might be an acceptable implementation.
Java servlets can do background processing. You could do something similar to this technology in a web technology with threading support. I don't know about PHP though.
Not a complete answer but I would think using AJAX and passing the 2nd step to something thats faster then PHP (C, C++, C#) then a PHP function pick the results off of some stack most likely just a database.

Resources