Process under FreeBSD 9.0 hangs in uninterruptable sleep with apparently no syscall (empty wchan) - sleep

I have a custom logging process that is reading from STDIN and sending the data out via TCP to a scribed logging server.
STDIN is in my case an access log that is attached to Apache httpd 2.2 like this in httpd.conf:
CustomLog "|/usr/local/bin/serelog" default
My serelog process sometimes goes into uninterruptable sleep under FreeBSD 9.0 and does not return from it. It works reliably under other operating systems though, including FreeBSD 8, Linux 2.6 and Linux 3.1.
How can I find out what could be the reason for the uninterruptable sleep?
The overall structure is like this:
httpd --[PIPE]--> serelog --[TCP-CONNECTION]--> scribed
Until now I did the following analysis:
Using ps: stat is "D" and wchan is "-". So there is apparently no syscall, which doesn't
make too much sense to me, as the process in uninterruptable sleep and should be in kernel land.
As the process is in state "D", the process does not react to kill -9 as expected.
Attaching truss to serelog externally from a shell: As long as truss is attached, serelog runs smoothly.
Shortly (Seconds) after detaching truss from serelog, serelog goes into "D" state.
When attaching truss to serelog AFTER it has entered "D" state, truss prints nothing
In "D" state, lsof shows that the incoming PIPE is full. This is exected, as in "D" state the process "sleeps"
and cannot read any longer. The outgoing TCP-CONNECTION is empty.
If I kill the "surrounding" Apache httpd server, the serelog process eventually terminates after (e.g.) 40 minutes.
Checking what others report in forums about the uninterruptable problem was not successful: In my setup there is no NFS.
And as it is a server there is also no user interaction with CD drives or pluggable hardware.
So I am now stuck with a process that is uninterruptable, is apparently not in a syscall,
and works reliably when traced. The only good thing is that I am able to reproduce the behavior in a few
seconds or minutes when I send a lot of HTTP requests via JMeter loadtest (5 threads in JMeter).
Any tips on debugging, kernel parameter tuning are appreciated.
Greetings

The issue has proven to be an actual FreeBSD Kernel bug, and is now fixed in the Kernel.
Link to the PR: http://www.freebsd.org/cgi/query-pr.cgi?pr=166340
Proposed Patch: http://lists.freebsd.org/pipermail/freebsd-bugs/2012-May/048610.html

Related

WebSphere - dumps generation system signal vs server script

I am looking for an explanation in differences between methods generating thread and heap dumps.
What I know so far:
system signal eg. kill -3 triggers instant creation of both (thread and heap dump)
script shipped with Liberty does run java agent which does magic and generates customizable output: thread dump alone or together with heap dump or core dump (or even with both)
server javadump myserver --include=thread,heap,system
https://www.ibm.com/support/knowledgecenter/SSEQTP_liberty/com.ibm.websphere.wlp.doc/ae/rwlp_command_server.html
..so my questions are:
what's better and why?
is there any difference in generated dumps?
which one would you use for providing exposed and automated way of dumps creation (eg. for developers)?
anyone has any experience with my previous point? I would highly appreciate your ProTips
..and also anything you might consider worth to mention here.
PS
What I've noticed. If I do system signal multiple times in a row nothing hangs and the number of generated dumps is equal to number of attempts made. The same happens if I do the same using script based solution (of course it takes longer).
..but if I do kill -3 <PID> ; server javadump myserver --include=thread,heap then server hungs and dumps are not generated - this state is unrecoverable without a restart. <- I've not spent much time on this behaviour so it could be just a failure unrelated to commands performed.
Thank you and best regards!

Tuxedo tmshutdown stops server but process still exists

i've got problem with tuxedo tmshutdown command. One of processes still runs (with huge cpu usage) though tmshutdown stops it succesfull. There is also one opened IPC shared memory which i can close when I kill existing process. There are other servers but only this one is problematic. Is it possible that the problem is in code (tpsvrdone is exiting without errors)?
Tmshudown normally sends a SIGTERM signal to tuxedo serves unless you use -k KILL (which sends a SIGKILL)
If the source code of the Tuxedo server implements a handler of the signal, you could get the behavior you explained.
http://www.thegeekstuff.com/2012/03/catch-signals-sample-c-code/
Also, if it is not possible to shutdown a server, or remove a service advertisement, a diagnostic is written on the ULOG.

In Windows 7, how to send a Ctrl-C or Ctrl-Break to a separate process

Our group has long running processes which run daily. The processes are typically started at 9pm on any given day and run until 7pm the next day. Thus they typically run 22hrs/day. They are started by scheduled tasks on servers under a particular generic user ID, and they start and run regardless of whether or not that user ID is logged on. Thus, they are windowless console executables.
The tasks orchestrate computations running on a large server farm. Generally these controlling tasks run uninterrupted for the full 22hrs/day. However, we often have a need to stop and restart these processes. Because they control a multitude of tasks running on our server farm, it is important that they be shut down cleanly, so that they can stop and shut down all the server farm processes. Which brings me to our problem.
The controlling process has been programmed to respond to ctrl-C and ctrl-break signals. This works fine when the process is manually started in a console where we have access to the console and can "type" ctrl-c or ctrl-break in the console window. However, as mentioned, the processes typically run as windowless scheduled tasks. Hence we cannot "type" anything into a non-existent console window. Because they are console processes that execute without a logon process, the also must be able to execute in a completely windowless environment. So, how do we set up the process to listen for a shut-down signal?
While the process does indeed listen for a ctrl-C and ctrl-break signal, I can see no way to send that signal to a process. This seems to be a fundamental problem in Windows, or am I wrong? I am aware of SendSignal.exe, but so far have been unable to get it to work. It fails as follows:
>SendSignal 26320
Sending signal to process 26320...
CreateRemoteThread failed with 0x00000005.
StartRemoteThread failed with 0x00000005.
0x00000005 == Access is denied.
Trying "taskkill" without -F results in:
>taskkill /PID 24840
ERROR: The process with PID 24840 could not be terminated.
Reason: This process can only be terminated forcefully (with /F option).
All other "kill" functions kill the process immediately rather than sending a signal.
One possible solution would be a file-watch based solution: create a watch for some modification of a specific file. But this is a hack and we would prefer to do it with appropriate signaling. Has anyone solved this issue? It seems to be so very basic a functionality, and it is certainly trivial to do it in a Unix environment. Surely Microsoft has provided SOME mechanism to allow clean shut down of a windowless executable?
I am aware of the thread below, whose question is virtually identical (save for the specification of why the answer is necessary, i.e. why one needs to be able to do this for a windowless, console-less process), but there is no answer there excpet for "use SendSignal", which, as I said, does not work for us:
Can I send a ctrl-C (SIGINT) to an application on Windows?
There are other similar questions, but no answers as yet.
Any help appreciated.
[Upgrading #Anon's comment to an answer for visibility]
windows-kill worked perfectly and managed to resolve access denial issues faced with SendSignal. A privileged user would have to run it as well of course.
windows-kill also supports both ctrl-c and ctrl-break signals.

How do I code a watchdog timer to restart a Windows service?

I'm very interested in the answer to another question regarding watchdog timers for Windows services (see here). That answer stated:
I have also used an internal watchdog system running in another thread. That thread looks at the main thread for activity like log output or a toggling event. If the activity is not seen then the service is considered hung and I shutdown the service.
In this case you can configure windows to auto-restart a stopped service and that might clear the problem (as long as it's not an internal logic bug).
Also services I work with have text logs that are written to a log. In addition for services that are about to "sleep for a bit", I log the time for the next wake up. I use MTAIL to watch a log for output."
Could anyone give some sample code how to use an internal watchdog running in another thread, since I currently have a task to develop a windows service which will be able to self restart in case it failed, hung up, etc.
I really appreciate your help.
I'm not a big fan of running a watchdog as a thread in the process you're watching. That means if the whole process hangs for some reason, the watchdog won't work.
Watchdogs are an idea lifted from the hardware world and they had it right. Use an external circuit as simple as possible (so it can be provably correct). Typical watchdogs simply ran an timer and, if the process hadn't done something before the timer expired (like access a memory location the watchdog was watching), the whole thing was reset. When the watchdog was "kicked", it would restart the timer.
The act of the process kicking the watchdog protected that process from summary termination.
My advice would be to write a very simple stand-alone program which just monitored an event (such as file update time being modified). If that event didn't occur within the required time, kill the process being watched (and let Windows restart it).
Then have your watched program periodically rewrite that file.
Other approaches you might want to consider besides regularly modifying the lastwritetime of a file would be to create a proper performance counter or even a WMI object. We do the later in our build infrastructure, the 'trick' is to find a meaningful work unit in the service being monitored and pulse your 'heartbeat' each time a unit is finished.
The advantage of WMI or Perf Counters over a the file approach is that you then become visible to a whole bunch of professional MIS / management tools. This can add a lot of value.
You can configure from service properties to self restart in case of failure
Services -> right-click your service -> Properties -> First failure : restart the service -> Second failure : restart the service -> Subsequent failure : restart

I can't run more than 100 processes

I have a massive number of shell commands being executed with root/admin priveleges through Authorization Services' "AuthorizationExecuteWithPrivileges" call. The issue is that after a while (10-15 seconds, maybe 100 shell commands) the program stops responding with this error in the debugger:
couldn't fork: errno 35
And then while the app is running, I cannot launch any more applications. I researched this issue and apparently it means that there are no more threads available for the system to use. However, I checked using Activity Monitor and my app is only using 4-5 threads.
To fix this problem, I think what I need to do is separate the shell commands into a separate thread (away from the main thread). I have never used threading before, and I'm unsure where to start (no comprehensive examples I could find)
Thanks
As Louis Gerbarg already pointed out, your question has nothing to do with threads. I've edited your title and tags accordingly.
I have a massive number of shell commands being executed with root/admin priveleges through Authorization Services' "AuthorizationExecuteWithPrivileges" call.
Don't do that. That function only exists so you can restore the root:admin ownership and the setuid mode bit to the tool that you want to run as root.
The idea is that you should factor out the code that should run as root into a completely separate program from the part that does not need to run as root, so that the part that needs root can have it (through the setuid bit) and the part that doesn't need root can go without it (through not having setuid).
A code example is in the Authorization Services Programming Guide.
The issue is that after a while (10-15 seconds, maybe 100 shell commands) the program stops responding with this error in the debugger:
couldn't fork: errno 35
Yeah. You can only run a couple hundred processes at a time. This is an OS-enforced limit.
It's a soft limit, which means you can raise it—but only up to the hard limit, which you cannot raise. See the output of limit and limit -h (in zsh; I don't know about other shells).
You need to wait for processes to finish before running more processes.
And then while the app is running, I cannot launch any more applications.
Because you are already running as many processes as you're allowed to. That x-hundred-process limit is per-user, not per-process.
I researched this issue and apparently it means that there are no more threads available for the system to use.
No, it does not.
The errno error codes are used for many things. EAGAIN (35, “resource temporarily unavailable”) may mean no more threads when set by a system call that starts a thread, but it does not mean that when set by another system call or function.
The error message you quoted explicitly says that it was set by fork, which is the system call to start a new process, not a new thread. In that context, EAGAIN means “you are already running as many processes as you can”. See the fork manpage.
However, I checked using Activity Monitor and my app is only using 4-5 threads.
See?
To fix this problem, I think what I need to do is separate the shell commands into a separate thread (away from the main thread).
Starting one process per thread will only help you run out of processes much faster.
I have never used threading before …
It sounds like you still haven't, since the function you're referring to starts a process, not a thread.
This is not about threads (at least not threads in your application). This is about system resources. Each of those forked processes is consuming at least 1 kernel thread (maybe more), some vnodes, and a number of other things. Eventually the system will not allow you to spawn more processes.
The first limits you hit are administrative limits. The system can support more, but it may causes degraded performance and other issues. You can usually raise these through various mecahanisms, like sysctls. In general doing that is a bad idea unless you have a particular (special) work load that you know will benefit from specific tweaks.
Chances are raising those limits will not fix your issues. While adjusting those limits may make you run a little longer, in order to actually fix it you need to figure out why the resources are not being returned to the system. Based on what you described above I would guess that your forked processes are never exiting.

Resources