I have a server running in an eventmachine reactor which listens to heartbeats from users to tell if they are online. It marks the users as online and offline appropriately, when it starts/stops receiving the heartbeat.
I want to wrap it all in an ensure block to mark all currently online users offline when it exits, but I'm unsure how reliable that would be.
Under what conditions could a process exit without running the ensure blocks wrapping the current execution context?
Quite a few, for example:
being killed with kill -9
segmentation faults etc (eg bugs in ruby itself or in native extensions)
power failures
the system as a whole crashing (eg kernel/driver bugs, hardware failures etc)
A network failure wouldn't stop your ensure block from running but might mean that it can't update whatever datastore stores these statuses.
Related
I'm familiar with Heroku's policy on dyno restarts (once every 24 hours, if underlying hardware fails, etc).
I've also looked elsewhere on Stackoverflow and found questions like
Controlling Heroku's random dyno restarts.
Our apps are good from a web perspective – sessions are handled via external database and multiple dynos balance the load perfectly. Restarts aren't an issue there. The issue is workers. For example, our worker receives a message and begins processing a job. It's 99% done, awaiting some final asynchronous request to return, and it receives SIGTERM. Before it can even clean up the job, the process is killed. The code can handle local cleanup for a job that needs to be restarted, but external services can't really be part of the transaction.
For example, if a report was built and an email was sent, but some 3rd async operation didn't complete before SIGTERM, I can't really roll back that transaction. For hardware failures or other rare events, it's understandable that a multi-step transaction could get truncated, but with Heroku's policy it seems that I need to assume this will happen at least once a day. Can anyone help me get a better grip on this problem?
I have a .NET application which spawns multiple child 'worker processes'. I am using the Windows Job Object API and the JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE setting to ensure the child processes always get killed if the parent process is terminated.
However, I have observed a number of orphaned processes still running on the machine after the parent has been closed. Using Process Explorer, I can see they are correctly still assigned to the Job, and that the Job has the correct 'Kill on Job Close' setting configured.
The documentation for JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE states:
"Causes all processes associated with the job to terminate when the last handle to the job is closed."
This would seem to imply that a handle to the Job was still open somewhere... I did a search for handles to my Job object, and found instances of WmiPrvSE.exe in the results. If I kill the relevant WmiPrvSE.exe process, the outstanding handle to Job is apparently closed, and all the orphaned application processes get terminated as expected.
How come WmiPrvSE.exe has a handle to my Job?
You may find this blog in sorting out what WmiPrvSE is doing.
WmiPrvSE is the WMI Provider host. That means it hosts WMI providers, which are DLLs. So it's almost surely the case that WmiPrvSE doesn't have a handle to your job, but one of the providers it hosts does. In order to figure out which provider is the culprit, one way is to follow the process here and then see which of the separate processes holds the handle.
Once you have determined which provider is holding the handle you can either try to deduce, based on what system components the provider manages, what kind of query would have a handle to your Job. Or you can just disable the provider, if you don't care about losing access to the management of the components the provider provides.
If you can determine what kind of query would be holding a handle, you may be able to deduce what program is issuing the query. Or maybe the eventlog can tell you that (first link above).
To get more help please provide additional details in the OP, such as which providers are running in WmiPrvSE, any relevant eventlog events, and any other diagnostics info you obtain.
EDIT 1/27/16
An approach to find out what happened that caused WMIPrvSE to obtain your job's handle is to use Windbg's !htrace extension. You need to run !htrace -enable after you load you .EXE but before you execute it in Windbg. Then you can break in later and execute !htrace <handle> to see stack traces when the handle was manipulated. You may want to start with this article on handle implementation.
Our group has long running processes which run daily. The processes are typically started at 9pm on any given day and run until 7pm the next day. Thus they typically run 22hrs/day. They are started by scheduled tasks on servers under a particular generic user ID, and they start and run regardless of whether or not that user ID is logged on. Thus, they are windowless console executables.
The tasks orchestrate computations running on a large server farm. Generally these controlling tasks run uninterrupted for the full 22hrs/day. However, we often have a need to stop and restart these processes. Because they control a multitude of tasks running on our server farm, it is important that they be shut down cleanly, so that they can stop and shut down all the server farm processes. Which brings me to our problem.
The controlling process has been programmed to respond to ctrl-C and ctrl-break signals. This works fine when the process is manually started in a console where we have access to the console and can "type" ctrl-c or ctrl-break in the console window. However, as mentioned, the processes typically run as windowless scheduled tasks. Hence we cannot "type" anything into a non-existent console window. Because they are console processes that execute without a logon process, the also must be able to execute in a completely windowless environment. So, how do we set up the process to listen for a shut-down signal?
While the process does indeed listen for a ctrl-C and ctrl-break signal, I can see no way to send that signal to a process. This seems to be a fundamental problem in Windows, or am I wrong? I am aware of SendSignal.exe, but so far have been unable to get it to work. It fails as follows:
>SendSignal 26320
Sending signal to process 26320...
CreateRemoteThread failed with 0x00000005.
StartRemoteThread failed with 0x00000005.
0x00000005 == Access is denied.
Trying "taskkill" without -F results in:
>taskkill /PID 24840
ERROR: The process with PID 24840 could not be terminated.
Reason: This process can only be terminated forcefully (with /F option).
All other "kill" functions kill the process immediately rather than sending a signal.
One possible solution would be a file-watch based solution: create a watch for some modification of a specific file. But this is a hack and we would prefer to do it with appropriate signaling. Has anyone solved this issue? It seems to be so very basic a functionality, and it is certainly trivial to do it in a Unix environment. Surely Microsoft has provided SOME mechanism to allow clean shut down of a windowless executable?
I am aware of the thread below, whose question is virtually identical (save for the specification of why the answer is necessary, i.e. why one needs to be able to do this for a windowless, console-less process), but there is no answer there excpet for "use SendSignal", which, as I said, does not work for us:
Can I send a ctrl-C (SIGINT) to an application on Windows?
There are other similar questions, but no answers as yet.
Any help appreciated.
[Upgrading #Anon's comment to an answer for visibility]
windows-kill worked perfectly and managed to resolve access denial issues faced with SendSignal. A privileged user would have to run it as well of course.
windows-kill also supports both ctrl-c and ctrl-break signals.
I have a massive number of shell commands being executed with root/admin priveleges through Authorization Services' "AuthorizationExecuteWithPrivileges" call. The issue is that after a while (10-15 seconds, maybe 100 shell commands) the program stops responding with this error in the debugger:
couldn't fork: errno 35
And then while the app is running, I cannot launch any more applications. I researched this issue and apparently it means that there are no more threads available for the system to use. However, I checked using Activity Monitor and my app is only using 4-5 threads.
To fix this problem, I think what I need to do is separate the shell commands into a separate thread (away from the main thread). I have never used threading before, and I'm unsure where to start (no comprehensive examples I could find)
Thanks
As Louis Gerbarg already pointed out, your question has nothing to do with threads. I've edited your title and tags accordingly.
I have a massive number of shell commands being executed with root/admin priveleges through Authorization Services' "AuthorizationExecuteWithPrivileges" call.
Don't do that. That function only exists so you can restore the root:admin ownership and the setuid mode bit to the tool that you want to run as root.
The idea is that you should factor out the code that should run as root into a completely separate program from the part that does not need to run as root, so that the part that needs root can have it (through the setuid bit) and the part that doesn't need root can go without it (through not having setuid).
A code example is in the Authorization Services Programming Guide.
The issue is that after a while (10-15 seconds, maybe 100 shell commands) the program stops responding with this error in the debugger:
couldn't fork: errno 35
Yeah. You can only run a couple hundred processes at a time. This is an OS-enforced limit.
It's a soft limit, which means you can raise it—but only up to the hard limit, which you cannot raise. See the output of limit and limit -h (in zsh; I don't know about other shells).
You need to wait for processes to finish before running more processes.
And then while the app is running, I cannot launch any more applications.
Because you are already running as many processes as you're allowed to. That x-hundred-process limit is per-user, not per-process.
I researched this issue and apparently it means that there are no more threads available for the system to use.
No, it does not.
The errno error codes are used for many things. EAGAIN (35, “resource temporarily unavailable”) may mean no more threads when set by a system call that starts a thread, but it does not mean that when set by another system call or function.
The error message you quoted explicitly says that it was set by fork, which is the system call to start a new process, not a new thread. In that context, EAGAIN means “you are already running as many processes as you can”. See the fork manpage.
However, I checked using Activity Monitor and my app is only using 4-5 threads.
See?
To fix this problem, I think what I need to do is separate the shell commands into a separate thread (away from the main thread).
Starting one process per thread will only help you run out of processes much faster.
I have never used threading before …
It sounds like you still haven't, since the function you're referring to starts a process, not a thread.
This is not about threads (at least not threads in your application). This is about system resources. Each of those forked processes is consuming at least 1 kernel thread (maybe more), some vnodes, and a number of other things. Eventually the system will not allow you to spawn more processes.
The first limits you hit are administrative limits. The system can support more, but it may causes degraded performance and other issues. You can usually raise these through various mecahanisms, like sysctls. In general doing that is a bad idea unless you have a particular (special) work load that you know will benefit from specific tweaks.
Chances are raising those limits will not fix your issues. While adjusting those limits may make you run a little longer, in order to actually fix it you need to figure out why the resources are not being returned to the system. Based on what you described above I would guess that your forked processes are never exiting.
I'm working on a consumer web app that needs to do a long running background process that is tied to each customer request. By long running, I mean anywhere between 1 and 3 minutes.
Here is an example flow. The object/widget doesn't really matter.
Customer comes to the site and specifies object/widget they are looking for.
We search/clean/filter for widgets matching some initial criteria. <-- long running process
Customer further configures more detail about the widget they are looking for.
When the long running process is complete the customer is able to complete the last few steps before conversion.
Steps 3 and 4 aren't really important. I just mention them because we can buy some time while we are doing the long running process.
The environment we are working in is a LAMP stack-- currently using PHP. It doesn't seem like a good design to have the long running process take up an apache thread in mod_php (or fastcgi process). The apache layer of our app should be focused on serving up content and not data processing IMO.
A few questions:
Is our thinking right in that we should separate this "long running" part out of the apache/web app layer?
Is there a standard/typical way to break this out under Linux/Apache/MySQL/PHP (we're open to using a different language for the processing if appropriate)?
Any suggestions on how to go about breaking it out? E.g. do we create a deamon that churns through a FIFO queue?
Edit: Just to clarify, only about 1/4 of the long running process is database centric. We're working on optimizing that part. There is some work that we could potentially do, but we are limited in the amount we can do right now.
Thanks!
Consider providing the search results via AJAX from a web service instead of your application. Presumably you could offload this to another server and let you web application deal with the content as you desire.
Just curious: 1-3 minutes seems like a long time for a lookup query. Have you looked at indexes on the columns you are querying to improve the speed? Or do you need to do some algorithmic process -- perhaps you could perform some of this offline and prepopulate some common searches with hints?
As Jonnii suggested, you can start a child process to carry out background processing. However, this needs to be done with some care:
Make sure that any parameters passed through are escaped correctly
Ensure that more than one copy of the process does not run at once
If several copies of the process run, there's nothing stopping a (not even malicious, just impatient) user from hitting reload on the page which kicks it off, eventually starting so many copies that the machine runs out of ram and grinds to a halt.
So you can use a subprocess, but do it carefully, in a controlled manner, and test it properly.
Another option is to have a daemon permanently running waiting for requests, which processes them and then records the results somewhere (perhaps in a database)
This is the poor man's solution:
exec ("/usr/bin/php long_running_process.php > /dev/null &");
Alternatively you could:
Insert a row into your database with details of the background request, which a daemon can then read and process.
Write a message to a message queue which a daemon then read and processed.
Here's some discussion on the Java version of this problem.
See java: what are the best techniques for communicating with a batch server
Two important things you might do:
Switch to Java and use JMS.
Read up on JMS but use another queue manager. Unix named pipes, for instance, might be an acceptable implementation.
Java servlets can do background processing. You could do something similar to this technology in a web technology with threading support. I don't know about PHP though.
Not a complete answer but I would think using AJAX and passing the 2nd step to something thats faster then PHP (C, C++, C#) then a PHP function pick the results off of some stack most likely just a database.