I have a daemon that launchd runs at system boot (OS X). I need to delay startup of my daemon by 3-5 seconds, yet the following code executes instantly at boot, but delays properly well after boot:
#include <unistd.h>
...
printf("Before delay\n");
unsigned int delay = 3000000;
while( (delay=usleep(delay)) > 0)
{
;
}
printf("After delay\n");
If I run it by hand after the system has started, it delays correctly. If I let launchd start it at boot the console log shows that there is no delay between Before delay and After delay - they are executed in the same second.
If I could get launchd to execute my daemon after a delay after boot that would be fine as well, but my reading suggests that this isn't possible (perhaps I'm wrong?).
Otherwise, I need to understand why usleep isn't working, and what I can do to fix it, or what delay I might be able to use instead that works that early in the boot process.
First things first. Put in some extra code to also print out the current time, rather than relying on launchd to do it.
It's possible that the different flushing behaviour for standard output may be coming into play.
If standard output can be determined to be an interactive device (such as running it from the command line), it is line buffered - you'll get the "before" line flushed before the delay.
Otherwise, it's fully buffered so the flush may not happen until the program exits (or you reach the buffer size of (for example) 4K. That means that launchd may see the lines come out together, both after the delay.
Getting the C code to timestamp the lines will tell you if ths is the problems, something like:
#include <stdio.h>
#include <time.h>
#include <unistd.h>
int main (void) {
printf("%d: Before delay\n", time(0));
unsigned int delay = 3000000;
while( (delay=usleep(delay)) > 0);
printf("%d: After delay\n", time(0));
return 0;
}
To see why the buffering may be a problem, consider running that program above as follows:
pax> ./testprog | while read; do echo $(date): $REPLY; done
Tue Jan 31 12:59:24 WAST 2012: 1327985961: Before delay
Tue Jan 31 12:59:24 WAST 2012: 1327985964: After delay
You can see that, because the buffering causes both lines to appear to the while loop when the program exits, they get the same timestamp of 12:59:24 despite the fact they were generated three seconds apart within the program.
In fact, if you change it as follows:
pax> ./testprog | while read; do echo $(date) $REPLY; sleep 10 ; done
Tue Jan 31 13:03:17 WAST 2012 1327986194: Before delay
Tue Jan 31 13:03:27 WAST 2012 1327986197: After delay
you can see the time seen by the "surrounding" program (the while loop or, in your case, launchd) is totally disconnected from the program itself).
Secondly, usleep is a function that can fail! And it can fail by returning -1, which is very much not greater than zero.
That means, if it fails, your delay will be effectively nothing.
The Single UNIX Specification states, for usleep:
On successful completion, usleep() returns 0. Otherwise, it returns -1 and sets errno to indicate the error.
The usleep() function may fail if: [EINVAL]: The time interval specified 1,000,000 or more microseconds.
That's certainly the case with your code although it would be hard to explain why it works after boot and not before.
Interestingly, the Mac OSX docs don't list EINVAL but they do allow for EINTR if the sleep is interrupted externally. So again, something you should check.
You can check those possibilities with something like:
#include <stdio.h>
#include <time.h>
#include <errno.h>
#include <unistd.h>
int main (void) {
printf("%d: Before delay\n", time(0));
unsigned int delay = 3000000;
while( (delay=usleep(delay)) > 0);
printf("%d: After delay\n", time(0));
printf("Delay became %d, errno is %d\n", delay, errno);
}
One other thing I've just noticed, from your code you seem to be assuming that usleep returns the number of microseconds unslept (remaining) and you loop until it's all done, but that behaviour is not borne out by the man pages.
I know that nanosleep does this (by updating the passed structure to contain the remaining time rather than returning it) but usleep only returns 0 or -1.
The sleep function acts in that manner, returning the number of seconds yet to go. Perhaps you might look into using that function instead, if possible.
In any case, I would still run that (last) code segment above just so you can ascertain what the actual problem is.
According to the old POSIX.1 standard, and as documented in the OSX manual page, usleep returns 0 on success and -1 on error.
If you get an error it's most likely EINTR (the only error documented in the OSX manual page) meaning it has been interrupted by a signal. You better check errno to be certain though. As a side-note, on the Linux manual page it states that you can get EINVAL too in some cases:
usec is not smaller than 1000000. (On systems where that is considered an error.)
As another side-note, usleep has been obseleted in the latest POSIX.1 standard, in favor of nanosleep.
Related
I just found this code in the wild:
def _scan_for_self(self):
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
basename = self.cmdline.split()[0]
pids = win32process.EnumProcesses()
if not pids:
UserLog.warn("WindowsProcess", "no pids", pids)
for pid in pids:
try:
handle = win32api.OpenProcess(
win32con.PROCESS_QUERY_INFORMATION | win32con.PROCESS_VM_READ,
pywintypes.FALSE, pid)
except pywintypes.error, err:
UserLog.warn("WindowsProcess", str(err))
continue
try:
modlist = win32process.EnumProcessModules(handle)
except pywintypes.error,err:
UserLog.warn("WindowsProcess",str(err))
continue
This line caught my eye:
win32api.Sleep(2000) # sleep to give time for process to be seen in system table.
It suggests that if you call EnumProcesses() too fast after starting, you won't see yourself. Is there any truth to this?
There is a race, but it's not the race the code tried to protect against.
A successful call to CreateProcess returns only after the kernel object representing the process has been created and enqueued into the kernel's process list. A subsequent call to EnumProcesses accesses the same list, and will immediately observe the newly created process object.
That is, unless the process object has since been destroyed. This isn't entirely unusual since processes in Windows are initialized in-process. The documentation even makes note of that:
Note that the function returns before the process has finished initialization. If a required DLL cannot be located or fails to initialize, the process is terminated.
What this means is that if a call to EnumProcesses immediately following a successful call to CreateProcess doesn't observe the newly created process, it does so because it was late rather than early. If you are late already then adding a delay will only make you more late.
Which swiftly leads to the actual race here: Process IDs uniquely identify processes only for a finite time interval. Once a process object is gone, its ID is up for grabs, and the system will reuse it at some point. The only reliable way to identify a process is by holding a handle to it.
Now it's anyone's guess what the author of _scan_for_self was trying to accomplish. As written, the code takes more time to do something that's probably altogether wrong1 anyway.
1 Turns out my gut feeling was correct. This is just your average POSIX developer, that, in the process of learning that POSIX is insufficient would rather call out Microsoft instead of actually using an all-around superior API.
The documentation for EnumProcesses (WIn32 API - EnumProcesses function), does not mention anything about a delay needed to see the current process in the list it returns.
The example from Microsoft how to use EnumProcess to enumerate all running processes (Enumerating All Processes), also does not contain any delay before calling EnumProcesses.
A small test application I created in C++ (see below) always reports that the current process is in the list (tested on Windows 10):
#include <Windows.h>
#include <Psapi.h>
#include <iostream>
#include <vector>
const DWORD MAX_NUM_PROCESSES = 4096;
DWORD aProcesses[MAX_NUM_PROCESSES];
int main(void)
{
// Get the list of running process Ids:
DWORD cbNeeded;
if (!EnumProcesses(aProcesses, MAX_NUM_PROCESSES * sizeof(DWORD), &cbNeeded))
{
return 1;
}
// Check if current process is in the list:
DWORD curProcId = GetCurrentProcessId();
bool bFoundCurProcId{ false };
DWORD numProcesses = cbNeeded / sizeof(DWORD);
for (DWORD i=0; i<numProcesses; ++i)
{
if (aProcesses[i] == curProcId)
{
bFoundCurProcId = true;
}
}
std::cout << "bFoundCurProcId: " << bFoundCurProcId << std::endl;
return 0;
}
Note: I am aware that the fact that the program reported the expected result does not mean that there is no race. Maybe I just couldn't catch it manifest. But trying to run code like that can give you a hint sometimes (especially if the result would have been that there is a race).
The fact that I never had a problem running this test (did it many times), together with the lack of any mention of the need for a delay in Microsoft's documentation make me believe that it is not required.
My conclusion is that either:
There is a unique issue when using it from python (doubt it).
or:
The code you found is doing something unnecessary.
There is no race.
EnumProcesses calls a NT API function that switches to kernel mode to walk the linked list of processes. Your own process has been added to the list before it starts running.
We've been using CygWin (/usr/bin/x86_64-w64-mingw32-gcc) to generate Windows 64-bit executable files and it has been working fine through yesterday. Today it stopped working in a bizarre way--it "caches" standard output until the program ends. I wrote a six line example
that did the same thing. Since we use the code in batch, I wouldn't worry except when I run a test case on the now-strangely-caching executable, it opens the output files, ends early, and does not fill them with data. (The same code on Linux works fine, but these guys are using Windows.) I know it's not gorgeous code, but it demonstrates my problem, printing the numbers "1 2 3 4 5 6 7 8 9 10" only after I press the key.
#include <stdio.h>
main ()
{
char q[256];
int i;
for (i = 1; i <= 10; i++)
printf ("%d ", i);
gets (q);
printf ("\n");
}
Does anybody know enough CygWin to help me out here? What do I try? (I don't know how to get version numbers--I did try to get them.) I found a 64-bit cygwin1.dll in /cygdrive/c/cygwin64/bin and that didn't help a bit. The 32-bit gcc compilation works fine, but I need 64-bit to work. Any suggestions will be appreciated.
Edit: we found and corrected an unexpected error in the original code that caused the program not to populate the output files. At this point, the remaining problem is that cygwin won't show the output of the program.
For months, the 64-bit executable has properly generated the expected output, just as the 32-bit version did. Just today, it has started exhibiting the "caching" behavior described above. The program sends many hundreds of lines with many newline characters through stdout. Now, when the 64-bit executable is created as above, none of these lines are shown until the program completes and the entire output it printed at once. Can anybody provide insight into this problem?
This is quite normal. printf outputs to stdout which is a FILE* and is normally line buffered when connected to a terminal. This means you will not see any output until you write a newline, or the internal buffer of the stdout FILE* is full (A common buffer size is 4096 bytes).
If you write to a file or pipe, output might be fully buffered, in which case output is flushed when the internal buffer is full and not when you write a newline.
In all cases the buffers of a FILE* are flushed when: you call fflush(..). You call fclose(..) or the program ends normally.
Your program will behave the same on windows/cygwin as on linux.
You can add a call to fflush(stdout) to see the output immediately.
for (i = 1; i <= 10; i++) {
printf ("%d ", i);
fflush(stdout);
}
Also, do not use the gets() function.
If your real programs "ends early" and does not write data in text files that it's supposed to, it may be it crashes due to a bug of yours before it finishes, in which case the buffered output will not be flushed out. Or, more unlikely, you call the _exit() function, which will terminate the program without flushing FILE* buffers (in contrast to the exit() function)
Who can tell how to use wake_up() in gwan?
// tell G-WAN when to run a script again (for the same request)
// type: WK_MS | WK_FD
#define WK_MS 1 // milliseconds
#define WK_FD 2 // file descriptor
void wake_up(char *argv[], int delay_or_fd, int type);
Is it used to replace sleep()?
Look at the examples using these functions - be careful though, the last time I tested them, they didn't work (this has probably been fixed already or might have been a usage error on my part, but nevertheless if you're going to use them, try the examples first and see if they work).
In a nutshell:
with WK_MS this behaves close to the sleep function, with the difference, that your function is called again after the time elapsed (as opposed to continuing where you called it), and execution is continued after the wake_up call. So it's more like "execute me again after X ms".
with WK_FD your script should be called again as soon as there is new data on the provided file descriptor (useful for e.g. tailing a self built log mechanism or theoretically for realtime communications like websockets, but I never got CLIENT_SOCKET working with this, so be careful to check whatever you pass if it's really a file descriptor beforehand)
Consider the following source code, which is fully POSIX compliant:
#include <stdio.h>
#include <limits.h>
#include <stdint.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
int main (int argc, char ** argv) {
pthread_cond_t c;
pthread_mutex_t m;
char printTime[UCHAR_MAX];
pthread_mutex_init(&m, NULL);
pthread_cond_init(&c, NULL);
for (;;) {
struct tm * tm;
struct timeval tv;
struct timespec ts;
gettimeofday(&tv, NULL);
printf("sleep (%ld)\n", (long)tv.tv_sec);
sleep(3);
tm = gmtime(&tv.tv_sec);
strftime(printTime, UCHAR_MAX, "%Y-%m-%d %H:%M:%S", tm);
printf("%s (%ld)\n", printTime, (long)tv.tv_sec);
ts.tv_sec = tv.tv_sec + 5;
ts.tv_nsec = tv.tv_usec * 1000;
pthread_mutex_lock(&m);
pthread_cond_timedwait(&c, &m, &ts);
pthread_mutex_unlock(&m);
}
return 0;
}
Prints the current system date every 5 seconds, however, it does a sleep of 3 seconds between getting the current system time (gettimeofday) and the condition wait (pthread_cond_timedwait).
Right after it is printing "sleep (...)", try setting the system clock two days into the past. What happens? Well, instead of waiting 2 more seconds on the condition as it usually does, pthread_cond_timedwait now waits for two days and 2 seconds.
How do I fix that?
How can I write POSIX compliant code, that does not break when the user manipulates the system clock?
Please keep in mind that the system clock might change even without user interaction (e.g. a NTP client might update the clock automatically once a day). Setting the clock into the future is no problem, it will only cause the sleep to wake up early, which is usually no problem and which you can easily "detect" and handle accordingly, but setting the clock into the past (e.g. because it was running in the future, NTP detected that and fixed it) can cause a big problem.
PS:
Neither pthread_condattr_setclock() nor CLOCK_MONOTONIC exists on my system. Those are mandatory for the POSIX 2008 specification (part of "Base") but most systems still only follow the POSIX 2004 specification as of today and in the POSIX 2004 specification these two were optional (Advanced Realtime Extension).
Interesting, I've not encountered that behaviour before but, then again, I'm not in the habit of mucking about with my system time that much :-)
Assuming you're doing that for a valid reason, one possible (though kludgy) solution is to have another thread whose sole purpose is to periodically kick the condition variable to wake up any threads so affected.
In other words, something like:
while (1) {
sleep (10);
pthread_cond_signal (&condVar);
}
Your code that's waiting for the condition variable to be kicked should be checking its predicate anyway (to take care of spurious wakeups) so this shouldn't have any real detrimental effect on the functionality.
It's a slight performance hit but once every ten seconds shouldn't be too much of a problem. It's only really meant to take care of the situations where (for whatever reason) your timed wait will be waiting a long time.
Another possibility is to re-engineer your application so that you don't need timed waits at all.
In situations where threads need to be woken for some reason, it's invariably by another thread which is perfectly capable of kicking a condition variable to wake one (or broadcasting to wake the lot of them).
This is very similar to the kicking thread I mentioned above but more as an integral part of your architecture than a bolt-on.
You can defend your code against this problem. One easy way is to have one thread whose sole purpose is to watch the system clock. You keep a global linked list of condition variables, and if the clock watcher thread sees a system clock jump, it broadcasts every condition variable on the list. Then, you simply wrap pthread_cond_init and pthread_cond_destroy with code that adds/removes the condition variable to/from the global linked list. Protect the linked list with a mutex.
On my Windows 7 box, this simple program causes the memory use of the application to creep up continuously, with no upper bound. I've stripped out everything non-essential, and it seems clear that the culprit is the Microsoft Iphlpapi function "GetIpAddrTable()". On each call, it leaks some memory. In a loop (e.g. checking for changes to the network interface list), it is unsustainable. There seems to be no async notification API which could do this job, so now I'm faced with possibly having to isolate this logic into a separate process and recycle the process periodically -- an ugly solution.
Any ideas?
// IphlpLeak.cpp - demonstrates that GetIpAddrTable leaks memory internally: run this and watch
// the memory use of the app climb up continuously with no upper bound.
#include <stdio.h>
#include <windows.h>
#include <assert.h>
#include <Iphlpapi.h>
#pragma comment(lib,"Iphlpapi.lib")
void testLeak() {
static unsigned char buf[16384];
DWORD dwSize(sizeof(buf));
if (GetIpAddrTable((PMIB_IPADDRTABLE)buf, &dwSize, false) == ERROR_INSUFFICIENT_BUFFER)
{
assert(0); // we never hit this branch.
return;
}
}
int main(int argc, char* argv[]) {
for ( int i = 0; true; i++ ) {
testLeak();
printf("i=%d\n",i);
Sleep(1000);
}
return 0;
}
#Stabledog:
I've ran your example, unmodified, for 24 hours but did not observe that the program's Commit Size increased indefinitely. It always stayed below 1024 kilobyte. This was on Windows 7 (32-bit, and without Service Pack 1).
Just for the sake of completeness, what happens to memory usage if you comment out the entire if block and the sleep? If there's no leak there, then I would suggest you're correct as to what's causing it.
Worst case, report it to MS and see if they can fix it - you have a nice simple test case to work from which is more than what I see in most bug reports.
Another thing you may want to try is to check the error code against NO_ERROR rather than a specific error condition. If you get back a different error than ERROR_INSUFFICIENT_BUFFER, there may be a leak for that:
DWORD dwRetVal = GetIpAddrTable((PMIB_IPADDRTABLE)buf, &dwSize, false);
if (dwRetVal != NO_ERROR) {
printf ("ERROR: %d\n", dwRetVal);
}
I've been all over this issue now: it appears that there is no acknowledgment from Microsoft on the matter, but even a trivial application grows without bounds on Windows 7 (not XP, though) when calling any of the APIs which retrieve the local IP addresses.
So the way I solved it -- for now -- was to launch a separate instance of my app with a special command-line switch that tells it "retrieve the IP addresses and print them to stdout". I scrape stdout in the parent app, the child exits and the leak problem is resolved.
But it wins "dang ugly solution to an annoying problem", at best.