Bash scripting: reader writer lock - bash

Imagine a network of several nix machines. A dedicated node stores files and periodically schedules Task A that modifies these files. Each of the other nodes schedules Task B that syncs (rsync) those files to local storage.
Task A can take considerable amount of time and the file collection needs to be in a consistent state on all nodes. Thus Task B shouldn't run while Task A is running.
A possible solution for this is to use a reader-writer lock. Task A and Task B would put a write and a read lock on the resource respectively.
I wonder how can we implement such locking mechanism with unix shell scripting.

The usual way of doing this is with the flock utility, which is part of the util-linux package. FreeBSD and NetBSD packages are also available, aiui, and probably others. (For MacOSX, see this question.)
The flock command can do both read ("shared") locks and write ("exclusive") locks. It is based on the flock(2) system call, and is consequently co-operative locking (aka advisory locking), but in most applications that will work fine (but see below for the case where the file is remote).
There are usage examples in the linked man page above. The simplest usage case is
flock /tmp/lockfile /usr/local/bin/do_the_update
flock /tmp/lockfile -s /usr/local/bin/do_the_rsync
both of obtain a lock on /tmp/lockfile, and then execute the specified command (presumably a shell script). The first command obtains an exclusive lock; I could have made that explicit with the -x option. The second command obtains a shared lock.
Since the question actually involves the need for a network lock, it is necessary to point out that flock() may not be reliable on a networked filesystem. Normally, the target file should always be local.
Even in a non-distributed application, you need to consider the possibilities of failure. Suppose you were rsync'ing locally to create a copy, for example. If the host crashes while the rsync is in process, you will end up with an incomplete or corrupt copy. rsync can recover from that, but there is no certainty that when the host restarts, the rsync will initiate before the files are modified. That shouldn't be a problem, but you definitely need to take it into account.
In a distributed application, the situation is more complex because the entire system rarely fails. You can have independent failure of the different servers or of the network itself.
Advisory locking is not persistent. If the lockfile's host crashes with the lock held and restarts, the lock will not be held after the restart. On the other hand, if one of the remote servers which holds the lock crashes and restarts, it may not be aware that it is holding the lock, in which case the lock will never be released.
If both servers were 100% aware of each other's state, this wouldn't be a problem, but it is very difficult to distinguish network failure from host failure.
You will need to evaluate the risks. As with the local case, if the fileserver crashes while an rsync is in progress, it may restart and immediately start modifying the files. If the remote rsync's did not fail while the fileserver was down, they will continue to attempt to synchronize and the resulting copy will be corrupt. With rsync, this should resolve itself on the next sync cycle, but in the interim you have a problem. You will need to decide how serious this is.
You can prevent the fileserver from starting the mutator on startup by using persistent locks. Each rsync server creates its own lockfile on the host before starting the rsync (and does not start the rsync until it is known that the file exists) and deletes the file before releasing the read lock. If an rsync server restarts and its indicator file exists, it knows that there was a crash during the rysnc, so it must delete the indicator file and restart the rsync.
This will work fine most of the time, but it can fail if an rsync server crashes during the rsync and never restarts, or restarts only after a long time. (Or, equivalently, if network failure isolates the rsync server for a long time.) In these cases, it is likely that manual intervention will be necessary. It would be useful to have a watchdog process running on the fileserver which alerts an operator if the read lock has been held for too long, for some definition of "too long".

Related

Is there a way to know if the current process holds a specific file lock?

We have a serie of applications running on windows that uses file locking for protecting concurrent access to their data (shared files on network drive).
Sometimes, it seems one of these process fails to release one of these locks and everything freezes until the process is killed. Finding out who holds that lock is not always trivial (needs an admin to go on the file server and check network open files, go one workstation, find process and kill it).
We have a message queue system between the applications that is serviced by a background thread so, in theory, it would be possible to send out a message to every process asking them if they hold a lock to a specific file and if they do, mybe take an action like kill the process if the lock is held longer than a few seconds)
So, the question is: is there a way for a thread to know if a different thread of the same process holds a lock (LockFile) against a given file?
I'm not sure if there is a API to query this but a process can query itself with the LockFileEx function:
A shared lock can overlap an exclusive lock if both locks were created using the same file handle. When a shared lock overlaps an exclusive lock, the only possible access is a read by the owner of the locks.
The other thread could query and see if it can get shared access.
If you control the file format you could write the computer name and process id to the start of the file every time you take the lock. File memory mappings can view the file contents even while it is locked.

When will ruby's ensure not run?

I have a server running in an eventmachine reactor which listens to heartbeats from users to tell if they are online. It marks the users as online and offline appropriately, when it starts/stops receiving the heartbeat.
I want to wrap it all in an ensure block to mark all currently online users offline when it exits, but I'm unsure how reliable that would be.
Under what conditions could a process exit without running the ensure blocks wrapping the current execution context?
Quite a few, for example:
being killed with kill -9
segmentation faults etc (eg bugs in ruby itself or in native extensions)
power failures
the system as a whole crashing (eg kernel/driver bugs, hardware failures etc)
A network failure wouldn't stop your ensure block from running but might mean that it can't update whatever datastore stores these statuses.

Reducing SSH connections

Okay, so I have a shell script for transferring some files to a remote host using rsync over ssh.
However, in addition to the transfer I need to do some house-keeping beforehand and afterwards, which involves an additional ssh with command, and a call to scp to transfer some extra data not included in the rsync transfer (I generate this data while the transfer is taking place).
As you can imagine this currently results in a lot of ssh sessions starting and stopping, especially since the housekeeping operations are actually very quick (usually). I've verified on the host that this is show up as lots of SSH connects and disconnects which, although minor compared to the actual transfer, seems pretty wasteful.
So what I'm wonder is; is there a way that I can just open an ssh connection and then leave it connected until I'm done with it? i.e - my initial ssh housekeeping operation would leave its connection open so that when rsync (and afterwards scp) runs it can just do its thing using that previously opened connection.
Is such a thing even possible in a shell script? If so, any pointers about how to handle errors (i.e - ensure the connection is closed once it isn't needed) would be appreciated as well!
It's possible. The easiest way doesn't even require any programming. See https://unix.stackexchange.com/questions/48632/cant-share-an-ssh-connection-with-rsync - the general idea is to use SSH connection reuse to get your multiple SSHs to share one session, then get rsync to join in as well.
The hard way is to use libssh2 and program everything yourself. This will be a lot more work, and it seems in your case has nothing to recommend it. For more complex scenarios, it's useful.

How to guarantee file integrity without mandatory file lock on OS X?

AFAIK, OS X is a BSD derivation, which doesn't have actual mandatory file locking. If so, it seems that I have no way to prevent writing access from other programs even while I am writing a file.
How to guarantee file integrity in such environment? I don't care integrity after my program exited, because that's now user's responsibility. But at least, I think I need some kind of guarantee while my program is running.
How do other programs guarantee file content integrity without mandatory locking? Especially database programs. If there's common technique or recommended practice, please let me know.
Update
I am looking for this for data layer of GUI application for non-engineer users. And currently, my program have this situations.
Data is too big that it cannot be fit to RAM. And even hard to be temporarily copied. So it cannot be read/written atomically, and should be used from disk directly while program is running.
A long running professional GUI content editor application used by humans who are non-engineers. Though users are not engineers, but they still can access the file simultaneously with Finder or another programs. So users can delete or write on currently using file accidentally. Problem is users don't understand what is actually happening, and expect program handles file integrity at least program is running.
I think the only way to guarantee file's integrity in current situation is,
Open file with system-wide exclusive mandatory lock. Now the file is program's responsibility.
Check for integrity.
Use the file as like external memory while program is running.
Write all the modifications.
Unlock. Now the file is user's responsibility.
Because OS X lacks system-wide mandatory lock, so now I don't know what to do for this. But still I believe there's a way to archive this kind of file integrity, which just I don't know. And I want to know how everybody else handles this.
This question is not about my programming error. That's another problem. Current problem is protecting data from another programs which doesn't respect advisory file lockings. And also, users are usually root and the program is running with same user, so trivial Unix file privilege is not useful.
You have to look at the problem that you are trying to actually solve with mandatory locking.
File content integrity is not guaranteed by mandatory locking; unless you keep your file locked 24/7; file integrity will still depend on all processes observing file format/access conventions (and can still fail due to hard drive errors etc.).
What mandatory locking protects you against is programming errors that (by accident, not out of malice) fail to respect the proper locking protocols. At the same time, that protection is only partial, since failure to acquire a lock (mandatory or not) can still lead to file corruption. Mandatory locking can also reduce possible concurrency more than needed. In short, mandatory locking provides more protection than advisory locking against software defects, but the protection is not complete.
One solution to the problem of accidental corruption is to use a library that is aggressively tested for preserving data integrity. One such library (there are others) is SQlite (see also here and here for more information). On OS X, Core Data provides an abstraction layer over SQLite as a data storage. Obviously, such an approach should be complemented by replication/backup so that you have protection against other causes for data corruption where the storage layer cannot help you (media failure, accidental deletion).
Additional protection can be gained by restricting file access to a database and allowing access only through a gateway (such as a socket or messaging library). Then you will just have a single process running that merely acquires a lock (and never releases it). This setup is fairly easy to test; the lock is merely to prevent having more than one instance of the gateway process running.
One simple solution would be to simply hide the file from the user until your program is done using it.
There are various ways to hide files. It depends on whether you're modifying an existing file that was previously visible to the user or creating a new file. Even if modifying an existing file, it might be best to create a hidden working copy and then atomically exchange its contents with the file that's visible to the user.
One approach to hiding a file is to create it in a location which is not normally visible to users. (That is, it's not necessary that the file be totally impossible for the user to reach, just out of the way so that they won't stumble on it.) You can obtain such a location using -[NSFileManager URLForDirectory:inDomain:appropriateForURL:create:error:] and passing NSItemReplacementDirectory and NSUserDomainMask for the first two parameters. See -replaceItemAtURL:withItemAtURL:backupItemName:options:resultingItemURL:error: method for how to atomically move the file into its file place.
You can set a file to be hidden using various APIs. You can use -[NSURL setResourceValue:forKey:error:] with the key NSURLIsHiddenKey. You can use the chflags() system call to set UF_HIDDEN. The old Unix standby is to use a filename starting with a period ('.').
Here's some details about this topic:
https://developer.apple.com/library/ios/documentation/FileManagement/Conceptual/FileSystemProgrammingGuide/FileCoordinators/FileCoordinators.html
Now I think the basic policy on OSX is something like this.
Always allow access by any process.
Always be prepared for shared data file mutation.
Be notified when other processes mutates the file content, and provide proper response on them. For example you can display an error to end users if other process is trying to access the file. And then users will learn that's bad, and will not do it again.

I can't run more than 100 processes

I have a massive number of shell commands being executed with root/admin priveleges through Authorization Services' "AuthorizationExecuteWithPrivileges" call. The issue is that after a while (10-15 seconds, maybe 100 shell commands) the program stops responding with this error in the debugger:
couldn't fork: errno 35
And then while the app is running, I cannot launch any more applications. I researched this issue and apparently it means that there are no more threads available for the system to use. However, I checked using Activity Monitor and my app is only using 4-5 threads.
To fix this problem, I think what I need to do is separate the shell commands into a separate thread (away from the main thread). I have never used threading before, and I'm unsure where to start (no comprehensive examples I could find)
Thanks
As Louis Gerbarg already pointed out, your question has nothing to do with threads. I've edited your title and tags accordingly.
I have a massive number of shell commands being executed with root/admin priveleges through Authorization Services' "AuthorizationExecuteWithPrivileges" call.
Don't do that. That function only exists so you can restore the root:admin ownership and the setuid mode bit to the tool that you want to run as root.
The idea is that you should factor out the code that should run as root into a completely separate program from the part that does not need to run as root, so that the part that needs root can have it (through the setuid bit) and the part that doesn't need root can go without it (through not having setuid).
A code example is in the Authorization Services Programming Guide.
The issue is that after a while (10-15 seconds, maybe 100 shell commands) the program stops responding with this error in the debugger:
couldn't fork: errno 35
Yeah. You can only run a couple hundred processes at a time. This is an OS-enforced limit.
It's a soft limit, which means you can raise it—but only up to the hard limit, which you cannot raise. See the output of limit and limit -h (in zsh; I don't know about other shells).
You need to wait for processes to finish before running more processes.
And then while the app is running, I cannot launch any more applications.
Because you are already running as many processes as you're allowed to. That x-hundred-process limit is per-user, not per-process.
I researched this issue and apparently it means that there are no more threads available for the system to use.
No, it does not.
The errno error codes are used for many things. EAGAIN (35, “resource temporarily unavailable”) may mean no more threads when set by a system call that starts a thread, but it does not mean that when set by another system call or function.
The error message you quoted explicitly says that it was set by fork, which is the system call to start a new process, not a new thread. In that context, EAGAIN means “you are already running as many processes as you can”. See the fork manpage.
However, I checked using Activity Monitor and my app is only using 4-5 threads.
See?
To fix this problem, I think what I need to do is separate the shell commands into a separate thread (away from the main thread).
Starting one process per thread will only help you run out of processes much faster.
I have never used threading before …
It sounds like you still haven't, since the function you're referring to starts a process, not a thread.
This is not about threads (at least not threads in your application). This is about system resources. Each of those forked processes is consuming at least 1 kernel thread (maybe more), some vnodes, and a number of other things. Eventually the system will not allow you to spawn more processes.
The first limits you hit are administrative limits. The system can support more, but it may causes degraded performance and other issues. You can usually raise these through various mecahanisms, like sysctls. In general doing that is a bad idea unless you have a particular (special) work load that you know will benefit from specific tweaks.
Chances are raising those limits will not fix your issues. While adjusting those limits may make you run a little longer, in order to actually fix it you need to figure out why the resources are not being returned to the system. Based on what you described above I would guess that your forked processes are never exiting.

Resources