Can I guarantee an atomic append in Ruby? - ruby

Based on Is file append atomic in UNIX? and other sources, it looks like on modern Linux, I can open a file in append mode and write small chunks (< PIPE_BUF) to it from multiple processes without worrying about tearing.
Are those limits extended to Ruby with syswrite? Specifically for this code:
f = File.new('...', 'a')
f.syswrite("short string\n")
Can I expect the write to not interleave with other process writing the same way? Or is there some buffering / potential splitting I'm not aware of yet?
Assuming ruby >= 2.3

I recently researched this very topic in order to implement the File appender in Rackstash.
You can find tests for this in the specs which I originally adapted from a blog post on this topic, whose code unfortunately is non-conclusive unfortunately since the author does not write to the file directly but through a pipe. Please read the comments there.
With (1) modern operating systems and (2) their usual local filesystems, the OS guarantees that concurrent appends from multiple processes do write interleave data.
The main points here are:
You need a reasonably modern operating system. Very old (or exotic) systems had lesser guarantees. Here, you might have to lock the file explicitly with e.g. File#flock.
You need to use a compatible filesystem. Most local filesystems like HFS, APFS, NTFS, and the usual Linux filesystems like ext3, ext4, btrfs, ... should be safe. SMB is one of the few network filesystems that also guarantees this. NFS and most FUSE filesystems are not safe in this regard.
Note that this mechanism does not guarantee that concurrent readers always read full writes. While the writes themselves are never interleaved, readers might read the partial results of in-progress writes.
As far as my understanding (and my tests) goes, the atomically writable size isn't even restricted to PIPE_SIZE. This limit only applies when writing to a pipe like a socket or e.g. STDOUT instead of a real file.
Unfortunately, authoritative information on this topic is rather sparse. Most articles (and SO answers) on this topic conflate strict appending with random writes. When not strictly appending (by opening the file in append-only mode), the guarantees are null and void.
Thus, to answer your specific question: Yes, your code from your question should be safe when writing to a local filesystem on modern operating systems. I think syswrite bypasses the file buffer already. To be sure, you should also set f.sync = true before writing to it to completely disable any buffering.
Note that you should still use a Mutex (or similar) if you plan to write to the single opened file from multiple threads in your process (since the append-guarantees of the OS are only valid for concurrent writes to different file-descriptors; it can not discern overlapping writes by the same process to the same file descriptor).

I would not assume that. syswrite calls the write POSIX function, which make no claims about atomicity when dealing with files.
See: Are POSIX' read() and write() system calls atomic?
And Understanding concurrent file writes from multiple processes
Tl;dr- you should implement some concurrency control in your app to synchronize this access.

Related

Golang simultaneous read/write to the file without explicit file lock

I have a situation where I need to concurrently read/write from/to the file, but the scope of operations is limited:
append only, no random offset writes
read from random position, where I know for sure the content has been written before(via append, internal access serialization via golang channel to ensure random read happens only after content's been appended)
there is only one process running
This is a high loaded application and I would like to avoid locking file for each read/write I do
I was going to open 2 files - one for read, another for append only
would doing so create some potential issues/bugs?
what is the recommended practice if I would like to avoid file locking for each read/write I do?
p.s. golang, linux, ext4
I'll assume by "random read" you actually mean "arbitrary read".
If I understand your use case correctly, you don't need to seek or lock or do anything manual. UNIX has this covered via O_APPEND. Here is what you can do:
Open the file with os.O_APPEND. This way every write, regardless of any preceding operations, will go to the end of the file
When reading use File.ReadAt. This lets you specify arbitrary offsets for your reads
Using this scheme you can avoid any sort of locking: the OS will do it for you. Because of the buffer cache this scheme is not even inefficient: appends and reads are pretty much independent.

two programs accessing one file

New to this forum - looks great!
I have some Processing code that periodically reads data wirelessly from remote devices and writes that data as bytes to a file, e.g. data.dat. I want to write an Objective C program on my Mac Mini using Xcode to read this file, parse the data, and act on the data if data values indicate a problem. My question is: can my two different programs access the same file asynchronously without a problem? If this is a problem can you suggest a technique that will allow these operations?
Thanks,
Kevin H.
Multiple processes can read from the same file at a time without any problem. A process can also read from a file while another writes without problem, although you'll have to take care to ensure that you read in any new data that was written. Multiple processes should not write to the same file at at the same time, though. The OS will let you do it, but the ordering of data will be undefined, and you'll like overwrite data—in general, you're gonna have a bad time if you do that. So you should take care to ensure that only one process writes to a file at a time.
The simplest way to protect a file so that only one process can write to it at a time is with the C function flock(), although that function is admittedly a bit rudimentary and may or may not suit your use case.

Ruby file handle management (too many open files)

I am performing very rapid file access in ruby (2.0.0 p39474), and keep getting the exception Too many open files
Having looked at this thread, here, and various other sources, I'm well aware of the OS limits (set to 1024 on my system).
The part of my code that performs this file access is mutexed, and takes the form:
File.open( filename, 'w'){|f| Marshal.dump(value, f) }
where filename is subject to rapid change, depending on the thread calling the section. It's my understanding that this form relinquishes its file handle after the block.
I can verify the number of File objects that are open using ObjectSpace.each_object(File). This reports that there are up to 100 resident in memory, but only one is ever open, as expected.
Further, the exception itself is thrown at a time when there are only 10-40 File objects reported by ObjectSpace. Further, manually garbage collecting fails to improve any of these counts, as does slowing down my script by inserting sleep calls.
My question is, therefore:
Am I fundamentally misunderstanding the nature of the OS limit---does it cover the whole lifetime of a process?
If so, how do web servers avoid crashing out after accessing over ulimit -n files?
Is ruby retaining its file handles outside of its object system, or is the kernel simply very slow at counting 'concurrent' access?
Edit 20130417:
strace indicates that ruby doesn't write all of its data to the file, returning and releasing the mutex before doing so. As such, the file handles stack up until the OS limit.
In an attempt to fix this, I have used syswrite/sysread, synchronous mode, and called flush before close. None of these methods worked.
My question is thus revised to:
Why is ruby failing to close its file handles, and how can I force it to do so?
Use dtrace or strace or whatever equivalent is on your system, and find out exactly what files are being opened.
Note that these could be sockets.
I agree that the code you have pasted does not seem to be capable of causing this problem, at least, not without a rather strange concurrency bug as well.

Move or copy and truncate a file that is in use

I want to be able to (programmatically) move (or copy and truncate) a file that is constantly in use and being written to. This would cause the file being written to would never be too big.
Is this possible? Either Windows or Linux is fine.
To be specific what I'm trying to do is log video with FFMPEG and create hour long videos.
It is possible in both Windows and Linux, but it would take cooperation between the applications involved. If the application that is writing the new data to the file is not aware of what the other application is doing, it probably would not work (well ... there is some possibility ... back to that in a moment).
In general, to get this to work, you would have to open the file shared. For example, if using the Windows API CreateFile, both applications would likely need to specify FILE_SHARE_READ and FILE_SHARE_WRITE. This would allow both (multiple) applications to read and write the file "concurrently".
Beyond sharing the file, though, it would also be necessary to coordinate the operations between the applications. You would need to use some kind of locking mechanism (either by locking some part of the file or some shared mutex/semaphore). Note that if you use file locking, you could lock some known offset in the file to act as a "semaphore" (it can even be a byte value beyond the physical end of the file). If one application were appending to the file at the same exact time that the other application were truncating it, then it would lead to unpredictable results.
Back to the comment about both applications needing to be aware of each other ... It is possible that if both applications opened the file exclusively and kept retrying the operations until they succeeded, then perform the operation, then close the file, it would essentially allow them to work without "knowledge" of each other. However, that would probably not work very well and not be very efficient.
Having said all that, you might want to consider alternatives for efficiency reasons. For example, if it were possible to have the writing application write to new files periodically, it might be more efficient than having to "move" the data constantly out of one file to another. Also, if you needed to maintain some portion of the file (e.g., move out the first 100 MB to another file and then move the second 100 MB to the beginning) that could be a fairly expensive operation as well.
logrotate would be a good option is linux, comes stock on just about any distro. I'm sure there's a similar windows service out there somewhere

Custom Prefetch

Any programmatic techniques, portable or specific to NT and Linux that get the result of number of large files loading faster? I am after a 'ahead of time', a prior, whatever you prefer to call it mechanisms that I can control in code for two OS in a question.
Each file has to be processed in full, i.e. completely in size and sequentially for its contents. The aim is to speed up some batch file processing.
I don't know about NT, but one option on Linux would be to use madvise with the MADV_WILLNEED flag shortly before you actually need the next file to start reading it in early.
Alternately, a more portable option would be to simply manually do readahead in a separate thread from your buffer-processing thread - that is, read data in to fill an X MB buffer in thread A, process it as fast as you can in thread B.
I am not aware of a Win32 (NT) API similar to madvise().
However, I would suggest an approach.
First, pass the Win32 flag FILE_FLAG_SEQUENTIAL_SCAN to CreateFile(). This will allow the Windows operating system to perform better buffering of the file once you have opened it.
With FILE_FLAG_SEQUENTIAL_SCAN, your file parser may operate more quickly once the file is in memory. Unlike madvise() on Linux, the file will not begin loading into memory any earlier due to the use of the Win32 flag.
Next, we need to trigger the file to begin loading. Asynchronously read the first page of the file by calling ReadFileEx() with an OVERLAPPED structure and a FileIOCompletionRoutine function.
Your FileIOCompletionRoutine can simply return, or you can set the event in the overlapped structure -- read the MSDN details of ReadFileEx for details.
Since it would not be a critical failure if the pre-fetch hasn't completed when you actually read from the file, the easiest implementation would be to "fire and forget" -- execute the overlapped file read and then never check the result of it. Be sure that you read the data into valid buffers, though!
If you perform this operation for a file while reading the previous file, the result should be that the next file will commence paging in.
Be aware that this may slow your performance. As the next file begins to page in, the disk I/O to access that file will compete with disk I/O for the file you are currently parsing. If the two files are physically distant from each other on the same disk, the result of pre-fetching might be additional delay as the drive head seeks. Although modern drives have huge buffers which mitigate this, queuing the first page of a new file is likely to cause a head seek.
bdonlan's suggestion of a 'pre-fetch' thread which loads the files asynchronously from the processing would be a workable solution for Win32, also.

Resources