Ruby file handle management (too many open files) - ruby

I am performing very rapid file access in ruby (2.0.0 p39474), and keep getting the exception Too many open files
Having looked at this thread, here, and various other sources, I'm well aware of the OS limits (set to 1024 on my system).
The part of my code that performs this file access is mutexed, and takes the form:
File.open( filename, 'w'){|f| Marshal.dump(value, f) }
where filename is subject to rapid change, depending on the thread calling the section. It's my understanding that this form relinquishes its file handle after the block.
I can verify the number of File objects that are open using ObjectSpace.each_object(File). This reports that there are up to 100 resident in memory, but only one is ever open, as expected.
Further, the exception itself is thrown at a time when there are only 10-40 File objects reported by ObjectSpace. Further, manually garbage collecting fails to improve any of these counts, as does slowing down my script by inserting sleep calls.
My question is, therefore:
Am I fundamentally misunderstanding the nature of the OS limit---does it cover the whole lifetime of a process?
If so, how do web servers avoid crashing out after accessing over ulimit -n files?
Is ruby retaining its file handles outside of its object system, or is the kernel simply very slow at counting 'concurrent' access?
Edit 20130417:
strace indicates that ruby doesn't write all of its data to the file, returning and releasing the mutex before doing so. As such, the file handles stack up until the OS limit.
In an attempt to fix this, I have used syswrite/sysread, synchronous mode, and called flush before close. None of these methods worked.
My question is thus revised to:
Why is ruby failing to close its file handles, and how can I force it to do so?

Use dtrace or strace or whatever equivalent is on your system, and find out exactly what files are being opened.
Note that these could be sockets.
I agree that the code you have pasted does not seem to be capable of causing this problem, at least, not without a rather strange concurrency bug as well.

Related

Read and write file atomically

I'd like to read and write a file atomically in Ruby between multiple independent Ruby processes (not threads).
I found atomic_write from ActiveSupport. This writes to a temp file, then moves it over the original and sets all permissions. However, this does not prevent the file from being read while it is being written.
I have not found any atomic_read. (Are file reads already atomic?)
Do I need to implement my own separate 'lock' file that I check for before reads and writes? Or is there a better mechanism already present in the file system for flagging a file as 'busy' that I could check before any read/write?
The motivation is dumb, but included here because you're going to ask about it.
I have a web application using Sinatra and served by Thin which (for its own reasons) uses a JSON file as a 'database'. Each request to the server reads the latest version of the file, makes any necessary changes, and writes out changes to the file.
This would be fine if I only had a single instance of the server running. However, I was thinking about having multiple copies of Thin running behind an Apache reverse proxy. These are discrete Ruby processes, and thus running truly in parallel.
Upon further reflection I realize that I really want to make the act of read-process-write atomic. At which point I realize that this basically forces me to process only one request at a time, and thus there's no reason to have multiple instances running. But the curiosity about atomic reads, and preventing reads during write, remains. Hence the question.
You want to use File#flock in exclusive mode. Here's a little demo. Run this in two different terminal windows.
filename = 'test.txt'
File.open(filename, File::RDWR) do |file|
file.flock(File::LOCK_EX)
puts "content: #{file.read}"
puts 'doing some heavy-lifting now'
sleep(10)
end
Take a look at transaction and open_and_lock_file methods in "pstore.rb" (Ruby stdlib).
YAML::Store works fine for me. So when I need to read/write atomically I (ab)use it to store data as a Hash.

Wait for a file to be writable

I am working on a tool which writes data to files.
At some point, a file might be "locked" and is not writable until other handles have been closed.
I could use the CreateFile API in a loop until the file is available for writing access.
But I have 2 concerns using CreateFile in a loop:
The Harddrive (cache) is always running...?!
I need to call CreateFile again to obtain a valid writing handle with different flags...?!
So my question is:
What is the best solution to wait for a file to be writable and instantly get a valid handle?
Are there any event solutions or anything, which allows to "queue/reserve" for a handle once, so that there is no "uncontrolled" race condition with others?
A file can be "locked" for two reasons:
An actual file lock which prevents writing to, and possibly reading from the file.
The file being opened without sharing access (accidentially or voluntarily) which even prevents you from opening a handle. If you already see CreateFile failing, that's likely the case rather than a real lock.
There are conceptually[1] at least two ways of knowing that no other process has locked a file without busy waiting:
By finding out who holds locks and waiting on the process or thread to exit (or, by outright killing them...)
By locking the file yourself
Who holds locks?
Finding out about lock owners is rather nasty, you can do it via the totally undocumented SystemLocksInformation class used with the undocumented NtQuerySystemInformation function (the latter is "only undocumented", but the former is so much undocumented that it's really hard to find any information at all). The returned structure is explained here, and it contains an owning thread id.
Luckily, holding a lock presumes holding a handle. Closing the file handle will unlock all file ranges. Which means: No lock without handle.
In other words, the problem can also be expressed as "who is holding an open handle to the file?". Of course not all processes that hold a handle to a file will have the file locked, but no process having a handle guarantees that no process has the file locked.
Code for finding out which processes have a file open is much easier (using restart manager) and is readily available at Raymond Chen's site.
Now that you know which processes and threads are holding file handles and locks, make a list of all thread/process handles and use WaitForMultipleObjects on the list of process handles. When a process exits, all handles are closed.
This also transparently deals with the possibility of a "lock" because a process does not share access.
Locking the file yourself
You can use LockFileEx, which operates asynchronously. Note that LockFileEx needs a valid handle that has been opened with either read or write permissions (getting write permission may not be possible, but read should work almost always -- even if you are prevented from actually reading by an exclusive lock, it's still possible to create a handle that could read if there was no lock).
You can then wait on the asynchronous locking to complete either via the event in the OVERLAPPED structure, or on a completion port, and can even do other useful stuff in the mean time, too. Once you have locked the file, you know that nobody else has it locked.
[1] The wording "conceptually" suggests that I am pretty sure either method will work, but I have not tested them.
Apart from a busy loop, repeatedly trying to open the file with write access (which doesn't smell right - what if the file is locked by a process that is stuck and requires a reboot or manual termination, you'll never be able to write to it.
You could write to a temporary file and rename it afterwards (you can tell the OS a file rename operation is required and it will do it at next boot). If you need to append instead of write, then you'll have to write a process to append your temporary file to the correct one, possibly at startup (write the instructions of which file to append to where to a file that your process reads).
If you need to modify a locked file, then you'll just have to take a lock on it as soon as you can, and refuse to start the program if you don't have write access - warn the user right at the start.
There is a possibility that you can wait in a better way: if a file is locked for writing, you can assume that someone is going to write to it, and so use FindFirstChangeNotification to receive events for the FILE_NOTIFY_CHANGE_LAST_WRITE or FILE_NOTIFY_CHANGE_ATTRIBUTES events. Its not perfect in that someone could request exclusive access for reading too.
I suppose you could try to get the handle to the file that is locked and wait on that, so when it is released your WaitForSingleObject will return. However, there's a good chance you will not be allowed to get the handle owned by a different process (by the security subsystem)

Move or copy and truncate a file that is in use

I want to be able to (programmatically) move (or copy and truncate) a file that is constantly in use and being written to. This would cause the file being written to would never be too big.
Is this possible? Either Windows or Linux is fine.
To be specific what I'm trying to do is log video with FFMPEG and create hour long videos.
It is possible in both Windows and Linux, but it would take cooperation between the applications involved. If the application that is writing the new data to the file is not aware of what the other application is doing, it probably would not work (well ... there is some possibility ... back to that in a moment).
In general, to get this to work, you would have to open the file shared. For example, if using the Windows API CreateFile, both applications would likely need to specify FILE_SHARE_READ and FILE_SHARE_WRITE. This would allow both (multiple) applications to read and write the file "concurrently".
Beyond sharing the file, though, it would also be necessary to coordinate the operations between the applications. You would need to use some kind of locking mechanism (either by locking some part of the file or some shared mutex/semaphore). Note that if you use file locking, you could lock some known offset in the file to act as a "semaphore" (it can even be a byte value beyond the physical end of the file). If one application were appending to the file at the same exact time that the other application were truncating it, then it would lead to unpredictable results.
Back to the comment about both applications needing to be aware of each other ... It is possible that if both applications opened the file exclusively and kept retrying the operations until they succeeded, then perform the operation, then close the file, it would essentially allow them to work without "knowledge" of each other. However, that would probably not work very well and not be very efficient.
Having said all that, you might want to consider alternatives for efficiency reasons. For example, if it were possible to have the writing application write to new files periodically, it might be more efficient than having to "move" the data constantly out of one file to another. Also, if you needed to maintain some portion of the file (e.g., move out the first 100 MB to another file and then move the second 100 MB to the beginning) that could be a fairly expensive operation as well.
logrotate would be a good option is linux, comes stock on just about any distro. I'm sure there's a similar windows service out there somewhere

File Unlocking and Deleting as single operation

Please note this is not duplicate of File r/w locking and unlink. (The difference - platform. Operations of files like locking and deletion have totally different semantics, thus the sultion would be different).
I have following problem. I want to create a file system based session storage where each session data is stored in simple file named with session ids.
I want following API: write(sid,data,timeout), read(sid,data,timeout), remove(sid)
where sid==file name, Also I want to have some kind of GC that may remove all timed-out sessions.
Quite simple task if you work with single process but absolutly not trivial when working with multiple processes or even over shared folders.
The simplest solution I thought about was:
write/read:
hanlde=CreateFile
LockFile(handle)
read/write data
UnlockFile(handle)
CloseHanlde(handle)
GC (for each file in directory)
hanlde=CreateFile
LockFile(handle)
check if timeout occured
DeleteFile
UnlockFile(handle)
CloseHanlde(handle)
But AFIAK I can't call DeleteFile on opended locked file (unlike in Unix where file locking is
not mandatory and unlink is allowed for opened files.
But if I put DeleteFile outside of Locking loop bad scenario may happen
GC - CreateFile/LockFile/Unlock/CloseHandle,
write - oCreateFile/LockFile/WriteUpdatedData/Unlock/CloseHandle
GC - DeleteFile
Does anybody have an idea how such issue may be solved? Are there any tricks that allow
combine file locking and file removal or make operation on file atomic (Win32)?
Notes:
I don't want to use Database,
I look for a solution for Win32 API for NT 5.01 and above
Thanks.
I don't really understand how this is supposed to work. However, deleting a file that's opened by another process is possible. The process that creates the file has to use the FILE_SHARE_DELETE flag for the dwShareMode argument of CreateFile(). A subsequent DeleteFile() call will succeed. The file doesn't actually get removed from the file system until the last handle on it is closed.
You currently have data in the record that allows the GC to determine if the record is timed out. How about extending that housekeeping info with a "TooLateWeAlreadyTimedItOut" flag.
GC sets TooLateWeAlreadyTimedItOut = true
Release lock
<== writer comes in here, sees the "TooLate" flag and so does not write
GC deletes
In other words we're using a kind of optimistic locking approach. This does require some additional complexity in the Writer, but now you're not dependent upon any OS-specifc wrinkles.
I'm not clear what happens in the case:
GC checks timeout
GC deletes
Writer attempts write, and finds no file ...
Whatever you have planned for this case can also be used in the "TooLate" case
Edited to add:
You have said that it's valid for this sequence to occur:
GC Deletes
(Very slightly later) Writer attempts a write, sees no file, creates a new one
The writer can treat "tooLate" flag as a identical to this case. It just creates a new file, with a different name, use a version number as a trailing part of it's name. Opening a session file the first time requires a directory search, but then you can stash the latest name in the session.
This does assume that there can only be one Writer thread for a given session, or that we can mediate between two Writer threads creating the file, but that must be true for your simple GC/Writer case to work.
For Windows, you can use the FILE_FLAG_DELETE_ON_CLOSE option to CreateFile - that will cause the file to be deleted when you close the handle. But I'm not sure that this satisfies your semantics (because I don't believe you can clear the delete-on-close attribute.
Here's another thought. What about renaming the file before you delete it? You simply can't close the window where the write comes in after you decided to delete the file but what if you rename the file before deleting it? Then when the write comes in it'll see that the session file doesn't exist and recreate it.
The key thing to keep in mind is that you simply can't close the window in question. IMHO there are two solutions:
Adding a flag like djna mentioned or
Require that a per-session named mutex be acquired which has the unfortunate side effect of serializing writes on the session.
What is the downside of having a TooLate flag? In other words, what goes wrong if you delete the file prematurely? After all your system has to deal with the file not being present...

Custom Prefetch

Any programmatic techniques, portable or specific to NT and Linux that get the result of number of large files loading faster? I am after a 'ahead of time', a prior, whatever you prefer to call it mechanisms that I can control in code for two OS in a question.
Each file has to be processed in full, i.e. completely in size and sequentially for its contents. The aim is to speed up some batch file processing.
I don't know about NT, but one option on Linux would be to use madvise with the MADV_WILLNEED flag shortly before you actually need the next file to start reading it in early.
Alternately, a more portable option would be to simply manually do readahead in a separate thread from your buffer-processing thread - that is, read data in to fill an X MB buffer in thread A, process it as fast as you can in thread B.
I am not aware of a Win32 (NT) API similar to madvise().
However, I would suggest an approach.
First, pass the Win32 flag FILE_FLAG_SEQUENTIAL_SCAN to CreateFile(). This will allow the Windows operating system to perform better buffering of the file once you have opened it.
With FILE_FLAG_SEQUENTIAL_SCAN, your file parser may operate more quickly once the file is in memory. Unlike madvise() on Linux, the file will not begin loading into memory any earlier due to the use of the Win32 flag.
Next, we need to trigger the file to begin loading. Asynchronously read the first page of the file by calling ReadFileEx() with an OVERLAPPED structure and a FileIOCompletionRoutine function.
Your FileIOCompletionRoutine can simply return, or you can set the event in the overlapped structure -- read the MSDN details of ReadFileEx for details.
Since it would not be a critical failure if the pre-fetch hasn't completed when you actually read from the file, the easiest implementation would be to "fire and forget" -- execute the overlapped file read and then never check the result of it. Be sure that you read the data into valid buffers, though!
If you perform this operation for a file while reading the previous file, the result should be that the next file will commence paging in.
Be aware that this may slow your performance. As the next file begins to page in, the disk I/O to access that file will compete with disk I/O for the file you are currently parsing. If the two files are physically distant from each other on the same disk, the result of pre-fetching might be additional delay as the drive head seeks. Although modern drives have huge buffers which mitigate this, queuing the first page of a new file is likely to cause a head seek.
bdonlan's suggestion of a 'pre-fetch' thread which loads the files asynchronously from the processing would be a workable solution for Win32, also.

Resources