How do Ruby Directories work? - ruby

Dir-s seem awkward as compared to File-s. Many of the methods are similar to IO methods, but a Dir doesn't inherit from IO. For example, tell in the IO docs reads:
Returns the current offset (in bytes) of ios.
When read-ing and tell-ing through a normal Dir, I get large numbers like 346723732 and 422823816. I was originally expecting these integers to be more "array-like" and just be a simple range.
Are these the bytes of the files contained in the Dir?
If not, is there any meaning to the numbers returned like IO#tell?
Also why do Dir-s have an open and close function if they are not Streams?
Is it still just as important to close a Dir as a normal IO?
Any general explanation of how a Ruby Dir works would be appreciated.
update Another confusing part: if Dirs are not IOs, why does close raise an IOerror?
Closes the directory stream. Any further attempts to access dir will raise an IOError.
Also notice that in the documentation it considers it a "directory stream". So this brings up the question again of are they streams or not and if not, why the naming convention?

The docs for Dir#tell say:
Returns the current position in dir.
without specifying what the position means. What the returned value signifying is likely to vary based on the OS that you're using and possibly the type of the file system that contains the directory. That value should be treated as opaque, don't try to interpret it in any way. The only purpose it serves is for being able to send that value back to the OS such as by calling Dir#seek.
Directories are not just a giant file. More typically they just map from a file name to information about where the data for the file is contained.
You should not (and as far as I'm aware cannot) write to directories yourself.

So after some IRC chat here's the conclusion I've come to:
The Dir object is NOT an IO
Dir Does not inherit from the IO class and is only readable. Still not sure why an IOError is raised on #close.
An opened Dir IS a stream however
Objects of class Dir are directory streams representing directories in the underlying file system.
Also if you check the source for Dir#close You will see that it calls the C function dirclose. man dirclose prints:
The closedir() function closes the directory stream associated with
dirp. A successful call to closedir() also closes the underlying file
descriptor associated with dirp. The directory stream descriptor dirp
is not available after this call.
...with dirp being a param.
So yes, instantiated Dirs will open a stream and yes, Dirs will use a file descriptor and need to be closed if you do not want to rely on garbage collection.
Big thanks to injekt and others on #ruby-lang irc!

Related

Find next file (but not FindNextFile)

If a user opens a file in a program (for example using GetOpenFileNameW, DragQueryFileW, command line argument, or whatever else to get the path, and a subsequent CreateFileW call), is there a way to find the next file in the parent directory of the opened file?
The obvious solution is to cycle through the results from FindNextFileW or NtQueryDirectoryFileEx until the opened file is encountered, and just open the next file.
However, this seems undesireable.
First, because these functions use paths (instead of for example a handle), the original file is decoupled from the search algorithm, so the original file might not even get encountered in that search. This is not much of an issue (as failing in this case is the expected outcome), and it probably could be resolved with (temporarly) changing the sharing mode, using LockFile or similar (though I would like to avoid that).
Second, this cycling search would have to be done every time, because the contents of the directory might have changed (retaining hFindFile does not work, because only FindFirstFileW calls NtQueryDirectoryFileEx and enumerates the contents of the directory). Which seems like unnecessary work and might even affect performance (for example if the directory contains a lot of files).
In theory any file system has some way of enumerating the files in a directory. Meaning there is some ordered data structure of the files' metadata. And getting the next file should only involve going back from the existing file handle to that file's entry, and then getting the next entry from that data structure. So there does not seem to be a fundamental reason why this cannot be done more sanely.
I thought maybe there exist a better way to do this somewhere in WinAPI...
Same question for finding the previous file.

What does IO.sysopen return?

Could somebody explain I/O to me? From everything I'm gathering, it can be summed up, abstractly, as the way computers interact with humans and vice versa. The I/O channel, or the "how", can run the gamut depending on external devices and/or internal OS management.
So what does the IO class in Ruby do? And how is it different from that of Java or C?
And take this code for instance:
x = IO.sysopen("file_name")
p x
The return is a Fixnum based on the file descriptor. In this case, the "file_name" is a pdf file and return a 7. What does the return object mean?
First of all, sysopen is a very low-level way of interacting with the system. For normal input and output in Ruby, you should use File.open instead.
The number returned by sysopen is called a "file descriptor". It's essentially an index into an array, but not a Ruby array; it lives inside the part of a process's memory which is maintained by the operating system. The first file descriptor, number 0, is called "standard input". Input calls will read from this input stream by default. The second, 1, is called "standard output"; output calls send their output there by default. And the third, 2, is called "standard error", which is where error messages go. All three of those are opened by the operating system before Ruby even starts. Normally they're all tied to the terminal, but you can change that with shell redirection.
As a general rule, when you open an extra file, the first one you open will get file descriptor 3, the next 4, and so on. So if you get a 7 back, that just means that Ruby has opened 4 other files by the time it gets to your code. And that's all it means. You can't tell anything else about an open file just based on the number. You have to hand that number off to a system call which can go look at the file descriptor array to see what's up.
But in Ruby, you usually have no reason to know or care about file descriptor numbers. You deal with instances of the IO class (and its subclasses like File for specific types of I/O). You call methods on the IO objects, and they handle the details of the system calls for you. The object referred to by the predefined constant STDIN (which is also the initial value of the global variable $stdin) knows that its file descriptor is 0, so you don't have to know that.

True file descriptor clone

Why is there no true file descriptor clone mechanism when possible, like it is for disk files.
POSIX:
After a successful return from one of these system calls, the old and
new file descriptors may be used interchangeably. They refer to the
same open file description (see open(2)) and thus share file offset
and file status flags; for example, if the file offset is modified by
using lseek(2) on one of the descriptors, the offset is also changed
for the other.
Windows:
The duplicate handle refers to the same object as the original handle. Therefore, any changes to the object are reflected through both handles. For example, if you duplicate a file handle, the current file position is always the same for both handles. For file handles to have different file positions, use the CreateFile function to create file handles that share access to the same file.
Reasons for having a clone primitive:
When manipulating a file archive, I want each file in the archive has to be accessible independently. The file archive should behave somewhat like a virtual filesystem.
File type checking. Being able to clone file offsets makes it possible to read a small portion of the file without affecting the original position.
You should consider the following: file descriptor is merely an offset into the array of "file" (literally, that's what they are called) object pointers on the kernel side. So when you duplicate the file descriptor, the kernel will simply copy the value of the file pointer from one location in the array to another and increment the reference count on the pointed to object.
Thus, your issue is not with file descriptor duplication, but with management of the file offsets. The easy answer for this: do it yourself. That is, associate the current file offset with each file descriptor on the application side explicitly.
Of course, the most basic file access system calls read() and write() make use of kernel maintained file offset variable, if it's available (and it's only available if you are dealing with "normal" random access files). But more advanced file access system calls will expect the desired file offset to be supplied by the application on each invocation. Those include pread()/pwrite(), preadv()/pwritev() and aio_read()/aio_write (the later is probably the best approach for writing parallel access applications like the one you described).
On Windows, ReadFile()/WriteFile(), ReadFileScatter()/WriteFileGather() and ReadFileEx()/WriteFileEx() analogously expect to be passed the file offset on every invocation (via the lpOverlapped argument).

Is appending to a file atomic with Windows/NTFS?

If I'm writing a simple text log file from multiple processes, can they overwrite/corrupt each other's entries?
(Basically, this question Is file append atomic in UNIX? but for Windows/NTFS.)
You can get atomic append on local files. Open the file with FILE_APPEND_DATA access (Documented in WDK). When you omit FILE_WRITE_DATA access then all writes will ignore the the current file pointer and be done at the end-of file. Or you may use FILE_WRITE_DATA access and for append writes specify it in overlapped structure (Offset = FILE_WRITE_TO_END_OF_FILE and OffsetHigh = -1 Documented in WDK).
The append behavior is properly synchronized between writes via different handles. I use that regularly for logging by multiple processes. I do write BOM at every open to offset 0 and all other writes are appended. The timestamps are not a problem, they can be sorted when needed.
Even if append is atomic (which I don't believe it is), it may not give you the results you want. For example, assuming a log includes a timestamp, it seems reasonable to expect more recent logs to be appended after older logs. With concurrency, this guarantee doesn't hold - if multiple processes are waiting to write to the same file, any one of them might get the write lock - not just the oldest one waiting. Thus, logs can be written out of sequence.
If this is not desirable behaviour, you can avoid it by publishing logs entries from all processes to a shared queue, such as a named pipe. You then have a single process that writes from this queue to the log file. This avoids the conccurrency issues, ensures that logs are written in order, and works when file appends are not atomic, since the file is only written to directly by one process.
From this MSDN page on creating and opening Files:
An application also uses CreateFile to specify whether it wants to share the file for reading, writing, both, or neither. This is known as the sharing mode. An open file that is not shared (dwShareMode set to zero) cannot be opened again, either by the application that opened it or by another application, until its handle has been closed. This is also referred to as exclusive access.
and:
If you specify an access or sharing mode that conflicts with the modes specified in the previous call, CreateFile fails.
So if you use CreateFile rather than say File.Open which doesn't have the same level of control over the file access, you should be able to open a file in such a way that it can't get corrupted by other processes.
You'll obviously have to add code to your processes to cope with the case where they can't get exclusive access to the log file.
No it isn't. If you need this there is Transactional NTFS in Windows Vista/7.

File Unlocking and Deleting as single operation

Please note this is not duplicate of File r/w locking and unlink. (The difference - platform. Operations of files like locking and deletion have totally different semantics, thus the sultion would be different).
I have following problem. I want to create a file system based session storage where each session data is stored in simple file named with session ids.
I want following API: write(sid,data,timeout), read(sid,data,timeout), remove(sid)
where sid==file name, Also I want to have some kind of GC that may remove all timed-out sessions.
Quite simple task if you work with single process but absolutly not trivial when working with multiple processes or even over shared folders.
The simplest solution I thought about was:
write/read:
hanlde=CreateFile
LockFile(handle)
read/write data
UnlockFile(handle)
CloseHanlde(handle)
GC (for each file in directory)
hanlde=CreateFile
LockFile(handle)
check if timeout occured
DeleteFile
UnlockFile(handle)
CloseHanlde(handle)
But AFIAK I can't call DeleteFile on opended locked file (unlike in Unix where file locking is
not mandatory and unlink is allowed for opened files.
But if I put DeleteFile outside of Locking loop bad scenario may happen
GC - CreateFile/LockFile/Unlock/CloseHandle,
write - oCreateFile/LockFile/WriteUpdatedData/Unlock/CloseHandle
GC - DeleteFile
Does anybody have an idea how such issue may be solved? Are there any tricks that allow
combine file locking and file removal or make operation on file atomic (Win32)?
Notes:
I don't want to use Database,
I look for a solution for Win32 API for NT 5.01 and above
Thanks.
I don't really understand how this is supposed to work. However, deleting a file that's opened by another process is possible. The process that creates the file has to use the FILE_SHARE_DELETE flag for the dwShareMode argument of CreateFile(). A subsequent DeleteFile() call will succeed. The file doesn't actually get removed from the file system until the last handle on it is closed.
You currently have data in the record that allows the GC to determine if the record is timed out. How about extending that housekeeping info with a "TooLateWeAlreadyTimedItOut" flag.
GC sets TooLateWeAlreadyTimedItOut = true
Release lock
<== writer comes in here, sees the "TooLate" flag and so does not write
GC deletes
In other words we're using a kind of optimistic locking approach. This does require some additional complexity in the Writer, but now you're not dependent upon any OS-specifc wrinkles.
I'm not clear what happens in the case:
GC checks timeout
GC deletes
Writer attempts write, and finds no file ...
Whatever you have planned for this case can also be used in the "TooLate" case
Edited to add:
You have said that it's valid for this sequence to occur:
GC Deletes
(Very slightly later) Writer attempts a write, sees no file, creates a new one
The writer can treat "tooLate" flag as a identical to this case. It just creates a new file, with a different name, use a version number as a trailing part of it's name. Opening a session file the first time requires a directory search, but then you can stash the latest name in the session.
This does assume that there can only be one Writer thread for a given session, or that we can mediate between two Writer threads creating the file, but that must be true for your simple GC/Writer case to work.
For Windows, you can use the FILE_FLAG_DELETE_ON_CLOSE option to CreateFile - that will cause the file to be deleted when you close the handle. But I'm not sure that this satisfies your semantics (because I don't believe you can clear the delete-on-close attribute.
Here's another thought. What about renaming the file before you delete it? You simply can't close the window where the write comes in after you decided to delete the file but what if you rename the file before deleting it? Then when the write comes in it'll see that the session file doesn't exist and recreate it.
The key thing to keep in mind is that you simply can't close the window in question. IMHO there are two solutions:
Adding a flag like djna mentioned or
Require that a per-session named mutex be acquired which has the unfortunate side effect of serializing writes on the session.
What is the downside of having a TooLate flag? In other words, what goes wrong if you delete the file prematurely? After all your system has to deal with the file not being present...

Resources