Could somebody explain I/O to me? From everything I'm gathering, it can be summed up, abstractly, as the way computers interact with humans and vice versa. The I/O channel, or the "how", can run the gamut depending on external devices and/or internal OS management.
So what does the IO class in Ruby do? And how is it different from that of Java or C?
And take this code for instance:
x = IO.sysopen("file_name")
p x
The return is a Fixnum based on the file descriptor. In this case, the "file_name" is a pdf file and return a 7. What does the return object mean?
First of all, sysopen is a very low-level way of interacting with the system. For normal input and output in Ruby, you should use File.open instead.
The number returned by sysopen is called a "file descriptor". It's essentially an index into an array, but not a Ruby array; it lives inside the part of a process's memory which is maintained by the operating system. The first file descriptor, number 0, is called "standard input". Input calls will read from this input stream by default. The second, 1, is called "standard output"; output calls send their output there by default. And the third, 2, is called "standard error", which is where error messages go. All three of those are opened by the operating system before Ruby even starts. Normally they're all tied to the terminal, but you can change that with shell redirection.
As a general rule, when you open an extra file, the first one you open will get file descriptor 3, the next 4, and so on. So if you get a 7 back, that just means that Ruby has opened 4 other files by the time it gets to your code. And that's all it means. You can't tell anything else about an open file just based on the number. You have to hand that number off to a system call which can go look at the file descriptor array to see what's up.
But in Ruby, you usually have no reason to know or care about file descriptor numbers. You deal with instances of the IO class (and its subclasses like File for specific types of I/O). You call methods on the IO objects, and they handle the details of the system calls for you. The object referred to by the predefined constant STDIN (which is also the initial value of the global variable $stdin) knows that its file descriptor is 0, so you don't have to know that.
Related
Doing some tests in scm (a scheme interpreter), I've intentionally closed the current-input-port (equivalent to the standard input file descriptor). Once the program work in REPL, the things got crazy, printing systematically a error message. My question is: how could I recover the control of process, that means, how could I reestablish the input file descriptor of such process?
Search for "changing file descriptor of a running process" or something similar, I couldn't find a helpful article.
Thanks in advance
System information: Debian 10.
You almost certainly can't, although this does slightly depend on how the language-level ports are mapped to the underlying OS-level I/O system.
If what you do is close the OS-level standard input then all is lost:
the REPL tries to read from standard input, gets an error as it's closed;
it tries to raise some error which will involve prompting the user for input ...
... from standard input, which is closed, so it gets error;
game over.
The only way to survive this is for one of two things to be true:
either you've wrapped an error handler around the code which is already prepared to deal with this;
or the implementation is smart enough to recognise that it's getting closed-port errors in its closed-port error handler and gives up in some smart way.
Basically once the OS level standard input is gone anything that needs to get input from it is doomed: you can't put it back without OS-level surgery on the process.
However it's possible that the implementation maps a single OS-level I/O stream to multiple language-level streams, and closing only one of these streams would leave the system with some other stream-of-last-resort to which it can still talk, and which still refers to the OS-level standard input. Common Lisp is an example of a system which can (depending on configuration) do this. It has, for instance, *standard-input* *error-output*, *query-io*, *terminal-io* and other streams, and it's very possible to be in a situation where, for instance, *standard-input* has been closed causing read errors, but *query-io* still points somewhere with a human on the end of it.
I don't know if scm does that.
On Unix-like operating systems, the file descriptors (file handles) used for things like pipes and sockets are small integers. It's common for a parent process to communicate with its child process by opening a pipe, socket or file at a particular file descriptor number. The number will either be agreed upon by the programmers of the parent and child program, or passed to the child program via the command line or environment variables. The child can then access the inherited file descriptor using the same number as the parent.
Is there a convention for passing handle numbers in a similar way on Windows? As far as I can tell, the relevant MSDN articles only talk about such conventions for the "standard" file handles (standard input, standard output, standard error). What would be a robust and idiomatic way to do something like "pass a pipe on file descriptor 3" as we sometimes do on Unix?
The MSDN article on Inheritance says (emphasis mine):
An inherited handle refers to the same object in the child process as it does in the parent process. It also has the same value and access privileges.
Does this mean that I can just cast the HANDLE value to an integer, convert that integer into a string and pass that string as a command line argument for the child process to parse and get the same handle back, and is this reliable or customary? Is there some other IPC mechanism that Windows programmers would normally use instead?
The main motivation for me would be to avoid temporary files and use pipes instead. This is conventional when all you have is stdin/stdout but less so when you have multiple pipes.
I have a Python program which performs a simple operation on a file:
with open(self.cache_filename_url, "a", encoding="utf8") as f:
w = csv.writer(f, delimiter=',', quotechar='"', lineterminator='\n')
w.writerow([cache_url, rpd_products])
As you can see it just opens the file and appends a CSV line to it. It does this a lot, in a loop.
I accidentally ran two copies of this program simultaneously, so I think they would have been appending to the file simultaneously. I am trying to determine the worst-case-scenario for file corruption.
Do you think the writes would at least be atomic operations in this case? For example this wouldn't be a problem for me:
old line
old line
new line written by instance 1
new line written by instance 2
new line written by one
This would be a problem for me:
old line
old line
[half of new line written by instance 1] [half of new line by instance 2]
etc
To put it another way, is it possible for the two append operations to "interfere" with each other?
EDIT: I am using Windows 7
Opening the same file multiple times in shared write mode can definitely be problematic. And, if they don't open in shared mode, you'll get one of them throwing exceptions that it cannot open the file.
If SHARED mode:
Both instances will have their own internal pointer. In most cases, they will probably write independently. You could get:
Process A opens file, sets pointer to end (byte 1024)
Process B opens file, sets pointer to end (byte 1024)
Process B writes at byte 1024 and closes file
Process A writes at byte 1024 and closes file.
Both processes will have written to the file at the same location. You've basically lost the record from Process B, and depending on how the close works (if it truncates), if the lines it writes are different lengths, you could get part of Process B if the line was longer.
If it is in EXCLUSIVE mode, one process will fail to open the file, and whatever exception handling you have will kick in.
Which mode you are in can be system dependent, as Python doesn't seem to provide any mechanisms for controlling the share mode.
Update: I ran a check on my file, and I did indeed have corrupted partial lines (the case under "This would be a problem for me" in my question)
It's unfortunate, especially since it implies you could have problems even when you intend to share a file between two processes.
I am still interested in any pointers on how to avoid this outcome. I will hold off on marking an answer as accepted for now. (The other answer is good, but doesn't provide enough details on these modes or how to determine which will be used.)
I am working on making a program that will act in a similar way as a shell, but supports only foreground processes and pipes. I have multiple processes writing to the same pipe and some other properties that differ from the normal usage of pipes. Anyhow, my question is,
Is there any easy (automatic) way to close all file descriptors of a process except the three basic ones?
I am asking this question since I have a lot of difficulties keeping track of all file descriptors for every process. And sometimes they act in some unpredictable ways to me. It could be also because of the fact that I don't have a very thorough understanding of them.
Is there any easy way(automatic) to close all file descriptors of a process except the three basic ones?
The normal way to do this is to simply iterate over all of them and close them:
for (i = getdtablesize(); i > 3;) close(--i);
That's already a one-liner. It doesn't get any more "automatic" than that.
I am asking this question since I have a lot of difficulty keeping track of all file descriptors for every process.
It will be worth your time to think about the life cycle of each file descriptor you open, when it gets duplicated (e.g. dup2() and fork()), how it gets used, and make sure you account for how each one is going to get closed when it is no longer needed. Papering over a problem of leaked file descriptors by indiscriminately closing them all is not going to be sustainable.
I have multiple processes writing to the same pipe
If you do this, then you need to be aware that the order in which data arrive at the other end of the pipe is going to be unpredictable. It will be difficult to avoid corrupting the data stream.
Use the closefrom(3) C library function.
From the manpage:
The closefrom() system call deletes all open file descriptors greater
than or equal to lowfd from the per-process object reference table.
Any errors encountered while closing file descriptors are ignored.
Example usage:
#include <unistd.h>
int main() {
// Close everything except stdin, stdout and stderr
closefrom(3); // Were 3 is the lowest file descriptor you wish to close
printf("Clear of all, but the three basic file descriptors!\n");
return 0;
}
This works in most unices, but requires the libbsd support library for Linux.
Dir-s seem awkward as compared to File-s. Many of the methods are similar to IO methods, but a Dir doesn't inherit from IO. For example, tell in the IO docs reads:
Returns the current offset (in bytes) of ios.
When read-ing and tell-ing through a normal Dir, I get large numbers like 346723732 and 422823816. I was originally expecting these integers to be more "array-like" and just be a simple range.
Are these the bytes of the files contained in the Dir?
If not, is there any meaning to the numbers returned like IO#tell?
Also why do Dir-s have an open and close function if they are not Streams?
Is it still just as important to close a Dir as a normal IO?
Any general explanation of how a Ruby Dir works would be appreciated.
update Another confusing part: if Dirs are not IOs, why does close raise an IOerror?
Closes the directory stream. Any further attempts to access dir will raise an IOError.
Also notice that in the documentation it considers it a "directory stream". So this brings up the question again of are they streams or not and if not, why the naming convention?
The docs for Dir#tell say:
Returns the current position in dir.
without specifying what the position means. What the returned value signifying is likely to vary based on the OS that you're using and possibly the type of the file system that contains the directory. That value should be treated as opaque, don't try to interpret it in any way. The only purpose it serves is for being able to send that value back to the OS such as by calling Dir#seek.
Directories are not just a giant file. More typically they just map from a file name to information about where the data for the file is contained.
You should not (and as far as I'm aware cannot) write to directories yourself.
So after some IRC chat here's the conclusion I've come to:
The Dir object is NOT an IO
Dir Does not inherit from the IO class and is only readable. Still not sure why an IOError is raised on #close.
An opened Dir IS a stream however
Objects of class Dir are directory streams representing directories in the underlying file system.
Also if you check the source for Dir#close You will see that it calls the C function dirclose. man dirclose prints:
The closedir() function closes the directory stream associated with
dirp. A successful call to closedir() also closes the underlying file
descriptor associated with dirp. The directory stream descriptor dirp
is not available after this call.
...with dirp being a param.
So yes, instantiated Dirs will open a stream and yes, Dirs will use a file descriptor and need to be closed if you do not want to rely on garbage collection.
Big thanks to injekt and others on #ruby-lang irc!