Does using the Ruby "File" class without closing leak memory? - ruby

So I was doing some research on the File class in Ruby. As I was digging I learned that File was a subclass of IO. To my understanding when you create an IO object (or File object), a buffer is opened to that file that allows you to read and write to that file. I don't completely understand what a buffer is, but apparently it stays open until you call the #close method on the object. To my understanding this buffer is opened whether you call File.new or File.open (please correct me if I'm wrong on any of this).
So say you like to use the File class for paths and stuff like this:
f = File.new('spec/tmp/testfile.md')
File.basename(f)
But you never call f.close. Does leaving this buffer open leak memory? If I called this several hundred times for a tree in a filesystem would I be in deep trouble?
Thanks for your replies!
PS I know you can just use File.basename('spec/tmp/testfile.md') instead, I'm just using this as an example

Yes
Except for the sys* family of operations, Ruby's IO ops ultimately allocate both file descriptors and buffers.
If you don't close the IO object then you are correct ... you most likely leak both the fd and the buffer.
Now, if you allocate it in such a way as to overwrite or otherwise end the lifetime of the old reference, then Ruby can g/c the entire object. This will definitely free the buffer, and it will eventually free the FD as well.
In all languages, however, it's considered quite bad practice to rely upon a g/c-triggered finalizer as it's unpredictable how long it will take and how many outstanding OS-level resources will exist at one time. You may exceed some local limit before the g/c machinery even starts up.
The general rule is to allocate and free OS resources synchronously.
And as long as I'm beating the subject to death, there is an exception. If you are allocating a fixed number of descriptors or something, and they all must exist at once anyway, and the program is going to exit after finishing its work, then it's OK to just leave them. The OS cleans up everything. For example, it's best not to free memory right before exit. The processing needed to manage the heap is completely wasted if the program is about to exit. The OS is just going to put every single page of the program on its free list. And there is an exception to the exception. If it's homework, I would free everything.

Related

What is the ideal way to emulate process replacement on Windows?

So, in a feature request I filed against Node.js, I was looking for a way to replace the current Node process with another. In Linux and friends (really, any POSIX-compliant system), this is easy: use execve and friends and call it a day. But obviously, that won't work on Windows, since it only has CreateProcess (which execve and friends delegate to, complete with async behavior). And it's not like people haven't wanted to do similar, leading to numerous duplicate questions on this site. (This isn't a duplicate because it's explicitly seeking a workaround given certain constraints, not just asking for direct replacement.)
Process replacement has several facets that have to addressed:
All console I/O streams have to be forwarded to the new process.
All signals need transparently forwarded to the new process.
The data from the old process have to be destroyed, with as many resources reclaimed as possible.
All pre-existing threads and child processes should be destroyed.
All pre-existing handles should be destroyed apart from open file descriptors and named pipes/etc.
Optimally, the old process's memory should be kept to a minimum after the process is created.
For my particular use case, retaining the process ID is not important.
And for my particular case, there are a few constraints:
I can control the initial process's startup as well as the location of my "process replacement" function.
I could load arbitrary native code via add-ons at potentially any stack offset.
Implication: I can't even dream of tracking malloc calls, handles, thread manipulation, or process manipulation to track and free them all, since DLL rewriting isn't exactly practical.
I have no control over when my "process replacement" is called. It could be called through an add-on, which could've been called through either interpreted code via FFI or even another add-on recursively. It could even be called during add-on initialization.
Implication: I would have no ability to know what's in the stack, even if I perfectly instrumented my side. And rewriting all their calls and pushes is far from practical, and would just be all-around slow for obvious reasons.
So, here's the gist of what I was thinking: use something similar to a pseudo-trampoline.
Statically allocate the following:
A single pointer for the stack pointer.
MAX_PATH + 1 chars for the application path + '\0'.
MAX_PATH + 1 chars for the current working directory path + '\0'.
32768 chars for the arguments + '\0'.
32768 chars for the environment + '\0'.
On entry, set the global stack pointer reference to the stack pointer.
On "replacement":
Do relevant process cleanup and lock/release everything you can.
Set the stack pointer to the stored original global one.
Terminate each child thread.
Kill each child process.
Free each open handle.
If possible (i.e. not in a UWP program), For each heap, destroy it if it's not the default heap or the temporary heap (if it exists).
If possible, close each open handle.
If possible, walk the default heap and free each segment associated with it.
Create a new process with the statically allocated file/arguments/environment/etc. with no new window created.
Proxy all future received signals, exceptions, etc. without modification to this process somehow. The standard signals are easy, but not so much with the exceptions.
Wait for the process to end.
Return with the process's exit code.
The idea here is to use a process-based trampoline and drop the current process size to an absolute minimum while the newly created one is started.
But where I'm not very familiar with Windows, I probably made quite a few mistakes here. Also, the above seems extremely inefficient and to an extent it just feels horribly wrong for something a kernel could just release a few memory pages, deallocate a bunch of memory handles, and move some memory around for the next process.
So, to summarize, what's the ideal way to emulate process replacement on Windows with the fewest limitations?
Given that I don't understand what is actually being requested and I certainly look at things like 'execve' with a "who the hell would ever call that anyway, nothing but madness can ever result" sentiment, I nonetheless look at this problem by asking myself:
if process-a was killed and replaced by an near identical process-b - who or what would notice?
Anything that held the process id, or a handle to the process would certainly notice. This can be handled by writing a wrapper app which loads the first node process, and when prodded, kills it and loads the next. External observers see the wrapping process handles and id's unchanged.
Obviously this would cut off the stdin and stdout streams being fed into the node applications. But again, the wrapper process could get around this by passing the same set of inheritable handles to each node process launched by filling in the STARTUPINFO structure passed to CreateProcess properly.
Windows doesn't support signals, and the ones that the MS C runtime fake all deal with internal errors except one, which deals with an interactive console window being closed via ctrl-C, which the active Node.js app is sure to get anyway - or can be passed on from the wrapper as the node apps would not actually be running on the interactive console with this approach.
Other than that, everything else seems to be an internal detail of the Node.js application so shouldn't effect any 3rd party app communicating with what it thinks is a single node app via its stdin/stdout streams.

Go destructors?

I know there are no destructors in Go since technically there are no classes. As such, I use initClass to perform the same functions as a constructor. However, is there any way to create something to mimic a destructor in the event of a termination, for the use of, say, closing files? Right now I just call defer deinitClass, but this is rather hackish and I think a poor design. What would be the proper way?
In the Go ecosystem, there exists a ubiquitous idiom for dealing with objects which wrap precious (and/or external) resources: a special method designated for freeing that resource, called explicitly — typically via the defer mechanism.
This special method is typically named Close(), and the user of the object has to call it explicitly when they're done with the resource the object represents. The io standard package does even have a special interface, io.Closer, declaring that single method. Objects implementing I/O on various resources such as TCP sockets, UDP endpoints and files all satisfy io.Closer, and are expected to be explicitly Closed after use.
Calling such a cleanup method is typically done via the defer mechanism which guarantees the method will run no matter if some code which executes after resource acquisition will panic() or not.
You might also notice that not having implicit "destructors" quite balances not having implicit "constructors" in Go. This actually has nothing to do with not having "classes" in Go: the language designers just avoid magic as much as practically possible.
Note that Go's approach to this problem might appear to be somewhat low-tech but in fact it's the only workable solution for the runtime featuring garbage-collection. In a language with objects but without GC, say C++, destructing an object is a well-defined operation because an object is destroyed either when it goes out of scope or when delete is called on its memory block. In a runtime with GC, the object will be destroyed at some mostly indeterminate point in the future by the GC scan, and may not be destroyed at all. So if the object wraps some precious resource, that resource might get reclaimed way past the moment in time the last live reference to the enclosing object was lost, and it might even not get reclaimed at all—as has been well explained by #twotwotwo in their respective answer.
Another interesting aspect to consider is that the Go's GC is fully concurrent (with the regular program execution). This means a GC thread which is about to collect a dead object might (and usually will) be not the thread(s) which executed that object's code when it was alive. In turn, this means that if the Go types could have destructors then the programmer would need to make sure whatever code the destructor executes is properly synchronized with the rest of the program—if the object's state affects some data structures external to it. This actually might force the programmer to add such synchronization even if the object does not need it for its normal operation (and most objects fall into such category). And think about what happens of those exernal data strucrures happened to be destroyed before the object's destructor was called (the GC collects dead objects in a non-deterministic way). In other words, it's much easier to control — and to reason about — object destruction when it is explicitly coded into the program's flow: both for specifying when the object has to be destroyed, and for guaranteeing proper ordering of its destruction with regard to destroying of the data structures external to it.
If you're familiar with .NET, it deals with resource cleanup in a way which resembles that of Go quite closely: your objects which wrap some precious resource have to implement the IDisposable interface, and a method, Dispose(), exported by that interface, must be called explicitly when you're done with such an object. C# provides some syntactic sugar for this use case via the using statement which makes the compiler arrange for calling Dispose() on the object when it goes out of the scope declared by the said statement. In Go, you'll typically defer calls to cleanup methods.
One more note of caution. Go wants you to treat errors very seriously (unlike most mainstream programming language with their "just throw an exception and don't give a fsck about what happens due to it elsewhere and what state the program will be in" attitude) and so you might consider checking error returns of at least some calls to cleanup methods.
A good example is instances of the os.File type representing files on a filesystem. The fun stuff is that calling Close() on an open file might fail due to legitimate reasons, and if you were writing to that file this might indicate that not all the data you wrote to that file had actually landed in it on the file system. For an explanation, please read the "Notes" section in the close(2) manual.
In other words, just doing something like
fd, err := os.Open("foo.txt")
defer fd.Close()
is okay for read-only files in the 99.9% of cases, but for files opening for writing, you might want to implement more involved error checking and some strategy for dealing with them (mere reporting, wait-then-retry, ask-then-maybe-retry or whatever).
runtime.SetFinalizer(ptr, finalizerFunc) sets a finalizer--not a destructor but another mechanism to maybe eventually free up resources. Read the documentation there for details, including downsides. They might not run until long after the object is actually unreachable, and they might not run at all if the program exits first. They also postpone freeing memory for another GC cycle.
If you're acquiring some limited resource that doesn't already have a finalizer, and the program would eventually be unable to continue if it kept leaking, you should consider setting a finalizer. It can mitigate leaks. Unreachable files and network connections are already cleaned up by finalizers in the stdlib, so it's only other sorts of resources where custom ones can be useful. The most obvious class is system resources you acquire through syscall or cgo, but I can imagine others.
Finalizers can help get a resource freed eventually even if the code using it omits a Close() or similar cleanup, but they're too unpredictable to be the main way to free resources. They don't run until GC does. Because the program could exit before next GC, you can't rely on them for things that must be done, like flushing buffered output to the filesystem. If GC does happen, it might not happen soon enough: if a finalizer is responsible for closing network connections, maybe a remote host hits its limit on open connections to you before GC, or your process hits its file-descriptor limit, or you run out of ephemeral ports, or something else. So it's much better to defer and do cleanup right when it's necessary than to use a finalizer and hope it's done soon enough.
You don't see many SetFinalizer calls in everyday Go programming, partly because the most important ones are in the standard library and mostly because of their limited range of applicability in general.
In short, finalizers can help by freeing forgotten resources in long-running programs, but because not much about their behavior is guaranteed, they aren't fit to be your main resource-management mechanism.
There are Finalizers in Go. I wrote a little blog post about it. They are even used for closing files in the standard library as you can see here.
However, I think using defer is more preferable because it's more readable and less magical.

Can a read() by one process see a partial write() by another?

If one process does a write() of size (and alignment) S (e.g. 8KB), then is it possible for another process to do a read (also of size and alignment S and the same file) that sees a mix of old and new data?
The writing process adds a checksum to each data block, and I'd like to know whether I can use a reading process to verify the checksums in the background. If the reader can see a partial write, then it will falsely indicate corruption.
What standards or documents apply here? Is there a portable way to avoid problems here, preferably without introducing lots of locking?
When a function is guaranteed to complete without there being any chance of any other process/thread/anything seeing things in a half finished state, it's said to be atomic. It either has or hasn't happened, there is no part way. While I can't speak to Windows, there are very few file operations in POSIX (which is what Linux/BSD/etc attempt to stick to) that are guaranteed to be atomic. Reading and writing are not guaranteed to be atomic.
While it would be pretty unlikely for you to write 2 bytes to a file and another process only see one of those bytes written, if by dumb luck your write straddled two different pages in memory and the VM system had to do something to prepare the second page, it's possible you'd see one byte without the other in a second process. Usually if things are page aligned in your file, they will be in memory, but again you can't rely on that.
Here's a list someone made of what is atomic in POSIX, which is pretty short, and I can't vouch for it's authenticity. (I can't think of why unlink isn't listed, for example).
I'd also caution you against testing what appears to work and running with it, the moment you start accessing files over a network file system (NFS on Unix, or SMB mounts in Windows) a lot of things that seemed to be atomic before no longer are.
If you want to have a second process calculating checksums while a first process is writing the file, you may want to open a pipe between the two and have the first process write a copy of everything down the pipe to the checksumming process. That may be faster than dealing with locking.

What happens to a process handle once the process was ended?

if I have a handle to some windows process which has stopped (killed or just ended):
Will the handle (or better the memory behind it) be re-used for another process?
Or will GetExitCodeProcess() for example get the correct result forever from now on?
If 1. is true: How "long" would GetExitCodeProcess() work?
If 2. is true: Wouldn't that mean that I can bring down the OS with starting/killing new processes, since I create more and more handles (and the OS reserves memory for them)?
I'm a bit confused about the concept of handles.
Thank you in advance!
The handle indirectly points to an kernel object. As long as there are open handles, the object will be kept alive.
Will the handle (or better the memory behind it) be re-used for another process?
The numeric value of the handle (or however it is implemented) might get reused, but that doesn't mean it'll always point to the same thing. Just like process IDs.
Or will GetExitCodeProcess() for example get the correct result forever from now on?
No. When all handles to the process are closed, the process object is freed (along with its exit code). Note that running process holds an implicit handle to itself. You can hold an open handle, though, as long as you need it.
If 2. is true: Wouldn't that mean that I can bring down the OS with starting/killing new processes, since I create more and more handles (and the OS reserves memory for them)?
There are many ways to starve the system. It will either start heavily swapping or just fail to spawn a new process at some point.
Short answer:
GetExitCodeProcess works until you call CloseHandle, after what the process object will be released and may be reused.
Long answer:
See Cat Plus Plus's answer.

Question about g++ generated code

Dear g++ hackers, I have the following question.
When some data of an object is overwritten by a faulty program, why does the program eventually fail on destruction of that object with a double free error? How does it know if the data is corrupted or not? And why does it cause double free?
It's usually not that the object's memory is overwritten, but some part of the memory outside of the object. If this hits malloc's control structures, free will freak out once it accesses them and tries to do weird things based on the corrupted structure.
If you'd really only overwrite object memory with silly stuff, there's no way malloc/free would know. Your program might crash, but for other reasons.
Take a look at valgrind. It's a tool that emulates the CPU and watches every memory access for anomalies (like trying to overwrite malloc's control structures). It's really easy to use, most of the time you just start your program inside valgrind by prepending valgrind on the shell, and it saves you a lot of pain.
Regarding C++: always make sure that you use new in conjunction with delete and, respectively, new[] in conjunction with delete[]. Never mix them up. Bad things will happen, often similar to what you are describing (but valgrind would warn you).

Resources