Ruby: getting disk usage information on Linux from /proc (or some other way that is non-blocking and doesn't spawn a new process) - ruby

Context:
I'm on Linux. I am writing a disk usage extension for Sensu. Extensions must be non-blocking, because they are included in the agent's main code loop. They also must be very lightweight, because they may be triggered as often as once every 10 seconds, or even down to once per second.
So I cannot spawn a new executable to gather disk usage information. From within Ruby, I can only do stuff like File.open() on /proc and /sys and so on, read the content, parse it, file.close(), then print the result. Repeat.
I've found the sys-filesystem gem, which appears to have everything I need. But I'd rather not force extensions to depend on gems, if it can be avoided. I'll use the gem if it turns out to be the best way, but is there a good alternative? Something that doesn't require a ton of coding?

The information can be accessed via the system call statfs
http://man7.org/linux/man-pages/man2/statfs.2.html
I can see there is a ruby interface to this here:
http://ruby-doc.org/core-trunk/File/Statfs.html

Related

Can I guarantee an atomic append in Ruby?

Based on Is file append atomic in UNIX? and other sources, it looks like on modern Linux, I can open a file in append mode and write small chunks (< PIPE_BUF) to it from multiple processes without worrying about tearing.
Are those limits extended to Ruby with syswrite? Specifically for this code:
f = File.new('...', 'a')
f.syswrite("short string\n")
Can I expect the write to not interleave with other process writing the same way? Or is there some buffering / potential splitting I'm not aware of yet?
Assuming ruby >= 2.3
I recently researched this very topic in order to implement the File appender in Rackstash.
You can find tests for this in the specs which I originally adapted from a blog post on this topic, whose code unfortunately is non-conclusive unfortunately since the author does not write to the file directly but through a pipe. Please read the comments there.
With (1) modern operating systems and (2) their usual local filesystems, the OS guarantees that concurrent appends from multiple processes do write interleave data.
The main points here are:
You need a reasonably modern operating system. Very old (or exotic) systems had lesser guarantees. Here, you might have to lock the file explicitly with e.g. File#flock.
You need to use a compatible filesystem. Most local filesystems like HFS, APFS, NTFS, and the usual Linux filesystems like ext3, ext4, btrfs, ... should be safe. SMB is one of the few network filesystems that also guarantees this. NFS and most FUSE filesystems are not safe in this regard.
Note that this mechanism does not guarantee that concurrent readers always read full writes. While the writes themselves are never interleaved, readers might read the partial results of in-progress writes.
As far as my understanding (and my tests) goes, the atomically writable size isn't even restricted to PIPE_SIZE. This limit only applies when writing to a pipe like a socket or e.g. STDOUT instead of a real file.
Unfortunately, authoritative information on this topic is rather sparse. Most articles (and SO answers) on this topic conflate strict appending with random writes. When not strictly appending (by opening the file in append-only mode), the guarantees are null and void.
Thus, to answer your specific question: Yes, your code from your question should be safe when writing to a local filesystem on modern operating systems. I think syswrite bypasses the file buffer already. To be sure, you should also set f.sync = true before writing to it to completely disable any buffering.
Note that you should still use a Mutex (or similar) if you plan to write to the single opened file from multiple threads in your process (since the append-guarantees of the OS are only valid for concurrent writes to different file-descriptors; it can not discern overlapping writes by the same process to the same file descriptor).
I would not assume that. syswrite calls the write POSIX function, which make no claims about atomicity when dealing with files.
See: Are POSIX' read() and write() system calls atomic?
And Understanding concurrent file writes from multiple processes
Tl;dr- you should implement some concurrency control in your app to synchronize this access.

Calling os.Lstat only if the file has changed since the last time I called os.Lstat

I'm trying to write a program, calcsize, that calculates the size of all sub directories. I want to create a cache of the result and only re-walk the directory if it has changed since the last time I've run the program.
Something like:
./calcsize
//outputs
/absolute/file/path1/ 1000 Bytes
/absolute/file/path2/ 2000 Bytes
I'm already walking the dirs with my own walk implementation because the built in go filepath.Walk is already calling Lstat on every file.
Is there any way to know if a directory or set of files has changed without calling Lstat on every file? Maybe a system call I'm not aware of?
In general, no. However you might want to look at: https://github.com/mattn/go-zglob/blob/master/fastwalk/fastwalk_unix.go
And using that data you can skip some of the stat calls, if you only care about files.
Whether, and how, this is possible depends heavily on your operating system. But you might take a look at github.com/howeyc/fsnotify which claims to offer this (I've never used it--I just now found it via Google).
In general, look at any Go program that provides a 'watch' feature. GoConvey and GopherJS's serve mode come to mind as examples, but there are others (possibly even in the standard library).

Read and write file atomically

I'd like to read and write a file atomically in Ruby between multiple independent Ruby processes (not threads).
I found atomic_write from ActiveSupport. This writes to a temp file, then moves it over the original and sets all permissions. However, this does not prevent the file from being read while it is being written.
I have not found any atomic_read. (Are file reads already atomic?)
Do I need to implement my own separate 'lock' file that I check for before reads and writes? Or is there a better mechanism already present in the file system for flagging a file as 'busy' that I could check before any read/write?
The motivation is dumb, but included here because you're going to ask about it.
I have a web application using Sinatra and served by Thin which (for its own reasons) uses a JSON file as a 'database'. Each request to the server reads the latest version of the file, makes any necessary changes, and writes out changes to the file.
This would be fine if I only had a single instance of the server running. However, I was thinking about having multiple copies of Thin running behind an Apache reverse proxy. These are discrete Ruby processes, and thus running truly in parallel.
Upon further reflection I realize that I really want to make the act of read-process-write atomic. At which point I realize that this basically forces me to process only one request at a time, and thus there's no reason to have multiple instances running. But the curiosity about atomic reads, and preventing reads during write, remains. Hence the question.
You want to use File#flock in exclusive mode. Here's a little demo. Run this in two different terminal windows.
filename = 'test.txt'
File.open(filename, File::RDWR) do |file|
file.flock(File::LOCK_EX)
puts "content: #{file.read}"
puts 'doing some heavy-lifting now'
sleep(10)
end
Take a look at transaction and open_and_lock_file methods in "pstore.rb" (Ruby stdlib).
YAML::Store works fine for me. So when I need to read/write atomically I (ab)use it to store data as a Hash.

Ruby file handle management (too many open files)

I am performing very rapid file access in ruby (2.0.0 p39474), and keep getting the exception Too many open files
Having looked at this thread, here, and various other sources, I'm well aware of the OS limits (set to 1024 on my system).
The part of my code that performs this file access is mutexed, and takes the form:
File.open( filename, 'w'){|f| Marshal.dump(value, f) }
where filename is subject to rapid change, depending on the thread calling the section. It's my understanding that this form relinquishes its file handle after the block.
I can verify the number of File objects that are open using ObjectSpace.each_object(File). This reports that there are up to 100 resident in memory, but only one is ever open, as expected.
Further, the exception itself is thrown at a time when there are only 10-40 File objects reported by ObjectSpace. Further, manually garbage collecting fails to improve any of these counts, as does slowing down my script by inserting sleep calls.
My question is, therefore:
Am I fundamentally misunderstanding the nature of the OS limit---does it cover the whole lifetime of a process?
If so, how do web servers avoid crashing out after accessing over ulimit -n files?
Is ruby retaining its file handles outside of its object system, or is the kernel simply very slow at counting 'concurrent' access?
Edit 20130417:
strace indicates that ruby doesn't write all of its data to the file, returning and releasing the mutex before doing so. As such, the file handles stack up until the OS limit.
In an attempt to fix this, I have used syswrite/sysread, synchronous mode, and called flush before close. None of these methods worked.
My question is thus revised to:
Why is ruby failing to close its file handles, and how can I force it to do so?
Use dtrace or strace or whatever equivalent is on your system, and find out exactly what files are being opened.
Note that these could be sockets.
I agree that the code you have pasted does not seem to be capable of causing this problem, at least, not without a rather strange concurrency bug as well.

Scaling a ruby script by launching multiple processes instead of using threads

I want to increase the throughput of a script which does net I/O (a scraper). Instead of making it multithreaded in ruby (I use the default 1.9.1 interpreter), I want to launch multiple processes. So, is there a system for doing this to where I can track when one finishes to re-launch it again so that I have X number running at any time. ALso some will run with different command args. I was thinking of writing a bash script but it sounds like a potentially bad idea if there already exists a method for doing something like this on linux.
I would recommend not forking but instead that you use EventMachine (and the excellent em-http-request if you're doing HTTP). Managing multiple processes can be a bit of a handful, even more so than handling multiple threads, but going down the evented path is, in comparison, much simpler. Since you want to do mostly network IO, which consist mostly of waiting, I think that an evented approach would scale as well, or better than forking or threading. And most importantly: it will require much less code, and it will be more readable.
Even if you decide on running separate processes for each task, EventMachine can help you write the code that manages the subprocesses using, for example, EventMachine.popen.
And finally, if you want to do it without EventMachine, read the docs for IO.popen, Open3.popen and Open4.popen. All do more or less the same thing but give you access to the stdin, stdout, stderr (Open3, Open4), and pid (Open4) of the subprocess.
You can try fork http://ruby-doc.org/core/classes/Process.html#M003148
You can get the PID in return and see if this process run again or not.
If you want manage IO concurrency. I suggest you to use EventMachine.
You can either
implement (or find an equivalent gem) a ThreadPool (ProcessPool, in your case), or
prepare an array of all, let's say 1000 tasks to be processed, split it into, say 10 chunks of 100 tasks (10 being the number of parallel processes you want to launch), and launch 10 processes, of which each process right away receives 100 tasks to process. That way you don't need to launch 1000 processes and control that not more than 10 of them work at the same time.

Resources