How can I pause execution of a Ruby script and serialize the process to disk? - ruby

One script takes lot of time. I would like to pause it (e.g. by pressing p) and save it to HDD (e.g. by pressing s) so I can resume it later from HDD. Libraries like Thread or gems like Celluloid may pause some part of the code, but as far I have seen they cannot save the current process to disk.
Ideally, I would like to put a few lines of codes at the beginning of script or something easy like this.

TL;DR
You are solving the wrong problem. If your script takes too long to run, speed up your script rather than try to serialize an OS process.
Alternative Approaches
If you insist on being able to freeze processes and save state to disk, you may want to consider running your processes inside a virtual machine like VirtualBox or VMware. Both of these products support the ability to pause a virtual machine and save the VM's current state to disk.
I'm unaware of any way to store running OS processes on disk other than inside some sort of virtualization layer. If you really need this functionality, that's the way I'd recommend. However, you'll probably get more bang for your buck by improving the efficiency of your code (profile or benchmark for bottlenecks), scaling up your system, or scaling out your program's tasks in a distributed way.

Related

How to check Matplotlib's speed in Xcode and increase performance?

I'm running into some considerable speed bottlenecks with a Python-Matplotlib-Xcode combination. I know some immediate responses will probably ask "Why are you doing python stuff in Xcode, just man up and use vim" --> I like the organizing ability and the built in version control, it makes elements of my work easier to deal with.
Getting python to run in xcode in the first place was a bit more tricky than I had hoped, but its possible. Now I have the following scenario:
A master file, 'main.py' does all the import stuff for me and sets up some universal formatting to make all the figures (for eventual inclusion in my PhD thesis) nice and uniform. Afterwards it runs a series of execfile commands to generate whichever graphics I need. Two things I can think of right off the bat:
1) at the very beginning of main.py after I import all the normal python stuff you tend to need, I call a system script which checks whether a certain filesystem is mounted. I keep all my climate model data on there since my local hard drive is too small to deal with all of it at once. Python pauses itself and waits for the system to do its thing, but once the filesystem has been found, it keeps going. Usually this only needs to happen once in the morning when I get to work, or if the VPN server kicked me off for whatever reason. (Side question, it'd be cool to know if theres a trick to automate an VPN login to reconnect as soon as it notices its not connected)
2) I'm not sure how much xcode is using on its own. running the same program from terminal is (somewhat) faster. I've tried to be memory conscience and turn off stuff I don't need while running the python/xcode combination.
Also, python launches a little window whenever I call plt.show(), this in itself takes time, I've considered just saving them as quick png files and opening them with some other viewer, although I guess that would also have to somehow take time to open up. Given how often these graphics change as I add model runs or think of nicer ways of displaying the data, it'd be nice to not waste something on the order of 15 to 30 minutes (possibly more) out of the entire day twiddling my thumbs and waiting for a window to pop up.
Benchmark it!
import datetime
start = datetime.datetime.now()
# your plotting code
td = datetime.datetime.now() - start
print td.total_seconds() # requires python version >= 2.7
Run it in xcode and from the command line, see what the difference is.

Playing Sound in Perl script

I'm trying to add sound to a Perl script to alert the user that the transaction was OK (user may not be looking at the screen all the time while working). I'd like to stay as portable as possible, as the script runs on Windows and Linux stations.
I can
use Win32::Sound;
Win32::Sound::Play('SystemDefault',SND_ASYNC);
for Windows. But I'm not sure how to call a generic sound on Linux (Gnome). So far, I've come up with
system('paplay /usr/share/sounds/gnome/default/alert/sonar.ogg');
But I'm not sure if I can count on that path being available.
So, three questions:
Is there a better way to call a default sound in Gnome
Is that path pretty universal (at least among Debain/Ubuntu flavors)
paplay takes a while to exit after playing a sound, is there a better way to call it?
I'd rather stay away from beeping the system speaker, it sounds awful (this is going to get played a lot) and Ubuntu blacklists the PC Speaker anyway.
Thanks!
A more portable way to get the path to paplay (assuming it's there) might be to use File::Which. Then you could get the path like:
use File::Which;
my $paplay_path = which 'paplay';
And to play the sound asynchronously, you can fork a subprocess:
my $pid = fork;
if ( !$pid ) {
# in the child process
system $paplay_path, '/usr/share/sounds/gnome/default/alert/sonar.ogg';
}
# parent proc continues here
Notice also that I've used the multi-argument form of system; doing so avoids the shell and runs the requested program directly. This avoids dangerous bugs (and is more efficient.)

bash protect HD from excessive use

How do I avoid breaking the HD? I have a bash script running on an ubuntu machine, with this meta code:
bash1.sh
while(true)
run bash2.sh
sleep 60 seconds
done
bash2.sh:
if(directory is empty): exit
process file
delete file
The directory is network shared, and the computer is not doing anything else. Once per day a new file arrives and is processed. (I do know that bash1.sh can be replaced by watch). My concern is that bash1.sh is reading bash2.sh everytime - that can presumably be avoided by only having one script!? and bash2.sh is reading the same directory everytime. Is the directory really read from the HD, or is ubuntu somehow caching the dir in ram? -so it is only read when something changes? is it a problem that it is the same place on the HD that is read every time, or does it not matter because the HD is already spinning? If the HD never sleeps, does it matter if I set the loop time down to only one second?
Maybe the directory could be a pure ram dir - how do I do that? -or is there some simple way to check if something has arrived over the network without reading the directory?
Reading a file or directory once every sixty seconds is not excessive use.
Seriously, don't worry about it.
If it's really worrying you, you can rethink your strategy for detecting the file.
For example, do you really need to know, within sixty seconds, that the file has arrived? Can it arrive any time during the day? Can some parts of the day be considered unlikely?
Using information like that, you can adjust the timing of checks to suit. If the file is supposed to be delivered after 4pm, don't check for it at all before then.
Check for it every sixty seconds between 4pm and 5pm, then every ten minutes after that.
These are all business-related decisions that can be made but I would still suggest that it's unnecessary. Provided you regularly back up your disks (and have standby hardware if you need to be back up in a hurry), you shouldn't lose anything.
In fact, if you were really paranoid, you could dedicate an entire machine for this, whose sole purpose is to receive the file via FTP and, when it arrives, send it across to your real processing box.
Put nothing else on that machine and have a warm standby (exactly same software, IP address and so on but powered down) so that, if it fails, the standby can be activated in minutes.
The real processing machine is then only written to once a day - that's unlikely to affect the disk lifetime.
That's probably too paranoid for my liking but it shows that there are ways to mitigate almost any problem.

I/O performance of multiple JVM (Windows 7 affected, Linux works)

I have a program that creates a file of about 50MB size. During the process the program frequently rewrites sections of the file and forces the changes to disk (in the order of 100 times). It uses a FileChannel and direct ByteBuffers via fc.read(...), fc.write(...) and fc.force(...).
New text:
I have a better view on the problem now.
The problem appears to be that I use three different JVMs to modify a file (one creates it, two others (launched from the first) write to it). Every JVM closes the file properly before the next JVM is started.
The problem is that the cost of fc.write() to that file occasionally goes through the roof for the third JVM (in the order of 100 times the normal cost). That is, all write operations are equally slow, it is not just one that hang very long.
Interestingly, one way to help this is to insert delays (2 seconds) between the launching of JVMs. Without delay, writing is always slow, with delay, the writing is slow aboutr every second time or so.
I also found this Stackoverflow: How to unmap a file from memory mapped using FileChannel in java? which describes a problem for mapped files, which I'm not using.
What I suspect might be going on:
Java does not completely release the file handle when I call close(). When the next JVM is started, Java (or Windows) recognizes concurrent access to that file and installes some expensive concurrency handler for that file, which makes writing expensive.
Would that make sense?
The problem occurs on Windows 7 (Java 6 and 7, tested on two machines), but not under Linux (SuSE 11.3 64).
Old text:
The problem:
Starting the program from as a JUnit test harness from eclipse or from console works fine, it takes around 3 seconds.
Starting the program through an ant task (or through JUnit by kicking of a separate JVM using a ProcessBuilder) slows the program down to 70-80 seconds for the same task (factor 20-30).
Using -Xprof reveals that the usage of 'force0' and 'pwrite' goes through the roof from 34.1% (76+20 tics) to 97.3% (3587+2913+751 tics):
Fast run:
27.0% 0 + 76 sun.nio.ch.FileChannelImpl.force0
7.1% 0 + 20 sun.nio.ch.FileDispatcher.pwrite0
[..]
Slow run:
Interpreted + native Method
48.1% 0 + 3587 sun.nio.ch.FileDispatcher.pwrite0
39.1% 0 + 2913 sun.nio.ch.FileChannelImpl.force0
[..]
Stub + native Method
10.1% 0 + 751 sun.nio.ch.FileDispatcher.pwrite0
[..]
GC and compilation are negligible.
More facts:
No other methods show a significant change in the -Xprof output.
It's either fast or very slow, never something in-between.
Memory is not a problem, all test machines have at least 8GB, the process uses <200MB
rebooting the machine does not help
switching of virus-scanners and similar stuff has no affect
When the process is slow, there is virtually no CPU usage
It is never slow when running it from a normal JVM
It is pretty consistently slow when running it in a JVM that was started from the first JVM (via ProcessBuilder or as ant-task)
All JVMs are exactly the same. I output System.getProperty("java.home") and the JVM options via RuntimeMXBean RuntimemxBean = ManagementFactory.getRuntimeMXBean(); List arguments = RuntimemxBean.getInputArguments();
I tested it on two machines with Windows7 64bit, Java 7u2, Java 6u26 and JRockit, the hardware of the machines differs, though, but the results are very similar.
I tested it also from outside Eclipse (command-line ant) but no difference there.
The whole program is written by myself, all it does is reading and writing to/from this file, no other libraries are used, especially no native libraries. -
And some scary facts that I just refuse to believe to make any sense:
Removing all class files and rebuilding the project sometimes (rarely) helps. The program (nested version) runs fast one or two times before becoming extremely slow again.
Installing a new JVM always helps (every single time!) such that the (nested) program runs fast at least once! Installing a JDK counts as two because both the JDK-jre and the JRE-jre work fine at least once. Overinstalling a JVM does not help. Neither does rebooting. I haven't tried deleting/rebooting/reinstalling yet ...
These are the only two ways I ever managed to get fast program runtimes for the nested program.
Questions:
What may cause this performance drop for nested JVMs?
What exactly do these methods do (pwrite0/force0)? -
Are you using local disks for all testing (as opposed to any network share) ?
Can you setup Windows with a ram drive to store the data ? When a JVM terminates, by default its file handles will have been closed but what you might be seeing is the flushing of the data to the disk. When you overwrite lots of data the previous version of data is discarded and may not cause disk IO. The act of closing the file might make windows kernel implicitly flush data to disk. So using a ram drive would allow you to confirm that their since disk IO time is removed from your stats.
Find a tool for windows that allows you to force the kernel to flush all buffers to disk, use this in between JVM runs, see how long that takes at the time.
But I would guess you are hitten some iteraction with the demands of the process and the demands of the kernel in attempting to manage disk block buffer cache. In linux there is a tool like "/sbin/blockdev --flushbufs" that can do this.
FWIW
"pwrite" is a Linux/Unix API for allowing concurrent writing to a file descriptor (which would be the best kernel syscall API to use for the JVM, I think Win32 API already has provision for the same kinds of usage to share a file handle between threads in a process, but since Sun have Unix heritige things get named after the Unix way). Google "pwrite(2)" for more info on this API.
"force" I would guess that is a file system sync, meaning the process is requesting the kernel to flush unwritten data (that is currently in disk block buffer cache) into the file on the disk (such as would be needed before you turned your computer off). This action will happen automatically over time, but transactional systems require to know when the data previously written (with pwrite) has actually hit the physical disk and is stored. Because some other disk IO is dependant on knowing that, such as with transactional checkpointing.
One thing that could help is making sure you explicitly set the FileChannel to null. Then call System.runFinalization() and maybe System.gc() at the end of the program. You may need more than 1 call.
System.runFinalizersOnExit(true) may also help, but it's deprecated so you will have to deal with the compiler warnings.

Scaling a ruby script by launching multiple processes instead of using threads

I want to increase the throughput of a script which does net I/O (a scraper). Instead of making it multithreaded in ruby (I use the default 1.9.1 interpreter), I want to launch multiple processes. So, is there a system for doing this to where I can track when one finishes to re-launch it again so that I have X number running at any time. ALso some will run with different command args. I was thinking of writing a bash script but it sounds like a potentially bad idea if there already exists a method for doing something like this on linux.
I would recommend not forking but instead that you use EventMachine (and the excellent em-http-request if you're doing HTTP). Managing multiple processes can be a bit of a handful, even more so than handling multiple threads, but going down the evented path is, in comparison, much simpler. Since you want to do mostly network IO, which consist mostly of waiting, I think that an evented approach would scale as well, or better than forking or threading. And most importantly: it will require much less code, and it will be more readable.
Even if you decide on running separate processes for each task, EventMachine can help you write the code that manages the subprocesses using, for example, EventMachine.popen.
And finally, if you want to do it without EventMachine, read the docs for IO.popen, Open3.popen and Open4.popen. All do more or less the same thing but give you access to the stdin, stdout, stderr (Open3, Open4), and pid (Open4) of the subprocess.
You can try fork http://ruby-doc.org/core/classes/Process.html#M003148
You can get the PID in return and see if this process run again or not.
If you want manage IO concurrency. I suggest you to use EventMachine.
You can either
implement (or find an equivalent gem) a ThreadPool (ProcessPool, in your case), or
prepare an array of all, let's say 1000 tasks to be processed, split it into, say 10 chunks of 100 tasks (10 being the number of parallel processes you want to launch), and launch 10 processes, of which each process right away receives 100 tasks to process. That way you don't need to launch 1000 processes and control that not more than 10 of them work at the same time.

Resources