Ruby poor performance in thread - ruby

I have a function that does IO/computation. I made a demo function which copies ~300MB from here to there. If I run it in a thread which I immediately join, it is much slower than if I run it without a thread. I checked with:
def cp
start = Time.now
FileUtils.cp_r("C:/tmp", "C:/tmp1")
fin = Time.now - start
p fin
end
Comparing these:
cp
Thread.new{cp}.join
the first cp call is always two to four times faster than the threaded call. The same happens if I do
cp
Thread.new{cp}
sleep 200
I heard about GIL, etc., but here, only one thread runs at a time, so no race for running time. Any ideas on how I can make it faster or why that is happening?

Threading isn't a guarantee that things will run faster, or even the same speed, as non-threaded code, at least currently with MRI. JRuby might be better. Your cp isn't getting the full attention of the CPU, which is why doing it without threading, and allowing it to block until done, is faster.
Consider using fork instead.
"A dozen (or so) ways to start sub-processes in Ruby: Part 1" looks useful. Also "How do you spawn a child process in Ruby?".

Related

Find the Run Time of Select Ruby Code

Problem
Howdy guys, so I want to find the run time of a block of code in Ruby, but I am not entirely sure as to how I could do it. I want to run some code, and then output how long it took to run that code because I have a super huge program and the run time changes a lot. I want to make sure it always has a consistent run time (I could do it by sleeping it for a fraction of a second) but that isn't my problem. I want to find out how long the run time actually is so the program can know if it needs to slow things down or speed things up.
My Thoughts
So, I have an idea as to how it could work. I have never used Time in ruby but I have an idea as to how I could use that. I could have a variable equal to the time (in milliseconds) and then another variable that I make at the end of the code block that does it again, and then I just subtract them, but I have (1) never used Time and (2) I don't actually know if that is the best way.
Thanks in advance!
Ruby has the Benchmark module for timing how long things take. I've never used this outside of seeing if a method is taking too long to run, etc. in development, not sure if this is 'recommended' for production code or for keeping things above a minimum runtime (as it sounds like you might be doing), but take a look and see how it feels for your use case.
It also sounds like you might be interested in the Timeout module as well (for making sure things don't take longer than a set amount of time).
If you really have a use case for making sure something takes a minimum amount of time, timing the code (either using a Benchmark method or just Time or another solution) and then sleep the difference is the only thing that comes to mind.
It is simple. Look at your watch (Time.now) and remember the time, run the code, look at your watch again, subtract.
t0 = Time.now
# your block of code
puts Time.now - t0
[http://ruby-doc.org/core-1.9.3/Time.html
You want to to use the Time object. (Time Docs)
For example,
start = Time.now
# code to time
finish = Time.now
diff = finish - start
diff would be in seconds, as a floating point number.
EDIT: end is reserved.
or you can use
require 'benchmark'
def foo
time = Benchmark.measure {
code to test
}
puts time.real #or save it to logs
end
Sample output:
2.2.3 :001 > foo
5.230000 0.020000 5.250000 ( 5.274806)
Values are CPU time, system time, total and real elapsed time.
[http://ruby-doc.org/stdlib-2.0.0/libdoc/benchmark/rdoc/Benchmark.html#method-c-bm
Source: Ruby docs.

Multi-threading in Ruby (MRI)

According to GIL implementation in Ruby (MRI), the code below must fail by printing a message more than one time. But it doesn't, it always print it one time:
class Sheep
def initialize
#shorn = false
end
def shorn?
#shorn
end
def shorn!
puts "shearing..."
#shorn = true
end
end
s = Sheep.new
55.times.map do
Thread.new { s.shorn! unless s.shorn? }
end.each(&:join)
How come?
$ ruby --version
ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin13.0]
It depends a bit on which exact ruby version you use (which differ in the way they schedule threads). On my system it depends a bit on the overall system load and how fast the terminal feels, but on Ruby 2.0.00p481 I get between 1 and 55 lines of output, on Ruby 1.8.7, I consistently get only one line.
It should be noted here that Ruby 2.0 and higher uses actual OS threads (albeit still with a GIL) while Ruby 1.8 uses internal green threads with its own scheduling. It might be very well possible that older ruby versions schedule threads more granular.
In any case, you should not rely on any incidentally thread scheduling behavior. This is not part of any documented behavior and things will change on different systems as as Ruby matures. You should always ensure that you use shared data structures safely when using threads.
I use Ruby version ruby 2.1.5p273 and I suppose your slightly different Ruby version should yield similar results.
I have different results every time I run the program.
I tried with one core enabled and fore cores enabled. I don't see a difference. It is not thread safe, as you expected.
Otherwise the only answer I can come up with is that your program is too fast/lightweight, so that the interpreter does not think of thread switching too often.
I have only one suggestion in this case. A trick you could use to give the interpreter a hint that maybe she could switch threads. You could use the sleep function.
In your example I would put it just before the race condition:
def shorn!
sleep 0.0001
puts "shearing..."
#shorn = true
end
If you'd like to have more info about the GIL I can recommend Jesse Storimer's Nobody understands the GIL
If you'd like to read more about Ruby and concurrency I can recommend Dotan Nahum's Pragmatic Concurrency with Ruby
The trick I suggested was mentioned in this answer
As others have mentioned, the GIL's behavior is not documented and is totally implementation-dependent. You shouldn't rely on any expectations about its scheduling behavior.
A more detailed (and also more general) answer, however, is that the scheduler switches execution between threads to make sure that no single thread blocks the process. This switch is called a context switch or more specifically a thread switch.
When the context switch occurs, the current thread's execution is paused and another thread's execution is resumed. If it's a brand new thread that's being "resumed," then it means that the new thread's execution starts from the beginning.
In the case of your program, each new thread begins with
s.shorn?
as it evaluates unless s.shorn?. At this point, #shorn == false and s.shorn? evaluates to false. So then the thread runs:
s.shorn!
The first command in #shorn! that gets run is:
puts "shearing..."
What happens next depends on the thread scheduler:
If the scheduler decides to let the current thread continue executing, then the next command that gets executed is #shorn = true. Then the thread ends, the scheduler starts the next thread, unless s.shorn? evaluates to true, and the thread stops. This behavior repeats in a loop until there are no more threads left.
If the scheduler decides to switch to another thread, then it will pause execution right before #shorn = true and start running the same code as before from the beginning. That means that #shorn == false when the new thread starts, and so puts "shearing..." will execute again.
As you can see, it all depends on when the scheduler decides to perform a context switch.
But what about the GIL?
The GIL is a horribly misunderstood part of MRI Ruby. There are plenty of resources out there to explain how the GIL works, but in this case the most important thing that you should know is that the GIL doesn't guarantee that each thread will run sequentially.
Instead, the GIL merely guarantees that most core Ruby methods that are implemented in C (for example, Array#<<) won't be interrupted by a context switch until they are finished. In the case of puts "shearing...", I haven't looked at the code for puts, but probably the GIL guarantees that no other thread will run until the currently running thread finishes executing puts.
As for why when you ran your code under MRI 1.8.7 it only displayed shearing... once, that doesn't necessarily have anything to do with green vs. native threads. The better answer is that it was a coincidence. The more precise answer is that in your case, for some reason the scheduler decided to interrupt the first thread after running #shorn = true. This behavior may possibly have been due to green threads in the sense that maybe your native scheduler interrupts more frequently than Ruby's scheduler (hence the "more granular" suggestion in one of the answers below), but that's not necessarily true. It could also have been a fluke.
Multithreading in Ruby is really easy to mess up. Hence why Matz recommends sticking to forking processes, which is memory-inefficient but removes the burden of managing threads. Another approach for larger projects would be to use a library like Celluloid, which abstracts away Ruby's thread safety mechanisms. For a small example like this, however, a simple mutex would do:
semaphore = Mutex.new
s = Sheep.new
55.times.map {
Thread.new {
semaphore.synchronize do
s.shorn! unless s.shorn?
end
}
}.each(&:join)

how to stop a running script in Matlab [duplicate]

This question already has an answer here:
How to abort a running program in MATLAB?
(1 answer)
Closed 7 years ago.
I write a long running script in Matlab, e.g.
tic;
d = rand(5000);
[a,b,c] = svd(d);
toc;
It seems running forever. Becasue I press F5 in the editor window. So I cannot press C-Break to stop in the Matlab console.
I just want to know how to stop the script. I am current use Task Manager to kill Matlab, which is really silly.
Thanks.
Matlab help says this-
For M-files that run a long time, or that call built-ins or MEX-files that run a long time, Ctrl+C does not always effectively stop execution. Typically, this happens on Microsoft Windows platforms rather than UNIX[1] platforms. If you experience this problem, you can help MATLAB break execution by including a drawnow, pause, or getframe function in your M-file, for example, within a large loop. Note that Ctrl+C might be less responsive if you started MATLAB with the -nodesktop option.
So I don't think any option exist. This happens with many matlab functions that are complex. Either we have to wait or don't use them!.
If ctrl+c doesn't respond right away because your script is too long/complex, hold it.
The break command doesn't run when matlab is executing some of its deeper scripts, and either it won't log a ctrl sequence in the buffer, or it clears the buffer just before or just after it completes those pieces of code. In either case, when matlab returns to execute more of your script, it will recognize that you are holding ctrl+c and terminate.
For longer running programs, I usually try to find a good place to provide a status update and I always accompany that with some measure of time using tic and toc. Depending on what I am doing, I might use run time, segment time, some kind of average, etc...
For really long running programs, I found this to be exceptionally useful
http://www.mathworks.com/matlabcentral/fileexchange/16649-send-text-message-to-cell-phone/content/send_text_message.m
but it looks like they have some newer functions for this too.
MATLAB doesn't respond to Ctrl-C while executing a mex implemented function such as svd. Also when MATLAB is allocating big chunk of memory it doesn't respond. A good practice is to always run your functions for small amount of data, and when all test passes run it for actual scale. When time is an issue, you would want to analyze how much time each segment of code runs as well as their rough time complexity.
Consider having multiple matlab sessions. Keep the main session window (the pretty one with all the colours, file manager, command history, workspace, editor etc.) for running stuff that you know will terminate.
Stuff that you are experimenting with, say you are messing with ode suite and you get lots of warnings: matrix singular, because you altered some parameter and didn't predict what would happen, run in a separate session:
dos('matlab -automation -r &')
You can kill that without having to restart the whole of Matlab.
One solution I adopted--for use with java code, but the concept is the same with mexFunctions, just messier--is to return a FutureValue and then loop while FutureValue.finished() or whatever returns true. The actual code executes in another thread/process. Wrapping a try,catch around that and a FutureValue.cancel() in the catch block works for me.
In the case of mex functions, you will need to return somesort of pointer (as an int) that points to a struct/object that has all the data you need (native thread handler, bool for complete etc). In the case of a built in mexFunction, your mexFunction will most likely need to call that mexFunction in the separate thread. Mex functions are just DLLs/shared objects after all.
PseudoCode
FV = mexLongProcessInAnotherThread();
try
while ~mexIsDone(FV);
java.lang.Thread.sleep(100); %pause has a memory leak
drawnow; %allow stdout/err from mex to display in command window
end
catch
mexCancel(FV);
end
Since you mentioned Task Manager, I'll guess you're using Windows. Assuming you're running your script within the editor, if you aren't opposed to quitting the editor at the same time as quitting the running program, the keyboard shortcut to end a process is:
Alt + F4
(By which I mean press the 'Alt' and 'F4' keys on your keyboard simultaneously.)
Alternatively, as mentioned in other answers,
Ctrl + C
should also work, but will not quit the editor.
if you are running your matlab on linux, you can terminate the matlab by command in linux consule.
first you should find the PID number of matlab by this code:
top
then you can use this code to kill matlab:
kill
example:
kill 58056
To add on:
you can insert a time check within a loop with intensive or possible deadlock, ie.
:
section_toc_conditionalBreakOff;
:
where within this section
if (toc > timeRequiredToBreakOff) % time conditional break off
return;
% other options may be:
% 1. display intermediate values with pause;
% 2. exit; % in some cases, extreme : kill/ quit matlab
end

ruby: How do i get the number of subprocess(fork) running

I want to limit the subprocesses count to 3. Once it hits 3 i wait until one of the processes stops and then execute a new one. I'm using Kernel.fork to start the process.
How do i get the number of running subprocesses? or is there a better way to do this?
A good question, but I don't think there's such a method in Ruby, at least not in the standard library. There's lots of gems out there....
This problem though sounds like a job for the Mutex class. Look up the section Condition Variables here on how to use Ruby's mutexes.
I usually have a Queue of tasks to be done, and then have a couple of threads consuming tasks until they receive an item indicating the end of work. There's an example in "Programming Ruby" under the Thread library. (I'm not sure if I should copy and paste the example to Stack Overflow - sorry)
My solution was to use trap("CLD"), to trap SIGCLD whenever a child process ended and decrease the counter (a global variable) of processes running.

Scaling a ruby script by launching multiple processes instead of using threads

I want to increase the throughput of a script which does net I/O (a scraper). Instead of making it multithreaded in ruby (I use the default 1.9.1 interpreter), I want to launch multiple processes. So, is there a system for doing this to where I can track when one finishes to re-launch it again so that I have X number running at any time. ALso some will run with different command args. I was thinking of writing a bash script but it sounds like a potentially bad idea if there already exists a method for doing something like this on linux.
I would recommend not forking but instead that you use EventMachine (and the excellent em-http-request if you're doing HTTP). Managing multiple processes can be a bit of a handful, even more so than handling multiple threads, but going down the evented path is, in comparison, much simpler. Since you want to do mostly network IO, which consist mostly of waiting, I think that an evented approach would scale as well, or better than forking or threading. And most importantly: it will require much less code, and it will be more readable.
Even if you decide on running separate processes for each task, EventMachine can help you write the code that manages the subprocesses using, for example, EventMachine.popen.
And finally, if you want to do it without EventMachine, read the docs for IO.popen, Open3.popen and Open4.popen. All do more or less the same thing but give you access to the stdin, stdout, stderr (Open3, Open4), and pid (Open4) of the subprocess.
You can try fork http://ruby-doc.org/core/classes/Process.html#M003148
You can get the PID in return and see if this process run again or not.
If you want manage IO concurrency. I suggest you to use EventMachine.
You can either
implement (or find an equivalent gem) a ThreadPool (ProcessPool, in your case), or
prepare an array of all, let's say 1000 tasks to be processed, split it into, say 10 chunks of 100 tasks (10 being the number of parallel processes you want to launch), and launch 10 processes, of which each process right away receives 100 tasks to process. That way you don't need to launch 1000 processes and control that not more than 10 of them work at the same time.

Resources