Ruby Parallel/Multithread Programming to read huge database - ruby

I have a ruby script reading a huge table (~20m rows), doing some processing and feeding it over to Solr for indexing purposes. This has been a big bottleneck in our process. I am planning to speed things in here and I'd like to achieve some kind of parallelism. I am confused about Ruby's multithreading nature. Our servers have
ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]. From this blog post and this question at StackOverflow it is visible that Ruby does not have a "real" multi threading approach. Our servers have multiple cores, so using parallel gem seems another approach to me.
What approach should I go with? Also, any inputs on parallel-database-read-feeding systems would be highly appreciated.

You can parallelize this at the OS level. Change the script so that it can take a range of lines from your input file
$ reader_script --lines=10000:20000 mytable.txt
Then execute multiple instances of the script.
$ reader_script --lines=0:10000 mytable.txt&
$ reader_script --lines=10000:20000 mytable.txt&
$ reader_script --lines=20000:30000 mytable.txt&
Unix will distribute them to different cores automatically.

Any chance of upgrading to Ruby 1.9? It's usually faster than 1.8.7.
It's true that Ruby suffers from having a GIL but if multithreading would solve your problem then you can take a look at JRuby since it supports true threading.
Also you better make sure it's the CPU that's the bottleneck because if it's I/O multithreading might not buy you much.

Related

Partition a loop to be processed on different cores parallel

How can I accomplish a loop that will que it's iterations on the available CPU cores? So each iteration will be run in parallel and therefore finish faster? I'm trying to understand the gem Celluloid but if there is another gem that I could use don't hesitate to tell me!
If it weren't for the jRuby tag I'd be inclined to mark as a duplicate too ;)
You don't state exactly what problem you're trying to solve but since your Ruby now has access to the Java standard lib, in Java 7+ you could use the Fork/Join framework. Charles Nutter has written a nice JRuby wrapper.

What's meant by a thread-safe Ruby interpreter?

In a 2000 interview (that is, pre-YARV), Matz said
Matz: I'd like to make it faster and more stable. I'm planning a full
rewrite of the interpreter for Ruby 2.0, code-named "Rite". It will be
smaller, easier to embed, thread-safe, and faster. It will use a
bytecode engine. It will probably take me years to implement, since
I'm pretty busy just maintaining the current version.
What was meant by "thread-safe" in this context? An interpreter that allowed you to use green threads? An interpreter that allowed you to use native threads? An interpreter that didn't have a global interpreter lock (GVL in YARV Ruby terminology)?
At the moment ruby's threading is less than ideal. Ruby can use threading and the threading works fine, but because of its current threading mechanism, the long-and-short of it is that one interpreter can only use one CPU core at a time; there are also other potential issues.
If you want all the gory details, This article covers it pretty well.

Is there any way to find the number of CPU cores in MRI Ruby (perhaps using a C extension)?

In JRuby, you can just use java.lang.Runtime.get_runtime.available_processors. Is there anything available for MRI, perhaps using a gem implemented in C?
In a future release of Ruby, it would be nice to see this information available as a automatically defined top-level constant, like RUBY_PLATFORM and RUBY_VERSION.
So far there isn't. But you can either use parallel gem or take a look how it figures out the number of cores.

Fast Thread-Safe Ruby Hash with strong read bias

I need some help in understanding Hash in Ruby 1.8.7.
I have a multi-threaded Ruby application, and about 95% of time multiple threads of the application are trying to access a global Hash.
I am not sure if the default Ruby Hash is thread safe. What would be the best way to have a fast Hash but also one that is thread safe given my situation?
The default Ruby Hash is not thread-safe. On MRI and YARV it is "somewhat accidentally thread-safe", because MRI and YARV have a broken threading implementation that is incapable of running two threads simultaneously anyway. On JRuby, IronRuby and Rubinius however, this is not the case.
I would suggest a wrapper which protects the Hash with a read-write lock. I couldn't find a pre-built Ruby read-write lock implementation (of course JRuby users can use java.util.concurrent.ReentrantReadWriteLock), so I built one. You can see it at:
https://github.com/alexdowad/showcase/blob/master/ruby-threads/read_write_lock.rb
Me and two other people have tested it on MRI 1.9.2, MRI 1.9.3, and JRuby. It seems to be working correctly (though I still want to do more thorough testing). It has a built-in test script; if you have a multi-core machine, please download, try running it, and let me know the results! As far as performance goes, it trounces Mutex in situations with a read bias. Even in situations with 80-90% writes, it still seems a bit faster than using a Mutex.
I am also planning to do a Ruby port of Java's ConcurrentHashMap.

What makes Ruby slow? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ruby is slow at certain things. But what parts of it are the most problematic?
How much does the garbage collector affect performance? I know I've had times when running the garbage collector alone took several seconds, especially when working with OpenGL libraries.
I've used matrix math libraries with Ruby that were particularly slow. Is there an issue with how ruby implements basic math?
Are there any dynamic features in Ruby that simply cannot be implemented efficiently? If so, how do other languages like Lua and Python solve these problems?
Has there been recent work that has significantly improved performance?
Ruby is slow. But what parts of it are the most problematic?
It does "late lookup" for methods, to allow for flexibility. This slows it down quite a bit. It also has to remember variable names per context to allow for eval, so its frames and method calls are slower. Also it lacks a good JIT compiler currently, though MRI 1.9 has a bytecode compiler (which is better), and jruby compiles it down to java bytecode, which then (can) compile via the HotSpot JVM's JIT compiler, but it ends up being about the same speed as 1.9.
How much does the garbage collector effect performance? I know I've had times when running the garbage collector alone took several seconds, especially when working with OpenGL libraries.
from some of the graphs at http://www.igvita.com/2009/06/13/profiling-ruby-with-googles-perftools/ I'd say it takes about 10% which is quite a bit--you can decrease that hit by increasing the malloc_limit in gc.c and recompiling.
I've used matrix math libraries with Ruby that were particularly slow. Is there an issue with how ruby implements basic math?
Ruby 1.8 "didn't" implement basic math it implemented Numeric classes and you'd call things like Fixnum#+ Fixnum#/ once per call--which was slow. Ruby 1.9 cheats a bit by inlining some of the basic math ops.
Are there any dynamic features in Ruby that simply cannot be implemented efficiently? If so, how do other languages like Lua and Python solve these problems?
Things like eval are hard to implement efficiently, though much work can be done, I'm sure. The kicker for Ruby is that it has to accomodate for somebody in another thread changing the definition of a class spontaneously, so it has to be very conservative.
Has there been recent work that has significantly improved performance?
1.9 is like a 2x speedup. It's also more space efficient. JRuby is constantly trying to improve speed-wise [and probably spends less time in the GC than KRI]. Besides that I'm not aware of much except little hobby things I've been working on. Note also that 1.9's strings are at times slower because of encoding friendliness.
Ruby is very good for delivering solutions quickly. Less so for delivering quick solutions. It depends what kind of problem you're trying to solve. I'm reminded of the discussions on the old CompuServe MSBASIC forum in the early 90s: when asked which was faster for Windows development, VB or C, the usual answer was "VB, by about 6 months".
In its MRI 1.8 form, Ruby is - relatively - slow to perform some types of computationally-intensive tasks. Pretty much any interpreted language suffers in that way in comparison to most mainstream compiled languages.
The reasons are several: some fairly easily addressable (the primitive garbage collection in 1.8, for example), some less so.
1.9 addresses some of the issues, although it's probably going to be some time before it becomes generally available. Some of the other implementation that target pre-existing runtimes, JRuby, IronRuby, MagLev for example, have the potential to be significantly quicker.
Regarding mathematical performance, I wouldn't be surprised to see fairly slow throughput: it's part of the price you pay for arbitrary precision. Again, pick your problem. I've solved 70+ of the Project Euler problems in Ruby with almost no solution taking more than a mintue to run. How fast do you need it to run and how soon do you need it?
The most problematic part is "everyone".
Bonus points if that "everyone" didn't really use the language, ever.
Seriously, 1.9 is much faster and now is on par with python, and jruby is faster than jython.
Garbage collectors are everywhere; for example, Java has one, and it's faster than C++ on dynamic memory handling. Ruby isn't suited well for number crunching; but few languages are, so if you have computational-intensive parts in your program in any language, you better rewrite them in C (Java is fast with math due to its primitive types, but it paid dearly for them, they're clearly #1 in ugliest parts of the language).
As for dynamic features: they aren't fast, but code without them in static languages can be even slower; for example, java would use a XML config instead of Ruby using a DSL; and it would likely be SLOWER since XML parsing is costly.
Hmm - I worked on a project a few years ago where I scraped the barrel with Ruby performance, and I'm not sure much has changed since. Right now it's caveat emptor - you have to know not to do certain things, and frankly games / realtime applications would be one of them (since you mention OpenGL).
The culprit for killing interactive performance is the garbage collector - others here mention that Java and other environments have garbage collection too, but Ruby's has to stop the world to run. That is to say, it has to stop running your program, scan through every register and memory pointer from scratch, mark the memory that's still in use, and free the rest. The process can't be interrupted while this happens, and as you might have noticed, it can take hundreds of milliseconds.
Its frequency and length of execution is proportional to the number of objects you create and destroy, but unless you disable it altogether, you have no control. My experience was there were several unsatisfactory strategies to smooth out my Ruby animation loop:
GC.disable / GC.enable around critical animation loops and maybe an opportunistic GC.start to force it to go when it can't do any harm. (because my target platform at the time was a 64MB Windows NT machine, this caused the system to run out of memory occasionally. But fundamentally it's a bad idea - unless you can pre-calculate how much memory you might need before doing this, you're risking memory exhaustion)
Reduce the number of objects you create so the GC has less work to do (reduces the frequency / length of its execution)
Rewrite your animation loop in C (a cop-out, but the one I went with!)
These days I would probably also see if JRuby would work as an alternative runtime, as I believe it relies on Java's more sophisticated garbage collector.
The other major performance issue I've found is basic I/O when trying to write a TFTP server in Ruby a while back (yeah I pick all the best languages for my performance-critical projects this was was just an experiment). The absolute simplest tightest loop to simply respond to one UDP packet with another, contaning the next piece of a file, must have been about 20x slower than the stock C version. I suspect there might have been some improvements to make there based around using low-level IO (sysread etc.) but the slowness might just be in the fact there is no low-level byte data type - every little read is copied out into a String. This is just speculation though, I didn't take this project much further but it warned me off relying on snappy I/O.
The main speed recent increase that has gone on, though I'm not fully up-to-date here, is that the virtual machine implementation was redone for 1.9, resulting in faster code execution. However I don't think the GC has changed, and I'm pretty sure there's nothing new on the I/O front. But I'm not fully up-to-date on bleeding-edge Ruby so someone else might want to chip in here.
I assume that you're asking, "what particular techniques in Ruby tend to be slow."
One is object instantiation. If you are doing large amounts of it, you want to look at (reasonable) ways of reducing that, such as using the flyweight pattern, even if memory usage is not a problem. In one library where I reworked it not to be creating a lot of very similar objects over and over again, I doubled the overall speed of the library.
Steve Dekorte: "Writing a Mandelbrot set calculator in a high level language is like trying to run the Indy 500 in a bus."
http://www.dekorte.com/blog/blog.cgi?do=item&id=4047
I recommend to learn various tools in order to use the right tool for the job. Doing matrix transformations could be done efficiently using high-level API which wraps around tight loops with arithmetic-intensive computations. See RubyInline gem for an example of embedding C or C++ code into Ruby script.
There is also Io language which is much slower than Ruby, but it efficiently renders movies in Pixar and outperforms raw C on vector arithmetics by using SIMD acceleration.
http://iolanguage.com
https://renderman.pixar.com/products/tools/it.html
http://iolanguage.com/scm/git/checkout/Io/docs/IoGuide.html#Primitives-Vector
Ruby 1.9.1 is about twice as fast as PHP, and a little bit faster than Perl, according to some benchmarks.
(Update: My source is this (screenshot). I don't know what his source is, though.)
Ruby is not slow. The old 1.8 is, but the current Ruby isn't.
Ruby is slow because it was designed to optimize the programmers experience, not the program's execution time. Slowness is just a symptom of that design decision. If you would prefer performance to pleasure, you should probably use a different language. Ruby's not for everything.
IMO, dynamic languages are all slow in general. They do something in runtime that static languages do in compiling time.
Syntax Check, Interpreting and Like type checking, converting. this is inevitable, therefore ruby is slower than c/c++/java, correct me if I am wrong.

Resources