Fast Thread-Safe Ruby Hash with strong read bias - ruby

I need some help in understanding Hash in Ruby 1.8.7.
I have a multi-threaded Ruby application, and about 95% of time multiple threads of the application are trying to access a global Hash.
I am not sure if the default Ruby Hash is thread safe. What would be the best way to have a fast Hash but also one that is thread safe given my situation?

The default Ruby Hash is not thread-safe. On MRI and YARV it is "somewhat accidentally thread-safe", because MRI and YARV have a broken threading implementation that is incapable of running two threads simultaneously anyway. On JRuby, IronRuby and Rubinius however, this is not the case.

I would suggest a wrapper which protects the Hash with a read-write lock. I couldn't find a pre-built Ruby read-write lock implementation (of course JRuby users can use java.util.concurrent.ReentrantReadWriteLock), so I built one. You can see it at:
https://github.com/alexdowad/showcase/blob/master/ruby-threads/read_write_lock.rb
Me and two other people have tested it on MRI 1.9.2, MRI 1.9.3, and JRuby. It seems to be working correctly (though I still want to do more thorough testing). It has a built-in test script; if you have a multi-core machine, please download, try running it, and let me know the results! As far as performance goes, it trounces Mutex in situations with a read bias. Even in situations with 80-90% writes, it still seems a bit faster than using a Mutex.
I am also planning to do a Ruby port of Java's ConcurrentHashMap.

Related

Garbage collect in ruby?

I can see related questions asked before
In some languages, there's a specific way to force garbage collection. For example in R, we can call gc() and it will free up memory that was previously used to store objects that have since been removed.
Is there any way to do this in ruby?
In case it's relevant, I'm running a very long loop and I think it's slowly accumulating a little memory use each iteration, and I would like to force garbage collection every 100th iteration or so just to be sure e.g. (pseudo code) if index % 100 == 0 then gc(). Also note I intend to use this in a rails app, although I don't think that's relevant (since garbage collection would be entirely a ruby feature, nothing to do with rails)
No, there is no way to do it in Ruby.
There is a method called GC::start, and the documentation even says:
Initiates garbage collection, even if manually disabled.
But that is not true. GC::start is simply an advisory from your code to the runtime that it would be safe for your application to run a garbage collection now. But that is only a suggestion. The runtime is free to ignore this suggestion.
The majority of programming languages with automatic memory management do not give the programmer control over the garbage collector.
If Ruby had a method to force a garbage collection, then it would be impossible to implement Ruby on the JVM and neither JRuby nor TruffleRuby could exist, it would be impossible to implement Ruby on .NET and IronRuby couldn't exist, it would be impossible to implement Ruby on ECMAScript and Opal couldn't exist, it would be impossible to implement Ruby using existing high-performance garbage collectors and RubyOMR couldn't exist.
Since it is in generally desirable to give implementors freedom to implement optimizations and make the language faster, languages are very cautious on specifying features that so drastically restrict what an implementor can do.
I am quite surprised that R has such a feature, especially since that means it is impossible to implement high-performance implementations like FastR in a way that is compliant with the language specification. FastR is up to more than 35× faster than GNU R, so it is obvious why it is desirable for something like FastR to exist. But one of the ways FastR is faster, is that it uses a third-party high-performance garbage collected runtime (either the GraalVM or the JVM) that does not allow control over garbage collection, and thus FastR can never be a compliant R implementation.
Interestingly, the documentation of gc() has this to say:
[T]he primary purpose of calling gc is for the report on memory usage.
This is done via GC.start.
--

What's meant by a thread-safe Ruby interpreter?

In a 2000 interview (that is, pre-YARV), Matz said
Matz: I'd like to make it faster and more stable. I'm planning a full
rewrite of the interpreter for Ruby 2.0, code-named "Rite". It will be
smaller, easier to embed, thread-safe, and faster. It will use a
bytecode engine. It will probably take me years to implement, since
I'm pretty busy just maintaining the current version.
What was meant by "thread-safe" in this context? An interpreter that allowed you to use green threads? An interpreter that allowed you to use native threads? An interpreter that didn't have a global interpreter lock (GVL in YARV Ruby terminology)?
At the moment ruby's threading is less than ideal. Ruby can use threading and the threading works fine, but because of its current threading mechanism, the long-and-short of it is that one interpreter can only use one CPU core at a time; there are also other potential issues.
If you want all the gory details, This article covers it pretty well.

Ruby Parallel/Multithread Programming to read huge database

I have a ruby script reading a huge table (~20m rows), doing some processing and feeding it over to Solr for indexing purposes. This has been a big bottleneck in our process. I am planning to speed things in here and I'd like to achieve some kind of parallelism. I am confused about Ruby's multithreading nature. Our servers have
ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]. From this blog post and this question at StackOverflow it is visible that Ruby does not have a "real" multi threading approach. Our servers have multiple cores, so using parallel gem seems another approach to me.
What approach should I go with? Also, any inputs on parallel-database-read-feeding systems would be highly appreciated.
You can parallelize this at the OS level. Change the script so that it can take a range of lines from your input file
$ reader_script --lines=10000:20000 mytable.txt
Then execute multiple instances of the script.
$ reader_script --lines=0:10000 mytable.txt&
$ reader_script --lines=10000:20000 mytable.txt&
$ reader_script --lines=20000:30000 mytable.txt&
Unix will distribute them to different cores automatically.
Any chance of upgrading to Ruby 1.9? It's usually faster than 1.8.7.
It's true that Ruby suffers from having a GIL but if multithreading would solve your problem then you can take a look at JRuby since it supports true threading.
Also you better make sure it's the CPU that's the bottleneck because if it's I/O multithreading might not buy you much.

NoSQL DB written in Ruby?

Was curious, but are any NoSQL DBMS written in Ruby?
And if not, would it be unwise to create one in Ruby?
Was curious, but are any NoSQL DBMS written in Ruby?
In 2007, Anthony Eden played around with RDDB, a CouchDB-inspired document-oriented database. He still keeps a copy of the code in his GitHub account.
I vaguely remember that at or around the same time, someone else was also playing around with a database in Ruby. I think it was either inspired by or a reaction to RDDB.
Last but not least, there is the PStore library in the stdlib, which – depending on your definition – may or may not count as a database.
And if not, would it be unwise to create one in Ruby?
The biggest problem I see in Ruby are its concurrency primitives. Threads and locks are so 1960s. If you want to support multiple concurrent users, then you obviously need concurrency, although if you want to build an embedded in-process database, then this is much less of a concern.
Other than that, there are some not-so-stellar implementations of Ruby, but that is not a limitation of Ruby but of those particular implementations, and it applies to pretty much every other programming language as well. Rubinius (especially the current development trunk, which adds Ruby 1.9 compatibility and removes the Global Interpreter Lock) and JRuby would both be fine choices.
As an added bonus, Rubinius comes with a built-in actors library for concurrency and JRuby gives you access to e.g. Clojure's concurrency libraries or the Akka actors library.
Performance isn't really much of a concern, I think. Rubinius's Hash class, which is written in 100% pure Ruby, performs comparably to YARV's Hash class, which is written in 100% hand-optimized C. This shows you that Ruby code, at least when it is carefully written, can be just as fast as C, especially since databases tend to be long-running and thus Rubinius's or JRuby's (and in the latter case specifically also the JVM's) dynamic optimizers (which C compilers typically do not have) can really get to work.
Ruby is just too slow for any type of DBMS
c/c++/erlang are generally the best choice.
You generally shouldn't care in what programming language was a DBMS implemented as long it has all the features and is available for use from your application programming language of choice.
So, the real question here is do you need one written in Ruby or available for use in Ruby.
In first case, I doubt you'll find a DBMS natively written in Ruby (any correction of this statement will be appreciated).
In second case, you should be able to find Ruby bindings/wrappers for any decent DBMS relational or not.

In which versions of ruby are external iterator speeds improved?

According to this rubyquiz, external iterators used to be slow, but are now faster. Is this an improvement only available in YARV (the C-based implementation of ruby 1.9), or is this also available in the C-based implementation of ruby 1.8.7?
Also, does enum_for rely on external iterators?
Ruby 1.9 uses fibers to implement Enumerator#next, which might be better than Ruby 1.8, but still makes it an expensive call to make.
enum_for returns an Enumerator but does not rely on external iterators. A fiber/continuation will be created only if needed, i.e. if you call next but not if you call each or any other method inherited from Enumerable.
Rubinius and JRuby are optimizing next for the builtin types because it is very difficult to implement, in particular on the JVM. Fun bedtime reading: this thread on ruby-core
Rubinius also has some major performance enhancements, but it is a Ruby 1.8 implementation, not 1.9.

Resources