Is there a fast and reliable way of serializing objects across different versions of Ruby? - ruby

I have two applications talking to each other using a queue, as of now they run exactly the same version of ruby (1.8.7), so I'm just marshaling objects back and forth; only objects from the standard lib mostly hashes, strings, time and date objects.
Right now I'm moving to Ruby 1.9.1, one app at the time, which means I'll be running one app with 1.8.7 and the other with 1.9.1 for a while. By running my tests I know Marshal will not be reliable across versions, I could use YAML, but it is much slower, JSON seems to be faster but it does not deal directly with the date/time objects.
Is there a reliable and fast way to serialize ruby objects across different versions?

I haven't tried it on ruby, but you could look at protocol buffers? Designed as a fast but portable binary format, it has a ruby port here. You would probably have to treat the generated types as a separate DTO layer, though (i.e. you map your existing data into the new types, rather than serialize your existing objects). Note that there is no inbuilt date-time support, but you could just use ticks in an epoch etc.

The key here is finding a common data type that you know will be represented the same across Ruby versions. The obvious choices here are storing data in an external database (the DB interface libraries will handle all the conversions) or writing the data out in a structured text format. If there's not a ton of data to work with (and the data is mostly standard types), I usually just store it as text; it takes longer to export/import but it's usually faster to write.

Protobufs are good, but require you to pre-define your data structures, if I recall. Thrift is similar to protobufs, but has some decent code generation features.
Apple's binary property list format sounds close to what you need. It's similar to JSON in behavior, but is more compact and supports a few extra types, including datetime and unencoded binary. There are a couple ruby implementations on github.
Your best bet may be BERT. BERT is based on Erlang's binary term serialization format. It's compact, includes datatime serialization and is implemented in a dozen or so languages, including ruby.

Related

How can I pass Selenium WebDriver objects between seperate Ruby processes?

I want to pass an instance of an object between two Ruby processes. Specifically, I want to pass an instance of a Selenium WebDriver from one process to another process. The reason I want to do this is because it takes a lot of time for Ruby to create this object, but I want it to be used by the other process.
I've found some related questions here and here that seem to point towards using DRb, but I've been unable to find any useful examples or sample code.
Is there a tool other than DRb that I should be using? Does anyone have an example similar to this that I could copy from?
It looks like you're going to have to use DRb, although the documentation for it seems to be lacking. There is however an interesting article here. You might also want to consider purchasing The dRuby Book by Masatoshi Seki to get a better idea of how to do this effectively.
Another option to investigate if you are not looking at simultaneous access, but you just want to send the object from one process to another, is to serialize (that is, encode in a way that Ruby can read) the object with YAML (for a human readable file) or Marshall (for a binary encoded file) and send it using a pipe. This was mentioned in another answer that has since been deleted.
Note that either of these solutions require modifying the Selenium code heavily since the objects you want to manipulate neither support copying, nor simultaneous access natively.
TL;DR
Most queue or distributed processes are going to require some sort of serialization to work properly. If you want to pass objects rather than messages, then this will a limiting factor in how you approach the problem.
DRb
I don't know if you can marshal a WebDriver object. If you can't, then DRb may be a good choice for your distributed Ruby programs because it supports DRbObject references for things that can't be marshaled. There are some examples provided in the DRb documentation.
Selenium Wire Protocol
Depending on what you're really trying to do, it may be worth taking a closer look at using the remote bindings for the Remote WebDriver client/server, or Selenium's JSON Wire Protocol as an alternative to passing objects between processes.
Other Alternatives: Fixtures, Factories, Stubs, and Mocks
Whether or not these work in your specific case will depend a lot on why you want to pass objects instead of simply driving the remote server. If it's largely an issue of how long it takes to build your object, then the serialization/de-serialization cycle may not necessarily be faster in all cases.
You might want to revisit why your object is so slow to create. If gathering and processing the data for it is what's taking too long, you can use some sort of test fixture or factory to trim that time, either by using a smaller set of fixed data, or using a pre-serialized object that's optimized for speed.
You might also consider whether you actually need real data or objects for your test at all. In many cases, you can speed up your tests a lot by stubbing methods or creating mock objects that will return the values you need for your integration tests without needing to perform expensive calculations or long-running operations.
There are certainly cases where you need to drive the full stack and perform acceptance tests on real data. Even then, you may be able to devise a set of fixture data that will take less time or memory to process. It's certainly worth at least thinking about.

Is there any embedable key-value store for Ruby?

I need fast and reliable key-value store for Ruby. Is there anything like it already?
The requirement is for it to run wholly inside the Ruby process, not needing any outside processes.
It might be in-memory with explicit disk flushes.
It needs to have minimal value-for-key retrieval times, write times may be not so good.
The amount of data stored won't be terrible, about few hundred thousand keys, each with ~1kb text value.
It turns out that the best option for me was to use plain Hash along with Marshal to serialize it to disk.
YAML is definitely too slow for that number of objects.
Thanks to #ian-armit for reinforcing my trust in the core Ruby libraries.
You could also try Moneta which allows you to build your own key/value store embedded in a ruby process.
Like DBM? http://www.ruby-doc.org/stdlib-1.9.3/libdoc/dbm/rdoc/DBM.html
(filler for spambot)
The DBM class provides a wrapper to a Unix-style dbm or Database Manager library.
Dbm databases do not have tables or columns; they are simple key-value data stores, like a Ruby Hash except not resident in RAM. Keys and values must be strings.
You could try Oria: https://github.com/intridea/oria
Oria (oh-rye-uh) is an in-memory, Ruby-based, zero-configuration Key-Value Store. It's designed to handle moderate amounts of data quickly and easily without causing deployment issues or server headaches. It uses EventMachine to provide a networked interface to a semi-persistent store and asynchronously writes the in-memory data to YAML files.
Check out PStore. Not sure if it's fast enough though.
Daybreak is a nice new option. Data is stored in a table in memory so Ruby niceties are available (each, filter, map, reduce, etc) and appears to be faster than pstore or dbm.
See this blog post for more info.
There's LevelDB, here's the ruby bindings.

Technology for database access system

I am currently designing system which should allow access to database. Assumptions are as follows:
Database should has access layer. The access layer should provide objects that represents database tables. (This would be done using some ORM framework).
Client which want to get data from database, should get object from access layer first, and then get data using those objects.
Clients could use Python, Java or C++.
Access layer is based on Java.
There won't be to many clients, but they will be opearating on large amounts of data.
The question which is hard for me is what technology should be used for passing object between acces layer and clients. I consider using ZeroC ICE, Apache Thrift or Google Protocol Buffers.
Does anyone have opinion which one is worth using?
This is my research for Protocol Buffers:
Advantages:
simple to use and easy to start
well documented
highly optimized
defining object data structure in java-like language
automatically generating implementation of setters and getters and build methods for Python, Java and C++
open-source bidnings for other languages
object could be extended without affecting old version of an applications
there are many of open-source RpcChanel and RpcController implementation (not tested)
Disadvantages:
need to implement object transfer
objects structure have to be defined before use, so we can't add some fields on the fly (Updated: there are posibilities to do that, see the comments)
if there is a need for reading one object's filed, we have to parse whole file (in contrast, in XML we could ignore chosen tags)
if we want to use RPC for invoke object methods, we need to define services and deliver RpcChanel and RpcController implementation
This is my research for Apache Thrift:
Advantages:
provide compiler that generates source code for supported languages (classes, all things that are important)
allow defining optional fields in the structures ( when we do not set value on a field, the size of transfered data is lower)
enable point out some methods that are "one way" (returning nothing and client after invokation do not wait for answer from server about completion processing of query)
support collections (maps, lists, sets), objects, primitives serialization (deserialization), constants, enumerations, exceptions
most of problems, errors are solved and explained
provide different methods of serialization: (TBinaryProtocol...) and different ways of exchanging data: (TBufferedTransport, TZlibTransport... )
compiler produces classes (structures) for languages thaw we can extend by adding some new methods.
possible to add fields to protocol(server as well as client) and remove other- old code and new one can properly interact(some rules in update)
enable asynchronous calls
easy to use
Disadvantages:
documentation - contains some errors that sometimes it is really hard to get to know what is the source of the problem
not allways problems are well taged (when we look for solution in the Internet).
not support overloading for service methods
tutorials cover only simple examples of thrift usage
hard to start
ICE ZeroC:
Is better than Protocol Buffers, because I wouldn't need to implement object passing by myself via e.g. sockets. ICE also gives ServantLocators which can provide management of connections.
The question is: whether ICE is much slower and less efficient than the PB?

gson vs protocol buffer

What are the pros and cons of protocol buffer (protobuf) over GSON?
In what situation protobuf is more appropriate than GSON?
I am sorry for a very generic question.
Both json (via the gson library) and protobuf are portable between platorms; but
protobuf is smaller (bandwidth) and cheaper (CPU) to read/write
json is human readable / editable (protobuf is binary; hard to parse without library support)
protobuf is trivial to merge fragments - just concatenate
json is easily passed to web page clients
the main java version of protobuf needs contract-definition (.proto) and code-generation; gson seems to allow arbitrary pojo usage (there are protobuf implementations that work on such objects, but not for java afaik)
If performance is key : protubuf
For use with a web page (JavaScript), or human readable: json (perhaps via gson)
If you want efficiency and cross-platform you should send raw messages between applications containing the information that is necessary and nothing more or less.
Serialising classes via Java's own mechanisms, gson, protobufs or whatever, creates data that contains not only the information you wish to send, but also information about the logical structures/hierarchies of the data structures that has been used to represent the data inside your application.
This makes those classes and data mapping dual purpose, one to represent the data internally in the application, and two to be transmitted to another application. Those two roles can be conflicting there is an onus on the developer to remember that the classes, collections and data layout he is working with at any time will also be serialised.

what is a data serialization system?

according to Apache AVRO project, "Avro is a serialization system". By saying data serialization system, does it mean that avro is a product or api?
also, I am not quit sure about what a data serialization system is? for now, my understanding is that it is a protocol that defines how data object is passed over the network. Can anyone help explain it in an intuitive way that it is easier for people with limited distributed computing background to understand?
Thanks in advance!
So when Hadoop was being written by Doug Cutting he decided that the standard Java method of serializing Java object using Java Object Serialization (Java Serialization) didn't meet his requirements for Hadoop. Namely, these requirements were:
Serialize the data into a compact binary format.
Be fast, both in performance and how quickly it allowed data to be transfered.
Interoperable so that other languages plug into Hadoop more easily.
As he described Java Serialization:
It looked big and hairy and I though we needed something lean and mean
Instead of using Java Serialization they wrote their own serialization framework. The main perceived problems with Java Serialization was that it writes the classname of each object being serialized to the stream, with each subsequent instance of that class containing a 5 byte reference to the first, instead of the classname.
As well as reducing the effective bandwidth of the stream this causes problems with random access as well as sorting of records in a serialized stream. Thus Hadoop serialization doesn't write the classname or the required references, and makes the assumption that the client knows the expected type.
Java Serialization also creates a new object for each one that is deserialized. Hadoop Writables, which implement Hadoop Serialization, can be reused. Thus, helping to improve the performance of MapReduce which accentually serializes and deserializes billions of records.
Avro fits into Hadoop in that it approaches serialization in a different manner. The client and server exchange a scheme which describes the datastream. This helps make it fast, compact and importantly makes it easier to mix languanges together.
So Avro defines a serialization format, a protocol for clients and servers to communicate these serial streams and a way to compactly persist data in files.
I hope this helps. I thought a bit of Hadoop history would help understand why Avro is a subproject of Hadoop and what its meant to help with.
If you have to store in a limited file the information like the hierarchy or data structure implementation details and pass that information over a network, you use data serialization. It is close to understanding xml or json format. The benefit is that the information which is translated into any serialization format can be deserialized to regenerate the classes, objects, data structures whatever that was serialized.
actual implementation-->serialization-->.xml or .json or .avro --->deserialization--->imlementation in original form
Here is the link to the list of serialization formats. Comment if you want further information! :)

Resources