what is a data serialization system? - hadoop

according to Apache AVRO project, "Avro is a serialization system". By saying data serialization system, does it mean that avro is a product or api?
also, I am not quit sure about what a data serialization system is? for now, my understanding is that it is a protocol that defines how data object is passed over the network. Can anyone help explain it in an intuitive way that it is easier for people with limited distributed computing background to understand?
Thanks in advance!

So when Hadoop was being written by Doug Cutting he decided that the standard Java method of serializing Java object using Java Object Serialization (Java Serialization) didn't meet his requirements for Hadoop. Namely, these requirements were:
Serialize the data into a compact binary format.
Be fast, both in performance and how quickly it allowed data to be transfered.
Interoperable so that other languages plug into Hadoop more easily.
As he described Java Serialization:
It looked big and hairy and I though we needed something lean and mean
Instead of using Java Serialization they wrote their own serialization framework. The main perceived problems with Java Serialization was that it writes the classname of each object being serialized to the stream, with each subsequent instance of that class containing a 5 byte reference to the first, instead of the classname.
As well as reducing the effective bandwidth of the stream this causes problems with random access as well as sorting of records in a serialized stream. Thus Hadoop serialization doesn't write the classname or the required references, and makes the assumption that the client knows the expected type.
Java Serialization also creates a new object for each one that is deserialized. Hadoop Writables, which implement Hadoop Serialization, can be reused. Thus, helping to improve the performance of MapReduce which accentually serializes and deserializes billions of records.
Avro fits into Hadoop in that it approaches serialization in a different manner. The client and server exchange a scheme which describes the datastream. This helps make it fast, compact and importantly makes it easier to mix languanges together.
So Avro defines a serialization format, a protocol for clients and servers to communicate these serial streams and a way to compactly persist data in files.
I hope this helps. I thought a bit of Hadoop history would help understand why Avro is a subproject of Hadoop and what its meant to help with.

If you have to store in a limited file the information like the hierarchy or data structure implementation details and pass that information over a network, you use data serialization. It is close to understanding xml or json format. The benefit is that the information which is translated into any serialization format can be deserialized to regenerate the classes, objects, data structures whatever that was serialized.
actual implementation-->serialization-->.xml or .json or .avro --->deserialization--->imlementation in original form
Here is the link to the list of serialization formats. Comment if you want further information! :)

Related

Hadoop own data types

I have been using hadoop for quite a time now but I'm not sure why Hadoop uses its own data types and not Java data types ? I have been searching for same thing over internet but nothing helped. please help.
Short answer is because of the serialization & deserialization performance that they provide.
Long version:
The primary benefit of using Writables (Hadoop's data types) is in their efficiency. Compared to Java serialization, which would have been an obvious alternative choice, they have a more compact representation. Writables don't store their type in the serialized representation, since at the point of deserialization it is known which type is expected.
Here is a more detailed excerpt from Hadoop Definitive Guide:
Java serialization is not compact, classes that implement java.io.Serializable or java.io.Externalizable write their classname and the object representation to the stream. Subsequent instances of the same class write a reference handle to the first occurrence, which occupies only 5 bytes. However, reference handles don't work well with random access, because the referent class may occur at any point in the preceding stream - that is, there is state stored in the stream. Even worse, reference handles play havoc with sorting records in a serialized stream, since the first record of a particular class is distinguished and must be treated as a special case. All these problems can be avoided by not writing the classname to the stream at all, which is the approach Writable takes. The result is that the format is considerably more compact than Java serialization, and random access and sorting work as expected because each record is independent of the others (so there is no stream state).

JMS: Deliver String or Object as payload, which is relatively faster?

I have a simple Person object that contains some basic information about a person. If I want to send it by JMS:
I can convert this object into JSON, then deliver it as a String object.
I can use Person object as the payload directly.
I'm using ActiveMQ as JSM provider. Which way is faster?
And what if I need to send a Map or List as the payload?
It's all about the performance of serialization, not much about jms/activemq. So an ObjectMessage is a binary blob at transport that uses java serialization and for the string message, you can choose whatever serialization processor you want.
This article with runnable benchmarks shows that json serialization can be as fast as java object serialization. Although the article is obviously biased, you can note that also jackson/JSON serialization and java serialization is pretty close in terms of performance.
I guess you can measure yourself, with your kind of data. Either way, it's likely a micro optimization. If serialization speed truly matters that much, see if you can optimize in terms of size/quantity in terms objects sent.
As a final note, if you deal with very large payloads, the size and therefore the transport time will contribute to performance. In that case, you may want to make sure your json is not indented and possibly also compressed.

Hadoop Text class

I want to ask why the Hadoop Framework, which implements the MapReduce distributed programming paradigm, uses a Text class to store a String when Java already has Strings implemented for us to use? It seems unnecessarily redundant (lol).
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Text.html
They have implemented their own class Text for String, LongWritable for Long, IntWritable for Integers.
Purpose behind adding these class is to define their own basic types for optimized network serialization. These are found in the org.apache.hadoop.io package.
This types produces a compact serialized object to makes best use of network bandwidth. And Hadoop is meant to process big data so network bandwidth is the most precious resource they want to use in very effective way. Plus for this class they have reduced the overhead of serialization and deserialization of these object as compared to Java's native types.
Redundant???
Let me shed some light. When we talk about distributed systems efficient Serialization/Deserialization plays a vital role. It appears in two quite distinct areas of distributed data processing :
IPC
Persistent Storage
To be specific to Hadoop, IPC between nodes is implemented using RPCs. The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. So, it is very important to have a solid Serialization/Deserialization framework in order to store and process huge amounts of data efficiently. In general, it is desirable that an RPC serialization format is:
Compact
Fast
Extensible
Interoperable
Hadoop uses its own types because developers wanted the storage format to be compact (to make efficient use of storage space), fast (so the overhead in reading or writing terabytes of data is minimal), extensible (so we can transparently read data written in an older format), and interoperable (so we can read or write persistent data using different languages).
Few points to remember before thinking that having dedicated MapReduce types is redundant :
Hadoop’s Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java’s native serialization framework.
As opposed to Java’s serialization, Hadoop’s Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients.
Hadoop’s Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.
HTH
Why can't I use the basic String or Integer classes?
Integer and String implement the standard Serializable-interface of Java . The problem is that MapReduce serializes/deserializes values not utilizing this standard interface but rather an own interface, which is called Writable.
The key and value classes have to be serializable by the framework and hence need to implement
the Writable interface. Additionally, the key classes have to implement the WritableComparable
interface to facilitate sorting by the framework.
Here is the link to MapReduce Tutorial

Hadoop and Stata

Does anyone have any experience using Stata and Hadoop? Stata 13 now has a Java Plugin API, so I think it should be straightforward to get them to play nice.
I am particularly interested in being able to parse weblog data to get it into a form suitable for statistical analysis.
This question came up on Statalist recently, but there was no response, so I thought I would try it here where the audience is more likely to have experience with this technology.
Dimitry,
I think it would be easier to do something like this using the ELK Stack (http://www.elastic.co). Logstash (the middle layer) has several parsers/tokenizers/analyzes built on the Apache Lucene engine for cleaning and formatting log data and can push the resulting data into elasticsearch, which exposes an HTTP API that you can curl fairly easily to get results (e.g., use insheetjson and pass the HTTP GET request as the URL and it should be imported into Stata without much problem).
I've been trying to cobble together a program to use the Jackson JSON library to build out more robust JSON I/O capabilities from within Stata and would definitely not mind trying to work with others to get it done.
Hope this helps,
Billy
I'll take an (un?)educated stab at this. From the looks of the java API, the caller seems to treat Stata as essentially a datastore. If that's the case, then I would imagine Stata would fit in to the hadoop world as a database and would be accessed by its own InputFormat and OutputFormat. In your specific case I'd imagine you'd write a StataOutputFormat which your reducer would use to write the parsed data. The only drawback seems to be your referenced comments that Stata apps tend to be I/O bound so I don't know that using hadoop is really going to help you since
you'll have to write all that data anyway, and
that write will be I/O bound, whether you use hadoop or not.

gson vs protocol buffer

What are the pros and cons of protocol buffer (protobuf) over GSON?
In what situation protobuf is more appropriate than GSON?
I am sorry for a very generic question.
Both json (via the gson library) and protobuf are portable between platorms; but
protobuf is smaller (bandwidth) and cheaper (CPU) to read/write
json is human readable / editable (protobuf is binary; hard to parse without library support)
protobuf is trivial to merge fragments - just concatenate
json is easily passed to web page clients
the main java version of protobuf needs contract-definition (.proto) and code-generation; gson seems to allow arbitrary pojo usage (there are protobuf implementations that work on such objects, but not for java afaik)
If performance is key : protubuf
For use with a web page (JavaScript), or human readable: json (perhaps via gson)
If you want efficiency and cross-platform you should send raw messages between applications containing the information that is necessary and nothing more or less.
Serialising classes via Java's own mechanisms, gson, protobufs or whatever, creates data that contains not only the information you wish to send, but also information about the logical structures/hierarchies of the data structures that has been used to represent the data inside your application.
This makes those classes and data mapping dual purpose, one to represent the data internally in the application, and two to be transmitted to another application. Those two roles can be conflicting there is an onus on the developer to remember that the classes, collections and data layout he is working with at any time will also be serialised.

Resources