Hadoop own data types - hadoop

I have been using hadoop for quite a time now but I'm not sure why Hadoop uses its own data types and not Java data types ? I have been searching for same thing over internet but nothing helped. please help.

Short answer is because of the serialization & deserialization performance that they provide.
Long version:
The primary benefit of using Writables (Hadoop's data types) is in their efficiency. Compared to Java serialization, which would have been an obvious alternative choice, they have a more compact representation. Writables don't store their type in the serialized representation, since at the point of deserialization it is known which type is expected.
Here is a more detailed excerpt from Hadoop Definitive Guide:
Java serialization is not compact, classes that implement java.io.Serializable or java.io.Externalizable write their classname and the object representation to the stream. Subsequent instances of the same class write a reference handle to the first occurrence, which occupies only 5 bytes. However, reference handles don't work well with random access, because the referent class may occur at any point in the preceding stream - that is, there is state stored in the stream. Even worse, reference handles play havoc with sorting records in a serialized stream, since the first record of a particular class is distinguished and must be treated as a special case. All these problems can be avoided by not writing the classname to the stream at all, which is the approach Writable takes. The result is that the format is considerably more compact than Java serialization, and random access and sorting work as expected because each record is independent of the others (so there is no stream state).

Related

JMS: Deliver String or Object as payload, which is relatively faster?

I have a simple Person object that contains some basic information about a person. If I want to send it by JMS:
I can convert this object into JSON, then deliver it as a String object.
I can use Person object as the payload directly.
I'm using ActiveMQ as JSM provider. Which way is faster?
And what if I need to send a Map or List as the payload?
It's all about the performance of serialization, not much about jms/activemq. So an ObjectMessage is a binary blob at transport that uses java serialization and for the string message, you can choose whatever serialization processor you want.
This article with runnable benchmarks shows that json serialization can be as fast as java object serialization. Although the article is obviously biased, you can note that also jackson/JSON serialization and java serialization is pretty close in terms of performance.
I guess you can measure yourself, with your kind of data. Either way, it's likely a micro optimization. If serialization speed truly matters that much, see if you can optimize in terms of size/quantity in terms objects sent.
As a final note, if you deal with very large payloads, the size and therefore the transport time will contribute to performance. In that case, you may want to make sure your json is not indented and possibly also compressed.

Hadoop Text class

I want to ask why the Hadoop Framework, which implements the MapReduce distributed programming paradigm, uses a Text class to store a String when Java already has Strings implemented for us to use? It seems unnecessarily redundant (lol).
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Text.html
They have implemented their own class Text for String, LongWritable for Long, IntWritable for Integers.
Purpose behind adding these class is to define their own basic types for optimized network serialization. These are found in the org.apache.hadoop.io package.
This types produces a compact serialized object to makes best use of network bandwidth. And Hadoop is meant to process big data so network bandwidth is the most precious resource they want to use in very effective way. Plus for this class they have reduced the overhead of serialization and deserialization of these object as compared to Java's native types.
Redundant???
Let me shed some light. When we talk about distributed systems efficient Serialization/Deserialization plays a vital role. It appears in two quite distinct areas of distributed data processing :
IPC
Persistent Storage
To be specific to Hadoop, IPC between nodes is implemented using RPCs. The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. So, it is very important to have a solid Serialization/Deserialization framework in order to store and process huge amounts of data efficiently. In general, it is desirable that an RPC serialization format is:
Compact
Fast
Extensible
Interoperable
Hadoop uses its own types because developers wanted the storage format to be compact (to make efficient use of storage space), fast (so the overhead in reading or writing terabytes of data is minimal), extensible (so we can transparently read data written in an older format), and interoperable (so we can read or write persistent data using different languages).
Few points to remember before thinking that having dedicated MapReduce types is redundant :
Hadoop’s Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java’s native serialization framework.
As opposed to Java’s serialization, Hadoop’s Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients.
Hadoop’s Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.
HTH
Why can't I use the basic String or Integer classes?
Integer and String implement the standard Serializable-interface of Java . The problem is that MapReduce serializes/deserializes values not utilizing this standard interface but rather an own interface, which is called Writable.
The key and value classes have to be serializable by the framework and hence need to implement
the Writable interface. Additionally, the key classes have to implement the WritableComparable
interface to facilitate sorting by the framework.
Here is the link to MapReduce Tutorial

what is a data serialization system?

according to Apache AVRO project, "Avro is a serialization system". By saying data serialization system, does it mean that avro is a product or api?
also, I am not quit sure about what a data serialization system is? for now, my understanding is that it is a protocol that defines how data object is passed over the network. Can anyone help explain it in an intuitive way that it is easier for people with limited distributed computing background to understand?
Thanks in advance!
So when Hadoop was being written by Doug Cutting he decided that the standard Java method of serializing Java object using Java Object Serialization (Java Serialization) didn't meet his requirements for Hadoop. Namely, these requirements were:
Serialize the data into a compact binary format.
Be fast, both in performance and how quickly it allowed data to be transfered.
Interoperable so that other languages plug into Hadoop more easily.
As he described Java Serialization:
It looked big and hairy and I though we needed something lean and mean
Instead of using Java Serialization they wrote their own serialization framework. The main perceived problems with Java Serialization was that it writes the classname of each object being serialized to the stream, with each subsequent instance of that class containing a 5 byte reference to the first, instead of the classname.
As well as reducing the effective bandwidth of the stream this causes problems with random access as well as sorting of records in a serialized stream. Thus Hadoop serialization doesn't write the classname or the required references, and makes the assumption that the client knows the expected type.
Java Serialization also creates a new object for each one that is deserialized. Hadoop Writables, which implement Hadoop Serialization, can be reused. Thus, helping to improve the performance of MapReduce which accentually serializes and deserializes billions of records.
Avro fits into Hadoop in that it approaches serialization in a different manner. The client and server exchange a scheme which describes the datastream. This helps make it fast, compact and importantly makes it easier to mix languanges together.
So Avro defines a serialization format, a protocol for clients and servers to communicate these serial streams and a way to compactly persist data in files.
I hope this helps. I thought a bit of Hadoop history would help understand why Avro is a subproject of Hadoop and what its meant to help with.
If you have to store in a limited file the information like the hierarchy or data structure implementation details and pass that information over a network, you use data serialization. It is close to understanding xml or json format. The benefit is that the information which is translated into any serialization format can be deserialized to regenerate the classes, objects, data structures whatever that was serialized.
actual implementation-->serialization-->.xml or .json or .avro --->deserialization--->imlementation in original form
Here is the link to the list of serialization formats. Comment if you want further information! :)

Is there a fast and reliable way of serializing objects across different versions of Ruby?

I have two applications talking to each other using a queue, as of now they run exactly the same version of ruby (1.8.7), so I'm just marshaling objects back and forth; only objects from the standard lib mostly hashes, strings, time and date objects.
Right now I'm moving to Ruby 1.9.1, one app at the time, which means I'll be running one app with 1.8.7 and the other with 1.9.1 for a while. By running my tests I know Marshal will not be reliable across versions, I could use YAML, but it is much slower, JSON seems to be faster but it does not deal directly with the date/time objects.
Is there a reliable and fast way to serialize ruby objects across different versions?
I haven't tried it on ruby, but you could look at protocol buffers? Designed as a fast but portable binary format, it has a ruby port here. You would probably have to treat the generated types as a separate DTO layer, though (i.e. you map your existing data into the new types, rather than serialize your existing objects). Note that there is no inbuilt date-time support, but you could just use ticks in an epoch etc.
The key here is finding a common data type that you know will be represented the same across Ruby versions. The obvious choices here are storing data in an external database (the DB interface libraries will handle all the conversions) or writing the data out in a structured text format. If there's not a ton of data to work with (and the data is mostly standard types), I usually just store it as text; it takes longer to export/import but it's usually faster to write.
Protobufs are good, but require you to pre-define your data structures, if I recall. Thrift is similar to protobufs, but has some decent code generation features.
Apple's binary property list format sounds close to what you need. It's similar to JSON in behavior, but is more compact and supports a few extra types, including datetime and unencoded binary. There are a couple ruby implementations on github.
Your best bet may be BERT. BERT is based on Erlang's binary term serialization format. It's compact, includes datatime serialization and is implemented in a dozen or so languages, including ruby.

Does soCaseInsensitive greatly impact performance for a TdxMemIndex on a TdxMemDataset?

I am adding some indexes to my DevExpress TdxMemDataset to improve performance. The TdxMemIndex has SortOptions which include the option for soCaseInsensitive. My data is usually a GUID string, so it is not case sensitive. I am wondering if I am better off just forcing all the data to the same case or if the soCaseInsensitive flag and using the loCaseInsensitive flag with the call to Locate has only a minor performance penalty (roughly equal to converting the case of my string every time I need to use the index).
At this point I am leaving the CaseInsentive off and just converting case.
IMHO, The best is to assure the data quality at Post time. Reasonings:
You (usually) know the nature of the data. So, eg. you can use UpperCase (knowing that GUIDs are all in ASCII range) instead of much slower AnsiUpperCase which a general component like TdxMemDataSet is forced to use.
You enter the data only once. Searching/Sorting/Filtering which all implies the internal upercassing engine of TdxMemDataSet it's a repeated action. Also, there are other chained actions which will trigger this engine whithout realizing. (Eg. a TcxGrid which is Sorted by default having GridMode:=True (I assume that you use the DevEx. components) and having a class acting like a broker passing the sort message to the underlying dataset.
Usually the data entry is done in steps, one or few records in a batch. The only notable exception is data aquisition applications. But in both cases above the user's usability culture allows way greater response times for you to play with. (IOW how much would add an UpperCase call to a record post which lasts 0.005 ms?) OTOH, users are very demanding with the speed of data retreival operations (searching, sorting, filtering etc.). Keep the data retreival as fast as you can.
Having the data in the database ready to expose reduces the risk of processing errors when you'll write (if you'll write) other modules (you need to remember to AnsiUpperCase the data in any module in any language you'll write). Also here a classical example is when you'll use other external tools to access the data (for ex. db managers to execute an SQL SELCT over the data).
hth.
Maybe the DevExpress forums (or ever a support email, if you have access to it) would be a better place to seek an authoritative answer on that performance question.
Anyway, is better to guarantee that data is on the format you want - for the reasons plainth already explained - the moment you save it. So, in that specific, make sure the GUID is written in upper(or lower, its a matter of taste)case. If it is SQL Server or another database server that have an guid datatype, make sure the SELECT make the work - if applicable and possible, even the sort.

Resources