AFAK, Hadoop Streaming only support text input, which means the data is organized by lines. but the mapper code will become messy if we want backward compatibility, supporting different versions of log lines in the same mapper program wrote in c++.
I used to consider avro or protobuf, but it seems that they are not supported in streaming mode, is it true?
and is there any other solution?
Other input/output formats can also be used along with Hadoop Streaming.
Avro support had been added for Hadoop Streaming. See AVRO-808 & AVRO-830. Also this Thread might be useful.
I could not find InputFormat and OutputFormat classes for ProtoBuf. So, they need to be custom created.
Just for information, hadoop streaming supports binary input/output.
Look for -io rawbytes option.
I created a prototype which was able to consume SequenceFile (I think - it was long ago).
I abandoned the idea because I had to deserialize Java Hadoop *Writables from the stream. And C# BinaryReader
uses little-endian encoding, while Java uses big-endian. So mapper became more complicated that it should be.
Anyway, it is possible.
Related
I know that Apache Arrow Parquet can read spec compliant Delta encoded files, but can not write them out. I am wondering if there is any commonly used open source C++/Python library that can write out Parquet spec compliant delta encoding.
There's a Rust library with Python bindings called delta-rs that has a file writer that can take an apache arrow Table or RecordBatch and write to Delta format. Note that it doesn't support transactions or checkpoints yet.
Seems like a pretty active project though, with recent contributions around Delta optimizations so that's cool.
Note: the Delta writer feature of delta-rs is labeled Experimental, so it might not be completely stable.
I have several machines with TBs of log data in a custom format which can be read with a c++ library. I want to upload all data to hadoop cluster (HDFS) while converting it to parquet files.
This is an on going process (meaning every day I will get more data) and not a one time effort.
What is best alternative to do it performance wise (doing it efficiently)?
Is the parquet C++ library as good as the Java one? (updates, bugs, etc.)
The solution should handle tens of TBs per day or even more in the future.
Log data arrives on going and should be available immediately on HDFS cluster.
Performance-wise, your best approach will be to gather the data in batches and then write out a new Parquet file per batch. If your data is received in single lines and you want to persist them immediately on HDFS, then you could also write them out to a row-based format (that supports single line appends), e.g. AVRO and run regulary a job that compacts them into a single Parquet file.
Library-wise, parquet-cpp is much more in active development at the moment then parquet-mr (the Java library). This is mainly due to the fact that active parquet-cpp development (re-)started about 1.5 years ago (winter/spring 2016). So updates to the C++ library will happen very quickly at the moment while the Java library is very mature as it has a huge userbase since quite some years. There are some features like predicate pushdown that are not yet implemented in parquet-cpp but these all on the read path, so for write they don't matter.
We now at a point with parquet-cpp, that it already runs very stable in different productive environments, so in the end, your choice of using the C++ or Java library should mainly depend on our system environment. If all your code is currently running in the JVM, than use parquet-mr, otherwise, if you're a C++/Python/Ruby user, use parquet-cpp.
Disclaimer: I'm one of the parquet-cpp developers.
I want to ask why the Hadoop Framework, which implements the MapReduce distributed programming paradigm, uses a Text class to store a String when Java already has Strings implemented for us to use? It seems unnecessarily redundant (lol).
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Text.html
They have implemented their own class Text for String, LongWritable for Long, IntWritable for Integers.
Purpose behind adding these class is to define their own basic types for optimized network serialization. These are found in the org.apache.hadoop.io package.
This types produces a compact serialized object to makes best use of network bandwidth. And Hadoop is meant to process big data so network bandwidth is the most precious resource they want to use in very effective way. Plus for this class they have reduced the overhead of serialization and deserialization of these object as compared to Java's native types.
Redundant???
Let me shed some light. When we talk about distributed systems efficient Serialization/Deserialization plays a vital role. It appears in two quite distinct areas of distributed data processing :
IPC
Persistent Storage
To be specific to Hadoop, IPC between nodes is implemented using RPCs. The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. So, it is very important to have a solid Serialization/Deserialization framework in order to store and process huge amounts of data efficiently. In general, it is desirable that an RPC serialization format is:
Compact
Fast
Extensible
Interoperable
Hadoop uses its own types because developers wanted the storage format to be compact (to make efficient use of storage space), fast (so the overhead in reading or writing terabytes of data is minimal), extensible (so we can transparently read data written in an older format), and interoperable (so we can read or write persistent data using different languages).
Few points to remember before thinking that having dedicated MapReduce types is redundant :
Hadoop’s Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java’s native serialization framework.
As opposed to Java’s serialization, Hadoop’s Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients.
Hadoop’s Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.
HTH
Why can't I use the basic String or Integer classes?
Integer and String implement the standard Serializable-interface of Java . The problem is that MapReduce serializes/deserializes values not utilizing this standard interface but rather an own interface, which is called Writable.
The key and value classes have to be serializable by the framework and hence need to implement
the Writable interface. Additionally, the key classes have to implement the WritableComparable
interface to facilitate sorting by the framework.
Here is the link to MapReduce Tutorial
according to Apache AVRO project, "Avro is a serialization system". By saying data serialization system, does it mean that avro is a product or api?
also, I am not quit sure about what a data serialization system is? for now, my understanding is that it is a protocol that defines how data object is passed over the network. Can anyone help explain it in an intuitive way that it is easier for people with limited distributed computing background to understand?
Thanks in advance!
So when Hadoop was being written by Doug Cutting he decided that the standard Java method of serializing Java object using Java Object Serialization (Java Serialization) didn't meet his requirements for Hadoop. Namely, these requirements were:
Serialize the data into a compact binary format.
Be fast, both in performance and how quickly it allowed data to be transfered.
Interoperable so that other languages plug into Hadoop more easily.
As he described Java Serialization:
It looked big and hairy and I though we needed something lean and mean
Instead of using Java Serialization they wrote their own serialization framework. The main perceived problems with Java Serialization was that it writes the classname of each object being serialized to the stream, with each subsequent instance of that class containing a 5 byte reference to the first, instead of the classname.
As well as reducing the effective bandwidth of the stream this causes problems with random access as well as sorting of records in a serialized stream. Thus Hadoop serialization doesn't write the classname or the required references, and makes the assumption that the client knows the expected type.
Java Serialization also creates a new object for each one that is deserialized. Hadoop Writables, which implement Hadoop Serialization, can be reused. Thus, helping to improve the performance of MapReduce which accentually serializes and deserializes billions of records.
Avro fits into Hadoop in that it approaches serialization in a different manner. The client and server exchange a scheme which describes the datastream. This helps make it fast, compact and importantly makes it easier to mix languanges together.
So Avro defines a serialization format, a protocol for clients and servers to communicate these serial streams and a way to compactly persist data in files.
I hope this helps. I thought a bit of Hadoop history would help understand why Avro is a subproject of Hadoop and what its meant to help with.
If you have to store in a limited file the information like the hierarchy or data structure implementation details and pass that information over a network, you use data serialization. It is close to understanding xml or json format. The benefit is that the information which is translated into any serialization format can be deserialized to regenerate the classes, objects, data structures whatever that was serialized.
actual implementation-->serialization-->.xml or .json or .avro --->deserialization--->imlementation in original form
Here is the link to the list of serialization formats. Comment if you want further information! :)
I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.
I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?
The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).
Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.
Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.
What about http://s4.io/. It's made for processing streaming data.
Update
A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more
I think you should take a look over Esper CEP ( http://esper.codehaus.org/ ).
Yahoo S4 http://s4.io/
It provide real time stream computing, like map reduce
Twitter's Storm is what you need, you can have a try!
Multiple options here.
I suggest the combination of Kafka and Storm + (Hadoop or NoSql) as the solution.
We already build our big data platform using those opensource tools, and it works very well.
Your use case sounds similar to the issue of writing a web crawler using Hadoop - the data streams back (slowly) from sockets opened to fetch remote pages via HTTP.
If so, then see Why fetching web pages doesn't map well to map-reduce. And you might want to check out the FetcherBuffer class in Bixo, which implements a threaded approach in a reducer (via Cascading) to solve this type of problem.
As you know the main issues with Hadoop for usage in stream mining are the fact that first, it uses HFDS which is a disk and disk operations bring latency that will result in missing data in stream. second, is that the pipeline is not parallel. Map-reduce generally operates on batches of data and not instances as it is with stream data.
I recently read an article about M3 which tackles the first issue apparently by bypassing HDFS and perform in-memory computations in objects database. And for the second issue, they are using incremental learners which are not anymore performed in batch. Worth checking it out M3
: Stream Processing on
Main-Memory MapReduce. I could not find the source code or API of this M3 anywhere, if somebody found it please share the link here.
Also, Hadoop Online is also another prototype that attemps to solve the same issues as M3 does: Hadoop Online
However, Apache Storm is the key solution to the issue, however it is not enough. You need some euqivalent of map-reduce right, here is why you need a library called SAMOA which actually has great algorithms for online learning that mahout kinda lacks.
Several mature stream processing frameworks and products are available on the market. Open source frameworks are e.g. Apache Storm or Apache Spark (which can both run on top of Hadoop). You can also use products such as IBM InfoSphere Streams or TIBCO StreamBase.
Take a look at this InfoQ article, which explains stream processing and all these frameworks and products in detail: Real Time Stream Processing / Streaming Analytics in Combination with Hadoop. Besides the article also explains how this is complementary to Hadoop.
By the way: Many software vendors such as Oracle or TIBCO call this stream processing / streaming analytics approach "fast data" instead of "big data" as you have to act in real time instead of batch processing.
You should try Apache Spark Streaming.
It should work well for your purposes.