I am trying to understand Avro and came to know that it is one of the Data Serialization frameowork which Hadoop uses.
While learning Hadoop, I came to know that Hadoop uses its own Serlization framework rather than Java's Serialization , so I can see Writable , WritableComparable in Hadoop.
Now, after going through AVRO, it says that Avro is used as Serlization framework.
I am bit confused because of this. So, when we say Hadoop's own serialization framework, are we referring to Avro or something else (which is built in "hadoop" itself).
Can anyone help me understand this?
Hadoop Writables are not Avro, and are "something else"
Avro is a separate project, and it's schema model allows for nested structures and evolution. Hadoop serialization has no concept schema evolution, as far as I know.
Thrift is another row-oriented serialization format commonly found in Hadoop projects.
Other (columnar) data storage formats include Parquet and ORC
Related
I am clear about the Serde available in Hive to support Avro schema for data formats. Comfortable in using avro with hive.
AvroSerDe
for say, I have found this issue against presto.
https://github.com/prestodb/presto/issues/5009
I need to choose components for fast execution cycle. Presto and impala provide much smaller execution cycle.
So, Anyone please let me clarify that which would be better in different data formats.
Primarily, I am looking for avro support with Presto now.
However, lets consider following data formats stored on HDFS:
Avro format
Parquet format
Orc format
Which is the best to use with high performance on different data formats.
?? please suggest.
Impala can read Avro data but can not write it. Please refer to this documentaion page describing the file formats supported by Impala.
Hive supports both reading and writing Avro files.
Presto's Hive Connector supports Avro as well. Thanks to David Phillips for pointing out this documentaion page.
There are different benchmarks on the internet about performance, but I would not like to link to a specific one as results heavily depend on the exact use case benchmarked.
How can I create a Scalding Source that will handle conversions between avro and parquet.
The solution should:
1. Read from parquet format and convert to avro memory representation
2. Write avro objects into a parquet file
Note: I noticed Cascading has a module for leveraging thrift and parquet. It occurs to me that this would be a good place to start looking. I also opened a thread on google-groups/scalding-dev
Try our latest changes in this fork -
https://github.com/epishkin/scalding/tree/parquet_avro/scalding-parquet
As I am new to Big Data and the related technologies my question is, as the title implies:
When would you use Hadoop and when would you use some kind of NoSQL-Databases to store and analyse massive amounts of data?
I know that Hadoop is a Framework and that Hadoop and NoSQL differs.
But you can save lots of data with Hadoop on HDFS and also with NoSQL-DBs like MongoDB, Neo4j...
So maybe the use of Hadoop or of a NoSQL-Database depends if you just want to analyse data or if you just want to store data?
Or is it just that HDFS can save lets say RAW data and a NoSQL-DB is more structured (more structured than raw data and less structured than a RDBMS)?
Hadoop in an entire framework of which one of the components can be NOSQL.
Hadoop generally refers to cluster of systems working together to analyze data. You can take data from NOSQL and parallel process them using Hadoop.
HBase is a NOSQL that is part of Hadoop ecosystem. You can use other different NOSQL too.
Your question is missleading you are comparing Hadoop, which is a framework, to a database ...
Hadoop is containing a lot of features (including NoSQL database named HBase) in order to provide you a big data environment. If you're having a massive quantity of data you will probably use Hadoop (for the MapReduce functionalities or the datawarehouse capabilities) but it's not sure, depending on what you're processing and how you want to process it. If you're just storing a lot of data and don't need other feature (batch data processing or data transformations ...) a simple NoSQL database is enough.
I'm currently using hadoop mapreduce jobs with SequenceFiles of writables.
The same Writable type are used for serialization also in the non-hadoop related parts of the system.
This method is hard to maintain - mainly because of the lack of schema and the need for manual handling of version changes.
It appears that apache avro handles these issues.
The problem is, that during the migration I will have data in both formats.
is there a simple way to handle the migration?
I haven't tried it myself, but maybe using AvroSequenceFile format would help. It's just a wrapper around SequenceFile so in theory you should be able to write data in both your old SequenceFile format as well as your new Avro format which should make the migration easier.
Here is more information about this format.
Generally, there is nothing stopping you from using Avro data and SequenceFiles interchangably. Use whatever InputFormat is necessary for the type of data you need, and for output it of course makes sense to use Avro formats whenever practial. If your input comes in different formats, take a look at MultipleInputs. Essentially, you will still have to implement separate Mappers, but that's to be expeced considering the Map input key/value is different.
Moving to Avro is a wise move. If you have the capacity in time and hardware, it might even be worthwhile to explicitly convert your data from SequenceFile to Avro right away. You can use any language supported by Avro which also happens to supports SequenceFiles to do this. Java certainly does (clearly), but Pig is also pretty handy for doing this.
The user contributed PiggyBank project has functionality for reading a SequenceFile, and then it is simply a matter of using AvroStorage from the same PiggyBank project with the appropriate Avro Scheme to get your Avro file.
If only Pig supported loading Avro schemas from file.. ! If you use Pig you will unfortunately have to form scripts that explicitly contain the Avro schema, which can be a bit annoying.
I am very new to hadoop , learned about its map/reduce functionality a bit , understands it wordcount demo , but not get the actual use of hadoop map/reduce in relate to database specific computations. That is not getting correct way that how map/reduce help me in some computations or database specific processings. Can anyone provide me a link or some guide which will help me in getting what is the best use and which senerio I can implement to better understand Hadoop map/reduce part.
Hadoop provides with a couple of Input and Outputs formats. The base InputFormat and the OutputFormat classes can be extended for customized Input/Output formats.
DBInputFormat/DBOutputFormat come with Hadoop. Here is the documentation from Cloudera on using the MapReduce with Database.