Is there a simple way to migrate from SequenceFiles to Avro? - hadoop

I'm currently using hadoop mapreduce jobs with SequenceFiles of writables.
The same Writable type are used for serialization also in the non-hadoop related parts of the system.
This method is hard to maintain - mainly because of the lack of schema and the need for manual handling of version changes.
It appears that apache avro handles these issues.
The problem is, that during the migration I will have data in both formats.
is there a simple way to handle the migration?

I haven't tried it myself, but maybe using AvroSequenceFile format would help. It's just a wrapper around SequenceFile so in theory you should be able to write data in both your old SequenceFile format as well as your new Avro format which should make the migration easier.
Here is more information about this format.

Generally, there is nothing stopping you from using Avro data and SequenceFiles interchangably. Use whatever InputFormat is necessary for the type of data you need, and for output it of course makes sense to use Avro formats whenever practial. If your input comes in different formats, take a look at MultipleInputs. Essentially, you will still have to implement separate Mappers, but that's to be expeced considering the Map input key/value is different.
Moving to Avro is a wise move. If you have the capacity in time and hardware, it might even be worthwhile to explicitly convert your data from SequenceFile to Avro right away. You can use any language supported by Avro which also happens to supports SequenceFiles to do this. Java certainly does (clearly), but Pig is also pretty handy for doing this.
The user contributed PiggyBank project has functionality for reading a SequenceFile, and then it is simply a matter of using AvroStorage from the same PiggyBank project with the appropriate Avro Scheme to get your Avro file.
If only Pig supported loading Avro schemas from file.. ! If you use Pig you will unfortunately have to form scripts that explicitly contain the Avro schema, which can be a bit annoying.

Related

Hadoop's own Serialization and its relationship with AVRO serialization?

I am trying to understand Avro and came to know that it is one of the Data Serialization frameowork which Hadoop uses.
While learning Hadoop, I came to know that Hadoop uses its own Serlization framework rather than Java's Serialization , so I can see Writable , WritableComparable in Hadoop.
Now, after going through AVRO, it says that Avro is used as Serlization framework.
I am bit confused because of this. So, when we say Hadoop's own serialization framework, are we referring to Avro or something else (which is built in "hadoop" itself).
Can anyone help me understand this?
Hadoop Writables are not Avro, and are "something else"
Avro is a separate project, and it's schema model allows for nested structures and evolution. Hadoop serialization has no concept schema evolution, as far as I know.
Thrift is another row-oriented serialization format commonly found in Hadoop projects.
Other (columnar) data storage formats include Parquet and ORC

Adding parquet-avro support to scalding

How can I create a Scalding Source that will handle conversions between avro and parquet.
The solution should:
1. Read from parquet format and convert to avro memory representation
2. Write avro objects into a parquet file
Note: I noticed Cascading has a module for leveraging thrift and parquet. It occurs to me that this would be a good place to start looking. I also opened a thread on google-groups/scalding-dev
Try our latest changes in this fork -
https://github.com/epishkin/scalding/tree/parquet_avro/scalding-parquet

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Hadoop MultipleOutputFormats to HFileOutputFormat and TextOutputFormat

I am running an ETL job with Hadoop where I need to output the valid, transformed data to HBase, and an external index for that data into MySQL. My initial thought is that I could use MultipleOutputFormats to export the the transformed data with HFileOutputFormat (key is Text and value is ProtobufWritable), and an index to TextOutputFormat (key is Text and value is Text).
The number of inputs records for an average-sized job (I'll need the ability to run many at once) is about 700 million.
I'm wondering if A) this seems to be a reasonable approach in terms of efficiency and complexity, and B) how to accomplish this with the CDH3 distribution's API, if possible.
If you're using the old MapReduce API then you can use MultipleOutputs and write to multiple output formats.
However, if you're using the new MapReduce API, I'm not sure that there is a way to do what you're trying to do. You might have to pay the price of doing another MapReduce job on the same inputs. But I'll have to do more research on it before saying for sure. There might be a way to hack the old + new api's together to allow you to use MultipleOutputs with the new API.
EDIT: Have a look at this post. You can probably implement your own OutputFormat and wrap the appropriate RecordWriters in the OutputFormat and use that to write to multiple output formats.

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Resources