How can I process Avro Container data with different versions of schema? - hadoop

I have months' worth of data from a single domain stored in HDFS in Avro Container files. Each file has the schema for all the data in that file, of course. How do I process all the data using Hive or Pig? It seems both Hive and Pig need the avsc file of some form of table structure definition up front. i.e. even if I use Avro tools to extract avsc from each file I will have to load each dataset using a different avsc file and I cannot process all of them using one job or DDL + Query.
Isn't it possible for Hive and Pig to pull the avsc at runtime based on the Avro Container file it is processing? Is it already implemented and I'm not finding it or too difficult to implement?

Related

How to write incremental data to hive using flink

I use flink 1.6,I know I can use custom sink and hive jdbc to write to hive,or use JDBCAppendTableSink,but it is still use jdbc.The problem is hive jdbc do not suppot batchExecute method.I think it will be very slow.
Then I seek another way,I write a DataSet to hdfs with writeAsText method,then create hive table from hdfs.But there is still a problem:the how to append incremental data.
The api of WriteMode is:
Enum FileSystem.WriteMode
Enum Constant and Description
NO_OVERWRITE
Creates the target file only if no file exists at that path already.
OVERWRITE
Creates a new target file regardless of any existing files or directories.
For example,first batch,I write data of September to hive,then I get data of October,I want to append it.
But If I use OVERWRITE to the same hdfs file,data of September will not exist any more,if I use NO_OVERWRITE,I must write it to a new hdfs file,then a new hive table,we need them in a same hive table.And I do not know how to combine 2 hdfs file to a hive table.
So How to write incremental data to hive using flink?
As you already wrote there is no HIVE-Sink. I guess the default pattern is to write (text, avro, parquett)-files to HDFS and define an external hive table on that directory. There it doesn't matter if you have a single file or mutiple files. But you most likely have to repair this table on a regular basis (msck repair table <db_name>.<table_name>;). This will update the meta-data and the new files will be available.
For bigger amounts of data I would recommend to partition the table and add the partitions on demand (This blogpost might give you a hint: https://resources.zaloni.com/blog/partitioning-in-hive).

How do i use Sqoop to save data in a parquet-avro file format?

I need to move my data from a relational database to HDFS but i would like to save the data to a parquet-avro file format. Looking at the sqoop documentation it seems like my options are --as-parquetfile or --as-avrodatafile, but not a mix of both. From my understanding of this blog/picture below, the way parquet-avro works is that it is a parquet file with the avro schema embedded and a converter to convert and save an avro object to a parquet file and vise versa.
My initial assumption is that if i use the sqoop option --as-parquetfile then the data being saved to the parquet file will be missing the avro schema and the converter won't work. However upon looking at the sqoop code that saves the data to a parquet file format it does seem to be using a util related to avro but i'm not sure what's going on. Could someone clarify? If i cannot do this with sqoop, what other options do i have?
parquet-avro is mainly a convenience layer so that you can read/write data that is stored in Apache Parquet into Avro object. When you read the Parquet again with parquet-avro, the Avro schema is inferred from the Parquet schema (alternatively you should be able to specify an explicit Avro schema). Thus you should be fine with --as-parquetfile.

Save and access table-like data structure in hadoop

I want to save and access a table like data structure in HDFS with MapReduce programming. Part of this DS is shown in the following picture. This DS have tens of thousands of columns and hundreds of rows and All nodes should have access to it.
My Question is: How can I save this DS in HDFS and access it with MapReduce programming. Should I use arrays? (Or Hive tables ? Or Hbase?)
Thank you.
HDFS is distributed file System which stores your big files in distributed servers.
You can copy your files from local system to HDFS using command
hadoop fs -copyFromLocal /source/local/path destincation/hdfs/path
Once copy completed an External hive table can be formed on destincation/hdfs/path.
This table can be queried using hive shell.
Do consider Hive for this scenario. If you want to do table type of processing like SAS dataset or R dataframe/dataTable or python pandas; almost always an equivalent thing is possible in SQL. Hive provides powerful SQL abstraction through MapReduce and Tez engines. If you want to graduate to Spark sometime then you can read Hive tables in dataframes. As #sumit pointed you just need to transfer your data from local to HDFS (using HDFS copyFromLocal or put command) and define an external Hive table on that.
If in case you want to write some custom map-reduce on this data then access the background hive table data (more likely at /user/hive/warehouse). After reading the data from stdin, parse it in mapper (separator could be find using describe extended <hive_table>) and emit in key-value pair format.

Is there a way to access avro data stored in hbase using hive to do analysis

My Hbase table has rows that contain both serialized avro (put there using havrobase) and string data. I know that Hive table can be mapped to avro data stored in hdfs to do data analysis but I was wondering if anyone has tried to map hive to hbase table(s) that contains avro data. Basically I need to be able to query both avro and non avro data stored in Hbase, do some analysis and store the result in a different hbase table. I need the capability to do this as a batch job as well. I don't want to write a JAVA MapReduce job to do this because we have constantly changing configurations and we need to use a scripted approach. Any suggestions? Thanks in advance!
You can write an HBase co-processor to expose the avro record as regular HBase qualifiers. You can see an implementation of that in Intel's panthera-dot

Is there a common place to store data schemas in Hadoop?

I've been doing some investigation lately around using Hadoop, Hive, and Pig to do some data transformation. As part of that I've noticed that the schema of data files doesn't seem to attached to files at all. The data files are just flat files (unless using something like a SequenceFile). Each application that wants to work with those files has its own way of representing the schema of those files.
For example, I load a file into the HDFS and want to transform it with Pig. In order to work effectively with it I need to specify the schema of the file when I load the data:
EMP = LOAD 'myfile' using PigStorage() as { first_name: string, last_name: string, deptno: int};
Now, I know that when storing a file using PigStorage, the schema can optionally be written out along side it, but in order to get a file into Pig in the first place it seems like you need to specify a schema.
If I want to work with the same file in Hive, I need to create a table and specify the schema with that too:
CREATE EXTERNAL TABLE EMP ( first_name string
, last_name string
, empno int)
LOCATION 'myfile';
It seems to me like this is extremely fragile. If the file format changes even slightly then the schema must be manually updated in each application. I'm sure I'm being naive but wouldn't it make sense to store the schema with the data file? That way the data is portable between applications and the barrier to using another tool would be lower since you wouldn't need to re-code the schema for each application.
So the question is: Is there a way to specify the schema of a data file in Hadoop/HDFS or do I need to specify the schema for the data file in each application?
It looks like you are looking for Apache Avro. With Avro your schema is embedded in your data, so you can read it without having to worry about schema issues and it makes schema evolution really easy.
The great thing about Avro is that it is completely integrated in Hadoop and you can use it with a lot of Hadoop sub-projects like Pig and Hive.
For example with Pig you could do:
EMP = LOAD 'myfile.avro' using AvroStorage();
I would advise looking at the documentation for AvroStorage for more details.
You can also work with Avro with Hive as described here but I have not used that personally but it should work the same way.
What you need is HCatalog which is
"Apache HCatalog is a table and storage management service for data
created using Apache Hadoop.
This includes:
Providing a shared schema and data type mechanism.
Providing a table abstraction so that users need not be concerned with where or how
their data is stored.
Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive."
You can take a look at the "data flow example" in the docs to see exactly the scenario you are talking about
Apache Zebra seems to be the tool that could provide a common schema definition across mr, pig and hive. It has its own schema store. MR job can use its built in TableStore to write to HDFS.

Resources