Avro Dynamic schema change on Hive - hadoop

I have some data coming in avro format v1 and getting stored in HDFS under a partition dt=yyyymmdd.
Now the data is maintained with two versions, v1 and v2 under the same partition.
Is it feasible to maintain a single hive table for two different versions?

Avro defines a schema evolution protocol
If v2 has simply added a field with a default value, for example, then updating the table with that schema, it can read the entirety of the old data, as it'll simply return the default values where they are missing.
If you've broken compatibility, you must make a separate table, then union the two to get a consistent result set

Related

Adding dynamic records in parquet format

I'm working on building a data lake and stuck on a very trivial thing. I'll be using Hadoop/HDFS as our data lake infrastructure and storing records in parquet format. The data will come from a Kafka queue which sends a json record every time. The keys in the json record could vary message to message. For example in the first message keys could be 'a', 'b' and in the second message keys could be 'c', 'd'.
I was using pyarrow to store files in parquet format but as per my understanding we've to predefine schema. So when I try to write the second message, it'll throw an error saying that keys 'c' 'd' are not defined on schema.
Could someone guide as to how to proceed with this? Any other libraries apart from pyarrow also works but with this functionality.
Parquet supports Map types for instances where fields are unknown ahead of time. Or, if some of the fields are known, define more concrete types for those, possibly making them nullable, however you cannot mix named fields with a map on the same level of the record structure.
I've not used Pyarrow, but I'd suggest using Spark Structured Streaming and defining a schema there. Especially when consuming from Kafka. Spark's default output writer to HDFS uses Parquet.

Can I keep data of different file formats in same hive table?

I am receiving data of formats like csv, xml, json and I want to keep all the files in same hive table.Is it achievable?
Hive expects all the files for one table to use the same delimiter, same compression applied etc. So, you cannot use a Hive table on top of files with multiple formats.
The solution you may want to use is
Create a separate table (json/xml/csv) for each of the file formats
Create a view for the UNION of the 3 tables created above.
This way the consumer of the data has to query only one view/object, if that's what you are looking for.
Yes, you can achieve this through a combination of different external tables.
Because different SerDes with different specifications for how to read columns in the different files will be needed, you will need to create one external table per type of file (and table). The data from each of these external tables can then be combined into a view with UNION, as suggested by Ramesh. The view can could then be used for reading from these, and you could e.g. insert the data into a managed table.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

Questions about migration, data model and performance of CDH/Impala

I have some questions about migration, data model and performance of Hadoop/Impala.
How to migrate Oracle application to cloudera hadoop/Impala
1.1 How to replace oracle stored procedure in impala or M/R or java/python app.
For example, the original SP include several parameters and sqls.
1.2 How to replace unsupported or complex SQL like over by partition from Oracle to impala.
Are there any existing examples or Impala UDF?
1.3 How to handle update operation since part of data has to be updated.
For example, use data timestamp? use the store model which can support update like HBase? or use delete all data/partition/dir and insert it again(insert overwrite).
Data store model , partition design and query performance
2.1 How to chose impala internal table or external table like csv, parquet, habase?
For example, if there are several kind of data like importing exsited large data in Oracle into hadoop, new business data into hadoop, computed data in hadoop and frequently updated data in hadoop, how to choose the data model? Do you need special attention if the different kind of data need to join?
We have XX TB's data from Oracle, do you have any suggestion about the file format like csv or parquet? Do we need to import the data results into impala internal table or hdfs fs after calculation. If those kind of data can be updated, how to we considered that?
2.2 How to partition the table /external table when joining
For example, there are huge number of sensor data and each one includes measuring data, acquisition timestamp and region information.
We need:
calculate measuring data by different region
Query a series of measuring data during a certain time interval for specific sensor or region.
Query the specific sensor data from huge number of data cross all time.
Query data for all sensors on specific date.
Would you please provide us some suggestion about how to setup up the partition for internal and directories structure for external table(csv) .
In addition, for the structure of the directories, which is better when using date=20090101/area=BEIJING or year=2009/month=01/day=01/area=BEIJING? Is there any guide about that?

Is it possible to retrieve schema from avro data and use them in MapReduce?

I've used avro-tools to convert my avro schema into Java class, which I pass it into Avro-Map-Input-Key-Schema for data processing. This is all working fine.
But recently I had to add a new column to avro schema and recompile the java class.
This is where I encountered a problem as my previously generated data were serialized by the old scheme, so my MapReduce jobs is now failing after modifying the schema, even though my MapReduce logic isn't using the new column.
Therefore, I was wondering whether I could stop passing in the Java schema class and retrieve the schema from the data and process the data (dynamically), is this possible.
I assume it isn't!
Yea, there's not. But you can read it as a GenericRecord and then map the fields to your updated type object. I go through this at a high level here.
It is possible to read existing data with an updated schema. Avro will always read a file using the schema from its header, but if you also supply an expected schema (or "read schema") then Avro will create records that conform to that requested schema. That ends up skipping fields that aren't requested or filling in defaults for fields that are missing from the file.
In this case, you want to set the read schema and data model for your MapReduce job like this:
AvroJob.setInputSchema(job, MyRecord.getClassSchema());
AvroJob.setDataModelClass(job, SpecificData.class);

Resources