Is it possible to retrieve schema from avro data and use them in MapReduce? - hadoop

I've used avro-tools to convert my avro schema into Java class, which I pass it into Avro-Map-Input-Key-Schema for data processing. This is all working fine.
But recently I had to add a new column to avro schema and recompile the java class.
This is where I encountered a problem as my previously generated data were serialized by the old scheme, so my MapReduce jobs is now failing after modifying the schema, even though my MapReduce logic isn't using the new column.
Therefore, I was wondering whether I could stop passing in the Java schema class and retrieve the schema from the data and process the data (dynamically), is this possible.
I assume it isn't!

Yea, there's not. But you can read it as a GenericRecord and then map the fields to your updated type object. I go through this at a high level here.

It is possible to read existing data with an updated schema. Avro will always read a file using the schema from its header, but if you also supply an expected schema (or "read schema") then Avro will create records that conform to that requested schema. That ends up skipping fields that aren't requested or filling in defaults for fields that are missing from the file.
In this case, you want to set the read schema and data model for your MapReduce job like this:
AvroJob.setInputSchema(job, MyRecord.getClassSchema());
AvroJob.setDataModelClass(job, SpecificData.class);

Related

Adding dynamic records in parquet format

I'm working on building a data lake and stuck on a very trivial thing. I'll be using Hadoop/HDFS as our data lake infrastructure and storing records in parquet format. The data will come from a Kafka queue which sends a json record every time. The keys in the json record could vary message to message. For example in the first message keys could be 'a', 'b' and in the second message keys could be 'c', 'd'.
I was using pyarrow to store files in parquet format but as per my understanding we've to predefine schema. So when I try to write the second message, it'll throw an error saying that keys 'c' 'd' are not defined on schema.
Could someone guide as to how to proceed with this? Any other libraries apart from pyarrow also works but with this functionality.
Parquet supports Map types for instances where fields are unknown ahead of time. Or, if some of the fields are known, define more concrete types for those, possibly making them nullable, however you cannot mix named fields with a map on the same level of the record structure.
I've not used Pyarrow, but I'd suggest using Spark Structured Streaming and defining a schema there. Especially when consuming from Kafka. Spark's default output writer to HDFS uses Parquet.

How to do data transformation using Apache NiFi standrad processor?

I have to do data transfomration using Apache NiFi standard processor for below mentioned input data. I have to add two new fields class and year and drop extra price fields.
Below are my input data and transformed data.
Input data
Expected output
Disclaimer: I am assuming that your input headers are not dynamic, which means that you can maintain a predictable input schema. If that is true, you can do this with the standard processors as of 1.12.0, but it will require a little work.
Here's a blog post of mine about how to use ScriptedTransformRecord to take input from one schema, build a new data structure and mix it with another schema. It's a bit involved.
I've used that methodology recently to convert a much larger set of data into summary records, so I know it works. The summary of what's involved is this:
Create two schemas, one that matches input and one for output.
Set up ScriptedTransformRecord to use a writer that explicitly sets which schema to use since ScriptedTransformRecord doesn't support the ability to change the schema configuration internally.
Create a fat jar with Maven or Gradle that compiles your Avro schema into an object that can be used with the NiFi API to expose a static RecordSchema (NiFi API) to your script.
Write a Groovy script that generates a new MapRecord.

avro data validation with the schema

I'm new to this data validation and related concept, please excuse me if it is a simple question, help me with the steps to achieve this.
Use case: Validating AVRO file (Structure and Data)
Inputs:
We are going to receive a AVRO file’s
We will have a schema file in a note pad (ex- field name, data type and size etc)
Validation:
Need to validate AVRO file with structure (schema-i.e field, data type, size etc)
Need to validate number and decimal format while viewing from Hive
So far I'm able to achieve is to get the schema from the avro file using the avro jar.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

MultipleInputs with DBInputFormat in Hadoop

In my database I have multiple tables where each table is a different entity type. I have an Avro schema that I use in hadoop which is a union of all the fields of these different entity types plus it has a entity type field.
What I would like to do is something along the lines of setting up a DBInputFormat with a DBWritable for each entity type that maps the entity type to the combined Avro type. Then give each DBInputFormat to something like MultipleInputs so that I can create a composite input format. The composite input format could then be given to my map reduce job so that all of the data from all the tables could be processed at once by the same mapper class.
Data is constantly added to these database tables so I need to be able to configure the DBInputFormat for each entity type/dbtable to only grab the new data and to do the splits properly.
Basically I need the functionality of DBInputFormat or DataDrivenDBInputFormat but also the ability to make a composite of them similar to what you can do with paths and MultipleInputs.
Create a view from the N input tables and set the view in the DBInputFormat#setInput. According to the Cloudera article. So, I guess data should not be updated in the table for the time the job completes.
Hadoop may need to execute the same query multiple times. It will need to return the same results each time. So any concurrent updates to your database, etc, should not affect the query being run by your MapReduce job. This can be accomplished by disallowing writes to the table while your MapReduce job runs, restricting your MapReduce’s query via a clause such as “insert_date < yesterday,” or dumping the data to a temporary table in the database before launching your MapReduce process.
Evaluate frameworks which support real time processing like Storm, HStreaming, S4 and Strembases. Some of these sit on top of Hadoop and some don't, some are FOSS and some are commercial.

Resources