avro data validation with the schema - validation

I'm new to this data validation and related concept, please excuse me if it is a simple question, help me with the steps to achieve this.
Use case: Validating AVRO file (Structure and Data)
Inputs:
We are going to receive a AVRO file’s
We will have a schema file in a note pad (ex- field name, data type and size etc)
Validation:
Need to validate AVRO file with structure (schema-i.e field, data type, size etc)
Need to validate number and decimal format while viewing from Hive
So far I'm able to achieve is to get the schema from the avro file using the avro jar.

Related

Adding dynamic records in parquet format

I'm working on building a data lake and stuck on a very trivial thing. I'll be using Hadoop/HDFS as our data lake infrastructure and storing records in parquet format. The data will come from a Kafka queue which sends a json record every time. The keys in the json record could vary message to message. For example in the first message keys could be 'a', 'b' and in the second message keys could be 'c', 'd'.
I was using pyarrow to store files in parquet format but as per my understanding we've to predefine schema. So when I try to write the second message, it'll throw an error saying that keys 'c' 'd' are not defined on schema.
Could someone guide as to how to proceed with this? Any other libraries apart from pyarrow also works but with this functionality.
Parquet supports Map types for instances where fields are unknown ahead of time. Or, if some of the fields are known, define more concrete types for those, possibly making them nullable, however you cannot mix named fields with a map on the same level of the record structure.
I've not used Pyarrow, but I'd suggest using Spark Structured Streaming and defining a schema there. Especially when consuming from Kafka. Spark's default output writer to HDFS uses Parquet.

Azure Data Factory Converting Source Data Type to a Different Format

I am using Azure Data Factory to copy data from an Oracle Database to ADLS Gen 2 Container
In the COPY Activity, I added Source as Oracle DB and Sink as ADLS
I want to create Parquet file in Sink
When I click on Mapping, I can see the datatype which is NUMBER in Source is getting converted as Double in ADF
Also, Date type in source is converted to DateTime in ADF
Due to which I am not able to load correct data
I even tried Typecasting in Source Query to convert it into same format as source but still ADF is converting it into Double
Pls find below screenshot as a reference:
Here ID column is NUMBER in Oracle DB, but ADF is considering it as Double and adding .0 to the data which is not what I need
Even after typecasting it to Number it is not showing correct type
What can be the possible root cause of this issue and why the Source data type is not shown in correct format
Due to this, the Parquet file which I am creating is not correct and my Synapse Table (end destination) is not able to add the data as in Synapse I have kept ID column as Int
Ideally, ADF should show the same data type as in Source
Pls let me know if you have any solution or suggestions for me to try
Thanks!
I am not an Oracle user, but as I understand it the NUMBER data type is generic and can be either integer or decimal based. Parquet does not have this concept, so when it converts it basically has to be to a decimal type (such as Double) to prevent loss of data. If you really want the data to be an Integer, then you'll need to use Data Flow (instead of COPY) to cast the incoming values to an integer column.

NiFi avro schema using regex to validate a string

I have an avro schema in NiFi which validates the columns of a CSV file, all is working well, however I'd like to ideally have an extra level of validation on certain string column to test that they adhere to specific patterns. For example ABC1234-X, or whatever. Here's the wrinkle though, the avro schema is generated for specific expected files, and so the NiFi processors need to be generic.
Is there a way to have the avro schema do this?

Is it possible to retrieve schema from avro data and use them in MapReduce?

I've used avro-tools to convert my avro schema into Java class, which I pass it into Avro-Map-Input-Key-Schema for data processing. This is all working fine.
But recently I had to add a new column to avro schema and recompile the java class.
This is where I encountered a problem as my previously generated data were serialized by the old scheme, so my MapReduce jobs is now failing after modifying the schema, even though my MapReduce logic isn't using the new column.
Therefore, I was wondering whether I could stop passing in the Java schema class and retrieve the schema from the data and process the data (dynamically), is this possible.
I assume it isn't!
Yea, there's not. But you can read it as a GenericRecord and then map the fields to your updated type object. I go through this at a high level here.
It is possible to read existing data with an updated schema. Avro will always read a file using the schema from its header, but if you also supply an expected schema (or "read schema") then Avro will create records that conform to that requested schema. That ends up skipping fields that aren't requested or filling in defaults for fields that are missing from the file.
In this case, you want to set the read schema and data model for your MapReduce job like this:
AvroJob.setInputSchema(job, MyRecord.getClassSchema());
AvroJob.setDataModelClass(job, SpecificData.class);

How does Hive stores data and what is SerDe?

when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. when performing an INSERT or CTAS (see “Importing Data” on page 441), the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.
Is serDe library?
How does hive store data i.e it stores in file or table?
Please can anyone explain the bold sentences clearly?
I'm new to hive!!
Answers
Yes, SerDe is a Library which is built-in to the Hadoop API
Hive uses Files systems like HDFS or any other storage (FTP) to store data, data here is in the form of tables (which has rows and columns).
SerDe - Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email, etc) or unstructured records (Audio, Video, etc) to be processed also. For Example If you have 1000 GB worth of RSS Feeds (RSS XMLs). You can ingest those to a location in HDFS. You would need to write a custom SerDe based on your XML structure so that Hive knows how to load XML files to Hive tables or other way around.
For more information on how to write a SerDe read this post
In this aspect we can see Hive as some kind of database engine. This engine is working on tables which are built from records.
When we let Hive (as well as any other database) to work in its own internal formats - we do not care.
When we want Hive to process our own files as tables (external tables) we have to let him know - how to translate data in files into records. This is exactly the role of SerDe. You can see it as plug-in which enables Hive to read / write your data.
For example - you want to work with CSV. Here is example of CSV_Serde
https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java
Method serialize will read the data, and chop it into fields assuming it is CSV
Method deserialize will take a record and format it as CSV.
Hive can analyse semi structured and unstructured data as well by using
(1) complex data type(struct,array,unions)
(2) By using SerDe
SerDe interface allow us to instruct hive as to how the record should be processed. Serializer will take java object that hive has been working on,and convert it into something that hive can store and Deserializer take binary representation of a record and translate into java object that hive can manipulate.
I think the above has the concepts serialise and deserialise back to front. Serialise is done on write, the structured data is serialised into a bit/byte stream for storage. On read, the data is deserialised from the bit/byte storage format to the structure required by the reader. eg Hive needs structures that look like rows and columns but hdfs stores the data in bit/byte blocks, so serialise on write, deserialise on read.

Resources