Adding dynamic records in parquet format - hadoop

I'm working on building a data lake and stuck on a very trivial thing. I'll be using Hadoop/HDFS as our data lake infrastructure and storing records in parquet format. The data will come from a Kafka queue which sends a json record every time. The keys in the json record could vary message to message. For example in the first message keys could be 'a', 'b' and in the second message keys could be 'c', 'd'.
I was using pyarrow to store files in parquet format but as per my understanding we've to predefine schema. So when I try to write the second message, it'll throw an error saying that keys 'c' 'd' are not defined on schema.
Could someone guide as to how to proceed with this? Any other libraries apart from pyarrow also works but with this functionality.

Parquet supports Map types for instances where fields are unknown ahead of time. Or, if some of the fields are known, define more concrete types for those, possibly making them nullable, however you cannot mix named fields with a map on the same level of the record structure.
I've not used Pyarrow, but I'd suggest using Spark Structured Streaming and defining a schema there. Especially when consuming from Kafka. Spark's default output writer to HDFS uses Parquet.

Related

NIFI: Proper way to consume kafka and store data into hive

I have the task to create kafka consumer that should extract messages from kafka, transfrom it and store into Hive table.
So, in kafka topic there are a lot of messages as json object.
I like to add some field and insert its into hive.
I create flow with following Nifi-processors:
ConsumeKafka_2_0
JoltTransformJSON - for transform json
ConvertRecord - to transform json into insert query for hive
PutHiveQL
The topic will be sufficiently loaded and handle about 5Gb data per day.
So, are the any ways to optimize my flow (i think it's a bad idea to give a huge amount of insert queries to Hive)? Maybe it will be better to use the external table and putHDFS Processor (in this way how to be with partition and merge input json into one file?)
As you suspect, using PutHiveQL to perform a large number of individual INSERTs is not very performant. Using your external table approach will likely be much better. If the table is in ORC format, you could use ConvertAvroToORC (for Hive 1.2) or PutORC (for Hive 3) which both generate Hive DDL to help create the external table.
There are also Hive streaming processors, but if you are using Hive 1.2 PutHiveStreaming is not very performant either (but should still be better than PutHiveQL with INSERTs). For Hive 3, PutHive3Streaming should be much more performant and is my recommended solution.

Avro Dynamic schema change on Hive

I have some data coming in avro format v1 and getting stored in HDFS under a partition dt=yyyymmdd.
Now the data is maintained with two versions, v1 and v2 under the same partition.
Is it feasible to maintain a single hive table for two different versions?
Avro defines a schema evolution protocol
If v2 has simply added a field with a default value, for example, then updating the table with that schema, it can read the entirety of the old data, as it'll simply return the default values where they are missing.
If you've broken compatibility, you must make a separate table, then union the two to get a consistent result set

Data storage format for unstructured data rows on HDFS

We are consuming very large data that needs to be written as fast as we receive and we are using HDFS, so we prefer using it. The data is almost unstructured, and we will be doing basic queries on them rarely. The data is flat with some fields, each row representing another data.
key1=str key2=30.3 key3=longtexthere
Another data row:
key1=3 key5=abc
SequenceFile seemed the most natural one but I could not find how to store multiple rows in a single SequenceFile.
Currently, in our temporary solution, we have multiple writers that writes to multiple text files. So when querying is needed, we read them in parallel. However, current text files contains 1000s of rows and I don't think creating a single SequenceFile for each row would be feasiable, it would incur much overhead for storing metadata and reading many too many files at once when querying.
I think the problem can be solved by using HBase or Cassandra, a columunar database but we are almost required to use HDFS. Am I missing something with SequenceFiles or we should really use a columunar database?
So sequence file format is like this:
<key, value>
<key, value>
<key, value>
...
where the key is a WritableComparable and the value is a Writable.
Now what a lot of people are doing - and you could do the same - is:
Only use the key OR the value 'column'
Implement a custom Writable which wraps a set of other Writables (call it record, row, ...)
That way you can model everything you want. That record writable could have a fixed schema, like it contains 'IntWritable, Text, IntWritable, IntWritable' (depending on you fields). Or in case you don't wanna support different types, you could use the existing ArrayWritable as your 'record'.
Knowing the schema of each file (e.g. put it into the metadata of the sequence file, will allow you to do reads on files with different/evolved schema's.
So its a lot of handcrafting, but build can very efficient and flexible structure. Never used it, but take a look at http://pangool.net/userguide/schemas.html, think they already modeled suche a flexible record/tuple schema on top of sequence files.
Bottom line, i think you can achieve what you want with sequence files.
However i would recommend to also have a look at columnar file formats like Parquet or ORC files. Those come with their own tradeoffs, but you will have a higher compression rate and selective reads (column projection, filter pushdown). Also you don't have to invent the schema/tuple structure.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

Is it possible to retrieve schema from avro data and use them in MapReduce?

I've used avro-tools to convert my avro schema into Java class, which I pass it into Avro-Map-Input-Key-Schema for data processing. This is all working fine.
But recently I had to add a new column to avro schema and recompile the java class.
This is where I encountered a problem as my previously generated data were serialized by the old scheme, so my MapReduce jobs is now failing after modifying the schema, even though my MapReduce logic isn't using the new column.
Therefore, I was wondering whether I could stop passing in the Java schema class and retrieve the schema from the data and process the data (dynamically), is this possible.
I assume it isn't!
Yea, there's not. But you can read it as a GenericRecord and then map the fields to your updated type object. I go through this at a high level here.
It is possible to read existing data with an updated schema. Avro will always read a file using the schema from its header, but if you also supply an expected schema (or "read schema") then Avro will create records that conform to that requested schema. That ends up skipping fields that aren't requested or filling in defaults for fields that are missing from the file.
In this case, you want to set the read schema and data model for your MapReduce job like this:
AvroJob.setInputSchema(job, MyRecord.getClassSchema());
AvroJob.setDataModelClass(job, SpecificData.class);

Resources