Writing AVRO data to Hadoop hdfs - hadoop

I've a java Kafka consumer that's consuming avro data from kafka [say topic x]. It's supposed to push this data to HDFS as it is without code generation. In avro documentation they're using something like the following:
GenericRecord e1 = new GenericData.Record(schema);
e1.put("key", "value");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, new File("<HDFS file path>"));
dataFileWriter.append(e1);
dataFileWriter.close();
Problem with this is, I already have the avro data. To use this sequence of steps I have to extract each key-value pair after deserializing the avro packet, and then push it to a GenericRecord object, which I don't think makes any sense. I didn't find any example of what I'm trying to achieve. Any hint or link to relevant documentation is very much appreciated.

If I understood you question correctly, I suggest you trying com.twitter.bijection.Injection and com.twitter.bijection.avro.GenericAvroCodecs packages, for example.
Take a look here http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html.
There, in Kafka producer the GenericRecord is converted to bytes[], which are put in Kafka topic, and then in consumer this bytes are inverted into a GenericRecord according to your schema. And you don't need to put the values to all fields in record. After that you can write this record to file.
And, also you probably need to access file in HDFS some other way, since you cannot create a File instance for it.

Related

Adding dynamic records in parquet format

I'm working on building a data lake and stuck on a very trivial thing. I'll be using Hadoop/HDFS as our data lake infrastructure and storing records in parquet format. The data will come from a Kafka queue which sends a json record every time. The keys in the json record could vary message to message. For example in the first message keys could be 'a', 'b' and in the second message keys could be 'c', 'd'.
I was using pyarrow to store files in parquet format but as per my understanding we've to predefine schema. So when I try to write the second message, it'll throw an error saying that keys 'c' 'd' are not defined on schema.
Could someone guide as to how to proceed with this? Any other libraries apart from pyarrow also works but with this functionality.
Parquet supports Map types for instances where fields are unknown ahead of time. Or, if some of the fields are known, define more concrete types for those, possibly making them nullable, however you cannot mix named fields with a map on the same level of the record structure.
I've not used Pyarrow, but I'd suggest using Spark Structured Streaming and defining a schema there. Especially when consuming from Kafka. Spark's default output writer to HDFS uses Parquet.

How to achieve parallelism with kafka connect source?

I'm fairly new to Kafka connect. I'm planning to use kafka connect source to read data from my MySQL database tables into one of the kafka topics. Now, since my source table is a transactional data store, i might get a new record inserted into it or a record might be updated. Now, I'm trying to understand how can i achieve parallelism to read the data from this table and my question is,
Can i use max.tasks to achieve parallelism (have more than one thread) to read the data and push onto the kafka topic? If yes, Please explain.
Thanks

Appending to existing avro file in HDFS with NiFi

I have this NiFi flow that grabs events in JSON from a MQTT broker, groups them according to some criteria, transforms them to Avro rows, and should ouput them through files in a Hadoop cluster.
I chose Avro as the storage format since it's able to append new data to an existing file.
These events are grouped by source, and ideally I should have one separate Avro file in HDFS for each event source, so NiFi accumulates new events in each file as they appear (with proper write batching of course since issuing a write per new event wouldn't be very good, I've already worked this out with a MergeContent processor).
I have the flow worked out but I found out that the last step, a PutHDFS processor, is file format agnostic, that is, it doesn't understands how to append to an existing Avro file.
I've found this pull request that implements exactly that, but it was never merged into NiFi due various concerns.
Is there a way to do this with existing NiFi processors? Or do I have to roll out my custom PutHDFS processor that understands how to append to existing Avro files?

Kafka Connect- Modifying records before writing into sink

I have installed Kafka connect using confluent-4.0.0
Using hdfs connector I am able to save Avro records received from Kafka topic to hive.
I would like to know if there is any way to modify the records before writing into hdfs sink.
My requirement is to do small modifications to values of the record. For Example, performing arithmetic operations on integers or manipulation of strings etc.
Please suggest if there any way to achieve this
You have several options.
Single Message Transforms, which you can see in action here. Great for light-weight changes as messages pass through Connect. Configuration-file based, and extensible using the provided API if there's not an existing transform that does what you want.
See the discussion here on when SMT are suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS. See this example here.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like. Here's an example.
Take a look at Kafka connect transformers [1] & [2]. You can build a custom transformer library and use it in connector.
[1] http://kafka.apache.org/documentation.html#connect_transforms
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect

Is it possible to retrieve schema from avro data and use them in MapReduce?

I've used avro-tools to convert my avro schema into Java class, which I pass it into Avro-Map-Input-Key-Schema for data processing. This is all working fine.
But recently I had to add a new column to avro schema and recompile the java class.
This is where I encountered a problem as my previously generated data were serialized by the old scheme, so my MapReduce jobs is now failing after modifying the schema, even though my MapReduce logic isn't using the new column.
Therefore, I was wondering whether I could stop passing in the Java schema class and retrieve the schema from the data and process the data (dynamically), is this possible.
I assume it isn't!
Yea, there's not. But you can read it as a GenericRecord and then map the fields to your updated type object. I go through this at a high level here.
It is possible to read existing data with an updated schema. Avro will always read a file using the schema from its header, but if you also supply an expected schema (or "read schema") then Avro will create records that conform to that requested schema. That ends up skipping fields that aren't requested or filling in defaults for fields that are missing from the file.
In this case, you want to set the read schema and data model for your MapReduce job like this:
AvroJob.setInputSchema(job, MyRecord.getClassSchema());
AvroJob.setDataModelClass(job, SpecificData.class);

Resources