How to convert Avro to an SQL batch update? - apache-nifi

I'm trying to convert an inputs Avro (array of Avro records) into a batch of upsert statemenets. Is there a processor that can do this?
[Read External DB (RDBMS)]->[Avro to Upsert batch]->[Local DB]
What I found is that the records can be formatted with sql.N.args.type & name before the PutSQL. With this approach, is there a processor or a trick that can make this clean?
[Read External DB (RDBMS)]->[Split into 1]->[Convert Avro to
sql.N.args.type and name format]->[SetAttribute:
sql.statement=SQL]->[Local DB]
In the 2nd case I'm stuck at [Convert Avro to sql.N.args.type and name format] and I'm trying to resist the urge to use ExecuteScript... What is the simplest way forward?

If you need to generate SQL (to do an upsert vs an insert for example), you could use ConvertJSONToSQL (assuming your content is JSON) which does all the sql.args.N stuff for you. If you use ExecuteSQLRecord or QueryDatabaseTableRecord you can get your source DB information as JSON (by using a JsonRecordSetWriter) vs the non-record-based versions which only output Avro. Otherwise you'd need a ConvertAvroToJSON before ConvertJSONToSQL.

Related

How to convert an AVRO scheme into line protocol in order to insert data into InfluxBD with Apache Ni-Fi

I am creating a data pipeline with Apache Ni-Fi to copy data from a remote MySQL database into InfluxDB.
I use QueryDatabaseTable processor to extract the data from the MySQL database, then I use UpdateRecord to do some data transformation and I would like to use PutInfluxDB to insert the time series into my local Influx instance in Linux.
The data coming from the QueryDatabaseTable processor uses AVRO scheme and I need to convert it into line protocol by configuring which are the tags and which are the measurement values.
However, I do not find any processor that allows doing this conversion.
Any hints?
Thanks,
Bernardo
There is no built-in processor for InfluxDB Line Protocol conversions - you could write a ScriptedRecordWriter if you wanted to do it yourself, however there is a project that already implements a Line Protocol reader for NiFi here by InfluxData that seems to be active & up-to-date.
See the documentation for adding it into NiFi here

NIFI: Proper way to consume kafka and store data into hive

I have the task to create kafka consumer that should extract messages from kafka, transfrom it and store into Hive table.
So, in kafka topic there are a lot of messages as json object.
I like to add some field and insert its into hive.
I create flow with following Nifi-processors:
ConsumeKafka_2_0
JoltTransformJSON - for transform json
ConvertRecord - to transform json into insert query for hive
PutHiveQL
The topic will be sufficiently loaded and handle about 5Gb data per day.
So, are the any ways to optimize my flow (i think it's a bad idea to give a huge amount of insert queries to Hive)? Maybe it will be better to use the external table and putHDFS Processor (in this way how to be with partition and merge input json into one file?)
As you suspect, using PutHiveQL to perform a large number of individual INSERTs is not very performant. Using your external table approach will likely be much better. If the table is in ORC format, you could use ConvertAvroToORC (for Hive 1.2) or PutORC (for Hive 3) which both generate Hive DDL to help create the external table.
There are also Hive streaming processors, but if you are using Hive 1.2 PutHiveStreaming is not very performant either (but should still be better than PutHiveQL with INSERTs). For Hive 3, PutHive3Streaming should be much more performant and is my recommended solution.

Read data from multiple tables at a time and combine the data based where clause using Nifi

I have scenario where I need to extract multiple database table data including schema and combine(combination data) them and then write to xl file?
In NiFi the general strategy to read in from a something like a fact table with ExecuteSQL or some other SQL processor, then using LookupRecord to enrich the data with a lookup table. The thing in NiFi is that you can only do a table at a time, so you'd need one LookupRecord for each enrichment table. You could then write to a CSV file that you could open in Excel. There might be some extensions elsewhere that can write directly to Excel but I'm not aware of any in the standard NiFi distro.

is there a way to process data in a sql table column before ingesting it to hbase using sqoop

data needs to be ingested from sql table to hbase using sqoop.i have xml data in one column. instead of ingesting the complete xml for each row, i want to required details from xml and then ingest it with rest of the columns. is there a way like writing UDF where xml column is passed and output is used along with other sql columns to ingest.
No but you can extend the Java class PutTransformer (https://sqoop.apache.org/docs/1.4.4/SqoopDevGuide.html), add your XML transformation logic there, and pass the custom JAR file to the sqoop command.

ExecuteSQL processor in Nifi returns data in avro format

Just started working with Apache Nifi. I am trying to fetch data from oracle and place it in HDFS then build an external hive table on top of it. The problem is ExecuteSQL processor returns data in avro format. Is there anyway I can get this data in a readable format?
apache nifi also has an 'ConvertAvroToJSON' processor. That might help you get it into a readable format. We also really need to just knock out the ability for our content viewer to nicely render avro data which would help as well.
Thanks
joe

Resources