Apache NiFi - How to add/pass attributes to a Processor, not a flow file - apache-nifi

My Purpose
Execute a sql and write result(flow file) using my own schema to a file directly.
Please see the explanation blow.
Solution 1 (use 4 processors)
ExecuteSql and the records has auto-generated(embedded) avro schema.
ConvertRecord: The Record Reader just use embedded avro schema and the Record Writer use my own schema from HortonworkSchemaRegistry, so pass attributes - 'schema.name' and 'schema.version' - by using UpdateAttribute.
It works.
Solution 2 (use ExecuteSqlRecord)
It may like this:
ExecuteSqlRecord has Record Writer
And the Record Writer get avro schema from HortonworkSchemaRegistry using 'schema.name' and 'schema.version' attributes
But ExecuteSqlRecord not support user-define-attributes
So
Is it the way to use ExecuteSqlRecord processor?
How to add attributes to a processor?

As for now, Users cannot add new properties to ExecuteSQL* processors.
Below are the ways you can try
Using GenerateFlowFile processor
Add schema.name attribute with some value.
Flow:
1.GenerateFlowFile //add schema.name attribute with value.
2.ExecuteSQLRecord
2.PutFile
(or)
By hard code schema.name value in RecordWriter controller service. in this case you don't need GenerateFlowFile processor.
Flow:
1.ExecuteSQLRecord //hardcode schema.name property value
2.PutFile

Related

How can I filter flow files upon the result of an SQL query?

Would it be possible to route flow files according to the result of an SQL query which returns a single row result? For example, if the result is '1' the flow file will be processed; otherwise, it will be ignored.
Solution
The following approach worked best for me.
Use ExecuteSQL processor in order to run filtering SQL query. The query was written to produce either a single record (match) or an empty record set (no match) in a way suggested by Shu.
Connect ExecuteSQL to RouteOnAttribute processor in order to filter out unmatched flow files using the following value of routing property value ${executesql.row.count:replaceNull(0):gt(0)}
Notice, that the original content of a flow file will be lost after applying ExecuteSQL. It's not an issue in my case, because I do filtering before processing flow file content and my SQL query is based entirely on the flow file attributes and not on its content. Though in a more general scenario, when the flow file content is modified by the incoming part of the flow, one should save file content somewhere (e.g. file system) and restore it after the filtering part has applied.
You can add where clause in your sql query where <field_name> = 1 then we are only going to have output a flowfile when the result value =1.
(or)
Checking the data in NiFi:
We are going to have AVRO format data as the result of SQL query so you can use
option1:ConvertAvroToJson Processor:
Convert the AVRO data into JSON format then extract the value from the json content as attribute using EvaluateJsonPath processor.
Then use RouteOnAttribute processor add new property using NiFi expression language equals function compare the value and route the flowfile to matched relation.
Refer to this link more details regards to EvaluateJsonpath and RouteOnAttribute processor configs.
option2: Using QueryRecord processor:
By using QueryRecord processor we can run SQL queries on the content of the flowfile
Add new property to the processor as
select * from FLOWFILE where <filed_name> =1
Feed the property relation to the other processor
Refer to this link for more details regarding QueryRecord processor usage.

How Can ExtractGrok use multiple regular expressions?

I have a Kakfa topic which includes different types of messages sent from different sources.
I would like to use the ExtractGrok processor to extract the message based on the regular expression/grok pattern.
How do I configure or run the processor with multiple regular expression?
For example, the Kafka topic contains INFO, WARNING and ERROR log entries from different applications.
I would like to separate the different log levels messages and place then into HDFS.
Instead of Using ExtractGrok processor, use Partition Record processor in NiFi to partition as this processor
Evaluates one or more RecordPaths against the each record in the
incoming FlowFile.
Each record is then grouped with other "like records".
Configure/enable controller services
RecordReader as GrokReader
Record writer as your desired format
Then use PutHDFS processor to store the flowfile based on the loglevel attribute.
Flow:
1.ConsumeKafka processor
2.Partition Record
3.PutHDFS processor
Refer to this link describes all the steps how to configure PartitionRecord processor.
Refer to this link describes how to store partitions dynamically in HDFS directories using PutHDFS processor.

How to keep hive table in NiFi DistributedMapCache

I want to keep my hive/MySQL table in NiFi DistributedMapCache. Can someone please help me with the example?
Or please correct me if we can not cache hive table anyhow in NiFi cache.
Thanks
You can use SelectHiveQL processor to pull data from Hive table and output format as CSV and include Header as false.
SplitText processor to split each line as individual flowfile.
Note
if your flowfile size is big then you have to use series of split text processors in series to split the flowfile to each line individually
ExtractText processor to extract the key attribute from the flowfile content.
PutDistributedMapCache processor
Configure/Enable DistributedMapCacheClientService, DistributedMapCacheServer controller service.
Add the Cache Entry Identifier property as your extracted attribute from ExtractText processor.
You need to change the Max cache entry size depending on the flowfile size.
To fetch the cached data you can use FetchDistributedMapCache processor and we need to use same exact value for the identifier that we have cached in PutDistributedMapCache
In the same way if you want to load data from external sources as we are going to have data in Avro format use ConvertRecord processor to convert Avro --> CSV format then load the data into distributed cache.
However this not an best practice to load all the data into distributedmapcache for the huge datasets as you can use lookuprecord processor also.

Best approach to determine Oracle INSERT or UPDATE using NiFi

I have a JSON flow-file and I need determine if I should be doing an INSERT or UPDATE. The trick is to only update the columns that match the JSON attributes. I have an ExecuteSQL working and it returns executesql.row.count, however I've lose the original JSON flow-file which I was planing to use as a routeonattribute. I'm trying to get the MergeContent to join the ExecuteSQL (dump the Avro output, I only need the executesql.row.count attribute) with the JSON flow. I've set follow before I do the ExecuteSQL:
fragment.count=2
fragment.identifier=${UUID()}
fragment.index=${nextInt()}
Alternatively I could create a MERGE, if there is a way to loop through the list of JSON attributes that match the Oracle table?
How large is your JSON? If it's small, you might consider using ExtractText (matching the whole document) to get the JSON into an attribute. Then you can run ExecuteSQL, then ReplaceText to put the JSON back into the content (overwriting the Avro results). If your JSON is large, you could set up a DistributedMapCacheServer and (in a separate flow) run ExecuteSQL and store the value or executesql.row.count into the cache. Then in the JSON flow you can use FetchDistributedMapCache with the "Put Cache Value In Attribute" property set.
If you only need the JSON to use RouteOnAttribute, perhaps you could use EvaluateJsonPath before ExecuteSQL, so your conditions are already in attributes and you can replace the flow file contents.
If you want to use MergeContent, you can set fragment.count to 2, but rather than using the UUID() function, you could set "parent.identifier" to "${uuid}" using UpdateAttribute, then DuplicateFlowFile to create 2 copies, then UpdateAttribute to set "fragment.identifier" to "${parent.identifier}" and "fragment.index" to "${nextInt():mod(2)}". This gives a mergeable set of two flow files, you can route on fragment.index being 0 or 1, sending one to ExecuteSQL and one through the other flow, joining back up at MergeContent.
Another alternative is to use ConvertJSONToSQL set to "UPDATE", and if it fails, route those flow files to another ConvertJSONToSQL processor set to "INSERT".

How to pass values dynamicallly from one processor to another processor using apache nifi

i want pass one processor result as input to another processor using apache NiFi.
I am geeting values from mysql using ExecuteSQL processor .i want pass this result dynamically to SelectHiveQL Processor in apache nifi.
ExecuteSQL outputs a result set as Avro. If you would like to process each row individually, you can use SplitAvro then ConvertAvroToJson, or ConvertAvroToJson then SplitJson. At that point you can use EvaluateJsonPath to extract values into attributes (for use with NiFi Expression Language), and at some point you will likely want ReplaceText where you set the content of the flow file to a HiveQL statement (for use by SelectHiveQL).

Resources