How can I prepare attribute before executing the NiFi flow? - apache-nifi

In NiFi, how can we prepare attribute BEFORE (or during) execution of the flow? (likes select a value from the db or get the execution time).
Ex.
Prepare an execution ID >>>> QueryDatabaseTableRecord >>> PutFile
Thank you in advance :-)
More Details:
What I would like to have is to have a runtime variable that won't changed in the flowfile (likes an EXECUTION TIME for example). If I used the attribute with an expression language
${now():toNumber()}
it will change for each flowfiles that's not the expected result I would like to have.

Related

The ExecuteSQL processor doesn't work after connecting with other processor

When I didn't connect any processors as an incoming one, the ExecuteSQL works perfectly fine as the screenshot
Screenshot#1
But when I've connected with another processor, there's no flowfiles coming out of the ExecuteSQL processor.
Screenshot#2
Anyone know how could I make it works? Thank you in advance :-)
check the NiFi docs and you'll find this dscription
Executes provided SQL select query. Query result will be converted to Avro format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query, and the query may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format. FlowFile attribute 'executesql.row.count' indicates how many rows were selected.
it tells you that you have to use some special atrributes by triggering via flowfile.
something like sql.args.1.type and sql.args.1.value

NiFi - Persist timestamp value used in ExecuteSQLRecord processor query

My use-case is simple but I did not find the right solution so-far.
I write the query which tag the data with the current timestamp in of the column at the time ExecuteSQLRecord processor hit and get the data from database now want I wanted is that created flowfile has to have the same timestamp in his name as well but i did not know how to capture the attribute which is ${now():format("yyyyMMddHHmmss")} so I can use alter for renaming the flowfile
Basically, I wanted to store the timestamp "at the time I hit the database", I can not use the update processor just before the executeSQL processor to get the timestamp needed (why => because if prior execution is still in process with executeSQL and all the flow files will pass updateattribute processor with the timestamp value and will sit in the queue until executeSQL processor process current thread).
Note - I am running NiFi in standalone mode so I can not run executeSQL in multiple threads.
Any help is highly appreciated. thanks in advance
ExecuteSQLRecord writes an attribute called executesql.query.duration which contains the duration of the query + fetch in milliseconds.
So, we can put an UpdateAttribute processor AFTER the ExecuteSQLRecord that uses ${now():toNumber():minus(${executesql.query.duration})} to get the current time as Epoch Millis, then minus the total query duration, to get the time at which the Query started.
You can then use :format('yyyyMMddHHmmss') to bring it back to the timestamp format you want.
It might be a few milliseconds off of the exact time (time taken to get to the UpdateAttribute processor).
See docs for ExecuteSQLRecord

How to find all the files created by GenerateTableFetch has been processed

We have a flow where GenerateTableFetch takes inpute from splitJson which gives TableName, ColumnName as argument. At once multiple tables are passed as input to GenerateTableFetch and next ExecuteSql executes the query.
Now i want to trigger a new process when all the files for a table has been processed by the below processor (At the end there is PutFile).
How to find that all the files created for a Table has been processed?
You may need NIFI-5601 to accomplish this, there is a patch currently under review at the time of this writing, I hope to get it into NiFi 1.9.0.
EDIT: Adding potential workarounds in the meantime
If you can use ListDatabaseTables instead of getting your table names from a JSON file, then you can set Include Count to true. Then you will get attributes for the table name and the count of its rows. Then you can divide the count by the value of the Partition Size in GTF and that will give you the number of fetches (let's call it X). Then add an attribute via UpdateAttribute called "parent" or something, and set it to ${UUID()}. Keep these attributes in the flow files going into GTF and ExecuteScript, then you can use Wait/Notify to wait until X flow files are received (setting Target Signal Count to ${X}) and using ${parent} as the Release Signal Identifier.
If you can't use ListDatabaseTables, then you may be able to have ExecuteSQLRecord after your SplitJSON, you can execute something like SELECT COUNT(*) FROM ${table.name}. If using ExecuteSQL, you may need a ConvertAvroToJSON, if using ExecuteSQLRecord use a JSONRecordSetWriter. Then you can extract the count from the flow file contents using EvaluateJsonPath.
Once you have the table name and the row count in attributes, you can continue with the flow I outlined above (i.e. determine the number of flow files that GTF will generate, etc.).

How can I filter flow files upon the result of an SQL query?

Would it be possible to route flow files according to the result of an SQL query which returns a single row result? For example, if the result is '1' the flow file will be processed; otherwise, it will be ignored.
Solution
The following approach worked best for me.
Use ExecuteSQL processor in order to run filtering SQL query. The query was written to produce either a single record (match) or an empty record set (no match) in a way suggested by Shu.
Connect ExecuteSQL to RouteOnAttribute processor in order to filter out unmatched flow files using the following value of routing property value ${executesql.row.count:replaceNull(0):gt(0)}
Notice, that the original content of a flow file will be lost after applying ExecuteSQL. It's not an issue in my case, because I do filtering before processing flow file content and my SQL query is based entirely on the flow file attributes and not on its content. Though in a more general scenario, when the flow file content is modified by the incoming part of the flow, one should save file content somewhere (e.g. file system) and restore it after the filtering part has applied.
You can add where clause in your sql query where <field_name> = 1 then we are only going to have output a flowfile when the result value =1.
(or)
Checking the data in NiFi:
We are going to have AVRO format data as the result of SQL query so you can use
option1:ConvertAvroToJson Processor:
Convert the AVRO data into JSON format then extract the value from the json content as attribute using EvaluateJsonPath processor.
Then use RouteOnAttribute processor add new property using NiFi expression language equals function compare the value and route the flowfile to matched relation.
Refer to this link more details regards to EvaluateJsonpath and RouteOnAttribute processor configs.
option2: Using QueryRecord processor:
By using QueryRecord processor we can run SQL queries on the content of the flowfile
Add new property to the processor as
select * from FLOWFILE where <filed_name> =1
Feed the property relation to the other processor
Refer to this link for more details regarding QueryRecord processor usage.

Get Hbase processor filter row by timestamp

I'm trying to use HBase get processor in NIFI, and i want to do this command in the hbase processor is it possible ?
scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}
The GetHBase processor is made to do incremental extraction from an HBase table based on the timestamp. The Initial Time Range property determines whether the min time starts at 0 or at the current time, after that the processor is keeping track of the max time seen in the previous execution and using that as the min time in the next execution. So you can't provide your own timerange since the processor is managing that for you.
The GetHBase processor always looks for incremental updates based on the timestamp. Basically it recognizes the new/updated data automatically.
But if you still want to read row specifically for timestamp(s), you have to use regular expression in the following format in the tab "Filter Expression":
TimeStampsFilter(timestamp1,timestamp2....timestampn)
You can find a list of these filters in: https://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_hbase_filtering.html

Resources