I had a 3 flowfiles which are coming from the same processor.
FF1 -> {a:1,b:2,c:'name'}
FF2 -> {a:1,b:5,c:'fruit'}
FF3 -> {a:2,b:3,c:'abc'}
By using MergeContent Processor I'm able to merge all the flow files, but my requirement is to merge flow files on Key.
Expected output if I join with Key 'a':
FF1 -> [{a:1,b:2,c:'name'},{a:1,b:5,c:'fruit'}]
FF2 -> [{a:2,b:3,c:'abc'}]
MergeContent has a property called "Correlation Attribute" which is the name of a flow file attribute that will be used to group together flow files with the same value for the attribute (the key in your example).
You will need to extract the value of field "a" into a flow file attribute using something like EvaluateJsonPath, or ExtractText, or some custom scripted processor, then once you get into an attribute like "my.key" then you put "my.key" into the Correlation Attribute property.
Related
I'm new to Apache NiFi and currently using it to route message data to various locations. I'm looking to add some fields based on a set of conditionals.
Currently I have a GetFile Processor that reads log files ---> ExtractGrok that applies a grok pattern to parse ---> ConvertRecord to convert from Grok to Json. The next part is where I'm stumped/not sure what to do next.
In my json I have a field refresh_time I need to create 2 new fields based on some conditions about the field refresh_time
something along the lines of if refresh_time < 10 then cache = 1; else if refresh_time > 10 then reprocess = 1
The end goal here is numeric fields cache and refresh_time that can be used down the road in aggregations.
What would be the best way to add 2 numerical fields based on a condition. Is there a processor for adding additional fields or updating the record to include new fields?
Thanks.
There's a couple ways you could achieve what you want to.
One option (More readable)
A QueryRecord would let you write a SQL statement across your Records and let you split them by the result. E.g.
Add a dynamic property called cache with a value SELECT * FROM FLOWFILE WHERE refresh_time < 10.
Add a dynamic property called refresh with a value SELECT * FROM FLOWFILE WHERE refresh_time > 10.
The QueryRecord will now have the relationships failure, original, cache and refresh.
Branching off from cache and refresh will be one UpdateRecord each, with Replacement Value Strategy set to Literal Value.
For the cache relationship, you can add a new dynamic property called cache with a value 1. For the refresh relationship, you can add a new dynamic property called refresh with a value 1.
Similar option (Possibly more performant)
If you want to avoid the additional UpdateRecord, you can add fields in the QueryRecord with something like this:
Two dynamic properties set as:
cache = SELECT *, 1 AS cache FROM FLOWFILE WHERE REFRESH < 10
reprocess = SELECT *, 1 AS reprocess FROM FLOWFILE WHERE REFRESH > 10
This option may be more performant due to fewer disk reads.
This gist is an example of the second option, you can import it to NiFi to try it out.
Also, FYI there is a GrokReader that you could use in ConvertRecord to parse with Grok straight to JSON, potentially skipping the ExtractGrok.
I want to get particular part of excel file in Nifi. My Nifi template like that;
GetFileProcessor
ConvertExcelToCSVProcessor
PutDatabaseRecordProcessor
I should parse data between step 2 and 3.
Is there a solution for getting specific rows and columns ?
Note:If there is a option for cutting ConvertExcelToCSVProcessor, it will work for me.
You can use Record processors between ConvertExcelToCSV and PutDatabaseRecord.
to remove or override a column use UpdateRecord. this processor can receive your data via CSVReader and prepare an output for PutDatabaseRecord or QueryRecord . check View usage -> Additional Details...
in order to filter by column use QueryRecord.
here an example. this example receives data through CSVReader and makes some aggregations, you can as well do some filtering according to doc
also this post had helped me to understand Records in Nifi
I am creating test data using NiFi and publishing it to Kafka. The data format is JSON (which I specified in the custom text of Generate Flow File):
{
"id":1,
"name": "John",
"inst_code":"HBA"
}
Requirement is to have data of a specific range, i.e. the id should be from 0 to 1,000,000 and inst_code should be either O or K.
I achieved this partially by using the following flow (except the upper range limit part):
GenerateFlowFile-> UpdateAttribute -> LookupAttribute -> ReplaceText
Where the UpdateAttribute has the below setting using the Store State Locally option:
Stateful Variables Initial Value->0
lookupnum -> ${random():mod(2):plus(1)}
seq-> ${getStateValue("seq"):plus(1)}
I used a SimpleKeyValueLookupService to lookup a random value of O and K based on the value of lookupnum. Later used ReplaceText to replace the value of id (using value of seq) and inst_code using the value from the lookup service.
This is working, but the only thing I am not able to think of is to set an upper limit of 1,000,000 to the seq attribute which I am using for the id field. How can it be achieved?
PS: I am unable to add screenshots as it is blocked
With ExecuteSQL/ExecuteSQLRecord processors you can specify "Output Batch Size" which will result in multiple Flow Files. Each Flow File contains executesql.row.count.
Now what is the simplest way to calculate a sum(executesql.row.count) for single table?
It's possible to do AttributesToJSON (drop content) => MergeContent (defragment) => QueryRecord (counting) => EvaluateJSONPath (back to attribute). Still it's too complex IMHO. Ideally, I would like to add attributes somehow, on/after MergeContent. The issue is that MergeContent is dropping attributes with the same key but different values. Also Nifi doesn't have any processor for adding dynamic number of attributes.
I tried with the below, I was able to get sum(executesql.row.count)
The NiFi Flow
First Update Attribute:
Init Counter through Update Attribute
Second Update Attribute:
Counter = Counter + executesql.row.count
The ADVANCED tab of Second Update Attribute - to reset after it exceeds a threshold (if required):
Reset Counter in Advanced option
I am currently getting files from FTP in Nifi, but I have to check some conditions before I fetch the file. The scenario goes some thing like this.
List FTP -> Check Condition -> Fetch FTP
In the Check Condition part, I have fetch some values from DB and compare with the file name. So can I use update attribute to fetch some records from DB and make it like this?
List FTP -> Update Attribute (from DB) -> Route on Attribute -> Fetch FTP
I think your flow looks something like below
Flow:
1.ListFTP //to list the files
2.ExecuteSQL //to execute query in db(sample query:select max(timestamp) db_time from table)
3.ConvertAvroToJson //convert the result of executesql to json format
4.EvaluateJsonPath //keep destination as FlowfileAttribute and add new property as db_time as $.db_time
5.ROuteOnAttribute //perform check filename timestamp vs extracted timestamp by using nifi expresson language
6.FetchFile //if condition is true then fetch the file
RouteOnAttribute Configs:
I have assumed filename is something like fn_2017-08-2012:09:10 and executesql has returned 2017-08-2012:08:10
Expression:
${filename:substringAfter('_'):toDate("yyyy-MM-ddHH:mm:ss"):toNumber()
:gt(${db_time:toDate("yyyy-MM-ddHH:mm:ss"):toNumber()})}
By using above expression we are having filename value same as ListFTP filename and db_time attribute is added by using EvaluateJsonPath processor and we are changing the time stamp to number then comparing.
Refer to this link for more details regards to NiFi expression language.
So if I understand your use case correctly, it is like you are using the external DB only for tracking purpose. So I guess only the latest processed timestamp is enough. In that case, I would suggest you to use DistributedCache processors and ControllerServices offered by NiFi instead of relying on an external DB.
With this method, your flow would be like:
ListFile --> FetchDistributedMapCache --(success)--> RouteOnAttribute -> FetchFile
Configure FetchDistributedMapCache
Cache Entry Identifier - This is the key for your Cache. Set it to something like lastProcessedTime
Put Cache Value In Attribute - Whatever name you give here will be added as a FlowFile attribute with its value being the Cache value. Provide a name, like latestTimestamp or lastProcessedTime
Configure RouteOnAttribute
Create a new dynamic relationship by clicking the (+) button in the Properties tab. Give it a name, like success or matches. Let's assume, your filenames are of the format somefile_1534824139 i.e. it has a name and an _ and the epoch timestamp appended.
In such case, you can leverage NiFi Expression Language and make use of the functions it offer. So for the new dynamic relation, you can have an expression like:
success - ${filename:substringAfter('_'):gt(${lastProcessedTimestamp})}
This is with the assumption that, in FetchDistributedMapCache, you have configured the property Put Cache Value In Attribute with the value lastProcessedTimestamp.
Useful Links
https://community.hortonworks.com/questions/83118/how-to-put-data-in-putdistributedmapcache.html
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#dates