merging of flow files in the specified order - apache-nifi

I am new to nifi(using version 1.8.0). I have the requirement of consuming kafka messages which contain vehicle position in the form of lat,lon per message. Since each message will arrive as a flow file, I need to merge all these flow files and make a json file containing the complete path followed by the vehicle. I am using consume kafka processor to subscribe to messages, update attribute processor(properties added are filename:${getStateValue("seq")},seq:${getStateValue("seq"):plus(1)}) to add a sequence number as filename (eg. filename is 1,2, 3 etc) and put file processor to write these files in the specified directory. I have configured FIFO priority queue on all the success relationship between the above mentioned processors.Once, I have received all the messages I want to merge all the flow files. For this I know I have to use get file, enforce order, merge content(merge strategy:bin packing algorithm, merge format:binary concatenation) and put file processor, respectively. Is my approach correct? How should I establish that merging of files takes place in the sequence of their names as filename is a seq number. What should I put in order attribute in enforce order processor?What should in put in group identifier? Are there more custom fields to be added in enforce order processor?

EnforceOrder processor documentation
1.Group Identifier
This property evaluate on each flowfile for your case use UpdateAttribute Processor, add group_name attribute and use the same ${group_name} attribute in Group Identifier property value.
2.Order Attribute
Expression language is not supported.
You can use filename (or) create new attribute in
UpdateAttribute processor and use same attribute name in your
Order Attribute property value.
For reference/usage of Enforce order processor use this template and upload to your NiFi instance.

Related

Can I filter based on a value's suffix rather than a prefix in my GCP pub/sub subscription?

This is the attribute and value which my pub/sub subscription will be expected to filter:
'objectId': 'event-notifications-test/_MANIFEST'
However, I am not concerned about the prefix of the value (event-notifications-test/) changing - I only want to filter the message from the topic if it contains '_MANIFEST'. If I was interested in the prefix, I expect I would need to use something like this:
hasPrefix(attributes.name, "co")
How can I filter the message based on the suffix of the value i.e. '_MANIFEST'?
It is not possible to filter directly on a suffix, no. Your two options are:
When publishing, write the suffix out as a separate attribute.
Filter at the application level when you receive a message by checking the suffix and acking the message without processing it.

Is there any way to make mathematical operations for some values in files with apache nifi?

I am getting some numerical data with API from URL and I am looking for a way to make some mathematical operations in apache nifi before putting data to file directory. Thanks already now.
By the way, I am using InvokeHTTP processor to get data and to put file in somewhere I am using PutFile processor. I searched some related websites but I could not find out a working way.
Try using QueryRecord processor and Define Record Reader/Writer controller services to read/write the flowfile.
Add new property to the QueryRecord processor by using Apache calcite SQL query with your mathematical operations on flowfile.
Results of the SQL query will be added to the outgoing flowfile in your desired format.
Ultimately the answer depends on whether the data you're working with is in the content of the FlowFile or in the attributes. If the data is small enough and it's only a couple operations, the suggested approach would be to work with the data as attributes and use NiFi's expression language to do the transformations.
There is a section of mathematical operations[1] in the Apache documentation[2]. The operations range from simple operand like plus/minus to exposing the java.lang.Math static methods.
[1] https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#numbers
[2] https://nifi.apache.org/docs.html
You can try ExecuteStreamCommand if you want to intake the whole file and then run operations. Alternatively, you can fiddle around with the variables on the flowfile - depending on how large your operation is.
For example if you have some initial variables you can include them in the name of your file and then extract them, run the operations within the variables of the flowfile, then add to the bottom of the original file

Apache NiFi to split data based on condition

Our requirement is split the flow data based on condition.
We thought to use "ExecuteStreamCommand" processor for that (intern it will use java class) but it is giving single flow data file only. We would like to have two flow data files, one is for matched and another is for unmatched criteria.
I looked at "RouteText" processor but it has no feature to use java class as part of it.
Let me know if anyone has any suggestion.
I think you could use GetMongo to read those definition values and store them in a map accessed by DistributedMapCacheClientService, then use RouteOnContent to route the incoming flowfiles based on the absence/presence of the retrieved values.
If that doesn't work, you could instead route the query result from GetMongo to PutFile and then use ScanContent, which reads from a dictionary file on the file system and routes flowfiles based on the absence/presence of those keywords in the content.
Finally, if all else fails, you can use ExecuteScript to combine those steps into a single processor and route to matched/unmatched relationships. It processes Groovy code easily, so you can directly invoke your existing Java class if necessary.

InferAvroSchema Avro Record Name based on flow attribute

I have a common process group that will infer avro schema based on the file i supplied. But I want to set the Avro Record Name to a name corresponding to the filename i am supplying. So I used ${filename}. But the InferAvroSchema got error saying the record name is empty. Note that before this, I already set the property "filename" to the flowfile attribute and it has a value since i tested it using ReplaceText to see if there's value for ${filename}
Unfortunately this looks like a bug in InferAvroSchema. Many of the properties support expression language, but then the processor doesn't evaluate them against the incoming flow file. So it ends up only being able to use a value typed directly into the property (non-EL), or a value from system or environment properties which doesn't really make sense for a lot of these properties.
I created this JIRA for the issue:
https://issues.apache.org/jira/browse/NIFI-2465
The fix is that all of the calls to evaluateAttributeExpressions() should be passing in a flow file like:
context.getProperty(CSV_HEADER_DEFINITION).evaluateAttributeExpressions(inputFlowFile).getValue()

Flume - Can an entire file be considered an event in Flume?

I have a use case where I need to ingest files from a directory into HDFS. As a POC, I used simple Directory Spooling in Flume where I specified the source, sink and channel and it works fine. The disadvantage is that I would have to maintain multiple directories for multiple file types that go into distinct folders in order to get greater control over file sizes and other parameters, while making configuration repetitive, but easy.
As an alternative, I was advised to use regex interceptors where multiple files would reside in a single directory and based on a string in the file, would be routed to the specific directory in HDFS. The kind of files I am expecting are CSV files where the first line is the header and the subsequent lines are comma separated values.
With this in mind, I have a few questions.
How do interceptors handle files?
Given that the header line in the CSV would be like ID, Name followed in the next lines by IDs and Names, and another file in the same directory would have Name, Address followed in the next line by names and address, what would the interceptor and channel configuration look like for it to route it to different HDFS directories?
How does an interceptor handle the subsequent lines that clearly do not match the regex expression?
Would an entire file even constitute one event or is it possible that one file can actually be multiple events?
Please let me know. Thanks!
For starters, flume doesn't work on files as such, but on a thing called events. Events are Avro structures which can contain anything, usually a line, but in your case it might be an entire file.
An interceptor gives you the ability to extract information from your event and add that to that event's headers. The latter can be used to configure a traget directory structure.
In your specific case, you would want to code a parser that analyses the content of you event and sets a header value, for instance sub path:
if (line.contains("Address")) {
event.getHeaders().put("subpath", "address");
else if (line.contains("ID")) {
event.getHeaders().put("subpath", "id");
}
You can then reference that in your hdfs-sink confirguration as follows:
hdfs-a1.sinks.hdfs-sink.hdfs.path = hdfs://cluster/path/%{subpath}
As to your question whether multiple files can constitute an event: yes, that's possible, but not with the spool source. You would have to implement a client class which speaks to a configured Avro source. You would have to pipe your files into an event and send that off. You could then also set the headers there instead of using an interceptor.

Resources