NiFi Flow for Record Enrichment - etl

I am using NiFi 1.11.4 to build a data pipeline where IoT device is sending data in JSON format. Each time I receive data from IoT device, I receive two JSONs;
JSON_INITIAL
{
devId: "abc",
devValue: "TWOINITIALCHARS23",
}
and
JSON_FINAL
{
devId: "abc",
devValue: "TWOINITIALCHARS45",
}
There is a time difference of a few milli seconds with which I receive these two flow files. In my usecase, I need to merge this JSON in such a way that my resultant JSON looks like below (please note removal of TWOINITIALCHARS in both cases;
JSON_RESULT_AFTER_MERGE
{
devId: "abc",
devValue: "2345",
}
Is this something NiFi should be dealing with? If yes, would really appreciate an approach to design relevant flow for this use case.

Assuming the devId is static for a device and not used for the correlation (i.e. abc for all messages coming from this device, not abc for the first two and then def for the next two, etc.), you have a few options:
Use MergeContent to concatenate the flowfile contents (the two JSON blocks) and ReplaceText to modify the combined content to match the desired output. This will require tuning the MC binning properties to limit the merge window to 1-2 seconds (difficult/insufficient if you're receiving multiple messages per second, for example) and using regular expressions to remove the duplicate content.
Use a custom script to interact with the device JSON output (Groovy for example will make the JSON interaction pretty simple)
If you do this within the context of NiFi (via ExecuteScript or InvokeScriptedProcessor), you will have access to the NiFi framework, so you can evaluate flowfile attributes and content, making this easier (there will be attributes for initial timestamp, etc.).
If you do this outside the context of NiFi (via ExecuteProcess or ExecuteStreamCommand), you won't have access to the NiFi framework (attributes, etc.) but you may have better interaction with the device directly.

Related

How to count metrics from executions of AWS lambdas?

I have all sorts of metrics I would like to count and later query. For example I have a lambda that processes stuff from a queue, and for each batch I would like to save a count like this:
{
"processes_count": 6,
"timestamp": 1695422215,
"count_by_type": {
"type_a": 4,
"type_b": 2
}
}
I would like to dump these pieces somewhere and later have the ability to query how many were processed within a time range.
So these are the options I considered:
write the json to the logs, and later have a component (beats?) that processed these logs and send to a timeseries db.
in the end of each execution send it directly to a timeseries db (like elasticearch).
What is better in terms of cost / scalability? Are there more options I should consider?
I think Cloud Watch Embedded Metric Format (EMF) would be good here. There are client libraries for Node.js, Python, Java, and C#.
CW EMF allows you to push metrics out of Lambda into CloudWatch in a managed async way. So it's a cost-effective and low-effort way of producing metrics.
The client library produces a particular JSON format to stdout, when CW sees a message of this type it automatically creates the metrics for you from it.
You can also include key-value pairs in the EMF format which allows you to go back and query the data with these keys in the future.
High-level clients are available with Lambda Powertools in Python and Java.

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

In nifi usgae of Evaluate jsonpath processor will it affect performance impact because of attribute creation

I'm trying to integrate nifi REST API's with my application. So by mapping input and output from my application, I am trying to call nifi REST api for flow creation. So, in my use case most of the times I will extract the JSON values and will apply expression languages.
So, for simplifying all the use-cases I am using evaluate JSONpath processor for fetching all attributes using jsonpath and apply expression language function on that in extract processor. Below is the flow diagram regarding that.
Is it the right approach because for JSON to JSON manipulation having 30 keys this is the simplest way, and as I am trying to integrate nifi REST API's with my application I cannot generate JOLT transformation logic dynamically based on the user mapping.
So, in this case, does the usage of evaluating JSONpath processor creates any performance issues for about 50 use case with different transformation logic because as I saw in documentation attribute usage creates performance(regarding memory) issues.
Your concern about having too many attributes in memory should not be an issue here; having 30 attributes per flowfile is higher than usual, but if these are all strings between 0 - ~100-200 characters, there should be minimal impact. If you start trying to extract KB worth of data from the flowfile content to the attributes on each flowfile, you will see increased heap usage, but the framework should still be able to handle this until you reach very high throughput (1000's of flowfiles per second on commodity hardware like a modern laptop).
You may want to investigate ReplaceTextWithMapping, as that processor can load from a definition file and handle many replace operations using a single processor.
It is usually a flow design "smell" to have multiple copies of the same flow process with different configuration values (with the occasional exception of database interaction). Rather, see if there is a way you can genericize the process and populate the relevant values for each flowfile using variable population (from the incoming flowfile attributes, the variable registry, environment variables, etc.).

Nifi processor to route flows based on changeable list of regex

I am trying to use Nifi to act as a router for syslog based on a list of regexes matching the syslog.body (nb as this is just a proof of concept I can change any part if needed)
The thought process is that via a separate system (for now, vi and a text file 😃) an admin can define a list of criteria (regex format for each seems sensible) which, if matched, would result in syslog messages being sent to a specific separate system (for example, all critical audit data (matched by the regex list) is sent to the audit system and all other data goes to the standard log store
I know that this can be done on Route by content processors but the properties are configured before the processor starts and an admin would have to stop the processor every time they need to make an edit
I would like to load the list of regex in periodically (automatically) and have the processor properties be updated
I don’t mind if this is done all natively in Nifi (but that is preferable for elegance and to save an external app being written) or via a REST API call driven by a python script or something (or can Nifi send REST calls to itself?!)
I appreciate a processor property cannot be updated while running, so it would have to be stopped to be updated, but that’s fine as the queue will buffer for the brief period. Maybe a check to see if the file has changed could avoid outages for no reason rather than periodic update regardless, I can solve that problem later.
Thanks
Chris
I think the easiest solution would be to use ScanContent, a processor which specifies a dictionary file on disk which contains a list of search terms and monitors the file for changes, reloading in that event. The processor then applies the search terms to the content of incoming flowfiles and allows you to route them based on matches. While this processor doesn't support regular expressions as dictionary terms, you could make a slight modification to the code or use this as a baseline for a custom processor with those changes.
If that doesn't work for you, there are a number of LookupService implementations which show how CSV, XML, property files, etc. can be monitored and read by the controller framework to provide an updated mapping of key/value pairs. These can also serve as a foundation for building a more complicated scan/match flow using the loaded terms/patterns.
Finally, if you have to rely on direct processor property updating, you can script this with the NiFi API calls to stop, update, and restart the processors so it can be done in near-real-time. To determine these APIs, visit the API documentation or execute the desired tasks via the UI in your browser and use the Developer Tools to capture the HTTP requests being made.

Which processor should I use to combine different JSON inputs into one object?

I'd like to set up a dataflow that takes in multiple JSON inputs, and combines them into a single JSON object with multiple properties (I'm currently using a few GenerateFlowFile processors to generate the inputs), and send the data every 10 seconds via the PublishMQTT processor.
The inputs come in at different intervals (1-5seconds), and examples are:
{"temperature": 60}
{"pressure": 30}
I would like to compile the incoming data into one object i.e. {"temperature": 60,"pressure": 30} before sending it to the PublishMQTT processor.
Also, if fresh data with the same attribute comes in before the message is sent, it should update the attribute in the same object instead of being queued. i.e. If new data entered {"pressure": 150}, the output object should be updated to {"temperature": 60,"pressure": 150} before it is sent out via MQTT
I'm guessing that I will require a processor (see blue circle in attached image), but I'm not sure what processor(s) does what I've described.
There isn't really a provided processor that can do this because it requires some knowledge of your data. You would have to implement a custom processor, or use ExecuteScript.
You can use wait-notify to get flowfiles wait for another like in this example:
https://pierrevillard.com/2018/06/27/nifi-workflow-monitoring-wait-notify-pattern-with-split-and-merge/comment-page-1/
Note that the link is only an example, your use case is different and must be changed according to your requirements.
Then you can merge the information in one flowfile and use EL to generate the new json value.

Resources