Record Oriented InvokeHTTP Processor - apache-nifi

I have a csv file
longtitude,lagtitude
34.094933,-118.30674
34.095028,-118.306625
(more to go)
I use UpdateRecord Processor (which support record processing) with CSVRecordSetWriter using RecordPath (https://nifi.apache.org/docs/nifi-docs/html/record-path-guide.html) to prepare gis field.
longtitude,lagtitude,gis
34.094933,-118.30674,"34.094933,-118.30674"
34.095028,-118.306625,"34.095028,-118.306625"
My next step is to retrieve gis as input parameter to a HTTP API, where this HTTP API returns info (poi) that I would like to store.
longtitude,lagtitude,gis,poi
34.094933,-118.30674,"34.094933,-118.30674","Restaurant A"
34.095028,-118.306625,"34.095028,-118.306625","Cinema X"
It seems like InvokeHTTP Processor does not process in record oriented way. Any possible solution to prepare the above without split it further?

When you want to enrich each record like this it is typically handled in NiFi by using the LookupRecord processor with a LookupService. It is basically saying, for each record in the incoming flow file, pass in some fields of the record to the lookup service, and take the results of the lookup and stored them back in the record.
For your example it sounds like you would want a RestLookupService:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-lookup-services-nar/1.9.1/org.apache.nifi.lookup.RestLookupService/index.html

Related

The ExecuteSQL processor doesn't work after connecting with other processor

When I didn't connect any processors as an incoming one, the ExecuteSQL works perfectly fine as the screenshot
Screenshot#1
But when I've connected with another processor, there's no flowfiles coming out of the ExecuteSQL processor.
Screenshot#2
Anyone know how could I make it works? Thank you in advance :-)
check the NiFi docs and you'll find this dscription
Executes provided SQL select query. Query result will be converted to Avro format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query, and the query may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format. FlowFile attribute 'executesql.row.count' indicates how many rows were selected.
it tells you that you have to use some special atrributes by triggering via flowfile.
something like sql.args.1.type and sql.args.1.value

Nifi: Auditing the data using provenance data

Hi Iam new to nifi and I have followed the tutorial here to understand the provenance repository content and moving it out for auditing. But I have a couple of questions here.
The main use of provenance data is to make understand what exactly happened to a piece of data. But here the data is in flow file. How are we supposed to understand what happened to a particular data using flow file?
Is the best practice is to always send data provenance data from one nifi to another? Why not use the SiteToSiteProvenanceReportingTask to send to a port in the same nifi instance and extract it out of there?
What could be the best tools that can be used for sending these data for auditing?
Hopefully this answers your questions:
You can export the provenance data many ways, to extract the content of the flowfile from the provenance event, I believe you have to get at the "content claims" for the flowfile, not sure how that works. Because the content claims are reclaimed when no flowfile in the current system is using it, I don't think you can query on provenance events' content when the content no longer exists in the content repository. Some components will add an attribute for any errors/status they encounter.
You can certainly use a SiteToSiteProvenanceReportingTask to send provenance data from a cluster back to itself, you probably just want to filter out the Input Port and Process Group that handle the processing of provenance data.
Data provenance is sometimes a graph problem but the events are often useful on their own (without needing to know the flow, e.g.) so analysis can be done on the events themselves. I've sent the events to a Hive table and then was able to do some things with HiveQL like calculating predicted backpressure on connections (before we added it to NiFi proper)

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Best approach to determine Oracle INSERT or UPDATE using NiFi

I have a JSON flow-file and I need determine if I should be doing an INSERT or UPDATE. The trick is to only update the columns that match the JSON attributes. I have an ExecuteSQL working and it returns executesql.row.count, however I've lose the original JSON flow-file which I was planing to use as a routeonattribute. I'm trying to get the MergeContent to join the ExecuteSQL (dump the Avro output, I only need the executesql.row.count attribute) with the JSON flow. I've set follow before I do the ExecuteSQL:
fragment.count=2
fragment.identifier=${UUID()}
fragment.index=${nextInt()}
Alternatively I could create a MERGE, if there is a way to loop through the list of JSON attributes that match the Oracle table?
How large is your JSON? If it's small, you might consider using ExtractText (matching the whole document) to get the JSON into an attribute. Then you can run ExecuteSQL, then ReplaceText to put the JSON back into the content (overwriting the Avro results). If your JSON is large, you could set up a DistributedMapCacheServer and (in a separate flow) run ExecuteSQL and store the value or executesql.row.count into the cache. Then in the JSON flow you can use FetchDistributedMapCache with the "Put Cache Value In Attribute" property set.
If you only need the JSON to use RouteOnAttribute, perhaps you could use EvaluateJsonPath before ExecuteSQL, so your conditions are already in attributes and you can replace the flow file contents.
If you want to use MergeContent, you can set fragment.count to 2, but rather than using the UUID() function, you could set "parent.identifier" to "${uuid}" using UpdateAttribute, then DuplicateFlowFile to create 2 copies, then UpdateAttribute to set "fragment.identifier" to "${parent.identifier}" and "fragment.index" to "${nextInt():mod(2)}". This gives a mergeable set of two flow files, you can route on fragment.index being 0 or 1, sending one to ExecuteSQL and one through the other flow, joining back up at MergeContent.
Another alternative is to use ConvertJSONToSQL set to "UPDATE", and if it fails, route those flow files to another ConvertJSONToSQL processor set to "INSERT".

Lookup using spring-xd

I am looking for way to perform lookup operation using spring-xd.
My problem statement goes like this,
I have a stream of JSON events coming in, I want to have the values of events looked-up against the threshold values in my file in HDFS or directly from RDBMS.
Please suggest a way to perform this.
Thanking you in advance.
If I understand this correctly, you have different thresholds for different values in your messages.
Something like
value 'A' -> 100
value 'B' -> 200
...
This information is stored in a file or in a relational database. Now you want to filter the events based on their values and the corresponding thresholds.
I guess you would have to write a custom processor that holds a connection to the database where these values are stored, and queries them. If the mapping is small enough you should consider cashing it, or at least cache the most frequently used values, such that this does not slow down your stream.
If I understand your question, you can write a groovy processor to receive the payload, filter it, and then pass it to where ever you want like hdfs.
stream --name --def "jdbc | groovyprocessor | hdfs" --deploy
In the case of batch, you will need to write a custom module.
Moha

Resources