Lookup using spring-xd - spring-xd

I am looking for way to perform lookup operation using spring-xd.
My problem statement goes like this,
I have a stream of JSON events coming in, I want to have the values of events looked-up against the threshold values in my file in HDFS or directly from RDBMS.
Please suggest a way to perform this.
Thanking you in advance.

If I understand this correctly, you have different thresholds for different values in your messages.
Something like
value 'A' -> 100
value 'B' -> 200
...
This information is stored in a file or in a relational database. Now you want to filter the events based on their values and the corresponding thresholds.
I guess you would have to write a custom processor that holds a connection to the database where these values are stored, and queries them. If the mapping is small enough you should consider cashing it, or at least cache the most frequently used values, such that this does not slow down your stream.

If I understand your question, you can write a groovy processor to receive the payload, filter it, and then pass it to where ever you want like hdfs.
stream --name --def "jdbc | groovyprocessor | hdfs" --deploy
In the case of batch, you will need to write a custom module.
Moha

Related

Apache NiFi how to check output from each processor

I am new to using Apache NiFi, and I trying to create a template that takes a JSON file and turns it into a set of SQL insert statements.
So far I have created a template that takes the JSON file and I have got it to the point of PutSQL. There is no database to connect to at the moment, but what I have not been able to check is the output. Can this be done? What I need to check is whether the array of JSON has been turned into a INSERT per element in the array
As far as inpecting the output, what does your flow look like? If you have something like ConvertJSONToSQL -> PutSQL, you can leave PutSQL stopped and run ConvertJSONToSQL, then you will see FlowFile(s) in the connection between the two processors. Then you can right-click on the connection and choose List Queue, then click the "eye" icon on the right for the FlowFile you wish to inspect. That will show you the contents of the FlowFile right before it goes into PutSQL.
Having said all that, if your JSON file contains fields that correspond to columns in your database, consider PutDatabaseRecord instead of ConvertJSONToSQL -> PutSQL. That can use a JsonTreeReader to parse each record, and it will generate and execute the necessary SQL as a prepared statement using the values in all records of the FlowFile. That way you don't need to generate the SQL yourself or worry about fragmented transactions or any of that.

CSV Blob Sink - Skip Writing File when 0 Rows Present

This is a relatively simple problem with (I'm hoping) a similarly-simple solution.
In my ADF ETLs, any time there's a known and expected yet unrecoverable row-based error, I don't want my full ETL to fail. Instead, I'd rather pipe those rows off to a log, which I can then pick up at the end of the ETL for manual inspection. To do this, I use conditional splits.
Most of the time, there shouldn't be any rows like this. When this is the case, I don't want my blob sink to write a file. However, the current behavior writes a file no matter what -- it's just that the file only contains the table header.
Is there a way to skip writing anything to a blob sink when there are no input rows?
Edit: Somehow I forgot to specify -- I'm specifically referring to a Mapping Data Flow with a blob sink.
You can use Lookup activity(don't check first row only) to get all your table data firstly. Then use If condition to check the count of Lookup activity's output. If its count > 0, execute next activity(or data flow).

How to wait for GenerateTableFetch queries to finish

My use case is like this. I have some X tables to be pulled from MySQL. I am splitting them using SplitText to put each table in a individual flow file and pull using GenerateTableFetch and ExecuteSQL.
And I want to be notified or put some other action when import is done for all the tables. At SplitText text processor I have routed original relationship to Wait on ${filename} with target count ${fragment.count}. This will track how many tables are done.
But now I am not able to figure out how to know when a particular table is done. GenerateTableFetch forks flow file into multiple based on Partition Size. But it does not write attributes like fragment.count which I can use to wait on for each table.
Is there a way I can achieve this? Or maybe is there a way to know at the end of the entire flow if all flow files in the flow have been processed and nothing is in queue or being processed?
If you have a standalone instance of NiFi (or are not distributing the flow files among a cluster to ExecuteSQL nodes), then you could use QueryDatabaseTable instead, it (by default) will only issue all flow files when the entire result set is processed. If you have all the rows go into a single flow file, then the fact that the flow file has been transferred downstream is an indication that the fetch is complete.
I have written NIFI-5601 to cover the improvement of adding fragment.* attributes to flow files generated by GTF.
Till NiFi add's support for this, I managed to make it work using MergeContent. Use table_name as Correlation attribute name and then use merged relation to Wait processor using ${merge.count} as target. Refer screenshots if someone is looking to do the same.

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Storage of parsed log data in hadoop and exporting it into relational DB

I have a requirement of parsing both Apache access logs and tomcat logs one after another using map reduce. Few fields are being extracted from tomcat log and rest from Apache log.I need to merge /map extracted fields based on the timestamp and export these mapped fields into a traditional relational db ( ex. MySQL ).
I can parse and extract information using regular expression or pig. The challenge i am facing is on how to map extracted information from both logs into a single aggregate format or file and how to export this data to MYSQL.
Few approaches I am thinking of
1) Write output of map reduce from both parsed Apache access logs and tomcat logs into separate files and merge those into a single file ( again based on timestamp ). Export this data to MySQL.
2) Use Hbase or Hive to store data in table format in hadoop and export that to MySQL
3) Directly write the output of map reduce to MySQL using JDBC.
Which approach would be most viable and also please suggest any other alternative solutions you know.
It's almost always preferable to have smaller, simpler MR jobs and chain them together than to have large, complex jobs. I think your best option is to go with something like #1. In other words:
Process Apache httpd logs into a unified format.
Process Tomcat logs into a unified format.
Join the output of 1 and 2 using whatever logic makes sense, writing the result into the same format.
Export the resulting dataset to your database.
You can probably perform the join and transform (1 and 2) in the same step. Use the map to transform and do a reduce side join.
It doesn't sound like you need / want the overhead of random access so I wouldn't look at HBase. This isn't its strong point (although you could do it in the random access sense by looking up each record in HBase by timestamp, seeing if it exists, merging the record in, or simply inserting if it doesn't exist, but this is very slow, comparatively). Hive could be conveinnient to store the "unified" result of the two formats, but you'd still have to transform the records into that format.
You absolutely do not want to have the reducer write to MySQL directly. This effectively creates a DDOS attack on the database. Consider a cluster of 10 nodes, each running 5 reducers, you'll have 50 concurrent writers to the same table. As you grow the cluster you'll exceed max connections very quickly and choke the RDBMS.
All of that said, ask yourself if it makes sense to put this much data into the database, if you're considering the full log records. This amount of data is precisely the type of case Hadoop itself is meant to store and process long term. If you're computing aggregates of this data, by all means, toss it into MySQL.
Hope this helps.

Resources