Get Hbase processor filter row by timestamp - apache-nifi

I'm trying to use HBase get processor in NIFI, and i want to do this command in the hbase processor is it possible ?
scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}

The GetHBase processor is made to do incremental extraction from an HBase table based on the timestamp. The Initial Time Range property determines whether the min time starts at 0 or at the current time, after that the processor is keeping track of the max time seen in the previous execution and using that as the min time in the next execution. So you can't provide your own timerange since the processor is managing that for you.

The GetHBase processor always looks for incremental updates based on the timestamp. Basically it recognizes the new/updated data automatically.
But if you still want to read row specifically for timestamp(s), you have to use regular expression in the following format in the tab "Filter Expression":
TimeStampsFilter(timestamp1,timestamp2....timestampn)
You can find a list of these filters in: https://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_hbase_filtering.html

Related

NiFi - Persist timestamp value used in ExecuteSQLRecord processor query

My use-case is simple but I did not find the right solution so-far.
I write the query which tag the data with the current timestamp in of the column at the time ExecuteSQLRecord processor hit and get the data from database now want I wanted is that created flowfile has to have the same timestamp in his name as well but i did not know how to capture the attribute which is ${now():format("yyyyMMddHHmmss")} so I can use alter for renaming the flowfile
Basically, I wanted to store the timestamp "at the time I hit the database", I can not use the update processor just before the executeSQL processor to get the timestamp needed (why => because if prior execution is still in process with executeSQL and all the flow files will pass updateattribute processor with the timestamp value and will sit in the queue until executeSQL processor process current thread).
Note - I am running NiFi in standalone mode so I can not run executeSQL in multiple threads.
Any help is highly appreciated. thanks in advance
ExecuteSQLRecord writes an attribute called executesql.query.duration which contains the duration of the query + fetch in milliseconds.
So, we can put an UpdateAttribute processor AFTER the ExecuteSQLRecord that uses ${now():toNumber():minus(${executesql.query.duration})} to get the current time as Epoch Millis, then minus the total query duration, to get the time at which the Query started.
You can then use :format('yyyyMMddHHmmss') to bring it back to the timestamp format you want.
It might be a few milliseconds off of the exact time (time taken to get to the UpdateAttribute processor).
See docs for ExecuteSQLRecord

How to find all the files created by GenerateTableFetch has been processed

We have a flow where GenerateTableFetch takes inpute from splitJson which gives TableName, ColumnName as argument. At once multiple tables are passed as input to GenerateTableFetch and next ExecuteSql executes the query.
Now i want to trigger a new process when all the files for a table has been processed by the below processor (At the end there is PutFile).
How to find that all the files created for a Table has been processed?
You may need NIFI-5601 to accomplish this, there is a patch currently under review at the time of this writing, I hope to get it into NiFi 1.9.0.
EDIT: Adding potential workarounds in the meantime
If you can use ListDatabaseTables instead of getting your table names from a JSON file, then you can set Include Count to true. Then you will get attributes for the table name and the count of its rows. Then you can divide the count by the value of the Partition Size in GTF and that will give you the number of fetches (let's call it X). Then add an attribute via UpdateAttribute called "parent" or something, and set it to ${UUID()}. Keep these attributes in the flow files going into GTF and ExecuteScript, then you can use Wait/Notify to wait until X flow files are received (setting Target Signal Count to ${X}) and using ${parent} as the Release Signal Identifier.
If you can't use ListDatabaseTables, then you may be able to have ExecuteSQLRecord after your SplitJSON, you can execute something like SELECT COUNT(*) FROM ${table.name}. If using ExecuteSQL, you may need a ConvertAvroToJSON, if using ExecuteSQLRecord use a JSONRecordSetWriter. Then you can extract the count from the flow file contents using EvaluateJsonPath.
Once you have the table name and the row count in attributes, you can continue with the flow I outlined above (i.e. determine the number of flow files that GTF will generate, etc.).

I need to get latest data in data ingest template nifi

Hi sir,
In data ingest template i need to get this property
for ex i have data with date field
date data
12-07-2018 a
13-07-2018 b
14-07-2018 c
15-07-2018 d
In that , i would like to take latest one i.e, 15-07-2018
if date field got new data
16-07-2018 e
then i have to get 16-07-2018 by checking last updated date 15-07-2018 rather than checking from first one 12-07-2018
like that, if i got 17-08-2108 f then have to get 17-08-2018 by checking with last new date 16-07-2018 ..
how to achieve this , in which processor i have to do modifications or have to add new properties
When the feed runs again, how does it take the latest watermark and work from there
Two possible approach comes to my mind:
Write your own Spark app which would be used (ExecuteSparkJob) to read through the file which is getting ingested. In this case, you keep track of the max date and when you are done through the ingestion, persist it somewhere. If you're in HDP world, easy thing would be to insert the max date to a Hive (transactional) table. You can also leverage ZooKeeper znode to persist or even the PutDistributedMapCache processor that NiFi offers.
Write a custom NiFi processor which would basically do the same thing as the above one, except that you have to enable it yourself to work with data of different format (CSV, JSON). Spark, in this regard, comes packed with many thing built in.

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Bulk Update of Particular fields In Hbase

I have a scenario while was working on Hbase. Initially I have to bulkupload a csv file to Hbase table.Which I could do successfully by using Hbase bulkloading.
Now I want to update a particular field in hbase table by comparing to an new csv provided and if the value is updated have to maintain a flag which says the rowkey was updated. Any hint how I can do it easily.
Any help is really appreciated.
Thanks
HBase maintains versions for each cell. As long as you have the row key with you, you get a handle of the row, and you can just use put to add the updated column. Internally it maintains the versions, and you can have access to history of the updated values too.
However, you need comparing too, as I can see. So after bulk loading the fastest you can do it, use a map reduce as have HBase as source and sink. Look here at 7.2.2 section.
The idea is have mapreduce perform the scan, do comparision in map, and write the new updated put in output. Its like a basic fetch, modify and update sequence. But we are using map reduce parallel feature as we are dealing with large amount of data

Resources