Nifi QueryDatabaseTable processor, when will it reset the value? - apache-nifi

According to the document mentioned below, it seems like if I will restart the processor it will reset the value of maximum column value I have provided and will start fetching data from the beginning.
Document Link: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
A comma-separated list of column names. The processor will keep track of the maximum value for each column that has been returned since the processor started running.
However, I tested this behavior, and even if I restart the processor I get incremental load only. is there a mistake in the document or have missed something?
What would happen if I re-deploy the job, I mean deleting the job and re-creating it from the template?
In the code, it has mentioned that the value will be stored as part of Scop.CLUSTER. would someone please explain to me what is it? and in which conditions the state will be cleared?
#Stateful(scopes = Scope.CLUSTER, description = "After performing a query on the specified table, the maximum values for " + "the specified column(s) will be retained for use in future executions of the query. This allows the Processor " + "to fetch only those records that have max values greater than the retained values. This can be used for " + "incremental fetching, fetching of newly added rows, etc. To clear the maximum values, clear the state of the processor " + "per the State Management documentation")

Once the processor is started the first time, it will never reset it's value unless you go into the the "View State" menu of the processor and click "Clear State".
It would not make sense to clear the state when starting and stopping the processor because then any time NiFi restarted for maintenance or a crash then it would reset which would not be desired.
Where the state is stored is dependent on whether you are running a single node or a cluster. In a single node it is stored in a local write ahead log, in a cluster it is stored in ZooKeeper so all nodes can access it if necessary. In either case it stored by the UUID of the processor.

Related

How does flink-sql deal with scenario like 'count(distinct )'

I need to calculate "Daily Active Users" in realtime using flink-sql and it is like a 'count(distinct )' operation on daily data.
My question is, if userA logined this morning at 1am and flink add 1 to DAU as expected. Now, userA logined again at 10pm, how could flink-sql know the userA has been processed this morning? Does it need to repeatly do count(distinct ) on the whole day's login log? If not, how does flink handle this senario?
Distinct is a very expensive operation in streaming. If you don't use the time-based windows (TUMBLE,SLIDE, SESSION), the runtime must store all values in state forever because it needs to assume that another record could arrive at any point in the future.
However, you can set the option table.exec.state.ttl (see here) how long you want to keep those records in state. This might be one of the most important options when designing a SQL pipeline with long-running queries where the value space of an operator input is not constant.
By real-time, I assume you mean in a Continuous Query?
See https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tuning/streaming_aggregation_optimization.html
By default, the unbounded aggregation operator processes input records
one by one, i.e., (1) read accumulator from state, (2)
accumulate/retract record to accumulator, (3) write accumulator back
to state, (4) the next record will do the process again from (1).
The accumulator does not only keep the end result, but also enough data to produce the next result without reading all previous records again.
I guess in the case of count(distinct), it means keeping all unique users per day in the accumulator.

NiFi Create Indexes after Inserting Records into table

I've got my first Process Group that drops indexes in table.
Then that routes to another process group the does inserts into table.
After successfully inserting the half million rows, I want to create the indexes on the table and analyze it. This is typical Data Warehouse methodology. Can anyone please give advice on how to do this?
I've tried setting counters, but cannot reference counters in Expression Language. I've tried RouteOnAttribute but getting nowhere. Now I'm digging into Wait & Notify Processors - maybe there's a solution there??
I have gotten Counters to count the flow file sql insert statements, but cannot reference the Counter values via Expression Language. Ie this always returns nulls: "${InsertCounter}" where InsertCounter is being set properly it appears via my UpdateCounter process in my flow.
So maybe this code can be used?
In the wait processor set the Target Signal Count to ${fragment.count}.
Set the Release Signal Identifier in both the notify and wait processor to ${fragment.identifier}
nothing works
You can use Wait/Notify processors to do that.
I assume you're using ExecuteSQL, SplitAvro? If so, the flow will look like:
Split approach
At the 2nd ProcessGroup
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
SpritAvro: creates 5,000 FlowFiles, this processor adds fragment.identifier and fragment.count (=5,000) attributes.
split:
XXXX: Do some conversion per record
PutSQL: Insert records individually
Notify: Increase count for the fragment.identifier (Release Signal Identifier) by 1. Executed 5,000 times.
original - to the next ProcessGroup
At the 3rd ProcessGroup
Wait: waiting for fragment.identifier (Release Signal Identifier) to reach fragment.count (Target Signal Count). This route processes the original FlowFile, so executed only once.
PutSQL: Execute a query to create indices and analyze tables
Alternatively, if possible, using Record aware processors would make the flow simpler and more efficient.
Record approach
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
Perform record level conversion: With UpdateRecord or LookupRecord, you can do data processing without splitting records into multiple FlowFiles.
PutSQL: Execute a query to create indices and analyze tables. Since the single FlowFile containing all records, no Wait/Notify is required, the output FlowFile can be connected to the downstream flow.
I Think my suggestion to this question will fit into your scenario as well
How to execute a processor only when another processor is not executing?
Check it out

In NiFi how to store attribute retrieved from DB which doesn’t change very frequently?

I have scheduled ExecuteSQL processor which retrieve speed limit from DB. This speed limit doesn’t change frequently so I created time interval of 24 hours. But I noticed that the next processor e.g RouteAttribute don’t store this speed limit value. With every FlowFile coming from Kafka I want to check whether speedlimit value in FlowFile is exceeding speedlimit value retrieved from DB. But value from DB gets processed as FlowFile once in 24 hours and it’s not available for comparison.
I have following flow:
1) ExecuteSQL->ConvertAvroToJson->EvaluateJsonPath-> from here I pass value of speed limit to following flow to processor RoutesAttribute.
2) ConsumeKafka->EvaluateJsonPath->RouteAttributes (RouteAtrribute get speed limit from above flow but it only gets this value once in 24 hours. How to store this value in memory permanently??)
Based on your description I think this how-to HCC post is very relevant:
https://community.hortonworks.com/questions/140060/nifi-how-to-load-a-value-in-memory-one-time-from-c.html
In summary it leverages the fact that UpdateAttribute has a state feature and makes sure the attribute only gets updated when data is pulled in from the reference table.
There is also an alternate solution, if it is OK for you to restart nifi after pulling in an updated reference value, this is called the variable registry and it simplifies things a bit:
https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.1.1/bk_administration/content/custom_properties.html

Stop Hbase update operation if it have same value

I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value
HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.
As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.

couchbaseTemplate.save(List<Employee>) wanted to save multiple Object in one go

I wanted to save multiple Employee object in couchbase document,but bothered about the use case what will happen when, I have List object of size 5.
suppose when it saved 3 object with 3-documents in couchbase document, some how couchbase server gets down while saving rest 2-document, what will happen in that case.
1) does my all saved document gets rollbacked?
2) does it will also persist another 2 document.?
3) if not both, what will recommended option for this use case.??
From the reference documentation:
Couchbase Server does not support multi-document transactions or rollback.
So neither 1. nor 2. will happen.
If you need such transaction guarantees, you have to either use a database product that supports them or implement them on your own.
The typical approach in working with non-transactional stores is, not to rely on consistency. For example by working with idempotent actions, i.e. in the case of a failure you can redo the action.
In the specific example, you might be able to first store the 5 documents, combined as a single document and then split it up in a separate process. The first write is protected by a transaction, and the second process can get repeated until it succeeds.
Adding to #jens-schauder's answer:
If you have at least 3 nodes in your Couchbase cluster, then the issue should not happen.
Say a node goes down, the cluster will automatically failover what this node was master of (1/3rd of data) to the 2 other nodes, thus writing will work seamlessly.

Resources