Nifi - Process the files based on count or time elapsed? - apache-nifi

I have a following flow,
ListFile ---> FetchFile ---> ? ExecuteScript (maybe) ---> Notify
Basically, I want to go to Notify, if
Total flowfiles (from fetch files) is say 200; OR
Time elapsed (from last signal) is say 3 hours.
I think the 1st condition is easy to achieve. I can have a groovy script which can read number of flowfiles, if 200 go to SUCCESS or else ROLLBACK the session.
But I want to know how to also check the time elapsed for n (number can be less than 200) flowfiles in queue is more than 3 hours or so?
Update
Here is the problem: We have a batch processing (~200 files and can increase based on business in future) currently. We have a NiFi pipeline, i.e. List, Fetch, Basic validation on checksum, etc and process (call the SQL) which is working fine.
As per the business, throughout the day we can have the correction to data so that we can get all or some of the files to "re-process". That is also fine and working.
Now, as per new requirements, we need to build the process after this "batch" is completed. So in the best case, I can have the MergeContent processor with max bin of n and give the signal or notify to my new processor.
However, as explained above, throughout that day we can get few or all files processed again. So now my "n" may not match the new "number" of files re-processed. Hence, even in this case if we have elapsed say 3 hours, then irrespective of "n" not equal to new number of files reprocessed, I should notify the new process to run again.
Hence, I am looking for n files OR m hours elapsed check.

I think this may be an example of an XY problem -- you're trying to solve a problem and believe that counting the number of files fetched or time elapsed will help, but this pattern is usually discouraged in Apache NiFi and there are other solutions to the original problem. I would encourage you to describe more fully the higher level problem you are trying to solve to see if there is a better solution.
I will answer the question though (none of these are ideal solutions).
You can use a MergeContent processor with a minimum bin count of 200
You can use an ExecuteScript processor as you noted
You can write a value (the current timestamp) to a DistributedCacheMapServer when the Notify processor executes, and check that value with a FetchDistributedCacheMap processor against the current timestamp and use a simple Expression Language statement to compare the timestamp values
I think you may also want to read some examples of Wait/Notify logic, because creating thresholds like "200 incoming flowfiles || 3 hours elapsed time" is what the Wait processor does.
"How to wait for all fragments to be processed, then do something?" by Koji Kawamura
"NiFi workflow monitoring – Wait/Notify pattern with split and merge" by Pierre Villard
"Simple NiFi Wait/Notify Example" answer by Abdelkrim Hadjidj

Related

Introduce time delay before moving flow files to next processor in NiFi

In NiFi, there exist a data flow to consume from MQTT (ConsumeMQTT) and publish into HDFS path (PutHDFS). I got a requirement to introduce 60 min delay before pushing the consumed data into HDFS path. Found ControlRate and MergeContent processor to be possible solution but not sure.
What is the ideal solution to introduce time delay?
Example: A flow file consumed at 9:00 AM should be published into HDFS at 10:00 AM
You can use an ExecuteScript processor to run a sleep(60*60*1000) loop, but this would unnecessarily use system resources.
I would instead introduce a RouteOnAttribute processor which has an output relationship of one_hour_elapsed going to PutHDFS, and unmatched looped back to itself. The RouteOnAttribute processor should have Routing Strategy set to Route to Property Name and a dynamic property (click the + button on the top right of the Properties tab) named one_hour_elapsed. The Expression Language value should be ${now():toNumber():gt(${entryDate:toNumber():plus(3600000)})}.
This expression:
Gets the current time and converts it to milliseconds since the epoch (now():toNumber())
Gets the entryDate attribute of the flowfile (when it entered NiFi) and converts it to milliseconds and adds one hour (entryDate:toNumber():plus(3600000) [3600000 == 60*60*1000])
Compares the two numbers (a:gt(${b}))
If this is not actually the start of your flow, you can use an UpdateAttribute processor to insert an arbitrary timestamp at any point of your flow and calculate from there.
I would also recommend setting the Yield Duration and Run Schedule of the RouteOnAttribute processor to be substantially higher than usual, as you do not want this processor to run constantly as it will do no work. I'd suggest setting this to 1 or 5 minutes to start, as you are introducing a one hour delay already.
Starting from nifi 1.10 this can be done even easier with the RetryFlowfile processor.
Use penalty duration for setting the delay time:

Apache Niffi getMongo Processor

I am new in niffi i am using getMongo to extract document from mongodb but same result is coming again and again but the result of query is only 2 document the query is {"qty":{$gt:10}}
There is a similar question regarding this. Let me quote what I had said there:
"GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach."
The question can be found here.

Oozie Behavior with misaligned start

I noticed that if I start an Oozie coordinator with a start time many "iterations" (in terms of the frequency) previous to the current time, then the coordinator would sequentially run workflows several times, ignoring the assigned frequency. However, for me it is more important that the workflow/action run itself at the assigned frequency, than it is for workflow/action to have run the correct number of times at a given point.
Is there any way I can avoid this behavior? One way would obviously be to ensure the start time is correct within an iteration time (is there a way to have it automatically take the start time?). Another would be to configure it to avoid this behavior altogether, and basically run at the next time when it should have given the start time and the frequency.
The obvious way to avoid side effects from "past" start dates is... to set the actual start date at submission time as "now".
That's the way we do it in my team:
on the local filesystem, write down a "Coord-template.xml" with a
placeholder such as start="%Now%"
just before submitting, generate the actual "Coordinator.xml" with
sed "s/%Now%/$(date --utc '+%FT%TZ')/" coord-template.xml > coordinator.xml
upload the coordinator definition to HDFS then submit it via Oozie CLI
~~~~~~~~~~~~
Aternative: if you are using "basic" frequency (not CRON-like scheduling) you may want to try these <controls> to have Oozie create executions for all "past" time slots but discard them immediately :
<throttle>1</throttle>
and/or
<execution>LAST_ONLY</execution>
cf. Oozie 4.x reference
The rules would also apply in case the Coordinator is suspended then resumed, or in case the Oozie service gets stopped then restarted, or in case YARN has to queue new jobs for a really long time (because the cluster is 100% busy).
Oozie has improved of late, so there's an easier solution available than the currently accepted answer. As of Oozie 4.1, there is a "NONE" execution available. This skips iterations which occur in the past, more or less. Here's the doc snippet:
NONE: Similar to LAST_ONLY except all older materializations are skipped. When NONE is set, an action that is WAITING or READY will be SKIPPED when the current time is more than a certain configured number of minutes (tolerance) past the action's nominal time. By default, the threshold is 1 minute. For example, suppose action 1 and 2 are both WAITING , the current time is 5:20pm, and both actions' nominal times are before 5:19pm. Both actions will become SKIPPED, assuming they don't transition to SUBMITTED (or a terminal state) before then. Another way of thinking about this is to view it as similar to setting the timeout equal to 1 minute which is the smallest time unit, except that the SKIPPED status doesn't cause the coordinator job to eventually become DONEWITHERROR and can actually become SUCCEEDED (i.e. it's a "good" version of TIMEDOUT ).
Oozie 4.1 doc
I have tested this, and it does work with CRON frequencies. It is superior to the LAST_ONLY execution in your case because LAST_ONLY will still run the most recent iteration in the past (with the misaligned time), in addition to current/future iterations.
<execution>NONE</execution>

Correct method to calculate "Total wait time" of a session in oracle

I need to find out the total time a session is waiting when its is active.
For this i used the query like below...
SELECT (SUM (wait_time + time_waited) / 1000000)
FROM v$active_session_history
WHERE session_id = 614
But, i feel i'm not getting what i wanted using this query.
Like, first time when i ran this query i got 145.980962, # second time=145.953926and #3rd time i got 127.706429.
Ideally, the time should be same or increase. But, as you see, the value returned is reducing everytime.
Please correct me where i'm doing wrong.
It does not contain whole history, v$active_session_history "forgets" older lines. Think about it as a ring of buffers. Once all buffers are written, it restarts from 1st buffer.
To get events of some session, look v$session_event. To get current (active) event of active session: v$session_wait (In recent Oracle versions, you can find this info also in v$session)
NOTE: v$session_event view will not show you CPU time (which is not event but can be seen in v$active_session_history). You can add it, for example, from v$sesstat if needed...
Your bloomer is that you have not understood the nature of v$active_session_history: it is a sample not a log. That is, each record in ASH is a point in time, and doesn't refer back to previous records.
Don't worry, it's a common mistake.
This is a particular problem with WAIT_TIME. This is the total time waited for that specfic occurence of that event. So if the wait event stretches across two samples, in the first record WAIT_TIME will be 1 (one second) and in the next sample it will be 2 (two seconds). However, a SUM(WAIT_TIME) would produce a total of 3 which is too much. Of course this is an arithmetic proghression so if the wait event stretches to ten samples (ten seconds) a SUM(WAIT_TIME) would produce a total of 55.
Basically, WAIT_TIME is a flag - if it is 0 the session is ON CPU and if it's greater than zero it is WAITING.
TIME_WAITED is only populated when the event has stopped waiting. So a SUM(TIME_WAITED) wouldn't give an inflated value. In fact just the opposite: it will only be populated for wait events which were ongoing at the sample time. So there can be lots of waits which fall between the interstices of the samples which won't show up in that SUM.
This is why ASH is good for highlighting big performance issues and bad for identifying background niggles.
So why doesn't the total time doesn't increase each time you run your query? Because ASH is a circular buffer. Older records get aged out to make way for new samples. AWR stores a percentage of the ASH records on disk; they are accessible through DBA_HIST_ACTIVE_SESSION_HIST (the default is one record in ten). So probably ASH purged some samples with high wait times between the second and third times you ran your queries. You could check that by including MIN(SAMPLE_TIME) in the select list.
Finally, bear in mind that SIDs get reused. The primary key for identifying a session is (SID, Serial#), Your query only grouops by SID, so it may use data from several different sessions.
There is a useful presentation by Graham Woods, on of the Oracle gurus who worked on ASH called "Shifting through the ASHes". Altough if would be better to hear Graham speaking, the slide deck on its own still provides some useful insights. Find it here.
tl;dr
ASH is a sample not a log. Use it for COUNTs not SUMs.
"Anything wrong in the way query these tables? "
As I said above, but perhaps didn't make clear enough, DBA_HIST_ACTIVE_SESSION_HIST only holds a fraction of the records from ASH. So it is even less meaningful to run SUM() on its columns than on the live ASH.
Whereas V$SESSION_EVENT is an actual log of events. Its wait times are reliable and accurate. That's why you pay the overhead of enabling timed statistics. Having said which, V$SESSION_EVENT only gives us aggregated values per session, so it's not particularly useful in diagnosis.

Why is Cacti showing an empty graph, even though the rrd file is created?

I have developed my own SNMP service, and i want to plot a graph of an OID provided.
So, i have created a graph in Cacti.
-) It is showing device up.
-) It is creating rrd file. (RRDTool says OK).
-) Showing the graph, but it's empty.
But when I check it, say
rrdtool fetch <rrd file> AVERAGE
it shows me nan for all the values. The monitored OID has value 47 and i have set min=0 and max=100.
I am using Cacti appliance by rpath:
http://www.rpath.org/ui/#/appliances?id=http://www.rpath.org/api/products/cacti-appliance
Still, I can't show value on graph..
Where is the problem? Can anyone please tell me?
First of all, use Cacti's "Rebuild Poller Cache" function under the Utilities menu.
If that didn't work ,check if the RRD file is actually updating with new data.
To do this use the command:
rrdtool last [filename.rrd]
This will output the last time (in unix timestamp) that a new value has been inserted into the RRA file which you can compare to the current time that date +%s will output.
If it's not updating with data then you should change the cacti log level to DEBUG via the settings page on Cacti's web UI and look for appropriate messages.
If the poller couldn't get the data then it's usually an issue relating to connectiviy/SNMP.
You can further check issues as such by manually polling the specific OID on that host:
snmpwalk -c[SNMP COMMUNITY] -v2c [HOSTNAME OR IP ADDRESS] 1.3.6.1.2.1
You can use the above command and OID (1.3.6.1.2.1) just to see if you're getting a reply.
If that worked then you should change the command from snmpwalk to snmpget and the OID to the actual OID you're trying to poll and retry.
If the RRD is updating with new data but you're still getting NaN in your graphs then I suggest looking into the heartbeat and step values of the data source (via the data template) in relation to your polling interval and poller cronjob interval.
These values determine how many times the RRD file will miss data before inserting a NaN.
The cronjob calls the cacti poller to start performing it's polling cycle.
The poller interval is the actual time that the poller will wait between two polling cycles if it was indeed invoked in time by the cronjob.
So for 1 minute polling (on the poller and the cronjob) you will have to use a step of 60 (seconds) and a heartbeat of 120.
For 5 minutes polling, the step will be 300 and the heartbeat will be 600.
This is mainly caused by someone changing the poller interval on the settings page.
Gandalf from the Cacti forums wrote a nice Guide that you can use and further help can be found on Cacti forums.
Good luck! :)
Maybe cacti doesn't have the needed permissions to access the rrd file and your test was done with a user who has the required permissions, for example root?
Are you sure you have collected enough data?
If your RRD has a step of 1 minute, and your first RRA has a consolidated count of 1 (1cdp=1pdp), then you should collect data for at least (step x ( count + 1 )) seconds before you expect to see any data in the graph. Make sure you are collecting data at least as often as the step size.
If you collect data for 10 min and nothing shows up, then make sure you are actually collecting the data, make sure the values you get are within range, and that they are being used. Check the last modification time on the RRD file. Print out the values before you update to verify they are what you think they are.
You should double check the range Cacti is plotting in. I moved the values in the graph filter and spotted a little chunk of data in the graphs, then you just have to adjust it.

Resources