Generate exactly 1 Flowfile - apache-nifi

I'm using the GenerateFlowFile processor in Apache Nifi - When I activate it, I want the processor to create exactly 1 Flowfile.
Right now I use the REST API via Python to change the state to RUNNING, wait 0.5 seconds and change the state to STOPPED. This results in 1 FlowFile being added to the queue to the next processor.
I tested a bit and waiting for 1.5 seconds gives me 2 FlowFiles, 2.5 seconds gives me 3 FlowFiles - I'm guessing the processor generates one Flowfile each second it is running.
How can I ensure that exactly 1 Flowfile is being generated? The above method obviously is dependent on the network connection and roundtrip times. Worst case: the connection drops while I wait and I cannot stop the processor anymore and x Flowfiles are being generated.
My current configs are:
Settings:
Yield duration: 1 sec
Penalty Duration: 30sec
Bulletin Level: WARN
Scheduling:
Scheduling Strategy: CRON driven
Concurrent Tasks: 1
Run Schedule: * * * * * ?
Execution: All nodes
Run duration: 0ms
Properties:
File Size: 0B
Batch Size: 1
Data Format: Text
Unique FlowFiles: false
Custom Text: No value set
Character Set: UTF-8
Mime Type: No value set

You'll want to flag the GenerateFlowFile as Primary node only (assuming you have more than 1 node) to ensure each node is not generating its own FlowFile.
Set the Scheduling to Timer and whack the run schedule up to something like 604800 (1 week) - this means that it even if you leave the processor running, it's only going to run once a week - that should give you plenty time to fix a connectivity issue if your script can't connect to tell the processor to stop.
Keep concurrency at 1.

Related

Nifi Group Content by Given Attributes

I am trying to run a script or a custom processor to group data by given attributes every hour. Queue size is up to 30-40k on a single run and it might go up to 200k depending on the case.
MergeContent does not fit since there is no limit on min-max counts.
RouteOnAttribute does not fit since there are too many combinations.
Solution 1: Consume all flow files and group by attributes and create the new flow file and push the new one. Not ideal but gave it a try.
While running this when I had 33k flow files on queue waiting.
session.getQueueSize().getObjectCount()
This number is returning 10k all the time even though I increased the queue threshold numbers on output flows.
Solution 2: Better approach is consume one flow file and and filter flow files matching the provided attributes
final List<FlowFile> flowFiles = session.get(file -> {
if (correlationId.equals(Arrays.stream(keys).map(file::getAttribute).collect(Collectors.joining(":"))))
return FlowFileFilter.FlowFileFilterResult.ACCEPT_AND_CONTINUE;
return FlowFileFilter.FlowFileFilterResult.REJECT_AND_CONTINUE;
});
Again with 33k waiting in the queue I was expecting around 200 new grouped flow files but 320 is created. It looks like a similar issue above and does not scan all waiting flow files on filter query.
Problems-Question:
Is there a parameter to change so this getObjectCount can take up to 300k?
Is there a way to filter all waiting flow files again by changing a parameter or by changing the processor?
I tried making default queue threshold 300k on nifi.properties but it didn't help
in nifi.properties there is a parameter that affects batching behavior
nifi.queue.swap.threshold=20000
here is my test flow:
1. GenerateFlowFile with "batch size = 50K"
2. ExecuteGroovyScript with script below
3. LogAttrribute (disabled) - just to have queue after groovy
groovy script:
def ffList = session.get(100000) // get batch with maximum 100K files from incoming queue
if(!ffList)return
def ff = session.create() // create new empty file
ff.batch_size = ffList.size() // set attribute to real batch size
session.remove(ffList) // drop all incoming batch files
REL_SUCCESS << ff // transfer new file to success
with parameters above there are 4 files generated in output:
1. batch_size = 20000
2. batch_size = 10000
3. batch_size = 10000
4. batch_size = 10000
according to documentation:
There is also the notion of "swapping" FlowFiles. This occurs when the number of FlowFiles in a connection queue exceeds the value set in the nifi.queue.swap.threshold property. The FlowFiles with the lowest priority in the connection queue are serialized and written to disk in a "swap file" in batches of 10,000.
This explains that from 50K incoming files - 20K it keeps inmemory and others in swap batched by 10K.
i don't know how increasing of nifi.queue.swap.threshold property will affect your system performance and memory consumption, but i set it to 100K on my local nifi 1.16.3 and it looks good with multiple small files, and first batch increased to 100K by this.

Circuit Breaker for an orchestrator

I have a facade which calls 3 different services for some type of requests and finally orchestrates the responses before sending the response back to the client. Here, it is mandatory that all 3 services are up and serving as expected. The client request can not be served even one of them is down.
I am looking for a circuit breaker to solve this problem. The circuit breaker should respond with error code even one of the service is down. I was checking the resilence4j circuit breaker and it doesnt fit for my problem.
https://resilience4j.readme.io/docs/circuitbreaker
Is there any other open source available?
Why doesn't it fit to you problem?
You can protect every service with a CircuitBreaker. As soon one of the CircuitBreakers is open, you can short circuit and directly return an error response to your client.
CircuitBreaker Works on protected function as below –
Thread <—> CircuitBreaker <—> Protected_Function
So a Protected_Function can call 1 or more microservices, Mostly we use 1 Protected_Function for 1 external micro service call because we have can tune resilience based on the profile or behavior of that particular micro-service. But as your requirement is different so we can have 3 calls under 1 Protected_Function.
So as per your explanation above, you Façade is calling 3 micro-services (assume in series). What you can do is to call you Façade or all 3 services through or inside a Protected Function –
#CircuitBreaker(name = "OVERALL_PROTECTION")
public Your_Response Protected_Function (Your_Request) {
Call_To_Service_1;
Call_To_Service_2;
Call_To_Service_3;
return Orchestrate_Your_Response;
}
Further you can add resilience for OVERALL_PROTECTION in your YAML property file as below (I have used Count based Sliding Window) –
resilience4j.circuitbreaker:
backends:
OVERALL_PROTECTION:
registerHealthIndicator: true
slidingWindowSize: 100 # start rate calc after 100 calls
minimumNumberOfCalls: 100 # minimum calls before the CircuitBreaker can calculate the error rate.
permittedNumberOfCallsInHalfOpenState: 10 # number of permitted calls when the CircuitBreaker is half open
waitDurationInOpenState: 10s # time that the CircuitBreaker should wait before transitioning from open to half-open
failureRateThreshold: 50 # failure rate threshold in percentage
slowCallRateThreshold: 100 # consider all transactions under interceptor for slow call rate
slowCallDurationThreshold: 2s # if a call is taking more than 2s then increase the error rate
recordExceptions: # increment error rate if following exception occurs
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
You can also use time based slidingWindow instead of count based if you wish, Rest I have mentioned #Comment for self explanation in front of each parameter in configuration.
resilience4j.retry:
instances:
OVERALL_PROTECTION:
maxRetryAttempts: 5
waitDuration: 100
retryExceptions:
- org.springframework.web.client.HttpServerErrorException
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
Above configuration will perform a retry for 5 times if Exceptions under retryExceptions occurs.
resilience4j.ratelimiter:
instances:
OVERALL_PROTECTION:
timeoutDuration: 100ms #The default wait time a thread waits for a permission
limitRefreshPeriod: 1000 #The period of a limit refresh. After each period the rate limiter sets its permissions count back to the limitForPeriod value
limitForPeriod: 25 #The number of permissions available during one limit refresh period
Above configuration will allow maximum up to 25 transactions in 1 second.

How can we schedule nifi data flow ? I'm using HDP 2.5

I want to schedule data flow in daily base through nifi.
For Example,
I need to run schedule on 9.00 AM in every morning.
Can anyone tell me what is procedure to make schedule data flow
There are three scheduling strategy available, see below details
Timer driven: This is the default mode. The Processor will be scheduled to run on a regular interval. The interval at which the Processor is run is defined by the ‘Run schedule’ option (see below).
Event driven: When this mode is selected, the Processor will be triggered to run by an event, and that event occurs when FlowFiles enter Connections feeding this Processor. This mode is currently considered experimental and is not supported by all Processors. When this mode is selected, the ‘Run schedule’ option is not configurable, as the Processor is not triggered to run periodically but as the result of an event. Additionally, this is the only mode for which the ‘Concurrent tasks’ option can be set to 0. In this case, the number of threads is limited only by the size of the Event-Driven Thread Pool that the administrator has configured.
CRON driven: When using the CRON driven scheduling mode, the Processor is scheduled to run periodically, similar to the Timer driven scheduling mode. However, the CRON driven mode provides significantly more flexibility at the expense of increasing the complexity of the configuration. The CRON driven scheduling value is a string of six required fields and one optional field, each separated by a space.
You typically specify values one of the following ways:
Number: Specify one or more valid value. You can enter more than one value using a comma-separated list.
Range: Specify a range using the - syntax.
Increment: Specify an increment using / syntax. For example, in the Minutes field, 0/15 indicates the minutes 0, 15, 30, and 45.
You should also be aware of several valid special characters:
* — Indicates that all values are valid for that field.
?  — Indicates that no specific value is specified. This special character is valid in the Days of Month and Days of Week field.
L  — You can append L to one of the Day of Week values, to specify the last occurrence of this day in the month. For example, 1L indicates the last Sunday of the month.
For example:
The string 0 0 13 * * ? indicates that you want to schedule the processor to run at 1:00 PM every day.
The string 0 20 14 ? * MON-FRI indicates that you want to schedule the processor to run at 2:20 PM every Monday through Friday.
The string 0 15 10 ? * 6L 2011-2017 indicates that you want to schedule the processor to run at 10:15 AM, on the last Friday of every month, between 2011 and 2017.
For your schedule time should be like below;
0109*?** - this meaning every 1 seconds 09 minutes is morning 9 AM other * fields run every day and month.
I hope it helps you!!!
Scheduling Strategy: CRON driven
Run Schedule: 0 0 9 * * ? *

How to view all information relative all slave machines in jmeter in real time

I configured my test in this way (in windows 7):
1 Virtual machine is master, that run all vm slaves with the command for a distribuited testing (from command line) and show in jmeter GUI some graphs (for example jp#gc Active thread over time , hits/sec, response time, etc..).
3 Virtual machine are slave, to execute the testing;
When master run the "start" to 3 slave, the test works (each slave run 6 thread), and in the GUI on master, there are only 6 thread in the graph (jp#gc - Active Threads Over Time), but in reality are 18 (6 thread for slaves, with 3 slaves).
So my question is: how can I see the total data for all slaves?
jp#gc - Active Threads Over Time = to see 18 thread (thread slave1 +thread slave2+thread slave3)
jp#gc - Hits per Second = Hits slave 1 +Hits slave 2+ Hits slave 3
and so on...
You need to add __machineName or __machineIP function so the listeners could distinguish results coming from different nodes.
Also be aware of mode property which is configured to send results from slave machines each 100 results or each minute (whatever comes the first) so you might want to amend it, i.e. add mode=Standard line to user.properties file on each slave node.
# Remote batching support
# Since JMeter 2.9, default is MODE_STRIPPED_BATCH, which returns samples in
# batch mode (every 100 samples or every minute by default)
# Note also that MODE_STRIPPED_BATCH strips response data from SampleResult, so if you need it change to
# another mode
# Hold retains samples until end of test (may need lots of memory)
# Batch returns samples in batches
# Statistical returns sample summary statistics
# hold_samples was originally defined as a separate property,
# but can now also be defined using mode=Hold
# mode can also be the class name of an implementation of org.apache.jmeter.samplers.SampleSender
#mode=Standard
#mode=Batch
#mode=Hold
#mode=Statistical
See Apache JMeter Properties Customization Guide for more information on working with JMeter properties.
Be aware that sending results in case of severe load may cause network IO overhead so it might be a good idea to consider Backend Listener instead
Add the Machine Info function to the thread group name area as shown below:

Difference between expected_nodes and recover_after_nodes parameters

I can't see the difference between two parameters for the recovery phase of the gateway module.
In the documentation :
The gateway.recover_after_nodes setting (which accepts a number) controls after how many (...) eligible nodes (...) recovery will start.
The gateway.expected_nodes allows to set how many (...) eligible nodes are expected to be in the cluster, and once met, (...) recovery starts
From what I understand, these two settings trigger the recovery phase once the number of node is equal to the value set.
Why using one over the other?
And what is the point of using both of them?
For example :
gateway:
recover_after_nodes: 3
expected_nodes: 5
In this case, what is the purpose of expected_nodes? recovery will be triggered as soon as there will be 3 nodes. There must be another reason to use it.
I hope my question is clear enough.
Thanks in advance!
When using recovery_after_nodes, recover_after_data_nodes or recovery_after_master_nodes, once all set conditions are met the cluster will then start waiting recover_after_time before starting recovery:
The gateway.recover_after_time setting (which accepts a time value)
sets the time to wait till recovery happens once all
gateway.recover_after...nodes conditions are met.
When using expected_nodes, expected_data_nodes or expected_master nodes, recovery will start once all conditions are met - the cluster will not wait. In addition, it will also default recovery_after_time to 5 min.
In your test case:
gateway:
recover_after_nodes: 3
expected_nodes: 5
Once you hit 3 nodes a countdown clock starts and the cluster will then recovery in either 5 minutes (the default) or if you hit 5 nodes. Basically it allows you to set a minimum threshold (recovery_after_nodes), with a timeout (recovery_after_time) to wait for a desired state (expected_nodes). You will either recovery recovery_after_time after recovery_after_nodes is hit, or when expected_nodes is hit (no additional waiting) - whichever comes first.
from the public document, there are misunderstandings in this threads.
http://www.elastic.co/guide/en/elasticsearch/reference/1.x/modules-gateway.html
gateway:
recover_after_time: 5m
expected_nodes: 2
In an expected 2 nodes cluster will cause recovery to start 5 minutes
after the first node is up, but once there are 2 nodes in the cluster,
recovery will begin immediately (without waiting).
so, the timer defined by recover_after_time will start already after a first node is up. not start after finding nodes defined in recover_after_nodes

Resources