Kafka streams branching and merging one of the branches back into the stream before branching? - apache-kafka-streams

Is it possible to have a topology that does the following (simplified example):
/ --> branch1 --> output sink
/
(success)
/
source ---> stream ---> /
^ \
| \
| (failure)
| \
| \ --> branch2 --> retry processor -->|
| |
\--< merge back if successful after x attempts <--/
Basically the predicate that decides whether events go to the output sink is an external call which might fail, if it fails we send the event to a default branch (branch 2) where a retry processor will retry it a couple of times and if the call is successful we forward the event and merge the resulting stream back into the initial stream so that it has a chance to go to the output topic.
Is there a way to make this work? I've managed to make it go all the way to the retry processor, which does it's job and forwards the event once it is successful. But I haven't been able to make the event go back to the initial stream.
I apply a merge operation on the resulting Kstream from the retry processor with the original stream.
Example code:
KStream<String, String> stream = builder.stream("input-topic", Consumed.with(Serdes.String(), Serdes.String()));
var branches = stream.split(Named.as("type"))
.branch((key, value) -> {
try {
return someOperationThatMightFail();
} catch (Exception e) {
return false;
}
}, Branched.withFunction(kstream -> kstream,"branch1"))
.defaultBranch(Branched.withFunction(kstream -> kstream,"branch2"));
var successBranch = branches.get("branch1");
var failedBranch = branches.get("branch2");
successBranch.to("output-topic");
failedBranch.transform(retrySupplier, Named.as("retry")).merge(stream);
The other option would be to merge it with the successful branch1 after the retry processor, but I also can't seem to make that work.

Related

Is there a step that will not cause an error? bufferTimeout: "Could not emit buffer due to lack of requests"

API: Exception - reactor-code
The example works as follows:
Subscribes to the next one on departure. The incoming data comes from the Rabbit and it will be processed. This can take a relatively long time and the result will send into another Rabbit queue.
Because of bulk processing, I use buffer for 10 elements. If not receive enough elements for 10, I use a timeout (on buffer) to release for processing.
Problem: If processing or rabbit publisher is slow, bufferTimeout not receive "request", when timeout run out bufferTimeout would like to emit. Then I get the following Exception: "Could not emit buffer due to lack of requests"
Since I need all the data, I exclude next method usage: onBackPressureDrop or onBackPressureLatest. Using plain onBackPressure won't be good because it is not forward the received request number. (onBackPressure use request(unbound) not request(n))
Example Kotlin code:
#Configuration
class SpringCloudStreamRabbitProcessor {
#Bean
fun rabbitFunc() = Function<Flux<Int>, Flux<Int>> {
it.bufferTimeout(10, Duration.ofMinutes(1))
.concatMap { intList ->
// process
Mono.just(intList)
}
.flatMapIterable { intList ->
intList
}
}
}

Backpressure with Reactors Parallel Flux + Timeouts

I'm currently working on using paralellism in a Flux. Right now I'm having problems with the backpressure. In our case we have a fast producing service we want to consume, but we are much slower.
With a normal flux, this works so far, but we want to have parallelism. What I see when I'm using the approach with
.parallel(2)
.runOn(Schedulers.parallel())
that there is a big request on the beginning, which takes quite a long time to process. Here occurs also a different problem, that if we take too long to process, we somehow seem to generate a cancel event in the producer service (we consume it via webflux rest-call), but no cancel event is seen in the consumer.
But back to problem 1, how is it possible to bring this thing back to sync. I know of the prefetch parameter on the .parallel() method, but it does not work as I expect.
A minimum example would be something like this
fun main() {
val atomicInteger = AtomicInteger(0)
val receivedCount = AtomicInteger(0)
val processedCount = AtomicInteger(0)
Flux.generate<Int> {
it.next(atomicInteger.getAndIncrement())
println("Emitted ${atomicInteger.get()}")
}.doOnEach { it.get()?.let { receivedCount.addAndGet(1) } }
.parallel(2, 1)
.runOn(Schedulers.parallel())
.flatMap {
Thread.sleep(200)
log("Work on $it")
processedCount.addAndGet(1)
Mono.just(it * 2)
}.subscribe {
log("Received ${receivedCount.get()} and processed ${processedCount.get()}")
}
Thread.sleep(25000)
}
where I can observe logs like this
...
Emitted 509
Emitted 510
Emitted 511
Emitted 512
Emitted 513
2022-02-02T14:12:58.164465Z - Thread[parallel-1,5,main] Work on 0
2022-02-02T14:12:58.168469Z - Thread[parallel-2,5,main] Work on 1
2022-02-02T14:12:58.241966Z - Thread[parallel-1,5,main] Received 513 and processed 2
2022-02-02T14:12:58.241980Z - Thread[parallel-2,5,main] Received 513 and processed 2
2022-02-02T14:12:58.442218Z - Thread[parallel-2,5,main] Work on 3
2022-02-02T14:12:58.442215Z - Thread[parallel-1,5,main] Work on 2
2022-02-02T14:12:58.442315Z - Thread[parallel-2,5,main] Received 513 and processed 3
2022-02-02T14:12:58.442338Z - Thread[parallel-1,5,main] Received 513 and processed 4
So how could I adjust that thing that I can use the parallelism but stay in backpressure/sync with my producer? The only way I got it to work is with a semaphore acquired before the parallelFlux and released after work, but this is not really a nice solution.
Ok for this szenario it seemed crucial that prefetch of parallel and runOn had to bet seen very low, here to 1.
With defaults from 256, we requested too much from our producer, so that there was already a cancel event because of the long time between the first block of requests for getting the prefetch and the next one when the Flux decided to fill the buffer again.

Is there any function in reactor.util.retry.Retry that i can use when retry is successful

I need to perform some action immediately after Retry is successful for server-sent event implementation, I was unable to find any method in reactor.util.retry.Retry. Is there any other alternate to do doOnRetrySuccess(func)
I'm not aware of a built-in operator to do this, but you could utilize an AtomicBoolean to detect if the onNext/onComplete signal occurs immediately after a retry:
final AtomicBoolean retrying = new AtomicBoolean();
monoOrFlux
.retryWhen(Retry
// Configure whatever retry behavior you want here.
// For simplicity, this example uses .indefinitely()
.indefinitely()
// Set the retrying flag to indicate that a retry is being attempted.
.doBeforeRetry(signal -> retrying.set(true)))
// Check and reset the retrying flag in doOnEach.
// This example uses doOnEach, which works for both Mono and Flux.
// If the stream is a Mono, then you could simplify this to use doOnSuccess instead.
.doOnEach(signal -> {
if ((signal.isOnNext() || signal.isOnComplete())
&& retrying.compareAndSet(true, false)) {
// This is the first onNext signal emitted after a retry,
// or the onComplete signal (if no element was emitted) after a retry.
}
});

Flowfiles stuck in queue (Apache NiFi)

I have following flow:
ListFTP -> RouteOnAttribute -> FetchFTP -> UnpackContent -> ExecuteScript.
Some of files are stuck in queue UnpackContent -> ExecuteScript.
ExecuteScript ate some flowfiles and they just disappeared: failure and success relationships are empty. It just showed some activity in Tasks/Time field. All of them stuck in queue before ExecuteScript. I tried to empty queue, but not all of flowfiles have been deleted from this queue. About 1/3 of them still stuck in queue. I tried to disable all processors and empty queue again but it returns: 0 FlowFiles (0 bytes) were removed from the queue.
When i'm trying to change Connection destionation it returns:
Cannot change destination of Connection because FlowFiles from this Connection are currently held by ExecuteScript[id=d33c9b73-0177-1000-5151-83b7b938de39]
ExecuScript from this answer (uses Python).
So, I can't empty queue because its always return message that there is no any flowfile, and i can't remove connection. This has been going on for several hours.
Connection configuration:
Scheduling is set to 0 sec, no penalties for flowfiles, etc.
Is it script problem?
UPDATE
Changed script to:
flowFile = session.get()
if (flowFile != None):
# All processing code starts at this indent
if errorOccurred:
session.transfer(flowFile, REL_FAILURE)
else:
session.transfer(flowFile, REL_SUCCESS)
# implicit return at the end
Same result.
UPDATE v2
I set concurent tasks to 50 and then ran ExecuteScript again and terminated it. I got this error:
UPDATE v3
I created additional ExecuteScript processor with same script and it works fine. But after i stopped this new processor and create new flowfiles, this processor now have same problems: it's just stuck.
Hilarious. Is ExecuteScript for single use?
You need to modify Your code in nifi-1.13.2 because NIFI-8080 caused these bugs. Or you just use nifi 1.12.1
JythonScriptEngineConfigurator:
#Override
public Object init(ScriptEngine engine, String scriptBody, String[] modulePaths) throws ScriptException {
// Always compile when first run
if (engine != null) {
// Add prefix for import sys and all jython modules
prefix = "import sys\n"
+ Arrays.stream(modulePaths).map((modulePath) -> "sys.path.append(" + PyString.encode_UnicodeEscape(modulePath, true) + ")")
.collect(Collectors.joining("\n"));
}
return null;
}
#Override
public Object eval(ScriptEngine engine, String scriptBody, String[] modulePaths) throws ScriptException {
Object returnValue = null;
if (engine != null) {
returnValue = ((Compilable) engine).compile(prefix + scriptBody).eval();
}
return returnValue;
}

Spring Integration DSL how to route split messages to different concurrent flows?

I probably hate writing noob questions as much as other people hate answering them, but here goes.
I need to split a message retrieved from a JdbcPollingChannelAdapter into multiple messages based on the operation requested in each row of the resultset in the payload.
The split operation is simple enough. What is proving to be a challenge is conditionally routing the message to one flow or the other.
After much trial and error, I believe that this flow represents my intention
/- insertUpdateAdapter -\
Poll Table -> decorate headers -> split -> router -< >- aggregator -> cleanup
\---- deleteAdapter ----/
TO that end I have constructed this Java DSL:
final JdbcOutboundGateway inboundAdapter = createInboundAdapter();;
final JdbcOutboundGateway deleteAdapter = createDeleteAdapter();
final JdbcOutboundGateway insertUpdateAdapter = createInsertUpdateAdapter();
return IntegrationFlows
.from(setupAdapter,
c -> c.poller(Pollers.fixedRate(1000L, TimeUnit.MILLISECONDS).maxMessagesPerPoll(1)))
.enrichHeaders(h -> h.headerExpression("start", "payload[0].get(\"start\")")
.headerExpression("end", "payload[0].get(\"end\")"))
.handle(inboundAdapter)
.split(insertDeleteSplitter)
.enrichHeaders(h -> h.headerExpression("operation", "payload[0].get(\"operation\")"))
.channel(c -> c.executor("stepTaskExecutor"))
.routeToRecipients (r -> r
.recipientFlow("'I' == headers.operation or 'U' == headers.operation",
f -> f.handle(insertUpdateAdapter))
// This element is complaining "Syntax error on token ")", ElidedSemicolonAndRightBrace expected"
// Attempted to follow patterns from https://github.com/spring-projects/spring-integration-java-dsl/wiki/Spring-Integration-Java-DSL-Reference#routers
.recipientFlow("'D' == headers.operation",
f -> f.handle(deleteAdapter))
.defaultOutputToParentFlow())
)
.aggregate()
.handle(cleanupAdapter)
.get();
Assumptions I have made, based on prior work include:
The necessary channels are auto-created as Direct Channels
Route To Recipients is the appropriate tool for this function (I have also considered expression router, but the examples of how to add sub-flows were less clear than the Route To Recipients)
Insert an ExecutorChannel somewhere between the splitter and router if you want to run the splits in parallel. You can limit the pool size of the executor to control the concurrency.
There is an extra parenthesis after .defaultOutputToParentFlow())
The corrected code is:
return IntegrationFlows
.from(setupAdapter,
c -> c.poller(Pollers.fixedRate(1000L, TimeUnit.MILLISECONDS).maxMessagesPerPoll(1)))
.enrichHeaders(h -> h.headerExpression("ALC_startTime", "payload[0].get(\"ALC_startTime\")")
.headerExpression("ALC_endTime", "payload[0].get(\"ALC_endTime\")"))
.handle(inboundAdapter)
.split(insertDeleteSplitter)
.enrichHeaders(h -> h.headerExpression("ALC_operation", "payload[0].get(\"ALC_operation\")"))
.channel(c -> c.executor(stepTaskExecutor))
.routeToRecipients (r -> r
.recipientFlow("'I' == headers.ALC_operation or 'U' == headers.ALC_operation",
f -> f.handle(insertUpdateAdapter))
// This element is complaining "Syntax error on token ")", ElidedSemicolonAndRightBrace expected"
// Attempted to follow patterns from https://github.com/spring-projects/spring-integration-java-dsl/wiki/Spring-Integration-Java-DSL-Reference#routers
.recipientFlow("'D' == headers.ALC_operation",
f -> f.handle(deleteAdapter))
.defaultOutputToParentFlow())
.aggregate()
.handle(cleanupAdapter)
.get();

Resources