I'm currently working on using paralellism in a Flux. Right now I'm having problems with the backpressure. In our case we have a fast producing service we want to consume, but we are much slower.
With a normal flux, this works so far, but we want to have parallelism. What I see when I'm using the approach with
.parallel(2)
.runOn(Schedulers.parallel())
that there is a big request on the beginning, which takes quite a long time to process. Here occurs also a different problem, that if we take too long to process, we somehow seem to generate a cancel event in the producer service (we consume it via webflux rest-call), but no cancel event is seen in the consumer.
But back to problem 1, how is it possible to bring this thing back to sync. I know of the prefetch parameter on the .parallel() method, but it does not work as I expect.
A minimum example would be something like this
fun main() {
val atomicInteger = AtomicInteger(0)
val receivedCount = AtomicInteger(0)
val processedCount = AtomicInteger(0)
Flux.generate<Int> {
it.next(atomicInteger.getAndIncrement())
println("Emitted ${atomicInteger.get()}")
}.doOnEach { it.get()?.let { receivedCount.addAndGet(1) } }
.parallel(2, 1)
.runOn(Schedulers.parallel())
.flatMap {
Thread.sleep(200)
log("Work on $it")
processedCount.addAndGet(1)
Mono.just(it * 2)
}.subscribe {
log("Received ${receivedCount.get()} and processed ${processedCount.get()}")
}
Thread.sleep(25000)
}
where I can observe logs like this
...
Emitted 509
Emitted 510
Emitted 511
Emitted 512
Emitted 513
2022-02-02T14:12:58.164465Z - Thread[parallel-1,5,main] Work on 0
2022-02-02T14:12:58.168469Z - Thread[parallel-2,5,main] Work on 1
2022-02-02T14:12:58.241966Z - Thread[parallel-1,5,main] Received 513 and processed 2
2022-02-02T14:12:58.241980Z - Thread[parallel-2,5,main] Received 513 and processed 2
2022-02-02T14:12:58.442218Z - Thread[parallel-2,5,main] Work on 3
2022-02-02T14:12:58.442215Z - Thread[parallel-1,5,main] Work on 2
2022-02-02T14:12:58.442315Z - Thread[parallel-2,5,main] Received 513 and processed 3
2022-02-02T14:12:58.442338Z - Thread[parallel-1,5,main] Received 513 and processed 4
So how could I adjust that thing that I can use the parallelism but stay in backpressure/sync with my producer? The only way I got it to work is with a semaphore acquired before the parallelFlux and released after work, but this is not really a nice solution.
Ok for this szenario it seemed crucial that prefetch of parallel and runOn had to bet seen very low, here to 1.
With defaults from 256, we requested too much from our producer, so that there was already a cancel event because of the long time between the first block of requests for getting the prefetch and the next one when the Flux decided to fill the buffer again.
Related
API: Exception - reactor-code
The example works as follows:
Subscribes to the next one on departure. The incoming data comes from the Rabbit and it will be processed. This can take a relatively long time and the result will send into another Rabbit queue.
Because of bulk processing, I use buffer for 10 elements. If not receive enough elements for 10, I use a timeout (on buffer) to release for processing.
Problem: If processing or rabbit publisher is slow, bufferTimeout not receive "request", when timeout run out bufferTimeout would like to emit. Then I get the following Exception: "Could not emit buffer due to lack of requests"
Since I need all the data, I exclude next method usage: onBackPressureDrop or onBackPressureLatest. Using plain onBackPressure won't be good because it is not forward the received request number. (onBackPressure use request(unbound) not request(n))
Example Kotlin code:
#Configuration
class SpringCloudStreamRabbitProcessor {
#Bean
fun rabbitFunc() = Function<Flux<Int>, Flux<Int>> {
it.bufferTimeout(10, Duration.ofMinutes(1))
.concatMap { intList ->
// process
Mono.just(intList)
}
.flatMapIterable { intList ->
intList
}
}
}
i have an outOfMemoryException while reading messages from a queue with 2 M of messages.
and i am trying to find a way to read messgages by 1000 for example .
here is my code
List<TextMessage> messages = jmsTemplate.browse(JndiQueues.BACKOUT, (session,browser) -> {
Enumeration<?> browserEnumeration = browser.getEnumeration().;
List<TextMessage> messageList = new ArrayList<TextMessage>();
while (browserEnumeration.hasMoreElements()) {
messageList.add((TextMessage) browserEnumeration.nextElement());
}
return messageList;
});
thanks
Perform the browse on a different thread and pass the results subset to the main thread via a BlockingQueue<List<TextMessage>>. e.g. a LinkedBlockingQueue with a small capacity.
The browsing thread will block when the queue is full. When the main thread removes an entry from the queue, the browser can add a new one.
Probably makes sense to have a capacity at least 2 so that the next list is available immediately.
We are receiving data in spark streaming from Kafka. Once execution has been started in Spark Streaming, it executes only one batch and the remaining batches starting queuing up in Kafka.
Our data is independent and can be processes in Parallel.
We tried multiple configurations with multiple executor, cores, back pressure and other configurations but nothing worked so far. There are a lot messages queued and only one micro batch has been processed at a time and rest are remained in queue.
We want to achieve parallelism at maximum, so that not any micro batch is queued, as we have enough resources available. So how we can reduce time by maximum utilization of resources.
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe("TOPIC_NAME",
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
JavaDStream<byte[]> messagesStream = consumerStream.map(new Function<ConsumerRecord<String, byte[]>, byte[]>() {
private static final long serialVersionUID = 1L;
#Override
public byte[] call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {
private static final long serialVersionUID = 1L;
#Override
public String call(byte[] asn1Data) throws Exception {
if(asn1Data.length > 0) {
try (InputStream inputStream = new ByteArrayInputStream(asn1Data);
Writer writer = new StringWriter(); ) {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(asn1Data);
GZIPInputStream gzipInputStream = new GZIPInputStream(byteArrayInputStream);
byte[] buffer = new byte[1024];
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int len;
while((len = gzipInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
return new String(byteArrayOutputStream.toByteArray());
} catch (Exception e) {
//
producer.flush();
throw e;
}
}
return null;
}
});
// publish generated json gzip to kafka
cache.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
airMainJsonProcessor.processAIRData(json, sparkSession);
}
}
});
getJavaStreamingContext().start();
getJavaStreamingContext().awaitTermination();
getJavaStreamingContext().stop();
Technology that we are using:
HDFS 2.7.1.2.5
YARN + MapReduce2 2.7.1.2.5
ZooKeeper 3.4.6.2.5
Ambari Infra 0.1.0
Ambari Metrics 0.1.0
Kafka 0.10.0.2.5
Knox 0.9.0.2.5
Ranger 0.6.0.2.5
Ranger KMS 0.6.0.2.5
SmartSense 1.3.0.0-1
Spark2 2.0.x.2.5
Statistics that we got from difference experimentations:
Experiment 1
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 48 Minutes
Experiment 2
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 8 Minutes
Experiment 3
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 7 Minutes
Experiment 4
spark.default.parallelism=16
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 10 Minutes
Please advise, how we can process maximum so no queued.
I was facing same issue and I tried a few things in trying to resolve the issue and came to following findings:
First of all. Intuition says that one batch must be processed per executor but on the contrary, only one batch is processed at a time but jobs and tasks are processed in parallel.
Multiple batch processing can be achieved by using spark.streaming.concurrentjobs, but it's not documented and still needs a few fixes. One of problems is with saving Kafka offsets. Suppose we set this parameter to 4 and 4 batches are processed in parallel, what if 3rd batch finishes before 4th one, which Kafka offsets would be committed. This parameter is quite useful if batches are independent.
spark.default.parallelism because of its name is sometimes considered to make things parallel. But its true benefit is in distributed shuffle operations. Try different numbers and find an optimum number for this. You will get a considerable difference in processing time. It depends upon shuffle operations in your jobs. Setting it too high would decrease the performance. It's apparent from you experiments results too.
Another option is to use foreachPartitionAsync in place of foreach on RDD. But I think foreachPartition is better as foreachPartitionAsync would queue up the jobs whereas batches would appear to be processed but their jobs would still be in the queue or in processing. May be I didn't get its usage right. But it behaved same in my 3 services.
FAIR spark.scheduler.mode must be used for jobs with lots of tasks as round-robin assignment of tasks to jobs, gives opportunity to smaller tasks to start receiving resources while bigger tasks are processing.
Try to tune your batch duration+input size and always keep it below processing duration otherwise you're gonna see a long backlog of batches.
These are my findings and suggestions, however, there are so many configurations and methods to do streaming and often one set of operation doesn't work for others. Spark Streaming is all about learning, putting your experience and anticipation together to get to a set of optimum configuration.
Hope it helps. It would be a great relief if someone could tell specifically how we can legitimately process batches in parallel.
We want to achieve parallelism at maximum, so that not any micro batch is queued
That's the thing about stream processing: you process the data in the order it was received. If you process your data at the rate slower than it arrives it will be queued. Also, don't expect that processing of one record will suddenly be parallelized across multiple nodes.
From your screenshot, it seems your batch time is 10 seconds and your producer published 100 records over 90 seconds.
It took 36s to process 2 records and 70s to process 17 records. Clearly, there is some per-batch overhead. If this dependency is linear, it would take only 4:18 to process all 100 records in a single mini-batch thus beating your record holder.
Since your code is not complete, it's hard to tell what exactly takes so much time. Transformations in the code look fine but probably the action (or subsequent transformations) are the real bottlenecks. Also, what's with producer.flush() which wasn't mentioned anywhere in your code?
I was facing the same issue and I solved it using Scala Futures.
Here are some link that show how to use it:
https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop
https://www.beyondthelines.net/computing/scala-future-and-execution-context/
Also, this is piece of my code when I used Scala Futures:
messages.foreachRDD{ rdd =>
val f = Future {
// sleep(100)
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures, conf)
println("Inside scala future")
1
}
f.onComplete {
case Success(messages) => println("yay!")
case Failure(exception) => println("On no!")
}
}
It's hard to tell without having all the details, but general advice to tackle issues like that -- start with very simple application, "Hello world" kind. Just read from input stream and print data into log file. Once this works you prove that problem was in application and you gradually add your functionality back until you find what was culprit. If even simplest app doesn't work - you know that problem in configuration or Spark cluster itself. Hope this helps.
I have a single ActorA that reads from an input stream and sends messages to a group of ActorB's. When ActorA reaches the end of the input stream it cleans up its resources, broadcasts a Done message to the ActorB's, and shuts itself down.
I have approx 12 ActorB's that send messages to a group of ActorC's. When an ActorB receives a Done message from ActorA then it cleans up its resources and shuts itself down, with the exception of the last surviving ActorB which broadcasts a Done message to the ActorC's before it shuts itself down.
I have approx 24 ActorC's that send messages to a single ActorD. Similar to the ActorB's, when each ActorC gets a Done message it cleans up its resources and shuts itself down, with the exception of the last surviving ActorC which sends a Done message to ActorD.
When ActorD gets a Done message it cleans up its resources and shuts itself down.
Initially I had the ActorB's and ActorC's immediately propagate the Done message when they received it, but this might cause the ActorC's to shut down before all of the ActorB's have finished processing their queues; likewise the ActorD might shut down before the ActorC's have finished processing their queues.
My solution is to use an AtomicInteger that is shared among the ActorB's
class ActorB(private val actorCRouter: ActorRef,
private val actorCount: AtomicInteger) extends Actor {
private val init = {
actorCount.incrementAndGet()
()
}
def receive = {
case Done => {
if(actorCount.decrementAndGet() == 0) {
actorCRouter ! Broadcast(Done)
}
// clean up resources
context.stop(self)
}
}
}
ActorC uses similar code, with each ActorC sharing an AtomicInteger.
At present all actors are initialized in a web service method, with the downstream ActorRef's passed in the upstream actors' constructors.
Is there a preferred way to do this, e.g. using calls to Akka methods instead of an AtomicInteger?
Edit: I'm considering the following as a possible alternative: when an actor receives a Done message it sets the receive timeout to 5 seconds (the program will take over an hour to run, so delaying cleanup/shutdown by a few seconds won't impact the performance); when the actor gets a ReceiveTimeout it broadcasts Done to the downstream actors, cleans up, and shuts down. (The routers for ActorB and ActorC are using a SmallestMailboxRouter)
class ActorB(private val actorCRouter: ActorRef) extends Actor {
def receive = {
case Done => {
context.setReceiveTimeout(Duration.create(5, SECONDS))
}
case ReceiveTimeout => {
actorCRouter ! Broadcast(Done)
// clean up resources
context.stop(self)
}
}
}
Sharing actorCount among related actors is not good thing to do. Actor should only be using its own state to handle messages.
How about having ActorBCompletionHanlder actor for actor of type ActorB. All ActorB will have reference to ActorBCompletionHanlder actor. Every time ActorB receives Done message it can do necessay cleanup and simply pass done message to ActorBCompletionHanlder. ActorBCompletionHanlder will maintain state variale for maintaining counts. Everytime it receives done message it can simply update counter. As this is solely state variable for this actor no need to have it atomic and that way no need for any explicit locking. ActorBCompletionHanlder will send done message to ActorC once it receives last done message.
This way sharing of activeCount is not among actors but only managed by ActorBCompletionHanlder. Same thing can be repeated for other types.
A-> B's -> BCompletionHanlder -> C's -> CCompletionHandler -> D
Other approach could be to have one monitoring actor for evey related group of actors. And using watch api and child terminated event on monitor you can chose to decide what to do once you receive last done message.
val child = context.actorOf(Props[ChildActor])
context.watch(child)
case Terminated(child) => {
log.info(child + " Child actor terminated")
}
I've the following problem that is begging a zmq solution. I have a time-series data:
A,B,C,D,E,...
I need to perform an operation, Func, on each point.
It makes good sense to parallelize the task using multiple workers via zmq. However, what is tripping me up is how do I synchronize the result, i.e., the results should be time-ordered exactly the way the input data came in. So the end result should look like:
Func(A), Func(B), Func(C), Func(D),...
I should also point out that time to complete,say, Func(A) will be slightly different than Func(B). This may require me to block for a while.
Any suggestions would be greatly appreciated.
You will always need to block for a while in order to synchronize things. You can actually send requests to a pool of workers, and when a response is received - to buffer it if it is not a subsequent one. One simple workflow could be described in a pseudo-language as follows:
socket receiver; # zmq.PULL
socket workers; # zmq.DEALER, the worker thread socket is started as zmq.DEALER too.
poller = poller(receiver, workers);
next_id_req = incr()
out_queue = queue;
out_queue.last_id = next_id_req
buffer = sorted_queue;
sock = poller.poll()
if sock is receiver:
packet_N = receiver.recv()
# send N for processing
worker.send(packet_N, ++next_id_req)
else if sock is workers:
# get a processed response Func(N)
func_N_response, id = workers.recv()
if out_queue.last_id != id-1:
# not subsequent id, buffer it
buffer.push(id, func_N_rseponse)
else:
# in order, push to out queue
out_queue.push(id, func_N_response)
# also consume all buffered subsequent items
while (out_queue.last_id == buffer.min_id() - 1):
id, buffered_N_resp = buffer.pop()
out_queue.push(id, buffered_N_resp)
But here comes the problem what happens if a packet is lost in the processing thread(the workers pool).. You can either skip it after a certain timeout(flush the buffer into the out queue), amd continue filling the out queue, and reorder when the packet comes later, if ever comes.