Here is my use case.
A legacy system updates a database queue table QUEUE.
I want a scheduled recurring job that
- checks the contents of QUEUE
- if there are rows in the table it locks the row and does some work
- deletes the row in QUEUE
If the previous job is still running, then a new thread will be created to do the work. I want to configure the maximum number of concurrent threads.
I am using Spring 3 and my current solution is to do the following (using a fixedRate of 1 millisecond to get the threads to run basically continuously)
#Scheduled(fixedRate = 1)
#Async
public void doSchedule() throws InterruptedException {
log.debug("Start schedule");
publishWorker.start();
log.debug("End schedule");
}
<task:executor id="workerExecutor" pool-size="4" />
This created 4 threads straight off and the threads correctly shared the workload from the queue. However I seem to be getting a memory leak when the threads take a long time to complete.
java.util.concurrent.ThreadPoolExecutor # 0xe097b8f0 | 80 | 373,410,496 | 89.74%
|- java.util.concurrent.LinkedBlockingQueue # 0xe097b940 | 48 | 373,410,136 | 89.74%
| |- java.util.concurrent.LinkedBlockingQueue$Node # 0xe25c9d68
So
1: Should I be using #Async and #Scheduled together?
2: If not then how else can I use spring to achieve my requirements?
3: How can I create the new threads only when the other threads are busy?
Thanks all!
EDIT: I think the queue of jobs was getting infinitely long... Now using
<task:executor id="workerExecutor"
pool-size="1-4"
queue-capacity="10" rejection-policy="DISCARD" />
Will report back with results
You can try
Run a scheduler with one second delay, which will lock & fetch all
QUEUE records that weren't locked so far.
For each record, call an Async method, which will process that record & delete it.
The executor's rejection policy should be ABORT, so that the scheduler can unlock the QUEUEs that aren't given out for processing yet. That way the scheduler can try processing those QUEUEs again in the next run.
Of course, you'll have to handle the scenario, where the scheduler has locked a QUEUE, but the handler didn't finish processing it for whatever reason.
Pseudo code:
public class QueueScheduler {
#AutoWired
private QueueHandler queueHandler;
#Scheduled(fixedDelay = 1000)
public void doSchedule() throws InterruptedException {
log.debug("Start schedule");
List<Long> queueIds = lockAndFetchAllUnlockedQueues();
for (long id : queueIds)
queueHandler.process(id);
log.debug("End schedule");
}
}
public class QueueHandler {
#Async
public void process(long queueId) {
// process the QUEUE & delete it from DB
}
}
<task:executor id="workerExecutor" pool-size="1-4" queue-capcity="10"
rejection-policy="ABORT"/>
//using a fixedRate of 1 millisecond to get the threads to run basically continuously
#Scheduled(fixedRate = 1)
When you use #Scheduled a new thread will be created and will invoke method doSchedule at the specified fixedRate at 1 milliseconds. When you run your app you can already see 4 threads competing for the QUEUE table and possibly a dead lock.
Investigate if there is a deadlock by taking thread dump.
http://helpx.adobe.com/cq/kb/TakeThreadDump.html
#Async annotation will not be of any use here.
Better way to implement this is to create you class as a thread by implementing runnable and passing your class to TaskExecutor with required number of threads.
Using Spring threading and TaskExecutor, how do I know when a thread is finished?
Also check your design it doesn't seem to be handling the synchronization properly. If a previous job is running and holding a lock on the row, the next job you create will still see that row and will wait for acquiring lock on that particular row.
Related
We are receiving data in spark streaming from Kafka. Once execution has been started in Spark Streaming, it executes only one batch and the remaining batches starting queuing up in Kafka.
Our data is independent and can be processes in Parallel.
We tried multiple configurations with multiple executor, cores, back pressure and other configurations but nothing worked so far. There are a lot messages queued and only one micro batch has been processed at a time and rest are remained in queue.
We want to achieve parallelism at maximum, so that not any micro batch is queued, as we have enough resources available. So how we can reduce time by maximum utilization of resources.
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe("TOPIC_NAME",
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
JavaDStream<byte[]> messagesStream = consumerStream.map(new Function<ConsumerRecord<String, byte[]>, byte[]>() {
private static final long serialVersionUID = 1L;
#Override
public byte[] call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {
private static final long serialVersionUID = 1L;
#Override
public String call(byte[] asn1Data) throws Exception {
if(asn1Data.length > 0) {
try (InputStream inputStream = new ByteArrayInputStream(asn1Data);
Writer writer = new StringWriter(); ) {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(asn1Data);
GZIPInputStream gzipInputStream = new GZIPInputStream(byteArrayInputStream);
byte[] buffer = new byte[1024];
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int len;
while((len = gzipInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
return new String(byteArrayOutputStream.toByteArray());
} catch (Exception e) {
//
producer.flush();
throw e;
}
}
return null;
}
});
// publish generated json gzip to kafka
cache.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
airMainJsonProcessor.processAIRData(json, sparkSession);
}
}
});
getJavaStreamingContext().start();
getJavaStreamingContext().awaitTermination();
getJavaStreamingContext().stop();
Technology that we are using:
HDFS 2.7.1.2.5
YARN + MapReduce2 2.7.1.2.5
ZooKeeper 3.4.6.2.5
Ambari Infra 0.1.0
Ambari Metrics 0.1.0
Kafka 0.10.0.2.5
Knox 0.9.0.2.5
Ranger 0.6.0.2.5
Ranger KMS 0.6.0.2.5
SmartSense 1.3.0.0-1
Spark2 2.0.x.2.5
Statistics that we got from difference experimentations:
Experiment 1
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 48 Minutes
Experiment 2
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 8 Minutes
Experiment 3
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 7 Minutes
Experiment 4
spark.default.parallelism=16
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 10 Minutes
Please advise, how we can process maximum so no queued.
I was facing same issue and I tried a few things in trying to resolve the issue and came to following findings:
First of all. Intuition says that one batch must be processed per executor but on the contrary, only one batch is processed at a time but jobs and tasks are processed in parallel.
Multiple batch processing can be achieved by using spark.streaming.concurrentjobs, but it's not documented and still needs a few fixes. One of problems is with saving Kafka offsets. Suppose we set this parameter to 4 and 4 batches are processed in parallel, what if 3rd batch finishes before 4th one, which Kafka offsets would be committed. This parameter is quite useful if batches are independent.
spark.default.parallelism because of its name is sometimes considered to make things parallel. But its true benefit is in distributed shuffle operations. Try different numbers and find an optimum number for this. You will get a considerable difference in processing time. It depends upon shuffle operations in your jobs. Setting it too high would decrease the performance. It's apparent from you experiments results too.
Another option is to use foreachPartitionAsync in place of foreach on RDD. But I think foreachPartition is better as foreachPartitionAsync would queue up the jobs whereas batches would appear to be processed but their jobs would still be in the queue or in processing. May be I didn't get its usage right. But it behaved same in my 3 services.
FAIR spark.scheduler.mode must be used for jobs with lots of tasks as round-robin assignment of tasks to jobs, gives opportunity to smaller tasks to start receiving resources while bigger tasks are processing.
Try to tune your batch duration+input size and always keep it below processing duration otherwise you're gonna see a long backlog of batches.
These are my findings and suggestions, however, there are so many configurations and methods to do streaming and often one set of operation doesn't work for others. Spark Streaming is all about learning, putting your experience and anticipation together to get to a set of optimum configuration.
Hope it helps. It would be a great relief if someone could tell specifically how we can legitimately process batches in parallel.
We want to achieve parallelism at maximum, so that not any micro batch is queued
That's the thing about stream processing: you process the data in the order it was received. If you process your data at the rate slower than it arrives it will be queued. Also, don't expect that processing of one record will suddenly be parallelized across multiple nodes.
From your screenshot, it seems your batch time is 10 seconds and your producer published 100 records over 90 seconds.
It took 36s to process 2 records and 70s to process 17 records. Clearly, there is some per-batch overhead. If this dependency is linear, it would take only 4:18 to process all 100 records in a single mini-batch thus beating your record holder.
Since your code is not complete, it's hard to tell what exactly takes so much time. Transformations in the code look fine but probably the action (or subsequent transformations) are the real bottlenecks. Also, what's with producer.flush() which wasn't mentioned anywhere in your code?
I was facing the same issue and I solved it using Scala Futures.
Here are some link that show how to use it:
https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop
https://www.beyondthelines.net/computing/scala-future-and-execution-context/
Also, this is piece of my code when I used Scala Futures:
messages.foreachRDD{ rdd =>
val f = Future {
// sleep(100)
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures, conf)
println("Inside scala future")
1
}
f.onComplete {
case Success(messages) => println("yay!")
case Failure(exception) => println("On no!")
}
}
It's hard to tell without having all the details, but general advice to tackle issues like that -- start with very simple application, "Hello world" kind. Just read from input stream and print data into log file. Once this works you prove that problem was in application and you gradually add your functionality back until you find what was culprit. If even simplest app doesn't work - you know that problem in configuration or Spark cluster itself. Hope this helps.
I have a spring boot (1.4.1) application which uses following Task Executor Configuration:
#Bean ("taskExecutor")
TaskExecutor taskExecutor (Environment env) {
ThreadPoolTaskExecutor t = new ThreadPoolTaskExecutor();
t.setCorePoolSize(Integer.parseInt( env.getProperty("ed.corePoolSize"))); //2
t.setMaxPoolSize(Integer.parseInt( env.getProperty("ed.maxPoolSize"))); //10
t.setQueueCapacity(Integer.parseInt( env.getProperty("ed.queueCapacity"))); //50
t.setThreadNamePrefix(env.getProperty("ed.threadNamePrefix"));
t.setAllowCoreThreadTimeOut(true);
t.setKeepAliveSeconds(Integer.parseInt( env.getProperty("ed.keepAliveSeconds"))); //120
return t;
}
The implementation of this-
TaskExecutor taskExecutor = (TaskExecutor) beanFactory.getBean("taskExecutor");
taskExecutor.execute(taskNameUpdate);
where taskNameUpdate is a Runnable task.
When the user sent 17 requests it processed 2 requests in two threads (Say T1 and T2). When T1 finishes execution of the first task it picks task17 from the queue where it should pick task3. The Bigger problem is when T2 finishes the current task it also start processing task 17. Both of threads fishes task 17 and starts processing task 17 again.
As a temporary fix, I have set corePoolSize=10 and requested not to send more than 10 batches at the same time.
Please help me to figure out the problem.
I have a single ActorA that reads from an input stream and sends messages to a group of ActorB's. When ActorA reaches the end of the input stream it cleans up its resources, broadcasts a Done message to the ActorB's, and shuts itself down.
I have approx 12 ActorB's that send messages to a group of ActorC's. When an ActorB receives a Done message from ActorA then it cleans up its resources and shuts itself down, with the exception of the last surviving ActorB which broadcasts a Done message to the ActorC's before it shuts itself down.
I have approx 24 ActorC's that send messages to a single ActorD. Similar to the ActorB's, when each ActorC gets a Done message it cleans up its resources and shuts itself down, with the exception of the last surviving ActorC which sends a Done message to ActorD.
When ActorD gets a Done message it cleans up its resources and shuts itself down.
Initially I had the ActorB's and ActorC's immediately propagate the Done message when they received it, but this might cause the ActorC's to shut down before all of the ActorB's have finished processing their queues; likewise the ActorD might shut down before the ActorC's have finished processing their queues.
My solution is to use an AtomicInteger that is shared among the ActorB's
class ActorB(private val actorCRouter: ActorRef,
private val actorCount: AtomicInteger) extends Actor {
private val init = {
actorCount.incrementAndGet()
()
}
def receive = {
case Done => {
if(actorCount.decrementAndGet() == 0) {
actorCRouter ! Broadcast(Done)
}
// clean up resources
context.stop(self)
}
}
}
ActorC uses similar code, with each ActorC sharing an AtomicInteger.
At present all actors are initialized in a web service method, with the downstream ActorRef's passed in the upstream actors' constructors.
Is there a preferred way to do this, e.g. using calls to Akka methods instead of an AtomicInteger?
Edit: I'm considering the following as a possible alternative: when an actor receives a Done message it sets the receive timeout to 5 seconds (the program will take over an hour to run, so delaying cleanup/shutdown by a few seconds won't impact the performance); when the actor gets a ReceiveTimeout it broadcasts Done to the downstream actors, cleans up, and shuts down. (The routers for ActorB and ActorC are using a SmallestMailboxRouter)
class ActorB(private val actorCRouter: ActorRef) extends Actor {
def receive = {
case Done => {
context.setReceiveTimeout(Duration.create(5, SECONDS))
}
case ReceiveTimeout => {
actorCRouter ! Broadcast(Done)
// clean up resources
context.stop(self)
}
}
}
Sharing actorCount among related actors is not good thing to do. Actor should only be using its own state to handle messages.
How about having ActorBCompletionHanlder actor for actor of type ActorB. All ActorB will have reference to ActorBCompletionHanlder actor. Every time ActorB receives Done message it can do necessay cleanup and simply pass done message to ActorBCompletionHanlder. ActorBCompletionHanlder will maintain state variale for maintaining counts. Everytime it receives done message it can simply update counter. As this is solely state variable for this actor no need to have it atomic and that way no need for any explicit locking. ActorBCompletionHanlder will send done message to ActorC once it receives last done message.
This way sharing of activeCount is not among actors but only managed by ActorBCompletionHanlder. Same thing can be repeated for other types.
A-> B's -> BCompletionHanlder -> C's -> CCompletionHandler -> D
Other approach could be to have one monitoring actor for evey related group of actors. And using watch api and child terminated event on monitor you can chose to decide what to do once you receive last done message.
val child = context.actorOf(Props[ChildActor])
context.watch(child)
case Terminated(child) => {
log.info(child + " Child actor terminated")
}
I have created EJB 3 Timer and it is deployed into weblogic cluster. The createTimer parameters are createTimer(1000, 20000, null);
This should create recurring timer for every 20 seconds. But the timer is always created every 30 seconds. I even changed the intervalDuration to 40 and 50 seconds but the timeout method is always triggered every 30 seconds.
Below are the entry from WEBLOGIC_TIMERS table
1##OSBNode2_1355845459844 (BLOB) 1355846770914 1000 TimerearTimerTest.jarTimerTest OSBDomain OSBCluster
Below are the entry from ACTIVE table
timer.1##OSBNode2_1355843156331 -96726833478167425/OSBNode2 OSBDomain OSBCluster 18-DEC-12
service.TimerMaster 8866906753834651127/OSBNode1 OSBDomain OSBCluster 18-DEC-12
service.SINGLETON_MASTER 8866906753834651127/OSBNode1 OSBDomain OSBCluster 18-DEC-12
Can anyone help me to investigate why the timer always triggers every 30 second instead of my intervalDuration value?
Below is the EJB ----->>
package com.timertest;
import java.util.*;
import javax.annotation.Resource;
import javax.ejb.*;
#Stateless(mappedName = "TimerTest")
public class TimerTest implements TimerTestRemote
{
#Resource
private SessionContext ctx;
#Override
public void createMyTimer()
throws EJBException
{
ctx.getTimerService().createTimer(1000, 20000, null);
}
#Timeout
public void timeout(Timer timer)
{
System.out.println("-> Timed Out ..."+ new Date());
}
}
Below is the and weblogic descripor-------->>
<?xml version="1.0" encoding="UTF-8"?>
<wls:weblogic-ejb-jar
xmlns:wls="http://xmlns.oracle.com/weblogic/weblogic-ejb-jar"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/ejb-jar_3_0.xsd http://xmlns.oracle.com/weblogic/weblogic-ejb-jar http://xmlns.oracle.com/weblogic/weblogic-ejb-jar/1.2/weblogic-ejb-jar.xsd">
<wls:weblogic-enterprise-bean>
<wls:ejb-name>TimerTest</wls:ejb-name>
<wls:stateless-session-descriptor>
<wls:stateless-clustering>
<wls:home-is-clusterable>true</wls:home-is-clusterable>
<wls:home-load-algorithm>round-robin</wls:home-load-algorithm>
<wls:stateless-bean-is-clusterable>true</wls:stateless-bean-is-clusterable>
<wls:stateless-bean-load-algorithm>round-robin
</wls:stateless-bean-load-algorithm>
</wls:stateless-clustering>
<wls:business-interface-jndi-name-map>
<wls:business-remote>TimerTestRemote</wls:business-remote>
<wls:jndi-name>TimerTest</wls:jndi-name>
</wls:business-interface-jndi-name-map>
</wls:stateless-session-descriptor>
<wls:enable-call-by-reference>false</wls:enable-call-by-reference>
<wls:jndi-name>TimerTest</wls:jndi-name>
</wls:weblogic-enterprise-bean>
<wls:timer-implementation>Clustered</wls:timer-implementation>
</wls:weblogic-ejb-jar>
Thanks in advance
This might not be the direct answer, but might help to avoid such condition.
Timers are persistent by default, so you have to cancel the previous timers before starting new ones.
Also, you can provide info while creating timer object instead of passing null, may be string identifier. It will later help to identify which timer had timed out or to cancel it in case of multiple timers in the system.
Creating timers & restarting the application multiple times, there are several such timers having same info, but are having different hashCode. So they don't overlap & are created new each time.
You can get timers by ctx.getTimerService().getTimers() & cancel by iterating over them.
Edit : I have replicated the scenario & have faced similar issue. I have debugged & it happens when there are multiple previous timers active for the same interval, which aren't being cancelled.
You try below code, I have tried & it resolved the issue.
for(Timer t : timerService.getTimers())
t.cancel(); //-- Cancel timers before creatimng new one
ctx.getTimerService().createTimer(1000, 20000, null);
Excerpt from documentation :
intervalDuration - If expiration is delayed (e.g. due to the interleaving of other method calls on the bean), two or more expiration notifications may occur in close succession to "catch up".
You can use a different style (annotations). See below:
#Schedule(hour = "*", minute = "*",second = "*/40")
#TransactionAttribute(TransactionAttributeType.REQUIRES_NEW)
#Lock(LockType.WRITE)
#AccessTimeout(1000 * 60 * 60)//1 Hour...//Concurrent Access is not permitted...
We have a situation where we set up a component to run batch jobs using spring batch remotely. We send a JMS message with the job xml path, name, parameters, etc. and we wait on the calling batch client for a response from the server.
The server reads the queue and calls the appropriate method to run the job and return the result, which our messaging framework does by:
this.jmsTemplate.send(queueName, messageCreator);
this.LOGGER.debug("Message sent to '" + queueName + "'");
try {
final Destination replyTo = messageCreator.getReplyTo();
final String correlationId = messageCreator.getMessageId();
this.LOGGER.debug("Waiting for the response '" + correlationId + "' back on '" + replyTo + "' ...");
final BytesMessage message = (BytesMessage) this.jmsTemplate.receiveSelected(replyTo, "JMSCorrelationID='"
+ correlationId + "'");
this.LOGGER.debug("Response received");
Ideally, we want to be able to call out runJobSync method twice, and have two jobs simultaneously operate. We have a unit test that does something similar, without jobs. I realize this code isn't very great, but, here it is:
final List result = Collections.synchronizedList(new ArrayList());
Thread thread1 = new Thread(new Runnable(){
#Override
public void run() {
client.pingWithDelaySync(1000);
result.add(Thread.currentThread().getName());
}
}, "thread1");
Thread thread2 = new Thread(new Runnable(){
#Override
public void run() {
client.pingWithDelaySync(500);
result.add(Thread.currentThread().getName());
}
}, "thread2");
thread1.start();
Thread.sleep(250);
thread2.start();
thread1.join();
thread2.join();
Assert.assertEquals("both thread finished", 2, result.size());
Assert.assertEquals("thread2 finished first", "thread2", result.get(0));
Assert.assertEquals("thread1 finished second", "thread1", result.get(1));
When we run that test, thread 2 completes first since it just has a 500 millisencond wait, while thread 1 does a 1 second wait:
Thread.sleep(delayInMs);
return result;
That works great.
When we run two remote jobs in the wild, one which takes about 50 seconds to complete and one which is designed to fail immediately and return, this does not happen.
Start the 50 second job, then immediately start the instant fail job. The client prints that we sent a message requesting that the job run, the server prints that it received the 50 second request, but waits until that 50 second job is completed before handling the second message at all, even though we use the ThreadPoolExecutor.
We are running transactional with Auto acknowledge.
Doing some remote debugging, the Consumer from AbstractPollingMessageListenerContainer shows no unhandled messages (so consumer.receive() obviously just returns null over and over). The webgui for the amq broker shows 2 enqueues, 1 deque, 1 dispatched, and 1 in the dispatched queue. This suggests to me that something is preventing AMQ from letting the consumer "have" the second message. (prefetch is 1000 btw)
This shows as the only consumer for the particular queue.
Myself and a few other developers have poked around for the last few days and are pretty much getting nowhere. Any suggestions on either, what we have misconfigured if this is expected behavior, or, what would be broken here.
Does the method that is being remotely called matter at all? Currently the job handler method uses an executor to run the job in a different thread and does a future.get() (the extra thread is for reasons related to logging).
Any help is greatly appreciated
not sure I follow completely, but off the top, you should try the following...
set the concurrentConsumers/maxConcurrentConsumers greater than the default (1) on the MessageListenerContainer
set the prefetch to 0 to better promote balancing messages between consumers, etc.