Is Sleep statement allowed in spark Streaming - spark-streaming

I have a requirement of add sleep statement if i am unable to consume message and want to retry after 5s. To do this do i need to set any configuration properties?
rdd.foreachPartition(new VoidFunction<Iterator<ConsumerRecord<String, Object>>>() {
#Override
public void call(Iterator<ConsumerRecord<String, Object>> record)
throws Exception {
while (record.hasNext()) {
ConsumerRecord<String, Object> consumerRecord = record.next();
boolean flag=false;
while(flag){
flag= processmessage(record.value())
if(!flag)
Thread.sleep(1000)
}
}
}
});
Currently, i am unable to run my job

you can use sleep in spark streaming application.
But wait,
Spark streaming job runs micro batches and we define the stream interval time which is usually in few seconds ( could be 1s, 2s,... etc ). If you use sleep in your spark streaming code, it will take additional time to finish running each micro batch. This might impact the performance if data is coming very frequently.
It totally depends on application requirement whether sleep will make any performance issue or delay or it might just not impact if data is coming after long intervals.
Hope this helps.

Related

improve spring batch job performance

I am in the process of implementing a spring batch job for our file upload process. My requirement is to read a flat file, apply business logic then store it in DB then post a Kafka message.
I have a single chunk-based step that uses a custom reader, processor, writer. The process works fine but takes a lot of time to process a big file.
It takes 15 mins to process a file having 60K records. I need to reduce it to less than 5 mins, as we will be consuming much bigger files than this.
As per https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html I understand making it multithreaded would give a performance boost, at the cost of restart ability. However, I am using FlatFileItemReader, ItemProcessor, ItemWriter and none of them is thread-safe.
Any suggestions as to how to improve performance here?
Here is the writer code:-
public void write(List<? extends Message> items) {
items.forEach(this::process);
}
private void process(Message message) {
if (message == null)
return;
try {
//message is a DTO that have info about success or failure.
if (success) {
//post kafka message using spring cloud stream
//insert record in DB using spring jpaRepository
} else {
//insert record in DB using spring jpaRepository
}
} catch (Exception e) {
//throw exception
}
}
Best regards,
Preeti
Please refer to below SO thread and refer the git hub source code for parallel processing
Spring Batch multiple process for heavy load with multiple thread under every process
Spring batch to process huge data

How to balance multiple message queues

I have a task that is potentially long running (hours). The task is performed by multiple workers (AWS ECS instances in my case) that read from a message queue (AWS SQS in my case). I have multiple users adding messages to the queue. The problem is that if Bob adds 5000 messages to the queue, enough to keep the workers busy for 3 days, then Alice comes along and wants to process 5 tasks, Alice will need to wait 3 days before any of Alice's tasks even start.
I would like to feed messages to the workers from Alice and Bob at an equal rate as soon as Alice submits tasks.
I have solved this problem in another context by creating multiple queues (subqueues) for each user (or even each batch a user submits) and alternating between all subqueues when a consumer asks for the next message.
This seems, at least in my world, to be a common problem, and I'm wondering if anyone knows of an established way of solving it.
I don't see any solution with ActiveMQ. I've looked a little at Kafka with it's ability to round-robin partitions in a topic, and that may work. Right now, I'm implementing something using Redis.
I would recommend Cadence Workflow instead of queues as it supports long running operations and state management out of the box.
In your case I would create a workflow instance per user. Every new task would be sent to the user workflow via signal API. Then the workflow instance would queue up the received tasks and execute them one by one.
Here is a outline of the implementation:
public interface SerializedExecutionWorkflow {
#WorkflowMethod
void execute();
#SignalMethod
void addTask(Task t);
}
public interface TaskProcessorActivity {
#ActivityMethod
void process(Task poll);
}
public class SerializedExecutionWorkflowImpl implements SerializedExecutionWorkflow {
private final Queue<Task> taskQueue = new ArrayDeque<>();
private final TaskProcesorActivity processor = Workflow.newActivityStub(TaskProcesorActivity.class);
#Override
public void execute() {
while(!taskQueue.isEmpty()) {
processor.process(taskQueue.poll());
}
}
#Override
public void addTask(Task t) {
taskQueue.add(t);
}
}
And then the code that enqueues that task to the workflow through signal method:
private void addTask(WorkflowClient cadenceClient, Task task) {
// Set workflowId to userId
WorkflowOptions options = new WorkflowOptions.Builder().setWorkflowId(task.getUserId()).build();
// Use workflow interface stub to start/signal workflow instance
SerializedExecutionWorkflow workflow = cadenceClient.newWorkflowStub(SerializedExecutionWorkflow.class, options);
BatchRequest request = cadenceClient.newSignalWithStartRequest();
request.add(workflow::execute);
request.add(workflow::addTask, task);
cadenceClient.signalWithStart(request);
}
Cadence offers a lot of other advantages over using queues for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
See the presentation that goes over Cadence programming model.

Apache Storm SleepSpoutWaitStrategy Behaviour

I am running an Apache Storm benchmark on a local machine.
However, I am seeing a weird behavior. One of the benchmarks i.e., SOL (Speed of light) test, uses a RandomMessageSpout to generate random tuples as source. Here is the nextTuple() code for that spout:
public void nextTuple() {
final String message = messages[rand.nextInt(messages.length)];
if(ackEnabled) {
collector.emit(new Values(message), messageCount);
messageCount++;
} else {
collector.emit(new Values(message));
}
}
When I run this benchmark and profile it using a Java profiler (Yourkit in my case). The spout thread shows sleep intervals in accordance with the SleepSpoutWaitStrategy.emptyEmit(). As my understanding goes, this function is called when nextTuple() has no tuples to emit and thus the spout thread goes to sleep for a configurable amount of time, as shown in the screenshot.
I do not understand why this function would be called given this particular nextTuple() implementation that will always return a tuple. What I am misunderstanding here?
Empty emit is also called in following situations
If the number of unacked messages reach the max spout pending.
If the executor send queue as well as the overflow buffer of spout is full.

Cancel running job scheduled with Hangfire.io

I schedule job using hangfire.io library and I can observe it being processed in built in dashboard. However, my system has requirement that the job can be cancelled from the dashboard.
There is an option to delete running job, but this only changes the state of the job in database and does not stop running job.
I see in documentation there is option to pass IJobCancellationToken however as I understand it it is used to correctly stop the job when whole server is shutting down.
Is there a way to achieve programmatic cancellation of already running task?
Should I write my own component that would periodically pull database and check whether current server instance is running job that has been cancelled? For instance maintain dictionary jobId -> CancellationTokenSource and then signal cancellation using appropriate tokensource.
Documentation is incomplete a bit. IJobCancellationToken.ThrowIfCancellationRequested method throws an exception after any of the following conditions met:
Hangfire Server shutdown initiated. This event is triggered when someone calls Stop or Dispose methods of BackgroundJobServer.
Background job is not in Processed state.
Background job is being performed by another worker.
The latter two cases are performed by querying job storage for the current background job state. So, cancellation token will throw if you delete or re-queue it from the dashboard also.
This works if you delete the job from the dashboard
static public void DoWork(IJobCancellationToken cancellationToken)
{
Debug.WriteLine("Starting Work...");
for (int i = 0; i < 10; i++)
{
Debug.WriteLine("Ping");
try
{
cancellationToken.ThrowIfCancellationRequested();
}
catch (Exception ex)
{
Debug.WriteLine("ThrowIfCancellationRequested");
break;
}
//Debug.WriteProgressBar(i);
Thread.Sleep(5000);
}
Debug.WriteLine("Finished.");
}

How to update task tracker that my mapper is still running fine as opposed to generating timeout?

I forgot what API/method to call, but my problem is that :
My mapper will run more than 10 minutes - and I don't want to increase default timeout.
Rather I want to have my mapper send out update ping to task tracker, when it is in the particular code path that consumes time > 10 mins.
Please let me know what API/method to call.
You can simply increase a counter and call progress. This will ensure that the task sends a heartbeat back to the tasktracker to know if its alive.
In the new API this is managed through the context, see here: http://hadoop.apache.org/common/docs/r1.0.0/api/index.html
e.G.
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// increment counter
context.getCounter(SOME_ENUM).increment(1);
context.progress();
}
In the old API there is the reporter class:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html
You typically use the Reporter to let the framework know you're still alive.
Quote from the javadoc:
Mapper and Reducer can use the Reporter provided to report progress or
just indicate that they are alive. In scenarios where the application
takes an insignificant amount of time to process individual key/value
pairs, this is crucial since the framework might assume that the task
has timed-out and kill that task.

Resources