Is kafka streaming driver waiting for executors to finish? - spark-streaming

I have a spark streaming application that monitors some messages from kafka. For a specific message I may need to go to a long loop to continue pinging some ip until its reconnected.
What I can see that when some executor goes to this loop, it stops processing the stream.
Is it correct ?
How can I make this loop without interrupting continuing processing the stream ?

Yes, the executors will not progress if they are blocked on a long-running task.
I would advise against running such long process on an executor. Assuming that those reconnect cases are sparse in time, I would use an asynchronous process in the driver. (Futures or barebone threads from a pool).

Related

ActiveMQ- Can I Replace The Scheduler Plugin With A Delayed Message Queue?

I worked a little with the ActiveMQ scheduler plugin. This simplifies scheduling messages for delivery with a delay at low volume, but as I get into the 100ks of messages the system breaks down in two key ways.
It's very slow (compared to queues) to enqueue messages in the scheduler.
Attempting to view the schedules in the dashboard crashes the ActiveMQ instance.
The existing scheduler feels a little bolted on and does not perform as expected. So, rethinking the problem I would like to have a jobs and jobs-scheduled queue. Messages sent to the jobs-scheduled queue will have a ttl header with the unix timestamp for when it should be delivered. A process will run on a cron job which will take messages from the jobs-scheduled queue and send it to the jobs queue using a selector to just pick out the messages with an elapsed ttl convert_string_expressions:ttl < %(now)s.
My two questions are:
Will this strategy work for delaying messages at scale or will I find scaling pains around the selector? These messages will be persisted if that makes a difference.
Is there an existing feature in ActiveMQ that will allow me to send messages from one queue to another with a selector query?
ActiveMQ is a message broker not a job scheduler so what you are trying to do is really outside the scope of the what the broker is intended to do. Yes ActiveMQ does have a scheduled message feature but this is not intended for large scale job queue type work, it is a simple feature to provide some minimal delayed delivery.
What you are looking for sounds more like Quartz or some other batch job scheduling library. You could develop your own Job scheduler implementation for ActiveMQ or do something in a plugin but you are really trying to run against the grain of what a broker is meant to do which is deliver messages as quickly as possible in a decoupled manner.
Side note-- potentially off-topic.
I've had to solve a similar situation in the past where it made a lot of sense to load up the queues with messages ahead of time to cut down on the total transfer time.
I solved it by using Camel routes and a side-channel activation. Camel allows you to programmatically start and stop routes, so you can load up a queue with no consumers for the data for a given time period. Then using a dedicated queue for control you send the 'start' message. The control route receives the 'start' message, and then activates the main data processing route. You then need to configure some sort of 'stop' message semantic to be ready for the next time periods run.
Effectively, you get the delayed behavior pattern with much more control over scheduling and cut down on the data-to-queue loading time problem. You can also solve the scaling problem by loading the data across more than one queue.

How to handle pods shutdown gracefully in microservices architechture

I am exploring different strategies for handling shutdown gracefully in case of deployment/crash. I am using the Spring Boot framework and Kubernetes. In a few of the services, we have tasks that can take around 10-20 minutes(data processing, large report generation). How to handle pod termination in these cases when the task is taking more time. For queuing I am using Kafka.
we have tasks that can take around 10-20 minutes(data processing, large report generation)
First, this is more of a Job/Task rather than a microservice. But similar "rules" applies, the node where this job is executing might terminate for upgrade or other reason, so your Job/Task must be idempotent and be able to be re-run if it crashes or is terminated.
How to handle pod termination in these cases when the task is taking more time. For queuing I am using Kafka.
Kafka is a good technology for this, because it is able to let the client Jon/Task to be idempotent. The job receives the data to process, and after processing it can "commit" that it has processed the data. If the Task/Job is terminated before it has processed the data, a new Task/Job will spawn and continue processing on the "offset" that is not yet committed.

NiFi from hadoop to kafak with exactly once guarantee

Is it possible for NiFi to read from hdfs (or hive) and publish data-rows to kafka with exactly once delivery guarantee?
Publishing to Kafka from NiFi is at-least-once guarantee because a failure could occur after Kafka has already received the message, but before NiFi receives the response, which could be due to a network issue, or maybe nifi crashed and restarted at that exact moment.
In any of those cases, the flow file would be put back in the original queue before the publish kafka processor (i.e. the session was never committed), and so it would be tried again.
Due to the threading model where different threads may execute the processor, it can't be guaranteed that the same thread that originally did the publishing will be the same thread that does the retry, and therefore can't make use of the "idempotent producer" concept.

Storm Pacemaker with upgraded KafkaSpout

I had a question regarding the usage of Pacemaker. We have a currently running Storm cluster on 1.0.2 and are in the process of migrating it to 1.2.2. We also use KafkaSpout to consume data from the KAfka topics.
Now, since this release in for Kafka 0.10 +, most of the load from ZK would be taken off since the offsets won't be stored in ZK.
Considering this, does it make sense for us to also start looking at Pacemaker to reduce load further on ZK?
Our cluster has 70+ supervisor and around 70 workers with a few unused slots. Also, we have around 9100+ executors/tasks running.
Another question I have is regarding the heartbeats and who all send it to whom? From what I have read, workers and supervisors send their heartbeats to ZK, which is what Pacemaker alleviates. How about the tasks? Do they also send heartbeats? If yes, then is it to ZK or where else? There's this config called task.heartbeat.frequency.secs which has led me to some more confusion.
The reason I ask this is that if the task level heartbeats aren't being sent to ZK, then its pretty evident that Pacemaker won't be needed. This is because with no offsets being committed to ZK, the load would be reduced dramatically. Is my assesment correct or would Pacemaker be still a feasible option? Any leads would be appreciated.
Pacemaker is an optional Storm daemon designed to process heartbeats from workers, which is implemented as a in-memory storage. You could use it if ZK become a bottleneck because the storm cluster scaled up
supervisor report heartbeat to nimbusthat it is alive, used for failure tolerance, and the frequency is set via supervisor.heartbeat.frequency.secs, stored in ZK.
And worker should heartbeat to the supervisor, the frequency is set via worker.heartbeat.frequency.secs. These heartbeats are stored in local file system.
task.heartbeat.frequency.secs: How often a task(executor) should heartbeat its status to the master(Nimbus), it never take effect in storm, and has been deprecated for Storm v2.0 RPC heartbeat reporting
This heartbeat stats what executors are assigned to which worker, stored in ZK.

Is polling on a ZMQ router socket threadsafe?

I have multiple threads interacting with the same ZeroMQ router socket (bad idea, I know). I manage thread safety with locks on all sends and receives.
Do I also need to lock polling or is this relatively benign operation threadsafe?
I think using polling alleviates any need for multithreading. You can pick up events as they come in using the poll loop on one thread and then distribute those events if you wish to other threads for processing. This way you don't need to share the socket.

Resources