Pause/Resume Flink job during migrations - elasticsearch

I'm using Apache Flink to propagate updates from a given set of Kafka topics into an Elasticsearch cluster.
The problem I'm facing is that sometimes the Elasticsearch cluster evolves and I have to (1) modify the mappings, (2) copy over the data...and by the time I have to point the Flink jobs to the new alias/index, there are plenty of updates that made it to the old index.
So I wonder what's the best way to approach this. I can have downtime, but I would like to avoid this if possible. I was trying to make the Flink jobs to slowdown or pause the (Kafka) input sources until the migration finishes, but I didn't find any endpoint for this.
The Flink jobs run in application mode.
If anyone can shed some light on how to accomplish this: pause/resume the jobs via an API or something similar, I will really appreciate the input. The only constraint I have is around stopping the applications (as in stopping/killing pods): it's possible, but too troublesome due to access constraints to the Kubernetes clusters.

I'd probably look into stopping the job with a save point using Flink REST API: https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/rest_api/#jobs-jobid-stop
If that Flink app is pretty big and has lots of state, you can also try to just stop sending data to the input Kafka topic if you don't want to stop it (assuming that it can properly write with the new mappings and indeces after you've made the required ES cluster changes without any change in the Flink Job). It is a bit of overhead, but you could have different topics for your producers and Flink sources, and have another simple Flink job mirror data from one topic (where producers produce to) to the other (where Flink consumes from). When you want to stop writing to ES, just stop that job using the REST API. To not write a new Flink job you could use MirrorMaker or similar, but to stop it you may have to kill its pod.
Or another option is architecting the Elasticsearch indexes so they can support your cluster evolution without having to stop the Flink app. It is hard to know what do you'd need to exactly change, but by writing into aliases and playing with the write index flag you may be able to achieve what you want. I've done this in the past, but it is true that if your mappings change a lot it may be hard to do,

Related

ruby-kafka: is it possible to publish to two kafka instances at the same time

Current flow of the project that I'm working on involves pushing to a local kafka using ruby-kafka gem.
Now the need arose to add producer for the remote kafka, and duplicate also messages there.
And I'm looking for a better way, than calling Kafka.new(...) twice...
Could you please help me, and do you happen to have any ideas?
Another approach to consider would be writing the data once from your application, and then asynchronously replicating the message from one Kafka cluster to another. There are multiple ways of doing this including Apache Kafka's MirrorMaker, Confluent's Replicator, Uber's uReplicator etc.
Disclaimer: I work for Confluent.

Scheduling task at some specific time in Java

I have some code execution which will scheduled many jobs at different date-time. So overall I will have lot of jobs to run at specific date-time. I know that there is Spring Scheduler which will execute a job at some time period, but it does not schedule a job dynamically. I can use ActiveMQ with timed delivery or Quartz for my purpose but looking for a little suggestion. Shall I use Quartz or ActiveMQ timed/delayed delivery or something else.
There is another alternative as well in Executor service with timed execution, but if application restarts then the job will be gone I believe. Any help will be appreciated.
While you can schedule message delivery in ActiveMQ it wasn't designed to be used as a job scheduler whereas that's exactly what Quartz was designed for.
In one of your comments you talked about wanting a "scalable solution" and ActiveMQ won't scale well with a huge number of scheduled jobs because the more messages which accumulate in the queues the worse it will perform since it will ultimately have to page those messages to disk rather than keeping them in memory. ActiveMQ, like most message brokers, was meant to hold messages for a relatively short amount of time before they are consumed. It's much different than a database which is better suited for this use-case. Quartz should scale better than ActiveMQ for a large number of jobs for this reason.
Also, the complexity of the jobs you can configure in Quartz is greater. If you go with ActiveMQ and you eventually need more functionality than it supports then that complexity will be pushed down into your application code. However, there's a fair chance could simply do what you want with Quartz since it was designed as a job scheduler.
Lastly, a database is more straight-forward to maintain than a message broker in my opinion and a database is also easy to provision in most cloud providers. I'd recommend you go with Quartz.
You can start by using a cron-expression in order to cover the case when your application will restart. The cron-expression can be stored in the properties file. Also, when your application will be scheduled, you can restart or reschedule your job programatically by creating a new job instance with another cron-expression for example.

Concurrent batch jobs writing logs to database

My production system has nearly 170 Ctrl-M jobs(essentially cron jobs) running every day. These jobs are weaved together(by creating dependencies) to perform ETL operations. eg: Ctrl-M(scheduler like CRON) almost always start with a shell script, which then executes a bunch of python, hive scripts or map-reduce jobs in a specific order.
I am trying to implement logging into each of these processes to be able to better monitor the tasks and the pipelines in whole. The logs would be used to build a monitoring dashboard.
Currently I have implemented logging using a central wrapper which would be called by each of the processes to log information. This wrapper in turn opens up a teradata connection EACH time and calls a teradata stored procedure to write into a teradata table.
This works fine for now. But in my case, multiple concurrent processes (spawning even more parallel child processes) run at the same time and I have started experiencing dropped connections while doing some load testing. Below is an approach I have been thinking about:
Make processes write to some kind of message queues(eg: AWS sqs). A listener would pick data from these message queues asynchronously and then batch write to teradata.
Using files or some structure to perform batch writing to teradata db.
I would definitely like to hear your thoughts on that or any other better approaches. Eventually the end point of logging will be shifted to redshift and hence thinking in the lines of AWS SQS queues.
Thanks in advance.
I think Kinesis firehose is the perfect solution for this. Setting up a the firehose stream is incredibly quick and easy to configure, very inexpensive and will stream your data to s3 bucket of your choice and optionally stream your logs directly to redshift.
If redshift is your end goal (or even just s3), kinesis firehose couldn't make it easier.
https://aws.amazon.com/kinesis/firehose/
Amazon Kinesis Firehose is the easiest way to load streaming data into
AWS. It can capture and automatically load streaming data into Amazon
S3 and Amazon Redshift, enabling near real-time analytics with
existing business intelligence tools and dashboards you’re already
using today. It is a fully managed service that automatically scales
to match the throughput of your data and requires no ongoing
administration. It can also batch, compress, and encrypt the data
before loading it, minimizing the amount of storage used at the
destination and increasing security. You can easily create a Firehose
delivery stream from the AWS Management Console, configure it with a
few clicks, and start sending data to the stream from hundreds of
thousands of data sources to be loaded continuously to AWS – all in
just a few minutes.

How to send, store and analyze sensor data using Hadoop

My Raspberry pi 2 is doing good with windows 10 and I'm able to control LED from internet using .Net MF. Now, I wanted to send my LED (I'm going to use temperature sensor instead of LED) ON-OFF signal onto big data for storing and analyzing or retrieving purpose.
Checked on net, not able to find the simple and easy way to do that. Could any one please suggest any tutorial for "How can I send real time data to Hadoop"? I want to understand whole architecture to proceed on this.
What all technologies/things I should concentrate on to make such POC?
Note: I think, I need some combination like MQTT broker, Spark or Strom etc...But not sure, how can I put all things together to make it practically possible. Please correct me if I'm wrong and help.
You could send the signals as a stream of events to Hadoop in real time, using one of several components which make up the Hadoop "ecosystem". Systems such as Spark or Storm which are for processing the data in real time are only necessary if you want to apply logic to the stream in real-time. If you just want to batch up the events and store them in HDFS for later retrieval by a batch process, you could use:
Apache Flume. A Flume agent runs on one or more of the Hadoop nodes and listens on a port. Your Raspberry Pi sends each event one by one to that port. Flume buffers the events and then writes them to HDFS https://flume.apache.org/FlumeUserGuide.html
Kafka. Your Raspberry Pi sends the events one by one to a Kafka instance which stores them as a message queue. A further distributed batch process runs periodically on Hadoop in order to move the events from Kafka to HDFS. This solution is more robust but has more moving parts.

Can Spark be run alongside Kafka on the same node to optimize by keeping the Spark Streaming ETL processes closer to real time data?

I have gotten the advice, and, read in a few places, that running Spark on the data nodes greatly improves the performance of batch processing. I have also gotten the advice to keep the Kafka service isolated on dedicated nodes.
If most of the consumers of Kafka data are Spark Streaming ETL processes that land a transformed version of the data either back on Kafka or some other storage mechanism, does it then not make since to run those process on the same nodes i.e. run the Spark Service alongside the Kafka service on the Kafka dedicated cluster?
Thanks

Resources