Kafka connect behaviour in Distribution mode - apache-kafka-connect

Im running Kafka connect in distributed mode having two different connectors with one task each. Each connector is running in different instance which is exactly I want.
Is it always ensure the same behaviour that Kafka connect cluster share the load properly ?

Connectors in Kafka Connect run with one, or more tasks. The number of tasks depends on how you have configured the connector, and whether the connector itself can run multiple tasks. An example would be the JDBC Source connector, which if ingesting more than one table from a database will run (if configured to do so) one task per table.
When you run Kafka Connect in distributed mode, tasks from all the connectors are executed across the available workers. Each task will only be executing on one worker at one time.
If a worker fails (or is shut down) then Kafka Connect will rebalance the tasks across the remaining worker(s).
Therefore, you may see one connector running across different workers (instances), but only if it has more than one task.
If you think you are seeing the same connector's task executing more than once then it suggests a misconfiguration of the Kafka Connect cluster, and I would suggest reviewing https://rmoff.net/2019/11/22/common-mistakes-made-when-configuring-multiple-kafka-connect-workers/.

Related

Wildfly 11 - High Availability - Single deploy on slave

I have two servers in a HA mode. I'd like to know if is it possible to deploy an application on the slave server? If yes, how to configure it in jgroups? I need to run a specific program that access the master database, but I would not like to run on master serve to avoid overhead on it.
JGroups itself does not know much about WildFly and the deployments, it only creates a communication channel between nodes. I don't know where you get the notion of master/slave, but JGroups always has single* node marked as coordinator. You can check the membership through Channel.getView().
However, you still need to deploy the app on both nodes and just make it inactive if this is not its target node.
*) If there's no split-brain partition, or similar rare/temporal issues

Kafka Connect separated logging

Currently we are using a couple of custom connetor plugins for our confluent kafka connect distributed worker cluster. One thing that bothers me for a long time is that kafka connect writes all logs from all deployed connectors to one File/Stream. This makes debugging an absolute nightmare. Is there a way to let kafka connect log the connectors in different Files/Streams?
Via the connect-log4j.properties I am able to let a specific class log to a different File/Stream. But this means that with every additional connector I have to adjust the connect-log4j.properties
Thanks
Kafka Connect does not currently support this. I agree that it is not ideal.
One option would be to split out your connectors and have a dedicated worker cluster for each, and thus separate log files.
Kafka Connect is part of Apache Kafka so you could raise a JIRA to discuss this further and maybe contribute it back via a PR?
Edit April 12, 2019: See https://cwiki.apache.org/confluence/display/KAFKA/KIP-449%3A+Add+connector+contexts+to+Connect+worker+logs

Spring Batch in clustered environment, high-availability

Right now I use H2 in-memory database as JobRepostiry for my single node Spring Batch/Boot application.
Now I would like to run Spring Batch application on two nodes in order to increase performance (distribute jobs between these 2 instances) and made the application more failover.
Instead of H2 I'm going to use PostgreSQL and configure both of the applications to use this shared database. Is that enough for Sring Batch in order to start working properly in the cluster and start distributing jobs between cluster nodes or do I need to perform some additional actions?
Depending on how you will distribute your jobs across the nodes, you might need to setup a communication middleware (such a JMS or AMQP provider) in addition to a shared job repository.
For example, if you use remote partitioning, your job will be partitioned and each worker can be run on one node. In this case, the job repository must be shared in order for:
the workers to report their progress to the job repository
the master to poll the job repository for workers statuses.
If your jobs are completely independent and you don't need feature like restart, you can continue using an in-memory database for each job and launch multiple instances of the same job on different nodes. But even in this case, I would recommend using a production grade job repository instead of an in-memory database. Things can go wrong very quickly in a clustered environment and having a job repository to store the execution status, synchronize executions, restart failed executions, etc is crucial in such an environment.

Flink: How to set JobManager restart strategy in HA?

I am implementing Flink standalone cluster with HA. When I fail one Job manager, the another standby Job manager takes its position. But when I check the log of Task manager to see what is exactly happening during this switch over, I observe that, Flink makes 6 attempts to reconnect to the old Job manager(failed) with increasing timeout(500,1000,2000,4000,8000,16000)ms.
Can I reduce this number of attempts via Flink configuration file so that I can connect to new Job manager as fast as possible? What are the setting that I have to make?

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

Resources