Spring batch remote partitioning with kafka multitenant - spring

I have implemented the spring batch remote partitioning with Kafka where 3 workers are running on separate JVMs , on the each worker before initialization of job repository I need to set the tenant.
I am passing a tenant code value in executionContext during partitioning from manager jvm. in worker jvm i am able to read this param from stepExecution but during jobRepository initialization I need to read this value.
I an using below inboundFlow
#Bean // request coming from manager
public IntegrationFlow inboundFlow(ConsumerFactory consumerFactory) {
return IntegrationFlows
.from(Kafka.inboundChannelAdapter(consumerFactory, new ConsumerProperties("requestForWorkers")))
.channel(requestForWorkers())
.get();
}
My understanding that when worker consumer read message from Kafka that time I can read this value ?

I am able to solve this multi tenant on remote partitioning with kafka by dynamically creating steps, Job Repository, Inbound , outbound flow bean for each tenant.

Related

StreamsException: Unable to initialize state, this can happen if multiple instances of Kafka Streams are running in the same state directory

This is regarding upgrading existing code base in production which uses windowing from kafka-clients,kafka-streams,spring-kafka 2.4.0 to 2.6.x and also upgrading spring-boot-starter-parentfrom 2.2.2.RELEASE to 2.3.x as 2.2 is incompatible with kafka-streams 2.6.
The existing code had these beans mentioned below with old verions(2.4.0,2.2 spring release):
#Bean("DataCompressionCustomTopology")
public Topology customTopology(#Qualifier("CustomFactoryBean") StreamsBuilder streamsBuilder) {
//Your topology code
return streamsBuilder.build();
}
#Bean("GenericKafkaStreams")
public KafkaStreams kStream() {
//Your kafka streams code
return kafkaStreams;
}
Now after upgrading kafka streams,kafka clients to to 2.6.2 and spring kafka to 2.6.x, the following exception was observed:
2021-05-13 12:33:51.954 [Persistence-Realtime-Transformation] [main] WARN o.s.b.w.s.c.AnnotationConfigServletWebServerApplicationContext - Exception encountered during context initialization - cancelling refresh attempt: org.springframework.context.ApplicationContextException: Failed to start bean 'CustomFactoryBean'; nested exception is org.springframework.kafka.KafkaException: Could not start stream: ; nested exception is org.apache.kafka.streams.errors.StreamsException: Unable to initialize state, this can happen if multiple instances of Kafka Streams are running in the same state directory
A similar Error can happen when you are running multiple of the same application(name/id) on the same machine.
Please visite State.dir to get the idea.
you can add that in Kafka configurations and make it unique per each instance
In case you are using spring cloud stream (cann't have same port in the same machine):
spring.cloud.stream.kafka.streams.binder.configuration.state.dir: ${spring.application.name}${server.port}
UPDATE:
In the case of spring stream kafka:
#Bean(name = KafkaStreamsDefaultConfiguration.DEFAULT_STREAMS_CONFIG_BEAN_NAME)
KafkaStreamsConfiguration kStreamsConfig() {
Map<String, Object> props = new HashMap<>();
props.put(APPLICATION_ID_CONFIG, springApplicationName);
props.put(BOOTSTRAP_SERVERS_CONFIG, bootstrapServer);
props.put(StreamsConfig.STATE_DIR_CONFIG, String.format("%s%s", springApplicationName, serverPort));
return new KafkaStreamsConfiguration(props);
}
or:
spring.kafka:
bootstrap-servers: ....
streams:
properties:
application.server: localhost:${server.port}
state.dir: ${spring.application.name}${server.port}
The problem here is newer versions of spring-kafka is initializing one more instance of kafka streams based on topology bean automatically and another bean of generickafkaStreams is getting initialized from existing code base which is resulting in multiple threads trying to lock over state directory and thus the error.
Even disabling the KafkaAutoConfiguration at spring boot level does not disable this behavior. This was such a pain to identify and lost lot of time.
The fix is to get rid of topology bean and have our own custom kafka streams bean as below code:
protected Topology customTopology() {
//topology code
return streamsBuilder.build();
}
/**
* This starts kafka stream application and sets the state listener and state
* store listener.
*
* #return KafkaStreams
*/
#Bean("GenericKafkaStreams")
public KafkaStreams kStream() {
KafkaStreams kafkaStreams = new KafkaStreams(customTopology(), kstreamsconfigs);
return kafkaStreams;
}
If you have a sophisticated Kafka Streams topology in your Spring Cloud Streams Kafka Streams Binder 3.0 style application, you might need to specify different application ids for different functions like the following:
spring.cloud.stream.function.definition: myFirstStream;mySecondStream
...
spring.cloud.stream.kafka.streams:
binder:
functions:
myFirstStream:
applicationId: app-id-1
mySecondStream:
applicationId: app-id-2
I've handled problem on versions:
org.springframework.boot version 2.5.3
org.springframework.kafka:spring-kafka:2.7.5
org.apache.kafka:kafka-clients:2.8.0
org.apache.kafka:kafka-streams:2.8.0
Check this: State directory
By default it is created in temp folder with kafka streams app id like:
/var/folders/xw/xgslnvzj1zj6wp86wpd8hqjr0000gn/T/kafka-streams/${spring.kafka.streams.application-id}/.lock
If two or more Kafka Streams apps use the same spring.kafka.streams.application-id then you get this exception.
So just change your Kafka Streams apps id's.
Or set directory option manually StreamsConfig.STATE_DIR_CONFIG in streams config.
Above answers to set state dir works perfectly for me. Thanks.
Adding one observation that might be helpful for someone working with spring-boot. When working on same machine and trying to bring up multiple kafka stream application instances and If you have enabled property spring.devtools.restart.enabled (which mostly is the case in dev profile), you might want to disable it as when the same application instance restarts automatically it might not get store lock. This is what I was facing and was able to resolve by disabling restart behavior.
In my case perfectly works specyfing separate #TestConfiguration class in which I specify counter for changing application name for each SpringBoot Test Context.
#TestConfiguration
public class TestKafkaStreamsConfig {
private static final AtomicInteger COUNTER = new AtomicInteger();
#Bean(name = KafkaStreamsDefaultConfiguration.DEFAULT_STREAMS_CONFIG_BEAN_NAME)
KafkaStreamsConfiguration kStreamsConfig() {
final var props = new HashMap<String, Object>();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "test-application-id-" + COUNTER.getAndIncrement());
// rest of configuration
return new KafkaStreamsConfiguration(props);
}
}
Of course I had to enable spring bean overriding to replace primary configuration.
Edit: I'm using SpringBoot v. 2.5.10 so in my case to make use of #TestConfiguration i have to pass it to #SpringBootTest(classes =) annotation.
I was facing the same problem. A single topology in spring boot and I was trying to access the state store for interactive queries. In order to do so I needed a KafkaStreams object as shown below.
GlobalKTable<String, String> configTable = builder.globalTable("config",
Materialized.<String, String, KeyValueStore<Bytes, byte[]>> as("config-store")
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.String()));
KafkaStreams streams = new KafkaStreams(builder.build(), kconfig.asProperties());
streams.start();
ReadOnlyKeyValueStore<String, String> configView = streams.store(StoreQueryParameters.fromNameAndType("config-store", QueryableStoreTypes.keyValueStore()));
The problem is the Spring Kafka Factory bean starts a topology and calling streams.start() causes the lock on the state store as a second start is called.
This can be fixed by setting the auto start property to false.
spring.kafka.streams.auto-startup=false
That's all you need.

Need some guidance with Spring Integration Flow

I am new to Spring Integration and have read quite some documentation and other topics here on StackOverflow. But I am still a bit overwhelmed on how to apply the newly acquired knowledge in a Spring Boot Application.
This is what should happen:
receive message from a Kafka topic, eg from "request-topic" (payload is a custom Job POJO). InboundChannelAdapter?
do some preparation (checkout from a git repo)
process files using a batch job
commit&push to git, update Job object with commit-id
publish message to Kafka with updated Job object, eg to "reply-topic". OutboundChannelAdapter?
Using DSL or plain Java configuration does not matter. My problem after trying several variants is that I could not achieve the desired result. For example, handlers would be called too early, or not at all, and thus the reply in step 5 would not be updated.
Also, there should only be one flow running at any given time, so I guess, a queue should be involved at some point, probably at step 1(?).
Where and when should I use QueueChannels, DirectChannel (or any other?), do I need GatewayHandlers, eg to reply with a commit-id?
Any hints are appreciated.
Something like this:
#Bean
IntegrationFlow flow() {
return IntegrationFlows.from(Kafka.inboundGateway(...))
.handle(// prep)
.transform(// to JobLaunchRequest)
.handle(// JobLaunchingGateway)
.handle(// cleanUp and return result)
.get();
}
It will only process one request at a time (with default concurrency).

Is it possible to automatically create queues without RabbitAutoConfiguration.class? AMQP

I'm using 2.1.0.RELEASE version of Spring boot with AMQP. Unfortunetly I need to connect to several different RabbitMQ servers. I had to exclude RabbitAutoConfiguration.class besause due to changes in above version of spring it's impossible to start without one of the ConnectionFactory beans as primary, but even if I set one of them as #Primary, obviously it doesn't work, because how would amqp/spring-boot know which queue to create on which server...
so, is it possible to automatically create queues on different servers with auto configuration disabled?
Yes, you need a RabbitAdmin for each connection factory.
By default all components will be declared on all brokers, but you can add conditions. See Conditional Declaration.
By default, all queues, exchanges, and bindings are declared by all RabbitAdmin instances (assuming they have auto-startup="true") in the application context.
#Bean
public Queue queue1() {
Queue queue = new Queue("foo");
queue.setAdminsThatShouldDeclare(admin1());
return queue;
}

Kafka, Spring Kafka and redelivering old messages

I use Kafka and Spring Boot with Spring Kafka. After abnormal application termination and then restart, my application started receiving the old, already processed messages from Kafka queue.
What may be the reason for that and how to find and resolve the issue?
my Kafka properties:
spring.kafka.bootstrap-servers=${kafka.host}:${kafka.port}
spring.kafka.consumer.auto-offset-reset=earliest
spring.kafka.consumer.group-id=postfenix
spring.kafka.consumer.enable-auto-commit=false
My Spring Kafka factory and listener:
#Bean
public ConcurrentKafkaListenerContainerFactory<String, Post> postKafkaListenerContainerFactory(KafkaProperties kafkaProperties) {
ConcurrentKafkaListenerContainerFactory<String, Post> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.getContainerProperties().setAckMode(AckMode.MANUAL);
factory.setConsumerFactory(postConsumerFactory(kafkaProperties));
return factory;
}
#KafkaListener(topics = "${kafka.topic.post.send}", containerFactory = "postKafkaListenerContainerFactory")
public void sendPost(ConsumerRecord<String, Post> consumerRecord, Acknowledgment ack) {
Post post = consumerRecord.value();
// do some logic here
ack.acknowledge();
}
When using Kafka, the clients need to commit offsets themselves. This is in contrast to other message brokers, such as AMQP brokers, where the broker keeps track of messages a client did already receive.
In your case, you do not commit offsets automatically and therefore Kafka expects you to commit them manually (because of this setting: spring.kafka.consumer.enable-auto-commit=false). If you do not commit offsets manually in your program, the behaviour you describe is pretty much the expected one. Kafka simply does not know what messages your program did process successfully. Each time you restart your program, Kafka will see that your program did not commit any offsets yet and will apply the strategy you provide in spring.kafka.consumer.auto-offset-reset=earliest, which means the first message in the queue.
If this is all new to you, I suggest reading up this documentation on Kafka and this Spring documentation, because Kafka is quite different than other message brokers.

Spring Boot and long running tasks

In my Spring Boot application I have to implement an import service. Users can submit a bunch of JSON files and application will try to import the data from these files. Depending on the data amount at JSON files the single import process can take a 1 or 2 hours.
I do not want to block the users during the import process so I plan to accept the task for importing and notify user that this data is scheduled for processing. I'll put the data into the queue and a free queue-consumer on the other end will start the import process. Also, I need to have a possibility to monitor a jobs in the queue, terminate them if needed.
Right now I'm thinking to use Embedded Apache ActiveMQ in order to introduce message producer and consumer logic but before this I'd like to ask - from the architecture point of view - is it a good choice for the described task or it can be implemented with a more appropriate tools.. like for example plain Spring #Async and so on ?
It is possible to treat files concurrently with Camel like this
from("file://incoming?maxMessagesPerPoll=1&idempotent=true&moveFailed=failed&move=processed&readLock=none").threads(5).process()
Take a look at http://camel.apache.org/file2.html
But i think that it is better for your requirements to use a standalone ActiveMQ, a standalone service to move files to ActiveMQ and standalone consumer to be capable to kill or restart each one independently.
It is better to use ActiveMQ as you said and you can easily create a service to move messages to a queue with Camel like this :
CamelContext context = new DefaultCamelContext();
ConnectionFactory connectionFactory = new ActiveMQConnectionFactory("vm://localhost?broker.persistent=true");
context.addComponent("test-jms", JmsComponent.jmsComponentAutoAcknowledge(connectionFactory));
context.addRoutes(new RouteBuilder() {
public void configure() {
// convertBodyTo to use TextMessage or maybe send them as file to the Queue from("file://testFolderPath").convertBodyTo(String.class).to("test-jms:queue:test.queue");
}
});
context.start();
Here some examples
http://www.programcreek.com/java-api-examples/index.php?api=org.apache.camel.component.jms.JmsComponent
https://skills421.wordpress.com/2014/02/08/sending-local-files-to-a-jms-queue/
https://github.com/apache/camel/blob/master/examples/camel-example-jms-file/src/main/java/org/apache/camel/example/jmstofile/CamelJmsToFileExample.java
https://github.com/apache/camel/tree/master/examples
To monitor and manage you can use jmx with VisualVM or Hawtio http://hawt.io/getstarted/index.html
http://camel.apache.org/camel-jmx.html
To consume you can use DefaultMessageListenerContainer with concurrent consumers on the queue and for this you need to change the prefetchPolicy on the ConnectionFactory of the DefaultMessageListenerContainer , Multithreaded JMS client ActiveMQ

Resources