KTable returns no data in Spring Boot application, however it can be queried - spring

I have a Spring Boot application working with Kafka Streams. I have a KTable with some financial currency quotes which is created like this:
#Bean(name = "indicativeQuotes")
public KTable<String, Quote> quoteKTable(StreamsBuilder streamsBuilder) {
return streamsBuilder.table(quoteTopicName,
Materialized.<String,Quote,KeyValueStore<Bytes,byte[]>>as("quoteTable")
.withKeySerde(Serdes.String())
.withValueSerde(new JsonSerde<>(Quote.class)));
}
I #Autowire this bean in another component, and test it with the following code:
#Autowired
private KTable<String, Quote> indicativeQuotes;
#PostConstruct
private void postConstruct() {
doPrint();
}
public void doPrint() {
ReadOnlyKeyValueStore<String, Quote> store = streamsBuilderFactoryBean.getKafkaStreams().store("quoteTable", QueryableStoreTypes.keyValueStore());
store.all().forEachRemaining(keyValue -> log.info("Key: " + keyValue.key + " Value: " + keyValue.value));
indicativeQuotes.foreach((k,v) -> log.info(k));}
The code logs correct values when querying through store, but it outputs nothing in foreach(), as if like table was empty. I have also tried print() and other options - all output nothing without any exceptions.
I'm starting to think that I cant inject KTable beans like that, but Spring documentation on the topic of kafka streams is pretty scarce and I can't find good examples. Any help will be appreciated.
Update.
My use case is that I have a scheduled Quartz job which is supposed to write current state of KTable to a Kafka topic when triggered, like following:
#Override
protected void executeInternal(JobExecutionContext jobExecutionContext) throws JobExecutionException {
TriggerKey triggerKey = jobExecutionContext.getTrigger().getKey();
log.info("Job was triggered by: {}", triggerKey.getName());
indicativeQuotes.filter((key, value) -> key.equals(triggerKey.getName()))
.mapValues(quoteToCourseFixedMapper)
.toStream()
.peek((instrument, course)-> log.info("Sending courses for instrument: {}, {}", instrument, course))
.to(quoteEventTopicName);
}
But I think this code does not work because it is not a part of processing topology and I cannot just take data from Ktable on demand. I'm a bit puzzled here, of course I can query the data through store when event is triggered, but maybe there is a better pattern for such use case? Basically I'm interested if its possible to incorporate this triggered job events as a part of processing pipeline.

If you just want to publish the updates to another topic, turn the KTable to a KStream and use the to() function.
KTable ktable = ...;
KStream ksteram = ktable.toStream();
kstream.to("topic", Produces.with(keySerde, valueSerde))
The topic will contain the change log of that table.
BUT
Apparently because of some life cycle related concepts, you can't just inject (#autowire) KStream/KTable. You should keep your KafkaStreams related code kind of as in-line as possible.
So in your specific case that you want the to do something with current state of table in some "random" time, you have to query the store (table). So search for kafka steams interactive queries. remember that you need to fetch data from all of your instances of the application (if you have more than 1. Or you can use a global store. Its a day or two of search.

Related

Using Spring Cloud Stream Kafka Streams with Avro input/output with nativeEncoding/decoding=false

We're testing the use of Kafka Streams via Spring Cloud Stream function support with Avro input/output records, but setting nativeEncoding=false and nativeDecoding=false in order to use a custom MessageConverter where we do the Avro conversion.
The default serdes are StringSerde for keys and ByteArraySerde for values.
Everything is ok when we only use a KStream to KStream function, for example:
#Bean
public Function<KStream<String, DataRecordAvro>, KStream<String, DataRecordAvro>> wordsCount() {
return input -> input
.flatMapValues(value -> Arrays.asList(value.getName().toString().toLowerCase().split("\\W+")))
.map((key, value) -> new KeyValue<>(value, value))
.groupByKey(Grouped.with(Serdes.String(), Serdes.String()))
.windowedBy(TimeWindows.of(Duration.ofSeconds(5)).grace(Duration.ofMillis(0)))
.count()
.toStream()
.map((key, value) -> new KeyValue<>(key.key(), new DataRecordAvro(key.key(), value)));
}
but when we try a little bit more complex example involving an input KTable like this:
#Bean
public BiFunction<KStream<String, DataRecordAvro>, KTable<String, DataRecordAvro>, KStream<String, DataRecordAvro>> userClicksRegionKTableAvro() {
return (userClicksStream, usersRegionKTable) -> userClicksStream
.leftJoin(usersRegionKTable,
(clicks, region) -> new RegionWithClicks(region == null ? "UNKNOWN" : region.getName().toString(), clicks.getCount()))
.map((user, regionWithClicks) -> new KeyValue<>(regionWithClicks.getRegion(), regionWithClicks.getClicks()))
.groupByKey(Grouped.with(Serdes.String(), Serdes.Long()))
.reduce(Long::sum)
.mapValues((key, value) -> new DataRecordAvro(key, value))
.toStream();
}
(The DataRecordAvro class only have two members: CharSequence name; Long count;)
When received the first record this exception is thrown:
ClassCastException invoking Processor. Do the Processor's input types match the deserialized types? Check the Serde setup and change the default Serdes in StreamConfig or provide correct Serdes via method parameters. Make sure the Processor can accept the deserialized input of type key: java.lang.String, and value: com.xxxx.kstreams.fixtures.avro.DataRecordAvro.
Note that although incorrect Serdes are a common cause of error, the cast exception might have another cause (in user code, for example). For example, if a processor wires in a store, but casts the generics incorrectly, a class cast exception could be raised during processing, but the cause would not be wrong Serdes.
The processor where the exception is thrown seems to be:
KSTREAM-LEFTJOIN-0000000011:
states: [user-regions-avro-STATE-STORE-0000000008]
We have no idea why it doesn't work in this case. Maybe the leftJoin operation persists information to an internal topic and there the useNativeEncoding/Decoding=false are not taken into account? But why the kstream->kstream example above does work? We thought the Avro conversion was only done at the start and end of the Topology, why this casting exception while using leftJoin?
Here is another example that works ok (without input Avro records, leaving consumer useNativeDecoding as default true):
#Bean
public BiFunction<KStream<String, Long>, KTable<String, String>, KStream<String, DataRecordAvro>> userClicksRegionKTable() {
return (userClicksStream, usersRegionKTable) -> userClicksStream
.leftJoin(usersRegionKTable,
(clicks, region) -> new RegionWithClicks(region == null ? "UNKNOWN" : region, clicks))
.map((user, regionWithClicks) -> new KeyValue<>(regionWithClicks.getRegion(), regionWithClicks.getClicks()))
.groupByKey(Grouped.with(Serdes.String(), Serdes.Long()))
.reduce(Long::sum)
.mapValues((key, value) -> new DataRecordAvro(key, value))
.toStream();
}
Please help!
For Kafka Streams binder in Spring Cloud Stream, we recommend using native decoding/encoding with Serdes unless you have strong reasoning for relying on the message conversion approach. What is the use case that forces you to go with message converters here? In practice, using message converters for serialization purposes in Kafka Streams applications in Spring Cloud Stream adds an extra layer in your topology and makes it deeper, thus the recommendation to use native decoding/encoding.
As you noted, for KTable, the binder always uses native decoding - at the moment, it is not possible to use message converters there. When you turn off useNativeDecoding on the KTable binding, the binder ignores it and simply uses the default byte serde. I suggest going with the default on the KTable binding and then adding the following bean in your application configuration.
#Bean
public Serde< DataRecordAvro> dataRecordAvroSerde() {
// return Serde
}
This way the binder will detect this bean and realize that the Serde type matches with the type from the function signature and then use it on those inputs.
If you have further issues on this app, feel free to share an MCRE. We can take further look then.

Send data to Spring Batch Item Reader (or Tasklet)

I have the following requirement:
An endpoint http://localhost:8080/myapp/jobExecution/myJobName/execute which receives a CSV and use univocity to apply some validations and generate a List of some pojo.
Send that list to a Spring Batch Job for some processing.
Multiple users could do this.
I want to know if with Spring Batch I can achieve this?
I was thinking to use a queue, put the data and execute a Job that pull objects from that queue. But how can I be sure that if other person execute the endpoint and other Job is executing, Spring Batch Knows which Item belongs to a certain execution?
You can use a queue or go ahead to put the list of values that was generated after the step with validations and store it as part of job parameters in the job execution context.
Below is a snippet to store the list to a job context and read the list using an ItemReader.
Snippet implements StepExecutionListener in a Tasklet step to put List which was constructed,
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
//tenantNames is a List<String> which was constructed as an output of an evaluation logic
stepExecution.getJobExecution().getExecutionContext().put("listOfTenants", tenantNames);
return ExitStatus.COMPLETED;
}
Now "listOfTenants" are read as part of a Step which has Reader (To allow one thread read at a time), Processor and Writer. You can also store it as a part of Queue and fetch it in a Reader. Snippet for reference,
public class ReaderStep implements ItemReader<String>, StepExecutionListener {
private List<String> tenantNames;
#Override
public void beforeStep(StepExecution stepExecution) {
try {
tenantNames = (List<String>)stepExecution.getJobExecution().getExecutionContext()
.get("listOfTenants");
logger.debug("Sucessfully fetched the tenant list from the context");
} catch (Exception e) {
// Exception block
}
}
#Override
public synchronized String read() throws Exception {
String tenantName = null;
if(tenantNames.size() > 0) {
tenantName = tenantNames.get(0);
tenantNames.remove(0);
return tenantName;
}
logger.info("Completed reading all tenant names");
return null;
}
// Rest of the overridden methods of this class..
}
Yes. Spring boot would execute these jobs in different threads. So Spring knows which items belongs to which execution.
Note: You can use like logging correlation id. This will help you filter the logs for a particular request. https://dzone.com/articles/correlation-id-for-logging-in-microservices

Updating global store from data within transform

I currently have a simple topology:
KStream<String, Event> eventsStream = builder.stream(sourceTopic);
eventsStream.transformValues(processorSupplier, "nameCache")
.to(destinationTopic);
My events sometimes have a key/value pair and other times have just the key. I want to be able to add the value to those events that are missing the value. I have this working fine with a local state store but when I add more tasks, sometimes the key/value events and the value events are in different threads and so they aren't updated correctly.
I'd like to use a global state store for this but I'm having difficulty figuring out how to update the global store when new key/value pairs come in. I've created a global state store with the following code:
builder.addGlobalStore(stateStore, "global_store", Consumed.with(Serdes.String(), Serdes.String()), new ProcessorSupplier<String, String>() {
#Override
public Processor<String, String> get() {
return new Processor<String, String>() {
private ProcessorContext context;
#Override
public void init(final ProcessorContext processorContext) {
this.context = processorContext;
}
#Override
public void process(final String key, final String value) {
context.forward(key, value);
}
#Override
public void close() {
}
};
}
});
As far as I can tell, it is working but since there is no data in the topic, I'm not sure.
So my question is how do I update the global store from inside of the transformValues? store.put() fails with an error that global store is read only.
I found Write to GlobalStateStore on Kafka Streams but the accepted answer just says to update the underlying topic but I don't see how I can do that since the topic isn't in my stream.
---Edited---
I updated the code per #1 in the accepted answer. I see the new key/value pairs show up in global_store. But the globalStore doesn't seem to see the new keys. If I restart the application, it fills the cache with the data in the topic but new keys aren't visible until after I stop/start the application.
I added logging to the process(String, String) in the global store processor and it shows new keys being processed. Any ideas?
You can only get a real-only access on Global state store inside transformValues, and if you want to update a global state store, yes, you have to send the update to the underlying input topic of Global state store, and your state will update the value when this update message is consumed. The reason behind this is that, Global state store are populated on all application instances and use this input topic for fault tolerance. You can do this by branching you topology:
KStream<String, Event> eventsStream = builder.stream(sourceTopic);
//processing message as normal
eventsStream.transformValues(processorSupplier, "nameCache")
.to(destinationTopic);
//this transform to the updated message to global state
eventsStream.transform(updateGlobalStateProcessorSupplier, "nameCache")
.to("global_store");
Using low level API to construct your Topology manually, so you can forward both to your destinationTopic topic and global_state topic using ProcessorContext.forward to forward message to sink processor node using name of the sink processor.

Springboot cloud Stream with Kafka

I'm trying to setup a project with Springboot cloud Stream with Kafka. I managed to build a simple example, where a listener gets messages from a topic and after processed it, it sends the output to another topic.
My listener and channels are configured like this:
#Component
public class FileEventListener {
private FileEventProcessorService fileEventProcessorService;
#Autowired
public FileEventListener(FileEventProcessorService fileEventProcessorService) {
this.fileEventProcessorService = fileEventProcessorService;
}
#StreamListener(target = FileEventStreams.INPUT)
public void handleLine(#Payload(required = false) String jsonData) {
this.fileEventProcessorService.process(jsonData);
}
}
public interface FileEventStreams {
String INPUT = "file_events";
String OUTPUT = "raw_lines";
#Input(INPUT)
SubscribableChannel inboundFileEventChannel();
#Output(OUTPUT)
MessageChannel outboundRawLinesChannel();
}
The problem with this example is that when the service starts, it doesn't check for messages that already exist in the topic, it only process those messages that are sent after it started. I'm very new to Springboot stream and kafka, but for what I've read, this behavior may correspond to the fact that I'm using a SubscribableChannel. I tried to use a QueueChannel for example, to see how it works but I found the following exception:
Error creating bean with name ... nested exception is java.lang.IllegalStateException: No factory found for binding target type: org.springframework.integration.channel.QueueChannel among registered factories: channelFactory,messageSourceFactory
So, my questions are:
If I want to process all messages that exists in the topic once the application starts (and also messages are processed by only one consumer), I'm on the right path?
Even if QueueChannel is not the right choice for achieve the behavior explained in 1.) What do I have to add to my project to be able to use this type of channel?
Thanks!
Add spring.cloud.stream.bindings.file_events.group=foo
anonymous groups consume from the end of the topic only, bindings with a group consume from the beginning, by default.
You cannot use a PollableChannel for a binding, it must be a SubscribableChannel.

Spring AMQP and Elasticsearch - aggregate messages

We have a consumer on some RabbitMQ queues which reads messages and then indexes this data in Elasticsearch. The implementation is done using spring-amqp. In order to increase our performance, we plan to aggregate the messages at the consumer level and do a bulk insert in Elasticsearch (this would really increase performance).
Do you have any sugestions on how to implement this? Also, another sensitive problem is how to handle responses. Each message has a "reply_to" header and we use an inbound gateway with a reply-channel, so for each message a response is supposed to be delivered.
I am thinking of using the aggregator from spring integration, with a release strategy based on a batch size and a period of time when the MessageGroupStore will expire (and of course the reaper). The inbound gateway has a task executor of 20 let's say and a prefetch count of 20 also. Whenever a request comes, the message will be added into the group store and when the canRelease() condition is ok, the reaper ot one of the threads that came with the request will do a bulk operation. But what I do with the other threads, which will have to wait for a response which will never come. Also, I don't know how to break the response for the big, aggregated message so each small request will have a response.
Another problem, how do I ack the messages? From what I read transactions will decrease performance on the RabbitMQ side, so I not that pleased of using the "tx-size" attribute. Also this attribute might do a wrong count if the timeout is too small.
The answer to the question regarding consumer and aggregator:
The config to consume messages from AMQP and aggregate.
The aggregate strategy is based on the Transction commit:
<amqp:inbound-channel-adapter queue-names="myQueue"
transaction-manager="transactionManager"
channel-transacted="true"
channel="aggregateChannel"
advice-chain="aggregatorReaperAdvice"
concurrent-consumers="4"
tx-size="100"/>
<aggregator input-channel="aggregateChannel" output-channel="storeChannel"
expire-groups-upon-completion="true"
correlation-strategy-expression="T(Thread).currentThread().id"
release-strategy-expression="^[payload.equals(#AGGREGATOR_RELEASE_MARK)] != null"
expression="?[!payload.equals(#AGGREGATOR_RELEASE_MARK)].![payload]"/>
The ReaperAdvice (Groovy code):
#Service
class AggregatorReaperAdvice implements MethodBeforeAdvice, InitializingBean {
private static final TRANSACTION_RESOURCE_MARK = 'TRANSACTION_RESOURCE_MARK'
public static final AGGREGATOR_RELEASE_MARK = 'AGGREGATOR_RELEASE_MARK'
MessagingTemplate messagingTemplate
#Autowired
MessageChannel aggregateChannel
#Override
void afterPropertiesSet() throws Exception {
Assert.notNull aggregateChannel, "aggregateChannel must not be null"
messagingTemplate = new MessagingTemplate(aggregateChannel)
}
#Override
void before(Method method, Object[] args, Object target) {
if (!TransactionSynchronizationManager.hasResource(AggregatorReaperAdvice)) {
TransactionSynchronizationManager.bindResource(AggregatorReaperAdvice, TRANSACTION_RESOURCE_MARK)
TransactionSynchronizationManager.registerSynchronization(new TransactionSynchronizationAdapter() {
#Override
void beforeCommit(boolean readOnly) {
messagingTemplate.send(MessageBuilder.withPayload(AGGREGATOR_RELEASE_MARK).build())
}
#Override
void afterCompletion(int status) {
TransactionSynchronizationManager.unbindResource(AggregatorReaperAdvice)
}
})
}
}
}
Let me know if it isn't clear.
All other question, will be addressed soon.
For manual ack you can use channel.basicAck(deliveryTag, true); - to ack on the last deliveryTag for all previous messages.
For the headers["reply_to"] case... I think you should provide custom AbstractAggregatingMessageGroupProcessor for the aggregator and kill two birds: the cumulative result of aggregator and iteration over MessageGroup.getMessages() to send each of them for the reply process to the provided MessageChannel. It is a quick solution for your case.
Something similar but more loosely-coupled solution maybe based on the result from aggregator and its MessageGroupStore, where you extract correlationKey to retrieve group and its messages to do the desired reply logic. In this case you shouldn't remove group from store with aggregator, but manually after that group retrieval.

Resources