Updating global store from data within transform - apache-kafka-streams

I currently have a simple topology:
KStream<String, Event> eventsStream = builder.stream(sourceTopic);
eventsStream.transformValues(processorSupplier, "nameCache")
.to(destinationTopic);
My events sometimes have a key/value pair and other times have just the key. I want to be able to add the value to those events that are missing the value. I have this working fine with a local state store but when I add more tasks, sometimes the key/value events and the value events are in different threads and so they aren't updated correctly.
I'd like to use a global state store for this but I'm having difficulty figuring out how to update the global store when new key/value pairs come in. I've created a global state store with the following code:
builder.addGlobalStore(stateStore, "global_store", Consumed.with(Serdes.String(), Serdes.String()), new ProcessorSupplier<String, String>() {
#Override
public Processor<String, String> get() {
return new Processor<String, String>() {
private ProcessorContext context;
#Override
public void init(final ProcessorContext processorContext) {
this.context = processorContext;
}
#Override
public void process(final String key, final String value) {
context.forward(key, value);
}
#Override
public void close() {
}
};
}
});
As far as I can tell, it is working but since there is no data in the topic, I'm not sure.
So my question is how do I update the global store from inside of the transformValues? store.put() fails with an error that global store is read only.
I found Write to GlobalStateStore on Kafka Streams but the accepted answer just says to update the underlying topic but I don't see how I can do that since the topic isn't in my stream.
---Edited---
I updated the code per #1 in the accepted answer. I see the new key/value pairs show up in global_store. But the globalStore doesn't seem to see the new keys. If I restart the application, it fills the cache with the data in the topic but new keys aren't visible until after I stop/start the application.
I added logging to the process(String, String) in the global store processor and it shows new keys being processed. Any ideas?

You can only get a real-only access on Global state store inside transformValues, and if you want to update a global state store, yes, you have to send the update to the underlying input topic of Global state store, and your state will update the value when this update message is consumed. The reason behind this is that, Global state store are populated on all application instances and use this input topic for fault tolerance. You can do this by branching you topology:
KStream<String, Event> eventsStream = builder.stream(sourceTopic);
//processing message as normal
eventsStream.transformValues(processorSupplier, "nameCache")
.to(destinationTopic);
//this transform to the updated message to global state
eventsStream.transform(updateGlobalStateProcessorSupplier, "nameCache")
.to("global_store");
Using low level API to construct your Topology manually, so you can forward both to your destinationTopic topic and global_state topic using ProcessorContext.forward to forward message to sink processor node using name of the sink processor.

Related

Send data to Spring Batch Item Reader (or Tasklet)

I have the following requirement:
An endpoint http://localhost:8080/myapp/jobExecution/myJobName/execute which receives a CSV and use univocity to apply some validations and generate a List of some pojo.
Send that list to a Spring Batch Job for some processing.
Multiple users could do this.
I want to know if with Spring Batch I can achieve this?
I was thinking to use a queue, put the data and execute a Job that pull objects from that queue. But how can I be sure that if other person execute the endpoint and other Job is executing, Spring Batch Knows which Item belongs to a certain execution?
You can use a queue or go ahead to put the list of values that was generated after the step with validations and store it as part of job parameters in the job execution context.
Below is a snippet to store the list to a job context and read the list using an ItemReader.
Snippet implements StepExecutionListener in a Tasklet step to put List which was constructed,
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
//tenantNames is a List<String> which was constructed as an output of an evaluation logic
stepExecution.getJobExecution().getExecutionContext().put("listOfTenants", tenantNames);
return ExitStatus.COMPLETED;
}
Now "listOfTenants" are read as part of a Step which has Reader (To allow one thread read at a time), Processor and Writer. You can also store it as a part of Queue and fetch it in a Reader. Snippet for reference,
public class ReaderStep implements ItemReader<String>, StepExecutionListener {
private List<String> tenantNames;
#Override
public void beforeStep(StepExecution stepExecution) {
try {
tenantNames = (List<String>)stepExecution.getJobExecution().getExecutionContext()
.get("listOfTenants");
logger.debug("Sucessfully fetched the tenant list from the context");
} catch (Exception e) {
// Exception block
}
}
#Override
public synchronized String read() throws Exception {
String tenantName = null;
if(tenantNames.size() > 0) {
tenantName = tenantNames.get(0);
tenantNames.remove(0);
return tenantName;
}
logger.info("Completed reading all tenant names");
return null;
}
// Rest of the overridden methods of this class..
}
Yes. Spring boot would execute these jobs in different threads. So Spring knows which items belongs to which execution.
Note: You can use like logging correlation id. This will help you filter the logs for a particular request. https://dzone.com/articles/correlation-id-for-logging-in-microservices

Kafka Streams Punctuator check time sent of an entry in StateStore

I've imlemented my own Transformer, where I'm accumulating events in my StateStore (into the internal changelog kafka topic), and once it accumulates X events, I emmit one accumulated event down the stream.
Everything is perfect, but I don't want accumulated events to be stuck in the pipe if there are no new events are posted for a while.
That's why I decided to use Punctuator:
#Override
public void init(ProcessorContext context) {
this.context = context;
this.kvStore = (KeyValueStore<Patient, Long>) this.context.getStateStore(stateStore);
this.context.schedule(Duration.ofMinutes(180), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
KeyValueIterator<Event, Long> iter = kvStore.all();
while (iter.hasNext()) {
KeyValue<Event, Long> entry = iter.next();
...
}
iter.close();
context.commit();
});
}
The problem is that the entries in the StateStore do not contain time sent information. Is there a way to access that? Do I need to implement my own custom StateStore in order to make it work?
Thank you!

KTable returns no data in Spring Boot application, however it can be queried

I have a Spring Boot application working with Kafka Streams. I have a KTable with some financial currency quotes which is created like this:
#Bean(name = "indicativeQuotes")
public KTable<String, Quote> quoteKTable(StreamsBuilder streamsBuilder) {
return streamsBuilder.table(quoteTopicName,
Materialized.<String,Quote,KeyValueStore<Bytes,byte[]>>as("quoteTable")
.withKeySerde(Serdes.String())
.withValueSerde(new JsonSerde<>(Quote.class)));
}
I #Autowire this bean in another component, and test it with the following code:
#Autowired
private KTable<String, Quote> indicativeQuotes;
#PostConstruct
private void postConstruct() {
doPrint();
}
public void doPrint() {
ReadOnlyKeyValueStore<String, Quote> store = streamsBuilderFactoryBean.getKafkaStreams().store("quoteTable", QueryableStoreTypes.keyValueStore());
store.all().forEachRemaining(keyValue -> log.info("Key: " + keyValue.key + " Value: " + keyValue.value));
indicativeQuotes.foreach((k,v) -> log.info(k));}
The code logs correct values when querying through store, but it outputs nothing in foreach(), as if like table was empty. I have also tried print() and other options - all output nothing without any exceptions.
I'm starting to think that I cant inject KTable beans like that, but Spring documentation on the topic of kafka streams is pretty scarce and I can't find good examples. Any help will be appreciated.
Update.
My use case is that I have a scheduled Quartz job which is supposed to write current state of KTable to a Kafka topic when triggered, like following:
#Override
protected void executeInternal(JobExecutionContext jobExecutionContext) throws JobExecutionException {
TriggerKey triggerKey = jobExecutionContext.getTrigger().getKey();
log.info("Job was triggered by: {}", triggerKey.getName());
indicativeQuotes.filter((key, value) -> key.equals(triggerKey.getName()))
.mapValues(quoteToCourseFixedMapper)
.toStream()
.peek((instrument, course)-> log.info("Sending courses for instrument: {}, {}", instrument, course))
.to(quoteEventTopicName);
}
But I think this code does not work because it is not a part of processing topology and I cannot just take data from Ktable on demand. I'm a bit puzzled here, of course I can query the data through store when event is triggered, but maybe there is a better pattern for such use case? Basically I'm interested if its possible to incorporate this triggered job events as a part of processing pipeline.
If you just want to publish the updates to another topic, turn the KTable to a KStream and use the to() function.
KTable ktable = ...;
KStream ksteram = ktable.toStream();
kstream.to("topic", Produces.with(keySerde, valueSerde))
The topic will contain the change log of that table.
BUT
Apparently because of some life cycle related concepts, you can't just inject (#autowire) KStream/KTable. You should keep your KafkaStreams related code kind of as in-line as possible.
So in your specific case that you want the to do something with current state of table in some "random" time, you have to query the store (table). So search for kafka steams interactive queries. remember that you need to fetch data from all of your instances of the application (if you have more than 1. Or you can use a global store. Its a day or two of search.

Does Google Guava Cache do deduplication when refreshing value of the same key

I implemented a non-blocking cache using Google Guava, there's only one key in the cache, and value for the key is only refreshed asynchronously (by overriding reload()).
My question is that does Guava cache handle de-duplication if the first reload() task hasn't finished, and a new get() request comes in.
//Cache is defined like below
this.cache = CacheBuilder
.newBuilder()
.maximumSize(1)
.refreshAfterWrite(10, TimeUnit.MINUTES)
.recordStats()
.build(loader);
//reload is overwritten asynchronously
#Override
public ListenableFuture<Map<String, CertificateInfo>> reload(final String key, Map<String, CertificateInfo> prevMap) throws IOException {
LOGGER.info("Refreshing certificate cache.");
ListenableFutureTask<Map<String, CertificateInfo>> task = ListenableFutureTask.create(new Callable<Map<String, CertificateInfo>>() {
#Override
public Map<String, CertificateInfo> call() throws Exception {
return actuallyLoad();
}
});
executor.execute(task);
return task;
}
Yes, see the documentation for LoadingCache.get(K) (and it sibling, Cache.get(K, Runnable)):
If another call to get(K) or getUnchecked(K) is currently loading the value for key, simply waits for that thread to finish and returns its loaded value.
So if a cache entry is currently being computed (or reloaded/recomputed), other threads that try to retrieve that entry will simply wait for the computation to finish - they will not kick off their own redundant refresh.

How to provide values to storm for calculation

I have a hard time understanding how to provide values to storm since i am a newbie to storm.
I started with the starter kit. I went through the TestWordSpout and in that the following code provides new values
public void nextTuple() {
Utils.sleep(100);
final String[] words = new String[] {"nathan", "mike", "jackson", "golda", "bertels"};
final Random rand = new Random();
final String word = words[rand.nextInt(words.length)];
_collector.emit(new Values(word));
}
So i see it's taking one word at a time _collector.emit(new Values(word));
How i can provide a collection of words directly.Is this possible?
TestWordSpout.java
What I mean when nextTuple is called a new words is selected at random from the list and emitted. The random list may look like this after certain time interval
#100ms: nathan
#200ms: golda
#300ms: golda
#400ms: jackson
#500ms: mike
#600ms: nathan
#700ms: bertels
What if i already have a collection of this list and just feed it to storm.
Storm is designed and built to process the continuous stream of data. Please see Rationale for the Storm. It's very unlikely that input data is feed into the storm cluster. Generally, the input data to storm is either from the JMS queues, Apache Kafka or twitter feeds etc. I would think, you would like to pass few configurations. In that case, the following would apply.
Considering the Storm design purpose, very limited configuration details can be passed to Storm such as the RDMBS connection details (Oracle/DB2/MySQL etc), JMS provider details(IBM MQ/RabbitMQ etc) or Apache Kafka details/Hbase etc.
For your particular question or providing the configuration details for the above products, there are three ways that I could think
1.Set the configuration details on the instance of the Spout or Bolt
For eg: Declare the instance variables and assign the values as part of the Spout/Bolt constructor as below
public class TestWordSpout extends BaseRichSpout {
List<String> listOfValues;
public TestWordSpout(List<String> listOfValues) {
this.listOfValues=listOfValues;
}
}
On the topology submission class, create an instance of Spout with the list of values
List<String> listOfValues=new ArrayList<String>();
listOfValues.add("nathan");
listOfValues.add("golda");
listOfValues.add("mike");
builder.setSpout("word", new TestWordSpout(listOfValues), 3);
These values are available as instance variables in the nextTuple() method
Please look at the Storm integrations at Storm contrib on the configurations set for RDBMS/Kafka etc as above
2.Set the configurations in the getComponentConfiguration(). This method is used to override the topology configurations, however, you could pass in few details as below
#Override
public Map<String, Object> getComponentConfiguration() {
Map<String, Object> ret = new HashMap<String, Object>();
if(!_isDistributed) {
ret.put(Config.TOPOLOGY_MAX_TASK_PARALLELISM, 1);
return ret;
} else {
List<String> listOfValues=new ArrayList<String>();
listOfValues.add("nathan");
listOfValues.add("golda");
listOfValues.add("mike");
ret.put("listOfValues", listOfValues);
}
return ret;
}
and the configuration details are available in the open() or prepare() method of Spout/Bolt respectively.
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
_collector = collector;
this.listOfValues=(List<String>)conf.get("listOfValues");
}
3.Declare the configurations in the property file and jar it as part of the jar file that would be submitted to the Storm cluster. The Nimbus node copies the jar file to the worker nodes and makes it available to executor thread. The open()/prepare() method can read the property file and assign to instance variable.
"Values" type accept any kind of objects and any number.
So you can simply send a List for instance from the execute method of a Bolt or from the nextTuple method of a Spout:
List<String> words = new ArrayList<>();
words.add("one word");
words.add("another word");
_collector.emit(new Values(words));
You can add a new Field too, just be sure to declare it in declareOutputFields method
_collector.emit(new Values(words, "a new field value!");
And in your declareOutputFields method
#Override
public void declareOutputFields(final OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("collection", "newField"));
}
You can get the fields in the next Bolt in the topology from the tuple object given by the execute method:
List<String> collection = (List<String>) tuple.getValueByField("collection");
String newFieldValue = tuple.getStringByField("newField");

Resources