we have the following problem to solve with Kafka Streams:
1- get a message. Each message is stamped with eventId (message updates event) and correlationId (unique for each message).
2- aggregate some state from that message (based on eventId) and append it to already existing state in local store
3- enrich that message for full aggregated state for that event and send it through to output topic
Point is we can't really lose a single message, and it must alway enrich the incoming message with the latest aggregated state (which we actually evaluate during message processing).
From what I've seen so far we cant just use simple aggregation (something like that:)
stateMessageStream
.map((k, v) => new KeyValue[String, StateMessage](k, v))
.mapValues[StateMessageWithMarkets](sm => {StateMessageWithMarkets(Some(sm), extract(sm))})
.groupBy((k, _) => k, stringSerde, marketAggregatorSerde)
.aggregate[StateMessageWithMarkets](() => StateMessageWithMarkets(), (_, v, aggregatedState) => aggregatedState.updateModelMarketsWith(v), marketAggregatorSerde, kafkaStoreName)
.to(stringSerde, marketAggregatorSerde, kafkaOutTopic)
because aggregation produces new records only in intervals, and that would mean for two incoming messages we might generate only single aggregated output message (so we lost one message)
My second attempt how to implement this was basically two streams, one for aggregation, second for plain messages. At the end we can join two streams back together using join operation, based on correlationId as a key - that we can match the right state with the right message:
val aggregatedStream : KStream[String, MarketAggregator] = stateMessageStream
.map((k, v) => new KeyValue[String, StateMessage](k, v))
.mapValues[StateMessage](v => {
log.debug("Received State Message, gameId: " + v.metadata().gtpId() + ", correlationId: " + v.correlationId)
v})
.mapValues[MarketAggregator](sm => {MarketAggregator(sm.correlationId, extract(sm))})
.groupBy((k, v) => k, stringSerde, marketAggregatorSerde)
.aggregate[MarketAggregator](() => MarketAggregator(), (_, v, aggregatedState) => aggregatedState.updateModelMarketsWith(v), marketAggregatorSerde, kafkaStoreName)
.toStream((k, v) => v.correlationId)
stateMessageStream
.selectKey[String]((k, v) => v.correlationId)
.leftJoin[MarketAggregator, StateMessageWithMarkets](aggregatedStream, (stateMessage : StateMessage, aggregatedState : MarketAggregator) => StateMessageWithMarkets(Some(stateMessage), aggregatedState.modelMarkets, stateMessage.correlationId),
JoinWindows.of(10000),
stringSerde, stateMessageSerde, marketAggregatorSerde)
.mapValues[StateMessageWithMarkets](v => {
log.debug("Producing aggregated State Message, gameId: " + v.stateMessage.map(_.metadata().gtpId()).getOrElse("unknown") +
", correlationId: " + v.stateMessage.map(_.correlationId).getOrElse("unknown"))
v
})
.to(stringSerde, stateMessageWithMarketsSerde, kafkaOutTopic)
However, that does not seem to be working either - for two incoming messages I still get just single message with latest aggregated state on the output topic.
Can someone please explain why and what the right solution would be?
You can use approach one and get an output message for each input message by disabling the cache. In your StreamsConfig you just set the value of StreamConfig#CACHE_MAX_BYTES_BUFFERING_CONFIG to zero.
For more details, see http://docs.confluent.io/current/streams/developer-guide.html#memory-management
Related
I have subscriber which collects the messages until reaches the specified limit and then pass collected messages to the processor to perform some operations. Code works fine, problem is subscriber waits Until it collects specified number messages. If we have lesser message program control will not pass to processor.
For example Lets say my chunk size is 100 and if I have 100 or multiple of 100 messages then program works fine But if I have messages < 100 or 150 some of messages are read by subscriber but they were never passed to processor. Is there way I can figure-out is that Queue is empty using rabbit template so that I can check that condition and break the loop
#RabbitListener(id="messageListener",queues = "#{rabbitMqConfig.getSubscriberQueueName()}",containerFactory="queueListenerContainer")
public void receiveMessage(Message message, Channel channel, #Header("id") String messageId,
#Header("amqp_deliveryTag") Long deliveryTag) {
LOGGER.info(" Message:"+ message.toString());
if(messageList.size() < appConfig.getSubscriberChunkSize() ) {
messageList.add(message);
deliveryTagList.add(deliveryTag);
if(messageList.size() == appConfig.getSubscriberChunkSize()) {
LOGGER.info("------------- Calling Message processor --------------");
Message [] messageArry = new Message[messageList.size()];
messageArry = messageList.toArray(messageArry);
LOGGER.info("message Array Length: "+messageArry.length);
messageProcessor.process(messageArry);
messageList = new ArrayList<Message>(Arrays.asList(messageArry));
LOGGER.info("message Array to List conversion Size: "+messageList.size());
LOGGER.info("-------------- Completed Message processor -----------");
eppQ2Publisher.sendMessages(messageList, channel, deliveryTagList);
messageList.clear();
deliveryTagList.clear();
}
} else {
// do nothing..
}
There are two ways to achieve this.
Add an #EventListener to listen for ListenerContainerIdleEvents which are published when no messages have been received for some time; set the container's idleEventInterval property. The source of the event is the listener container; it contains the #RabbitListener's id. See Detecting Idle Consumers.
Use RabbitAdmin.getQueueProperties().
You can use RabbitAdmin.getQueueInfo("queue name").getMessageCount() that will be 0 for empty queue.
I want to ensure that a given API call is throttled so that for a given time interval, only a single request is fired, and that the other, throttled requests wait and receive the results of the request that was actively fired
Example
const generateReport = (args) => client.get(...)
const generateReportA = (argsForA) =>
generateReport(argsForA).then(saveReportSomewhere)
const generateReportB = (argsForB) =>
generateReport(argsForB).then(saveReportSomewhere)
const generateReportC = (argsForC) =>
generateReport(argsForC).then(saveReportSomewhere)
If we then run the statements below
generateReportA(...).then(console.log) // should log result of C
generateReportB(...).then(console.log) // should log result of C
generateReportC(...).then(console.log) // should log result
right after each other, I only want to fire the request associated with generateReportC and I'd like both generateReportA and generateReportB to receive and handle the result of generateReportC.
In the end generateReport should have been called once and saveReportSomewhere should have been called 3 times, each with the result from generateReportC
Is this possible?
This will fire C request get result and save it, then trigger A and B at the same time with result from C and save results immediately after every emit.
generateReportC.pipe(
tap(cResult => saveReportSomewhere(cResult)),
mergeMap(cResult => merge(generateReportA(cResult), generateReportB(cResult))),
tap(result => saveReportSomewhere(result))
).subscribe();
I've been struggling to come up with an Rx strategy for a particular situation. I'm hoping someone could point me in the right direction.
Basically, I have a socket feed that I would like to skip based on a boolean value. When the stream is skipping the socket, I need to keep a running buffer of the latest value sent from socket.
Once I am no longer skipping the socket events, then push down the stream the last value that was emitted when it was skipping, but only under another condition (bool), and re-start listen to socket events
So basically:
Listen to socket feed
takeWhile(bool)
When start listening again, apply last values while skipping
socket, if reapply==true
Didn't get far, but this is what I have:
Rx.Observable.interval(1000)
.skipWhile(()=>isSkipping)
.bufferWhileSkipping??
.applySkippedValuesAfterSkipping(ifisReapply)??
.subscribe(val=>console.log(val));
Perhaps skipWhile is not the right approach but was the only one that kind of made senseā¦
You could do it similarly to the following (I'm assuming your isSkipping can be an Observable):
const isSkipping = new BehaviorSubject(false);
Observable.interval(100)
.take(20)
.window(isSkipping)
.withLatestFrom(isSkipping)
.switchMap(([observable, skipping]) => skipping
? observable.takeLast(1).map(val => 'last:' + val)
: observable)
.subscribe(console.log);
setTimeout(() => isSkipping.next(true), 500);
setTimeout(() => isSkipping.next(false), 1050);
setTimeout(() => isSkipping.next(true), 1500);
setTimeout(() => isSkipping.next(false), 1850);
Every time isSkipping emits a value the window operator creates a new Observable that just re-emits everything or chains the .takeLast(1) operator when skipping is set to true.
The example above prints the following output to the console:
0
1
2
3
last:9
10
11
12
13
last:16
17
18
19
I create kafka stream with the following codes:
val streams = (1 to 5) map {i =>
KafkaUtils.createStream[....](
streamingContext,
Map( .... ),
Map(topic -> numOfPartitions),
StorageLevel.MEMORY_AND_DISK_SER
).filter(...)
.mapPartitions(...)
.reduceByKey(....)
val unifiedStream = streamingContext.union(streams)
unifiedStream.foreachRDD(...)
streamingContext.start()
I give each stream different group id. When I run the application, only part of kafka messages are received and the executor is pending at foreachRDD call. If I only create one stream, everything works well. There aren't any exceptions from logging info.
I don't know why the application is stuck there. Does it mean no enough resources?
You want to try set the parameter
SparkConf().set("spark.streaming.concurrentJobs", "5")
I've written a Continuous JMS Message reveiver :
Here, I'm using CLIENT_ACKNOWLEDGE because I don't want this thread to acknowledge the messages.
(...)
connection.start();
session = connection.createQueueSession(true, Session.CLIENT_ACKNOWLEDGE);
queue = session.createQueue(QueueId);
receiver = session.createReceiver(queue);
While (true) {
message = receiver.receive(1000);
if ( message != null ) {
// NB : I can only pass Strings to the other thread
sendMessageToOtherThread( message.getText() , message.getJMSMessageID() );
}
// TODO Implement criteria to exit the loop here
}
In another thread, I'll do something as follows (after successful processing) :
This is in a distinct JMS Connection executed simultaneously.
public void AcknowledgeMessage(String messageId) {
if (this.first) {
this.connection.start();
this.session = this.connection.createQueueSession( false, Session.AUTO_ACKNOWLEDGE );
this.queue = this.session.createQueue(this.QueueId);
}
QueueReceiver receiver = this.session.createReceiver(this.queue, "JMSMessageID='" + messageId + "'");
Message AckMessage = receiver.receive(2000);
receiver.close();
}
It appears that the message is not found (AckMessage is null after timeout) whereas it does exist in the Queue.
I suspect the message to be blocked by the continuous input thread.. indeed, when firing the AcknowledgeMessage() alone, it works fine.
Is there a cleaner way to retrieve 1 message ? based on its QueueId and messageId
Also, I feel like there could be a risk of memory leak in the continuous reader if it has to memorize the Messages or IDs during a long time.. justified ?
If I'm using a QueueBrowser to avoid impacting the Acknowledge Thread, it looks like I cannot have this continuous input feed.. right ?
More context : I'm using ActiveMQ and the 2 threads are 2 custom "Steps" of a Pentaho Kettle transformation.
NB : Code samples are simplified to focus on the issue.
Well, you can't read that message twice, since you have already read it in the first thread.
ActiveMQ will not delete the message as you have not acknowledge it, but it won't be visible until you drop the JMS connection (I'm not sure if there is a long timeout here as well in ActiveMQ).
So you will have to use the original message and do: message.acknowledge();.
Note, however, that sessions are not thread safe, so be careful if you do this in two different threads.