Why would Kafka Stream app crash with suppress() enabled? - apache-kafka-streams

I'm writing a kafka stream 2.3.0 application to count the number of events in a session window and hopefully to print out only the final record when a session times out.
Serde<String> stringSerde = Serdes.serdeFrom(new StringSerializer(), new StringDeserializer());
Serde<MuseObject> museObjectSerde = Serdes.serdeFrom(new MuseObjectSerializer(), new MuseObjectDeserializer());
StreamsBuilder builder = new StreamsBuilder();
builder
.stream(INPUT_TOPIC, Consumed.with(stringSerde, museObjectSerde))
.map((key, value) -> {
return KeyValue.pair(value.getSourceValue("vid"), value.toString());
})
.groupByKey(Grouped.with(Serdes.String(), Serdes.String()))
.windowedBy(SessionWindows.with(Duration.ofSeconds(INACTIVITY_GAP)).grace(Duration.ZERO))
.count(Materialized.with(Serdes.String(), Serdes.Long()))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
.print(Printed.toSysOut());
However the application crashes when a session times out:
12:35:03.859 [kafka-producer-network-thread | kafka-streams-test-kgu-4c3f2398-8f67-429d-82ce-6062c86af466-StreamThread-1-producer] ERROR o.a.k.s.p.i.RecordCollectorImpl - task [1_0] Error sending record to topic kafka-streams-test-kgu-KTABLE-SUPPRESS-STATE-STORE-0000000008-changelog due to The server experienced an unexpected error when processing the request.; No more records will be sent and no more offsets will be recorded for this task. Enable TRACE logging to view failed record key and value.
org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.
12:35:03.862 [kafka-streams-test-kgu-4c3f2398-8f67-429d-82ce-6062c86af466-StreamThread-1] ERROR o.a.k.s.p.i.AssignedStreamsTasks - stream-thread [kafka-streams-test-kgu-4c3f2398-8f67-429d-82ce-6062c86af466-StreamThread-1] Failed to commit stream task 1_0 due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [1_0] Abort sending since an error caught with a previous record (key user01\x00\x00\x01m!\xCE\x99u\x00\x00\x01m!\xCE\x80\xD1 value null timestamp null) to topic kafka-streams-test-kgu-KTABLE-SUPPRESS-STATE-STORE-0000000008-changelog due to org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:138)
I've tried to comment out ".suppress..." line. It works fine without suppress() and prints out something like this
[KSTREAM-FILTER-0000000011]: [user01#1568230244561/1568230250869], MuseSession{vid='user01', es='txnSuccess', count=6, start=2019-06-26 17:11:02.937, end=2019-06-26 18:07:10.685, sessionType='open'}".
What did I miss in using suppress()? Is there another way to filter out only the session records that have been timed out?
Any help is appreciated. Thanks in advance.

suppress() requires at least broker version 0.11.0 and message format 0.11.

Related

KTable & LogAndContinueExceptionHandler

I have a very simple consumer from which I create a materialized view. I have enabled validation on my value object (throwing Constraintviolationexception for invalid json data). When I receive a value on which the validation fails, I exepct the value to logged & consumer should read the next offset as I have LogAndContinueExceptionHandler enabled.
However LogAndContinueExceptionHandler is never invoked and consumePojo State transition from PENDING_ERROR to ERROR
Code
#Bean
public Consumer<KTable<String, Pojo>> consume() {
return values->
values
.filter((key, value) -> Objects.nonNull(key))
.mapValues(value-> value, Materialized.<String, Pojo>as(Stores.inMemoryKeyValueStore("POJO_STORE_NAME"))
.withKeySerde(Serdes.String())
.withValueSerde(SerdeUtil.pojoSerde())
.withLoggingDisabled())
.toStream()
.peek((key, value) -> log.debug("Receiving Pojo from topic with key: {}, and UUID: {}", key, value == null ? 0 : value.getUuid()));
}
Why is it that LogAndContinueExceptionHandler is not invoked in case of KTable?
Note: If code is changed to KStreams then I see logging and records being skipped but with KTable not !!
In order to handle exceptions not handled by Kafka Streams use the KafkaStreams.setUncaughtExceptionHandler method and StreamsUncaughtExceptionHandler implementation, this needs to return one of 3 available enumerations:
REPLACE_THREAD
SHUTDOWN_CLIENT
SHUTDOWN_APPLICATION
and in your case REPLACE_THREAD is the best option, as you can see in KIP-671:
REPLACE_THREAD:
The current thread is shutdown and transits to state DEAD.
A new thread is started if the Kafka Streams client is in state RUNNING or REBALANCING.
For the Global thread this option will log an error and revert to shutting down the client until the option had been added
In Spring Kafka you can replace default StreamsUncaughtExceptionHandler by StreamsBuilderFactoryBean:
#Autowired
void setMyStreamsUncaughtExceptionHandler(StreamsBuilderFactoryBean streamsBuilderFactoryBean) {
streamsBuilderFactoryBean.setStreamsUncaughtExceptionHandler(exception -> StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD);
}
I was able to solve the problem after looking at the logs carefully, I found that valueSerde for the Pojo, was showing useNativeDecoding (default being JsonSerde) due to this DeserializationExceptionHandler wasn't invoked and thread terminated.
Problem went away when I fixed the valueSerde in application.properties

How to fix error 'Component: [x] subscribes from non-existent component [y]' in Apache Storm Trident topology

I just implemented a trident DRPC function to process incoming messages and I am trying to persist the count of processed tuples at the final stage of the topology as a Trident state. Here is my topology:
topology.newDRPCStream("portfolio")
.map(parseMapFunction,
new Fields("portfolioUrn", "portfolioSourceSystem", "portfolioRegion",
"portfolioTimestamp", "portfolioPayload"))
.filter(new FilterNull())
.flatMap(splitMapFunction,
new Fields("portfolioUrn", "portfolioSourceSystem", "portfolioRegion",
"portfolioTimestamp", "strategyCode"))
.parallelismHint(1)
.shuffle()
.each(new Fields("strategyCode"), findMongoTradesFunction,
new Fields("uitid", "id", "sourceSystem", "sourceTransactionTime", "publicationTime",
"tradeVersion", "urn", "riskViewFrom", "riskViewTo", "authorized"))
.parallelismHint(10)
.shuffle()
.filter(tradeFilterFunction)
.parallelismHint(150)
.groupBy(new Fields("uitid"))
.aggregate(
new Fields("portfolioUrn", "portfolioTimestamp", "strategyCode", "id", "sourceSystem",
"sourceTransactionTime", "publicationTime", "tradeVersion", "urn", "riskViewFrom",
"riskViewTo", "uitid"), reduceAggregateFunction,
new Fields("portfolioUrn", "portfolioTimestamp", "strategyCode", "id", "sourceSystem",
"sourceTransactionTime", "publicationTime", "tradeVersion", "urn", "riskViewFrom",
"riskViewTo"))
.parallelismHint(200)
.groupBy(new Fields("portfolioUrn"))
.persistentAggregate(stateFactory, new Count(), new Fields("count"));
While I am trying to submit this topology to Storm I am having this error:
Exception in thread "main" java.lang.RuntimeException: InvalidTopologyException(msg:Component: [b-4] subscribes from non-existent component [$mastercoord-bg0])
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:273)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
at com.citi.tm.portfolio.tps.PortfolioLauncher.main(PortfolioLauncher.java:34)
Caused by: InvalidTopologyException(msg:Component: [b-4] subscribes from non-existent component [$mastercoord-bg0])
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8070)
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8047)
at org.apache.storm.generated.Nimbus$submitTopology_result.read(Nimbus.java:7981)
at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.storm.generated.Nimbus$Client.recv_submitTopology(Nimbus.java:306)
at org.apache.storm.generated.Nimbus$Client.submitTopology(Nimbus.java:290)
at org.apache.storm.StormSubmitter.submitTopologyInDistributeMode(StormSubmitter.java:326)
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:260)
... 3 more
I can successfully submit the topology if I remove the last 2 functions from that topology , that are:
.groupBy(new Fields("portfolioUrn"))
.persistentAggregate(stateFactory, new Count(), new Fields("count"));
After I run my aggregation function (aggregate()) I'd like to group tuples with 'portfolioUrn' field and persist the count into mongoDB. I don't understand why the last groupBy().persistentAggregate() section causes this error. Could you please help to find the reason?
After some research, I found this page which seems a similar case to me. Nathan Marz is stating that DRPC topologies do not support partition persist (as of 2013) and I believe it is the same for my case as well. I think, (not fully verified) Storm 1.2.1 DRPC topologies might not support state persistence at all.

Regex for given log file

Hi I need RegEx to parse the following line of log file:
2015-07-06 11:07:29 +0950 [ERROR] index=heal-legacy host=imb.asp.com
com.iml.Keplas.collector.CollectorException: Could not process additional data, connection lost to data collector service
Expected result is :
Time 2015-07-06 11:07:29
Type ERROR
Index heal-legacy
Host imb.asp.com
Message Could not process additional data, connection lost to data collector service
Please help me to get these fields Time, Type([ERROR]),index, host, message.
You can try this regular expression:
(?<date>\d{4}-\d{2}-\d{2})\s+(?<time>\d{2}:\d{2}:\d{2})\s+\+\d{4}\s+(?<error>\S+)\s+index=(?<index>\S+)\s+host=(?<host>\S+)\s+(?<message>.*)
See demo
s = "2015-07-06 11:07:29 +0950 [ERROR] index=heal-legacy host=imb.asp.com
com.iml.Keplas.collector.CollectorException: Could not process additional data, connection lost to data collector service"
a = s.split(" ", 7)
# => ["2015-07-06", "11:07:29", "+0950", "[ERROR]", "index=heal-legacy", "host=imb.asp.com", "com.iml.Keplas.collector.CollectorException: Could not process additional data, connection lost to data collector service"]
[a.first(3).join(" "), a[3], a[4][/(?<==).+/], a[5][/(?<==).+/], a[6]]
# => ["2015-07-06 11:07:29 +0950", "[ERROR]", "heal-legacy", "imb.asp.com", "\ncom.iml.Keplas.collector.CollectorException: Could not process additional data, connection lost to data collector service"]

Spark Streaming with multiple Kafka streams

I create kafka stream with the following codes:
val streams = (1 to 5) map {i =>
KafkaUtils.createStream[....](
streamingContext,
Map( .... ),
Map(topic -> numOfPartitions),
StorageLevel.MEMORY_AND_DISK_SER
).filter(...)
.mapPartitions(...)
.reduceByKey(....)
val unifiedStream = streamingContext.union(streams)
unifiedStream.foreachRDD(...)
streamingContext.start()
I give each stream different group id. When I run the application, only part of kafka messages are received and the executor is pending at foreachRDD call. If I only create one stream, everything works well. There aren't any exceptions from logging info.
I don't know why the application is stuck there. Does it mean no enough resources?
You want to try set the parameter
SparkConf().set("spark.streaming.concurrentJobs", "5")

How to continuously read JMS Messages in a thread and achnowledge them based on their JMSMessageID in another thread?

I've written a Continuous JMS Message reveiver :
Here, I'm using CLIENT_ACKNOWLEDGE because I don't want this thread to acknowledge the messages.
(...)
connection.start();
session = connection.createQueueSession(true, Session.CLIENT_ACKNOWLEDGE);
queue = session.createQueue(QueueId);
receiver = session.createReceiver(queue);
While (true) {
message = receiver.receive(1000);
if ( message != null ) {
// NB : I can only pass Strings to the other thread
sendMessageToOtherThread( message.getText() , message.getJMSMessageID() );
}
// TODO Implement criteria to exit the loop here
}
In another thread, I'll do something as follows (after successful processing) :
This is in a distinct JMS Connection executed simultaneously.
public void AcknowledgeMessage(String messageId) {
if (this.first) {
this.connection.start();
this.session = this.connection.createQueueSession( false, Session.AUTO_ACKNOWLEDGE );
this.queue = this.session.createQueue(this.QueueId);
}
QueueReceiver receiver = this.session.createReceiver(this.queue, "JMSMessageID='" + messageId + "'");
Message AckMessage = receiver.receive(2000);
receiver.close();
}
It appears that the message is not found (AckMessage is null after timeout) whereas it does exist in the Queue.
I suspect the message to be blocked by the continuous input thread.. indeed, when firing the AcknowledgeMessage() alone, it works fine.
Is there a cleaner way to retrieve 1 message ? based on its QueueId and messageId
Also, I feel like there could be a risk of memory leak in the continuous reader if it has to memorize the Messages or IDs during a long time.. justified ?
If I'm using a QueueBrowser to avoid impacting the Acknowledge Thread, it looks like I cannot have this continuous input feed.. right ?
More context : I'm using ActiveMQ and the 2 threads are 2 custom "Steps" of a Pentaho Kettle transformation.
NB : Code samples are simplified to focus on the issue.
Well, you can't read that message twice, since you have already read it in the first thread.
ActiveMQ will not delete the message as you have not acknowledge it, but it won't be visible until you drop the JMS connection (I'm not sure if there is a long timeout here as well in ActiveMQ).
So you will have to use the original message and do: message.acknowledge();.
Note, however, that sessions are not thread safe, so be careful if you do this in two different threads.

Resources