Kafka stream/avro application fails with ClassCastException (deserialized instance has a different classloader) - apache-kafka-streams

My kafka streams application is using a SpecificAvroSerde to deserialize the events coming in from the source topic.
Pretty much using the SpecificAvroSerde config found here:
https://docs.confluent.io/platform/current/streams/developer-guide/datatypes.html#avro
I create the avro IDL definitions in a separate module and import it to my kafka streams app as a dependency, so I can use the generate entities at compile time.
I also have a Custom Timestamp extractor for a windowed aggregation and I receive the object deserialized to extract the timestamp from the event. Because the signature method of the timestamp extractor doesn't deal with generics (ex: it receives a ConsumerRecord<Object, Object>) I cast the runtime instant to the class I expect it to be as they should be the same time.
However because the one I receive has a different classloader from the one I cast it to (even though it's the same class type). I get an exception...
a bit out of ideas, on how to solve this?

Related

Processing incoming payloads as batch not working as expected in spring-cloud-streams

I say 'not working as expected' but actually is more like 'I don't really know if I'm doing the proper work in here', I feel like I'm mixing stuff from different approaches and doesn't really correlate.
Right now I've been using Spring Cloud Streams to process String-type messages from a PubSub subscription and so far so good, message in message out without much of a hassle.
What I'm trying to achieve now is to gather, let's say, 1000 messages, process them and send them altogether to another PubSub Topic. Still unsure about sending them as a List or individually like now, but all at the same time (this shouldn't be related to this question though).
Now I just discovered the following property.
spring.cloud.stream.bindings.input.consumer.batch-mode=true
Together with the following ones more specific to the GCP stuff.
spring.cloud.gcp.pubsub.publisher.batching.enabled=true
spring.cloud.gcp.pubsub.publisher.batching.delay-threshold-seconds=300
spring.cloud.gcp.pubsub.publisher.batching.element-count-threshold=100
So first question is... Are they linked by any means? Must I have the first one together with the other three?
What happened after I added the previous properties to my application.properties file is actually no change at all. Messages keep arriving and leaving the application without any issue and with no batch approach whatsoever.
Currently using the functional features the following way.
#Bean
public Function<Message<String>, String> sampleFunction() {
... // Stream processing in here
return processedString;
}
I was expecting this to crash with some message since the method only receives a String, not a list of String. Since it didn't crash, I modified the method above to receive a list of String (maybe Spring does some magic behind the scenes to still receive messages as String but collect them in a list for the method to process afterwards?).
#Bean
public Function<Message<List<String>>, String> sampleFunction() {
... // Stream processing in here
return processedString;
}
But this just crashes since it's trying to parse a single String message as a List of String.
How could I prepare the code to batch all those String messages into a List? Is there any example on this?
...batch-mode only works with binders that support it (e.g. Kafka, RabbitMQ). It doesn't look like the GCP binder supports it (I see no references to the property).
https://github.com/spring-cloud/spring-cloud-gcp/blob/master/spring-cloud-gcp-pubsub-stream-binder/src/main/java/org/springframework/cloud/gcp/stream/binder/pubsub/PubSubMessageChannelBinder.java
https://docs.spring.io/spring-cloud-stream/docs/3.1.0/reference/html/spring-cloud-stream.html#_batch_consumers
Publisher batching is not related to consumer batching.

Kafka Streams - override default addSink implementation / custom producer

It is my first post to this here and I am not sure if this was covered here before, but here goes: I have a Kafka Streams application, using Processor API, following the topology below:
1. Consume data from an input topic (processor.addSource())
2. Inserts data into a DB (processor.addProcessor())
3. Produce its process status to an output topic (processor.addSink())
App works big time, however, for traceability purposes, I need to have in the logs the moment kstreams produced a message to the output topic, as well as its RecordMetaData (topic, partition, offset).
Example below:
KEY="MY_KEY" OUTPUT_TOPIC="MY-OUTPUT-TOPIC" PARTITION="1" OFFSET="1000" STATUS="SUCCESS"
I am not sure if there is a way to override the default kafka streams producer to add this logging or maybe creating my own producer to plug it on the addSink process. I partially achieved it by implementing my own ExceptionHandler (default.producer.exception.handler), but it only covers the exceptions.
Thanks in advance,
Guilherme
If you configure the streams application to use a ProducerInterceptor, then you should be able to get the information you need. Specifically, implementing the onAcknowledgement() will provide access to everything you listed above.
To configure interceptors in a streams application:
Properties props = new Properties();
// add this configuration in addition to your other streams configs
props.put(StreamsConfig.producerPrefix(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG), Collections.singletonList(MyProducerInterceptor.class));
You can provide more than one interceptor if desired, just add the class name and change the list implementation from a singleton to a regular List. Execution of the interceptors follows the order of the classes in the list.
EDIT: Just to be clear, you can override the provided Producer in Kafka Streams via the KafkaClientSupplier interface, but IMHO using an interceptor is the cleaner approach. But which direction to go is up to you. You pass in your KafkaClientSupplier in an overloaded Kafka Streams constructor.

kafka connect avro enums parsed into string

I am using kafka connect framework from confluent to produce messages from my application servers into a kafka cluster (zookeeper + brokers + schema registry for avro support).
The data I am sending through connect is defined by an avro schema. My schema represents structured object containing ENUMS. Indeed Apache avro supports supports enumeration types.
I dont have to commit my schema to the registry because kafka connect API does it automatically.
My problem is that kafka connect seems to parse ENUMS into String. When I try to consume I see that the schema commited by connect is not correct since it has converted all my ENUMS into String. Thereby I cannot consume my data withou implementing a conversion logic from String back to ENUMS.
I want to keep my logical information as an ENUM and to use kafka connect as well. I jumped into the kafka-connect code and it seems to not handle the enumeration types but only basic types.
My alternative current alternative is to build my own producing framework which keeps ENUMS by imitating connect framework but this is time consuming, and I cannot avoid to use ENUMS.
Have you manage to produce and consume record containing ENUMS to kafka using kafka-connect ?
Any help or experience feedback is welcomme,
Thanks!
In the more recent versions of Connect (around maybe 4.2+), there are these properties. (I personally haven't seen in the documentation, but I was able to find them in the source code because I also found the same thing as you)
As you can see, the default is false, and I've been told in the newer releases, it'll be set to true
public static final String ENHANCED_AVRO_SCHEMA_SUPPORT_CONFIG = "enhanced.avro.schema.support";
public static final boolean ENHANCED_AVRO_SCHEMA_SUPPORT_DEFAULT = false;
public static final String ENHANCED_AVRO_SCHEMA_SUPPORT_DOC =
"Enable enhanced avro schema support in AvroConverter: Enum symbol preservation and Package"
+ " Name awareness";
As for now, you'll need to set these at the worker / connector level to have the enums preserved, assuming you're running a version of Connect that has these

Is there a way to get offset for each message consumed in kafka streams?

In order to avoid reading of messages which are processed but missed to get committed when a KAFKA STREAMS is killed , I want to get the offset for each message along with the key and value so that I can store it somewhere and use it to avoid the reprocessing of already processed messages.
Yes, this is possible. See the FAQ entry at http://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information.
I'll copy-paste the key information below:
Accessing record metadata such as topic, partition, and offset information?
Record metadata is accessible through the Processor API.
It is also accessible indirectly through the DSL thanks to its
Processor API integration.
With the Processor API, you can access record metadata through a
ProcessorContext. You can store a reference to the context in an
instance field of your processor during Processor#init(), and then
query the processor context within Processor#process(), for example
(same for Transformer). The context is updated automatically to match
the record that is currently being processed, which means that methods
such as ProcessorContext#partition() always return the current
record’s metadata. Some caveats apply when calling the processor
context within punctuate(), see the Javadocs for details.
If you use the DSL combined with a custom Transformer, for example,
you could transform an input record’s value to also include partition
and offset metadata, and subsequent DSL operations such as map or
filter could then leverage this information.

Using Spring Integration to split a large XML file into individual smaller messages and process each individually

I am using Spring Integration and have a large XML file containing a collection of child items, I want to split the file into a set of messages, the payload of each message will be one of the child XML fragments.
Using splitter is the obvious but this requires returning a collection of messages and this will exhaust the memory; I need to split the file into individual messages but process them one at a time (or more likely with a multi threaded task-executor).
Is there a standard way to do this without writing a custom component that writes the sub-messages to a channel programatically.
i have been looking for a similar solution and I have not found either any standard way of doing this.
Here is a rather dirty fix, if anyone needs this behavior implemented:
Split the files manually using a Service Activator or a Splitter with a custom bean.
<int:splitter input-channel="rawChannel" output-channel="splitChannel" id="splitter" >
<bean class="com.a.b.c.MYSplitter" />
</int:splitter>
Your custom bean should implement ApplicationContextAware so the application context can be injected by Spring.
Manually retrieve the output channel and send each sub-message
MessageChannel xsltChannel = (MessageChannel) applicationContext.getBean("splitChannel");
Message<String> message = new GenericMessage<String>(payload));
splitChannel.send(message);
For people coming across this very old question. Splitters can now handle results of type Iterable, Iterator, Stream, and Flux (project reactor). If any of these types are returned, messages are emitted one-at-a-time.
Iterator/Iterable since 4.0.4; Stream/Flux since 5.0.0.
There is also now a FileSplitter which emits file contents a line-at-a-time via an Interator - since 4.1.2.

Resources