Creating a state store with Avro inside Kstream Consumer Processor - apache-kafka-streams

I have a consumer defined as below. It reads a avro message out of topic and constructs a statestore of aggregated data, which is also of type avro.
#Bean
public Consumer<KStream<String, InputEvent>> avroTest() {
Serde<OutputEvent> serdeOutEvent = new SpecificAvroSerde<>(schemaRegistryClient);
return st -> st.groupByKey().aggregate(OutputEvent::new, (key, currentEvent, outputEvent) -> {
//aggregate here
return outputEvent;
}, Materialized.with(new Serdes.StringSerde(), serdeOutEvent).toStream();
}
The function is able to read messages from topic and create the first aggregated result, but when it tries to store it in statestore, receives a 404 for schema not present.
Exception in thread "odoAvroTest-e4ef8e3e-ea1e-458c-b309-b2afefbeacec-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000000, topic=odometer, partition=0, offset=0, stacktrace=org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema: {"type":"record","name": "" .... }
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Subject not found.; error code: 40401
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:226)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:252)
at io.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:319)
at io.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:307)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getIdFromRegistry(CachedSchemaRegistryClient.java:165)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getId(CachedSchemaRegistryClient.java:297)
at io.confluent.kafka.serializers.AbstractKafkaAvroSerializer.serializeImpl(AbstractKafkaAvroSerializer.java:73)
at io.confluent.kafka.serializers.KafkaAvroSerializer.serialize(KafkaAvroSerializer.java:53)
at io.confluent.kafka.streams.serdes.avro.SpecificAvroSerializer.serialize(SpecificAvroSerializer.java:65)
at io.confluent.kafka.streams.serdes.avro.SpecificAvroSerializer.serialize(SpecificAvroSerializer.java:38)
at org.apache.kafka.streams.state.internals.ValueAndTimestampSerializer.serialize(ValueAndTimestampSerializer.java:59)
at org.apache.kafka.streams.state.internals.ValueAndTimestampSerializer.serialize(ValueAndTimestampSerializer.java:50)
at org.apache.kafka.streams.state.internals.ValueAndTimestampSerializer.serialize(ValueAndTimestampSerializer.java:27)
at org.apache.kafka.streams.state.StateSerdes.rawValue(StateSerdes.java:192)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.put(MeteredKeyValueStore.java:166)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl$KeyValueStoreReadWriteDecorator.put(ProcessorContextImpl.java:486)
at org.apache.kafka.streams.kstream.internals.KStreamAggregate$KStreamAggregateProcessor.process(KStreamAggregate.java:103)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:117)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:133)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:87)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:363)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788)
Do let know if there are additional config tweaks that are necessary to make this work. When i change the input to hashmap and /or a simple POJO amd use JSONSerde, the code seems to work and creates aggregation

The issue here is Schema Registry needed by AVRO Serde. When you set value Serde in Materialized.with(), you have to set the schema registry config to your serde.

Related

How do I tigger the doOnError() in a Reactive Kakfa consumer?

private Flux<Record> consumeRecord() {
return reactiveKafkaConsumerTemplate
.receive()
.doOnNext(consumerRecord -> {
Record record = consumerRecord.value();
recordWorkflowService.handleRecord(record);
}
)
.map(ConsumerRecord::value)
.doOnError(throwable -> {
log.error("something bad happened while consuming : {}", throwable.getMessage());
});
}
Currently this is the code I have in my consumer. When a record comes in I do see that my recordWorflowService.handleRecord is called and the record is processed successfully, however I cannot get the error case to trigger.
I have a use case where I am consumer records from a kafka topic and do some processing on them. However, if any part of that processing fails I do not want the kafka record to be committed so that it can get reprocessed. So if any error occurs in the recordWorkflowService I want .doOnError() to be triggered and to not commit the offset (So it can be reprocessed).
Am I on the right path here? I have tried manually throwing an exception within handleRecord() but .doOnError() never seems to get triggered.

KStreams turn off the internal topics for Aggregate

I am working on an application where the Kafka by default has 'auto topic' creation disabled. So any topics required by the application should be be created prior.
I have implemented a Window store with aggregation..
KGroupedStream<String, String> groupedStream = builder.stream("topic", Consumed.with(Serdes.String(), Serdes.String()))
.map(this::processInbound)
.groupByKey();
WindowBytesStoreSupplier windowStoreSupplier = Stores.persistentWindowStore(storename, STORE_RETENTION, WINDOW_SIZE, false);
final Materialized<String, StatAggregator, WindowStore<Bytes, byte[]>> materialized = Materialized.
<String, StatAggregator>as(windowStoreSupplier )
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.String());
groupedStream.windowedBy(TimeWindows.of(1day).advanceBy(15mins))
.aggregate(StatAggregator::new, (key, value, aggregator) -> {
aggregator.add(value);
return aggregator;
}, materialized );
I need the fault tolerance to be in place, so I create the change log and the partition topics prior..ie the topics xxxxx-store-changelog and xxxxx-store-repartition are already created before app launch.
But as I use aggregation I see that the there are other internal topics created by KStreams like xxxxx-KSTREAM-AGGREGATE-STATE-STORE-0000000031-changelog and xxxxx-KSTREAM-AGGREGATE-STATE-STORE-0000000031-repartition etc... As these topics names are derived by the KStreams itself I am unable to create them prior.
I would like to either turn off the auto topic creation for AGGREGATE or able to define the name of the topics prior (config?) so that the KStreams can use them for aggregation.
Please advise.
Thanks,
Jinu

Remove In-memory kafka records receive from fetch in Springboot project

I have a requirment, if i can remove in-memory kafka messages which was fetched as i have max-poll-records: 10. so the scenario is : while processing the record one by one, if my program encounter any error, i don't need to process any further left over records which stored in-memory.
Ex : i fetch 10 records at once as my max-poll-interval is 10. i processed 5 records successfully (committing manually) but during 6th records i encounter an error, now i have to remove all left over 5 records from in-memory. below is my listener code :
#KafkaListener(topics = "#{'${kafka.consumer.allTopicList}'.split(',')}", groupId = Constant.GROUP_ID)
public void consumeAllTopics(#Header(KafkaHeaders.RECEIVED_TOPIC) String topic, String message, Acknowledgment acknowledgment) {
switch (topic) {
case Constant.toipc1:
if (!StringUtils.isEmpty(message)) {
try {
//processing logic
acknowledgment.acknowledge();
}catch(Exception e){e.printStackTrace();}
}
i want to remove records through code. Please help me to understand if it is possible, and if so how can i achive this.
It's not clear what you mean by "remove". If you mean "ignore" or "skip", you would need to to throw an exception and configure a custom error handler.
See the documentation.
If an ErrorHandler implements RemainingRecordsErrorHandler, the error handler is provided with the failed record and any unprocessed records retrieved by the previous poll(). Those records are not passed to the listener after the handler exits.
There is no standard error handler to "skip" the remaining records; it's an unusual requiremeent.
Most people would use a SeekToCurrentErrorHandler (which is now the default in the upcoming 2.5 release).

KafkaStream does not use the serde given in Consumed.with(), but uses the default serde

I have created Serde consuming from kafka as the following
import org.apache.kafka.connect.json.JsonDeserializer;
import org.apache.kafka.connect.json.JsonSerializer;
final Deserializer<JsonNode> jsonDeserializer = new JsonDeserializer();
final Serializer<JsonNode> jsonSerializer = new JsonSerializer();
final Serde<JsonNode> jsonNodeSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);
final StreamsBuilder builder = new StreamsBuilder();
final KStream<String, JsonNode> eventStream = builder
.stream("my-test-1",
Consumed.with(Serdes.String(), jsonNodeSerde)
but still receive serialization error:
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (key: org.apache.kafka.common.serialization.StringSerializer / value: org.apache.kafka.common.serialization.ByteArraySerializer) is not compatible to the actual key or value type (key type: java.lang.String / value type: com.fasterxml.jackson.databind.node.ObjectNode). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
As Consumed.with() is already provided, why the default serde is still used? As the answer written here, this should work, or?
https://stackoverflow.com/a/48832957/3952994
Yes, the problem is that your data doesn't match the serdes.
A serializer (key: org.apache.kafka.common.serialization.StringSerializer /
value: org.apache.kafka.common.serialization.ByteArraySerializer)
is not compatible to the actual key or value type
(key type: java.lang.String /
value type: com.fasterxml.jackson.databind.node.ObjectNode).
However, the error message says the problem is caused when data is serialized, i.e. when Kafka Streams attempts to write the data somewhere.
Your code snippet with Consumed, however, is about deserializing and thus reading data. Therefore it seems that the problem is not caused by the code snippet you shared in your question, but by code that is presumably further down in your Java file, which is not shown in your question. (Btw, it would have helped if you had provided the full stack trace of the error.)

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark.
Here is the code:
JavaSparkContext sc = new JavaSparkContext(
new SparkConf().setAppName("MySQLtoEs")
.set("es.index.auto.create", "true")
.set("es.nodes", "127.0.0.1:9200")
.set("es.mapping.id", "id")
.set("spark.serializer", KryoSerializer.class.getName()));
SQLContext sqlContext = new SQLContext(sc);
// Data source options
Map<String, String> options = new HashMap<>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable", "OFFERS");
options.put("partitionColumn", "id");
options.put("lowerBound", "10001");
options.put("upperBound", "499999");
options.put("numPartitions", "10");
// Load MySQL query result as DataFrame
LOGGER.info("Loading DataFrame");
DataFrame jdbcDF = sqlContext.load("jdbc", options);
DataFrame df = jdbcDF.select("id", "title", "description",
"merchantId", "price", "keywords", "brandId", "categoryId");
df.show();
LOGGER.info("df.count : " + df.count());
EsSparkSQL.saveToEs(df, "offers/product");
You can see the code is very straightforward. It reads the data into a DataFrame, selects some columns and then performs a count as a basic action on the Dataframe. Everything works fine up to this point.
Then it tries to save the data into Elasticsearch, but it fails because it cannot handle some type. You can see the error log here.
I'm not sure about why it can't handle that type. Does anyone know why this is occurring?
I'm using Apache Spark 1.5.0, Elasticsearch 1.4.4 and elaticsearch-hadoop 2.1.1
EDIT:
I have updated the gist link with a sample dataset along with the source code.
I have also tried to use the elasticsearch-hadoop dev builds as mentionned by #costin on the mailing list.
The answer for this one was tricky, but thanks to samklr, I have managed to figure about what the problem was.
The solution isn't straightforward nevertheless and might consider some “unnecessary” transformations.
First let's talk about Serialization.
There are two aspects of serialization to consider in Spark serialization of data and serialization of functions. In this case, it's about data serialization and thus de-serialization.
From Spark’s perspective, the only thing required is setting up serialization - Spark relies by default on Java serialization which is convenient but fairly inefficient. This is the reason why Hadoop itself introduced its own serialization mechanism and its own types - namely Writables. As such, InputFormat and OutputFormats are required to return Writables which, out of the box, Spark does not understand.
With the elasticsearch-spark connector one must enable a different serialization (Kryo) which handles the conversion automatically and also does this quite efficiently.
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
Even since Kryo does not require that a class implement a particular interface to be serialized, which means POJOs can be used in RDDs without any further work beyond enabling Kryo serialization.
That said, #samklr pointed out to me that Kryo needs to register classes before using them.
This is because Kryo writes a reference to the class of the object being serialized (one reference is written for every object written), which is just an integer identifier if the class has been registered but is the full classname otherwise. Spark registers Scala classes and many other framework classes (like Avro Generic or Thrift classes) on your behalf.
Registering classes with Kryo is straightforward. Create a subclass of KryoRegistrator,and override the registerClasses() method:
public class MyKryoRegistrator implements KryoRegistrator, Serializable {
#Override
public void registerClasses(Kryo kryo) {
// Product POJO associated to a product Row from the DataFrame
kryo.register(Product.class);
}
}
Finally, in your driver program, set the spark.kryo.registrator property to the fully qualified classname of your KryoRegistrator implementation:
conf.set("spark.kryo.registrator", "MyKryoRegistrator")
Secondly, even thought the Kryo serializer is set and the class registered, with changes made to Spark 1.5, and for some reason Elasticsearch couldn't de-serialize the Dataframe because it can't infer the SchemaType of the Dataframe into the connector.
So I had to convert the Dataframe to an JavaRDD
JavaRDD<Product> products = df.javaRDD().map(new Function<Row, Product>() {
public Product call(Row row) throws Exception {
long id = row.getLong(0);
String title = row.getString(1);
String description = row.getString(2);
int merchantId = row.getInt(3);
double price = row.getDecimal(4).doubleValue();
String keywords = row.getString(5);
long brandId = row.getLong(6);
int categoryId = row.getInt(7);
return new Product(id, title, description, merchantId, price, keywords, brandId, categoryId);
}
});
Now the data is ready to be written into elasticsearch :
JavaEsSpark.saveToEs(products, "test/test");
References:
Elasticsearch's Apache Spark support documentation.
Hadoop Definitive Guide, Chapter 19. Spark, ed. 4 – Tom White.
User samklr.

Resources