KStreams turn off the internal topics for Aggregate - apache-kafka-streams

I am working on an application where the Kafka by default has 'auto topic' creation disabled. So any topics required by the application should be be created prior.
I have implemented a Window store with aggregation..
KGroupedStream<String, String> groupedStream = builder.stream("topic", Consumed.with(Serdes.String(), Serdes.String()))
.map(this::processInbound)
.groupByKey();
WindowBytesStoreSupplier windowStoreSupplier = Stores.persistentWindowStore(storename, STORE_RETENTION, WINDOW_SIZE, false);
final Materialized<String, StatAggregator, WindowStore<Bytes, byte[]>> materialized = Materialized.
<String, StatAggregator>as(windowStoreSupplier )
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.String());
groupedStream.windowedBy(TimeWindows.of(1day).advanceBy(15mins))
.aggregate(StatAggregator::new, (key, value, aggregator) -> {
aggregator.add(value);
return aggregator;
}, materialized );
I need the fault tolerance to be in place, so I create the change log and the partition topics prior..ie the topics xxxxx-store-changelog and xxxxx-store-repartition are already created before app launch.
But as I use aggregation I see that the there are other internal topics created by KStreams like xxxxx-KSTREAM-AGGREGATE-STATE-STORE-0000000031-changelog and xxxxx-KSTREAM-AGGREGATE-STATE-STORE-0000000031-repartition etc... As these topics names are derived by the KStreams itself I am unable to create them prior.
I would like to either turn off the auto topic creation for AGGREGATE or able to define the name of the topics prior (config?) so that the KStreams can use them for aggregation.
Please advise.
Thanks,
Jinu

Related

Use micrometer timer with manual counter increment

I have a KafkaListener which receives messages containing a list of objects.
#KafkaListener(
id = "dataConsumer",
topics = "data.topic",
groupId = "${spring.kafka.consumer.group-id}",
containerFactory = "dataKafkaListenerContainerFactory")
public void consumeData(DataContainer message) {
List<Data> data = message.getList();
...
}
The list of objects can vary in size so the metrics for each message may not be useful.
I can get the timer metrics for this method by going to /actuator/metrics/spring.kafka.listener?tag=name:dataConsumer-0 but the count is for the message not the list of elements in the message. How can I switch this metric or make a similar metric for the time and count of the data elements in the message?
You can register your own Meter with the MeterRegistry - refer to the Micrometer Documentation.

Creating a state store with Avro inside Kstream Consumer Processor

I have a consumer defined as below. It reads a avro message out of topic and constructs a statestore of aggregated data, which is also of type avro.
#Bean
public Consumer<KStream<String, InputEvent>> avroTest() {
Serde<OutputEvent> serdeOutEvent = new SpecificAvroSerde<>(schemaRegistryClient);
return st -> st.groupByKey().aggregate(OutputEvent::new, (key, currentEvent, outputEvent) -> {
//aggregate here
return outputEvent;
}, Materialized.with(new Serdes.StringSerde(), serdeOutEvent).toStream();
}
The function is able to read messages from topic and create the first aggregated result, but when it tries to store it in statestore, receives a 404 for schema not present.
Exception in thread "odoAvroTest-e4ef8e3e-ea1e-458c-b309-b2afefbeacec-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000000, topic=odometer, partition=0, offset=0, stacktrace=org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema: {"type":"record","name": "" .... }
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Subject not found.; error code: 40401
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:226)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:252)
at io.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:319)
at io.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:307)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getIdFromRegistry(CachedSchemaRegistryClient.java:165)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getId(CachedSchemaRegistryClient.java:297)
at io.confluent.kafka.serializers.AbstractKafkaAvroSerializer.serializeImpl(AbstractKafkaAvroSerializer.java:73)
at io.confluent.kafka.serializers.KafkaAvroSerializer.serialize(KafkaAvroSerializer.java:53)
at io.confluent.kafka.streams.serdes.avro.SpecificAvroSerializer.serialize(SpecificAvroSerializer.java:65)
at io.confluent.kafka.streams.serdes.avro.SpecificAvroSerializer.serialize(SpecificAvroSerializer.java:38)
at org.apache.kafka.streams.state.internals.ValueAndTimestampSerializer.serialize(ValueAndTimestampSerializer.java:59)
at org.apache.kafka.streams.state.internals.ValueAndTimestampSerializer.serialize(ValueAndTimestampSerializer.java:50)
at org.apache.kafka.streams.state.internals.ValueAndTimestampSerializer.serialize(ValueAndTimestampSerializer.java:27)
at org.apache.kafka.streams.state.StateSerdes.rawValue(StateSerdes.java:192)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.put(MeteredKeyValueStore.java:166)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl$KeyValueStoreReadWriteDecorator.put(ProcessorContextImpl.java:486)
at org.apache.kafka.streams.kstream.internals.KStreamAggregate$KStreamAggregateProcessor.process(KStreamAggregate.java:103)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:117)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:133)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:87)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:363)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:425)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:912)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:819)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:788)
Do let know if there are additional config tweaks that are necessary to make this work. When i change the input to hashmap and /or a simple POJO amd use JSONSerde, the code seems to work and creates aggregation
The issue here is Schema Registry needed by AVRO Serde. When you set value Serde in Materialized.with(), you have to set the schema registry config to your serde.

Kafka Streams - The state store may have migrated to another instance

I'm writing a basic application to test the Interactive Queries feature of Kafka Streams. Here is the code:
public static void main(String[] args) {
StreamsBuilder builder = new StreamsBuilder();
KeyValueBytesStoreSupplier waypointsStoreSupplier = Stores.persistentKeyValueStore("test-store");
StoreBuilder waypointsStoreBuilder = Stores.keyValueStoreBuilder(waypointsStoreSupplier, Serdes.ByteArray(), Serdes.Integer());
final KStream<byte[], byte[]> waypointsStream = builder.stream("sample1");
final KStream<byte[], TruckDriverWaypoint> waypointsDeserialized = waypointsStream
.mapValues(CustomSerdes::deserializeTruckDriverWaypoint)
.filter((k,v) -> v.isPresent())
.mapValues(Optional::get);
waypointsDeserialized.groupByKey().aggregate(
() -> 1,
(aggKey, newWaypoint, aggValue) -> {
aggValue = aggValue + 1;
return aggValue;
}, Materialized.<byte[], Integer, KeyValueStore<Bytes, byte[]>>as("test-store").withKeySerde(Serdes.ByteArray()).withValueSerde(Serdes.Integer())
);
final KafkaStreams streams = new KafkaStreams(builder.build(), new StreamsConfig(createStreamsProperties()));
streams.cleanUp();
streams.start();
ReadOnlyKeyValueStore<byte[], Integer> keyValueStore = streams.store("test-store", QueryableStoreTypes.keyValueStore());
KeyValueIterator<byte[], Integer> range = keyValueStore.all();
while (range.hasNext()) {
KeyValue<byte[], Integer> next = range.next();
System.out.println(next.value);
}
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
protected static Properties createStreamsProperties() {
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "random167");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "client-id");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, Serdes.Integer().getClass().getName());
//streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10000);
return streamsConfiguration;
}
So my problem is, every time I run this I get this same error:
Exception in thread "main" org.apache.kafka.streams.errors.InvalidStateStoreException: the state store, test-store, may have migrated to another instance.
I'm running only 1 instance of the application, and the topic I'm consuming from has only 1 partition.
Any idea what I'm doing wrong ?
Looks like you have a race condition. From the kafka streams javadoc for KafkaStreams::start() it says:
Start the KafkaStreams instance by starting all its threads. This function is expected to be called only once during the life cycle of the client.
Because threads are started in the background, this method does not block.
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/streams/KafkaStreams.html
You're calling streams.store() immediately after streams.start(), but I'd wager that you're in a state where it hasn't initialized fully yet.
Since this is code appears to be just for testing, add a Thread.sleep(5000) or something in there and give it a go. (This is not a solution for production) Depending on your input rate into the topic, that'll probably give a bit of time for the store to start filling up with events so that your KeyValueIterator actually has something to process/print.
Probably not applicable to OP but might help others:
In trying to retrieve a KTable's store, make sure the the KTable's topic exists first or you'll get this exception.
I failed to call Storebuilder before consuming the store.
Typically this happens for two reasons:
The local KafkaStreams instance is not yet ready (i.e., not yet in
runtime state RUNNING, see Run-time Status Information) and thus its
local state stores cannot be queried yet. The local KafkaStreams
instance is ready (e.g. in runtime state RUNNING), but the particular
state store was just migrated to another instance behind the scenes.
This may notably happen during the startup phase of a distributed
application or when you are adding/removing application instances.
https://docs.confluent.io/platform/current/streams/faq.html#handling-invalidstatestoreexception-the-state-store-may-have-migrated-to-another-instance
The simplest approach is to guard against InvalidStateStoreException when calling KafkaStreams#store():
// Example: Wait until the store of type T is queryable. When it is, return a reference to the store.
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}

Is it possible to expire data of specific key from gemfire region like redis

Hi I am new to gemfire and i want to exprire data from gemfire region for specific key after idle time which I set.
i did this using redis by below code.
jedis.set(key, value);
config.setMaxIdle(50);
jedis.expire(key, config.getMaxIdle());
but how to do it in gemfire.
Any help.
Thanks.
You can control the expiration of individual keys if you configure the region to use a custom expiration. You provide an implementation of the CustomExpiry interface that can look at each entry and decide when it should expire. For example:
RegionFactory regionFactory = ...
regionFactory.setCustomEntryIdleTimeout(new CustomExpiry() {
public ExpirationAttributes getExpiry(Entry entry) {
if(entry.getKey().equals("XXX")) {
return new ExpirationAttributes(50, ExpirationAction.INVALIDATE);
}
}
});
If you want to delete data for specific region then try following code :
Region<String, String> region = cache .<String, String> createClientRegionFactory( ClientRegionShortcut.CACHING_PROXY) .setEntryTimeToLive(new ExpirationAttributes(50)) .create(PropertiesCache.getInstance().getProperty("region"));
It's works in my situation.

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark.
Here is the code:
JavaSparkContext sc = new JavaSparkContext(
new SparkConf().setAppName("MySQLtoEs")
.set("es.index.auto.create", "true")
.set("es.nodes", "127.0.0.1:9200")
.set("es.mapping.id", "id")
.set("spark.serializer", KryoSerializer.class.getName()));
SQLContext sqlContext = new SQLContext(sc);
// Data source options
Map<String, String> options = new HashMap<>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable", "OFFERS");
options.put("partitionColumn", "id");
options.put("lowerBound", "10001");
options.put("upperBound", "499999");
options.put("numPartitions", "10");
// Load MySQL query result as DataFrame
LOGGER.info("Loading DataFrame");
DataFrame jdbcDF = sqlContext.load("jdbc", options);
DataFrame df = jdbcDF.select("id", "title", "description",
"merchantId", "price", "keywords", "brandId", "categoryId");
df.show();
LOGGER.info("df.count : " + df.count());
EsSparkSQL.saveToEs(df, "offers/product");
You can see the code is very straightforward. It reads the data into a DataFrame, selects some columns and then performs a count as a basic action on the Dataframe. Everything works fine up to this point.
Then it tries to save the data into Elasticsearch, but it fails because it cannot handle some type. You can see the error log here.
I'm not sure about why it can't handle that type. Does anyone know why this is occurring?
I'm using Apache Spark 1.5.0, Elasticsearch 1.4.4 and elaticsearch-hadoop 2.1.1
EDIT:
I have updated the gist link with a sample dataset along with the source code.
I have also tried to use the elasticsearch-hadoop dev builds as mentionned by #costin on the mailing list.
The answer for this one was tricky, but thanks to samklr, I have managed to figure about what the problem was.
The solution isn't straightforward nevertheless and might consider some “unnecessary” transformations.
First let's talk about Serialization.
There are two aspects of serialization to consider in Spark serialization of data and serialization of functions. In this case, it's about data serialization and thus de-serialization.
From Spark’s perspective, the only thing required is setting up serialization - Spark relies by default on Java serialization which is convenient but fairly inefficient. This is the reason why Hadoop itself introduced its own serialization mechanism and its own types - namely Writables. As such, InputFormat and OutputFormats are required to return Writables which, out of the box, Spark does not understand.
With the elasticsearch-spark connector one must enable a different serialization (Kryo) which handles the conversion automatically and also does this quite efficiently.
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
Even since Kryo does not require that a class implement a particular interface to be serialized, which means POJOs can be used in RDDs without any further work beyond enabling Kryo serialization.
That said, #samklr pointed out to me that Kryo needs to register classes before using them.
This is because Kryo writes a reference to the class of the object being serialized (one reference is written for every object written), which is just an integer identifier if the class has been registered but is the full classname otherwise. Spark registers Scala classes and many other framework classes (like Avro Generic or Thrift classes) on your behalf.
Registering classes with Kryo is straightforward. Create a subclass of KryoRegistrator,and override the registerClasses() method:
public class MyKryoRegistrator implements KryoRegistrator, Serializable {
#Override
public void registerClasses(Kryo kryo) {
// Product POJO associated to a product Row from the DataFrame
kryo.register(Product.class);
}
}
Finally, in your driver program, set the spark.kryo.registrator property to the fully qualified classname of your KryoRegistrator implementation:
conf.set("spark.kryo.registrator", "MyKryoRegistrator")
Secondly, even thought the Kryo serializer is set and the class registered, with changes made to Spark 1.5, and for some reason Elasticsearch couldn't de-serialize the Dataframe because it can't infer the SchemaType of the Dataframe into the connector.
So I had to convert the Dataframe to an JavaRDD
JavaRDD<Product> products = df.javaRDD().map(new Function<Row, Product>() {
public Product call(Row row) throws Exception {
long id = row.getLong(0);
String title = row.getString(1);
String description = row.getString(2);
int merchantId = row.getInt(3);
double price = row.getDecimal(4).doubleValue();
String keywords = row.getString(5);
long brandId = row.getLong(6);
int categoryId = row.getInt(7);
return new Product(id, title, description, merchantId, price, keywords, brandId, categoryId);
}
});
Now the data is ready to be written into elasticsearch :
JavaEsSpark.saveToEs(products, "test/test");
References:
Elasticsearch's Apache Spark support documentation.
Hadoop Definitive Guide, Chapter 19. Spark, ed. 4 – Tom White.
User samklr.

Resources