leftjoin on two GlobalKTables - apache-kafka-streams

I am trying to join a stream to 2 differents GlobalTables, treating them as a lookup, more specifically, devices (user agent) and geocoding (ip address).
The issue being with the serialization, but I dont get why. It gets stuck on DEFAULT_VALUE_SERDE_CLASS_CONFIG but the topic to which I want to write is serialized correctly.
//
// Set up serialization / de-serialization
private static Serde<String> stringSerde = Serdes.String();
private static Serde<PodcastData> podcastSerde = StreamsSerdes.PodCastSerde();
private static Serde<GeoCodedData> geocodedSerde = StreamsSerdes.GeoIPSerde();
private static Serde<DeviceData> deviceSerde = StreamsSerdes.DeviceSerde();
private static Serde<JoinedPodcastGeoDeviceData> podcastGeoDeviceSerde = StreamsSerdes.PodcastGeoDeviceSerde();
private static Serde<JoinedPodCastDeviceData> podcastDeviceSerde = StreamsSerdes.PodcastDeviceDataSerde()
...
GlobalKTable<String, DeviceData> deviceIDTable = builder.globalTable(kafkaProperties.getProperty("deviceid-topic"));
GlobalKTable<String, GeoCodedData> geoIPTable = builder.globalTable(kafkaProperties.getProperty("geoip-topic"));
//
// Stream from source topic
KStream<String, PodcastData> podcastStream = builder.stream(
kafkaProperties.getProperty("source-topic"),
Consumed.with(stringSerde, podcastSerde));
//
podcastStream
// left join the podcast stream to the device table, looking up the device
.leftJoin(deviceIDTable,
// get a DeviceData object from the user agent
(podcastID, podcastData) -> podcastData.getUser_agent(),
// join podcast and device and return a JoinedPodCastDeviceData object
(podcastData, deviceData) -> {
JoinedPodCastDeviceData data =
JoinedPodCastDeviceData.builder().build();
data.setPodcastObject(podcastData);
data.setDeviceData(deviceData);
return data;
})
// left join the podcast stream to the geo table, looking up the geo data
.leftJoin(geoIPTable,
// get a Geo object from the ip address
(podcastID, podcastDeviceData) -> podcastDeviceData.getPodcastObject().getIp_address(),
// join podcast and geo
(podcastDeviceData, geoCodedData) -> {
JoinedPodcastGeoDeviceData data=
JoinedPodcastGeoDeviceData.builder().build();
data.setGeoData(geoCodedData);
data.setDeviceData(podcastDeviceData.getDeviceData());
data.setPodcastData(podcastDeviceData.getPodcastObject());
return data;
})
//
.to(kafkaProperties.getProperty("sink-topic"),
Produced.with(stringSerde, podcastGeoDeviceSerde));
...
...
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, stringSerde.getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, stringSerde.getClass().getName());
The error
ERROR java.lang.String cannot be cast to DeviceData

streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, stringSerde.getClass().getName());
Due to above value, the application will use String serde as default value serde unless you specify explicitly while making KTable/KStream/GlobalKTable.
Since expected value Type for deviceIDTable is DeviceData, specify that as given below:
You need to define the value serde in GlobalKTable .
GlobalKTable<String, DeviceData> deviceIDTable = builder.globalTable(kafkaProperties.getProperty("deviceid-topic"), Materialized.<String, DeviceData, KeyValueStore<Bytes, byte[]>>as(DEVICE_STORE)
.withKeySerde(stringSerde)
.withValueSerde(deviceSerde));

Related

Iterate through Flux items and add them in to Mono object

I am working on the api, which takes ids. For the given id, I want to download related data from s3 and put them in a new object lets call it data
class Data {
private List<S3Object> s3Objects;
//getter-setter
}
public Mono<ResponseEntity<Data>> getData(#RequestParam List<String> tagIds){
Data data = new Data();
Flux<S3Object> s3ObjectFlux = Flux.fromStream(tagIds.stream())
.parallel()
.runOn(Schedulers.boundedElastic())
.flatMap(id -> fetchResources(id))
.flatMap(idS3Object -> Mono.just(s3Object))
.ordered((u1, u2) -> u2.hashCode() - u1.hashCode());
//how do i add it in data object to convert Mono<Data>?
}
You need to collect it into a list and then map it to create a Data object as follows:
public Mono<ResponseEntity<Data>> getData(#RequestParam List<String> tagIds){
Flux<S3Object> s3ObjectFlux = Flux.fromStream(tagIds.stream())
.parallel()
.runOn(Schedulers.boundedElastic())
.flatMap(id -> fetchResources(id))
.flatMap(idS3Object -> Mono.just(s3Object))
.ordered((u1, u2) -> u2.hashCode() - u1.hashCode());
Mono<Data> data = s3ObjectFlux.collectList()
.map(s3Objects -> new Data(s3Objects));
}
Creating a constructor that accepts the S3 objects list is helpful:
class Data {
private List<S3Object> s3Objects;
public Data(List<S3Object> s3Objects) {
this.s3Objects = s3Objects;
}
//getter-setter
}

Query KTable in the same Application where it is created

I have an Kafka streams application in which I read from a topic, do aggregation and materialize in a KTable. I then create a Stream and run some logic on the stream. Now in the stream processing, I want to use some data from the aforementioned KTable. Once I start the stream app, how do I get access to the KTable stream again? I don't want to push the KTable to a new Topic.
KStream<String, MyClass> source = builder.stream("my-topic");
KTable<Windowed<String>, Long> kTable =
source.groupBy((key, value) -> value.getKey(),
Grouped.<String, MyClass >as("repartition-1")
.withKeySerde(new Serdes.String())
.withValueSerde(new MyClassSerDes()))
.windowedBy(TimeWindows.of(Duration.ofSeconds(5)))
.count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("test-store")
.withKeySerde(new Serdes.String())
.withValueSerde(Serdes.Long()));
Here I want to use data from the kTable.
inputstream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.count(Materialized.<myKey, Long, WindowStore<Bytes, byte[]>>as("str")
.withRetention(Duration.ofMinutes(30)))
.toStream()
.filter((k, v) -> {
// Here get the count for the previous Window.
// Use that count for some computation here.
}
You can add the KTable store to a processor/transformer. For you case, you can replace the filter with flatTransform (or any sibling like transform etc depending if you need access to the key) and connect the store to the operator:
inputstream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.count(Materialized.<myKey, Long, WindowStore<Bytes, byte[]>>as("str")
.withRetention(Duration.ofMinutes(30))
)
.toStream()
// requires v2.2; otherwise use `transform()`
// if you don't need access to the key, consider to use `flatTransformValues` (v2.3)
.flatTransform(
() -> new Transformer<Windowed<myKey>,
Long,
List<KeyValue<Windowed<myKey>, Long>>() {
private ReadOnlyWindowStore<myKey, Long> store;
public void init(final ProcessorContext context) {
// get a handle on the store by its name
// as specified via `Materialized` above;
// should be read-only
store = (ReadOnlyWindowStore<myKey, Long>)context.getStateStore("str");
}
public List<KeyValue<Windowed<myKey>, Long>> transform(Windowed<myKey> key,
Long value) {
// access `store` as you wish to make a filtering decision
if ( ... ) {
// record passes
return Collection.singletonList(KeyValue.pair(key, value));
} else {
// drop record
return Collection.emptyList();
}
}
public void close() {} // nothing to do
},
"str" // connect the KTable store to the transformer using its name
// as specified via `Materialized` above
);

How to handle duplicate messages using Kafka streaming DSL functions

My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
There is possibility of source system sending duplicate messages to INPUT topic in case of any failures.
FLOW -
Source System --> INPUT Topic --> Kafka Streaming --> OUTPUT Topic
Currently I am using flatMap to generate multiple keys out the payload but flatMap is stateless so not able to avoid duplicate message processing upon receiving from INPUT Topic.
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
Could you please put some light on it.
My requirement is to skip or avoid duplicate messages(having same key) received from INPUT Topic using kafka stream DSL API.
Take a look at the EventDeduplication example at https://github.com/confluentinc/kafka-streams-examples, which does that. You can then adapt the example with the required flatMap functionality that is specific to your use case.
Here's the gist of the example:
final KStream<byte[], String> input = builder.stream(inputTopic);
final KStream<byte[], String> deduplicated = input.transform(
// In this example, we assume that the record value as-is represents a unique event ID by
// which we can perform de-duplication. If your records are different, adapt the extractor
// function as needed.
() -> new DeduplicationTransformer<>(windowSize.toMillis(), (key, value) -> value),
storeName);
deduplicated.to(outputTopic);
and
/**
* #param maintainDurationPerEventInMs how long to "remember" a known event (or rather, an event
* ID), during the time of which any incoming duplicates of
* the event will be dropped, thereby de-duplicating the
* input.
* #param idExtractor extracts a unique identifier from a record by which we de-duplicate input
* records; if it returns null, the record will not be considered for
* de-duping but forwarded as-is.
*/
DeduplicationTransformer(final long maintainDurationPerEventInMs, final KeyValueMapper<K, V, E> idExtractor) {
if (maintainDurationPerEventInMs < 1) {
throw new IllegalArgumentException("maintain duration per event must be >= 1");
}
leftDurationMs = maintainDurationPerEventInMs / 2;
rightDurationMs = maintainDurationPerEventInMs - leftDurationMs;
this.idExtractor = idExtractor;
}
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, Long>) context.getStateStore(storeName);
}
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
final KeyValue<K, V> output;
if (isDuplicate(eventId)) {
output = null;
updateTimestampOfExistingEventToPreventExpiry(eventId, context.timestamp());
} else {
output = KeyValue.pair(key, value);
rememberNewEvent(eventId, context.timestamp());
}
return output;
}
}
private boolean isDuplicate(final E eventId) {
final long eventTime = context.timestamp();
final WindowStoreIterator<Long> timeIterator = eventIdStore.fetch(
eventId,
eventTime - leftDurationMs,
eventTime + rightDurationMs);
final boolean isDuplicate = timeIterator.hasNext();
timeIterator.close();
return isDuplicate;
}
private void updateTimestampOfExistingEventToPreventExpiry(final E eventId, final long newTimestamp) {
eventIdStore.put(eventId, newTimestamp, newTimestamp);
}
private void rememberNewEvent(final E eventId, final long timestamp) {
eventIdStore.put(eventId, timestamp, timestamp);
}
#Override
public void close() {
// Note: The store should NOT be closed manually here via `eventIdStore.close()`!
// The Kafka Streams API will automatically close stores when necessary.
}
}
I am looking for DSL API which can skip duplicate records received from INPUT Topic and also generate multiple key/values before sending to OUTPUT Topic.
The DSL doesn't include such functionality out of the box, but the example above shows how you can easily build your own de-duplication logic by combining the DSL with the Processor API of Kafka Streams, with the use of Transformers.
Thought Exactly Once configuration will be useful here to deduplicate messages received from INPUT Topic based on keys but looks like its not working, probably I did not understand usage of Exactly Once.
As Matthias J. Sax mentioned in his answer, from Kafka's perspective these "duplicates" are not duplicates from the point of view of its exactly-once processing semantics. Kafka ensures that it will not introduce any such duplicates itself, but it cannot make such decisions out-of-the-box for upstream data sources, which are black box for Kafka.
Exactly-once can be use to ensure that consuming and processing an input topic, does not result in duplicates in the output topic. However, from an exactly-once point of view, the duplicates in the input topic that you describe are not really duplicates but two regular input messages.
For remove input topic duplicates, you can use a transform() step with an attached state store (there is no built-in operator in the DSL that does what you want). For each input records, you first check if you find the corresponding key in the store. If not, you add it to the store and forward the message. If you find it in the store, you drop the input as duplicate. Note, this will only work with 100% correctness guarantee if you enable exactly-once processing in your Kafka Streams application. Otherwise, even if you try do deduplicate, Kafka Streams could re-introduce duplication in case of a failure.
Additionally, you need to decide how long you want to keep entries in the store. You could use a Punctuation to remove old data from the store if you are sure that no further duplicate can be in the input topic. One way to do this, would be to store the record timestamp (or maybe offset) in the store, too. This way, you can compare the current time with the store record time within punctuate() and delete old records (ie, you would iterator over all entries in the store via store#all()).
After the transform() you apply your flatMap() (or could also merge your flatMap() code into transform() directly.
It's achievable with DSL only as well, using SessionWindows changelog without caching.
Wrap the value with duplicate flag
Turn the flag to true in reduce() within time window
Filter out true flag values
Unwrap the original key and value
Topology:
Serde<K> keySerde = ...;
Serde<V> valueSerde = ...;
Duration dedupWindowSize = ...;
Duration gracePeriod = ...;
DedupValueSerde<V> dedupValueSerde = new DedupValueSerde<>(valueSerde);
new StreamsBuilder()
.stream("input-topic", Consumed.with(keySerde, valueSerde))
.mapValues(v -> new DedupValue<>(v, false))
.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapAndGrace(dedupWindowSize, gracePeriod))
.reduce(
(value1, value2) -> new DedupValue<>(value1.value(), true),
Materialized
.<K, DedupValue<V>, SessionStore<Bytes, byte[]>>with(keySerde, dedupValueSerde)
.withCachingDisabled()
)
.toStream()
.filterNot((wk, dv) -> dv == null || dv.duplicate())
.selectKey((wk, dv) -> wk.key())
.mapValues(DedupValue::value)
.to("output-topic", Produced.with(keySerde, valueSerde));
Value wrapper:
record DedupValue<V>(V value, boolean duplicate) { }
Value wrapper SerDe (example):
public class DedupValueSerde<V> extends WrapperSerde<DedupValue<V>> {
public DedupValueSerde(Serde<V> vSerde) {
super(new DvSerializer<>(vSerde.serializer()), new DvDeserializer<>(vSerde.deserializer()));
}
private record DvSerializer<V>(Serializer<V> vSerializer) implements Serializer<DedupValue<V>> {
#Override
public byte[] serialize(String topic, DedupValue<V> data) {
byte[] vBytes = vSerializer.serialize(topic, data.value());
return ByteBuffer
.allocate(vBytes.length + 1)
.put(data.duplicate() ? (byte) 1 : (byte) 0)
.put(vBytes)
.array();
}
}
private record DvDeserializer<V>(Deserializer<V> vDeserializer) implements Deserializer<DedupValue<V>> {
#Override
public DedupValue<V> deserialize(String topic, byte[] data) {
ByteBuffer buffer = ByteBuffer.wrap(data);
boolean duplicate = buffer.get() == (byte) 1;
int remainingSize = buffer.remaining();
byte[] vBytes = new byte[remainingSize];
buffer.get(vBytes);
V value = vDeserializer.deserialize(topic, vBytes);
return new DedupValue<>(value, duplicate);
}
}
}

Custom Hive SerDe unable to SELECT column but works when I do SELECT *

I'm writing a custom SerDe and will only be using it to deserialize.
The underlying data is a thrift binary, each row is an event log. Each event has a schema which i have access to, but we wrap the event in another schema, let's call it Message before storing.
The reason I'm writing a SerDe instead of using the ThriftDeserializer is because as mentioned the underlying event is wrapped as a Message. So we first need to deserialize using the schema of Message and then deserialize the data for that event.
The SerDe works (only) when I do a SELECT * and I can deserialize the data as expected but whenever I select a column from the table instead of a SELECT *, the rows are all NULL. The object inspector returned is a ThriftStructObjectInspector and the Object returned by the deserialize is a TBase.
What could cause Hive to return NULL when we select a column, but return the column data when I do a SELECT * ?
Here's the SerDe class (changed some classnames):
public class MyThriftSerde extends AbstractSerDe {
private static final Log LOG = LogFactory.getLog(MyThriftSerde.class);
/* Abstracting away the deserialization of the underlying event which is wrapped in a message */
private static final MessageDeserializer myMessageDeserializer =
MessageDeserializer.getInstance();
/* Underlying event class which is wrapped in a Message */
private String schemaClassName;
private Class<?> schemaClass;
/* Used to read the input row */
public static List<String> inputFieldNames;
public static List<ObjectInspector> inputFieldOIs;
public static List<Integer> notSkipIDs;
public static ObjectInspector inputRowObjectInspector;
/* Output Object Inspector */
public static ObjectInspector thriftStructObjectInspector;
#Override
public void initialize(Configuration conf, Properties tbl) throws SerDeException {
try {
logHeading("INITIALIZE MyThriftSerde");
schemaClassName = tbl.getProperty(SERIALIZATION_CLASS);
schemaClass = conf.getClassByName(schemaClassName);
LOG.info(String.format("Building DDL for event: %s", schemaClass.getName()));
inputFieldNames = new ArrayList<>();
inputFieldOIs = new ArrayList<>();
notSkipIDs = new ArrayList<>();
/* Initialize the Input fields */
// The underlying data is stored in RCFile format, and only has 1 column, event_binary
// So we create a ColumnarStructBase for each row we deserialize.
// This ColumnasStruct only has 1 column: event_binary
inputFieldNames.add("event_binary");
notSkipIDs.add(0);
inputFieldOIs.add(LazyPrimitiveObjectInspectorFactory.LAZY_BINARY_OBJECT_INSPECTOR);
inputRowObjectInspector =
ObjectInspectorFactory.getColumnarStructObjectInspector(inputFieldNames, inputFieldOIs);
/* Output Object Inspector*/
// This is what the SerDe will return, it is a ThriftStructObjectInspector
thriftStructObjectInspector =
ObjectInspectorFactory.getReflectionObjectInspector(
schemaClass, ObjectInspectorFactory.ObjectInspectorOptions.THRIFT);
// Only for debugging
logHeading("THRIFT OBJECT INSPECTOR");
LOG.info("Output OI Class Name: " + thriftStructObjectInspector.getClass().getName());
LOG.info(
"OI Details: "
+ ObjectInspectorUtils.getObjectInspectorName(thriftStructObjectInspector));
} catch (Exception e) {
LOG.info("Exception while initializing SerDe", e);
}
}
#Override
public Object deserialize(Writable rowWritable) throws SerDeException {
logHeading("START DESERIALIZATION");
ColumnarStructBase inputLazyStruct =
new ColumnarStruct(inputRowObjectInspector, notSkipIDs, null);
LazyBinary eventBinary;
Message rowAsMessage;
TBase deserializedRow = null;
try {
inputLazyStruct.init((BytesRefArrayWritable) rowWritable);
eventBinary = (LazyBinary) inputLazyStruct.getField(0);
rowAsMessage =
myMessageDeserializer.fromBytes(eventBinary.getWritableObject().copyBytes(), null);
deserializedRow = rowAsMessage.getEvent();
LOG.info("deserializedRow.getClass(): " + deserializedRow.getClass());
LOG.info("deserializedRow.toString(): " + deserializedRow.toString());
} catch (Exception e) {
e.printStackTrace();
}
logHeading("END DESERIALIZATION");
return deserializedRow;
}
private void logHeading(String s) {
LOG.info(String.format("------------------- %s -------------------", s));
}
#Override
public ObjectInspector getObjectInspector() {
return thriftStructObjectInspector;
}
}
Context on the code:
In the underlying data, each row contains only 1 column (called event_binary), stored as a binary. The binary is a Message which contains 2 fields, "schema" + "event_data". i.e. each row is a Message which contains the underlying event's schema + data. We use the schema from Message to deserialize the data.
The SerDe first deserializes the row as a Message, extracts the event data and then deserializes the event.
I create an EXTERNAL table which points to the Thrift data using
ADD JAR hdfs://my-jar.jar;
CREATE EXTERNAL TABLE dev_db.thrift_event_data_deserialized
ROW FORMAT SERDE 'com.test.only.MyThriftSerde'
WITH SERDEPROPERTIES (
"serialization.class"="com.test.only.TestEvent"
) STORED AS RCFILE
LOCATION 'location/of/thrift/data';
MSCK REPAIR TABLE thrift_event_data_deserialized;
Then SELECT * FROM dev_db.thrift_event_data_deserialized LIMIT 10; works as expected
But, SELECT column1_name, column2_name FROM dev_db.thrift_event_data_deserialized LIMIT 10; does not work.
Any idea what i'm missing here? Would love any help on this!

Convert ImmutableListMultimap to Map using Collectors.toMap

I would like to convert a ImmutableListMultimap<String, Character> to Map<String, List<Character>>.
I used to do it in the non-stream way as follows
void convertMultiMaptoList(ImmutableListMultimap<String, Character> reverseImmutableMultiMap) {
Map<String, List<Character>> result = new TreeMap<>();
for( Map.Entry<String, Character> entry: reverseImmutableMultiMap.entries()) {
String key = entry.getKey();
Character t = entry.getValue();
result.computeIfAbsent(key, x-> new ArrayList<>()).add(t);
}
//reverseImmutableMultiMap.entries().stream().collect(Collectors.toMap)
}
I was wondering how to write the above same logic using java8 stream way (Collectors.toMap).
Please share your thoughts
Well there is already a asMap that you can use to make this easier:
Builder<String, Character> builder = ImmutableListMultimap.builder();
builder.put("12", 'c');
builder.put("12", 'c');
ImmutableListMultimap<String, Character> map = builder.build();
Map<String, List<Character>> map2 = map.asMap()
.entrySet()
.stream()
.collect(Collectors.toMap(Entry::getKey, e -> new ArrayList<>(e.getValue())));
If on the other hand you are OK with the return type of the asMap than it's a simple method call:
ImmutableMap<String, Collection<Character>> asMap = map.asMap();
Map<String, List<Character>> result = reverseImmutableMultiMap.entries().stream()
.collect(groupingBy(Entry::getKey, TreeMap::new, mapping(Entry::getValue, toList())));
The important detail is mapping. It will convert the collector (toList) so that it collects List<Character> instead of List<Entry<String, Character>>. According to the mapping function Entry::getValue
groupingBy will group all entries by the String key
toList will collect all values with same key to a list
Also, passing TreeMap::new as an argument to groupingBy will make sure you get this specific type of Map instead of the default HashMap

Resources