KStreams Grouping by multiple fields to get count - apache-kafka-streams

So I have a bunch of records in a topic like the one below. I can create the GroupBy in KSQLDB with no problem as it is more SQL than anything else. But I have been tasked to move it over to Java KStreams and am failing miserably.
Can someone guide me on the Topology for first grouping by user_id then Object_id then by day? I don't ask this lightly as I have tried over and over with state stores with so many examples but I am just chasing my tail. Basically, I would like to know how many times a user looked at a specific object on a given day.
Anything on how to accomplish this would be greatly appreciated.
{
"entrytimestamp": "2020-05-04T15:21:01.897",
"user_id": "080db36a-f205-4e32-a324-cc375b75d167",
"object_id": "fdb084f7-5367-4776-a5ae-a10d6e898d22"
}

You can create composed key, and then group by key, like:
KStream<String, Message> stream = builder.stream(MESSAGES, Consumed.with(Serdes.String(), jsonSerde));
KStream<String, Message> newKeyStream = stream.selectKey((key, message) ->
String.format("%s-%s-%s",
message.userId(),
message.objectId(),
LocalDate.ofInstant(Instant.ofEpochMilli(message.timestamp()), ZoneId.systemDefault())));
KGroupedStream<String, Message> groupedBy = newKeyStream.groupByKey();

Related

MongoTemplate, Criteria and Hashmap

Good Morning.
I'm starting to learn some mongo right now.
I'm facing this problem right now, and i'm start to think if this is the best approach to resolve this "task", or if is bettert to turn around and write another way to solve this "problem".
My goal is to iterate a simple map of values (key) and vector\array (values)
My test map will be recived by a rest layer.
{
"1":["1","2","3"]
}
now after some logic, i need to use the Dao in order to look into db.
The Key will be "realm", the value inside vector are "castle".
Every Realm have some castle and every castle have some "rules".
I need to find every rules for each avaible combination of realm-castle.
AccessLevel is a pojo labeled by #Document annotation and it will have various params, such as castle and realm (both simple int)
So the idea will be to iterate a map and write a long query for every combination of key-value.
public AccessLevel searchAccessLevel(Map<String,Integer[]> request){
Query q = new Query();
Criteria c = new Criteria();
request.forEach((k,v)-> {
for (int i: Arrays.asList(v)
) {
q.addCriteria(c.andOperator(
Criteria.where("realm").is(k),
Criteria.where("castle").is(v))
);
}
});
List<AccessLevel> response=db.find(q,AccessLevel.class);
for (AccessLevel x: response
) {
System.out.println(x.toString());
}
As you can see i'm facing an error concerning $and.
Due to limitations of the org.bson.Document, you can't add a second '$and' expression specified as [...]
it seems mongo can't handle various $and, something i'm pretty used to abuse over sql
select * from a where id =1 and id=2 and id=3 and id=4
(not the best, sincei can use IN(), but sql allow me)
So, the point is: mongo can actualy work in this way and i need to dig more into the problem, or i need to do another approach, like using criterion.in(), and make N interrogation via mongotemplate one for every key in my Map?

Time semantics between KStream and KTable

I am trying to build the following topology:
Using Debezium Connectors, I am pulling 2 tables (let's called them tables A, and DA). As per DBZ, the topics where the table rows are stored have the structure { before: "...", after: "..." }.
First steps in my topology are to create "clean" KStreams off these two "table" topics. The sub-topology there looks roughly like this:
private static KStream<String, TABLE_A.Value> getTableARowByIdStream(
StreamsBuilder builder, Properties streamsConfig) {
return builder
.stream("TABLE_A", Consumed.withTimestampExtractor(Application::getRowDate))
.filter((key, envelope) -> [ some filtering condition ] )
.map((key, envelope) -> [ maps to TABLE_A.Value ] )
.through(tableRowByIdTopicName);
}
Notice that I am assigning the record time explicitly because the table rows will be CDC'ed "years" after they were originally published. What the function is doing at the moment is faking the time starting at 2010-01-01 and, using an AtomicInteger, adding 1 millisecond for each consumed entity. It does this for tables A but it doesn't for DA (I will explain why later).
Phase 2 of the topology is to build 1 KTable based on the "cleaned" topic for table A, like this:
private static KTable<String, EntityInfoList> getEntityInfoListById(
KStream<String, TABLE_A.Value> tableAByIdStream) {
return tableAByIdStream
.map((key, value) -> [ some mapping ] )
.groupByKey()
.aggregate(() -> [ builds up a EntityInfoList object ] ));
}
Finally, with th KTable ready, I'm joining them with the KStream over DA like so:
private static KStream<String, OutputTopicEntity> getOutputTopicEntityStream(
KStream<String, Table_DA.Value> tableDAStream,
KTable<String, EntityInfoList> tableA_KTable) {
KStream<String, Table_DA>[] branches = tableDAStream.branch(
(key, value) -> [ some logic ],
(key, value) -> true);
KStream<String, OutputTopicEntity> internalAccountRefStream = branches[0]
.join(
tableA_KTable,
(streamValue, tableValue) -> [ some logic to build a list of OutputTopicEntity ])
.flatMap((key, listValue) -> [ some logic to flatten it ]));
[ similar logic with branch[1] ]
}
My problem is, despite the fact that I am "faking" the time for records coming from the Table_A topic (I've verified that they are referencing 2010/01/01 using kafkacat) and entries in Table_DA (the stream side of the join) have timestamps around today '2019/08/14'), it doesn't seem like Kafka Streams is holding reading any of the entries from Table_DA KStream until it has ingested all records from Table_A into the KTable.
As a result of that, I don't have all the "join hits" that I was expecting and it is also nondeterministic. My understanding based on this sentence from What are the differences between KTable vs GlobalKTable and leftJoin() vs outerJoin()? was the opposite:
For stream-table join, Kafka Stream align record processing ordered based on record timestamps. Thus, the update to the table are aligned with the records of you stream.
My experience so far is this is not happening. I can also easily see how my application continues churning through the Table_A topic way after it has consumed all entries in Table_DA stream (it happens to be 10 times smaller).
Am I doing something wrong?
Timestamp synchronization is best effort before 2.1.0 release (cf. https://issues.apache.org/jira/browse/KAFKA-3514).
As of 2.1.0, timestamps are synchronized strictly. However, if one input does not have any data, Kafka Streams will "enforce" processing as described in KIP-353 to avoid blocking forever. If you have bursty inputs and want to "block" processing for some time if one input has no data, you can increase configuration parameter max.task.idle.ms (default is 0) as introduced in 2.1.0 via KIP-353.

Spring Kafka Stream doesn't get written

I'm writing a Spring Boot (2.1.4) app trying to use Spring Cloud Streams for Kafka.
What I'm trying to do is maintain a list of sensors on one topic ("sensors"). OTOH, I have incoming data on the other topic ("data"). What I'm trying to achieve is that when I get data for a sensor I don't already have, I want to add it to the sensor list.
To do that, I create a KTable<String, Sensor> from the sensors topic, map the temperature topic to the pure sensor's data (in this case, its name) and do an outer join with a ValueJoiner that retains the sensor if present, otherwise use the reading's sensor. Then, I write the result back to the sensors topic.
KTable<String, Sensor> sensorTable = ...;
KStream<String, SensorData> sensorDataStream = ...;
// get sensors providing measurements
KTable<String, Sensor> sensorsFromData =
sensorDataStream.groupByKey()
.aggregate(
Sensor::new,
(k, v, s) -> {
s.setName(k);
return s;
},
Materialized.with(Serdes.String(), SensorSerde.SERDE));
// join both sensor tables, preferring the existing ones
KTable<String, Sensor> joinedSensorTable =
sensorTable.outerJoin(
sensorsFromData,
// only use sensors from measurements if sensor not already present
(ex, ft) -> (ex != null) ? ex : ft,
Materialized.<String, Sensor, KeyValueStore<Bytes, byte[]>>as(SENSORS_TABLE)
.withKeySerde(Serdes.String()).withValueSerde(SensorSerde.SERDE));
// write to new topic for downstream services
joinedSensorTable.toStream();
This works fine if I create this using a StreamBuilder - i.e. if the sensorTable and sensorDataStream are coming from something like builder.table("sensors", Consumed.with(Serdes.String(), SensorSerde.SERDE)).
However, I'm trying to use Spring Stream Binding for this, ie the above code is wrapped in
#Configuration
#EnableBinding(SensorTableBinding.class)
class StreamConfiguration {
static final String SENSORS_TABLE = "sensors-table";
#StreamListener
#SendTo("sensorsOut")
private KStream<String, Sensor> getDataFromData
(#Input("sensors") KTable<String, Sensor> sensorTable,
#Input("data") KStream<String, SensorData> sensorDataStream) {
// ...
return joinedSensorTable.toStream();
}
}
with a
interface SensorTableBinding {
#Input("sensors")
KTable<String, Sensor> sensorStream();
#Output("sensorsOut")
KStream<String, Sensor> sensorOutput();
#Input("data")
KStream<String, SensorData> sensorDataStream();
}
Here is the spring stream section of the application.properties:
spring.cloud.stream.kafka.streams.binder.configuration.default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.streams.binder.configuration.default.value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
spring.cloud.stream.kafka.binder.brokers: ${spring.kafka.bootstrap-servers}
spring.cloud.stream.kafka.binder.configuration.auto.offset.reset: latest
spring.cloud.stream.kafka.binder.bindings.sensors.group: sensor-service
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
spring.cloud.stream.kafka.binder.data.group: sensor-service
spring.cloud.stream.kafka.binder.data.destination: data
The stream gets initialized fine, and the join is performed (the key-value-store is filled properly), however, the resulting stream is never written to the "sensors" topic.
Why? Am I missing something?
Also: I'm sure there's a better way to de/serialize my objects from/to JSON using an existing Serde, rather than having to declare classes of my own to add to the processing (SensorSerde/SensorDataSerde are thin delegation wrapper to an ObjectMapper)?
Turns out the data was written after all, but to the wrong topic, namely sensorOut.
The reason was the configuration. Instead of
spring.cloud.stream.kafka.binder.bindings.sensors.destination: sensors
spring.cloud.stream.kafka.binder.bindings.sensorsOut.destination: sensors
the topics are configured with this:
spring.cloud.stream.bindings.sensors.destination: sensors
spring.cloud.stream.bindings.sensorsOut.destination: sensors
For the sensors and data topic, that didn't matter, because the binding's name was the same as the topic; but since Spring couldn't find a proper destination for the output, it used the binding's name sensorOut and wrote the data there.
As a note, the whole configuration setup around these is very confusing. The individual items are documented, but it's hard to tell for each to which configuration prefix they belong. Looking into the source code doesn't help either, because at that level what's passed around are Maps with the key stripped of the prefix at runtime, so it's really hard to tell where the data is coming from and what it will contain.
IMO it would really help to have acual #ConfigurationProperties-like data classes passed around, which would make it so much easier to understand.

Kafka: Efficiently join windowed aggregates to events

I'm prototyping a fraud application. We'll frequently have metrics like "total amount of cash transactions in the last 5 days" that we need to compare against some threshold to determine if we raise an alert.
We're looking to use Kafka Streams to create and maintain the aggregates and then create an enhanced version of the incoming transaction that has the original transaction fields plus the aggregates. This enhanced record gets processed by a downstream rules system.
I'm wondering the best way to approach this. I've prototyped creating the aggregates with code like this:
TimeWindows twoDayHopping TimeWindows.of(TimeUnit.DAYS.toMillis(2))
.advanceBy(TimeUnit.DAYS.toMillis(1));
KStream<String, AdditiveStatistics> aggrStream = transactions
.filter((key,value)->{
return value.getAccountTypeDesc().equals("P") &&
value.getPrimaryMediumDesc().equals("CASH");
})
.groupByKey()
.aggregate(AdditiveStatistics::new,
(key,value,accumulator)-> {
return AdditiveStatsUtil
.advance(value.getCurrencyAmount(),accumulator),
twoDayHopping,
metricsSerde,
"sas10005_store")
}
.toStream()
.map((key,value)-> {
value.setTransDate(key.window().start());
return new KeyValue<String, AdditiveStatistics>(key.key(),value);
})
.through(Serdes.String(),metricsSerde,datedAggrTopic);;
This creates a store-backed stream that has a records per key per window. I then join the original transactions stream to this window to produce the final output to a topic:
JoinWindows joinWindow = JoinWindows.of(TimeUnit.DAYS.toMillis(1))
.before(TimeUnit.DAYS.toMillis(1))
.after(-1)
.until(TimeUnit.DAYS.toMillis(2)+1);
KStream<String,Transactions10KEnhanced> enhancedTrans = transactions.join(aggrStream,
(left,right)->{
Transactions10KEnhanced out = new Transactions10KEnhanced();
out.setAccountNumber(left.getAccountNumber());
out.setAccountTypeDesc(left.getAccountTypeDesc());
out.setPartyNumber(left.getPartyNumber());
out.setPrimaryMediumDesc(left.getPrimaryMediumDesc());
out.setSecondaryMediumDesc(left.getSecondaryMediumDesc());
out.setTransactionKey(left.getTransactionKey());
out.setCurrencyAmount(left.getCurrencyAmount());
out.setTransDate(left.getTransDate());
if(right != null) {
out.setSum2d(right.getSum());
}
return out;
},
joinWindow);
This produces the correct results, but it seems to run for quite a while, even with a low number of records. I'm wondering if there's a more efficient way to achieve the same result.
It's a config issues: cf http://docs.confluent.io/current/streams/developer-guide.html#memory-management
Disable caching by setting cache size to zero (parameter cache.max.bytes.buffering in StreamsConfig) will resolve the "delayed" delivery to the output topic.
You might also read this blog post for some background information about Streams design: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

Hbase - Hadoop : TableInputFormat extension

Using an hbase table as my input, of which the keys I have pre-processed in order to consist of a number concatenated with the respective row ID, I want to rest assured that all rows with the same number heading their key, will be processed from the same mapper at a M/R job. I am aware that this could be achieved through extension of TableInputFormat, and I have seen one or two posts concerning extension of this class, but I am searching for the most efficient way to do this in particular.
If anyone has any ideas, please let me know.
You can use a PrefixFilter in your scan.
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PrefixFilter.html
And parallelize the launch of your different mappers using Future
final Future<Boolean> newJobFuture = executor.submit(new Callable<Boolean>() {
#Override
public Boolean call() throws Exception {
Job mapReduceJob = MyJobBuilder.createJob(args, thePrefix,
...);
return mapReduceJob.waitForCompletion(true);
}
});
But I believe this is more an approach of a reducer you are looking for.

Resources