using kafka-streams to create a new KStream containing multiple aggregations - apache-kafka-streams

I am sending JSON messages containing details about a web service request and response to a Kafka topic. I want to process each message as it arrives in Kafka using Kafka streams and send the results as a continuously updated summary(JSON message) to a websocket to which a client is connected.
The client will then parse the JSON and display the various counts/summaries on a web page.
Sample input messages are as below
{
"reqrespid":"048df165-71c2-429c-9466-365ad057eacd",
"reqDate":"30-Aug-2017",
"dId":"B198693",
"resp_UID":"N",
"resp_errorcode":"T0001",
"resp_errormsg":"Unable to retrieve id details. DB Procedure error",
"timeTaken":11,
"timeTakenStr":"[0 minutes], [0 seconds], [11 milli-seconds]",
"invocation_result":"T"
}
{
"reqrespid":"f449af2d-1f8e-46bd-bfda-1fe0feea7140",
"reqDate":"30-Aug-2017",
"dId":"G335887",
"resp_UID":"Y",
"resp_errorcode":"N/A",
"resp_errormsg":"N/A",
"timeTaken":23,
"timeTakenStr":"[0 minutes], [0 seconds], [23 milli-seconds]",
"invocation_result":"S"
}
{
"reqrespid":"e71b802d-e78b-4dcd-b100-fb5f542ea2e2",
"reqDate":"30-Aug-2017",
"dId":"X205014",
"resp_UID":"Y",
"resp_errorcode":"N/A",
"resp_errormsg":"N/A",
"timeTaken":18,
"timeTakenStr":"[0 minutes], [0 seconds], [18 milli-seconds]",
"invocation_result":"S"
}
As the stream of messages comes into Kafka, I want to be able to compute on the fly
**
total number of requests i.e a count of all
total number of requests with invocation_result equal to 'S'
total number of requests with invocation_result not equal to 'S'
total number of requests with invocation_result equal to 'S' and UID
equal to 'Y'
total number of requests with invocation_result equal to 'S' and UID
equal to 'Y'
minimum time taken i.e. min(timeTaken)
maximum time taken i.e. max(timeTaken)
average time taken i.e. avg(timeTaken)
**
and write them out into a KStream with new key set to the reqdate value and new value a JSON message that contains the computed values as shown below using the 3 messages shown earlier
{
"total_cnt":3, "num_succ":2, "num_fail":1, "num_succ_data":2,
"num_succ_nodata":0, "num_fail_biz":0, "num_fail_tech":1,
"min_timeTaken":11, "max_timeTaken":23, "avg_timeTaken":17.3
}
Am new to Kafka streams. How do i do the multiple counts and by differing columns all in one or as a chain of different steps? Would Apache flink or calcite be more appropriate as my understanding of a KTable suggests that you can only have a key e.g. 30-AUG-2017 and then a single column value e.g a count say 3. I need a resulting table structure with one key and multiple count values.
All help is very much appreciated.

You can just do a complex aggregation step that computes all those at once. I am just sketching the idea:
class AggResult {
long total_cnt = 0;
long num_succ = 0;
// and many more
}
stream.groupBy(...).aggregate(
new Initializer<AggResult>() {
public AggResult apply() {
return new AggResult();
}
},
new Aggregator<KeyType, JSON, AggResult> {
AggResult apply(KeyType key, JSON value, AggResult aggregate) {
++aggregate.total_cnt;
if (value.get("success").equals("true")) {
++aggregate.num_succ;
}
// add more conditions to get all the other aggregate results
return aggregate;
}
},
// other parameters omitted for brevity
)
.to("result-topic");

Related

Spring batch Remote partitioning : Pushing Huge data in kafka during partition

I have implemented spring batch remote partitioning.Now I have to push partition 10 billion ids divided into partitions.The ids will be fetched from elastic and push into partition which in turn will be pushed into kafka
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, ExecutionContext> map = new HashMap<>(gridSize);
AtomicInteger partitionNumber = new AtomicInteger(1);
try {
for(int i=0;i<n;i++){
List<Integer> ids = //fetch id from elastic
map.put("partition" + partitionNumber.getAndIncrement(), context);
}
System.out.println("Partitions Created");
} catch (IOException e) {
e.printStackTrace();
}
return map;
}
I cannot fetch and push all ids in map at once otherwise,I will go out of memory.I want ids to be pushed in queue and then next ids are fetched.
Can this be done through spring batch?
If you want to use partitioning, you have to find a way to partition the input dataset with a given key. Without a partition key, you can't really use partitioning (with or without Spring Batch).
If your IDs are defined by a sequence that can be divided into partitions, you don't have to fetch 10 billion IDs, partition them and put each partition (ie all IDs of each partition) in the execution context of workers. What you can do is find the max ID, create ranges of IDs and assign them to distinct workers. For example:
Partition 1: 0 - 10000
Partition 2: 10001 - 20000
etc
If your IDs are not defined by a sequence and cannot be partitioned by range, then you need to find another key (or a composite key) that allows you to partition data based on another criteria. Otherwise, (remote) partitioning is not an option for you.

Kafka Streams: Add Sequence to each message within a group of message

Set Up
Kafka 2.5
Apache KStreams 2.4
Deployment to Openshift(Containerized)
Objective
Group a set of messages from a topic using a set of value attributes & assign a unique group identifier
-- This can be achieved by using selectKey and groupByKey
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.groupByKey()
groupedStream.mapValues((k,v)->
{
v.setGroupKey(k);
return v;
});
For each message within a specific group , create a new message with an itemCount number as one of the attributes
e.g. A group with key "keypart1|keyPart2" can have 10 messages and each of the message should have an incremental id from 1 through 10.
aggregate?
count and some additional StateStore based implementation.
One of the options (that i listed above), can make use of a couple of state stores
state store 1-> Mapping of each groupId and individual Item (KTable)
state store 2 -> Count per groupId (KTable)
A join of these 2 tables to stamp a sequence on the message as they get published to the final topic.
Other statistics:
Average number of messages per group would be in some 1000s except for an outlier case where it can go upto 500k.
In general the candidates for a group should be made available on the source within a span of 15 mins max.
Following points are of concern from the optimum solution perspective .
I am still not clear how i would be able to stamp a sequence number on the messages unless some kind of state store is used for keeping track of messages published within a group.
Use of KTable and state stores (either explicit usage or implicitly by the use of KTable) , would add to the state store size considerably.
Given the problem involves some kind of tasteful processing , the state store cant be avoided but any possible optimizations might be useful.
Any thoughts or references to similar patterns would be helpful.
You can use one state store with which you maintain the ID for each composite key. When you get a message you select a new composite key and then you lookup the next ID for the composite key in the state store. You stamp the message with the new ID that you just looked up. Finally, you increase the ID and write it back to the state store.
Code-wise, it would be something like:
// create state store
StoreBuilder<KeyValueStore<String,String>> keyValueStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("idMaintainer"),
Serdes.String(),
Serdes.Long()
);
// add store
builder.addStateStore(keyValueStoreBuilder);
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.repartition()
.transformValues(() -> new ValueTransformer() {
private StateStore state;
void init(ProcessorContext context) {
state = context.getStateStore("idMaintainer");
}
NewValueType transform(V value) {
// your logic to:
// - get the ID for the new composite key,
// - stamp the record
// - increase the ID
// - write the ID back to the state store
// - return the stamped record
}
void close() {
}
}, "idMaintainer")
.to("output-topic");
You do not need to worry about concurrent access to the state store because in Kafka Streams same keys are processed by one single task and tasks do not share state stores. That means, your new composite keys with the same value will be processed by one single task that exclusively maintains the IDs for the composite keys in its state store.

Could using changelogs cause a bottleneck for the app itself?

I have a spring cloud kafka streams application that rekeys incoming data to be able to join two topics, selectkeys, mapvalues and aggregate data. Over time the consumer lag seems to increase and scaling by adding multiple instances of the app doesn't help a bit. With every instance the consumer lag seems to be increasing.
I scaled up and down the instances from 1 to 18 but no big difference is noticed. The number of messages it lags behind, keeps increasing every 5 seconds independent of the number of instances
KStream<String, MappedOriginalSensorData> flattenedOriginalData = originalData
.flatMap(flattenOriginalData())
.through("atl-mapped-original-sensor-data-repartition", Produced.with(Serdes.String(), new MappedOriginalSensorDataSerde()));
//#2. Save modelid and algorithm parts of the key of the errorscore topic and reduce the key
// to installationId:assetId:tagName
//Repartition ahead of time avoiding multiple repartition topics and thereby duplicating data
KStream<String, MappedErrorScoreData> enrichedErrorData = errorScoreData
.map(enrichWithModelAndAlgorithmAndReduceKey())
.through("atl-mapped-error-score-data-repartition", Produced.with(Serdes.String(), new MappedErrorScoreDataSerde()));
return enrichedErrorData
//#3. Join
.join(flattenedOriginalData, join(),
JoinWindows.of(
// allow messages within one second to be joined together based on their timestamp
Duration.ofMillis(1000).toMillis())
// configure the retention period of the local state store involved in this join
.until(Long.parseLong(retention)),
Joined.with(
Serdes.String(),
new MappedErrorScoreDataSerde(),
new MappedOriginalSensorDataSerde()))
//#4. Set instalation:assetid:modelinstance:algorithm::tag key back
.selectKey((k,v) -> v.getOriginalKey())
//#5. Map to ErrorScore (basically removing the originalKey field)
.mapValues(removeOriginalKeyField())
.through("atl-joined-data-repartition");
then the aggregation part:
Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> materialized = Materialized
.as(localStore.getStoreName());
// Set retention of changelog topic
materialized.withLoggingEnabled(topicConfig);
// Configure how windows looks like and how long data will be retained in local stores
TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
localStore.getTimeUnit(), Long.parseLong(topicConfig.get(RETENTION_MS)));
// Processing description:
// 2. With the groupByKey we group the data on the new key
// 3. With windowedBy we split up the data in time intervals depending on the provided LocalStore enum
// 4. With reduce we determine the maximum value in the time window
// 5. Materialized will make it stored in a table
stream.groupByKey()
.windowedBy(configuredTimeWindows)
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue), materialized);
}
private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long retentionMs) {
TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
timeWindows.until(retentionMs);
return timeWindows;
}
I would expect that increasing the number of instances would decrease the consumer lag tremendous.
So in this setup there are multiple topics involved such as:
* original-sensor-data
* error-score
* kstream-joinother
* kstream-jointhis
* atl-mapped-original-sensor-data-repartition
* atl-mapped-error-score-data-repartition
* atl-joined-data-repartition
the idea is to join the original-sensor-data with the error-score. The rekeying requires the atl-mapped-* topics. then the join will use the kstream* topics and in the end as a result of the join the atl-joined-data-repartition is filled. After that the aggregation also creates topics but I leave this out of scope now.
original-sensor-data
\
\
\ atl-mapped-original-sensor-data-repartition-- kstream-jointhis -\
/ atl-mapped-error-score-data-repartition -- kstream-joinother -\
/ \
error-score atl-joined-data-repartition
As it seems that increasing the number of instances doesn't seem to have much of affect anymore since I introduced the join and the atl-mapped topics, I'm wondering if it is possible that this topology would become its own bottleneck. From the consumer lag it seems that the original-sensor-data and error-score topic have a much smaller consumer lag compare to for instance the atl-mapped-* topics. Is there a way to cope with this by removing these changelogs or does this result in not being able to scale?

How can I create a histogram of time stamp deltas?

We are storing small documents in ES that represent a sequence of events for an object. Each event has a date/time stamp. We need to analyze the time between events for all objects over a period of time.
For example, imagine these event json documents:
{ "object":"one", "event":"start", "datetime":"2016-02-09 11:23:01" }
{ "object":"one", "event":"stop", "datetime":"2016-02-09 11:25:01" }
{ "object":"two", "event":"start", "datetime":"2016-01-02 11:23:01" }
{ "object":"two", "event":"stop", "datetime":"2016-01-02 11:24:01" }
What we would want to get out of this is a histogram plotting the two resulting time stamp deltas (from start to stop): 2 minutes / 120 seconds for object one and 1 minute / 60 seconds for object two.
Ultimately we want to monitor the time between start and stop events but it requires that we calculate the time between those events then aggregate them or provide them to the Kibana UI to be aggregated / plotted. Ideally we would like to feed the results directly to Kibana so we can avoid creating any custom UI.
Thanks in advance for any ideas or suggestions.
Since you're open to use Logstash, there's a way to do it using the aggregate filter
Note that this is a community plugin that needs to be installed first. (i.e. it doesn't ship with Logstash by default)
The main idea of the aggregate filter is to merge two "related" log lines. You can configure the plugin so it knows what "related" means. In your case, "related" means that both events must share the same object name (i.e. one or two) and then that the first event has its event field with the start value and the second event has its event field with the stop value.
When the filter encounters the start event, it stores the datetime field of that event in an internal map. When it encounters the stop event, it computes the time difference between the two datetimes and stores the duration in seconds in the new duration field.
input {
...
}
filter {
...other filters
if [event] == "start" {
aggregate {
task_id => "%{object}"
code => "map['start'] = event['datetime']"
map_action => "create"
}
} else if [event] == "stop" {
aggregate {
task_id => "%{object}"
code => "map['duration'] = event['datetime'] - map['start']"
end_of_task => true
timeout => 120
}
}
}
output {
elasticsearch {
...
}
}
Note that you can adjust the timeout value (here 120 seconds) to better suit your needs. When the timeout has elapsed and no stop event has happened yet, the existing start event will be ditched.

What is the difference between TYPE_STEP_COUNT_DELTA and AGGREGATE_STEP_COUNT_DELTA data type in Google Fit Android Api?

The Google Fit API describes two of these data types of the Sensor API. However both seem to be returning the same data. Can anyone explain the difference?
TYPE_STEP_COUNT_DELTA:
In the com.google.step_count.delta data type, each data point represents the number of steps taken since the last reading.
AGGREGATE_STEP_COUNT_DELTA:
Aggregate number of steps during a time interval.
You can see more here:
https://developers.google.com/android/reference/com/google/android/gms/fitness/data/DataType
// Setting a start and end date using a range of 1 week before this moment.
Calendar cal = Calendar.getInstance();
Date now = new Date();
cal.setTime(now);
long endTime = cal.getTimeInMillis();
cal.add(Calendar.WEEK_OF_YEAR, -1);
long startTime = cal.getTimeInMillis();
java.text.DateFormat dateFormat = getDateInstance();
Log.i(TAG, "Range Start: " + dateFormat.format(startTime));
Log.i(TAG, "Range End: " + dateFormat.format(endTime));
DataReadRequest readRequest = new DataReadRequest.Builder()
// The data request can specify multiple data types to return, effectively
// combining multiple data queries into one call.
// In this example, it's very unlikely that the request is for several hundred
// datapoints each consisting of a few steps and a timestamp. The more likely
// scenario is wanting to see how many steps were walked per day, for 7 days.
.aggregate(DataType.TYPE_STEP_COUNT_DELTA, DataType.AGGREGATE_STEP_COUNT_DELTA)
// Analogous to a "Group By" in SQL, defines how data should be aggregated.
// bucketByTime allows for a time span, whereas bucketBySession would allow
// bucketing by "sessions", which would need to be defined in code.
.bucketByTime(1, TimeUnit.DAYS)
.setTimeRange(startTime, endTime, TimeUnit.MILLISECONDS)
.build();

Resources