Kafka Streams access latest value before window - apache-kafka-streams

I have a windowed stream which needs to calculate a value based on all values which occurred within the window PLUS the latest value which occurred before the window.
inputStream.groupByKey
.windowedBy(timeWindow)
.aggregate(Aggregation()) {
case (_, value, aggregation) =>
// ...
}
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
.toStream
.map((windowed, aggregation) => {
// here I need access to the last value which
// occurred before the window
}
.to("output")
How can I get access to the previous value in this stage of my stream?

You will need to use the Processor API and write a custom Processor (or if you embed it into the DSL you can use transform()) with attached windowed store. Note that a windowed-stores is also just a key-value store, with key being compose of the record-key and the start timestamp of the window.

Related

X11/Xlib: Grab key events and then pass them to sub-windows

I'm tinkering with X11 using Go and Xlib (using cgo) and I'm trying to program basic window management functionalities, but I'm having problems with input management (keyboard in this case).
So far I've created a basic reparenting window manager where client windows (i.e. windows managed by this program) are wrapped within container windows created by this program. Container windows are direct children of root window and client windows are direct children of their container windows.
I'm trying to grab the entire keyboard and check if a key event is directed towards the window manager in order to process certain things. In case this key event is not something related to the window manager I would like to pass it along to client windows. I know that there is the option to select only specific keys or modifiers (using XGrabKey) for this matter (I was able to do it), but I would like to be able to grab the entire keyboard in this case.
So far I have the following code which doesn't work and keys are not passed to client windows.
...
C.XGrabKeyboard(
display,
rootWindow,
0,
C.GrabModeAsync,
C.GrabModeAsync,
C.CurrentTime,
)
...
for {
var event C.XEvent
C.XNextEvent(display, &event)
eventType := (*C.int)(unsafe.Pointer(&event[0]))
switch *eventType {
...
case C.KeyPress:
eventPayload := (*C.XKeyEvent)(unsafe.Pointer(&event[0]))
// Value of eventPayload.root here equals rootWindow.
// Value of eventPayload.window here equals rootWindow.
// Value of eventPayload.subwindow here equals a ContainerWindow.
if SOME_CONDITIONS_ARE_MET {
// Key event is directed towards the window manager. Process this key event.
...
continue
}
// Window manager has nothing to do with this key event. Pass it along.
C.XAllowEvents(display, C.ReplayKeyboard, C.CurrentTime)
...
}
}
It is worth mentioning that in this case I've used C.XSynchronize(display, 1) so calling XSync is no longer required. Also calling XFlush after XAllowEvents did not solve the problem either.
By the way I originally saw the XAllowEvents solution in this Stack Overflow question and this website.

What is the best practice for querying an AWS Amplify GraphQL/DynamoDB database for changes since your last query?

I have an iOS AWS Amplify Project using GraphQL API and DynamoDB to store data. This is a chat app.
Initially when a user signs in he will get all messages for a conversation. If then he comes back to the app a few days later I want to update any new messages in each conversation.
At the moment I am using the Date of the last message as a predicate for future queries. However I was wondering if there is a better or more standard way of querying a table for changes since the last time you queried?
I am not using DataStore. I am managing my own local storage.
let keys = Message.keys
let predicate = keys.conversationID == conversationID && keys.createdAt > lastQueryDate
Amplify.API.query(request: .paginatedList(Message.self, where: predicate, limit: 100)) { [weak self]
DispatchQueue.main.async {
switch event {
case .success(let result):
switch result {
case .success(let messages):
// handle data
case .failure(let error):
print("Got failed result with \(error.errorDescription)")
}
case .failure(let error):
print("Got failed event with error \(error)")
}
}
}
Querying based on the date of the last message is perfectly acceptable & standard.
DynamoDB itself doesn't provide a way to get "new" items (messages in this case) like you would be able to do using SQS for example - there's no way to "acknowledge" an item.
For pagination, you could use markers like ExclusiveStartKey however you already have a date field to filter by so there is absolutely no need to do so.
You're following best practices already by modeling your data correctly.

How to create MutableSharedFlow in Kotlin Coroutines simillar to PublishSubject from RxJava?

Is there an equivalent of PublishSubject from RxJava in Kotlin Coroutines library?
Channels cannot be a replacement for PublishSubject since they do not publish values to multiple collectors (each value can be collected by a single collector only). Even MutableSharedFlow that supports multiple collectors, still does not allow emitting values without waiting for collectors to finish processing previous values. How can we create a flow with functionality similar to the PublishSubject?
The following code will create a Flow equivalent to the PublishSubject:
fun <T> publishFlow(): MutableSharedFlow<T> {
return MutableSharedFlow(
replay = 0,
extraBufferCapacity = Int.MAX_VALUE
)
}
The main attributes of the PublishSubject are that it does not replay old values to new observers, and still allows to publish new values/events without waiting for the observers to handle them. So this functionality can be achieved with MutableSharedFlow by specifying replay = 0 for preventing new collectors from collecting old values, and extraBufferCapacity = Int.MAX_VALUE to allow publishing new values without waiting for busy collectors to finish collecting previous values.
One can add the following forceEmit function to be called instead of tryEmit, to ensure that the value is actually emitted:
fun <T> MutableSharedFlow<T>.forceEmit(value: T) {
val emitted = tryEmit(value)
check(emitted){ "Failed to emit into shared flow." }
}
Since we have a buffer with MAX_VALUE capacity, this forceEmit function should never fail if we use it with our publishFlow. If the flow will be replaced somehow with a different flow that does not support emitting without suspending, we will get an exception and will know to handle the case where the buffer is full and one cannot emit without suspending.
Notice that having a buffer of MAX_VALUE capacity may cause high consumption of memory if the collection of values by the collectors takes a long time, so it is more suitable for cases where the collectors perform a short synchronous operation (similarly to RxJava observers).

Access Record Offset in Kafka Streams Aggregator

I have a simple windowing topology:
builder.stream("input-topic", Consumed.with(...))
.groupByKey()
.windowedBy(TimeWindows.of(windowSize).advanceBy(windowAdvance).grace(windowGrace))
.aggregate(Frame::new,
this::windowAggregator,
...
)
.transformValues(FrameTransformer::new)
.toStream()
.selectKey((key, value) -> value...)
.to("output-topic", Produced.with(...));
I'd like to put the actual record offset of the beginning of the window into the Frame aggregating object.
How can I get access to the record offset from the windowAggregator (aggregate() handler) function?
I know that I can get access to the record offset in the FrameTransformer, but that doesn't help me creating accurate Frame objects describing my windows in terms of start and end offset.
I've heard there's a way of doing that by inserting another .transform() call before the groupByKey(), there I can access the offsets, but then I'd need to modify the schema of my event records to store there the offset information inside.
Is there a (simpler) way of achieving my intention?
Update
In fact, I was able to get the accurate window start and end offsets in the Frame objects in the following way
builder.stream("input-topic", Consumed.with(...))
.transformValues(EventTransformer::new)
.groupByKey()
.windowedBy(TimeWindows.of(windowSize).advanceBy(windowAdvance).grace(windowGrace))
.aggregate(Frame::new,
this::windowAggregator,
...
)
.toStream()
.selectKey((key, value) -> value...)
.to("output-topic", Produced.with(...));
But as mentioned above, at the expense of editing the schema of Event object.
How can I get access to the record offset from the windowAggregator (aggregate() handler) function?
You can't. Your approach to use transformValues() before the aggregation (and to enrich the Event object is the right approach.
There was a proposal to extend the API to allow accessing record metadata within aggregate() and other DSL operators, but it was never pushed over the finish line (cf https://cwiki.apache.org/confluence/display/KAFKA/KIP-159%3A+Introducing+Rich+functions+to+Streams).

How does Spark Streaming's countByValueAndWindow work?

I have a Spark Streaming application that is processing a stream of website click events. Each event has a property containing a GUID that identifies the user session that the event belongs to.
My application is counting up the number of events that occurred for each session, using windowing:
def countEvents(kafkaStream: DStream[(String, Event)]): DStream[(String, Session)] = {
// Get a list of the session GUIDs from the events
val sessionGuids = kafkaStream
.map(_._2)
.map(_.getSessionGuid)
// Count up the GUIDs over our sliding window
val sessionGuidCountsInWindow = sessionGuids.countByValueAndWindow(Seconds(60), Seconds(1))
// Create new session objects with the event count
sessionGuidCountsInWindow
.map({
case (guidS, eventCount) =>
guidS -> new Session().setGuid(guidS).setEventCount(eventCount)
})
}
My understanding was that the countByValueAndWindow function is only counting the values in the DStream on which the function is called. In other words, in the code above, the call to countByValueAndWindow should return the event counts only for the session GUIDs in the sessionGuids DStream on which we're calling that function.
But I'm observing something different; the call to countByValueAndWindow is returning counts for session GUIDs that are not in sessionGUIDs. It appears to be returning counts for session GUIDs that were processed in previous batches. Am I just misunderstanding how this function works? I haven't been able to find anything in the way of useful documentation online.
A colleague of mine who is much more versed in the ways of Spark than I has helped me with this. Apparently I was mis-understanding the way that the countByValueAndWindow function works. I thought that it would only return counts for values in the DStream for which you're calling the function. But, in fact, it returns counts for all values across the entire window. To address my issue, I simply perform a join between my input DStream and the DStream resulting from the countByValueAndWindow operation. Thus I only end up with results for values in my input DStream.

Resources