I'm exploring Kafka Streams for a sessionization use case and wanted to understand if there is a way to end a session window earlier than the inactivity gap if it sees an end signal earlier ?
Appreciate any help.
I'm exploring Kafka Streams for a sessionization use case and wanted to understand if there is a way to end a session window earlier than the inactivity gap if it sees an end signal earlier ?
No, this is not possible out of the box. The session window implementation in KStreams uses inactivity as the sole parameter to determine whether a window (session) should or should not be closed.
If you need a different behavior you can use the Processor API of Kafka Streams. For example, I have seen developers that implement custom 'sessions' based on finite state machines. For example, reconstructing a TCP/IP session from raw network data can be done in this way.
Related
I need to implement a logic similar to session windows using processor API in order to have a full control over state store. Since processor API doesn't provide windowing abstraction, this needs to be done manually. However, I fail to find the source code for KStreams session window logic, to get some initial ideas (specifically regarding session timeouts).
I was expecting to use punctuate method, but it's a per processor timer rather than per key timer. Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
[UPDATE]
As an example, assume processor instance is processing K1 and stream time is incremented which causes the session for K2 to timeout. K2 may or may not exist at all. How do you know that there exists a specific key (like K2 when stream time is incremented (while processing a different key)? In other words when stream time is incremented, how do you figure out which windows are expired (because you don't know those keys exists)?
This is the DSL code: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java -- hope it helps.
It's unclear what your question is though -- it's mostly statements. So let me try to give some general answer.
In the DSL, sessions are close based on "stream time" progress. Only relying on the input data makes the operation deterministic. Using wall-clock time would introduce non-determinism. Hence, using a Punctuation is not necessary in the DSL implementation.
Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
Sessions in the DSL are based on keys and thus it's sufficient to scan the store on a per-key basis over a time range (as done via findSessions(...)).
Update:
In the DSL, each time a session window is updated, as corresponding update event is sent downstream immediately. Hence, the DSL implementation does not wait for "stream time" to advance any further but publishes the current (potentially intermediate) result right away.
To obey the grace period, the record timestamp is compared to "stream time" and if the corresponding session window is already closed, the record is skipped (cf. https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java#L146). I.e., closing a window is just a logical step (not an actually operation); the session will still be stored and if a window is closed no additional event needs to be sent downstream because the final result was sent downstream in the last update to the window already.
Retention time itself must not be handled by the Processor implementation because it's a built-in feature of the SessionStore: internally, the session store maintains so-called "segments" that store sessions for a certain time period. Each time a put() is done, the store checks if old segments can be dropped (based on the timestamp provided by put()). I.e., old sessions are deleted lazily and as bulk deletes (i.e., all session of the whole segment will be deleted at once) as it's more efficient than individual deletes.
Question:
If an event arrives after the window has closed, then, how do we re-direct it to another topic for handling the correction ?
Context:
We use tumbling windows
We use events source creation time(event-time) for defining windows
thanks
Currently, there is no API to do that. Late events are dropped and you cannot get a hand on them easily.
What you could do is, to have an upstream operator (like a transform()) before the window, the compares the record timestamp to the current "stream time" (you would need to track "stream time" manually within the operator) -- this should help you to detect if the downstream window will drop the record as late and react to it accordingly (for example using a branch() after transform() and before groupByKey().windonwedBy().
This is more of a theorical question.
Well, imagine that I have two programas that work simultaneously, the main one only do something when he receives a flag marked with true from a secondary program. So, this main program has a function that will keep asking to the secondary for the value of the flag, and when it gets true, it will do something.
What I learned at college is that the polling is the simplest way of doing that. But when I started working as an developer, coworkers told me that this method generate some overhead or it's waste of computation, by asking every certain amount of time for a value.
I tried to come up with some ideas for doing this in a different way, searched on the internet for something like this, but didn't found a useful way about how to do this.
I read about interruptions and passive ways that can cause the main program to get that data only if was informed by the secondary program. But how this happen? The main program will need a function to check for interruption right? So it will not end the same way as before?
What could I do differently?
There is no magic...
no program will guess when it has new information to be read, what you can do is decide between two approaches,
A -> asks -> B
A <- is informed <- B
whenever use each? it depends in many other factors like:
1- how fast you need the data be delivered from the moment it is generated? as far as possible? or keep a while and acumulate
2- how fast the data is generated?
3- how many simoultaneuos clients are requesting data at same server
4- what type of data you deal with? persistent? fast-changing?
If you are building something like a stocks analyzer where you need to ask the price of stocks everysecond (and it will change also everysecond) the approach you mentioned may be the best
if you are writing a chat based app like whatsapp where you need to check if there is some new message to the client and most of time wont... publish subscribe may be the best
but all of this is a very superficial look into a high impact architecture decision, it is not possible to get the best by just looking one factor
what i want to show is that
coworkers told me that this method generate some overhead or it's
waste of computation
it is not a right statement, it may be in some particular scenario but overhead will always exist in distributed systems
The typical way to prevent polling is by using the Publish/Subscribe pattern.
Your client program will subscribe to the server program and when an event occurs, the server program will publish to all its subscribers for them to handle however they need to.
If you flip the order of the requests you end up with something more similar to a standard web API. Your main program (left in your example) would be a server listening for requests. The secondary program would be a client hitting an endpoint on the server to trigger an event.
There's many ways to accomplish this in every language and it doesn't have to be tied to tcp/ip requests.
I'll add a few links for you shortly.
Well, in most of languages you won't implement such a low level. But theorically speaking, there are different waiting strategies, you are talking about active waiting. Doing this you can easily eat all your memory.
Most of languages implements libraries to allow you to start a process as a service which is at passive waiting and it is triggered when a request comes.
Using Event-machine and Ruby. Currently I'm making a game were at the end of the turn it checks if other user there. When sending data to the user using ws.send() how can I check if the user actually got the data or is alternative solution?
As the library doesn't provide you with access to the underlying protocol elements, you need to add elements to your application protocol to do this. A typical approach is to add an identifier to each message and response to messages with acknowledgement messages that contain those identifiers.
Note that such an approach will only help you to have a better idea of what has been received by a client. There is no assurance of particular state in the case of errors. An example would be losing a connection after the client as sent an ACK, but the service has not received it.
As a result of the complexity I just mentioned, it is often easier to try to make most operations idempotent - that is able to be replayed without detriment to the system, and to replay readily during/after error conditions. You may additionally find a way to periodically synchronize the relevant state entirely, to avoid the long term continuation of minor errors introduced by loss of data/a connection.
I am implementing a Reactor design pattern, using a single thread, for asynchronous operations using Windows Events Mechanism.
I faced a problem while trying to combine my reactor to support Windows Notifications (WM_CLOSE, WM_CREATE, WM_DEVICECHANGE...) along with the existing Windows Events.
Thus, my question is:
Is it possible to signal an event when a particular window receives a particular notification?
Thanks in advance.
No, you cannot make Windows signal an event object when particular window messages are received. You would have to catch the messages in your message loop first and then signal the event object yourself as needed.
Otherwise, re-write your message loop to use MsgWaitForMultipleObjects() so it can check for event signals and pending window messages at the same time, and then you can act according to whichever one satisfies the wait on each loop iteration. Just be aware of this gotcha:
MsgWaitForMultipleObjects is a very tricky API
if you specify bWaitAll as true, you may find that your application doesn’t wake up when you expected it to
In this situation, you would set bWaitAll to false and all is well.