Spark streaming mapWithState timeout without remove - spark-streaming

Imagine a use case where events are streaming in per user but only the first week of events are of interest. Within that time frame stateful logic is taking place using mapWithState. After that period the user incoming events should be disregarded.
As the user's state takes memory, it makes sense to change it after the user's week period to a simple already-seen-marker.
If any event comes in for that user a week or later after his first event, it is easy to change the state to that already-seen-marker.
But, if no events come after that week, the state never changes to that already-seen-marker, and the state will continue to occupy memory forever.
As far as I understand, adding a timeout ( to user's state ) will not help, as you are not allowed to change state for a timeout state ( makes sense, as it is going to be removed ).
Is there a simple way to achieve this use case?

From what I understand, Spark's 2.2 mapGroupsWithState has richer timeouts that can be used not only to remove a state, but also to change it (check here).

Related

CQRS Where to Query for business logic/Internal Processes

I'm currently looking at implementing CQRS driven by events (not yet event sourcing) in for a service at work; the reasoning being:
I need aggregate data to support a RestAPI coming out of this service (which will be used to populate views)- however the aggregated data will not be used by the application logic/processing (ie the data originating outside this service, the bits that of the aggregate originating within it will be used)
I need to stream events to other systems so that they can react to the data (will produce to a Kafka topic, so the 'read'/'projection' side of this system will consume the same events as the external systems, from these Kafka topics
I will be consuming events from internal systems to help populate the aggregate for the views in first point (ie it's data from this service and other's)
The reason for not going event sourced currently is that a) we're in a bit of a time crunch, and b) due to still learning about it. Having said which, it is something that we are looking to do in the future- though currently, we have a static DB in the 'Command' side of the system, which will just store current state
I'm pretty confident with the concept of using the aggregate data to provide the Rest API; however my confusion is coming from when I want to change a resource from within the system (for example via a cron job triggered 5 times a day) Example:
If I have resource of class x, which (given some data), wants a piece of state changing
I need to select instances of the class x which meet the requirements (from one of the DB's). Think select * from {class x} where last_changed_ date > 5 days ago;
Then create a command to change the state of these instances of x (in my case, the static command DB would be updated, as well as an event made to update the read DB)
The middle bullet point is what is confusing me. If I pull the data out of the Read DB, and check some information on it, then decide to change a property; I then have to convert the object from the 'Read Object' to the 'Command Object', so that I can then persist it and create an event? With my current architecture- I could query the command DB no problem, to find all the instances of {class x} that match the criteria, however I don't know if a) this is the right thing to do, and b) how this would work if I was using an event store as a DB? I'd have to query a table with millions of rows to find the most recent bit of state about the objects, to then see if they match?
Lots of what I read online has been very conceptual- so I think when it comes to implementations it maybe seems more difficult than it is? Anyhow, if anyone has any advice it would be hugely appreciated!
TIA :)
CQRS can be interpreted in a "permissive" way: rather than saying "thou shalt not query the command/write side", it says "it's OK to have a query/read side that's separate from the command/write side". Because you have this permission to do such separation, it follows that one can optimize the command/write side for a more write-heavy workload (in practice, there are always some reads in the command/write side: since command validation is typically done against some state, that requires some means of getting the state!). From this, it's extremely likely that there will be some queries which can be performed efficiently against the command/write side and some that can't be (without deoptimizing the command/write side). From this perspective, it's OK to perform the first kind of query against the command/write side: you can get the benefit of strong consistency by doing that, though be sure to make sure that you're not affecting the command/write side's primary raison d'etre of taking writes.
Event sourcing is in many ways the maximally optimized persistence model for a command/write side, especially if you have some means of keeping the absolute latest state cached and ensuring concurrency control. This is because you can then have many times more writes than reads. The tradeoff in event sourcing is that nearly all reads become rather more expensive than in an update-in-place model: it's thus generally the case that CQRS doesn't force event sourcing but event sourcing tends to force CQRS (and in turn, event sourcing can simplify ensuring that a CQRS system is eventually consistent, which can be difficult to ensure with update-in-place).
In an event-sourced system, you would tend to have a read-side which subscribes to the event stream and tracks the mapping of X ID to last updated and which periodically queries and issues commands. Alternatively, you can have a scheduler service that lets you say "issue this command at this time, unless canceled or rescheduled before then" and a read-side which subscribes to updates and schedules a command for the given ID 5 days from now after canceling the command from the previous update.

KStreams: implementing session window with pocessor API

I need to implement a logic similar to session windows using processor API in order to have a full control over state store. Since processor API doesn't provide windowing abstraction, this needs to be done manually. However, I fail to find the source code for KStreams session window logic, to get some initial ideas (specifically regarding session timeouts).
I was expecting to use punctuate method, but it's a per processor timer rather than per key timer. Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
[UPDATE]
As an example, assume processor instance is processing K1 and stream time is incremented which causes the session for K2 to timeout. K2 may or may not exist at all. How do you know that there exists a specific key (like K2 when stream time is incremented (while processing a different key)? In other words when stream time is incremented, how do you figure out which windows are expired (because you don't know those keys exists)?
This is the DSL code: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java -- hope it helps.
It's unclear what your question is though -- it's mostly statements. So let me try to give some general answer.
In the DSL, sessions are close based on "stream time" progress. Only relying on the input data makes the operation deterministic. Using wall-clock time would introduce non-determinism. Hence, using a Punctuation is not necessary in the DSL implementation.
Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
Sessions in the DSL are based on keys and thus it's sufficient to scan the store on a per-key basis over a time range (as done via findSessions(...)).
Update:
In the DSL, each time a session window is updated, as corresponding update event is sent downstream immediately. Hence, the DSL implementation does not wait for "stream time" to advance any further but publishes the current (potentially intermediate) result right away.
To obey the grace period, the record timestamp is compared to "stream time" and if the corresponding session window is already closed, the record is skipped (cf. https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java#L146). I.e., closing a window is just a logical step (not an actually operation); the session will still be stored and if a window is closed no additional event needs to be sent downstream because the final result was sent downstream in the last update to the window already.
Retention time itself must not be handled by the Processor implementation because it's a built-in feature of the SessionStore: internally, the session store maintains so-called "segments" that store sessions for a certain time period. Each time a put() is done, the store checks if old segments can be dropped (based on the timestamp provided by put()). I.e., old sessions are deleted lazily and as bulk deletes (i.e., all session of the whole segment will be deleted at once) as it's more efficient than individual deletes.

Is it possible to pause and resume Kafka Stream conditionally?

I have a requirement as stated # https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#window-final-results for waiting until window is closed in order to handle late out of order event by buffering it for duration of window.
Per my understanding of this feature is once windowing is created, the window works like wall clock processing, e.g. Creating for 1 hour window, The window starts ticking once first event comes. This 1hr window is closed exactly one hour later and all the events buffered so far will be forwarded to down stream. However, i need to be able to hold this window even longer say conditionally for as long as required e.g. based on state / information in external system such as database.
To be precise my requirement for event forwarding is (windows of 1 hour if external state record says it is good) or (hold for as long as required until external record says it's good and resume tracking of the event until the event make it fully 1hr, disregarding the time when external system is not good)
To elaborate this 2nd condition, e.g. if my window duration 1 1hr , my event starts at 00:00, if on 00:30 it is down and back normal on 00:45, the window should extend until 01:15.
Is it possible to pause and resume the forwarding of events conditionally based on my requirement above ?
Do I have to use transformation / processor and use value store manually to track the first processing time of my event and conditionally forwarding buffered events in punctuator ?
I appreciate all kind of work around and suggestion for this requirement.
the window works like wall clock processing
No. Kafka Streams work on event-time, hence, the timestamps as returned from the TimestampExtractor (by default the embedded record timestamp) are use to advance time.
To be precise my requirement for event forwarding is (windows of 1 hour if external state record says it is good)
This would need a custom solution IMHO.
or (hold for as long as required until external record says it's good and resume tracking of the event until the event make it fully 1hr, disregarding the time when external system is not good)
Not 100% if I understand this part.
Is it possible to pause and resume the forwarding of events conditionally based on my requirement above ?
No.
Do I have to use transformation / processor and use value store manually to track the first processing time of my event and conditionally forwarding buffered events in punctuator ?
I think this might be required.
Check out this blog post, that explains how suppress() work in details, and when it emits based on observed event-time: https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers

Coalescing GCD file system events

I have a class that implements a file-monitoring service to detect when a file I am interested in has been changed by something other than my application. I use the standard technique of opening the file (with the O_EVTONLY flag) and binding the file descriptor to a Grand Central Dispatch source of type DISPATCH_SOURCE_TYPE_VNODE. When I get an event, I notify my main thread with NSNotificationCenter's postNotificationName:object:userInfo: which calls an observer in my app delegate. So far so good. It works great. But, in general, if the triggering event is an attributes change (i.e. the DISPATCH_VNODE_ATTRIB flag is set on return from dispatch_source_get_data()) then I usually get two closely-spaced events. The behaviour is easily exhibited if I touch(1) the object I am monitoring. I hypothesise this is due to the file's mtime and atime being set non-atomically although I can't verify this. This can lead to spurious notifications being sent to my observer and this raises the possibility of race conditions etc.
What is the best way of dealing with this? I thought of storing a timestamp for the last event received and only sending a notification if the current event is later than this timestamp by some amount (a few tens of milliseconds?) Does this sound like a reasonable solution?
You can't ever escape the "race condition" in this situation, because the notification of your GCD event source in your process is not synchronous with the other process's modification of the underlying file. So, no matter what, you must always be tolerant of the possibility that the change you're being notified for could already be "gone."
As for coalescing, do whatever makes sense for your app. There are two obvious strategies. You can act immediately on a received event, and then drop subsequent events received in some time window on the floor, or you can delay every event for some time period during which you will drop other events for the same file on the floor. It really just depends on what's more important, acting quickly, or having a higher likelihood of a quiescent state (knowing that you can never be sure things are quiescent.)
The only thing I would add is to suggest that you do all your coalescence before dispatching anything to the main thread. The main thread has things like tracking loops, etc that will make it harder to get time-based coalescing right in certain cases.

Distributed time synchronization and web applications

I'm currently trying to build an application that inherently needs good time synchronization across the server and every client. There are alternative designs for my application that can do away with this need for synchronization, but my application quickly begins to suck when it's not present.
In case I am missing something, my basic problem is this: firing an event in multiple locations at exactly the same moment. As best I can tell, the only way of doing this requires some kind of time synchronization, but I may be wrong. I've tried modeling the problem differently, but it all comes back to either a) a sucky app, or b) requiring time synchronization.
Let's assume I Really Really Do Need synchronized time.
My application is built on Google AppEngine. While AppEngine makes no guarantees about the state of time synchronization across its servers, usually it is quite good, on the order of a few seconds (i.e. better than NTP), however sometimes it sucks badly, say, on the order of 10 seconds out of sync. My application can handle 2-3 seconds out of sync, but 10 seconds is out of the question with regards to user experience. So basically, my chosen server platform does not provide a very reliable concept of time.
The client part of my application is written in JavaScript. Again we have a situation where the client has no reliable concept of time either. I have done no measurements, but I fully expect some of my eventual users to have computer clocks that are set to 1901, 1970, 2024, and so on. So basically, my client platform does not provide a reliable concept of time.
This issue is starting to drive me a little mad. So far the best thing I can think to do is implement something like NTP on top of HTTP (this is not as crazy as it may sound). This would work by commissioning 2 or 3 servers in different parts of the Internet, and using traditional means (PTP, NTP) to try to ensure their sync is at least on the order of hundreds of milliseconds.
I'd then create a JavaScript class that implemented the NTP intersection algorithm using these HTTP time sources (and the associated roundtrip information that is available from XMLHTTPRequest).
As you can tell, this solution also sucks big time. Not only is it horribly complex, but only solves one half the problem, namely giving the clients a good notion of the current time. I then have to compromise on the server, either by allowing the clients to tell the server the current time according to them when they make a request (big security no-no, but I can mitigate some of the more obvious abuses of this), or having the server make a single request to one of my magic HTTP-over-NTP servers, and hoping that request completes speedily enough.
These solutions all suck, and I'm lost.
Reminder: I want a bunch of web browsers, hopefully as many as 100 or more, to be able to fire an event at exactly the same time.
Let me summarize, to make sure I understand the question.
You have an app that has a client and server component. There are multiple servers that can each be servicing many (hundreds) of clients. The servers are more or less synced with each other; the clients are not. You want a large number of clients to execute the same event at approximately the same time, regardless of which server happens to be the one they connected to initially.
Assuming that I described the situation more or less accurately:
Could you have the servers keep certain state for each client (such as initial time of connection -- server time), and when the time of the event that will need to happen is known, notify the client with a message containing the number of milliseconds after the beginning value that need to elapse before firing the event?
To illustrate:
client A connects to server S at time t0 = 0
client B connects to server S at time t1 = 120
server S decides an event needs to happen at time t3 = 500
server S sends a message to A:
S->A : {eventName, 500}
server S sends a message to B:
S->B : {eventName, 380}
This does not rely on the client time at all; just on the client's ability to keep track of time for some reasonably short period (a single session).
It seems to me like you're needing to listen to a broadcast event from a server in many different places. Since you can accept 2-3 seconds variation you could just put all your clients into long-lived comet-style requests and just get the response from the server? Sounds to me like the clients wouldn't need to deal with time at all this way ?
You could use ajax to do this, so yoǘ'd be avoiding any client-side lockups while waiting for new data.
I may be missing something totally here.
If you can assume that the clocks are reasonable stable - that is they are set wrong, but ticking at more-or-less the right rate.
Have the servers get their offset from a single defined source (e.g. one of your servers, or a database server or something).
Then have each client calculate it's offset from it's server (possible round-trip complications if you want lots of accuracy).
Store that, then you the combined offset on each client to trigger the event at the right time.
(client-time-to-trigger-event) = (scheduled-time) + (client-to-server-difference) + (server-to-reference-difference)
Time synchronization is very hard to get right and in my opinion the wrong way to go about it. You need an event system which can notify registered observers every time an event is dispatched (observer pattern). All observers will be notified simultaneously (or as close as possible to that), removing the need for time synchronization.
To accommodate latency, the browser should be sent the timestamp of the event dispatch, and it should wait a little longer than what you expect the maximum latency to be. This way all events will be fired up at the same time on all browsers.
Google found the way to define time as being absolute. It sounds heretic for a physicist and with respect to General Relativity: time is flowing at different pace depending on your position in space and time, on Earth, in the Universe ...
You may want to have a look at Google Spanner database: http://en.wikipedia.org/wiki/Spanner_(database)
I guess it is used now by Google and will be available through Google Cloud Platform.

Resources