How does Spark Streaming's countByValueAndWindow work? - spark-streaming

I have a Spark Streaming application that is processing a stream of website click events. Each event has a property containing a GUID that identifies the user session that the event belongs to.
My application is counting up the number of events that occurred for each session, using windowing:
def countEvents(kafkaStream: DStream[(String, Event)]): DStream[(String, Session)] = {
// Get a list of the session GUIDs from the events
val sessionGuids = kafkaStream
.map(_._2)
.map(_.getSessionGuid)
// Count up the GUIDs over our sliding window
val sessionGuidCountsInWindow = sessionGuids.countByValueAndWindow(Seconds(60), Seconds(1))
// Create new session objects with the event count
sessionGuidCountsInWindow
.map({
case (guidS, eventCount) =>
guidS -> new Session().setGuid(guidS).setEventCount(eventCount)
})
}
My understanding was that the countByValueAndWindow function is only counting the values in the DStream on which the function is called. In other words, in the code above, the call to countByValueAndWindow should return the event counts only for the session GUIDs in the sessionGuids DStream on which we're calling that function.
But I'm observing something different; the call to countByValueAndWindow is returning counts for session GUIDs that are not in sessionGUIDs. It appears to be returning counts for session GUIDs that were processed in previous batches. Am I just misunderstanding how this function works? I haven't been able to find anything in the way of useful documentation online.

A colleague of mine who is much more versed in the ways of Spark than I has helped me with this. Apparently I was mis-understanding the way that the countByValueAndWindow function works. I thought that it would only return counts for values in the DStream for which you're calling the function. But, in fact, it returns counts for all values across the entire window. To address my issue, I simply perform a join between my input DStream and the DStream resulting from the countByValueAndWindow operation. Thus I only end up with results for values in my input DStream.

Related

Access Record Offset in Kafka Streams Aggregator

I have a simple windowing topology:
builder.stream("input-topic", Consumed.with(...))
.groupByKey()
.windowedBy(TimeWindows.of(windowSize).advanceBy(windowAdvance).grace(windowGrace))
.aggregate(Frame::new,
this::windowAggregator,
...
)
.transformValues(FrameTransformer::new)
.toStream()
.selectKey((key, value) -> value...)
.to("output-topic", Produced.with(...));
I'd like to put the actual record offset of the beginning of the window into the Frame aggregating object.
How can I get access to the record offset from the windowAggregator (aggregate() handler) function?
I know that I can get access to the record offset in the FrameTransformer, but that doesn't help me creating accurate Frame objects describing my windows in terms of start and end offset.
I've heard there's a way of doing that by inserting another .transform() call before the groupByKey(), there I can access the offsets, but then I'd need to modify the schema of my event records to store there the offset information inside.
Is there a (simpler) way of achieving my intention?
Update
In fact, I was able to get the accurate window start and end offsets in the Frame objects in the following way
builder.stream("input-topic", Consumed.with(...))
.transformValues(EventTransformer::new)
.groupByKey()
.windowedBy(TimeWindows.of(windowSize).advanceBy(windowAdvance).grace(windowGrace))
.aggregate(Frame::new,
this::windowAggregator,
...
)
.toStream()
.selectKey((key, value) -> value...)
.to("output-topic", Produced.with(...));
But as mentioned above, at the expense of editing the schema of Event object.
How can I get access to the record offset from the windowAggregator (aggregate() handler) function?
You can't. Your approach to use transformValues() before the aggregation (and to enrich the Event object is the right approach.
There was a proposal to extend the API to allow accessing record metadata within aggregate() and other DSL operators, but it was never pushed over the finish line (cf https://cwiki.apache.org/confluence/display/KAFKA/KIP-159%3A+Introducing+Rich+functions+to+Streams).

Are hot non completing database observables a Rx usecase? Side-effect writing issue

I have more of a opinions question, asi if this, what many people do, should be a Rx use case.
In apps there is usually sql database, which is queried by UI as a observable, which emits after the query is loaded + anytime data changes (Room / SqlDelight etc)
Reads sound okay, however, is it possible to have "pure" writes to the database?
Writing to the database might look like this
fun sync() = Completable.fromCallable {
// do something
database.writeSomethingSynchronously()
}
SomeUi {
init {
database.someQueryObservable()
.subscribe { show list }
}
}
Imagine you want to display progressbar while this Completable is in flight.
What is effectively happening here is sideffecting to the database. Which means the opened database observable will re-emit when the data is written, but still before the sync() returns (assuming single threaded for simplicity)
Now there is point in time where there is new data in the UI and the progressbar is shown. (and worse with multithreading timings) This is invalid state.
In imperative world, sync would provide a completion callback, in which one would reload the query manually + show/hide progressbar synchronously. (And somehow block the database change listener for duration of the sync writes?)
Is there a way around this at all?

How to insert a batch of records into Redis

In a twitter-like application, one of the things they do is when someone posts a tweet, they iterate over all followers and create a copy of the tweet in their timeline. I need something similar. What is the best way to insert a tweet ID into say 10/100/1000 followers assuming I have a list of follower IDs.
I am doing it within Azure WebJobs using Azure Redis. Each webjob is automatically created for every tweet received in the queue. So I may have around 16 simultaneous jobs running at the same time where each one goes through followers and inserts tweets.I'm thinking if 99% of inserts happen, they should not stop because one or a few have failed. I need to continue but log it.
Question: Should I do CreateBatch like below? If I need to retrieve latest tweets first in reverse chronological order is below fine? performant?
var tasks = new List<Task>();
var batch = _cache.CreateBatch();
//loop start
tasks.Add(batch.ListRightPushAsync("follower_id", "tweet_id"));
//loop end
batch.Execute();
await Task.WhenAll(tasks.ToArray());
a) But how do I catch if something fails? try catch?
b) how do I check in a batch for a total # in each list and pop one out if it reaches a certain #? I want to do a LeftPop if the list is > 800. Not sure how to do it all inside the batch.
Please point me to a sample or let me have a snippet here. Struggling to find a good way. Thank you so much.
UPDATE
Does this look right based on #marc's comments?
var tasks = new List<Task>();
followers.ForEach(f =>
{
var key = f.FollowerId;
var task = _cache.ListRightPushAsync(key, value);
task.ContinueWith(t =>
{
if (t.Result > 800) _cache.ListLeftPopAsync(key).Wait();
});
tasks.Add(task);
});
Task.WaitAll(tasks.ToArray());
CreateBatch probably doesn't do what you think it does. What it does is defer a set of operations and ensure they get sent contiguously relative to a single connection - there are some occasions this is useful, but not all that common - I'd probably just send them individually if it was me. There is also CreateTransaction (MULTI/EXEC), but I don't think that would be a good choice here.
That depends on whether you care about the data you're popping. If not: I'd send a LTRIM, [L|R]PUSH pair - to trim the list to (max-1) before adding. Another option would be Lua, but it seems overkill. If you care about the old data, you'll need to do a range query too.

Relation between command handlers, aggregates, the repository and the event store in CQRS

I'd like to understand some details of the relations between command handlers, aggregates, the repository and the event store in CQRS-based systems.
What I've understood so far:
Command handlers receive commands from the bus. They are responsible for loading the appropriate aggregate from the repository and call the domain logic on the aggregate. Once finished, they remove the command from the bus.
An aggregate provides behavior and an internal state. State is never public. The only way to change state is by using the behavior. The methods that model this behavior create events from the command's properties, and apply these events to the aggregate, which in turn call an event handlers that sets the internal state accordingly.
The repository simply allows loading aggregates on a given ID, and adding new aggregates. Basically, the repository connects the domain to the event store.
The event store, last but not least, is responsible for storing events to a database (or whatever storage is used), and reloading these events as a so-called event stream.
So far, so good.
Now there are some issues that I did not yet get:
If a command handler is to call behavior on a yet existing aggregate, everything is quite easy. The command handler gets a reference to the repository, calls its loadById method and the aggregate is returned. But what does the command handler do when there is no aggregate yet, but one should be created? From my understanding the aggregate should later-on be rebuilt using the events. This means that creation of the aggregate is done in reply to a fooCreated event. But to be able to store any event (including the fooCreated one), I need an aggregate. So this looks to me like a chicken-and-egg problem: I can not create the aggregate without the event, but the only component that should create events is the aggregate. So basically it comes down to: How do I create new aggregates, who does what?
When an aggregate triggers an event, an internal event handler responses to it (typically by being called via an apply method) and changes the aggregate's state. How is this event handed over to the repository? Who originates the "please send the new events to the repository / event store" action? The aggregate itself? The repository by watching the aggregate? Someone else who is subscribed to the internal events? ...?
Last but not least I have a problem understanding the concept of an event stream correctly: In my imagination, it's simply something like an ordered list of events. What's of importance is that it's "ordered". Is this right?
The following is based on my own experience and my experiments with various frameworks like Lokad.CQRS, NCQRS, etc. I'm sure there are multiple ways to handle this. I'll post what makes most sense to me.
1. Aggregate Creation:
Every time a command handler needs an aggregate, it uses a repository. The repository retrieves the respective list of events from the event store and calls an overloaded constructor, injecting the events
var stream = eventStore.LoadStream(id)
var User = new User(stream)
If the aggregate didn't exist before, the stream will be empty and the newly created object will be in it's original state. You might want to make sure that in this state only a few commands are allowed to bring the aggregate to life, e.g. User.Create().
2. Storage of new Events
Command handling happens inside a Unit of Work. During command execution every resulting event will be added to a list inside the aggregate (User.Changes). Once execution is finished, the changes will be appended to the event store. In the example below this happens in the following line:
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
3. Order of Events
Just imagine what would happen, if two subsequent CustomerMoved events are replayed in the wrong order.
An Example
I'll try to illustrate the with a piece of pseudo-code (I deliberately left repository concerns inside the command handler to show what would happen behind the scenes):
Application Service:
UserCommandHandler
Handle(CreateUser cmd)
stream = store.LoadStream(cmd.UserId)
user = new User(stream.Events)
user.Create(cmd.UserName, ...)
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
Handle(BlockUser cmd)
stream = store.LoadStream(cmd.UserId)
user = new User(stream.Events)
user.Block(string reason)
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
Aggregate:
User
created = false
blocked = false
Changes = new List<Event>
ctor(eventStream)
isNewEvent = false
foreach (event in eventStream)
this.Apply(event, isNewEvent)
Create(userName, ...)
if (this.created) throw "User already exists"
isNewEvent = true
this.Apply(new UserCreated(...), isNewEvent)
Block(reason)
if (!this.created) throw "No such user"
if (this.blocked) throw "User is already blocked"
isNewEvent = true
this.Apply(new UserBlocked(...), isNewEvent)
Apply(userCreatedEvent, isNewEvent)
this.created = true
if (isNewEvent) this.Changes.Add(userCreatedEvent)
Apply(userBlockedEvent, isNewEvent)
this.blocked = true
if (isNewEvent) this.Changes.Add(userBlockedEvent)
Update:
As a side note: Yves' answer reminded me of an interesting article by Udi Dahan from a couple of years ago:
Don’t Create Aggregate Roots
A small variation on Dennis excellent answer:
When dealing with "creational" use cases (i.e. that should spin off new aggregates), try to find another aggregate or factory you can move that responsibility to. This does not conflict with having a ctor that takes events to hydrate (or any other mechanism to rehydrate for that matter). Sometimes the factory is just a static method (good for "context"/"intent" capturing), sometimes it's an instance method of another aggregate (good place for "data" inheritance), sometimes it's an explicit factory object (good place for "complex" creation logic).
I like to provide an explicit GetChanges() method on my aggregate that returns the internal list as an array. If my aggregate is to stay in memory beyond one execution, I also add an AcceptChanges() method to indicate the internal list should be cleared (typically called after things were flushed to the event store). You can use either a pull (GetChanges/Changes) or push (think .net event or IObservable) based model here. Much depends on the transactional semantics, tech, needs, etc ...
Your eventstream is a linked list. Each revision (event/changeset) pointing to the previous one (a.k.a. the parent). Your eventstream is a sequence of events/changes that happened to a specific aggregate. The order is only to be guaranteed within the aggregate boundary.
I almost agree with yves-reynhout and dennis-traub but I want to show you how I do this. I want to strip my aggregates of the responsibility to apply the events on themselves or to re-hydrate themselves; otherwise there is a lot of code duplication: every aggregate constructor will look the same:
UserAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
OrderAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
ProfileAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
Those responsibilities could be left to the command dispatcher. The command is handled directly by the aggregate.
Command dispatcher class
dispatchCommand(command) method:
newEvents = ConcurentProofFunctionCaller.executeFunctionUntilSucceeds(tryToDispatchCommand)
EventDispatcher.dispatchEvents(newEvents)
tryToDispatchCommand(command) method:
aggregateClass = CommandSubscriber.getAggregateClassForCommand(command)
aggregate = AggregateRepository.loadAggregate(aggregateClass, command.getAggregateId())
newEvents = CommandApplier.applyCommandOnAggregate(aggregate, command)
AggregateRepository.saveAggregate(command.getAggregateId(), aggregate, newEvents)
ConcurentProofFunctionCaller class
executeFunctionUntilSucceeds(pureFunction) method:
do this n times
try
call result=pureFunction()
return result
catch(ConcurentWriteException)
continue
throw TooManyRetries
AggregateRepository class
loadAggregate(aggregateClass, aggregateId) method:
aggregate = new aggregateClass
priorEvents = EventStore.loadEvents()
this.applyEventsOnAggregate(aggregate, priorEvents)
saveAggregate(aggregateId, aggregate, newEvents)
this.applyEventsOnAggregate(aggregate, newEvents)
EventStore.saveEventsForAggregate(aggregateId, newEvents, priorEvents.version)
SomeAggregate class
handleCommand1(command1) method:
return new SomeEvent or throw someException BUT don't change state!
applySomeEvent(SomeEvent) method:
changeStateSomehow() and not throw any exception and don't return anything!
Keep in mind that this is pseudo code projected from a PHP application; the real code should have things injected and other responsibilities refactored out in other classes. The ideea is to keep aggregates as clean as possible and avoid code duplication.
Some important aspects about aggregates:
command handlers should not change state; they yield events or
throw exceptions
event applies should not throw any exception and should not return anything; they only change internal state
An open-source PHP implementation of this could be found here.

What approach should I take for creating a "lobby" in Node.js?

I have users connecting to a Node.js server, and when they join, I add them into a Lobby (essentially a queue). Any time there are 2 users in the lobby, I want them to pair off and be removed from the lobby. So essentially, it's just a simple queue.
I started off by trying to implement this with a Lobby.run method, which has an infinite loop (started within a process.nextTick call), and any time there are more than two entries in the queue, I remove them form the queue. However, I found that this was eating all my memory and that infinite loops like this are generally ill-advised.
I'm now assuming that emitting events via EventEmitter is the way to go. However, my concern is with synchronization. Let's assuming my Lobby is pretty simple:
Lobby = {
users: []
, join: function (user) {
this.users.push(user);
emitter.emit('lobby.join', user);
}
, leave: function (user) {
var index = this.users.indexOf(user);
this.users.splice(index, 1);
emitter.emit('lobby.leave', user);
}
};
Now essentially I assume I want to watch for users joining the lobby and pair them up, maybe something like this:
Lobby = {
...
, run: function () {
emitter.on('lobby.join', function (user) {
// TODO: determine if this.users contains other users,
// pair them off, and remove them from the array
});
}
}
As I mentioned, this does not account for synchronization. Multiple users can join the lobby at the same time, and so the event listener might pair up a single user with multiple other users instead of just one.
Can someone with more Node.js experience tell me if I am right to be concerned with this event-based approach? Any insight for improvement on this approach would be much appreciated.
You are wrong to be concerned with this. This is because Node.JS is single-threaded, there is no concurrency at all! Whenever a block of code is fired no other code (including event handlers) can be fired until the block finishes what it does. In particular if you define this empty loop in your app:
while(true) { }
then your server is crashed, no other code will ever fire, no other request will be ever handled. So be careful with blocks of code, make sure that each block will eventually end.
Back to the question... So in your case it is impossible for multiple users to be paired with the same user. And let me say one more time: this is simply because there is no concurrency in Node.JS!
On the other hand this only applies to one instance of Node.JS. If you want to scale it to many machines, then obviously you will have to implement some locking mechanism (which ensures that no other process can work with the data at the same time).

Resources