What happens when using multiple field on fieldsgrouping? - apache-storm

Assuming we have an event stream with events with the following two attributes:
{"first_name", "last_name"}
and we partition on both attributes using fieldsgrouping:
.fieldsgrouping{"spout", new Fields("first_name", "last_name")}
The processing bolt is parallelized by two tasks and the following events enter the stream in specified order:
1: {"foo", "foo"}
2: {"bar", "bar"}
3: {"foo", "bar"}
Now events 1 and 2 go to task one and two respectively, what will happen with event 3? If it goes to either task it will break fieldsgrouping of an attribute.
How does Storm handle this? Or am I not understanding fieldsgrouping correctly?
Edit:
Thinking about this a bit more I probably misunderstood the behaviour of fieldsgrouping. If both fields are considered coupled event 1, 2, and 3 are each considered a distinct partition. Removing the problem.
However, this is not immediately clear from the only official documentation I can find on fieldsgrouping.
If anybody could point me to more detailed documentation.

You are grouping by first name by last name, meaning that not all tuples with the same first name will end up on the same destination, but tuples with the same first name and last name will.
Storm Applied (Sec 3.5.3) has a good example of this based on grouping street checkins by time-interval and city instead of using only time-interval. Basically, the latter was creating a bottleneck of all street checkins in the same interval ending up in the same bolt, no matter the city. By adding city to fields grouping, they kept the requirement of having all street checkins in the same bolt and at the same time they removed the bottleneck.

Related

Adding latest quantity value as part of event source event

According to my understanding I have to "construct" the latest state of the system by iterating over all events for the given aggregator. If I need to find the latest quantity for a product I have to iterate over each events that added or deducted quantity from the particular item. Can I however append the "latest" quantity of that product as part of the given event? So instead of having events like {added: 3}, {deducted: 1} = 2 available, I can have {added: 3, available: 3}, {deducted: 1, available: 2} so that I can just grab the last event from the store instead of iterating over all of them OR keeping a snapshot. Is this against the "rules" and what are the possible implications?
A general rule is that an event should not contain computed values, but at the same time its a trad-off between complexity and ease of use.
An alternative is to rely on snapshots, IF the performance is an issue. However, in my cases the read-side can handle those aggregated questions for you. You should also not be afraid of having to load a set of events to get to the current state of a given aggregate.

EventStore Competing Consumer Ordering

I am in the process of scaling out an application horizontally, and have realised read model updates (external projection via event handler) will need to be handled on a competing consumer basis.
I initially assumed that I would need to ensure ordering, but this requirement is message dependent. In the case of shopping cart checkouts where i want to know totals, i can add totals regardless of the order - get the message, update the SQL database, and ACK the message.
I am now racking my brains to even think of a scenario/messages that would be anything but, however i know this is not the case. Some extra clarity and examples would be immensely useful.
My questions i need help with please are:
What type of messages would the ordering need to be important, and
how would this be resolved using the messages as-is?
How would we know which event to resubscribe from when the processes
join/leave I can see possible timing issues that could cause a
subscription to be requested on a message that had just been
processed by another process?
I see there is a Pinned consumer strategy for best efforts affinity of stream to subscriber, however this is not guaranteed. I could solve this making a specific stream single threaded processing only those messages in order - is it possible for a process to have multiple subscriptions to different streams?
To use your example of a shopping cart, ordering would be potentially important for the following events:
Add item
Update item count
Remove item
You might have sequences like A: 'Add item, remove item' or B: 'Add item, Update item count (to 2), Update item count (to 3)'. For A, if you process the remove before the add, obviously you're in trouble. For B, if you process two update item counts out of order, you'll end up with the wrong final count.
This is normally scaled out by using some kind of sharding scheme, where a subset of all aggregates are allocated to each shard. For Event Store, I believe this can be done by creating a user-defined projection using partitionBy to partition the stream into multiple streams (aka 'shards'). Then you need to allocate partitions/shards to processing nodes in a some way. Some technologies are built around this approach to horizontal scaling (Kafka and Kinesis spring to mind).

CQRS / Event Sourcing - transient events

During development of my application, I found that I need to emit some events that actually don't modify the state of the aggregate, but they are needed in order to update read models (transient events?). e.g. if in my code (domain model) I hold state of hierarchy of numbers in layers like:
1 4 7
5 8
3 9
and the read model is doing projection of events like (top number from left to right):
1
5
3
then, when I trigger event in aggregate root RemovedNumber(1), and if this is the only event I trigger (since it is enough to update aggregate state), read model will not know that it needs to replace number 1 with 4.
? <--- SHOULD BE 4 SINCE 4 IS UNDER 1
5
3
So here basically, I need to trigger additionally: NowShowNumber(4 instead of 1), and then read model will know to project:
4
5
3
Event RemovedNumber(1) should be kept in event store, since it affects internal state of aggregate. Event NowShowNumber(4 instead of 1) should also be stored in event store since it is affecting read model (and should be replayed on re-projecting it), but it should probably not be used during reconstruction of aggregate root from event stream.
Is this standard practice in CQRS/Event Sourcing systems? Is there some alternative solution?
Why doesn't the Read model know to show number 4?
Didn't the Aggregate emit an AddNumber(4) prior to AddNumber(1)?
Then the Read model has the necessary state replicated on his part, basically a stack with numbers, in order to pull the previous number and to show it.
In CQRS, in order to help the Read models, when a state changes and an Event is emitted, the Aggregate include bits of the previous state in the Event also.
In your case, the emitted Event could have the following signature RemovedNumber( theRemovedNumber, theNewCurrentNumber), and in particular RemovedNumber(1, 4).
I call these events out of band events and save them to a different stream than I hydrate aggregates with.
Haven't heard anyone else doing it but haven't heard any good arguments to not do it - especially if you have a legitimate case for posting events that have no effect at all on the aggregate.
In your case if I understand your problem well enough I would just have the domain write a TopLevelNumberChanged event which the read model would see and process.
And obviously it would not read that event when hydrating.
I cannot see that it is at all an issue having events that don't effect changes in your projections. Depending on the projection it may be that the projection ignores many events.
That being said, if you are saying that these two events go hand-in-hand you may need to have another look at the design / intention. How do you know to call the second command? Would a single command not perhaps do the trick? The event could return the full change:
NumberReplacedEvent ReplaceNumber(1, 4);
The event would contain all the state:
public class NumberReplacedEvent
{
int ReplacedNumber { get; set; }
int WithNumber { get; set;
}
From my understanding, there's no single correct answers. CQRS / Event Sourcing is just a tool for helping you to model your data flow. But it's still your data, your business rules, your use case. In other words: Some other company could use the exact same data model, but have a different event structure, because it fits better for their use case.
Some example:
Let's imagine we have an online shop. And every time a customer buys a product, we decrease the inStock value for that product. If the customer sends the product back, we increase the value.
The command is pretty simple: BuyProduct(id: "123", amount: 4)
For the resulting event we have (at least) 2 options:
ProductBuyed(id: "123", amount: 4) (delta value)
ProductBuyed(id: "123", newInStockValue: 996) (new total value)
(you could also publish 4 times a simple ProductBuyed(id: "123") event)
Or you can have multiple resulting events at the same time:
ProductBuyed(id: "123", amount: 4)
InStockValueForProductChanged(id: "123", newValue: 996)
An online shop will possibly have multiple read models that are interested in these events. The Product Page wants to display only 996 items left!. And the Shop Statistics Page wants to display sold 4 items today. Though both options (total and delta) can be useful.
But also both Pages could work if there's only one of both events. Then the read side must do the calculation: oldTotal - newTotal = delta or oldTotal - delta = newTotal.
There are even more possible solutions. For example:
Checkout Service publishes ProductBuyed(id: "123", amount: 4) event
Stock Service receives this event, decreases the stock and then publishes the InStockValueForProductChanged(id: "123", newValue: 996) event
It really depends on the needs of your business.
My suggestions:
I prefer when the write model is only responsible for managing the business rules. Get Command, validate it, publish event(s) which look pretty similar to the command contents.
And the read model should be as simple as possible, too. Get Event, update model.
If calculations have to be done, there are a few options:
The calculation is part of a business rule? Then your write side has to compute the result anyway. In this case you already have written the algorithm, the CPU has done its work, and you have the resulting value for free. (Just include the result with the published event)
The calculation is really complex and/or there are multiple event consumers that need the result. Then it might be better to compute it once and include the result in an event, instead of computing it n times for every involved event consumer. Complex could mean:
Takes a lot of time
Very CPU / memory intensive
Needs special / huge external libs (imagine you had to include some Image Processing library with every read service)
The calculation is the result of a combination of a lot of different events (i.e. it's getting complex): Build an external service, which is responsible for the calculation. This way you can easily scale out by providing multiple instances of this service.
If the calculation is not part of a business rule and it's simple and only a single service needs the result or if it's only relevant for the read model: Place it in the read side.
In the end it's a tradeoff:
Duplicate algorithms? You could have multiple event consumers written with different programming languages. Do you want to implement the algorithm multiple times?
More network traffic / bigger event store? If you include the calculation result with the event, there's more data to store and transfer between the services. Can your infrastructure handle that?
Can your write / read service take the additional load?

Storm fields grouping

I'm having the following situation:
There is a number of bolts that calculate different values
This values are sent to visualization bolt
Visualization bolt opens a web socket and sends values to be visualized somehow
The thing is, visualization bolt is always the same, but it sends a message with a different header for each type of bolt that can be its input. For example:
BoltSum calculates sum
BoltDif calculates difference
BoltMul calculates multiple
All this bolts use VisualizationBolt for visualization
There are 3 instances of VisualizationBolt in this case
My question is, should I create 3 independent instances, where each instance will have one thread, e.g.
builder.setBolt("forSum", new VisualizationBolt(),1).globalGrouping("bolt-sum");
builder.setBolt("forDif", new VisualizationBolt(),1).globalGrouping("bolt-dif");
builder.setBolt("forMul", new VisualizationBolt(),1).globalGrouping("bolt-mul");
Or should I do the following
builder.setBolt("forAll", new VisualizationBolt(),3)
.fieldsGrouping("forSum", new Fields("type"))
.fieldsGrouping("forDif", new Fields("type"))
.fieldsGrouping("forMul", new Fields("type"));
And emit type from each of the previous bolts, so they can be grouped on based on it?
What are the advantages?
Also, should I expect that each and every time bolt-sum will go to first visualization bolt, bolt-dif will go to second visualization bolt and bolt-mul will go to third visualization bolt? They won't be mixed?
I think that that should be the case, but it currently isn't in my implementation, so I'm not sure if it's a bug or I'm missing something?
The first approach using three instances is the correct approach. Using fieldsGrouping does not ensure, that "sum" values go to "Sum-Visualization-Bolt" and neither that sum/diff/mul values are distinct (ie, in different bolt instances).
The semantic of fieldGrouping is more relaxed: it only guarantees, that all tuples of the same type will be processed by a single bolt instance, ie, that it will never be the case, that two different bolt instances get the same type.
I guess you can use Partial Key grouping (partialKeyGrouping). On the Storm documentation about stream groups says:
Partial Key grouping: The stream is partitioned by the fields
specified in the grouping, like the Fields grouping, but are load
balanced between two downstream bolts, which provides better
utilization of resources when the incoming data is skewed. This paper
provides a good explanation of how it works and the advantages it
provides.
I implemented a simple topology using this grouping and the chart on Graphite server show a better load balance compared to fieldsGrouping. The full source code is here.
topologyBuilder.setBolt(MqttSensors.BOLT_SENSOR_TYPE.getValue(), new SensorAggregateValuesWindowBolt().withTumblingWindow(Duration.seconds(5)), 2)
// .fieldsGrouping(MqttSensors.SPOUT_STATION_01.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
// .fieldsGrouping(MqttSensors.SPOUT_STATION_02.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
.partialKeyGrouping(MqttSensors.SPOUT_STATION_01.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
.partialKeyGrouping(MqttSensors.SPOUT_STATION_02.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
.setNumTasks(4) // This will create 4 Bolt instances
.addConfiguration(TagSite.SITE.getValue(), TagSite.EDGE.getValue())
;

Duplicates when linkswalking riak using ripple

I'm working on a project where I use Riak with Ripple, and I've stumbled on a problem.
For some reason I get duplicates when link-walking a structure of links. When I link walk using curl I don't get the duplicates as far as I can see.
The difference between my curl based link-walk
curl -v http://127.0.0.1:8098/riak/users/2306403e5177b4716da9df93b67300824aa2fd0e/_,projects,0/_,tasks,1
and my ruby ripple/riak-client based link walk
result = Riak::MapReduce.new(self.robject.bucket.client).
add(self.robject.bucket,self.key).
link(Riak::WalkSpec.new({:key => 'projects'})).
link(Riak::WalkSpec.new({:key => 'tasks', :bucket=>'tasks'})).
map("function(v){ if(!JSON.parse(v.values[0].data).completed) {return [v];} else { return [];} }", {:keep => true}).run
is as far as I can tell the map at the end.
However the result of the map/reduce contains several duplicates. I can't wrap my head around why. Now I've settled for removing the duplicates based on the key, but I wish that the riak result wouldn't contain duplicates, since it seems like waste to remove duplicates at the end.
I've tried the following:
Making sure there are no duplicates in the links sets of my ripple objects
Loading the data without the map reduce, but the link walk contains duplicate keys.
Any help is appreciated.
What you're running into here is an interesting side-effect/challenge of Map/Reduce queries.
M/R queries don't have any notion of read quorum values, and they necessarily have to hit every object (within the limitations of input filtering, of course) on every node.
Which means, when N > 1, the queries have to hit every copy of every object.
For example, let's say N=3, as per default. That means, for each written object, there are 3 copies, one each on 3 different nodes.
When you issue a read for an object (let's say with the default quorum value of R=2), the coordinating node (which received the read request from your client) contacts all 3 nodes (and potentially receives 3 different values, 3 different copies of the object).
It then checks to make sure that at least 2 of those copies have the same values (to satisfy the R=2 requirement), returns that agreed-upon value to the requesting client, and discards the other copies.
So, in regular operations (reads/writes, but also link walking), the coordinating node filters out the duplicates for you.
Map/Reduce queries don't have that luxury. They don't really have quorum values associated with them -- they are made to iterate over every (relevant) key and object on all the nodes. And because the M/R code runs on each individual node (close to the data) instead of just on the coordinating node, they can't really filter out any duplicates intrinsically. One of the things they're designed for, for example, is to update (or delete) all of the copies of the objects on all the nodes. So, each Map phase (in your case above) runs on every node, returns the matched 'completed' values for each copy, and ships the results back to the coordinating node to return to the client. And since it's very likely that your N>1, there's going to be duplicates in the result set.
Now, you can probably filter out duplicates explicitly, by writing code in the Reduce phase, to check if there's already a key present and reject duplicates if it is, etc.
But honestly, if I was in your situation, I would just filter out the duplicates in ruby on the client side, rather than mess with the reduce code.
Anyways, I hope that sheds some light on this mystery.

Resources