I have an incoming stream of events, each of which already has an associated sessionId from another process.
All I wish to do is combine these events into a single session object using a custom CombineFn.
During development, I'm using a bounded dataset that reads from a file and the following code seems to work:
input.apply(ParDo.named("ParseEvent").of(new ParseEventFn()))
.setCoder(KvCoder.of(StringUtf8Coder.of(), AvroCoder.of(Event.class)))
.apply(GroupByKey.<String, Event>create())
.apply(Combine.groupedValues(new SessionAccumulator()))
The above code (with input/output handling) will output a series of sessions with multiple events in each.
{sessionId: 1, events: [event1,event2,event3]}
{sessionId: 2, events: [event4,event5]}
But in order for this to work on an unbounded dataset, I need to apply a Windowing function, which in this case is a SessionWindow.
input.apply(ParDo.named("ParseEvent").of(new ParseEventFn()))
.setCoder(KvCoder.of(StringUtf8Coder.of(), AvroCoder.of(Event.class)))
.apply(Window.<KV<String, Event>>into(Sessions.withGapDuration(Duration.standardMinutes(30))))
.apply(GroupByKey.<String, Event>create())
.apply(Combine.groupedValues(new SessionAccumulator()))
In that case the only new code is the Windowing function, and rather than rolling up the events, I get each event in it's own session, like this:
{sessionId: 1, events: [event1]}
{sessionId: 1, events: [event2]}
{sessionId: 1, events: [event3]}
{sessionId: 2, events: [event4]}
{sessionId: 2, events: [event5]}
Any idea why this is happening?
EDIT: I should add, the ParseEventFn is applying a timestamp to the PCollection using context.outputWithTimestamp(), and that timestamp seems to be correct.
Digging into it further, in my case the issue was that my core assumption that the timestamps were correct, was wrong.
The timestamps I was applying before the windowing were wrong.
The Windowing was doing exactly what it should, but I had set my timestamps too far apart and it was creating separate sessions for each event.
Oops
In your case, you could possibly write your own WindowFn. If you set the keys to be the session IDs then a large gap duration also works, but it doesn't reflect the nature of your data and computation quite as well.
The ingredients to your WindowFn would be:
your own subclass of BoundedWindow, in this case you would make a window type that contained the session ID in a field
assignWindows, where you would assign each element to a window identified by the session ID. The length of the window still matters, as it controls when the window expires and is garbage collected.
mergeWindows, where you would merge all windows that have the same session ID. They wouldn't have to fall within any particular gap duration.
Another thing you'll need to be careful of is that the watermark that governs the garbage collection of these windows is determined by the source of your unbounded stream of events. So setting the timestamps in your ParDo.of(new ParseEventFn()) will be too late to influence the watermark. You may have data dropped that you'd like to keep.
Related
I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}
I have a scheduled script execution that needs to persist a value between runs. It is updated with each run. Using gs.setProperty seemed like the natural place until I came across this:
Care should be taken when setting system properties (sys_properties)
using this method as it causes a system-wide cache flush. Each flush
can cause system degradation while the caches rebuild. If a value must
be updated often, it should not be stored as a system property. In
general, you should only place values in the sys_properties table that
do not frequently change.
Creating a separate table to store a single scalar value seems like overkill. Is there a better place to store it?
You could set a preference if you need it in the instance. Another place could be the events table. Log the event with the data in parm1 or parm2 and on next run query the most recent event.
I'd avoid making a table as that has cost implications for some clients. I agree with the sys_properties.
var encrypter = new GlideEncrypter();
var encrypted = encrypter.encrypt('Super Secret Phrase');
gs.info('encrypted: ' + encrypted);
var decrypted = encrypter.decrypt(encrypted);
gs.info('decrypted: ' + decrypted);
/**
*** Script: encrypted: g/bXLJHa7xNRMKZEo5q/YtLMEdse36ED
*** Script: decrypted: Super Secret Phrase
*/
This way only administrators could really read this data. Also if I recall correctly, the sysevent table is cleared after 7 days. You could have the job remove the event as soon as it has it in memory.
How do I implement a sliding window aggregation (or transformation) with a fixed-size count-based window?
For e.g: If I have stream data like the following
input stream = 1,2,3,4,5,6,7,8...
Assume that time is not relevant here. And say my aggregate function is AVERAGE and window size is fixed at 3 records (not 3 millis, 3 secs, 3 hours etc), I would like my output stream to be
output stream = avg(1,2,3), avg(2,3,4), avg(3,4,5), avg(4,5,6), avg(5,6,7)... = 2,3,4,5,6...
The Windows documented in Kafka streams work are "time-based". Even the constructor of base class Window has following signature:
Window(long startMs, long endMs)
So I was not sure if it's the right tool to do non-time based windowing aggregating.
Apache Flink supports count-based sliding and tumbling windows. That's exactly what I need, but I'm looking for a similar feature in Kafka Streams.
If time-ordering is no concern for you, you can implement a custom Transformer with attached state.
StreamsBuilder builder = new StreamsBuilder();
builder.addStoreStore(...); // add KeyValueStore here
KStream result = builder.stream("topic").transform(...); // pass in name of your KeyValueStore, too
For you custom Transformer you can maintain a List per key with the list being your window -- as long as the list is smaller than you window-size you append new record to the list -- if it's exactly the size, you trigger the computation -- if it exceeds the size, you trim it and trigger the computation afterwards.
See the docs for more details: https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html (Note, that a Processor and a Transformer are basically the same thing.)
If you wish to use Apache Storrm which is also an streaming engine, kafka can be connected as a data source to it. Storm new version provides a concept called Tumbling Window, which delivers exact number of tuple to your topology. This can easily be used to solve your problem.
For more have a look at https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_storm-component-guide/content/storm-windowing-concepts.html
I try to update a projection from event store. The following line will load all events:
$events = $this->eventStore->load(new StreamName('mystream'));
Currently i try to load only not handled events by passing the fromNumber parameter:
$events = $this->eventStore->load(new StreamName('mystream'), 10);
This will load all events eg from 15 to 40. But i found no way to figure out which is the current/highest "no" of the results. But this is necessary for me to load only from this entry on the next time.
If the database is truncated (with restarted sequences) this is not a real problem cause i know that the events will start with 1. But if the primary key starts with a number higher than 1 can not figure out which event has which number in the event store
When you are using pdo-event-store, you have a key _position in the event metadata after loading, so your read model can track which position was the last you were working on. Other then that, if you are working with proophs event-store projections, you don't need to take care of that at all. The projector will track the current event position for all needed streams internally, you just need to provide callbacks for each event where you need to do something.
We are using Kaltura to notify our CMS about changes in the videos. In the KMC under Settings->Integrations Settings we have checked all the checkboxes under "Sent by Server".
Some times these checkmarks disappear? IT happens maybe once a week or once a month. How can we find the reason to these boxes being deactivated?
Those notifications are being stored on the partner object in partner table. The actual data is stored in the custom_data field, which holds large amount of PHP-serialized data.
I can suspect cases that due to updates of other fields in the custom_data object, the notifications section will be erased.
Your best shot would be first check the value of that field when the config got erased. If it was actually erased in the database, try to find the following log messages in api_v3.log (which can lead you to the actual API request that modified the field):
[2124167851][propel] */ UPDATE partner SET
`UPDATED_AT`='2017-10-04 14:11:36',
`NOTIFY`='1',
`CUSTOM_DATA`='a:79:{s:9:"firstName";s:5:"Roman";s:12:"isFirstLogin";b:0;
... tons of PHP serialized data ...
i:1;s:19:"notificationsConfig";s:42:"*=0;1=1;2=1;3=1;4=0;21=0;6=0;7=0;26=0;5=0;";
... tons of PHP serialized data ...
}' WHERE partner.ID='101' AND MD5(cast(partner.CUSTOM_DATA as char character set latin1)) = '7eb7781cc04c7f98077efc2e3c1e9426'
The key that stores the notifications config is notificationsConfig (Each number represents the notification type, then 0 / 1 for off / no).
As a side note, which CE version are you using? There might be a more reliable way to integrate with your CMS.