Kafka Stream: Kafka Windowed Stream behavior on application restart - apache-kafka-streams

Working on a simple stream of data (say inputStream), for example:
Notice that value is updated for the same key and timestamp.
|-----|--------|-------|
| Key | TS(ms) | Value |
|-----|--------|-------|
| A | 1000 | 0 |
| B | 1000 | 0 |
| A | 61000 | 0 |
| B | 61000 | 0 |
| A | 121000 | 0 |
| B | 121000 | 0 |
| A | 1000 | 1 |
| B | 1000 | 1 |
| A | 61000 | 1 |
| B | 61000 | 1 |
| A | 121000 | 1 |
| B | 121000 | 1 |
Here is the code:
KStream<Windowed<String>, Long> aggregatedStream = inputStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)).grace(Duration.ofMinutes(1)))
.count(Materialized.as("count-metric"))
.toStream();
aggregatedStream.print(Printed.toSysOut());
The output of print is
[KTABLE-TOSTREAM-0000000014]: [A#0/60000], 1
[KTABLE-TOSTREAM-0000000014]: [B#0/60000], 1
[KTABLE-TOSTREAM-0000000014]: [A#60000/120000], 1
[KTABLE-TOSTREAM-0000000014]: [B#60000/120000], 1
[KTABLE-TOSTREAM-0000000014]: [A#120000/180000], 1
[KTABLE-TOSTREAM-0000000014]: [B#120000/180000], 1
[KTABLE-TOSTREAM-0000000014]: [A#60000/120000], 2
[KTABLE-TOSTREAM-0000000014]: [B#60000/120000], 2
[KTABLE-TOSTREAM-0000000014]: [A#120000/180000], 2
[KTABLE-TOSTREAM-0000000014]: [B#120000/180000], 2
Since the GracePeriod is set to 1min, the count for windows [A#0/60000] and [B#0/60000] is not incremented when the value is updated to 1 in the input stream for the same key and time stamp. The output is shown as expected.
But when i restart my streams application and ingest the same input stream again, iam seeing the following output:
[KTABLE-TOSTREAM-0000000014]: [A#0/60000], 2
[KTABLE-TOSTREAM-0000000014]: [B#0/60000], 2
[KTABLE-TOSTREAM-0000000014]: [A#60000/120000], 2
[KTABLE-TOSTREAM-0000000014]: [B#60000/120000], 2
[KTABLE-TOSTREAM-0000000014]: [A#120000/180000], 2
[KTABLE-TOSTREAM-0000000014]: [B#120000/180000], 2
[KTABLE-TOSTREAM-0000000014]: [A#60000/120000], 3
[KTABLE-TOSTREAM-0000000014]: [B#60000/120000], 3
[KTABLE-TOSTREAM-0000000014]: [A#120000/180000], 3
[KTABLE-TOSTREAM-0000000014]: [B#120000/180000], 3
Why does the window [A#0/60000] and [B#0/60000] gets updated to 2 after restarting the application?
Before restarting the application the streamTime is 121000 and the window [A#0/60000] and [B#0/60000] already exceeded the grace period and closed.
Why is this window considered after restart again ?

This is a known issue: https://issues.apache.org/jira/browse/KAFKA-9368
During a rebalance, the aggregation() operator "forgets" the current time and when restarted it only "re-learns" the time base on the new records it sees. This affect how grace period is applied.

Related

Laravel leftJoin returns null from 2nd table

I have 2 table duty_sheets
centerId | centerName | p1 | p2 | p3 | p4 | ...p22 | examiId
1 | xyz | 1 | 5 | 8 | 7 | 1 | 1
2 | abc | 9 | 1 | 6 | 6 | 1 | 1
and feedback
id | centerId | inspectorId | A | B | C | examiId
1 | 1 | 1 | 1 | 5 | 8 | 1
2 | 2 | 9 | 9 | 1 | 6 | 1
here is my code
$center = DutySheet::select('duty_sheets.centerId', 'duty_sheets.centerName','feedback.id')
->leftJoin('feedback', function ($leftJoin) {
$leftJoin->on('duty_sheets.examId', 'feedback.examId')
->where("duty_sheets.centerId", 'feedback.centerId')
->where("feedback.inspectorId", 1);
})
->where("duty_sheets.examId", 1)
->where("p20", 1)
->get();
dd($center);
to retrieve "All rows from DutySheet where p20 = 1 and dutysheet.examId = 1, and relevant rows from feedback depend on centerId, inspectorId and examId.
The problem is that the query return feedback.id as null while the record exist in feedback table with the ids.
Laravel version = 9
The problem is in left Join
->where("duty_sheets.centerId", 'feedback.centerId')
This build a where against the value 'feedback.centerId'
duty_sheets.centerId='feedback.centerId'
You need use
->on("duty_sheets.centerId",'=', 'feedback.centerId')
Or
->whereColumn("duty_sheets.centerId", 'feedback.centerId')

Subtract value row by row in matlab

I have a 1 column matrix with the following values:
*-------*
| 6 |
| 4 |
| 3 |
| 1 |
| 1 |
*-------*
With this function, starting from the first value, I subtract the value in the following row and place 0 at the end. This is the result:
Delta = Ctv_ds_universal(1:(end-1),1)-Ctv_ds_universal(2:end,1);
Delta(end+1)=0;
*-----------*
| 2 (6-4) |
| 1 (4-3) |
| 2 (3-1) |
| 0 (1-1) |
| 0 |
*-----------*
Now, I would like to reverse the order and start subtracting from down to the top, placing 0 at the beginning. How can I modify the function?
*------------*
| 0 |
| -2 (4-6) |
| -1 (3-4) |
| -2 (1-3) |
| 0 (1-1) |
*------------*
Delta = 0;
Delta = [Delta; Ctv_ds_universal(2:end,1)-Ctv_ds_universal(1:end-1,1)];

Expand rectangles as much as possible to cover another rectangle, minimizing overlap

Given a tiled, x- and y-aligned rectangle and (potentially) a starting set of other rectangles which may overlap, I'd like to find a set of rectangles so that:
if no starting rectangle exists, one might be created; otherwise do not create additional rectangles
each of the rectangles in the starting set are expanded as much as possible
the overlap is minimal
the whole tiled rectangle's area is covered.
This smells a lot like a set cover problem, but it still is... different.
The key is that each starting rectangle's area has to be maximized while still minimizing general overlap. A good solution keeps a balance between necessary overlaps and high initial rectangles sizes.
I'd propose a rating function such as that:
Higher is better.
Examples (assumes a rectangle tiled into a 4x4 grid; numbers in tiles denote starting rectangle "ID"):
easiest case: no starting rectangles provided, can just create one and expand it fully:
.---------------. .---------------.
| | | | | | 1 | 1 | 1 | 1 |
|---|---|---|---| |---|---|---|---|
| | | | | | 1 | 1 | 1 | 1 |
|---|---|---|---| => |---|---|---|---|
| | | | | | 1 | 1 | 1 | 1 |
|---|---|---|---| |---|---|---|---|
| | | | | | 1 | 1 | 1 | 1 |
·---------------· ·---------------·
rating: 16 * 1 - 0 = 16
more sophisticated:
.---------------. .---------------. .---------------.
| 1 | 1 | | | | 1 | 1 | 1 | 1 | | 1 | 1 | 2 | 2 |
|---|---|---|---| |---|---|---|---| |---|---|---|---|
| 1 | 1 | | | | 1 | 1 | 1 | 1 | | 1 | 1 | 2 | 2 |
|---|---|---|---| => |---|---|---|---| or |---|---|---|---|
| | | 2 | 2 | | 2 | 2 | 2 | 2 | | 1 | 1 | 2 | 2 |
|---|---|---|---| |---|---|---|---| |---|---|---|---|
| | | 2 | 2 | | 2 | 2 | 2 | 2 | | 1 | 1 | 2 | 2 |
·---------------· ·---------------· ·---------------·
ratings: (4 + 4) * 2 - 0 = 16 (4 + 4) * 2 - 0 = 16
pretty bad situation, with initial overlap:
.-----------------. .-----------------------.
| 1 | | | | | 1 | 1 | 1 | 1 |
|-----|---|---|---| |-----|-----|-----|-----|
| 1,2 | 2 | | | | 1,2 | 1,2 | 1,2 | 1,2 |
|-----|---|---|---| => |-----|-----|-----|-----|
| | | | | | 2 | 2 | 2 | 2 |
|-----|---|---|---| |-----|-----|-----|-----|
| | | | | | 2 | 2 | 2 | 2 |
·-----------------· ·-----------------------·
rating: (8 + 12) * 2 - (2 + 2 + 2 + 2) = 40 - 8 = 36
covering with 1 only:
.-----------------------.
| 1 | 1 | 1 | 1 |
|-----|-----|-----|-----|
| 1,2 | 1,2 | 1 | 1 |
=> |-----|-----|-----|-----|
| 1 | 1 | 1 | 1 |
|-----|-----|-----|-----|
| 1 | 1 | 1 | 1 |
·-----------------------·
rating: (16 + 2) * 1 - (2 + 2) = 18 - 4 = 16
more starting rectangles, also overlap:
.-----------------. .---------------------.
| 1 | 1,2 | 2 | | | 1 | 1,2 | 1,2 | 1,2 |
|---|-----|---|---| |---|-----|-----|-----|
| 1 | 1 | | | | 1 | 1 | 1 | 1 |
|---|-----|---|---| => |---|-----|-----|-----|
| 3 | | | | | 3 | 3 | 3 | 3 |
|---|-----|---|---| |---|-----|-----|-----|
| | | | | | 3 | 3 | 3 | 3 |
·-----------------· ·---------------------·
rating: (8 + 3 + 8) * 3 - (2 + 2 + 2) = 57 - 6 = 51
The starting rectangles may be located anywhere in the tiled rectangle and have any size (minimum bound 1 tile).
The starting grid might be as big as 33x33 currently, though potentially bigger in the future.
I haven't been able to reduce this problem instantiation to a well-problem, but this may only be my own inability.
My current approach to solve this in an efficient way would go like this:
if list of starting rects empty:
create starting rect in tile (0,0)
for each starting rect:
calculate the distances in x and y direction to the next object (or wall)
sort distances in ascending order
while free space:
pick rect with lowest distance
expand it in lowest distance direction
I'm unsure if this gives the optimal solution or really is the most efficient one... and naturally if there are edge cases this approach would fail on.
Proposed attack. Your mileage may vary. Shipping costs higher outside the EU.
Make a list of open tiles
Make a list of rectangles (dimension & corners)
We're going to try making +1 growth steps: expand some rectangle one unit in a chosen direction. In each iteration, find the +1 with the highest score. Iterate until the entire room (large rectangle) is covered.
Scoring suggestions:
Count the squares added by the extension: open squares are +1; occupied squares are -1 for each other rectangle overlapped.
For instance, in this starting position:
- - 3 3
1 1 12 -
- - 2 -
...if we try to extend rectangle 3 down one row, we get +1 for the empty square on the right, but -2 for overlapping both 1 and 2.
Divide this score by the current rectangle area. In the example above, we would have (+1 - 2) / (1*2), or -1/2 as the score for that move ... not a good idea, probably.
The entire first iteration would consider the moves below; directions are Up-Down-Left-Right
rect dir score
1 U 0.33 = (2-1)/3
1 D 0.33 = (2-1)/3
1 R 0.33 = (1-0)/3
2 U -0.00 = (0-1)/2
2 L 0.00 = (1-1)/2
2 R 0.50 = (2-1)/2
3 D 0.00 = (1-1)/2
3 L 0.50 = (1-0)/2
We have a tie for best score: 2 R and 3 L. I'll add a minor criterion of taking the greater expansion, 2 tiles over 1. This gives:
- - 3 3
1 1 12 2
- - 2 2
For the second iteration:
rect dir score
1 U 0.33 = (2-1)/3
1 D 0.33 = (2-1)/3
1 R 0.00 = (0-1)/3
2 U -0.50 = (0-2)/4
2 L 0.00 = (1-1)/4
3 D -1.00 = (0-2)/2
3 L 0.50 = (1-0)/2
Naturally, the tie from last time is now the sole top choice, since the two did not conflict:
- 3 3 3
1 1 12 2
- - 2 2
Possible optimization: If a +1 has no overlap, extend it as far as you can (avoiding overlap) before computing scores.
In the final two iterations, we will similarly get 3 L and 1 D as our choices, finishing with
3 3 3 3
1 1 12 2
1 1 2 2
Note that this algorithm will not get the same answer for your "pretty bad example": this one will cover the entire room with 2, reducing to only 2 overlap squares. If you'd rather have 1 expand in that case, we'll need a factor for the proportion of another rectangle that you're covering, instead of my constant value of 1.
Does that look like a tractable starting point for you?

Fast algorithm for simple data group

There are several billions rows like this
id | type | groupId
---+------+--------
1 | a |
1 | b |
2 | a |
2 | c |
1 | a |
2 | d |
2 | a |
1 | e |
5 | a |
1 | f |
4 | a |
1 | b |
4 | a |
1 | t |
8 | a |
3 | c |
6 | a |
I need to add groupId for these data, if id same or type same, then its a same groupId, the result like this:
id | type | group
---+------+--------
1 | a | 1
1 | b | 1
2 | a | 1
2 | c | 1
1 | a | 1
2 | d | 1
2 | a | 1
1 | e | 1
5 | a | 1
1 | f | 1
4 | a | 1
1 | b | 1
4 | a | 1
7 | t | 2
8 | g | 3
3 | c | 1
6 | a | 1
I try to use a loop to do this, but its very inefficiency, its need server weeks to finish all this.
This is a classic example where you can use a Quick-Union algorithm.
Computational Limits
Time complexity for grouping N rows : O(N log* N) where log* N is the "number of times needed to take the lg of a number until reaching 1" . eg Log* 10^100 = 3 (approx)
Space complexity : O(N)
Read more on this algorithm:
https://www.youtube.com/watch?v=MaNCMWhYIHo ,
https://www.cs.princeton.edu/~rs/AlgsDS07/01UnionFind.pdf

Direct-­mapped instruction cache VS fully associative instruction cache using LRU replacement

For caches of small size, a direct-­mapped instruction cache can sometimes outperform a fully associative instruction cache using LRU replacement.
Could anyone explain how this would be possible with an example access pattern?
This can happen for caches of small size. Let's compare caches of size 2.
In my example, the directly-mapped "DM" cache will use row A for odd addresses, and row B for even addresses.
The LRU cache will use the least recently used row to store values on a miss.
The access pattern I suggest is 13243142 (repeated as many times as one wants).
Here's a breakdown of how botch caching algorithms will behave:
H - hits
M - misses
----- time ------>>>>>
Accessed: 1 | 3 | 2 | 4 | 3 | 1 | 4 | 2
\ \ \ \ \ \ \ \
LRU A ? | ? | 3 | 3 | 4 | 4 | 1 | 1 | 2 |
B ? | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 |
M M M M M M M M
DM A ? | 1 | 3 | 3 | 3 | 3 | 1 | 1 | 1 |
B ? | ? | ? | 2 | 4 | 4 | 4 | 4 | 2 |
M M M M H M H M
That gives 8 misses for the LRU, and 6 for directly-mapped. Let's see what happens if this pattern gets repeated forever:
----- time ------>>>>>
Accessed: 1 | 3 | 2 | 4 | 3 | 1 | 4 | 2
\ \ \ \ \ \ \ \
LRU A | 2 | 3 | 3 | 4 | 4 | 1 | 1 | 2 |
B | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 |
M M M M M M M M
DM A | 1 | 3 | 3 | 3 | 3 | 1 | 1 | 1 |
B | 2 | 2 | 2 | 4 | 4 | 4 | 4 | 2 |
H M H M H M H M
So the directly-mapped cache will have 50% of hits, which outperforms 0% hits of LRU.
This works this way because:
Any address repeated in this pattern has not been accessed for previous 2 accesses (and both these were different), so LRU cache will always miss.
The DM cache will sometimes miss, as the pattern is designed so that it utilizes what has been stored the last time the corresponding row was used.
Therefore once can build similar patterns for larger cache sizes, but the larger the cache size, the longer such pattern would need to be. This corresponds to the intuition that for larger caches it would be harder to exploit them this way.

Resources