RxScala buffer elements - rx-scala

I have application where I need parse strings and I use for it RxScala. My code:
import java.util.concurrent.TimeUnit
import rx.lang.scala.Subject
import rx.lang.scala.schedulers.NewThreadScheduler
import rx.lang.scala.subjects.{SerializedSubject, PublishSubject}
import scala.concurrent.duration.Duration
object RxScala extends App {
val subject: Subject[String] = SerializedSubject(PublishSubject())
val processLines = (lines: Seq[String]) => {
// long action
}
subject
.subscribeOn(NewThreadScheduler())
.tumblingBuffer(Duration(2, TimeUnit.SECONDS), 100)
.subscribe(processLines)
for(i <- 1 to 100000) {
subject.onNext("Line " + i)
}
}
I have problem because I add lines faster then I can process them.
I want create buffer e.g. 200 lines and if buffer is full ignore new records to moment when buffer isn't full, e.g.
Add 100 records (A)
Add 100 records (B)
Program start processLines (A) // buffer have (B) elements
Add 100 records (C) // buffer have (B, C) elements and it is full
Add 100 records (D) // elements are ignored
ProcessLines is finished
Program start processLines (B) // buffer have (C) elements
Add 100 records (E) // buffer have (C, E) elements
Does RxScala have method to do this?

subject.
tumblingBuffer(Duration(2, TimeUnit.SECONDS), 100).
onBackpressureDrop.
observeOn(NewThreadScheduler()).
subscribe(processLines)
http://reactivex.io/documentation/operators/backpressure.html
onBackpressureDrop
drops emissions from the source Observable unless there is a pending request from a downstream Subscriber, in which case it will
emit enough items to fulfill the request

Related

How to get similar behavior to bufferCount whilst emitting if there are less items than the buffer count

I'm trying to achieve something very similar to a buffer count. As values come through the pipe, bufferCount of course buffers them and sends them down in batches. I'd like something similar to this that will emit all remaining items if there are currently fewer than the buffer size in the stream.
It's a little confusing to word, so I'll provide an example with what I'm trying to achieve.
I have something adding items individually to a subject. Sometimes it'll add 1 item a minute, sometimes it'll add 1000 items in 1 second. I wish to do a long running process (2 seconds~) on batches of these items as to not overload the server.
So for example, consider the timeline where P is processing
---A-----------B----------C---D--EFGHI------------------
|_( P(A) ) |_(P(B)) |_( P(C) ) |_(P([D, E, F, G, H, I]))
This way I can process the events in small or large batches depending on how many events are coming through, but i ensure the batches remain smaller than X.
I basically need to map all the individual emits into emits that contain chunks of 5 or fewer. As I pipe the events into a concatMap, events will start to stack up. I want to pick these stacked up events off in batches. How can I achieve this?
Here's a stackblitz with what I've got so far: https://stackblitz.com/edit/rxjs-iqwcbh?file=index.ts
Note how item 4 and 5 don't process until more come in and fill in the buffer. Ideally after 1,2,3 are processed, it'll pick off 4,5 the queue. Then when 6,7,8 come in, it'll process those.
EDIT: today I learned that bufferTime has a maxBufferSize parameter, that will emit when the buffer reaches that size. Therefore, the original answer below isn't necessary, we can simply do this:
const stream$ = subject$.pipe(
bufferTime(2000, null, 3), // <-- buffer emits # 2000ms OR when 3 items collected
filter(arr => !!arr.length)
);
StackBlitz
ORIGINAL:
It sounds like you want a combination of bufferCount and bufferTime. In other words: "release the buffer when it reaches size X or after Y time has passed".
We can use the race operator, along with those other two to create an observable that emits when the buffer reaches the desired size OR after the duration has passed. We'll also need a little help from take and repeat:
const chunk$ = subject$.pipe(bufferCount(3));
const partial$ = subject$.pipe(
bufferTime(2000),
filter(arr => !!arr.length) // don't emit empty array
);
const stream$ = race([chunk$, partial$]).pipe(
take(1),
repeat()
);
Here we define stream$ to be the first to emit between chunk$ and partial$. However, race will only use the first source that emits, so we use take(1) and repeat to sort of "reset the race".
Then you can do your work with concatMap like this:
stream$.pipe(
concatMap(chunk => this.doWorkWithChunk(chunk))
);
Here's a working StackBlitz demo.
You may want to roll it into a custom operator, so you can simply do something like this:
const stream$ = subject$.pipe(
bufferCountTime(5, 2000)
);
The definition of bufferCountTime() could look like this:
function bufferCountTime<T>(count: number, time: number) {
return (source$: Observable<T>) => {
const chunk$ = source$.pipe(bufferCount(count));
const partial$ = source$.pipe(
bufferTime(time),
filter((arr: T[]) => !!arr.length)
);
return race([chunk$, partial$]).pipe(
take(1),
repeat()
);
}
}
Another StackBlitz sample.
Since I noticed the use of forkJoin in your sample code, I can see you are sending a request to the server for each emission (I was originally under the impression that you were making only 1 call per batch with combined data).
In the case of sending one request per item the solution is much simpler!
There is no need to batch the emissions, you can simply use mergeMap and specify its concurrency parameter. This will limit the number of currently executing requests:
const stream$ = subject$.pipe(
mergeMap(val => doWork(val), 3), // 3 max concurrent requests
);
Here is a visual of what the output would look like when the subject rapidly emits:
Notice the work only starts for the first 3 items initially. Emissions after that are queued up and processed as the prior in flight items complete.
Here's a StackBlitz example of this behavior.
TLDR;
A StackBlitz app with the solution can be found here.
Explanation
Here would be an approach:
const bufferLen = 3;
const count$ = subject.pipe(filter((_, idx) => (idx + 1) % bufferLen === 0));
const timeout$ = subject.pipe(
filter((_, idx) => idx === 0),
switchMapTo(timer(0))
);
subject
.pipe(
buffer(
merge(count$, timeout$).pipe(
take(1),
repeat()
)
),
concatMap(buffer => forkJoin(buffer.map(doWork)))
)
.subscribe(/* console.warn */);
/* Output:
Processing 1
Processing 2
Processing 3
Processed 1
Processed 2
Processed 3
Processing 4
Processing 5
Processed 4
Processed 5
Processing 6 <- after the `setTimeout`'s timer expires
Processing 7
Processing 8
Processed 6
Processed 7
Processed 8
*/
The idea was to still use the bufferCount's behavior when items come in synchronously, but, at the same time, detect when fewer items than the chosen bufferLen are in the buffer. I thought that this detection could be done using a timer(0), because it internally schedules a macrotask, so it is ensured that items emitted synchronously will be considered first.
However, there is no operator that exactly combines the logic delineated above. But it's important to keep in mind that we certainly want a behavior similar to the one the buffer operator provides. As in, we will for sure have something like subject.pipe(buffer(...)).
Let's see how we can achieve something similar to what bufferTime does, but without using bufferTime:
const bufferLen = 3;
const count$ = subject.pipe(filter((_, idx) => (idx + 1) % bufferLen === 0));
Given the above snippet, using buffer(count$) and bufferTime(3), we should get the same behavior.
Let's move now onto the detection part:
const timeout$ = subject.pipe(
filter((_, idx) => idx === 0),
switchMapTo(timer(0))
);
What it essentially does is to start a timer after the subject has emitted its first item. This will make more sense when we have more context:
subject
.pipe(
buffer(
merge(count$, timeout$).pipe(
take(1),
repeat()
)
),
concatMap(buffer => forkJoin(buffer.map(doWork)))
)
.subscribe(/* console.warn */);
By using merge(count$, timeout$), this is what we'd be saying: when the subject emits, start adding items to the buffer and, at the same time, start the timer. The timer is started too because it is used to determine if fewer items will be in the buffer.
Let's walk through the example provided in the StackBlitz app:
from([1, 2, 3, 4, 5])
.pipe(tap(i => subject.next(i)))
.subscribe();
// Then mimic some more items coming through a while later
setTimeout(() => {
subject.next(6);
subject.next(7);
subject.next(8);
}, 10000);
When 1 is emitted, it will be added to the buffer and the timer will start. Then 2 and 3 arrive immediately, so the accumulated values will be emitted.
Because we're also using take(1) and repeat(), the process will restart. Now, when 4 is emitted, it will be added to the buffer and the timer will start again. 5 arrives immediately, but the number of the collected items until now is less than the given buffer length, meaning that until the 3rd value arrives, the timer will have time to finish. When the timer finishes, the [4,5] chunk will be emitted. What happens with [6, 7, 8] is the same as what happened with [1, 2, 3].

Guarantee `n` seconds between emit without waiting initially

Given an event stream like (each - is 10ms)
--A-B--C-D
With debounceTime(20) we get
-----------D
With throttleTime(20) we get
--A----C--
With throttleTime(20, undefined, {leading: true, trailing: true} we get
--A----CD
How can I instead guarantee that I have that much time between each emit, so for example with 20ms
--A-----C--D
In general the throttleTime with the trailing: true gets closest, but it can sometimes cause the trailing output to be too close to the leading output.
Sample code can be found on rxviz.com
1. Concat a delay
Concatenate an empty delay to each item, that doesn't emit anything and only completes after a given time.
const { EMTPY, of, concat } = Rx;
const { concatMap, delay } = RxOperators;
event$.pipe(
concatMap(item => concat(of(item), EMPTY.pipe(delay(20))))
);
2. ConcatMap to a timer
Map every item to a timer that starts with the given item and completes after a given amount of time. The next item will be emitted when the timer completes. Values emitted by the timer itself are ignored.
const { timer } = Rx;
const { concatMap, ignoreElements, startWith } = RxOperators;
event$.pipe(
concatMap(item => timer(20).pipe(ignoreElements(), startWith(item)))
);
3. Zip with an interval (not optimal)
If your event stream emits items faster than the desired delay you could use zip to emit events when an interval emits.
const { interval, zip } = Rx;
const { map } = RxOperators;
zip(event$, interval(20)).pipe(map(([item, i]) => item));
This method won't guarantee n seconds between every emitted item in all circumstances, e.g. when there is a gap larger than the desired delay followed by a small gap in the event stream.
E.g zip works in your example with emits at 20, 30, 50, 60 with min delay 20.
zip won't work perfectly with emits at 20, 30, 65, 70 with min delay 20.
When the interval emits faster than events are coming in, those interval items will just pile up inside zip. If this is the case zip will immediately zip any new event with an already present interval item from its stack causing events to be emitted without the intended delay.
Not sure if there's a ready-made operator available to achieve this (there might be!), but you can do it by timestamping each value and adding necessary delay in between:
Timestamp each value
Scan over the sequence and calculate relative delay based on previous value's effective timestamp
delay each value by appropriate amount
concat the resulting sequence
Here's an rxviz illustrating it. Code looks like this:
const minTimeBetween = 800
events.pipe(
timestamp(),
scan((a, x) => ({
...x,
delayBy: a === null
? 0
: Math.max(0, minTimeBetween - (x.timestamp - (a.timestamp + a.delayBy)))
}), null),
concatMap(x => of(x.value).pipe(
delay(x.delayBy)
))
);

RxJs hot range observable

I'm trying to create a hot range observable. This means that when I have an observer observering the observable after a certain timeout, it should not receive the values that have already been published. I have created the following program:
import Rx from "rxjs/Rx";
var x = Rx.Observable.range(1,10).share()
x.subscribe(x => {
print('1: ' + x);
});
setTimeout(() => {
x.subscribe(x => {
print('2: ' + x);
});
}, 1000);
function print(x) {
const element = document.createElement('div');
element.innerText = x;
document.body.appendChild(element)
}
I expect this program to print 1 to 10, and then the second observable to print nothing, since the values 1 to 10 are produced within the first second. The expected output is shown below.
1: 1
1: 2
..
1:10
However, I see that it also prints all the values. Eventhough I have put the share() operator behind it. The output is shown below.
1: 1
..
1: 10
2: 1
..
2: 10
Can somebody explain this to me?
share returns an observable that's reference counted for subscriptions. When the reference count goes from zero to one, the shared observable subscribes to the source - in your case, to the range observable. And when the reference count drops back to zero, it unsubscribes from the source.
The key point in your snippet is that range emits it's values synchronously and then completes. And the completion effects an unsubscription from the shared observable and that sees the reference count drop back to zero - which sees the shared observable unsubscribe from its source.
If you replace share with publish you should see the behaviour you expected:
var x = Rx.Observable.range(1,10).publish();
x.subscribe(x => print('1: ' + x));
x.connect();
publish returns a ConnectableObservable which is not reference counted and provides a connect method that can be called to explicitly connect - i.e. subscribe - to the source.

Algorithm to time-sort N data streams

So I've got N asynchronous, timestamped data streams. Each stream has a fixed-ish rate. I want to process all of the data, but the catch is that I must process the data in order as close to the time that the data arrived as possible (it is a real-time streaming application).
So far, my implementation has been to create a fixed window of K messages which I sort by timestamp using a priority queue. I then process the entirety of this queue in order before moving on to the next window. This is okay, but its less than ideal because it creates lag proportional to the size of the buffer, and also will sometimes lead to dropped messages if a message arrives just after the end of the buffer has been processed. It looks something like this:
// Priority queue keeping track of the data in timestamp order.
ThreadSafeProrityQueue<Data> q;
// Fixed buffer size
int K = 10;
// The last successfully processed data timestamp
time_t lastTimestamp = -1;
// Called for each of the N data streams asyncronously
void receiveAsyncData(const Data& dat) {
q.push(dat.timestamp, dat);
if (q.size() > K) {
processQueue();
}
}
// Process all the data in the queue.
void processQueue() {
while (!q.empty()) {
const auto& data = q.top();
// If the data is too old, drop it.
if (data.timestamp < lastTimestamp) {
LOG("Dropping message. Too old.");
q.pop();
continue;
}
// Otherwise, process it.
processData(data);
lastTimestamp = data.timestamp;
q.pop();
}
}
Information about the data: they're guaranteed to be sorted within their own stream. Their rates are between 5 and 30 hz. They consist of images and other bits of data.
Some examples of why this is harder than it appears. Suppose I have two streams, A and B both running at 1 Hz and I get the data in the following order:
(stream, time)
(A, 2)
(B, 1.5)
(A, 3)
(B, 2.5)
(A, 4)
(B, 3.5)
(A, 5)
See how if I processed the data in order of when I received them, B would always get dropped? that's what I wanted to avoid.Now in my algorithm, B would get dropped every 10th frame, and I would process the data with a lag of 10 frames into the past.
I would suggest a producer/consumer structure. Have each stream put data into the queue, and a separate thread reading the queue. That is:
// your asynchronous update:
void receiveAsyncData(const Data& dat) {
q.push(dat.timestamp, dat);
}
// separate thread that processes the queue
void processQueue()
{
while (!stopRequested)
{
data = q.pop();
if (data.timestamp >= lastTimestamp)
{
processData(data);
lastTimestamp = data.timestamp;
}
}
}
This prevents the "lag" that you see in your current implementation when you're processing a batch.
The processQueue function is running in a separate, persistent thread. stopRequested is a flag that the program sets when it wants to shut down--forcing the thread to exit. Some people would use a volatile flag for this. I prefer to use something like a manual reset event.
To make this work, you'll need a priority queue implementation that allows concurrent updates, or you'll need to wrap your queue with a synchronization lock. In particular, you want to make sure that q.pop() waits for the next item when the queue is empty. Or that you never call q.pop() when the queue is empty. I don't know the specifics of your ThreadSafePriorityQueue, so I can't really say exactly how you'd write that.
The timestamp check is still necessary because it's possible for a later item to be processed before an earlier item. For example:
Event received from data stream 1, but thread is swapped out before it can be added to the queue.
Event received from data stream 2, and is added to the queue.
Event from data stream 2 is removed from the queue by the processQueue function.
Thread from step 1 above gets another time slice and item is added to the queue.
This isn't unusual, just infrequent. And the time difference will typically be on the order of microseconds.
If you regularly get updates out of order, then you can introduce an artificial delay. For example, in your updated question you show messages coming in out of order by 500 milliseconds. Let's assume that 500 milliseconds is the maximum tolerance you want to support. That is, if a message comes in more than 500 ms late, then it will get dropped.
What you do is add 500 ms to the timestamp when you add the thing to the priority queue. That is:
q.push(AddMs(dat.timestamp, 500), dat);
And in the loop that processes things, you don't dequeue something before its timestamp. Something like:
while (true)
{
if (q.peek().timestamp <= currentTime)
{
data = q.pop();
if (data.timestamp >= lastTimestamp)
{
processData(data);
lastTimestamp = data.timestamp;
}
}
}
This introduces a 500 ms delay in the processing of all items, but it prevents dropping "late" updates that fall within the 500 ms threshold. You have to balance your desire for "real time" updates with your desire to prevent dropping updates.
There's always be a lag and that lag will be determined by how long you'll be willing to wait for your slowest "fixed-ish rate" stream.
Suggestion:
keep the buffer
keep an array of bool flags with the meaning:"if position ix is true, in the buffer there is at least a sample originated from stream ix"
sort/process as soon as you have all flag to true
Not full-proof (each buffer will be sorted, but from one buffer to another you may have timestamp inversion), but perhaps good enough?
Playing around with the count of "satisfied" flags to trigger the processing (at step 3) may be used to make the lag smaller, but with the risk of more inter-buffer timestamp inversions. In extreme, accepting the processing with only one satisfied flag means "push a frame as soon as you receive it, timestamp sorting be damned".
I mentioned this to support my feeling that lag/timestamp inversions balance is inherent to your problem - except for absolutely equal framerates, there will be perfect solution in which one of the sides is not sacrificed.
Since a "solution" will be an act of balancing, any solution will require gathering/using extra information to help decisions (e.g. that "array of flags"). If what I suggested sounds silly for your case (may well be, the details you chose to share aren't too many), start thinking what metrics will be relevant for your targeted level of "quality of experience" and use additional data structures to help gathering/processing/using those metrics.

alternative to zip that produces value whenever any of the observable emit a value

At the moment zip will only produce a value whenever all of the zipped observable produces a value. E.g. from the docs:
Merges the specified observable sequences or Promises into one
observable sequence by using the selector function whenever all of the
observable sequences have produced an element
I'm looking for an observable which can sort of zip an observable but will produce an array of sequence of the zipped observable wherein it doesn't matter if all produces a value..
e.g. lets say i have tick$, observ1, observ2.. tick$ always produce value every x secs.. while observ1 and observ2 only produces from time to time..
I'm expecting my stream to look like
[tick, undefined, observ2Res],
[tick, undefined, undefined],
[tick, observ1Res, observ2Res]
...
...
its not combine latest, given that combine latest takes the latest value of a given observable.
I believe buffer (or maybe sample) might get you on the right track. The buffer method accepts an Observable that's used to define our buffer boundaries. The resulting stream emits any items that were emitted in that window (example stolen from RXJS docs for buffer):
var source = Rx.Observable.timer(0, 50)
.buffer(function () { return Rx.Observable.timer(125); })
.take(3);
var subscription = source.subscribe(x => console.log('Next: ', x));
// => Next: 0,1,2
// => Next: 3,4,5
// => Next: 6,7
So we now have a way to get all of a stream's emitted events in a certain time window. In your case, we can use tick$ to describe our sampling period and observ1 and observ2 are our underlying streams that we want to buffer:
const buffered1 = observ1.buffer(tick$);
const buffered2 = observ2.buffer(tick$);
Each of these streams will emit once every tick$ period, and will emit a list of all emitted items from the underlying stream (during that period). The buffered stream will emit data like this:
|--[]--[]--[1, 2, 3]--[]-->
To get the output you desire, we can choose to only look at the latest emitted item of each buffered result, and if there's no emitted data, we can pass null:
const buffered1 = observ1.buffer($tick).map(latest);
const buffered2 = observ2.buffer($tick).map(latest);
function latest(x) {
return x.length === 0 ? null : x[x.length - 1];
}
The previous sample stream I illustrated will now look like this:
|--null--null--3--null-->
And finally, we can zip these two streams to get "latest" emitted data during our tick$ interval:
const sampled$ = buffered1.zip(buffered2);
This sampled$ stream will emit the latest data from our observ1 and observ2 streams over the tick$ window. Here's a sample result:
|--[null, null]--[null, 1]--[1, 2]-->

Resources