Controlling observable buffering by observable itself - rx-scala

I'm trying to slice observable stream by itself, eg.:
val source = Observable.from(1 to 10).share
val boundaries = source.filter(_ % 3 == 0)
val result = source.tumblingBuffer(boundaries)
result.subscribe((buf) => println(buf.toString))
Te output is:
Buffer()
Buffer()
Buffer()
Buffer()
source is probably iterated on boundaries line, before it reaches the result so it only create boundaries and resulting buffers but there's nothing to fill in.
My approach to this is using publish/connect:
val source2 = Observable.from(1 to 10).publish
val boundaries2 = source2.filter(_ % 3 == 0)
val result2 = source2.tumblingBuffer(boundaries2)
result2.subscribe((buf) => println(buf.toString))
source2.connect
This produces output alright:
Buffer(1, 2)
Buffer(3, 4, 5)
Buffer(6, 7, 8)
Buffer(9, 10)
Now I just need to hide connect from outer world and connect it when result gets subscribed (I am doing this inside a class and I don't want to expose it). Something like:
val source3 = Observable.from(1 to 10).publish
val boundaries3 = source3.filter(_ % 3 == 0)
val result3 = source3
.tumblingBuffer(boundaries3)
.doOnSubscribe(() => source3.connect)
result3.subscribe((buf) => println(buf.toString))
But now, doOnSubscribe action gets never called so published source gets never connected...
What's wrong?

You were on the right track with your publish solution. There is however an alternative publish operator that takes a lambda as its argument (see documentation) of type Observable[T] => Observable[R]. The argument of this lambda is the original stream, to which you can safely subscribe multiple times. Within the lambda you transform the original stream to your liking; in your case you filter the stream and buffer it on that filter.
Observable.from(1 to 10)
.publish(src => src.tumblingBuffer(src.filter(_ % 3 == 0)))
.subscribe(buf => println(buf.toString()))
The best thing of this operator is that you don't need to call anything like connect afterwards.

Related

RXJS subscribe only if previous value is not that great and I really need a better one

I have a costly server ajax request which has one input (full: boolean). If full is false, the server can return either a partial or a full response (response.isFull == true); but if full is true, the server will return a full response. Normally the partial response is good enough, but there are certain conditions that will require a full response. I need to avoid requesting a full response explicitly as much as possible, so I thought I'd start with a BehaviorSubject which I can eventually feed with true and combine it with distinctUntilChanged if I ever need to get the full response. This will give me an observable with false initially and that can give me true if I feed that into it:
const fullSubject = new BehaviorSubject<boolean>(false);
Then I've got a function that takes a boolean parameter and returns an observable with the server request (retried, transformed, etc.). As said, the answer can be partial or full, but it can be full even if the input parameter was false at the server's discretion. For example:
interface IdentityData {
...
isFull: boolean;
}
private getSimpleIdentity(full: boolean): Observable<IdentityData> {
return Axios.get(`/api/identity${full?"?full=true":""}`)
.pipe( ... retry logic ...,
... transformation logic ...,
shareReplay(1) );
}
I need to know how can I combine these so that the following is true:
The server needs to be queried at most twice.
If the first answer is a full answer, no further queries must be performed to the server.
If the first answer is a partial answer, and true is fed into fullSubject, a full answer must be requested.
The expected output from all this is an observable that emits either one full response, or a partial response and, when asked, a full response.
Environment: Vue 2.6.11, RxJS 6.5.5, Axios 0.19.2, TypeScript 3.7.5.
Thanks in advance
Here would be my approach:
const fullSubject = new BehaviorSubject(false);
const src$ = fullSubject.pipe(
switchMap(isFull => Axios.get('...')),
take(2), // Server required at most twice
takeWhile(response => !response.isFull, true), // When `isFull`, it will complete & unsubscribe -> no more requests to the server
shareReplay(1),
);
src$.subscribe(() => { /* ... */ });
function getFullAnswer () {
fullSubject.next(true);
}
takeWhile takes a second argument, inclusive. When set to true, when the predicate function evaluates to false(e.g isFull is true) it will send that value as well. –
if I've got it correctly
private getSimpleIdentity(): Observable<IdentityData> {
return fullSubject.pipe(
switchMap(full => Axios.get(`/api/identity${full ? "?full=true" : ""}`)),
shareReplay(1),
);
}
Uses the retryWhen() operator
const source = of("").pipe(map(() => Math.floor(Math.random() * 10 + 1)));
const example = source
.pipe(
tap((val) => console.log("tap", val)),
map((val) => {
//error will be picked up by retryWhen
if (val !== 5) throw val;
return val;
}),
retryWhen((errors) =>
errors.pipe(
tap(() => console.log("--Wait 1 seconds then repeat")),
delay(1000)
)
)
)
.subscribe((val) => console.log("subscription", val));
/*
output:
tap 3
--Wait 1 seconds then repeat
tap 8
--Wait 1 seconds then repeat
tap 1
--Wait 1 seconds then repeat
tap 4
--Wait 1 seconds then repeat
tap 7
--Wait 1 seconds then repeat
tap 5
subscription 5
*/

Rxjs - Calculate time spent inside/outside a div

I am learning Rxjs and wanted to try out a few examples on my own
but I can't seem to get my head around to think reactively.
I am trying to calculate the time a user's mouse pointer spends inside and outside a div.
see fiddle - https://jsfiddle.net/ishansoni22/44af3n3k/
<div class = "space">
<div>
let $space = $(".space")
let in$ = Rx.Observable.fromEvent($space, "mouseenter")
.map((event) => "in")
let out$ = Rx.Observable.fromEvent($space, "mouseleave")
.map((event) => "out")
let inOut$ = Rx.Observable.merge(in$, out$)
let time$ = Rx.Observable.interval(1000)
.buffer(inOut$)
.map((list) => list.length)
time$.subscribe((value) => console.log(value));
I am able to calculate the time but how do I relate it to the respective in/ out streams? I want the output to look something like :
inside, in - 20, out - 30
outside, in - 20, out - 35
inside, in - 100, out - 35
Also, can someone point me to some examples I could do so that I can start thinking in the reactive paradigm?
There are some examples in the official documentation (http://reactivex.io/rxjs) but they are a little bit scarce indeed.
I think I would some your sample something like this:
let $space = $(".space")
let in$ = Rx.Observable.fromEvent($space, "mouseenter")
let out$ = Rx.Observable.fromEvent($space, "mouseleave")
let durations$ = in$
.map(_ => Date.now())
.switchMap(inTime => out$
.take(1)
.map(_ => Date.now())
.map(outTime => outTime - inTime)
)
durations$
.scan((sum, next) => sum + next, 0)
.subscribe(total => console.log(total))
This would start listening to in$, then upon a mouseenter-event it starts to listen to mouseleaves, takes 1 of those events and calculate the duration.
I have written multiple maps below each other for clarity, but of course you can compose that into a single function.
One of the things I found most challenging when starting out with Rx was using streams of streams, and becoming comfortable with flatMap and switchMap. The problem you describe is most easily solved using exactly this approach. With your streams defined as follows (I prefer const over let to make it clear no mutation is occuring):
const in$ = Rx.Observable.fromEvent($space, 'mouseenter');
const out$ = Rx.Observable.fromEvent($space, 'mouseleave');
you can describe entering and then leaving as follows:
const inThenOut$ = in$.switchMap(() => out$);
To understand exactly what this is doing I urge you to learn about flatMap, become comfortable with streams of streams, and then learn how switchMap works by only maintaining a subscription to the most recent inner stream. For this I found the official rxjs documentation the best source. The included marble diagrams often tell complex stories with just a few dots and lines.
From here it's a relatively small step to get the time spent inside. First, we map our original streams into timestamp values:
const timestamp = () => + new Date();
const in$ = Rx.Observable.fromEvent($space, 'mouseenter').map(() => timestamp());
const out$ = Rx.Observable.fromEvent($space, 'mouseleave').map(() => timestamp());
(note: there is a timestamp method in rxjs you could use instead of doing this manually, but I feel this better illustrates how you can map your stream elements into anything you please).
From there, we can adjust our switchMap usage to access both the in and out values, and return the difference between them:
const inThenOut$ = in$.switchMap(() => out$, (x, y) => y - x);
Here's the whole thing working:
https://jsbin.com/qoruyoluho/edit?js,console,output
You could use RXJS - Timestamp operator to attach timestamp to each item emitted by an Observable indicating when it was emitted.
const { fromEvent } = Rx;
const { map, switchMap, timestamp, take, tap } = RxOperators;
const in$ = fromEvent($space, 'mouseenter').pipe(
timestamp(),
tap(x => console.log(`In: ${x.timestamp}`))
)
const out$ = fromEvent($space, 'mouseleave').pipe(
timestamp(),
tap(x => console.log(`Out: ${x.timestamp}`))
)
const duration$ = in$.pipe(
switchMap(start => out$.pipe(
take(1),
map(finish => finish.timestamp - start.timestamp),
tap(value => console.log(`Duration ms: ${value}`))
)
)
)
/* output example
In: 1552295324302
Out: 1552295325158
Duration ms: 856
*/
Try it here: https://rxviz.com/v/rOW5g9x8

python Use Pool to create multiple processes but not execute the results

I put all the functions are placed in a class, including the creation of the process of the function and the implementation of the function, in another file to call the function of this class
from multiprocessing import Pool
def initData(self, type):
# create six process to deal with the data
if type == 'train':
data = pd.read_csv('./data/train_merged_8.csv')
elif type == 'test':
data = pd.read_csv('./data/test_merged_2.csv')
modelvec = allWord2Vec('no').getModel()
modelvec_all = allWord2Vec('all').getModel()
modelvec_stop = allWord2Vec('stop').getModel()
p = Pool(6)
count = 0
for i in data.index:
count += 1
p.apply_async(self.valueCal, args=(i, data, modelvec, modelvec_all, modelvec_stop))
if count % 1000 == 0:
print(str(count // 100) + 'h rows of data has been dealed')
p.close()
p.join
def valueCal(self, i, data, modelvec, modelvec_all, modelvec_stop):
# the function run in process
list_con = []
q1 = str(data.get_value(i, 'question1')).split()
q2 = str(data.get_value(i, 'question2')).split()
f1 = self.getF1_union(q1, q2)
f2 = self.getF2_inter(q1, q2)
f3 = self.getF3_sum(q1, q2)
f4_q1 = len(q1)
f4_q2 = len(q2)
f4_rate = f4_q1/f4_q2
q1 = [','.join(str(ve)) for ve in q1]
q2 = [','.join(str(ve)) for ve in q2]
list_con.append('|'.join(q1))
list_con.append('|'.join(q2))
list_con.append(f1)
list_con.append(f2)
list_con.append(f3)
list_con.append(f4_q1)
list_con.append(f4_q2)
list_con.append(f4_rate)
f = open('./data/test.txt', 'a')
f.write('\t'.join(list_con) + '\n')
f.close()
The result appears very soon like this, but I have not even seen the file being created.But when I check the task manager, there are indeed six processes are created and consumed a lot of resources I cpu. And when the program is finished, the file is still not created.
How can i solve this problem?
10h rows of data have been dealed
20h rows of data have been dealed
30h rows of data have been dealed
40h rows of data have been dealed

RxJs - parse file, group lines by topics, but I miss the end

I am trying RxJS.
My use case is to parse a log file and group lines by topic ( i.e.: the beginning of the group is the filename and then after that I have some lines with user, date/time and so on)
I can analyse the lines using regExp. I can determine the beginning of the group.
I use ".scan" to group the lines together, when I've the beginning of new group of line, I create an observer on the lines I've accumulated ... fine.
The issue is the end of the file. I've started a new group, I am accumulating lines but I can not trigger the last sequence as I do not have the information that the end. I would have expect to have the information in the complete (but not)
Here is an example using number. Begin of group can multi of 3 or 5. (remark: I work in typescript)
import * as Rx from "rx";
let r = Rx.Observable
.range(0, 8)
.scan( function(acc: number[], value: number): number[]{
if (( value % 3 === 0) || ( value % 5 === 0)) {
acc.push(value);
let info = acc.join(".");
Rx.Observable
.fromArray(acc)
.subscribe( (value) => {
console.log(info, "=>", value);
});
acc = [];
} else {
acc.push(value);
}
return acc;
}, [])
.subscribe( function (x) {
// console.log(x);
});
This emit:
0 => 0
1.2.3 => 1
1.2.3 => 2
1.2.3 => 3
4.5 => 4
4.5 => 5
6 => 6
I am looking how to emit
0 => 0
1.2.3 => 1
1.2.3 => 2
1.2.3 => 3
4.5 => 4
4.5 => 5
6 => 6
7.8 => 7 last items are missing as I do not know how to detect end
7.8 => 8
Can you help me, grouping items?
Any good idea, even not using scan, is welcome.
Thank in advance
You can use the materialize operator. See the documentation here and the marbles here, and an example of use from SO.
In your case, I would try something like (untested but hopefully you can complete it yourself, note that I don't know a thing about typescript so there might be some syntax errors):
import * as Rx from "rx";
let r = Rx.Observable
.range(0, 8)
.materialize()
.scan( function(acc: number[], materializedNumber: Rx.Notification<number>): number[]{
let rangeValue: number = materializedNumber.value;
if (( rangeValue % 3 === 0) || ( rangeValue % 5 === 0)) {
acc.push(rangeValue);
generateNewObserverOnGroupOf(acc);
acc = [];
} else if ( materializedNumber.kind === "C") {
generateNewObserverOnGroupOf(acc);
acc = [];
} else {
acc.push(rangeValue);
}
return acc;
}, [])
// .dematerialize()
.subscribe( function (x) {
// console.log(x);
});
function generateNewObserverOnGroupOf(acc: number[]) {
let info = acc.join(".");
Rx.Observable
.fromArray(acc)
.subscribe( (value) => {
console.log(info, "=>", value);
});
The idea is that the materialize and dematerialize works with notifications, which encodes whether the message being passed by the stream is one of next, error, completed kinds (respectively 'N', 'E', 'C' values for the kind property). If you have a next notification, then the value passed is in the value field of the notification object. Note that you need to dematerialize to return to the normal behaviour of the stream so it can complete and free resources when finished.

Slow performance in spark streaming

I am using spark streaming 1.1.0 locally (not in a cluster).
I created a simple app that parses the data (about 10.000 entries), stores it in a stream and then makes some transformations on it. Here is the code:
def main(args : Array[String]){
val master = "local[8]"
val conf = new SparkConf().setAppName("Tester").setMaster(master)
val sc = new StreamingContext(conf, Milliseconds(110000))
val stream = sc.receiverStream(new MyReceiver("localhost", 9999))
val parsedStream = parse(stream)
parsedStream.foreachRDD(rdd =>
println(rdd.first()+"\nRULE STARTS "+System.currentTimeMillis()))
val result1 = parsedStream
.filter(entry => entry.symbol.contains("walking")
&& entry.symbol.contains("true") && entry.symbol.contains("id0"))
.map(_.time)
val result2 = parsedStream
.filter(entry =>
entry.symbol == "disappear" && entry.symbol.contains("id0"))
.map(_.time)
val result3 = result1
.transformWith(result2, (rdd1, rdd2: RDD[Int]) => rdd1.subtract(rdd2))
result3.foreachRDD(rdd =>
println(rdd.first()+"\nRULE ENDS "+System.currentTimeMillis()))
sc.start()
sc.awaitTermination()
}
def parse(stream: DStream[String]) = {
stream.flatMap { line =>
val entries = line.split("assert").filter(entry => !entry.isEmpty)
entries.map { tuple =>
val pattern = """\s*[(](.+)[,]\s*([0-9]+)+\s*[)]\s*[)]\s*[,|\.]\s*""".r
tuple match {
case pattern(symbol, time) =>
new Data(symbol, time.toInt)
}
}
}
}
case class Data (symbol: String, time: Int)
I have a batch duration of 110.000 milliseconds in order to receive all the data in one batch. I believed that, even locally, the spark is very fast. In this case, it takes about 3.5sec to execute the rule (between "RULE STARTS" and "RULE ENDS"). Am I doing something wrong or this is the expected time? Any advise
So i was using case matching in allot of my jobs and it killed performance, more than when i introduced a json parser. Also try tweaking the batch time on the StreamingContext. It made quite a bit of difference for me. Also how many local workers do you have?

Resources