Confusion with Spark streaming multiple input kafka dstreams - spark-streaming

I'm new to Spark Streaming. I don't know the difference between the codes below:
A:
val kafkaDStreams = (1 to 3).map { i =>
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams,
topicsMap, StorageLevel.MEMORY_AND_DISK_SER)
.map(_._2)
}
ssc.union(kafkaDStreams).foreachRDD(......)
B:
val kafkaDStreams = (1 to 3).map { i =>
KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams,
topicsMap, StorageLevel.MEMORY_AND_DISK_SER)
.map(_._2).foreachRDD(......)
}
What differences between the 2 code samples when executed in Spark Streaming App. Any help? Thanks!

I'll do this backwards, because it's probably easier to explain that way :-)
In the second example, you create three DStreams via (1 to 3).map { [...] createStream [...] } and then call foreachRDD on them, so you have three separate sets of processing going on in parallel, so your foreachRDD function is called three times for every time period you set on your Spark streaming context - i.e. in the first time period, you'll get one call to foreachRDD for stream 1, one for stream 2 and one for stream 3.
In the first example, you create the same three DStreams, but then call union on them to produce just one DStream with the elements from all three. This means that you get just one call to your foreachRDD function for each time period, but the RDD, but it now contains the elements from all of stream 1, stream 2 and stream 3.

Related

How to get similar behavior to bufferCount whilst emitting if there are less items than the buffer count

I'm trying to achieve something very similar to a buffer count. As values come through the pipe, bufferCount of course buffers them and sends them down in batches. I'd like something similar to this that will emit all remaining items if there are currently fewer than the buffer size in the stream.
It's a little confusing to word, so I'll provide an example with what I'm trying to achieve.
I have something adding items individually to a subject. Sometimes it'll add 1 item a minute, sometimes it'll add 1000 items in 1 second. I wish to do a long running process (2 seconds~) on batches of these items as to not overload the server.
So for example, consider the timeline where P is processing
---A-----------B----------C---D--EFGHI------------------
|_( P(A) ) |_(P(B)) |_( P(C) ) |_(P([D, E, F, G, H, I]))
This way I can process the events in small or large batches depending on how many events are coming through, but i ensure the batches remain smaller than X.
I basically need to map all the individual emits into emits that contain chunks of 5 or fewer. As I pipe the events into a concatMap, events will start to stack up. I want to pick these stacked up events off in batches. How can I achieve this?
Here's a stackblitz with what I've got so far: https://stackblitz.com/edit/rxjs-iqwcbh?file=index.ts
Note how item 4 and 5 don't process until more come in and fill in the buffer. Ideally after 1,2,3 are processed, it'll pick off 4,5 the queue. Then when 6,7,8 come in, it'll process those.
EDIT: today I learned that bufferTime has a maxBufferSize parameter, that will emit when the buffer reaches that size. Therefore, the original answer below isn't necessary, we can simply do this:
const stream$ = subject$.pipe(
bufferTime(2000, null, 3), // <-- buffer emits # 2000ms OR when 3 items collected
filter(arr => !!arr.length)
);
StackBlitz
ORIGINAL:
It sounds like you want a combination of bufferCount and bufferTime. In other words: "release the buffer when it reaches size X or after Y time has passed".
We can use the race operator, along with those other two to create an observable that emits when the buffer reaches the desired size OR after the duration has passed. We'll also need a little help from take and repeat:
const chunk$ = subject$.pipe(bufferCount(3));
const partial$ = subject$.pipe(
bufferTime(2000),
filter(arr => !!arr.length) // don't emit empty array
);
const stream$ = race([chunk$, partial$]).pipe(
take(1),
repeat()
);
Here we define stream$ to be the first to emit between chunk$ and partial$. However, race will only use the first source that emits, so we use take(1) and repeat to sort of "reset the race".
Then you can do your work with concatMap like this:
stream$.pipe(
concatMap(chunk => this.doWorkWithChunk(chunk))
);
Here's a working StackBlitz demo.
You may want to roll it into a custom operator, so you can simply do something like this:
const stream$ = subject$.pipe(
bufferCountTime(5, 2000)
);
The definition of bufferCountTime() could look like this:
function bufferCountTime<T>(count: number, time: number) {
return (source$: Observable<T>) => {
const chunk$ = source$.pipe(bufferCount(count));
const partial$ = source$.pipe(
bufferTime(time),
filter((arr: T[]) => !!arr.length)
);
return race([chunk$, partial$]).pipe(
take(1),
repeat()
);
}
}
Another StackBlitz sample.
Since I noticed the use of forkJoin in your sample code, I can see you are sending a request to the server for each emission (I was originally under the impression that you were making only 1 call per batch with combined data).
In the case of sending one request per item the solution is much simpler!
There is no need to batch the emissions, you can simply use mergeMap and specify its concurrency parameter. This will limit the number of currently executing requests:
const stream$ = subject$.pipe(
mergeMap(val => doWork(val), 3), // 3 max concurrent requests
);
Here is a visual of what the output would look like when the subject rapidly emits:
Notice the work only starts for the first 3 items initially. Emissions after that are queued up and processed as the prior in flight items complete.
Here's a StackBlitz example of this behavior.
TLDR;
A StackBlitz app with the solution can be found here.
Explanation
Here would be an approach:
const bufferLen = 3;
const count$ = subject.pipe(filter((_, idx) => (idx + 1) % bufferLen === 0));
const timeout$ = subject.pipe(
filter((_, idx) => idx === 0),
switchMapTo(timer(0))
);
subject
.pipe(
buffer(
merge(count$, timeout$).pipe(
take(1),
repeat()
)
),
concatMap(buffer => forkJoin(buffer.map(doWork)))
)
.subscribe(/* console.warn */);
/* Output:
Processing 1
Processing 2
Processing 3
Processed 1
Processed 2
Processed 3
Processing 4
Processing 5
Processed 4
Processed 5
Processing 6 <- after the `setTimeout`'s timer expires
Processing 7
Processing 8
Processed 6
Processed 7
Processed 8
*/
The idea was to still use the bufferCount's behavior when items come in synchronously, but, at the same time, detect when fewer items than the chosen bufferLen are in the buffer. I thought that this detection could be done using a timer(0), because it internally schedules a macrotask, so it is ensured that items emitted synchronously will be considered first.
However, there is no operator that exactly combines the logic delineated above. But it's important to keep in mind that we certainly want a behavior similar to the one the buffer operator provides. As in, we will for sure have something like subject.pipe(buffer(...)).
Let's see how we can achieve something similar to what bufferTime does, but without using bufferTime:
const bufferLen = 3;
const count$ = subject.pipe(filter((_, idx) => (idx + 1) % bufferLen === 0));
Given the above snippet, using buffer(count$) and bufferTime(3), we should get the same behavior.
Let's move now onto the detection part:
const timeout$ = subject.pipe(
filter((_, idx) => idx === 0),
switchMapTo(timer(0))
);
What it essentially does is to start a timer after the subject has emitted its first item. This will make more sense when we have more context:
subject
.pipe(
buffer(
merge(count$, timeout$).pipe(
take(1),
repeat()
)
),
concatMap(buffer => forkJoin(buffer.map(doWork)))
)
.subscribe(/* console.warn */);
By using merge(count$, timeout$), this is what we'd be saying: when the subject emits, start adding items to the buffer and, at the same time, start the timer. The timer is started too because it is used to determine if fewer items will be in the buffer.
Let's walk through the example provided in the StackBlitz app:
from([1, 2, 3, 4, 5])
.pipe(tap(i => subject.next(i)))
.subscribe();
// Then mimic some more items coming through a while later
setTimeout(() => {
subject.next(6);
subject.next(7);
subject.next(8);
}, 10000);
When 1 is emitted, it will be added to the buffer and the timer will start. Then 2 and 3 arrive immediately, so the accumulated values will be emitted.
Because we're also using take(1) and repeat(), the process will restart. Now, when 4 is emitted, it will be added to the buffer and the timer will start again. 5 arrives immediately, but the number of the collected items until now is less than the given buffer length, meaning that until the 3rd value arrives, the timer will have time to finish. When the timer finishes, the [4,5] chunk will be emitted. What happens with [6, 7, 8] is the same as what happened with [1, 2, 3].

Spark Dataframe suddenly become very slow when I reuse the old cached data iteratively too much time

The problem happened when I try to keep my cached result in a List and try to calculate new DataFrame by all the data from the last list in each iteration. However, Even I use an empty DataFrame and get an empty result each time, the function will suddenly get very slow after about 8~12 round.
Here is my code
testLoop(Nil)
def testLoop(lastDfList:List[DataFrame]){
// do some dummy transformation like union and cache the result
val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf)}.cache
// always get 0, of course
println(resultDf.count)
// benchmark action
benchmark(resultDf.count)
testLoop(resultDf::lastDfList)
}
the benchmark result
1~6 round : < 200ms
7 round : 367ms
8 round : 918ms
9 round : 2476ms
10 round : 7833ms
11 round : 24231ms
I don't think GC or Block eviction is the problem in my case since I already use an empty DataFrame, but I don't know what is the cause? Do I misunderstand the meaning of cache or something?
Thanks!
After reading ImDarrenG's solution, I changed my code to be the following:
spark.sparkContext.setCheckpointDir("/tmp")
testLoop(Nil)
def testLoop(lastDfList:List[DataFrame]){
// do some dummy transformation like union and cache the result
val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf)}.cache
resultDf.checkpoint()
// always get 0, of course
println(resultDf.count)
// benchmark action
benchmark(resultDf.count)
testLoop(resultDf::lastDfList)
}
But it still become very slow after a few iterations.
Here you create a list of DataFrames by adding resultDf to the beginning of lastDfList and pass that to the next iteration of testLoop:
testLoop(resultDf::lastDfList)
So lastDfList gets longer each pass.
This line creates a new DataFrame by unioning each member of lastDfList:
val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf))}.cache
Each member of lastDfList is a union of it's predecessors, therefore, Spark is maintaining a lineage that becomes exponentially larger with each pass of testLoop.
I expect that the increase in time is caused by the housekeeping of the DAG. Caching the dataframes removes the need to repeat transformations, but the lineage must still be maintained by spark.
Cached data or no, it looks like you are building a really complex DAG by unioning each DataFrame with all of it's predecessors with each pass of testLoop.
You could use checkpoint to trim the lineage, and introduce some check to prevent infinite recursion.
According to API and code, checkpoint will return a new Dataset instead of changing original Dataset.

alternative to zip that produces value whenever any of the observable emit a value

At the moment zip will only produce a value whenever all of the zipped observable produces a value. E.g. from the docs:
Merges the specified observable sequences or Promises into one
observable sequence by using the selector function whenever all of the
observable sequences have produced an element
I'm looking for an observable which can sort of zip an observable but will produce an array of sequence of the zipped observable wherein it doesn't matter if all produces a value..
e.g. lets say i have tick$, observ1, observ2.. tick$ always produce value every x secs.. while observ1 and observ2 only produces from time to time..
I'm expecting my stream to look like
[tick, undefined, observ2Res],
[tick, undefined, undefined],
[tick, observ1Res, observ2Res]
...
...
its not combine latest, given that combine latest takes the latest value of a given observable.
I believe buffer (or maybe sample) might get you on the right track. The buffer method accepts an Observable that's used to define our buffer boundaries. The resulting stream emits any items that were emitted in that window (example stolen from RXJS docs for buffer):
var source = Rx.Observable.timer(0, 50)
.buffer(function () { return Rx.Observable.timer(125); })
.take(3);
var subscription = source.subscribe(x => console.log('Next: ', x));
// => Next: 0,1,2
// => Next: 3,4,5
// => Next: 6,7
So we now have a way to get all of a stream's emitted events in a certain time window. In your case, we can use tick$ to describe our sampling period and observ1 and observ2 are our underlying streams that we want to buffer:
const buffered1 = observ1.buffer(tick$);
const buffered2 = observ2.buffer(tick$);
Each of these streams will emit once every tick$ period, and will emit a list of all emitted items from the underlying stream (during that period). The buffered stream will emit data like this:
|--[]--[]--[1, 2, 3]--[]-->
To get the output you desire, we can choose to only look at the latest emitted item of each buffered result, and if there's no emitted data, we can pass null:
const buffered1 = observ1.buffer($tick).map(latest);
const buffered2 = observ2.buffer($tick).map(latest);
function latest(x) {
return x.length === 0 ? null : x[x.length - 1];
}
The previous sample stream I illustrated will now look like this:
|--null--null--3--null-->
And finally, we can zip these two streams to get "latest" emitted data during our tick$ interval:
const sampled$ = buffered1.zip(buffered2);
This sampled$ stream will emit the latest data from our observ1 and observ2 streams over the tick$ window. Here's a sample result:
|--[null, null]--[null, 1]--[1, 2]-->

Does a shuffle in Spark "always" move data around, even in trivial cases?

In the simple case of a Spark dataset that contains partitions where each key is only present in a single partition, like in the case of the following two partitions :
[ ("a", 1), ("a", 2) ]
[ ("b", 1) ],
will a shuffle operation (like groupByKey) generally shuffle the data accross partitions, even though there is no need for this?
I am asking this question because shuffling is expensive, so this matters, for large datasets. My use case is exactly this: a large dataset where each key almost always sits in a single partition.
Well, it depends. By default groupByKey is using a HashPartitioner. Lets assume you have only two partitions. It means that pairs with key "a" will go to partition number 1
scala> "a".hashCode % 2
res17: Int = 1
and pairs with key "b" to partition 2
scala> "b".hashCode % 2
res18: Int = 0
If you create RDD like this:
val rdd = sc.parallelize(("a", 1) :: ("a", 2) :: ("b", 1) :: Nil, 2).cache
there are multiple scenarios how data is distributed. First we'll need a small helper:
def addPartId[T](iter: Iterator[T]) = {
Iterator((TaskContext.get.partitionId, iter.toList))
}
Scenario 1
rdd.mapPartitions(addPartId).collect
Array((0,List((b,1))), (1,List((a,1), (a,2))))
No data movement required since all pairs are already on the right partition
Scenario 2
Array((0,List((a,1), (a,2))), (1,List((b,1))))
Although matching pairs are already on the same partition all pairs have to be moved since partition IDs don't match keys
Scenario 3
Some mixed distribution where only a part of the data have to be moved:
Array((0,List((a,1))), (1,List((a,2), (b,1))))
If data is partitioned using HashPartioner before groupByKey there is no need for shuffling whatsoever.
val rddPart = rdd.partitionBy(new HashPartitioner(2)).cache
rddPart.mapPartitions(addPartId).collect
Array((0,List((b,1))), (1,List((a,1), (a,2))))
rddPart.groupByKey

Mux & Demux w/ LINQ

I'm playing around with using LINQ to Objects for multiplexing and demultiplexing but it seems to me that this is a pretty tricky problem.
See this demuxer signature:
public static IEnumerable<IEnumerable<TSource>> Demux<TSource>(this IEnumerable<TSource> source, int multiplexity)
On an abstract level this is easy but ideally one would want to
remain lazy for the source stream
remain lazy for each multiplexed stream
not reiterate over the same elements
How would you do this?
I'm a bit tired so it could be my concentration failing me here...
Assuming that you want (0, 1, 2, 3) to end up as (0, 2) and (1, 3) when demuxing to two streams, you basically can't do it without buffering. You could buffer only when necessary, but that would be difficult. Basically you need to be able to cope with two contradictory ways of using the call...
Getting both iterators, and reading one item from each of them:
// Ignoring disposing of iterators etc
var query = source.Demux(2);
var demuxIterator = query.GetEnumerator();
demuxIterator.MoveNext();
var first = demuxIterator.Current;
demuxIterator.MoveNext();
var second = demuxIterator.Current;
first.MoveNext();
Console.WriteLine(first.Current); // Prints 0
second.MoveNext();
Console.WriteLine(second.Current); // Prints 1
Or getting one iterator, then reading both items:
// Ignoring disposing of iterators etc
var query = source.Demux(2);
var demuxIterator = query.GetEnumerator();
demuxIterator.MoveNext();
var first = demuxIterator.Current;
first.MoveNext();
Console.WriteLine(first.Current); // Prints 0
first.MoveNext();
Console.WriteLine(first.Current); // Prints 2
In the second case, it has to either remember the 1, or be able to reread it.
Any chance you could deal with IList<T> instead of IEnumerable<T>? That would "break" the rest of LINQ to Objects, admittedly - lazy projections etc would be a thing of the past.
Note that this is quite similar to the problem that operations like GroupBy have - they're deferred, but not lazy: as soon as you start reading from a GroupBy result, it reads the whole of the input data.

Resources