Return individual rdds from Dstream Window - spark-streaming

How to return individual rdd from dstream window ?:
my_dstream_window : somedstream.window(3mins,1min)
Suppose the above my_dstream_window contains {rdd1,rdd2,rdd3} I want to do some operation on
rdd1.Operation(rdd2).Operation(rdd3)
Intention: - my_dstream_window has duplicates. I could use reduceByKey to remove them for this stream but the next slice of my_dstream_window will have overlapping key-value pairs with older one.
So basically my task is to save distinct rdd key-value pairs only by removing any overlapping key-value from previous dstream.
Pls suggest.

There's a non-documented method on DStream that let's you get the RDDs belonging to a slice of time:
def slice(fromTime: Time, toTime: Time): Seq[RDD[T]]
This is used internally by the window functions but it's also exposed as public API. To use it, we need to keep track of time as it requires an time interval as parameter. It returns a sequence of RDDs belonging to that interval. (There were previously "remembered" either explicitly or through calling window functions)

Related

How to go about parallelizing my processing using tbb::parallel_for and tbb::dataflow?

I have a source of files that I need to process.
From each file, my code generates a variable number of data objects, let's call it N.
I have K number of processing objects that can be used to process the N data objects.
I'm thinking of doing the following using Tbb:dataflow:
Create a function_node with concurrency K and put my K processing objects into a concurrent_queue.
Use input_node to read file, generate the N data objects, and try_put each into the function_node.
The function_node body dequeues a processing object, uses it to process a data object, then returns the processing object back to the concurrent_queue when done.
Another way I can think of is possibly like so:
Create a function_node with serial concurrency.
Use input_node to read file, generate the N data objects, put the data objects into a collection and send over to the function_node.
At the function_node, partition the N objects into K ranges and use each of the K processing objects to process each range concurrently - not sure if it is possible to customize parallel_for for this purpose.
The advantage of the first method is probably lower latency because I can start sending data objects through the dataflow the moment they are generated rather than have to wait for all N data objects to be generated.
What do you think is the best way to go about parallelizing this processing?
Yes, you are right that the first method has this advantage of not waiting all of the data objects to start their processing. However, it also has an advantage of not waiting completion of processing all of the data objects passed to parallel_for. This becomes especially visible if the speed of processing varies for each data object and/or by each processing object.
Also, it seems enough to have buffer_node followed by (perhaps, reserving) join_node instead of concurrent_queue for saving of processing objects for further reuse. In this case, function_node would return processing object back to the buffer_node once it finishes processing of the data object. So, the graph will look like the following:
input_node -> input_port<0>(join_node);
buffer_node -> input_port<1>(join_node);
join_node -> function_node;
function_node -> buffer_node;
In this case, the concurrency of the function_node can be left unlimited as it would be automatically followed by the number of processing objects that exist (available tokens) in the graph.
Also, note that generating data objects from different files can be done in parallel as well. If you see benefit from that consider using function_node instead of input_node as the latter is always serial. However, in this case, use join_node with queueing policy since function_node is not reservable.
Also, please consider using tbb::parallel_pipeline instead as it seems you have a classic pipelining scheme of processing. In particular, this and that link might be useful.

Network shuffle in streaming

So,keyBy or groupBy causes a network shuffle that repartitions the stream. It is said that it is pretty expensive, since it involves network communication along with serialization and deserialization etc.
For an example, if I run the following operators:
map(Mapper1).keyBy(0).map(Mapper2)
with a parallelism of 2, I would get something like this:
Mapper1(1) -\-/- Mapper2(1)
X
Mapper1(2) -/-\- Mapper2(2)
And in the end all records with the same key within the Mapper1 are assigned to the same partition in Mapper2.
My question is:
I want to know what happens during the keyBy or groupBy in streaming. Every processed element is serialized and deserialized by every sub task ? How can I compare the cost of keyBy or groupBy with an another operation ?
Also, I am familiar with the concept of partitioner in batch systems, but I am getting a bit confused when I am trying to apply that in streaming.
Thank you !
So Apache Flink buffers the outgoing of a task and after that sends it to the next task for processing. setBufferTimeout is a parameter on the job-level which can be configured via the StreamExecutionEnvironment and the default value for this timeout is 100 ms. After this time, the buffers are sent automatically even if they are not full.
Also the following links are really helpful to understand the details:
https://flink.apache.org/2019/06/05/flink-network-stack.html
https://flink.apache.org/2019/07/23/flink-network-stack-2.html

Adding a global store for a transformer to consume

Is there a way to add a global store for a Transformer to use? In the docs for transformer it says:
"Transform each record of the input stream into zero or more records in the output stream (both key and value type can be altered arbitrarily). A Transformer (provided by the given TransformerSupplier) is applied to each input record and computes zero or more output records. In order to assign a state, the state must be created and registered beforehand via stores added via addStateStore or addGlobalStore before they can be connected to the Transformer"
yet, the API for addGlobalStore on takes a ProcessSupplier?
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore],
topic: String,
consumed: Consumed[_, _],
stateUpdateSupplier: ProcessorSupplier[_, _])
My end goal is to the Kafka Streams DSL, with a transformer since I need a flatMap and transform both keys and values to my output topic. I do not have a processor in my topology tho.
I would expect something like this:
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore], topic: String, consumed: Consumed[_, ], stateUpdateSupplier: TransformerSupplier[, _])
The Processor that is passed into addGlobalStore() is use to maintain (ie, write) the store. Note, that's it's expected atm that this Processor copies the data as-is into the store (cf https://issues.apache.org/jira/browse/KAFKA-7663).
After you have added a global store, you can also add a Transformer and the Transformer can access the store. Note, that it's not required to connect a global store to make it available (only "regular" stores, would need to be added). Also note, that a Transformer only gets read access to global stores.
Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.

Storm fields grouping

I'm having the following situation:
There is a number of bolts that calculate different values
This values are sent to visualization bolt
Visualization bolt opens a web socket and sends values to be visualized somehow
The thing is, visualization bolt is always the same, but it sends a message with a different header for each type of bolt that can be its input. For example:
BoltSum calculates sum
BoltDif calculates difference
BoltMul calculates multiple
All this bolts use VisualizationBolt for visualization
There are 3 instances of VisualizationBolt in this case
My question is, should I create 3 independent instances, where each instance will have one thread, e.g.
builder.setBolt("forSum", new VisualizationBolt(),1).globalGrouping("bolt-sum");
builder.setBolt("forDif", new VisualizationBolt(),1).globalGrouping("bolt-dif");
builder.setBolt("forMul", new VisualizationBolt(),1).globalGrouping("bolt-mul");
Or should I do the following
builder.setBolt("forAll", new VisualizationBolt(),3)
.fieldsGrouping("forSum", new Fields("type"))
.fieldsGrouping("forDif", new Fields("type"))
.fieldsGrouping("forMul", new Fields("type"));
And emit type from each of the previous bolts, so they can be grouped on based on it?
What are the advantages?
Also, should I expect that each and every time bolt-sum will go to first visualization bolt, bolt-dif will go to second visualization bolt and bolt-mul will go to third visualization bolt? They won't be mixed?
I think that that should be the case, but it currently isn't in my implementation, so I'm not sure if it's a bug or I'm missing something?
The first approach using three instances is the correct approach. Using fieldsGrouping does not ensure, that "sum" values go to "Sum-Visualization-Bolt" and neither that sum/diff/mul values are distinct (ie, in different bolt instances).
The semantic of fieldGrouping is more relaxed: it only guarantees, that all tuples of the same type will be processed by a single bolt instance, ie, that it will never be the case, that two different bolt instances get the same type.
I guess you can use Partial Key grouping (partialKeyGrouping). On the Storm documentation about stream groups says:
Partial Key grouping: The stream is partitioned by the fields
specified in the grouping, like the Fields grouping, but are load
balanced between two downstream bolts, which provides better
utilization of resources when the incoming data is skewed. This paper
provides a good explanation of how it works and the advantages it
provides.
I implemented a simple topology using this grouping and the chart on Graphite server show a better load balance compared to fieldsGrouping. The full source code is here.
topologyBuilder.setBolt(MqttSensors.BOLT_SENSOR_TYPE.getValue(), new SensorAggregateValuesWindowBolt().withTumblingWindow(Duration.seconds(5)), 2)
// .fieldsGrouping(MqttSensors.SPOUT_STATION_01.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
// .fieldsGrouping(MqttSensors.SPOUT_STATION_02.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
.partialKeyGrouping(MqttSensors.SPOUT_STATION_01.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
.partialKeyGrouping(MqttSensors.SPOUT_STATION_02.getValue(), new Fields(MqttSensors.FIELD_SENSOR_TYPE.getValue()))
.setNumTasks(4) // This will create 4 Bolt instances
.addConfiguration(TagSite.SITE.getValue(), TagSite.EDGE.getValue())
;

Duplicates when linkswalking riak using ripple

I'm working on a project where I use Riak with Ripple, and I've stumbled on a problem.
For some reason I get duplicates when link-walking a structure of links. When I link walk using curl I don't get the duplicates as far as I can see.
The difference between my curl based link-walk
curl -v http://127.0.0.1:8098/riak/users/2306403e5177b4716da9df93b67300824aa2fd0e/_,projects,0/_,tasks,1
and my ruby ripple/riak-client based link walk
result = Riak::MapReduce.new(self.robject.bucket.client).
add(self.robject.bucket,self.key).
link(Riak::WalkSpec.new({:key => 'projects'})).
link(Riak::WalkSpec.new({:key => 'tasks', :bucket=>'tasks'})).
map("function(v){ if(!JSON.parse(v.values[0].data).completed) {return [v];} else { return [];} }", {:keep => true}).run
is as far as I can tell the map at the end.
However the result of the map/reduce contains several duplicates. I can't wrap my head around why. Now I've settled for removing the duplicates based on the key, but I wish that the riak result wouldn't contain duplicates, since it seems like waste to remove duplicates at the end.
I've tried the following:
Making sure there are no duplicates in the links sets of my ripple objects
Loading the data without the map reduce, but the link walk contains duplicate keys.
Any help is appreciated.
What you're running into here is an interesting side-effect/challenge of Map/Reduce queries.
M/R queries don't have any notion of read quorum values, and they necessarily have to hit every object (within the limitations of input filtering, of course) on every node.
Which means, when N > 1, the queries have to hit every copy of every object.
For example, let's say N=3, as per default. That means, for each written object, there are 3 copies, one each on 3 different nodes.
When you issue a read for an object (let's say with the default quorum value of R=2), the coordinating node (which received the read request from your client) contacts all 3 nodes (and potentially receives 3 different values, 3 different copies of the object).
It then checks to make sure that at least 2 of those copies have the same values (to satisfy the R=2 requirement), returns that agreed-upon value to the requesting client, and discards the other copies.
So, in regular operations (reads/writes, but also link walking), the coordinating node filters out the duplicates for you.
Map/Reduce queries don't have that luxury. They don't really have quorum values associated with them -- they are made to iterate over every (relevant) key and object on all the nodes. And because the M/R code runs on each individual node (close to the data) instead of just on the coordinating node, they can't really filter out any duplicates intrinsically. One of the things they're designed for, for example, is to update (or delete) all of the copies of the objects on all the nodes. So, each Map phase (in your case above) runs on every node, returns the matched 'completed' values for each copy, and ships the results back to the coordinating node to return to the client. And since it's very likely that your N>1, there's going to be duplicates in the result set.
Now, you can probably filter out duplicates explicitly, by writing code in the Reduce phase, to check if there's already a key present and reject duplicates if it is, etc.
But honestly, if I was in your situation, I would just filter out the duplicates in ruby on the client side, rather than mess with the reduce code.
Anyways, I hope that sheds some light on this mystery.

Resources