I am trying to convert a for loop which takes a list of one object and stores that list into another object:
/*List<Rbs> cellList = new ArrayList<Rbs>();
for (CommonPrep commonTest : commonTest.values()) {
cellList.addAll(commonTestInfo.getRbsList());
}*/
List<Rbs> cellList =
commonTest.values()
.stream()
.flatMap(rbsList -> rbsList.getRbsList().stream())
.collect(Collectors.toList());
Is there a better way of doing it than using two streams (as I'm currently doing)?
A slightly different variant.
List<Rbs> cellList =
commonTest.values()
.stream()
.map(CommonPrep::getRbsList)
.flatMap(Collection::stream)
.collect(Collectors.toList());
As you can see, eventually you'll need to call the stream method once in the .values().stream() and N times in the flatMap (due to the return value of the function passed to flatMap) where N is the number of elements returned from the map operation.
In fact, each intermediate operation also returns a Stream<T>, so there are more stream objects than what I've mentioned above.
There's no way to avoid it. plus creating a stream, in general, is a cheap operation so you shouldn't be concerned about it.
Thus, unless you're experiencing performance issues it's better not to think about how many Stream objects are created and instead focus on writing your query to achieve what you want and let the library handle the rest.
Even, if you were experiencing performance issues trying to find ways to avoid creating new streams on each function call or method call is not going to work as all intermediate operations returns a new stream on each invocation and some operations accepting behavioral parameters also return new streams as in the function passed to flatMap.
I very much doubt you're currently experiencing performance issues due to the creation of stream objects. Anyhow, you can always consider going parallel when you're experiencing performance problems.
It's important to understand several factors before even attempting to go parallel. You can read the answers here for things to consider.
Related
I am new to KStream and would like to know best practices or guidance on how to optimally process batch of records of n size using KStream. I have a working code as shown below but it does work for single messages at a time.
KStream<String, String> sourceStream = builder.stream("upstream-kafka-topic",
Consumed.with(Serdes.String(),
Serders.String());
//transform sourceStream using implementation of ValueTransformer<String,String>
sourceStream.transformValues(() -> new MyValueTransformer()).
to("downstream-kafka-topic",
Produced.with(Serdes.String(),
Serdes.String());
Above code works with single records as MyValueTransformer which implements ValueTransformer transforms single String value. How do I make above code work for Collection of String values?
You would need to somehow "buffer / aggregate" the messages. For example, you could add a state store to your transformer and store N messages inside the store. As long as the store contains fewer than N messages you don't do any processing and also don't emit any output (you might want to use flatTransformValues which allows you to emit zero results).
Not sure what you're trying to achieve. Kafka Streams by concept is designed to process one record at a time. If you want to process a collection or batch of messages you have a few options.
You might not actually need Kafka streams as the example you mentioned doesn't do much with the message, in this case, you can leverage a normal Consumer which will enable you to process in Batches. Check spring Kafka implementation of this here -> https://docs.spring.io/spring-kafka/docs/current/reference/html/#receiving-messages (Kafka process batches on the network layer but normally you would process one record at a time, but it's possible with a standard client to process batches) OR you might model your Object value to have an array of messages so for each record you will be receiving an object which contains a collection embedded which you could then use Kafka streams to do it, check the array type for Avro -> https://avro.apache.org/docs/current/spec.html#Arrays
Check this part of the documentation to understand better the Kafka streams concepts -> https://kafka.apache.org/31/documentation/streams/core-concepts
Let's say I have an input Flux containing many (billions of strings) like this:
apple
application
bible
book
There are billions of such strings, they won't fit into memory, that's why I want to use the reactive approach.
The stream is sorted. Now what I want is to create a flux of ordered groups of strings by the first 3 characters:
app: apple, application
bib: bible
boo: book
This Flux ends up on an HTTP response which means that all "app" items must be outputted before "bib" items begin.
Without using Flux I could use the ordered property and collect the items into a prepared bucket (amount of strings per bucket will fit into memory) as they come - whenever the prefix changes, I will flush the bucket and start collecting new prefix. The big advantage of the stream being ordered is that I know that once I encounter new prefix, the old one will never come again.
But using Flux I don't know how to do this. The .groupBy() will return Flux of Flux but I don't think this will work when trying to serialize this to the HTTP response output stream.
This is pretty much a textbook use case for windowUntilChanged().
In your case, the "key" you want to extract is the first 3 letters of the string, so you can do something like flux.windowUntilChanged(str -> str.substring(0,3)), which will give you a Flux<Flux<String>>, where the inner fluxes start and finish on change of the first 3 letters in your string. You may want to add some additional logic to deal with words less than 3 characters long of course, but I'll leave that as an exercise for the reader :-)
(I know you've mentioned it in the question, but just for clarification and the sake of anyone else finding this answer - this will only work if the incoming elements on the stream are already sorted alphabetically.)
I have a source of files that I need to process.
From each file, my code generates a variable number of data objects, let's call it N.
I have K number of processing objects that can be used to process the N data objects.
I'm thinking of doing the following using Tbb:dataflow:
Create a function_node with concurrency K and put my K processing objects into a concurrent_queue.
Use input_node to read file, generate the N data objects, and try_put each into the function_node.
The function_node body dequeues a processing object, uses it to process a data object, then returns the processing object back to the concurrent_queue when done.
Another way I can think of is possibly like so:
Create a function_node with serial concurrency.
Use input_node to read file, generate the N data objects, put the data objects into a collection and send over to the function_node.
At the function_node, partition the N objects into K ranges and use each of the K processing objects to process each range concurrently - not sure if it is possible to customize parallel_for for this purpose.
The advantage of the first method is probably lower latency because I can start sending data objects through the dataflow the moment they are generated rather than have to wait for all N data objects to be generated.
What do you think is the best way to go about parallelizing this processing?
Yes, you are right that the first method has this advantage of not waiting all of the data objects to start their processing. However, it also has an advantage of not waiting completion of processing all of the data objects passed to parallel_for. This becomes especially visible if the speed of processing varies for each data object and/or by each processing object.
Also, it seems enough to have buffer_node followed by (perhaps, reserving) join_node instead of concurrent_queue for saving of processing objects for further reuse. In this case, function_node would return processing object back to the buffer_node once it finishes processing of the data object. So, the graph will look like the following:
input_node -> input_port<0>(join_node);
buffer_node -> input_port<1>(join_node);
join_node -> function_node;
function_node -> buffer_node;
In this case, the concurrency of the function_node can be left unlimited as it would be automatically followed by the number of processing objects that exist (available tokens) in the graph.
Also, note that generating data objects from different files can be done in parallel as well. If you see benefit from that consider using function_node instead of input_node as the latter is always serial. However, in this case, use join_node with queueing policy since function_node is not reservable.
Also, please consider using tbb::parallel_pipeline instead as it seems you have a classic pipelining scheme of processing. In particular, this and that link might be useful.
I`m reading data from ElasticSearch to Spark every 5min. So there will be a RDD every 5 minutes.
I hope to construct a DStream based on these RDDs, so that I can get report for data within last 1 day, last 1 hour , last 5 minutes and so on.
To construct the DStream, I was thinking about create my own receiver, but the official documents of spark only give information using scala or java to do so. And I use python.
So do you know any way to do it? I know we can. After all the DStream is a series of RDDs, of course we should be about create DStream from continued RDDs. I just do not know how. Please give some advice
Writing your own receiver would be one way as you mentioned but seems like a lot of overhead. What you can do is to use a QueueReceiver which creates QueueInputDStream like in this example. It's Scala but you should also be able to do a similar thing in Python:
val rddQueue = new Queue[RDD[Map[String, Any]]]()
val inputStream = ssc.queueStream(rddQueue)
Afterwards you simply query your ES instance every X sec/min/h/day/whatever and you put the results into that queue.
With Python I guess it would be something like this:
rddQueue = []
rddQueue += es_rdd() // method that returns an RDD from ES
inputStream = ssc.queueStream(rddQueue)
// some kind of loop that adds to rddQueue new RDDS
Apparently you need to have something in the queue before you use it inside queueStream (or at least I'm getting exceptions in pyspark if it's empty).
It's not necessary to use receivers. You can directly override the InputDStream class to implement your elasticsearch data pulling logic. It's a better approach to not rely on receivers when your data already benefits from a replicated and replayable storage.
See : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.InputDStream
Though, I'm not sure you can easily create InputDStream classes directly from python.
I've been looking into Akka lately and it looks like a great framework for building scalable servers on the JVM. However most of the libraries on the JVM are blocking (e.g. JDBC) so don't your lose out on the performance benefits of using an event based model because your threads will always be blocked? Does Akka do something to get around this? Or is it just something you have to live with until we get more non-blocking libraries on the JVM?
Have a look at CQRS, it greatly improves scalability by separating reads from writes. This means that you can scale your reads separately from your writes.
With the types of IO blocking issues you mentioned Scala provides a language embedded solution that matches perfectly: Futures. For example:
def expensiveDBQuery(key : Key) = Future {
//...query the database
}
val dbResult : Future[Result] =
expensiveDBQuery(...) //non-blocking call
The dbResult returns immediately from the function call. The Result will be a available in the "Future". The cool part about a Future is that you can think about them like any old collection, except you can never call .size on the Future. Other than that all collection-ish functions (e.g. map, filter, foreach, ...) are fair game. Simply think of the dbResult as a list of Results. What would you do with such a list:
dbResult.map(_.getValues)
.filter(values => someTestOnValues(values))
...
That sequence of calls sets up a computation pipeline that is invoked whenever the Result is actually returned from the database. You can give a sequence of computing steps before the data has arrived. All asynchronously.