How does control flow in Java8 collectors? - java-8

I am learning how to use Java 8 streams. While debugging this piece of code :
Collector<Person, StringJoiner, String> collector =
Collector.of(
() -> new StringJoiner(" | "),
(j,p) -> j.add(p.name.toLowerCase()),
StringJoiner::merge,
StringJoiner::toString);
System.out.println(persons.stream().collect(collector));
execution never reaches StringJoiner::merge or StringJoiner::toString. If I replace the combiner (StringJoiner::merge) with null, then the code throws null pointer exception. I am unable to follow.
Additional (related) question :
How can I add logging for debugging lambdas ? I tried adding braces for multi-line code blocks. This does not compile :
Collector<Person, StringJoiner, String> collector =
Collector.of(
() -> {
System.out.println("Supplier");
new StringJoiner(" | ")},
(j,p) -> j.add(p.name.toLowerCase()),
StringJoiner::merge,
StringJoiner::toString);

Here's your code with debug statements added (I replaced Person with String, but it doesn't change anything):
List<String> persons = Arrays.asList("John", "Mary", "Jack", "Jen");
Collector<String, StringJoiner, String> collector =
Collector.of(
() -> {
System.out.println("Supplier");
return new StringJoiner(" | ");
},
(j, p) -> {
System.out.println("Accumulator");
j.add(p.toLowerCase());
},
(stringJoiner, other) -> {
System.out.println("Combiner");
return stringJoiner.merge(other);
},
(stringJoiner) -> {
System.out.println("Finisher");
return stringJoiner.toString();
});
System.out.println(persons.stream().collect(collector));
Run it, and you'll see that the finisher is definitely called:
a StringJoiner is created by the supplier
all persons are added to the joiner
the finisher transforms the joiner to a String
The combiner, however, although required by the method of(), which checks for null, is only relevant if the collector is used on a parallel stream, and the stream really decides to split the work on multiple threads, thus using multiple joiners and combining them together.
To test that, you'll need a high number of persons in the collection, and a parallel stream instead of a sequential one:
List<String> persons = new ArrayList<>();
for (int i = 0; i < 1_000_000; i++) {
persons.add("p_" + i);
}
Collector<String, StringJoiner, String> collector =
Collector.of(
() -> {
System.out.println("Supplier");
return new StringJoiner(" | ");
},
(j, p) -> {
System.out.println("Accumulator");
j.add(p.toLowerCase());
},
(stringJoiner, other) -> {
System.out.println("Combiner");
return stringJoiner.merge(other);
},
(stringJoiner) -> {
System.out.println("Finisher");
return stringJoiner.toString();
});
System.out.println(persons.parallelStream().collect(collector));
The number of threads used is decided by the stream. And it can split the task done by one thread into yet two other threads in the middle if it thinks it's a good idea. Let's just assume it chooses to use 2:
two StringJoiners are created by the supplier, and a thread is allocated for each joiner
each thread adds half of the persons to its joiner
the two joiners are merged together by the combiner
the finisher transforms the merged joiner to a String

Related

Collect Uni.combine failures

Is there a way to collect Uni.combine().all().unis(...) failures, as Uni.join().all(...).andCollectFailures() does?
I need to call different services concurrently (with heterogeneous results) and fail all if one of them fails.
Moreover, what's the difference between Uni.combine().all().unis(...) and Uni.join(...) ?
Uni Combine Exception
The code should be like this,
return Uni.combine().all().unis(getObject1(), getObject2()).collectFailures().asTuple().flatMap(tuples -> {
return Uni.createFrom().item(Response.ok().build());
}).onFailure().recoverWithUni(failures -> {
System.out.println(failures instanceof CompositeException);
CompositeException exception = (CompositeException) failures;
for (Throwable error : exception.getCauses()) {
System.out.println(error.toString());
}
// failures.printStackTrace();
return Uni.createFrom().item(Response.status(500).build());
});
Difference:
Quarkus provides parallel processing through use of these two features:
Unijoin - Iterate through a list of objects and perform a certain
operation in parallel.
Iterate through a list of orders and perform activities on each order one by one (Multi)
OR
Iterate through a list of orders and add a method wrapper on the object in the Unijoin Builder. When the builder executes, the method wrappers are called in parallel, and its response collected in a list.
List<RequestDTO> reqDataList = request.getRequestData(); // Your input data
UniJoin.Builder<ResponseDTO> builder = Uni.join().builder();
for (RequestDTO requestDTO : reqDataList) {
builder.add(process(requestDTO));
}
return builder.joinAll().andFailFast().flatMap(responseList -> {
List<ResponseDTO> nonNullList = new ArrayList<>();
nonNullList.addAll(responseList.stream().filter(respDTO -> {
return respDTO != null;
}).collect(Collectors.toList()));
return Uni.createFrom().item(nonNullList);
});
You can see the list of objects converted to method wrapper, 'process', which is then called in parallel when 'andFailFast' is called.
'Uni.combine' - Call separate methods that
returns different response in parallel.
List<OrderDTO> orders = new ArrayList<>();
return Uni.combine().all().unis(getCountryMasters(),
getCurrencyMasters(updateDto)).asTuple()
.flatMap(tuple -> {
List<CountryDto> countries = tuple.getItem1();
List<CurrencyDto> currencies = tuple.getItem2();
// Get country code and currency code from each order and
// convert it to corresponding technical id.
return convert(orders, countries, currencies);
});
As you see above, both methods in 'combine' returns different results and yet they are performed in parallel.

Filtering data bolt Storm

I have a simple Storm topology which reads the data from Kafka, parses and extracts message fields. I would like to filter the stream of tuples by one of the fields values and perform counting aggregation on another one. How can I do this in Storm?
I haven't found respective methods for tuples (filter, aggregate) so should I perform these functions directly on the field values?
Here is a topology:
topologyBuilder.setSpout("kafka_spout", new KafkaSpout(spoutConfig), 1)
topologyBuilder.setBolt("parser_bolt", new ParserBolt()).shuffleGrouping("kafka_spout")
topologyBuilder.setBolt("transformer_bolt", new KafkaTwitterBolt()).shuffleGrouping("parser_bolt")
val config = new Config()
cluster.submitTopology("kafkaTest", config, topologyBuilder.createTopology())
I have set up KafkaTwitterBolt for counting and filtering with parsed fields. I've managed to filter the whole list of values only not by specific field:
class KafkaTwitterBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val tweetValues = input.getValues.asScala.toList
val filterTweets = tweetValues
.map(_.toString)
.filter(_ contains "big data")
val resultAllValues = new Values(filterTweets)
collector.emit(resultAllValues)
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.lang", "user.favorite_count", "entities.hashtags"))
}
}
Turns out Storm core API does not allows that, in order to perform filtering on any field Trident should be used (it has built-in filter function).
The code would look like this:
val tridentTopology = new TridentTopology()
val stream = tridentTopology.newStream("kafka_spout",
new KafkaTridentSpoutOpaque(spoutConfig))
.map(new ParserMapFunction, new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.favorite_count", "user.lang", "entities.hashtags"))
.filter(new LanguageFilter)
Filtering function itself:
class LanguageFilter extends BaseFilter{
override def isKeep(tuple: TridentTuple): Boolean = {
val language = tuple.getStringByField("user.lang")
println(s"TWEET: $language")
language.contains("en")
}
}
Your answer at https://stackoverflow.com/a/59805582/8845188 is a little wrong. The Storm core API does allow filtering and aggregation, you just have to write the logic yourself.
A filtering bolt is just a bolt that discards some tuples, and passes others on. For instance, the following bolt will filter out tuples based on a string field:
class FilteringBolt() extends BaseBasicBolt{
override def execute(input: Tuple, collector: BasicOutputCollector): Unit = {
val values = input.getValues.asScala.toList
if ("Pass me".equals(values.get(0))) {
collector.emit(values)
}
//Emitting nothing means discarding the tuple
}
override def declareOutputFields(declarer: OutputFieldsDeclarer): Unit = {
declarer.declare(new Fields("some-field"))
}
}
An aggregating bolt is just a bolt that collects multiple tuples, and then emits a new aggregate tuple anchored in the original tuples:
class AggregatingBolt extends BaseRichBolt {
List<Tuple> tuplesToAggregate = ...;
int counter = 0;
override def execute(input: Tuple): Unit = {
tuplesToAggregate.add(input);
counter++;
if (counter == 10) {
Values aggregateTuple = ... //create a new set of values based on tuplesToAggregate
collector.emit(tuplesToAggregate, aggregateTuple) //This anchors the new aggregate tuple to all the original tuples, so if the aggregate fails, the original tuples are replayed.
for (Tuple t : tuplesToAggregate) {
collector.ack(t); //Ack the original tuples now that this bolt is done with them
//Note that you MUST emit before you ack, or the at-least-once guarantee will be broken.
}
tuplesToAggregate.clear();
counter = 0;
}
//Note that we don't ack the input tuples until the aggregate gets emitted. This lets us replay all the aggregated tuples in case the aggregate fails
}
}
Note that for aggregation, you will need to extend BaseRichBolt and do acking manually, since you want to delay acking a tuple until it has been included in an aggregate tuple.

Save Multiple Records using Web Flux and Mongo DB

I'm working on a project which uses Spring web Flux and mongo DB and I'm very new to reactive programming and WebFlux.
I have scenario of saving into 3 collections using one service. For each collection im generating id using a sequence and then save them. I have FieldMaster which have List on them and every Field Info has List . I need to save FieldMaster, FileInfo and FieldOption. Below is the Code i'm using. The code works only when i'm running on debugging mode, otherwise it get blocked on below line
Integer field_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDINFO).block().getSeqValue());
Here is the full code
public Mono< FieldMaster > createMasterData(Mono< FieldMaster > fieldmaster)
{
return fieldmaster.flatMap(fm -> {
return sequencesCollection.getNextSequence(FIELDMASTER).flatMap(seqVal -> {
LOGGER.info("Generated Sequence value :" + seqVal.getSeqValue());
fm.setId(Integer.parseInt(seqVal.getSeqValue()));
List<FieldInfo> fieldInfo = fm.getFieldInfo();
fieldInfo.forEach(field -> {
// saving Field Goes Here
Integer field_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDINFO).block().getSeqValue()); // stops execution at this line
LOGGER.info("Generated Sequence value Field Sequence:" + field_seq_id);
field.setId(field_seq_id);
field.setMasterFieldRefId(fm.getId());
mongoTemplate.save(field).block();
LOGGER.info("Field Details Saved");
List<FieldOption> fieldOption = field.getFieldOptions();
fieldOption.forEach(option -> {
// saving Field Option Goes Here
Integer opt_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDOPTION).block().getSeqValue());
LOGGER.info("Generated Sequence value Options Sequence:" + opt_seq_id);
option.setId(opt_seq_id);
option.setFieldRefId(field_seq_id);
mongoTemplate.save(option).log().block();
LOGGER.info("Field Option Details Saved");
});
});
return mongoTemplate.save(fm).log();
});
});
}
First in reactive programming is not good to use .block because you turn nonblocking code to blocking. if you want to get from a stream and save in 3 streams you can do like that.
There are many different ways to do that for performance purpose but it depends of the amount of data.
here you have a sample using simple data and using concat operator but there are even zip and merge. it depends from your needs.
public void run(String... args) throws Exception {
Flux<Integer> dbData = Flux.range(0, 10);
dbData.flatMap(integer -> Flux.concat(saveAllInFirstCollection(integer), saveAllInSecondCollection(integer), saveAllInThirdCollection(integer))).subscribe();
}
Flux<Integer> saveAllInFirstCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}
Flux<Integer> saveAllInSecondCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}
Flux<Integer> saveAllInThirdCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}

Iterate over Collected list in Java 8 GroupingBy

I have a List of Objects say List<Type1> that I have grouped using type.(using groupingBy)
Now I want to convert that Map> into Type2 that has both the list and the Id of that group.
class Type1{
int id;
int type;
String name;
}
class Type2{
int type;
List<Type1> type1List;
}
This is what I have written to achieve this:
myCustomList
.stream()
.collect(groupingBy(Type1::getType))
.entrySet()
.stream()
.map(type1Item -> new Type2() {
{
setType(type1Item.getKey());
setType1List(type1Item.getValue());
}
})
.collect(Collectors.toList());
This works perfectly. But I am trying to make the code even cleaner. Is there a way to avoid streaming this thing all over again and use some kind of flatmap to achieve this.
You can pass a finisher function to the collectingAndThen to get the work done after the formation of the initial map.
List<Type2> result = myCustomList.stream()
.collect(Collectors.collectingAndThen(Collectors.groupingBy(Type1::getType),
m -> m.entrySet().stream()
.map(e -> new Type2(e.getKey(), e.getValue()))
.collect(Collectors.toList())));
You should give Type2 a constructor of the form
Type2(int type, List<Type1> type1List) {
this.type = type;
this.type1List = type1List;
}
Then, you can write .map(type1Item -> new Type2(type1Item.getKey(), type1Item.getValue())) instead of
.map(type1Item -> new Type2() {
{
setType(type1Item.getKey());
setType1List(type1Item.getValue());
}
})
See also What is Double Brace initialization in Java?
In short, this creates a memory leak, as it creates a subclass of Type2 which captures the type1Item its entire lifetime.
But you can perform the conversion as part of the downstream collector of the groupingBy. This implies that you have to make the toList explicit, to combine it via collectingAndThen with the subsequent mapping:
Collection<Type2> collect = myCustomList
.stream()
.collect(groupingBy(Type1::getType,
collectingAndThen(toList(), l -> new Type2(l.get(0).getType(), l))))
.values();
If you really need a List, you can use
List<Type2> collect = myCustomList
.stream()
.collect(collectingAndThen(groupingBy(Type1::getType,
collectingAndThen(toList(), l -> new Type2(l.get(0).getType(), l))),
m -> new ArrayList<>(m.values())));
You can do as mentioned below:
type1.map( type1Item -> new Type2(
type1Item.getKey(), type1Item
)).collect(Collectors.toList());

Is there a Dataflow TransformBlock that receives two input arguments?

I have a delegate that takes two numbers and creates a System.Windows.Point from them:
(x, y) => new Point(x,y);
I want to learn how can I use TPL Dataflow, specifically TransformBlock, to perform that.
I would have something like this:
ISourceBlock<double> Xsource;
ISourceBlock<double> Ysource;
ITargetBlock<Point> PointTarget;
// is there such a thing?
TransformBlock<double, double, Point> PointCreatorBlock;
// and also, how should I wire them together?
UPDATE:
Also, how can I assemble a network that joins more than two arguments? For example, let's say I have a method that receives eight arguments, each one coming from a different buffer, how can I create a block that knows when every argument has one instance available so that the object can be created?
I think what your looking for is the join block. Currently there is a two input and a three input variant, each outputs a tuple. These could be combined to create an eight parameter result. Another method would be creating a class to hold the parameters and using various block to process and construct the parameters class.
For the simple example of combining two ints for a point:
class MyClass {
BufferBlock<int> Xsource;
BufferBlock<int> Ysource;
JoinBlock<int, int> pointValueSource;
TransformBlock<Tuple<int, int>, Point> pointProducer;
public MyClass() {
CreatePipeline();
LinkPipeline();
}
private void CreatePipeline() {
Xsource = new BufferBlock<int>();
Ysource = new BufferBlock<int>();
pointValueSource = new JoinBlock<int, int>(new GroupingDataflowBlockOptions() {
Greedy = false
});
pointProducer = new TransformBlock<Tuple<int, int>, Point>((Func<Tuple<int,int>,Point>)ProducePoint,
new ExecutionDataflowBlockOptions()
{ MaxDegreeOfParallelism = Environment.ProcessorCount });
}
private void LinkPipeline() {
Xsource.LinkTo(pointValueSource.Target1, new DataflowLinkOptions() {
PropagateCompletion = true
});
Ysource.LinkTo(pointValueSource.Target2, new DataflowLinkOptions() {
PropagateCompletion = true
});
pointValueSource.LinkTo(pointProducer, new DataflowLinkOptions() {
PropagateCompletion = true
});
//pointProduce.LinkTo(Next Step In processing)
}
private Point ProducePoint(Tuple<int, int> XandY) {
return new Point(XandY.Item1, XandY.Item2);
}
}
The JoinBlock will wait until it has data available on both of its input buffers to produce an output. Also, note that in this case if X's and Y's are arriving out of order at the input buffers care needs to be taken to re-sync them. The join block will only combine the first X and the first Y value it receives and so on.

Resources