kafka-streams split into branches and merge them back throws TopologyException: Invalid topology: Processor is already added - apache-kafka-streams

Trying to use the "new" (at least for me) split method/API to branch stream into a few parts (instead of the deprecated branch method.
I have 3 conditions to split my stream. On every branch, I need to perform some transformation and finally merge all 3 back into one stream.
var split = mystream.split(Named.as("branch-"));
split
.branch(
(k, wrapper) -> condition1,
Branched.withFunction(ks -> ks, "name1")
).branch(
(k, wrapper) -> condition1,
Branched.withFunction(ks -> ks, "name2")
).branch(
(k, v) -> condition3,
Branched.withFunction(ks -> ks, "name3")
);
var branches = split.noDefaultBranch();
var merged = branches.get("branch-name1").mapValues(p -> convertToAnotherObj(anotherObj))
.merge(branches.get("branch-name2").mapValues(p -> convertToAnotherObj(anotherObj))
.merge(branches.get("branch-name3").mapValues(p -> convertToAnotherObj(anotherObj)));
merged.to(kafkaProps.getOutputTopic(), Produced.with(Serdes.String(), expSerde));
So, as I mentioned, I get an exception:
kafka-streams split into branches and merge them back throws TopologyException: Invalid topology: Processor "branch-" is already added.
I am looking into JavaDoc of streams org.apache.kafka.streams.kstream.BranchedKStream and example from here:
Map<String, KStream<String, String>> branches = source.split(Named.as("split-"))
.branch((key, value) -> value == null, Branched.withFunction(s -> s.mapValues(v->"NULL"), "null")
.defaultBranch(Branched.as("non-null"));
KStream<String, String> merged = branches.get("split-non-null").merge(branches.get("split-null"));
The question is, if I can split, to perform some transformation on branches and merge all this back?
Appreciate it if anyone can point out where is my mistake.

Related

Get the maximum value using stream for Map

I have a class called Test. This class has a method called getNumber which returns an int value.
public class Test{
.
.
.
.
public int getNumber(){
return number;
}
}
Also I have a HashMap which the key is a Long and the value is a Test object.
Map<Long, Test> map = new HashMap<Long, Test>();
I want to print the key and also getNumber which has a maximum getNumber using a Stream Line code.
I can print the maximum Number via below lines
final Comparator<Test> comp = (p1, p2) -> Integer.compare(p1.getNumber(), p2.getNumber());
map.entrySet().stream().map(m -> m.getValue())
.max(comp).ifPresent(d -> System.out.println(d.getNumber()));
However my question is How can I return the key of the maximum amount? Can I do it with one round using stream?
If I understood you correctly:
Entry<Long, Test> entry = map.entrySet()
.stream()
.max(Map.Entry.comparingByValue(Comparator.comparingInt(Test::getNumber)))
.get();
If you want to find the key-value pair corresponding to the maximum 'number' value in the Test instances, you can use Collections.max() combined with a comparator that compares the entries with this criteria.
import static java.util.Comparator.comparingInt;
...
Map.Entry<Long, Test> maxEntry =
Collections.max(map.entrySet(), comparingInt(e -> e.getValue().getNumber()))
If you want to use the stream way, then remove the mapping (because you lost the key associated with the value), and provide the same comparator:
map.entrySet()
.stream()
.max(comparingInt(e -> e.getValue().getNumber()))
.ifPresent(System.out::println);
Note that there is a small difference in both snippets, as the first one will throw a NoSuchElementException if the provided map is empty.

Improving the Java 8 way of finding the most common words in "War and Peace"

I read this problem in Richard Bird's book: Find the top five most common words in War and Peace (or any other text for that matter).
Here's my current attempt:
public class WarAndPeace {
public static void main(String[] args) throws Exception {
Map<String, Integer> wc =
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.map(line -> line.replaceAll("\\p{Punct}", ""))
.flatMap(line -> Arrays.stream(line.split("\\s+")))
.filter(word -> word.matches("\\w+"))
.map(s -> s.toLowerCase())
.filter(s -> s.length() >= 2)
.collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
wc.entrySet()
.stream()
.sorted((e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
}
}
This definitely looks interesting and runs reasonably fast. On my laptop it prints the following:
$> time java -server -Xmx10g -cp target/classes tmp.WarAndPeace
the: 34566
and: 22152
to: 16716
of: 14987
a: 10521
java -server -Xmx10g -cp target/classes tmp.WarAndPeace 1.86s user 0.13s system 274% cpu 0.724 total
It usually runs in under 2 seconds. Can you suggest further improvements to this from an expressiveness and a performance standpoint?
PS: If you are interested in the rich history of this problem, see here.
You're recompiling all the regexps on every line and on every word. Instead of .flatMap(line -> Arrays.stream(line.split("\\s+"))) write .flatMap(Pattern.compile("\\s+")::splitAsStream). The same for .filter(word -> word.matches("\\w+")): use .filter(Pattern.compile("^\\w+$").asPredicate()). The same for map.
Probably it's better to swap .map(s -> s.toLowerCase()) and .filter(s -> s.length() >= 2) in order not to call toLowerCase() for one-letter words.
You should not use Collectors.toConcurrentMap(w -> w, w -> 1, Integer::sum). First, your stream is not parallel, so you may easily replace toConcurrentMap with toMap. Second, it would probably be more efficient (though testing is necessary) to use Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1)) as this would reduce boxing (but add a finisher step which will box all the values at once).
Instead of (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()) you may use ready comparator: Map.Entry.comparingByValue() (though probably it's a matter of taste).
To summarize:
Map<String, Integer> wc =
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.map(Pattern.compile("\\p{Punct}")::matcher)
.map(matcher -> matcher.replaceAll(""))
.flatMap(Pattern.compile("\\s+")::splitAsStream)
.filter(Pattern.compile("^\\w+$").asPredicate())
.filter(s -> s.length() >= 2)
.map(s -> s.toLowerCase())
.collect(Collectors.groupingBy(w -> w,
Collectors.summingInt(w -> 1)));
wc.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
If you don't like method references (some people don't), you may store precompiled regexps in the variables instead.
You are performing several redundant and unnecessary operations.
You first replace all punctuation characters with empty strings, creating new strings, then you perform a split operation using space characters as boundary. This even risks merging of words which are separated by punctuation without spacing. You could fix that by replacing punctuation by spaces, but in the end, you don’t need that replacement as you can change the split pattern to “punctuation or space” but
You are then filtering the split results by accepting strings solely consisting of word characters only. Since you have already removed all punctuation and spacing characters, this will sort out strings having characters that are neither word, space or punctuation characters and I'm not sure if this is the intended logic. After all, if you are interested in words only, why not search for words only in the first place? Since Java 8 does not support streams of matches, we can direct it to split using non-word characters as boundary.
Then you are doing a .map(s -> s.toLowerCase()).filter(s -> s.length() >= 2). Since for English texts, the string length won’t change when changing it to uppercase, the filtering condition is not affected, so we can filter first, skipping the toLowerCase conversion for strings that are not accepted by the predicate: .filter(s -> s.length() >= 2).map(s -> s.toLowerCase()). The net benefit might be small, but it doesn’t hurt.
Choosing the right Collector. Tagir already explained it. In principle, there’s Collectors.counting() which fits better than Collectors.summingInt(w->1), but unfortunately, Oracle’s current implementation is poor as it is based on reduce, unboxing and reboxing Longs for all elements.
Putting it all together, you’ll get:
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.flatMap(Pattern.compile("\\W+")::splitAsStream)
.filter(s -> s.length() >= 2)
.map(String::toLowerCase)
.collect(Collectors.groupingBy(w->w, Collectors.summingInt(w->1)))
.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
As explained, don’t be surprised if the word counts are slightly higher than in your approach.

Java8 style for comparing arrays? (Streams and Math3)

I'm just beginning to learn Java8 streams and Apache commons Math3 at the same time, and looking for missed opportunities to simplify my solution for comparing instances for equality. Consider this Math3 RealVector:
RealVector testArrayRealVector =
new ArrayRealVector(new double [] {1d, 2d, 3d});
and consider this member variable containing boxed doubles, plus this copy of it as an array list collection:
private final Double [] m_ADoubleArray = {13d, 14d, 15d};
private final Collection<Double> m_CollectionArrayList =
new ArrayList<>(Arrays.asList(m_ADoubleArray));
Here is my best shot at comparing these in a functional style in a JUnit class (full gist here), using protonpack from codepoetix because I couldn't find zip in the Streams library. This looks really baroque to my eyes and I wonder whether I've missed ways to make this shorter, faster, simpler, better because I'm just beginning to learn this stuff and don't know much.
// Make a stream out of the RealVector:
DoubleStream testArrayRealVectorStream =
Arrays.stream(testArrayRealVector.toArray());
// Check the type of that Stream
assertTrue("java.util.stream.DoublePipeline$Head" ==
testArrayRealVectorStream.getClass().getTypeName());
// Use up the stream:
assertEquals(3, testArrayRealVectorStream.count());
// Old one is used up; make another:
testArrayRealVectorStream = Arrays.stream(testArrayRealVector.toArray());
// Make a new stream from the member-var arrayList;
// do arithmetic on the copy, leaving the original unmodified:
Stream<Double> collectionStream = getFreshMemberVarStream();
// Use up the stream:
assertEquals(3, collectionStream.count());
// Stream is now used up; make new one:
collectionStream = getFreshMemberVarStream();
// Doesn't seem to be any way to use zip on the real array vector
// without boxing it.
Stream<Double> arrayRealVectorStreamBoxed =
testArrayRealVectorStream.boxed();
assertTrue(zip(
collectionStream,
arrayRealVectorStreamBoxed,
(l, r) -> Math.abs(l - r) < DELTA)
.reduce(true, (a, b) -> a && b));
where
private Stream<Double> getFreshMemberVarStream() {
return m_CollectionArrayList
.stream()
.map(x -> x - 12.0);
}
Again, here is a gist of my entire JUnit test class.
It seems you are trying to bail in Streams at all cost.
If I understand you correctly, you have
double[] array1=testArrayRealVector.toArray();
Double[] m_ADoubleArray = {13d, 14d, 15d};
as starting point. Then, the first thing you can do is to verify the lengths of these arrays:
assertTrue(array1.length==m_ADoubleArray.length);
assertEquals(3, array1.length);
There is no point in wrapping the arrays into a stream and calling count() and, of course, even less in wrapping an array into a collection to call stream().count() on it. Note that if your starting point is a Collection, calling size() will do as well.
Given that you already verified the length, you can simply do
IntStream.range(0, 3).forEach(ix->assertEquals(m_ADoubleArray[ix]-12, array1[ix], DELTA));
to compare the elements of the arrays.
or when you want to apply arithmetic as a function:
// keep the size check as above as the length won’t change
IntToDoubleFunction f=ix -> m_ADoubleArray[ix]-12;
IntStream.range(0, 3).forEach(ix -> assertEquals(f.applyAsDouble(ix), array1[ix], DELTA));
Note that you can also just create a new array using
double[] array2=Arrays.stream(m_ADoubleArray).mapToDouble(d -> d-12).toArray();
and compare the arrays similar to above:
IntStream.range(0, 3).forEach(ix -> assertEquals(array1[ix], array2[ix], DELTA));
or just using
assertArrayEquals(array1, array2, DELTA);
as now both arrays have the same type.
Don’t think about that temporary three element array holding the intermediate result. All other attempts consume far more memory…

How to retainAll of List of Lists using stream reduce

I faced following problem. I have a list of lists which i simply want to retainAll. I'm trying to do with streams
private List<List<Long>> ids = new ArrayList<List<Long>>();
// some ids.add(otherLists);
List<Long> reduce = ids.stream().reduce(ids.get(0), (a, b) -> a.addAll(b));
unfortunately I got the error
Error:(72, 67) java: incompatible types: bad return type in lambda expression
boolean cannot be converted to java.util.List<java.lang.Long>
If you want to reduce (I think you mean flatten by that) the list of lists, you should do it like this:
import static java.util.stream.Collectors.toList
...
List<Long> reduce = ids.stream().flatMap(List::stream).collect(toList());
Using reduce, the first value should be the identity value which is not the case in your implementation, and your solution will produce unexpected results when running the stream in parallel (because addAll modifies the list in place, and in this case the identity value will be the same list for partial results).
You'd need to copy the content of the partial result list, and add the other list in it to make it working when the pipeline is run in parallel:
List<Long> reduce = ids.parallelStream().reduce(new ArrayList<>(), (a, b) -> {
List<Long> list = new ArrayList<Long>(a);
list.addAll(b);
return list;
});
addAll returns a boolean, not the union of the two lists. You want
List<Long> reduce = ids.stream().reduce(ids.get(0), (a, b) -> {
a.addAll(b);
return a;
});

java 8 stream interference versus non-interference

I understand why the following code is ok. Because the collection is being modified before calling the terminal operation.
List<String> wordList = ...;
Stream<String> words = wordList.stream();
wordList.add("END"); // Ok
long n = words.distinct().count();
But why is this code is not ok?
Stream<String> words = wordList.stream();
words.forEach(s -> if (s.length() < 12) wordList.remove(s)); // Error—interference
Stream.forEach() is a terminal operation, and the underlying wordList collection is modified after the terminal has been started/called.
Joachim's answer is correct, +1.
You didn't ask specifically, but for the benefit of other readers, here are a couple techniques for rewriting the program a different way, avoiding stream interference problems.
If you want to mutate the list in-place, you can do so with a new default method on List instead of using streams:
wordList.removeIf(s -> s.length() < 12);
If you want to leave the original list intact but create a modified copy, you can use a stream and a collector to do that:
List<String> newList = wordList.stream()
.filter(s -> s.length() >= 12)
.collect(Collectors.toList());
Note that I had to invert the sense of the condition, since filter takes a predicate that keeps values in the stream if the condition is true.

Resources