Caching Java 8 stream - java-8

Suppose I have a list which I perform multiple stream operations on.
bobs = myList.stream()
.filter(person -> person.getName().equals("Bob"))
.collect(Collectors.toList())
...
and
tonies = myList.stream()
.filter(person -> person.getName().equals("tony"))
.collect(Collectors.toList())
Can I not just do:
Stream<Person> stream = myList.stream();
which then means I can do:
bobs = stream.filter(person -> person.getName().equals("Bob"))
.collect(Collectors.toList())
tonies = stream.filter(person -> person.getName().equals("tony"))
.collect(Collectors.toList())

NO, you can't. One Stream can only be use one time It will throw below error when you will try to reuse:
java.lang.IllegalStateException: stream has already been operated upon or closed
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:229)
As per Java Docs:
A stream should be operated on (invoking an intermediate or terminal stream operation) only once.
But a neat solution to your query will be to use Stream Suplier. It looks like below:
Supplier<Stream<Person>> streamSupplier = myList::stream;
bobs = streamSupplier.get().filter(person -> person.getName().equals("Bob"))
.collect(Collectors.toList())
tonies = streamSupplier.get().filter(person -> person.getName().equals("tony"))
.collect(Collectors.toList())
But again, every get call will return a new stream.

No you can't, doc says:
A stream should be operated on (invoking an intermediate or terminal
stream operation) only once.
But you can use a single stream by filtering all elements you want once and then group them the way you need:
Set<String> names = ...; // construct a sets containing bob, tony, etc
Map<String,List<Person>> r = myList.stream()
.filter(p -> names.contains(p.getName())
.collect(Collectors.groupingBy(Person::getName);
List<Person> tonies = r.get("tony");
List<Person> bobs = r.get("bob");

Well, what you can do in your case is generate dynamic stream pipelines. Assuming that the only variable in your pipeline is the name of the person that you filter by.
We can represent this as a Function<String, Stream<Person>> as in the following :
final Function<String, Stream<Person>> pipelineGenerator = name -> persons.stream().filter(person -> Objects.equals(person.getName(), name));
final List<Person> bobs = pipelineGenerator.apply("bob").collect(Collectors.toList());
final List<Person> tonies = pipelineGenerator.apply("tony").collect(Collectors.toList());

As already mentioned a given stream should be operated upon only once.
I can understand the "idea" of caching a reference to an object if you're going to refer to it more than once, or to simply avoid creating more objects than necessary.
However, you should not be concerned when invoking myList.stream() every time you need to query again as creating a stream, in general, is a cheap operation.

Related

Should TimeWindows be the same when joining two KTable dervied from TimeWindows

I use two different retention time for two different KTable, and it works with RocksDB States and changelog Kafka Topics.
KTable is generated from KStream and groupBy and then windowedBy.
I believe when joining KStream with windowing, TimeWindows is the same. I'm wondering will there be benefit or drawback if TimeWindows parameters are different, when joining two different KTable windowed by TimeWindows?
code snippet:
final KStream<Integer, String> eventStream = builder.stream("events",
Consumed.with(Serdes.Integer(), Serdes.String())
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST));
final KTable<Windowed<Integer>, String> eventWindowTable = eventStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).until(Duration.ofSeconds(100).toMillis()))
.reduce((oldValue, newValue) -> newValue);
final KStream<Integer, String> clickStream = builder.stream("clicks",
Consumed.with(Serdes.Integer(), Serdes.String())
.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST));
final KTable<Windowed<Integer>, String> clickWindowTable = clickStream.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).until(Duration.ofSeconds(70).toMillis()))
.reduce((oldValue, newValue) -> newValue);
final KTable<Windowed<Integer>, String> join = eventWindowTable.leftJoin(clickWindowTable,
(event, click) -> event + " ; " + click + " ; " + Instant.now()
);
Initially I thought joining two different KTable with different TimeWindows parameters will not work because the joining relies on TimeWindowedKey, a key for the time slot. But after testing, it works as well.
The join is executed because the type of both keys is the same: Windowed<Integer>. The join will of course only produce a result if the keys are the same. Assume you have the following windows (note that only the window start timestamp is stored for TimeWindows):
eventWindowTable: <A,0> <A,60>
clickWindowTable: <A,0> <A,30> <A,60> <A,90>
For this case, only <A,0> and <A,60> would join. Hence, having different windows, does impact your result, because the window start timestamp is part of the key and some windows will never join (eg, <A,30> and <A,90> in our example).

Is it possible for a kafka steams application to write multiple outputs from a single input?

I'm unsure if kafka-streams is the correct solution for a problem I'm trying to solve. I'd like to be able to use it because of the parallelism and fault tolerance it provides, but I'm struggling to come up with a way to achieve a desired processing pipeline.
The pipeline is something like this:
A record of some type arrives on an input topic
Information in this record is used to perform a database query, which returns many results
I'd like to be able to write out each result as an individual record, with its own key, rather than as a collection of results in a single record.
Ignoring the single output record per result requirement for a moment, I have code that looks like this:
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<List<MyOutput>> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
outputs.to("output-topic", Produced.with(stringSerde, outputSerde));
This is simple enough, 1 message in, 1 message (albeit a collection) out.
What I'd like to be able to do is something like:
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<MyOutput> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
KStream<String, MyOutput> sink = outputs.???
sink.to("output-topic", Produced.with(stringSerde, outputSerde));
I cannot come up with anything sensible for an operation or operations to perform on the outputs stream.
Any suggestions? Or is kafka-streams maybe not the right solution to a problem like this?
yes, it's possible, for that you need to use KStream flatMap transformation. FlatMap transforms each record of the input stream into zero or more records in the output stream (both key and value type can be altered arbitrarily)
kStream = kStream.flatMap(
(key, value) -> {
List<KeyValue<String, MyOutput>> result = new ArrayList<>();
// do your logic here
return result;
});
kStream.to("output-topic", Produced.with(stringSerde, outputSerde));
Thanks, Vasiliy, flatMap was indeed what I needed. I looked at it earlier, thought it was the right operation but then got confused and mistakenly discarded it.
Combining what I had before with your suggestion, the following works, assuming MyOutput implements a method called getKey():
Serde<String> stringSerde = Serdes.String();
JsonSerde<MyInput> inputSerde = new JsonSerde<>();
JsonSerde<MyOutput> outputSerde = new JsonSerde<>();
Consumed<String, MyInput> consumer = Consumed.with(stringSerde, inputSerde);
KStream<String, MyInput> receiver = builder.stream("input-topic", consumer);
KStream<String, List<MyOutput>> outputs = receiver.mapValues(this::mapInputToManyOutputs);
KStream<String, MyOutput> sink = outputs.flatMap(((key, value) ->
value.stream().map(o -> new KeyValue<>(o.getKey(), o)).collect(Collectors.toList())));
sink.to("output-topic", Produced.with(stringSerde, outputSerde));

java 8 not seeing desired output with using map with filter

I see different result with using map with filter than using foreach with filter:
public class test1
{
public static void main (String[] args) throws java.lang.Exception
{
// your code goes here
Map<String, String> map = new HashMap<>();
map.put("a", "a");
map.put("c", "a");
Set<String> vs = new HashSet<>();
vs.add("b");
vs.add("c");
List<String> list = new ArrayList<>();
vs.stream()
.filter(a -> map.containsKey(a))
.map(c -> list.add(c));
System.out.println("here "+ list.size());
vs.stream()
.filter(a -> map.containsKey(a))
.forEach(c -> list.add(c));
System.out.println("here "+ list.size());
}
}
here is the output:
here 0
here 1
can somebody explain?
Terminal operations produces a non-stream, result such as primitive value, a collection or no value at all. Terminal operations are typically preceded by intermediate operations which return another Stream which allows operations to be connected in a form of a query. e.g. forEach()
Intermediate operations return another Stream which allows you to call multiple operations in a form of a query. Intermediate operations do not get executed until a terminal operation is invoked as there is a possibility they could be processed together when a terminal operation is executed. e.g. map()
In the following code, you didn't invoke a terminal operation in the last such as forEach() or collect(). That's why c -> list.add(c) isn't executed along with .filter(a -> map.containsKey(a)).
vs.stream()
.filter(a -> map.containsKey(a))
.map(c -> list.add(c));
Examine the result after using the following code snippet instead of above one,
vs.stream()
.filter(a -> map.containsKey(a)) // intermediate
.map(t -> list.add(t)) // intermediate
.collect(Collectors.toList()); // terminal
In our first stream you need terminal operation.
Without this, map is not executed.
Second stream have terminal operation (foreach) and whole stream is executed.
Add terminal operation to first stream(like count()) and you will see
Here 1
Here 2

Fastest way to convert key value pairs to grouped by key objects map using java 8 stream

Model:
public class AgencyMapping {
private Integer agencyId;
private String scoreKey;
}
public class AgencyInfo {
private Integer agencyId;
private Set<String> scoreKeys;
}
My code:
List<AgencyMapping> agencyMappings;
Map<Integer, AgencyInfo> agencyInfoByAgencyId = agencyMappings.stream()
.collect(groupingBy(AgencyMapping::getAgencyId,
collectingAndThen(toSet(), e -> e.stream().map(AgencyMapping::getScoreKey).collect(toSet()))))
.entrySet().stream().map(e -> new AgencyInfo(e.getKey(), e.getValue()))
.collect(Collectors.toMap(AgencyInfo::getAgencyId, identity()));
Is there a way to get the same result and use more simpler code and faster?
You can simplify the call to collectingAndThen(toSet(), e -> e.stream().map(AgencyMapping::getScoreKey).collect(toSet())))) with a call to mapping(AgencyMapping::getScoreKey, toSet()).
Map<Integer, AgencyInfo> resultSet = agencyMappings.stream()
.collect(groupingBy(AgencyMapping::getAgencyId,
mapping(AgencyMapping::getScoreKey, toSet())))
.entrySet()
.stream()
.map(e -> new AgencyInfo(e.getKey(), e.getValue()))
.collect(toMap(AgencyInfo::getAgencyId, identity()));
A different way to see it using a toMap collector:
Map<Integer, AgencyInfo> resultSet = agencyMappings.stream()
.collect(toMap(AgencyMapping::getAgencyId, // key extractor
e -> new HashSet<>(singleton(e.getScoreKey())), // value extractor
(left, right) -> { // a merge function, used to resolve collisions between values associated with the same key
left.addAll(right);
return left;
}))
.entrySet()
.stream()
.map(e -> new AgencyInfo(e.getKey(), e.getValue()))
.collect(toMap(AgencyInfo::getAgencyId, identity()));
The latter example is arguably more complicated than the former. Nevertheless, your approach is pretty much the way to go apart from using mapping as opposed to collectingAndThen as mentioned above.
Apart from that, I don't see anything else you can simplify with the code shown.
As for faster code, if you're suggesting that your current approach is slow in performance then you may want to read the answers here that speak about when you should consider going parallel.
You are collecting to an intermediate map, then streaming the entries of this map to create AgencyInfo instances, which are finally collected to another map.
Instead of all this, you could use Collectors.toMap to collect directly to a map, mapping each AgencyMapping object to the desired AgencyInfo and merging the scoreKeys as needed:
Map<Integer, AgencyInfo> agencyInfoByAgencyId = agencyMappings.stream()
.collect(Collectors.toMap(
AgencyMapping::getAgencyId,
mapping -> new AgencyInfo(
mapping.getAgencyId(),
new HashSet<>(Set.of(mapping.getScoreKey()))),
(left, right) -> {
left.getScoreKeys().addAll(right.getScoreKeys());
return left;
}));
This works by grouping the AgencyMapping elements of the stream by AgencyMapping::getAgencyId, but storing AgencyInfo objects in the map instead. We get these AgencyInfo instances from manually mapping each original AgencyMapping object. Finally, we're merging AgencyInfo instances that are already in the map by means of a merge function that folds left scoreKeys from one AgencyInfo to another.
I'm using Java 9's Set.of to create a singleton set. If you don't have Java 9, you can replace it with Collections.singleton.

Replacing Inner for loop with Java Stream

I am learning Java Streams and want to replace the below code with java 8 features.
i was able to use stream.filter() and stream.map features , but i could not replace the below code with java 8 features.
List<Subject> subjects= null;
Set<SubjectData> subjectData= new SubjectData();
for (String name: studentNames)
{
//subjects = student.getSubjects(name);
// consider instead of above line , which returns a collection of <Subject>
for (Subject subject : subjects)
{
subjectData.add(new SubjectData(subject.syllabus(), subject.code()));
}
}
any pointers would be appreciated
I imagine something like this is what you intend:
Set<SubjectData> subjectData = studentNames.stream()
.flatMap(name -> student.getSubjects(name).stream())
.map(subject -> new SubjectData(subject.syllabus(), subject.code()))
.collect(Collectors.toSet());
This streams the student names, maps them to their subjects while concatenating those streams, and then creates SubjectData objects for each. Lastly, those objects are collected into a set.

Resources