There is no way to create a reference to stream & it’s not possible to reuse the same stream multiple times - java-8

Reading article about java 8 stream, and found
Java Streams are consumable, so there is no way to create a reference
to stream for future usage. Since the data is on-demand, it’s not
possible to reuse the same stream multiple times.
at the same time at the same article
//sequential stream
Stream<Integer> sequentialStream = myList.stream();
//parallel stream
Stream<Integer> parallelStream = myList.parallelStream();
What does it mean of "there is no way to create a reference to stream for future usage" ? aren't sequentialStream and parallelStream references to streams
also what does it mean of "it’s not possible to reuse the same stream multiple times" ?

What it means is that every time you need to operate on a stream, you must make a new one.
So you cannot, for example, have something like:
Class Person {
private Stream<String> phoneNumbers;
Stream<String> getPhoneNumbers() {
return phoneNumbers;
}
}
and just reuse that one stream whenever you like. Instead, you must have something like
Class Person {
private List<String> phoneNumbers;
Stream<String> getPhoneNumbers() {
return phoneNumbers.stream(); // make a NEW stream over the same data
}
}
The code snipped you included does just that. It makes 2 different streams over the same data

Related

How do I combine a Kafka Processor with a Kafka Streams application?

I am trying to construct a Kafka Streams operator that takes in a Stream< timestamp, data> and outputs another Stream where the timestamps are sorted in ascending order; the purpose is to deal with streams that have "out of order" entries due to delays in the supplier.
At first, I thought about doing this with time-windowed aggregation, but then I happened upon a solution using a Kafka Processor. I figured I could then say something like:
class SortProcessor implements Processor<timestamp,data> ...
class SortProcessorSupplier ...supplies suitably initialized SortProcessor
KStream<timestamp,data> input_stream = ...sourced from "input_topic"
KStream<timestamp,data> output_stream =
input_stream.process( new SortProcessorSupplier(...parameters...) );
However, this doesn't work because KStream.process returns void.
So, my question is: How do I "wrap" the Processor so that I can use it as follows:
KStream<timestamp,data> input_stream = ...sourced from "input_topic"
KStream<timestamp,data> output_stream =
new WrappedSortProcessor( input_stream, ...parameters... )
Instead of a Processor you can use a Transformer, which is very similar to a Processor but is better suited to forwarding results on to the stream. You can then invoke it from the stream using the KStream.transform() method instead of process().

Java 8 JPA Repository Stream produce two (or more) results?

I have a Java 8 stream being returned by a Spring Data JPA Repository. I don't think my usecase is all that unusual, there are two (actually 3 in my case), collections off of the resulting stream that I would like collected.
Set<Long> ids = // initialized
try (Stream<SomeDatabaseEntity> someDatabaseEntityStream =
someDatabaseEntityRepository.findSomeDatabaseEntitiesStream(ids)) {
Set<Long> theAlphaComponentIds = someDatabaseEntityStream
.map(v -> v.getAlphaComponentId())
.collect(Collectors.toSet());
// operations on 'theAlphaComponentIds' here
}
I need to pull out the 'Beta' objects and do some work on those too. So I think I had to repeat the code, which seems completely wrong:
try (Stream<SomeDatabaseEntity> someDatabaseEntityStream =
someDatabaseEntityRepository.findSomeDatabaseEntitiesStream(ids)) {
Set<BetaComponent> theBetaComponents = someDatabaseEntityStream
.map(v -> v.getBetaComponent())
.collect(Collectors.toSet());
// operations on 'theBetaComponents' here
}
These two code blocks occur serially in the processing. Is there clean way to get both Sets from processing the Stream only once? Note: I do not want some kludgy solution that makes up a wrapper class for the Alpha's and Beta's as they don't really belong together.
You can always refactor code by putting the common parts into a method and turning the uncommon parts into parameters. E.g.
public <T> Set<T> getAll(Set<Long> ids, Function<SomeDatabaseEntity, T> f)
{
try(Stream<SomeDatabaseEntity> someDatabaseEntityStream =
someDatabaseEntityRepository.findSomeDatabaseEntitiesStream(ids)) {
return someDatabaseEntityStream.map(f).collect(Collectors.toSet());
}
}
usable via
Set<Long> theAlphaComponentIds = getAll(ids, v -> v.getAlphaComponentId());
// operations on 'theAlphaComponentIds' here
and
Set<BetaComponent> theBetaComponents = getAll(ids, v -> v.getBetaComponent());
// operations on 'theBetaComponents' here
Note that this pulls the “operations on … here” parts out of the try block, which is a good thing, as it implies that the associated resources are released earlier. This requires that BetaComponent can be processed independently of the Stream’s underlying resources (otherwise, you shouldn’t collect it into a Set anyway). For the Longs, we know for sure that they can be processed independently.
Of course, you could process the result out of the try block even without the moving the common code into a method. Whether the original code bears a duplication that requires this refactoring, is debatable. Actually, the operation consists a single statement within a try block that looks big only due to the verbose identifiers. Ask yourself, whether you would still deem the refactoring necessary, if the code looked like
Set<Long> alphaIDs, ids = // initialized
try(Stream<SomeDatabaseEntity> s = repo.findSomeDatabaseEntitiesStream(ids)) {
alphaIDs = s.map(v -> v.getAlphaComponentId()).collect(Collectors.toSet());
}
// operations on 'theAlphaComponentIds' here
Well, different developers may come to different conclusions…
If you want to reduce the number of repository queries, you can simply store the result of the query:
List<SomeDatabaseEntity> entities;
try(Stream<SomeDatabaseEntity> someDatabaseEntityStream =
someDatabaseEntityRepository.findSomeDatabaseEntitiesStream(ids)) {
entities=someDatabaseEntityStream.collect(Collectors.toList());
}
Set<Long> theAlphaComponentIds = entities.stream()
.map(v -> v.getAlphaComponentId()).collect(Collectors.toSet());
// operations on 'theAlphaComponentIds' here
Set<BetaComponent> theBetaComponents = entities.stream()
.map(v -> v.getBetaComponent()).collect(Collectors.toSet());
// operations on 'theBetaComponents' here

Is possible to know the size of a stream without using a terminal operation

I have 3 interfaces
public interface IGhOrg {
int getId();
String getLogin();
String getName();
String getLocation();
Stream<IGhRepo> getRepos();
}
public interface IGhRepo {
int getId();
int getSize();
int getWatchersCount();
String getLanguage();
Stream<IGhUser> getContributors();
}
public interface IGhUser {
int getId();
String getLogin();
String getName();
String getCompany();
Stream<IGhOrg> getOrgs();
}
and I need to implement Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations)
this method returns a IGhRepo with most Contributors(getContributors())
I tried this
Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations){
return organizations
.flatMap(IGhOrg::getRepos)
.max((repo1,repo2)-> (int)repo1.getContributors().count() - (int)repo2.getContributors().count() );
}
but it gives me the
java.lang.IllegalStateException: stream has already been operated upon or closed
I understand that count() is a terminal operation in Stream but I can't solve this problem, please help!
thanks
Is possible to know the size of a stream without using a terminal operation
No it's not, because streams can be infinite or generate output on demand. It's not necessary that they are backed by collections.
but it gives me the
java.lang.IllegalStateException: stream has already been operated upon or closed
That's becase you are returning the same stream instance on each method invocation. You should return a new Stream instead.
I understand that count() is a terminal operation in Stream but I can't solve this problem, please help!
IMHO you are misusing the streams here. Performance and simplicity wise it's much better that you return some Collection<XXX> instead of Stream<XXX>
NO.
This is not possible to know the size of a stream in java.
As mentioned in java 8 stream docs
No storage. A stream is not a data structure that stores elements;
instead, it conveys elements from a source such as a data structure,
an array, a generator function, or an I/O channel, through a pipeline
of computational operations.
You don't specify this, but it looks like some or possibly all of the interface methods that return Stream<...> values don't return a fresh stream each time they are called.
This seems problematic to me from an API point of view, as it means each of these streams, and a fair chunk of the object's functionality can be used at most once.
You may be able to solve the particular problem you are having by ensuring that the stream from each object is used only once in the method, something like this:
Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations) {
return organizations
.flatMap(IGhOrg::getRepos)
.distinct()
.map(repo -> new AbstractMap.SimpleEntry<>(repo, repo.getContributors().count()))
.max(Map.Entry.comparingByValue())
.map(Map.Entry::getKey);
}
Unfortunately it looks like you will now be stuck if you want to (for example) print a list of the contributors, as the stream returned from getContributors() for the returned IGhRepo has already been consumed.
You might want to consider having your implementation objects return a fresh stream each time a stream returning method is called.
You could keep a counter that is incremented per "iteration" using peek. In the example below the counter is incremented before every item is processed with doSomeLogic
final var counter = new AtomicInteger();
getStream().peek(item -> counter.incrementAndGet()).forEach(this::doSomeLogic);

Java Stream BufferedReader file stream

I am using Java 8 Streams to create stream from a csv file.
I am using BufferedReader.lines(), I read the docs for BufferedReader.lines():
After execution of the terminal stream operation there are no guarantees that the reader will be at a specific position from which to read the next character or line.
public class Streamy {
public static void main(String args[]) {
Reader reader = null;
BufferedReader breader = null;
try {
reader = new FileReader("refined.csv");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
breader = new BufferedReader(reader);
long l1 = breader.lines().count();
System.out.println("Line Count " + l1); // this works correctly
long l2 = breader.lines().count();
System.out.println("Line Count " + l2); // this gives 0
}
}
It looks like after reading the file for first time, reader does not get to beginning of the file. What is the way around for this problem
It looks like after reading the file for first time, reader does not get to beginning of the file.
No - and I don't know why you would expect it to given the documentation you quoted. Basically, the lines() method doesn't "rewind" the reader before starting, and may not even be able to. (Imagine the BufferedReader wraps an InputStreamReader which wraps a network connection's InputStream - once you've read the data, it's gone.)
What is the way around for this problem
Two options:
Reopen the file and read it from scratch
Save the result of lines() to a List<String>, so that you're then not reading from the file at all the second time. For example:
List<String> lines = breader.lines().collect(Collectors.toList());
As an aside, I'd strongly recommend using Files.newBufferedReader instead of FileReader - the latter always uses the platform default encoding, which isn't generally a good idea.
And for that matter, to load all the lines into a list, you can just use Files.readAllLines... or Files.lines if you want the lines as a stream rather than a list. (Note the caveats in the comments, however.)
Probably the cited fragment from JavaDoc needs to be clarified. Usually you would expect that after reading the whole file reader will point to the end of the file. But using streams it depends on whether short-circuit terminal operation is used and whether the stream is parallel. For example, if you use
String magicLine = breader.lines()
.filter(str -> str.startsWith("magic"))
.findAny()
.orElse(null);
Your reader will likely to stop after the first found line (because no need to read further) or read the whole input file if such line is not found. If you make the same operation in parallel stream, then the resulting position will be unpredictable, because input will be split to some implementation-dependent chunks where the search will be performed. That's why it's written this way in the documentation.
As for workaround ways, please read the #JonSkeet answer. And consider closing your streams via try-with-resource construct.
If there are no guarantees that the reader will be at a specific line, why wouldn't you create two readers?
reader1=new FileReader("refined.csv");
reader2=new FileReader("refined.csv");

how to choose a field value from a specific stream in storm

public void execute(Tuple input) {
Object value = input.getValueByField(FIELD_NAME);
...
}
When calling getValueByField, how do I specify a particular stream name emitted by previous Bolt/Spout so that particular FIELD_NAME is coming from that stream?
I need to know this because I'm facing the following exception:
InvalidTopologyException(msg:Component: [bolt2-name] subscribes from non-existent stream: [default] of component [bolt1-name])
So, I want to specify a particular stream while calling getValueBy... methods.
I don't remember a way of doing it on a tuple, but you can get the information of who sent you the tuple:
String sourceComponent = tuple.getSourceComponent();
String streamId = tuple.getSourceStreamId();
Then you can use a classic switch/case in java to call a specific method that will know which fields are available.
You can also iterate through fields included in your tuple to check if the field is available but I find this way dirty.
for (String field : tuple.getFields()) {
// Check something on field...
}
Just found out that the binding to a specific stream could be done while building topology.
The Spout could declare fields to a stream (in declareOutputFields method)
declarer.declareStream(streamName, new Fields(field1, field2));
...and emit value to the stream
collector.emit(streamName, new Values(value1, value2...), msgID);
When Bolt is being added in the topology, it could subscribe to a specific stream from preceding spout or bolt like following
topologyBuilder.setBolt(boltId, new BoltClass(), parallelismLevel)
.localOrShuffleGrouping(spoutORBoltID, streamID);
The overloaded version of the method localOrShuffleGrouping provides an option to specify streamID as last argument.

Resources