Note: I am not necessarily looking for solutions to the concrete example problems described below. I am genuinely interested why this isn't possible out of the box in java 8.
Java streams are lazy. At the very end they have a single terminal operation.
My interpretation is that this terminal operation will pull all the values through the stream. None of the intermediate operations can do that. Why are there no intermediate operations that pull in an arbitrary amount of elements through the stream? Something like this:
stream
.mapMultiple(this::consumeMultipleElements) // or groupAndMap or combine or intermediateCollect or reverseFlatMap
.collect(Collectors.toList());
When a downstream operation tries to advance the stream once, the intermediate operation might try to advance the upstream multiple times (or not at all).
I would see a couple of use cases:
(These are just examples. So you can see that it is certainly possible to handle these use cases but it is "not the streaming way" and these solutions lack the desirable laziness property that Streams have.)
Combine multiple elements into a single new element to be passed down the rest of the stream. (E.g., making pairs (1,2,3,4,5,6) ➔ ((1,2),(3,4),(5,6)))
// Something like this,
// without needing to consume the entire stream upfront,
// and also more generic. (The combiner should decide for itself how many elements to consume/combine per resulting element. Maybe the combiner is a Consumer<Iterator<E>> or a Consumer<Supplier<E>>)
public <E, R> Stream<R> combine(Stream<E> stream, BiFunction<E, E, R> combiner) {
List<E> completeList = stream.collect(toList());
return IntStream.range(0, completeList.size() / 2)
.mapToObj(i -> combiner.apply(
completeList.get(2 * i),
completeList.get(2 * i + 1)));
}
Determine if the Stream is empty (mapping the Stream to an Optional non-empty Stream)
// Something like this, without needing to consume the entire stream
public <E> Optional<Stream<E>> toNonEmptyStream(Stream<E> stream) {
List<E> elements = stream.collect(toList());
return elements.isEmpty()
? Optional.empty()
: Optional.of(elements.stream());
}
Having a lazy Iterator that doesn't terminate the stream (allowing to skip elements based on more complicated logic then just skip(long n)).
Iterator<E> iterator = stream.iterator();
// Allow this without throwing a "java.lang.IllegalStateException: stream has already been operated upon or closed"
stream.collect(toList());
When they designed Streams and everything around them, did they forget about these use cases or did they explicitly leave this out?
I understand that these might give unexpected results when dealing with parallel streams but in my opinion this is a risk that can be documented.
Well all of the operations that you want are actually achievable in the Stream API, but not out of the box.
Combining multiple elements into pairs of elements - you need a custom Spliterator for that. Here is Tagir Valeev doing that. He has a absolute beast of the library called StreamEx that does many other useful things that are not supported out of the box.
I did not understand your second example, but I bet it's doable also.
skip to a more complicated operation are in java-9 via dropWhile and takeWhile that take a Predicate as input.
Just notice that when you say that none of the intermediate operations can do that is not accurate - there is sorted and distinct that do exactly that. They can't work otherwise. There's also flatMap that acts like that, but that is treated more like a bug.
One more thing is that intermediate operations for parallel streams have no defined order, so such a stateful intermediate operation would have unknown entries for a parallel stream. On the other hand you always have the option to abuse things like:
List<Something> list = Collections.synchronizedList()
.map(x -> {
list.add(x);
// your mapping
})
I would not do that if I were you and really think if I might need that, but just in case...
Not every terminal operation will “pull all the values through the stream”. The terminal operations iterator() and spliterator() do not immediately fetch all values and allow to do lazy processing, including constructing a new Stream again. For the latter, it’s strongly recommended to use spliterator() as this allows more meta information to be passed to the new stream and also implies less wrapping of objects.
E.g. your second example could be implemented as
public static <T> Stream<T> replaceWhenEmpty(Stream<T> s, Supplier<Stream<T>> fallBack) {
boolean parallel = s.isParallel();
Spliterator<T> sp = s.spliterator();
Stream.Builder<T> firstElement;
if(sp.getExactSizeIfKnown()==0 || !sp.tryAdvance(firstElement=Stream.builder())) {
s.close();
return fallBack.get();
}
return Stream.concat(firstElement.build(), StreamSupport.stream(sp, parallel))
.onClose(s::close);
}
For your general question, I don’t see how a general abstraction of these examples should look like, except like the spliterator() method that already exist. As the documentation puts it
However, if the provided stream operations do not offer the desired functionality, the BaseStream.iterator() and BaseStream.spliterator() operations can be used to perform a controlled traversal.
Related
I want to get the last element of a lazy but finite Seq in Raku, e.g.:
my $s = lazy gather for ^10 { take $_ };
The following don't work:
say $s[* - 1];
say $s.tail;
These ones work but don't seem too idiomatic:
say (for $s<> { $_ }).tail;
say (for $s<> { $_ })[* - 1];
What is the most idiomatic way of doing this while keeping the original Seq lazy?
What you're asking about ("get[ing] the last element of a lazy but finite Seq … while keeping the original Seq lazy") isn't possible. I don't mean that it's not possible with Raku – I mean that, in principle, it's not possible for any language that defines "laziness" the way Raku does with, for example, the is-lazy method.
If particular, when a Seq is lazy in Raku, that "means that [the Seq's] values are computed on demand and stored for later use." Additionally, one of the defining features of a lazy iterable is that it cannot know its own length while remaining lazy – that's why calling .elems on a lazy iterable throws an error:
my $s = lazy gather for ^10 { take $_ };
say $s.is-lazy; # OUTPUT: «True»
$s.elems; # THROWS: «Cannot .elems a lazy list onto a Seq»
Now, at this point, you might reasonably be thinking "well, maybe Raku doesn't know how long $s is, but I can tell that it has exactly 10 elements in it." And you're not wrong – with that code, $s is indeed guaranteed to have 10 elements. This means that, if you want to get the tenth (last) element of $s, you can do so with $s[9]. And accessing $s's tenth element like that won't change the fact that $s.is-lazy.
But, importantly, you can only do so because you know something "extra" about $s, and that extra info undoes a good chunk of the reason you might want a list to be lazy in practice.
To see what I mean, consider a very similar Seq
my $s2 = lazy gather for ^10 { last if rand > .95; take $_ };
say $s2.is-lazy; # OUTPUT: «True»
Now, $s2probably has 10 elements, but it might not – the only way to know is to iterate through it and find out. In turn, this means $s2[9] does not jump to the tenth element the way $s[9] did; it iterates through $s2 just like you'd need to. And, as a result, if you run $s2[9], then $s2 will no longer be lazy (i.e., $s2.is-lazy will return False).
And this is, in effect, what you did in the code in your question:
my $s = lazy gather for ^10 { take $_ };
say $s.is-lazy; # OUTPUT: «True»
say (for $s<> { $_ }).tail; # OUTPUT: «9»
say $s.is-lazy; # OUTPUT: «False»
Because Raku cannot ever know that it has reached the tail of a lazy Seq, the only way it could tell you the .tail is to fully iterate $s. And that necessarily means that $s is no longer lazy.
Two complications
It's worth mentioning two adjacent topics that aren't actually related but that are close enough that they trip some people up.
First, nothing I've said about lazy iterables not knowing their length precludes some non-lazy iterables from knowing their length. Indeed, a decent number of Raku types do both the Iterator role and the PredictiveIterator role – and the main point of a PredictiveIterator is that it does know how many elements it can produce without needing to produce/iterate them. But PredictiveIterators cannot be lazy.
The second potentially confusing topic is closely related to the first: while no PredictiveIterator can be lazy (that is, none will ever have an .is-lazy method that returns True), some PredictiveIterators have behavior that is very similar to laziness – and, in fact, may even be colloquially referred to as "lazy".
I can't do a great job explaining this distinction because, quite honestly, I don't fully understand it myself. But I can give you an example: the .lines method on an IO::Handle. It's certainly the case that reading the lines of a huge file behaves a lot like it's dealing with a lazy iterable. most obviously, you can process each line without ever having the whole file in memory. And the docs even say that "lines are read lazily" with the .lines method.
On the other hand:
my $l = 'some-file-with-100_000-lines.txt'.IO.lines;
say $l.is-lazy; # OUTPUT: «False»
say $l.iterator ~~ PredictiveIterator; # OUTPUT: «True»
say $l.elems; # OUTPUT: «100000»
So I'm not quite sure whether it's fair to say that $l "is a lazy iterable", but if it is, it's "lazy" in a different way than $s was.
I realize that was a lot, but I hope it is helpful. If you have a more specific use case in mind for laziness (I bet it wasn't gathering the numbers from zero to nine!), I'd be happy to address that more specifically. And if anyone else can fill in some of the details with .lines and other lazy-not-lazy PredictiveIterators, I'd really appreciate it!
Drop the lazy
Lazy sequences in Raku are designed to work well as is. You don't need to emphasize they're lazy by adding an explicit lazy.
If you add an explicit lazy, Raku interprets that as a request to block operations such as .tail because they will almost certainly immediately render laziness moot, and, if called on an infinite sequence, or even just a sufficiently large one, hang or OOM the program.
So, either drop the lazy, or don't invoke operations like .tail that will be blocked if you do.
Expanded version of my original answer
As noted by #ugexe, the idiomatic solution is to drop the lazy.
Quoting my answer to the SO About Laziness:
if a gather is asked if it's lazy, it returns False.
Aiui, something like the following applies:
Some lazy sequence producers may be actually or effectively infinite. If so, calling .tail etc on them will hang the calling program. Conversely, other lazy sequences perform fine when all their values are consumed in one go. How should Raku distinguish between these two scenarios?
A decision was made in 2015 to let value producing datatypes emphasize or deemphasize their laziness via their response to an .is-lazy call.
Returning True signals that a sequence is not only lazy but wants to be known to be lazy by consuming code that calls .is-lazy. (Not so much end-user code but instead built in consuming features such as # sigilled variables handling an assignment trying to determine whether or not to assign eagerly.) Built in consuming features take a True as a signal they ought block calls like .tail. If a dev knows this is overly conservative, they can add an eager (or remove an unneeded lazy).
Conversely, a datatype, or even a particular object instance, may return False to signal that it does not want to be considered lazy. This may be because the actual behaviour of a particular datatype or instance is eager, but it might instead be that it is lazy technically, but doesn't want a consumer to block operations such as .tail because it knows they will not be harmful, or at least prefers to have that be the default presumption. If a dev knows better (because, say, it hangs the program), or at least does not want to block potentially problematic operations, they can add a lazy (or remove an unneeded eager).
I think this approach works well, but it doc and error messages mentioning "lazy" may not have caught up with the shift made in 2015. So:
If you've been confused by some doc about laziness, please search for doc issues with "lazy" in them, or "laziness", and add comments to existing issues, or file a new doc issue (perhaps linking to this SO answer).
If you've been confused by a Rakudo error message mentioning laziness, please search for Rakudo issues with "lazy" in them, and tagged [LTA] (which means "Less Than Awesome"), and add comments, or file a new Rakudo issue (with an [LTA] tag, and perhaps a link to this SO answer).
Further discussion
the docs ... say “If you want to force lazy evaluation use the lazy subroutine or method. Binding to a scalar or sigilless container will also force laziness.”
Yes. Aiui this is correct.
[which] sounds like it implies “my $x := lazy gather { ... } is the same as my $x := gather { ... }”.
No.
An explicit lazy statement prefix or method adds emphasis to laziness, and Raku interprets that to mean it ought block operations like .tail in case they hang the program.
In contrast, binding to a variable alters neither emphasis nor deemphasis of laziness, merely relaying onward whatever the bound producer datatype/instance has chosen to convey via .is-lazy.
not only in connection with gather but elsewhere as well
Yes. It's about the result of .is-lazy:
my $x = (1, { .say; $_ + 1 } ... 1000);
my $y = lazy (1, { .say; $_ + 1 } ... 1000);
both act lazily ... but $x.tail is possible while $y.tail is not.
Yes.
An explicit lazy statement prefix or method forces the answer to .is-lazy to be True. This signals to a consumer that cares about the dangers of laziness that it should become cautious (eg rejecting .tail etc.).
(Conversely, an eager statement prefix or method can be used to force the answer to .is-lazy to be False, making timid consumers accept .tail etc calls.)
I take from this that there are two kinds of laziness in Raku, and one has to be careful to see which one is being used where.
It's two kinds of what I'll call consumption guidance:
Don't-tail-me If an object returns True from an .is-lazy call then it is treated as if it might be infinite. Thus operations like .tail are blocked.
You-can-tail-me If an object returns False from an .is-lazy call then operations like .tail are accepted.
It's not so much that there's a need to be careful about which of these two kinds is in play, but if one wants to call operations like tail, then one may need to enable that by inserting an eager or removing a lazy, and one must take responsibility for the consequences:
If the program hangs due to use of .tail, well, DIHWIDT.
If you suddenly consume all of a lazy sequence and haven't cached it, well, maybe you should cache it.
Etc.
What I would say is that the error messages and/or doc may well need to be improved.
I use a pipeline operator ponyfill, which is just a utility function applyPipe such that applyPipe(x, a, b) is equivalent to b(a(x)) or x |> a |> b (in this example there are 2 functions, but actually it can be any number of functions). In fp-ts this function is called pipe.
In my case the function is implemented as
export const applyPipe = (
source,
...project
) => {
for (const el of project) {
source = el(source);
}
return source;
};
(you could also implement it with .reduce).
This function can be used to compose observable operators, so applyPipe(timer(500), delay(500)) is equivalent to timer(500).pipe(delay(500)). The question is, is there a performance penalty to using such function in place of the .pipe method?
Theoretically, I see no major issue or performance downgrade by doing so (other than adding an extra step by using your function and copying the observable/object reference to process it in the function). You will just copy the observable reference (not the emissions of the observable) hence it's shouldn't be a big deal for performance reasons.
Also, ponyfill/polyfills are generally expected to be either equal to or worst than the actual implementation in terms of performance. Just keep in mind that the spreading operator will copy only the properties of the object (not any nested property).
I would leave some comment in that ponyfill function to make it easier to understand for every other developer that works with your codebase.
I am refactoring some business rule functions to provide a more generic version of the function.
The functions I am refactoring are:
DetermineWindowWidth
DetermineWindowHeight
DetermineWindowPositionX
DetermineWindowPositionY
All of them do string parsing, as it is a string parsing business rules engine.
My question is what would be a good name for the newly refactored function?
Obviously I want to shy away from a function name like:
DetermineWindowWidthHeightPositionXPositionY
I mean that would work, but it seems unnecessarily long when it could be something like:
DetermineWindowMoniker or something to that effect.
Function objective: Parse an input string like 1280x1024 or 200,100 and return either the first or second number. The use case is for data-driving test automation of a web browser window, but this should be irrelevant to the answer.
Question objective: I have the code to do this, so my question is not about code, but just the function name. Any ideas?
There are too little details, you should have specified at least the parameters and returns of the functions.
Have I understood correctly that you use strings of the format NxN for sizes and N,N for positions?
And that this generic function will have to parse both (and nothing else), and will return either the first or second part depending on a parameter of the function?
And that you'll then keep the various DetermineWindow* functions but make them all call this generic function?
If so:
Without knowing what parameters the generic function has it's even harder to help, but it's most likely impossible to give it a simple name.
Not all batches of code can be described by a simple name.
You'll most likely need to use a different construction if you want to have clear names. Here's an idea, in pseudo code:
ParseSize(string, outWidth, outHeight) {
ParsePair(string, "x", outWidht, outHeight)
}
ParsePosition(string, outX, outY) {
ParsePair(string, ",", outX, outY)
}
ParsePair(string, separator, outFirstItem, outSecondItem) {
...
}
And the various DetermineWindow would call ParseSize or ParsePosition.
You could also use just ParsePair, directly, but I thinks it's cleaner to have the two other functions in the middle.
Objects
Note that you'd probably get cleaner code by using objects rather than strings (a Size and a Position one, and probably a Pair one too).
The ParsePair code (adapted appropriately) would be included in a constructor or factory method that gives you a Pair out of a string.
---
Of course you can give other names to the various functions, objects and parameters, here I used the first that came to my mind.
It seems this question-answer provides a good starting point to answer this question:
Appropriate name for container of position, size, angle
A search on www.thesaurus.com for "Property" gives some interesting possible answers that provide enough meaningful context to the usage:
Aspect
Character
Characteristic
Trait
Virtue
Property
Quality
Attribute
Differentia
Frame
Constituent
I think ConstituentProperty is probably the most apt.
I am aggregating a bunch of enum values (different from the ordinal values) in a foreach loop.
int output = 0;
for (TestEnum testEnum: setOfEnums) {
output |= testEnum.getValue();
}
Is there a way to do this in streams API?
If I use a lambda like this in a Stream<TestEnum> :
setOfEnums.stream().forEach(testEnum -> (output |= testEnum.getValue());
I get a compile time error that says, 'variable used in lambda should be effectively final'.
Predicate represents a boolean valued function, you need to use reduce method of stream to aggregate bunch of enum values.
if we consider that you have HashSet as named SetOfEnums :
//int initialValue = 0; //this is effectively final for next stream pipeline if you wont modify this value in that stream
final int initialValue = 0;//final
int output = SetOfEnums.stream()
.map(TestEnum::getValue)
.reduce(initialValue, (e1,e2)-> e1|e2);
You nedd to reduce stream of enums like this:
int output = Arrays.stream(TestEnum.values()).mapToInt(TestEnum::getValue).reduce(0, (acc, value) -> acc | value);
I like the recommendations to use reduction, but perhaps a more complete answer would illustrate why it is a good idea.
In a lambda expression, you can reference variables like output that are in scope where the lambda expression is defined, but you cannot modify the values. The reason for that is that, internally, the compiler must be able to implement your lambda, if it chooses to do so, by creating a new function with your lambda as its body. The compiler may choose to add parameters as needed so that all of the values used in this generated function are available in the parameter list. In your case, such a function would definitely have the lambda's explicit parameter, testEnum, but because you also reference the local variable output in the lambda body, it could add that as a second parameter to the generated function. Effectively, the compiler might generate this function from your lambda:
private void generatedFunction1(TestEnum testEnum, int output) {
output |= testEnum.getValue();
}
As you can see, the output parameter is a copy of the output variable used by the caller, and the OR operation would only be applied to the copy. Since the original output variable wouldn't be modified, the language designers decided to prohibit modification of values passed implicitly to lambdas.
To get around the problem in the most direct way, setting aside for the moment that the use of reduction is a far better approach, you could wrap the output variable in a wrapper (e.g. an int[] array of size 1 or an AtomicInteger. The wrapper's reference would be passed by value to the generated function, and since you would now update the contents of output, not the value of output, output remains effectively final, so the compiler won't complain. For example:
AtomicInteger output = new AtomicInteger();
setOfEnums.stream().forEach(testEnum -> (output.set(output.get() | testEnum.getValue()));
or, since we're using AtomicInteger, we may as well make it thread-safe in case you later choose to use a parallel Stream,
AtomicInteger output = new AtomicInteger();
setOfEnums.stream().forEach(testEnum -> (output.getAndUpdate(prev -> prev | testEnum.getValue())));
Now that we've gone over an answer that most resembles what you asked about, we can talk about the superior solution of using reduction, that other answers have already recommended.
There are two kinds of reduction offered by Stream, stateless reduction (reduce(), and stateful reduction (collect()). To visualize the difference, consider a conveyer belt delivering hamburgers, and your goal is to collect all of the hamburger patties into one big hamburger. With stateful reduction, you would start with a new hamburger bun, and then collect the patty out of each hamburger as it arrives, and you add it to the stack of patties in the hamburger bun you set up to collect them. In stateless reduction, you start out with an empty hamburger bun (called the "identity", since that empty hamburger bun is what you end up with if the conveyer belt is empty), and as each hamburger arrives on the belt, you make a copy of the previous accumulated burger and add the patty from the new one that just arrived, discarding the previous accumulated burger.
The stateless reduction may seem like a huge waste, but there are cases when copying the accumulated value is very cheap. One such case is when accumulating primitive types -- primitive types are very cheap to copy, so stateless reduction is ideal when crunching primitives in applications such as summing, ORing, etc.
So, using stateless reduction, your example might become:
setOfEnums.stream()
.mapToInt(TestEnum::getValue) // or .mapToInt(testEnum -> testEnum.getValue())
.reduce(0, (resultSoFar, testEnum) -> resultSoFar | testEnum);
Some points to ponder:
Your original for loop is probably faster than using streams, except perhaps if your set is very large and you use parallel streams. Don't use streams for the sake of using streams. Use them if they make sense.
In my first example, I showed the use of Stream.forEach(). If you ever find yourself creating a Stream and just calling forEach(), it is more efficient just to call forEach() on the collection directly.
You didn't mention what kind of Set you are using, but I hope you are using EnumSet<TestEnum>. Because it is implemented as a bit field, It performs much better (O(1)) than any other kind of Set for all operations, even copying. EnumSet.noneOf(TestEnum.class) creates an empty Set, EnumSet.allOf(TestEnum.class) gives you a set of all enum values, etc.
Suppose I have a List<T> with 1000 items in it.
I'm then passing this to a method that filters this List.
As it drops through the various cases (for example there could be 50), List<T> may have up to 50 various Linq Where() operations performed on it.
I'm interested in this running as quickly as possible. Therefore, I don't want this List<T> filtered each time a Where() is performed on it.
Essentially I need it to defer the actual manipulation of the List<T> until all the filters have been applied.
Is this done natively by the compiler? Or just when I call .ToList() on the IEnumerable that a List<T>.Where() returns, or should I perform the Where() operations on X (Where X = List.AsQueryable())?
Hope this makes sense.
Yes, deferred execution is supported natively. Every time you apply a query or lambda expression on your List the query stores all the expressions that is executed only when you call .ToList() on the Query.
Each call to Where will create a new object which knows about your filter and the sequence it's being called on.
When this new object is asked for a value (and I'm being deliberately fuzzy between an iterator and an iterable here) it will ask the original sequence for the next value, check the filter, and either return the value or iterate back, asking the original sequence for the next value etc.
So if you call Where 50 times (as in list.Where(...).Where(...).Where(...), you end up with something which needs to go up and down the call stack at least 50 times for each item returned. How much performance impact will that have? I don't know: you should measure it.
One possible alternative is to build an expression tree and then compile it down into a delegate at the end, and then call Where. This would certainly be a bit more effort, but it could end up being more efficient. Effectively, it would let you change this:
list.Where(x => x.SomeValue == 1)
.Where(x => x.SomethingElse != null)
.Where(x => x.FinalCondition)
.ToList()
into
list.Where(x => x.SomeValue == 1 && x.SomethingElse != null && x.FinalCondition)
.ToList()
If you know that you're just going to be combining a lot of "where" filters together, this may end up being more efficient than going via IQueryable<T>. As ever, check performance of the simplest possible solution before doing something more complicated though.
There is so much failure in the question and the comments. The answers are good but don't hit hard enough to break through the failure.
Suppose you have a list and a query.
List<T> source = new List<T>(){ /*10 items*/ };
IEnumerable<T> query = source.Where(filter1);
query = query.Where(filter2);
query = query.Where(filter3);
...
query = query.Where(filter10);
Is [lazy evaluation] done natively by the compiler?
No. Lazy evaluation is due to the implementation of Enumerable.Where
This method is implemented by using deferred execution. The immediate return value is an object that stores all the information that is required to perform the action. The query represented by this method is not executed until the object is enumerated either by calling its GetEnumerator method directly or by using foreach in Visual C# or For Each in Visual Basic.
speed penalty there is on calling List.AsQueryable().ToList()
Don't call AsQueryable, you only need to use Enumerable.Where.
thus won't prevent a 50 calls deep call stack
Depth of call stack is much much less important than having a highly effective filter first. If you can reduce the number of elements early, you reduce the number of method calls later.