Do Java 8 streams produce slower code than plain imperative loops? - performance

There are too much hype about functional programming and particularly the new Java 8 streams API. It is advertised as good replacement for old good loops and imperative paradigm.
Indeed sometimes it could look nice and do the job well. But what about performance?
E.g. here is the good article about that: Java 8: No more loops
Using the loop you can do all the job with a one iteration. But with a new stream API you will chain multiple loops which make it much slower(is it right?).
Look at their first sample. Loop will do not walk even through the whole array in most cases. However to do the filtering with a new stream API you have to cycle through the whole array to filter out all candidates and then you will be able to get the first one.
In this article it was mentioned about some laziness:
We first use the filter operation to find all articles that have the Java tag, then used the findFirst() operation to get the first occurrence. Since streams are lazy and filter returns a stream, this approach only processes elements until it finds the first match.
What does author mean about that laziness?
I did simple test and it shows that old good loop solution works 10x fast then stream approach.
public void test() {
List<String> list = Arrays.asList(
"First string",
"Second string",
"Third string",
"Good string",
"Another",
"Best",
"Super string",
"Light",
"Better",
"For string",
"Not string",
"Great",
"Super change",
"Very nice",
"Super cool",
"Nice",
"Very good",
"Not yet string",
"Let's do the string",
"First string",
"Low string",
"Big bunny",
"Superstar",
"Last");
long start = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
getFirstByLoop(list);
}
long end = System.currentTimeMillis();
System.out.println("Loop: " + (end - start));
start = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
getFirstByStream(list);
}
end = System.currentTimeMillis();
System.out.println("Stream: " + (end - start));
}
public String getFirstByLoop(List<String> list) {
for (String s : list) {
if (s.endsWith("string")) {
return s;
}
}
return null;
}
public Optional<String> getFirstByStream(List<String> list) {
return list.stream().filter(s -> s.endsWith("string")).findFirst();
}
Results was:
Loop: 517
Stream: 5790
BTW if I will use String[] instead of List the difference will be even more! Almost 100x!
QUESTION: Should I use old loop imperative approach if I'm looking for the best code performance? Is FP paradigm is just to make code "more concise and readable" but not about performance?
OR
is there something I missed and new stream API could be at least the same as efficient as loop imperative approach?

QUESTION: Should I use old loop imperative approach if I'm looking for the best code performance?
Right now, probably yes. Various benchmarks seem to suggest that streams are slower than loops for most tests. Though not catastrophically slower.
Counter examples:
In some cases, parallel streams can give a useful speed up.
Lazy streams can provide performance benefits for some problems; see http://java.amitph.com/2014/01/java-8-streams-api-laziness.html
It is possible to do equivalent things with loops, you can't do it with just loops.
But the bottom line is that performance is complicated and streams are not (yet) a magic bullet for speeding up your code.
Is FP paradigm is just to make code "more concise and readable" but not about performance?
Not exactly. It is certainly true that the FP paradigm is more concise and (to someone who is familiar with it) more readable.
However, by expressing the using the FP paradigm, you are also expressing it in a way that potentially could be optimized in ways that are much harder to achieve with code expressed using loops and assignment. FP code is also more amenable to formal methods; i.e. formal proof of correctness.
(In the context of this discussion of streams, "could be optimized" means in some future Java release.)

Laziness is about how elements are taken from the source of the stream - that is on demand. If there is needed to take more elements - they will, otherwise they will not. Here is an example:
Arrays.asList(1, 2, 3, 4, 5)
.stream()
.peek(x -> System.out.println("before filter : " + x))
.filter(x -> x > 2)
.peek(System.out::println)
.anyMatch(x -> x > 3);
Notice how each element goes through the entire pipeline of stages; that is filter is applied to one element at at time - not all of them, thus filter returns a Stream<Integer>. This allows the stream to be short-circuiting, as anyMatch does not even process 5 since there is no need at all.
Just notice that not all intermediate operations are lazy. For example sorted and distinct is not - and these are called stateful intermediate operations. Think about this way - to actually sort elements you do need to traverse the entire source. One more example that is not intuitive is flatMap, but this is not guaranteed and is seen more like a bug, more to read here
The speed is about how you measure, measuring micro-benchmarks in java is not easy, and de facto tool for that is jmh - you can try that out. There are numerous posts here on SO that show that streams are indeed slower (which in normal - they have an infrastructure), but the difference is not that big to actually care.

Related

Concise (one line?) binary search in Raku

Many common operations aren't built in to Raku because they can be concisely expressed with a combination of (meta) operators and/or functions. It feels like binary search of a sorted array ought to be expressable in that way (maybe with .rotor? or …?) but I haven't found a particularly good way to do so.
For example, the best I've come up with for searching a sorted array of Pairs is:
sub binary-search(#a, $target) {
when +#a ≤ 1 { #a[0].key == $target ?? #a[0] !! Empty }
&?BLOCK(#a[0..^*/2, */2..*][#a[*/2].key ≤ $target], $target)
}
That's not awful, but I can't shake the feeling that it could be an awfully lot better (both in terms of concision and readability). Can anyone see what elegant combo of operations I might be missing?
Here's one approach that technically meets my requirements (in that the function body it fits on a single normal-length line). [But see the edit below for an improved version.]
sub binary-search(#a, \i is copy = my $=0, :target($t)) {
for +#a/2, */2 … *≤1 {#a[i] cmp $t ?? |() !! return #a[i] with i -= $_ × (#a[i] cmp $t)}
}
# example usage (now slightly different, because it returns the index)
my #a = ((^20 .pick(*)) Z=> 'a'..*).sort;
say #a[binary-search(#a».key, :target(17))];
say #a[binary-search(#a».key, :target(1))];
I'm still not super happy with this code, because it loses a bit of readability – I still feel like there could/should be a concise way to do a binary sort that also clearly expresses the underlying logic. Using a 3-way comparison feels like it's on that track, but still isn't quite there.
[edit: After a bit more thought, I came up with an more readable version of the above using reduce.
sub binary-search(#a, :target(:$t)) {
(#a/2, */2 … *≤.5).reduce({ $^i - $^pt×(#a[$^i] cmp $t || return #a[$^i]) }) && Nil
}
In English, that reads as: for a sequence starting at the midpoint of the array and dropping by 1/2, move your index $^i by the value of the next item in the sequence – with the direction of the move determined by whether the item at that index is greater or lesser than the target. Continue until you find the target (in which case, return it) or you finish the sequence (which means the target wasn't present; return Nil)]

performance issue variable assignment

I am wondering, what scenario would be best ?
(Please bare with my examples, these are just small examples of the situation in question. I know you could have the exact same function without a result variable.)
A)
public String doSomthing(){
String result;
if(condition){ result = "Option A";}
else{ result = "Option B";}
return result;
}
B)
public String doSomthing(){
String result = "Option B";
if(condition){ result = " Option A";}
return result;
}
Cause in scenario B: if the condition is met, Then you would be assigning result a value twice.
Yet in code, i keep seeing scenario A.
Actually, the overhead here is minimal, if any, considering the compiler optimisations. You would not care about it in a professional coding environment, unless you are writing a compiler yourself.
What is more important, considering (modern) programming paradigms, is the code style and readability.
Example A is far more readable, as it has a well-presented reason-outcome hierarchy. This is important especially for big methods, as it saves the programmer lots of analysis time.

performance of many if statements/switch cases

If I had literally 1000s of simple if statements or switch statements
ex:
if 'a':
return 1
if 'b':
return 2
if 'c':
return 3
...
...
Would the performance of creating trivial if statements be faster when compared to searching a list for something? I imagined that because every if statement must be tested until the desired output is found (worst case O(n)) it would have the same performance if I were to search through a list. This is just an assumption. I have no evidence to prove this. I am curious to know this.
You could potentially put these things in to delegates that are then in a map, the key of which is the input you've specified.
C# Example:
// declare a map. The input(key) is a char, and we have a function that will return an
// integer based on that char. The function may do something more complicated.
var map = new Dictionary<char, Func<char, int>>();
// Add some:
map['a'] = (c) => { return 1; };
map['b'] = (c) => { return 2; };
map['c'] = (c) => { return 3; };
// etc... ad infinitum.
Now that we have this map, we can quite cleanly return something based on the input
public int Test(char c)
{
Func<char, int> func;
if(map.TryGetValue(c, out func))
return func(c);
return 0;
}
In the above code, we can call Test and it will find the appropriate function to call (if present). This approach is better (imho) than a list as you'd have to potentially search the entire list to find the desired input.
This depends on the language and the compiler/interpreter you use. In many interpreted languages, the performance will be the same, in other languages, the switch statements gives the compiler crucial additional information that it can use to optimize the code.
In C, for instance, I expect a long switch statement like the one you present to use a lookup table under the hood, avoiding explicit comparison with all the different values. With that, your switch decision takes the same time, no matter how many cases you have. A compiler might also hardcode a binary search for the matching case. These optimizations are typically not performed when evaluating a long else if() ladder.
In any case, I repeat, it depends on the interpreter/compiler: If your compiler optimized else if() ladders, but no switch statements, what it could do with a switch statement is quite irrelevant. However, for mainline languages, you should be able to expect all constructs to be optimized.
Apart from that, I advise to use a switch statement wherever applicable, it carries a lot more semantic information to the reader than an equivalent else if() ladder.

OOP much slower than Structural programming. why and how can be fixed?

as i mentioned on subject of this post i found out OOP is slower than Structural Programming(spaghetti code) in the hard way.
i writed a simulated annealing program with OOP then remove one class and write it structural in main form. suddenly it got much faster . i was calling my removed class in every iteration in OOP program.
also checked it with Tabu Search. Same result .
can anyone tell me why this is happening and how can i fix it on other OOP programs?
are there any tricks ? for example cache my classes or something like that?
(Programs has been written in C#)
If you have a high-frequency loop, and inside that loop you create new objects and don't call other functions very much, then, yes, you will see that if you can avoid those news, say by re-using one copy of the object, you can save a large fraction of total time.
Between new, constructors, destructors, and garbage collection, a very little code can waste a whole lot of time.
Use them sparingly.
Memory access is often overlooked. The way o.o. tends to lay out data in memory is not conducive to efficient memory access in practice in loops. Consider the following pseudocode:
adult_clients = 0
for client in list_of_all_clients:
if client.age >= AGE_OF_MAJORITY:
adult_clients++
It so happens that the way this is accessed from memory is quite inefficient on modern architectures because they like accessing large contiguous rows of memory, but we only care for client.age, and of all clients we have; those will not be laid out in contiguous memory.
Focusing on objects that have fields results into data being laid out in memory in such a way that fields that hold the same type of information will not be laid out in consecutive memory. Performance-heavy code tends to involve loops that often look at data with the same conceptual meaning. It is conducive to performance that such data be laid out in contiguous memory.
Consider these two examples in Rust:
// struct that contains an id, and an optiona value of whether the id is divisible by three
struct Foo {
id : u32,
divbythree : Option<bool>,
}
fn main () {
// create a pretty big vector of these structs with increasing ids, and divbythree initialized as None
let mut vec_of_foos : Vec<Foo> = (0..100000000).map(|i| Foo{ id : i, divbythree : None }).collect();
// loop over all hese vectors, determine if the id is divisible by three
// and set divbythree accordingly
let mut divbythrees = 0;
for foo in vec_of_foos.iter_mut() {
if foo.id % 3 == 0 {
foo.divbythree = Some(true);
divbythrees += 1;
} else {
foo.divbythree = Some(false);
}
}
// print the number of times it was divisible by three
println!("{}", divbythrees);
}
On my system, the real time with rustc -O is 0m0.436s; now let us consider this example:
fn main () {
// this time we create two vectors rather than a vector of structs
let vec_of_ids : Vec<u32> = (0..100000000).collect();
let mut vec_of_divbythrees : Vec<Option<bool>> = vec![None; vec_of_ids.len()];
// but we basically do the same thing
let mut divbythrees = 0;
for i in 0..vec_of_ids.len(){
if vec_of_ids[i] % 3 == 0 {
vec_of_divbythrees[i] = Some(true);
divbythrees += 1;
} else {
vec_of_divbythrees[i] = Some(false);
}
}
println!("{}", divbythrees);
}
This runs in 0m0.254s on the same optimization level, — close to half the time needed.
Despite having to allocate two vectors instead of of one, storing similar values in contiguous memory has almost halved the execution time. Though obviously the o.o. approach provides for much nicer and more maintainable code.
P.s.: it occurs to me that I should probably explain why this matters so much given that the code itself in both cases still indexes memory one field at a time, rather than, say, putting a large swath on the stack. The reason is c.p.u. caches: when the program asks for the memory at a certain address, it actually obtains, and caches, a significant chunk of memory around that address, and if memory next to it be asked quickly again, then it can serve it from the cache, rather than from actual physical working memory. Of course, compilers will also vectorize the bottom code more efficiently as a consequence.

An efficient technique to replace an occurence in a sequence with mutable or immutable state

I am searching for an efficient a technique to find a sequence of Op occurences in a Seq[Op]. Once an occurence is found, I want to replace the occurence with a defined replacement and run the same search again until the list stops changing.
Scenario:
I have three types of Op case classes. Pop() extends Op, Push() extends Op and Nop() extends Op. I want to replace the occurence of Push(), Pop() with Nop(). Basically the code could look like seq.replace(Push() ~ Pop() ~> Nop()).
Problem:
Now that I call seq.replace(...) I will have to search in the sequence for an occurence of Push(), Pop(). So far so good. I find the occurence. But now I will have to splice the occurence form the list and insert the replacement.
Now there are two options. My list could be mutable or immutable. If I use an immutable list I am scared regarding performance because those sequences are usually 500+ elements in size. If I replace a lot of occurences like A ~ B ~ C ~> D ~ E I will create a lot of new objects If I am not mistaken. However I could also use a mutable sequence like ListBuffer[Op].
Basically from a linked-list background I would just do some pointer-bending and after a total of four operations I am done with the replacement without creating new objects. That is why I am now concerned about performance. Especially since this is a performance-critical operation for me.
Question:
How would you implement the replace() method in a Scala fashion and what kind of data structure would you use keeping in mind that this is a performance-critical operation?
I am happy with answers that point me in the right direction or pseudo code. No need to write a full replace method.
Thank you.
Ok, some considerations to be made. First, recall that, on lists, tail does not create objects, and prepending (::) only creates one object for each prepended element. That's pretty much as good as you can get, generally speaking.
One way of doing this would be this:
def myReplace(input: List[Op], pattern: List[Op], replacement: List[Op]) = {
// This function should be part of an KMP algorithm instead, for performance
def compare(pattern: List[Op], list: List[Op]): Boolean = (pattern, list) match {
case (x :: xs, y :: ys) if x == y => compare(xs, ys)
case (Nil, Nil) => true
case _ => false
}
var processed: List[Op] = Nil
var unprocessed: List[Op] = input
val patternLength = pattern.length
val reversedReplacement = replacement.reverse
// Do this until we finish processing the whole sequence
while (unprocessed.nonEmpty) {
// This inside algorithm would be better if replaced by KMP
// Quickly process non-matching sequences
while (unprocessed.nonEmpty && unprocessed.head != pattern.head) {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
if (unprocessed.nonEmpty) {
if (compare(pattern, unprocessed)) {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
} else {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
}
}
processed.reverse
}
You may gain speed by using KMP, particularly if the pattern searched for is long.
Now, what is the problem with this algorithm? The problem is that it won't test if the replaced pattern causes a match before that position. For instance, if I replace ACB with C, and I have an input AACBB, then the result of this algorithm will be ACB instead of C.
To avoid this problem, you should create a backtrack. First, you check at which position in your pattern the replacement may happen:
val positionOfReplacement = pattern.indexOfSlice(replacement)
Then, you modify the replacement part of the algorithm this:
if (compare(pattern, unprocessed)) {
if (positionOfReplacement > 0) {
unprocessed :::= replacement
unprocessed :::= processed take positionOfReplacement
processed = processed drop positionOfReplacement
} else {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
}
} else {
This will backtrack enough to solve the problem.
This algorithm won't deal efficiently, however, with multiply patterns at the same time, which I guess is where you are going. For that, you'll probably need some adaptation of KMP, to do it efficiently, or, otherwise, use a DFA to control possible matchings. It gets even worse if you want to match both AB and ABC.
In practice, the full blow problem is equivalent to regex match & replace, where the replace is a function of the match. Which means, of course, you may want to start looking into regex algorithms.
EDIT
I was forgetting to complete my reasoning. If that technique doesn't work for some reason, then my advice is going with an immutable tree-based vector. Tree-based vectors enable replacement of partial sequences with low amount of copying.
And if that doesn't do, then the solution is doubly linked lists. And pick one from a library with slice replacement -- otherwise you may end up spending way too much time debugging a known but tricky algorithm.

Resources