I am new to Java 8 and getting a bit confused about the scope of Lambda Expression. I read some articles expressing that Lambda Expression reduces execution time, so to find the same I wrote following two programs
1) Without using Lambda Expression
import java.util.*;
public class testing_without_lambda
{
public static void main(String args[])
{
long startTime = System.currentTimeMillis();
List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5, 6);
for (int number : numbers)
{
System.out.println(number);
}
long stopTime = System.currentTimeMillis();
System.out.print("without lambda:");
System.out.println(stopTime - startTime);
}//end main
}
output:
2) With using Lambda Expression
import java.util.*;
public class testing_with_lambda
{
public static void main(String args[])
{
long startTime = System.currentTimeMillis();
List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5, 6);
numbers.forEach((Integer value) -> System.out.println(value));
long stopTime = System.currentTimeMillis();
System.out.print("with lambda:");
System.out.print(stopTime - startTime);
}//end main
}
output:
Is this means Lambda Expression requires more time to execute?
There is no general statement about “execution time” possible, as even the term “execution time” isn’t always meaning the same. Of course, there is no reason, why just using a lambda expression should reduce the execution time in general.
Your code is measuring the initialization time of the code and it’s execution time, which is fair, when you consider the total execution time of that tiny program, but for real life applications, it has no relevance, as they usually run significantly longer than their initialization time.
What make the drastic difference in initialization time, is the fact that the JRE uses the Collection API itself internally, so its classes are loaded and initialized and possibly even optimized to some degree, before your application even starts (so you don’t measure its costs). In contrast, it doesn’t use lambda expressions, so your first use of a lambda expression will load and initialize an entire framework behind the scenes.
Since you are usually interested in how code would perform in a real application, where the initialization already happened, you would have to execute the code multiple times within the same JVM to get a better picture. However, allowing the JVM’s optimizer to process the code bears the possibility that it gets over-optimized due to its simpler nature (compared to a real life scenario) and shows too optimistic numbers then. That’s why it’s recommended to use sophisticated benchmark tools, developed by experts, instead of creating your own. Even with these tools, you have to study their documentation to understand and avoid the pitfalls. See also How do I write a correct micro-benchmark in Java?
When you compare the for loop, also known as “external iteration” with the equivalent forEach call, also known as “internal iteration”, the latter does indeed bear the potential of being more efficient, if properly implemented by the particular Collection, but its outcome is hard to predict, as the JVM’s optimizer is good at removing the drawbacks of the other solution. Also, your list is far too small to ever exhibit this difference, if it exists.
It also must be emphasized that this principle is not tight to lambda expressions. You could also implement the Consumer via an anonymous inner class, and in your case, where the example suffers from the first time initialization cost, the anonymous inner class would be faster than the lambda expression.
In Addition to Holger's answer I want to show how you could've benchmarked your code.
What you're really measuring is initialization of classes and system's IO (i.e. class loading and System.out::println).
In order to get rid of these you should use a benchmark framework like JMH. In addition you should measure with multiple list or array sizes.
Then your code may look like this:
#Fork(3)
#BenchmarkMode(Mode.AverageTime)
#Measurement(iterations = 10, timeUnit = TimeUnit.NANOSECONDS)
#State(Scope.Benchmark)
#Threads(1)
#Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {
#Param({ "10", "1000", "10000" })
private int length;
private List<Integer> numbers;
#Setup
public void setup() {
Random random = new Random(length);
numbers = random.ints(length).boxed().collect(Collectors.toList());
}
#Benchmark
public void externalIteration(Blackhole bh) {
for (Integer number : numbers) {
bh.consume(number);
}
}
#Benchmark
public void internalIteration(Blackhole bh) {
numbers.forEach(bh::consume);
}
}
And the results:
Benchmark (length) Mode Cnt Score Error Units
MyBenchmark.externalIteration 10 avgt 30 41,002 ± 0,263 ns/op
MyBenchmark.externalIteration 1000 avgt 30 4026,842 ± 71,318 ns/op
MyBenchmark.externalIteration 10000 avgt 30 40423,629 ± 572,055 ns/op
MyBenchmark.internalIteration 10 avgt 30 40,783 ± 0,815 ns/op
MyBenchmark.internalIteration 1000 avgt 30 3888,240 ± 28,790 ns/op
MyBenchmark.internalIteration 10000 avgt 30 41961,320 ± 991,047 ns/op
As you can see there is little to no difference.
I don't think lambda expression code will be faster in execution all the time, it would really depend on different conditions. Can you may be point me to the article where you read that the lambda expression is faster in execution time?
(It certainly is considered faster while writing the code, due to functional programming style of code)
I ran your test again locally, found something strange:
The first time I ran the code without lambda, it took almost the same time, rather more as with the lambda. (49 milliseconds)
Second time onward, the code without lambda ran very much faster i.e. in 1 millisecond.
The code with lambda expression runs in the same amount of time every time, I tried around 3-4 times in total.
General Results:
1
2
3
4
5
6
with lambda:47
1
2
3
4
5
6
without lambda:1
I think it would really take a very large sample of numbers to really test this and also multiple calls to the same code to remove any initialization burden on the JVM. This is a pretty small test sample.
Related
I have a function like this:
fun randomWalk(numSteps: Int): Int {
var n = 0
repeat(numSteps) { n += (-1 + 2 * Random.nextInt(2)) }
return n.absoluteValue
}
This works fine, except that it uses a mutable variable, and I would like to make everything immutable when possible, for better safety and readability. So I came up with an equivalent version that doesn't use any mutable variables:
fun randomWalk_seq(numSteps: Int): Int =
generateSequence(0) { it + (-1 + 2 * Random.nextInt(2)) }
.elementAt(numSteps)
.absoluteValue
This also works fine and produces the same results, but it takes 3 times longer.
I used the following way to measure it:
#OptIn(ExperimentalTime::class)
fun main() {
val numSamples = 100000
val numSteps = 15708
repeat(5) {
val randomWalkSamples: IntArray
val duration = measureTime {
randomWalkSamples = IntArray(numSamples) { randomWalk(numSteps) }
}
println(duration)
}
}
I know it's a bit hacky (I could have used JMH but this is just a quick test - at least I know that measureTime uses a monotonic clock). The results for the iterative (mutable) version:
2.965358406s
2.560777033s
2.554363661s
2.564279403s
2.608323586s
As expected, the first line shows it took a bit longer on the first run due to the warming up of the JIT, but the next 4 lines have fairly small variation.
After replacing randomWalk with randomWalk_seq:
6.636866719s
6.980840906s
6.993998111s
6.994038706s
7.018054467s
Somewhat surprisingly, I don't see any warmup time - the first line is always lesser duration than the following 4 lines, every time I run this. And also, every time I run it, the duration keeps increasing, with line 5 always being the greatest duration.
Can someone explain the findings, and also is there any way of making this function not use any mutable variables but still have performance that is close to the mutable version?
Your solution is slower for two main reasons: boxing and the complexity of the iterator used by generateSequence()'s Sequence implementation.
Boxing happens because a Sequence uses its types generically, so it cannot use primitive 32-bit Ints directly, but must wrap them in classes and unwrap them when retrieving the items.
You can see the complexity of the iterator by Ctrl+clicking the generateSequence function to view the source code.
#Михаил Нафталь's suggestion is faster because it avoids the complex iterator of the sequence, but it still has boxing.
I tried writing an overload of sumOf that uses IntProgression directly instead of Iterable<T>, so it won't use boxing, and that resulted in equivalent performance to your imperative code with the var. As you can see, it's inline and when put together with the { -1 + 2 * Random.nextInt(2) } lambda suggested by #Михаил Нафталь, then the resulting compiled code will be equivalent to your imperative code.
inline fun IntProgression.sumOf(selector: (Int) -> Int): Int {
var sum: Int = 0.toInt()
for (element in this) {
sum += selector(element)
}
return sum
}
Ultimately, I don't think you're buying yourself much in the way of code clarity by removing a single var in such a small function. I would say the sequence code is arguably harder to read. vars may add to code complexity in complex algorithms, but I don't think they do in such simple algorithms, especially when there's only one of them and it's local to the function.
Equivalent immutable one-liner is:
fun randomWalk2(numSteps: Int) =
(1..numSteps).sumOf { -1 + 2 * Random.nextInt(2) }.absoluteValue
Probably, even more performant would be to replace
with
so that you'll have one multiplication and n additions instead of n multiplications and (2*n-1) additions:
fun randomWalk3(numSteps: Int) =
(-numSteps + 2 * (1..numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Update
As #Tenfour04 noted, there is no specific stdlib implementation for IntProgression.sumOf, so it's resolved to Iterable<T>.sumOf, which will add unnecessary overhead for int boxing.
So, it's better to use IntArray here instead of IntProgression:
fun randomWalk4(numSteps: Int) =
(-numSteps + 2 * IntArray(numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Still encourage you to check this all with JMH
I think:"Removing mutability without losing speed" is wrong title .because
mutability thing comes to deal with the flow that program want to achieve .
you are using var inside function.... and 100% this var will not ever change from outside this function and that is mutability concept.
if we git rid off from var everywhere why we need it in programming ?
Kotlin for competitive programming suggests the following code for reading console input.
readLine()!!.split(" ").map{ str -> str.toInt() } //read space separated Integer from console
Until now for every competitive problem I've used the same approach and to be honest, it has never disappointed.
But for certain problems where the count of input integers is very large (close to 2 * 10^6) this method is just too slow and results in TLE (Time Limit Exceeded).
Is there even a faster way to read input from console?
If you suspect that the .split() call is the bottleneck, you could explore some of the alternatives in this thread.
If you suspect that the toInt() call is the bottleneck, perhaps you could try parallelizing the streams using the Java 8 stream API. For example:
readLine()!!.split(" ").parallelStream().map { str -> str.toInt() }...
For best performance, you could probably combine the two methods.
I believe, toInt() convertions are more expensive, than split(" ").
Are you sure you need to convert to Int all strings of the input in the very beginning?
It depends on a task, but sometimes part of this convertions could be avoided.
For instance, if you have a task "check if there is no negative numbers in the input", you may convert strings to Int one by one, and if you met a negative one, no need to convert others.
I think that JMH could be useful here. You can run the benchmark similar to the one bellow and try to identify your bottlenecks.
Note that this is in Mode.SingleShotTime, and so emulates the scenarion where JIT has little opprotunity to do it's thing.
import org.openjdk.jmh.annotations.*
import java.util.concurrent.TimeUnit
import kotlin.random.Random
//#BenchmarkMode(Mode.AverageTime)
#BenchmarkMode(Mode.SingleShotTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#State(Scope.Benchmark)
open class SplitToInt {
val count = 2 * 1_000_000
lateinit var stringToParse: String
lateinit var tokensToParse: List<String>
#Setup(Level.Invocation)
fun setup() {
stringToParse = (0..count).map { Random.nextInt(0, 100) }.joinToString(separator = " ")
tokensToParse = (0..count).map { Random.nextInt(0, 100) }.map { it.toString() }
}
#Benchmark
open fun split() =
stringToParse.split(" ")
#Benchmark
open fun map_toInt() =
tokensToParse.map { it.toInt() }
#Benchmark
open fun split_map_toInt() =
stringToParse.split(" ").map { it.toInt() }
}
The stats on my machine are:
Benchmark Mode Cnt Score Error Units
SplitToInt.map_toInt ss 48.666 ms/op
SplitToInt.split ss 124.899 ms/op
SplitToInt.split_map_toInt ss 186.981 ms/op
So splitting the string and mapping to list of Ints takes ~ 187 ms. Allowing JIT to warm up (Mode.AverageTime) gives me:
Benchmark Mode Cnt Score Error Units
SplitToInt.map_toInt avgt 5 30.670 ± 6.979 ms/op
SplitToInt.split avgt 5 108.989 ± 23.569 ms/op
SplitToInt.split_map_toInt avgt 5 120.628 ± 27.052 ms/op
Whether this is fast or slow depends on the curmstances but are you sure that the input transformation here is the reason you get TLE?
Edit: If you do think that split(" ").map{ str -> str.toInt() } is too slow, you could replace creating the two lists (one from split and one from map) with a single list by splitting and transforming in one go. I wrote a quick hack off kotlin.text.Regex.split that does that and it is about 20% faster.
If in your use case you need to examine only part of the input, splitToSequence is probably a better option.
It is common in C/C++ programming to use function pointers to optimize branching in the main data path. So I wrote a test program to find out if similar performance savings can be gotten in Scala using functional programming techniques. The usecase is that of a function which is invoked millions of times and has a branching statement based on a global flag. The code using if() statement -
val b = true
def test() = {
if(b) // do something
else // do something else
}
for(i <- 0 to 100000) test()
And trying to get rid of the if() I did this -
def onTrue() = { // do something }
def onFalse() = { // do something else }
lazy val callback: () => Unit = if(b) onTrue else onFalse
def test() = callback()
for(i <- 0 to 100000) test()
I did a comparison of both these programs by running them for large counters (in the for loop) and running them many times and using the System.nanoTime() differential to measure the time taken.
And my tests seem to suggest that the callback method is actually SLOWER than using if() in the loop. The reason for this could be that a function call requires the params and returns to be pushed on the stack and a new stack frame created etc. Given this finding wanted to know -
Is there a functional way one could code which will better the performance of using the if() in the loop with Scala?
#inline works with compiler. Is there a runtime equivalent to avoid the stack activities? (similar to tail call optimization)
Could my test or results be inaccurate/erroneous in some way?
3) It's very easy to get your methodology wrong when testing this way. Use something like JMH if you want quasi-trustable microbenchmarks!
2) The JVM does inlining at runtime.
1) You aren't measuring a difference in whether something is "functional". You're measuring the difference between using a lazy val and not. If you don't have the lazy val in there, the JVM will probably be able to optimize your code (depending on what "do something" is).
If you remove the lazy val, the second one optimizes to the same speed in my hands. (It has an extra mandatory check for every access that it isn't being initialized in a multi-threaded context.)
I have a large vector of vectors of strings:
There are around 50,000 vectors of strings,
each of which contains 2-15 strings of length 1-20 characters.
MyScoringOperation is a function which operates on a vector of strings (the datum) and returns an array of 10100 scores (as Float64s). It takes about 0.01 seconds to run MyScoringOperation (depending on the length of the datum)
function MyScoringOperation(state:State, datum::Vector{String})
...
score::Vector{Float64} #Size of score = 10000
I have what amounts to a nested loop.
The outer loop typically would runs for 500 iterations
data::Vector{Vector{String}} = loaddata()
for ii in 1:500
score_total = zeros(10100)
for datum in data
score_total+=MyScoringOperation(datum)
end
end
On one computer, on a small test case of 3000 (rather than 50,000) this takes 100-300 seconds per outer loop.
I have 3 powerful servers with Julia 3.9 installed (and can get 3 more easily, and then can get hundreds more at the next scale).
I have basic experience with #parallel, however it seems like it is spending a lot of time copying the constant (It more or less hang on the smaller testing case)
That looks like:
data::Vector{Vector{String}} = loaddata()
state = init_state()
for ii in 1:500
score_total = #parallel(+) for datum in data
MyScoringOperation(state, datum)
end
state = update(state, score_total)
end
My understanding of the way this implementation works with #parallel is that it:
For Each ii:
partitions data into a chuck for each worker
sends that chuck to each worker
works all process there chunks
main procedure sums the results as they arrive.
I would like to remove step 2,
so that instead of sending a chunk of data to each worker,
I just send a range of indexes to each worker, and they look it up from their own copy of data. or even better, only giving each only their own chunk, and having them reuse it each time (saving on a lot of RAM).
Profiling backs up my belief about the functioning of #parellel.
For a similarly scoped problem (with even smaller data),
the non-parallel version runs in 0.09seconds,
and the parallel runs in
And the profiler shows almost all the time is spent 185 seconds.
Profiler shows almost 100% of this is spend interacting with network IO.
This should get you started:
function get_chunks(data::Vector, nchunks::Int)
base_len, remainder = divrem(length(data),nchunks)
chunk_len = fill(base_len,nchunks)
chunk_len[1:remainder]+=1 #remained will always be less than nchunks
function _it()
for ii in 1:nchunks
chunk_start = sum(chunk_len[1:ii-1])+1
chunk_end = chunk_start + chunk_len[ii] -1
chunk = data[chunk_start: chunk_end]
produce(chunk)
end
end
Task(_it)
end
function r_chunk_data(data::Vector)
all_chuncks = get_chunks(data, nworkers()) |> collect;
remote_chunks = [put!(RemoteRef(pid)::RemoteRef, all_chuncks[ii]) for (ii,pid) in enumerate(workers())]
#Have to add the type annotation sas otherwise it thinks that, RemoteRef(pid) might return a RemoteValue
end
function fetch_reduce(red_acc::Function, rem_results::Vector{RemoteRef})
total = nothing
#TODO: consider strongly wrapping total in a lock, when in 0.4, so that it is garenteed safe
#sync for rr in rem_results
function gather(rr)
res=fetch(rr)
if total===nothing
total=res
else
total=red_acc(total,res)
end
end
#async gather(rr)
end
total
end
function prechunked_mapreduce(r_chunks::Vector{RemoteRef}, map_fun::Function, red_acc::Function)
rem_results = map(r_chunks) do rchunk
function do_mapred()
#assert r_chunk.where==myid()
#pipe r_chunk |> fetch |> map(map_fun,_) |> reduce(red_acc, _)
end
remotecall(r_chunk.where,do_mapred)
end
#pipe rem_results|> convert(Vector{RemoteRef},_) |> fetch_reduce(red_acc, _)
end
rchunk_data breaks the data into chunks, (defined by get_chunks method) and sends those chunks each to a different worker, where they are stored in RemoteRefs.
The RemoteRefs are references to memory on your other proccesses(and potentially computers), that
prechunked_map_reduce does a variation on a kind of map reduce to have each worker first run map_fun on each of it's chucks elements, then reduce over all the elements in its chuck using red_acc (a reduction accumulator function). Finally each worker returns there result which is then combined by reducing them all together using red_acc this time using the fetch_reduce so that we can add the first ones completed first.
fetch_reduce is a nonblocking fetch and reduce operation. I believe it has no raceconditions, though this maybe because of a implementation detail in #async and #sync. When julia 0.4 comes out, it is easy enough to put a lock in to make it obviously have no race conditions.
This code isn't really battle hardened. I don;t believe the
You also might want to look at making the chuck size tunable, so that you can seen more data to faster workers (if some have better network or faster cpus)
You need to reexpress your code as a map-reduce problem, which doesn't look too hard.
Testing that with:
data = [float([eye(100),eye(100)])[:] for _ in 1:3000] #480Mb
chunk_data(:data, data)
#time prechunked_mapreduce(:data, mean, (+))
Took ~0.03 seconds, when distributed across 8 workers (none of them on the same machine as the launcher)
vs running just locally:
#time reduce(+,map(mean,data))
took ~0.06 seconds.
I'm using VS2010 built-in profilier
My application contains three threads.
One of the thread is really simple:
while (true)
if (something) {
// blah blah, very fast and rarely occuring thing
}
Thread.sleep(1000);
}
Visual Studio reports that Thread.sleep takes 36% of the program time.
The question is "why not ~100% of the time?" Why Main methods takes 40% of the time, I definitely was inside this method durring application execution from start to end.
Do profiler devides the result to the number of the threads?
On my another thread I've observed that method takes 34% of the time.
What does it mean? Does it mean that it works only 34% of the time or it works almost all the time?
In my opinion if I have three threads that run in parallel, and if I sum methods time I should get 300% (if application runs for 10 seconds for example, this means that each thread runs for 10 seconds, and if there are 3 threads - it would be 30 seconds totally)
The question is what do you measuring and how you do it. From your question I'm unable to repeat your experience actually...
Thread.Sleep() call takes very small amount of time itself. Its task is to call native function from WinAPI that will command scheduler (responsible for dividing processor time between threads) that user thread it was called from should not be scheduled for the next second at all. After that this thread doesn't receive processor time until this second is over.
But thread do not takes any bit of processor time in that state. I'm not sure how this situation is reported by profiler.
Here is the code I was experimenting with:
internal class Program
{
private static int x = 0;
private static void A()
{
// Just to have something in the profiler here
Console.WriteLine("A");
}
private static void Main(string[] args)
{
var t = new Thread(() => { while (x == 0) Thread.MemoryBarrier(); });
t.Start();
while (true)
{
if (DateTime.Now.Millisecond%3 == 0)
A();
Thread.Sleep(1000);
}
}
}