Fast I/O in Kotlin - performance

Kotlin for competitive programming suggests the following code for reading console input.
readLine()!!.split(" ").map{ str -> str.toInt() } //read space separated Integer from console
Until now for every competitive problem I've used the same approach and to be honest, it has never disappointed.
But for certain problems where the count of input integers is very large (close to 2 * 10^6) this method is just too slow and results in TLE (Time Limit Exceeded).
Is there even a faster way to read input from console?

If you suspect that the .split() call is the bottleneck, you could explore some of the alternatives in this thread.
If you suspect that the toInt() call is the bottleneck, perhaps you could try parallelizing the streams using the Java 8 stream API. For example:
readLine()!!.split(" ").parallelStream().map { str -> str.toInt() }...
For best performance, you could probably combine the two methods.

I believe, toInt() convertions are more expensive, than split(" ").
Are you sure you need to convert to Int all strings of the input in the very beginning?
It depends on a task, but sometimes part of this convertions could be avoided.
For instance, if you have a task "check if there is no negative numbers in the input", you may convert strings to Int one by one, and if you met a negative one, no need to convert others.

I think that JMH could be useful here. You can run the benchmark similar to the one bellow and try to identify your bottlenecks.
Note that this is in Mode.SingleShotTime, and so emulates the scenarion where JIT has little opprotunity to do it's thing.
import org.openjdk.jmh.annotations.*
import java.util.concurrent.TimeUnit
import kotlin.random.Random
//#BenchmarkMode(Mode.AverageTime)
#BenchmarkMode(Mode.SingleShotTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#State(Scope.Benchmark)
open class SplitToInt {
val count = 2 * 1_000_000
lateinit var stringToParse: String
lateinit var tokensToParse: List<String>
#Setup(Level.Invocation)
fun setup() {
stringToParse = (0..count).map { Random.nextInt(0, 100) }.joinToString(separator = " ")
tokensToParse = (0..count).map { Random.nextInt(0, 100) }.map { it.toString() }
}
#Benchmark
open fun split() =
stringToParse.split(" ")
#Benchmark
open fun map_toInt() =
tokensToParse.map { it.toInt() }
#Benchmark
open fun split_map_toInt() =
stringToParse.split(" ").map { it.toInt() }
}
The stats on my machine are:
Benchmark Mode Cnt Score Error Units
SplitToInt.map_toInt ss 48.666 ms/op
SplitToInt.split ss 124.899 ms/op
SplitToInt.split_map_toInt ss 186.981 ms/op
So splitting the string and mapping to list of Ints takes ~ 187 ms. Allowing JIT to warm up (Mode.AverageTime) gives me:
Benchmark Mode Cnt Score Error Units
SplitToInt.map_toInt avgt 5 30.670 ± 6.979 ms/op
SplitToInt.split avgt 5 108.989 ± 23.569 ms/op
SplitToInt.split_map_toInt avgt 5 120.628 ± 27.052 ms/op
Whether this is fast or slow depends on the curmstances but are you sure that the input transformation here is the reason you get TLE?
Edit: If you do think that split(" ").map{ str -> str.toInt() } is too slow, you could replace creating the two lists (one from split and one from map) with a single list by splitting and transforming in one go. I wrote a quick hack off kotlin.text.Regex.split that does that and it is about 20% faster.
If in your use case you need to examine only part of the input, splitToSequence is probably a better option.

Related

Removing mutability without losing speed

I have a function like this:
fun randomWalk(numSteps: Int): Int {
var n = 0
repeat(numSteps) { n += (-1 + 2 * Random.nextInt(2)) }
return n.absoluteValue
}
This works fine, except that it uses a mutable variable, and I would like to make everything immutable when possible, for better safety and readability. So I came up with an equivalent version that doesn't use any mutable variables:
fun randomWalk_seq(numSteps: Int): Int =
generateSequence(0) { it + (-1 + 2 * Random.nextInt(2)) }
.elementAt(numSteps)
.absoluteValue
This also works fine and produces the same results, but it takes 3 times longer.
I used the following way to measure it:
#OptIn(ExperimentalTime::class)
fun main() {
val numSamples = 100000
val numSteps = 15708
repeat(5) {
val randomWalkSamples: IntArray
val duration = measureTime {
randomWalkSamples = IntArray(numSamples) { randomWalk(numSteps) }
}
println(duration)
}
}
I know it's a bit hacky (I could have used JMH but this is just a quick test - at least I know that measureTime uses a monotonic clock). The results for the iterative (mutable) version:
2.965358406s
2.560777033s
2.554363661s
2.564279403s
2.608323586s
As expected, the first line shows it took a bit longer on the first run due to the warming up of the JIT, but the next 4 lines have fairly small variation.
After replacing randomWalk with randomWalk_seq:
6.636866719s
6.980840906s
6.993998111s
6.994038706s
7.018054467s
Somewhat surprisingly, I don't see any warmup time - the first line is always lesser duration than the following 4 lines, every time I run this. And also, every time I run it, the duration keeps increasing, with line 5 always being the greatest duration.
Can someone explain the findings, and also is there any way of making this function not use any mutable variables but still have performance that is close to the mutable version?
Your solution is slower for two main reasons: boxing and the complexity of the iterator used by generateSequence()'s Sequence implementation.
Boxing happens because a Sequence uses its types generically, so it cannot use primitive 32-bit Ints directly, but must wrap them in classes and unwrap them when retrieving the items.
You can see the complexity of the iterator by Ctrl+clicking the generateSequence function to view the source code.
#Михаил Нафталь's suggestion is faster because it avoids the complex iterator of the sequence, but it still has boxing.
I tried writing an overload of sumOf that uses IntProgression directly instead of Iterable<T>, so it won't use boxing, and that resulted in equivalent performance to your imperative code with the var. As you can see, it's inline and when put together with the { -1 + 2 * Random.nextInt(2) } lambda suggested by #Михаил Нафталь, then the resulting compiled code will be equivalent to your imperative code.
inline fun IntProgression.sumOf(selector: (Int) -> Int): Int {
var sum: Int = 0.toInt()
for (element in this) {
sum += selector(element)
}
return sum
}
Ultimately, I don't think you're buying yourself much in the way of code clarity by removing a single var in such a small function. I would say the sequence code is arguably harder to read. vars may add to code complexity in complex algorithms, but I don't think they do in such simple algorithms, especially when there's only one of them and it's local to the function.
Equivalent immutable one-liner is:
fun randomWalk2(numSteps: Int) =
(1..numSteps).sumOf { -1 + 2 * Random.nextInt(2) }.absoluteValue
Probably, even more performant would be to replace
with
so that you'll have one multiplication and n additions instead of n multiplications and (2*n-1) additions:
fun randomWalk3(numSteps: Int) =
(-numSteps + 2 * (1..numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Update
As #Tenfour04 noted, there is no specific stdlib implementation for IntProgression.sumOf, so it's resolved to Iterable<T>.sumOf, which will add unnecessary overhead for int boxing.
So, it's better to use IntArray here instead of IntProgression:
fun randomWalk4(numSteps: Int) =
(-numSteps + 2 * IntArray(numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Still encourage you to check this all with JMH
I think:"Removing mutability without losing speed" is wrong title .because
mutability thing comes to deal with the flow that program want to achieve .
you are using var inside function.... and 100% this var will not ever change from outside this function and that is mutability concept.
if we git rid off from var everywhere why we need it in programming ?

Do Java 8 streams produce slower code than plain imperative loops?

There are too much hype about functional programming and particularly the new Java 8 streams API. It is advertised as good replacement for old good loops and imperative paradigm.
Indeed sometimes it could look nice and do the job well. But what about performance?
E.g. here is the good article about that: Java 8: No more loops
Using the loop you can do all the job with a one iteration. But with a new stream API you will chain multiple loops which make it much slower(is it right?).
Look at their first sample. Loop will do not walk even through the whole array in most cases. However to do the filtering with a new stream API you have to cycle through the whole array to filter out all candidates and then you will be able to get the first one.
In this article it was mentioned about some laziness:
We first use the filter operation to find all articles that have the Java tag, then used the findFirst() operation to get the first occurrence. Since streams are lazy and filter returns a stream, this approach only processes elements until it finds the first match.
What does author mean about that laziness?
I did simple test and it shows that old good loop solution works 10x fast then stream approach.
public void test() {
List<String> list = Arrays.asList(
"First string",
"Second string",
"Third string",
"Good string",
"Another",
"Best",
"Super string",
"Light",
"Better",
"For string",
"Not string",
"Great",
"Super change",
"Very nice",
"Super cool",
"Nice",
"Very good",
"Not yet string",
"Let's do the string",
"First string",
"Low string",
"Big bunny",
"Superstar",
"Last");
long start = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
getFirstByLoop(list);
}
long end = System.currentTimeMillis();
System.out.println("Loop: " + (end - start));
start = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
getFirstByStream(list);
}
end = System.currentTimeMillis();
System.out.println("Stream: " + (end - start));
}
public String getFirstByLoop(List<String> list) {
for (String s : list) {
if (s.endsWith("string")) {
return s;
}
}
return null;
}
public Optional<String> getFirstByStream(List<String> list) {
return list.stream().filter(s -> s.endsWith("string")).findFirst();
}
Results was:
Loop: 517
Stream: 5790
BTW if I will use String[] instead of List the difference will be even more! Almost 100x!
QUESTION: Should I use old loop imperative approach if I'm looking for the best code performance? Is FP paradigm is just to make code "more concise and readable" but not about performance?
OR
is there something I missed and new stream API could be at least the same as efficient as loop imperative approach?
QUESTION: Should I use old loop imperative approach if I'm looking for the best code performance?
Right now, probably yes. Various benchmarks seem to suggest that streams are slower than loops for most tests. Though not catastrophically slower.
Counter examples:
In some cases, parallel streams can give a useful speed up.
Lazy streams can provide performance benefits for some problems; see http://java.amitph.com/2014/01/java-8-streams-api-laziness.html
It is possible to do equivalent things with loops, you can't do it with just loops.
But the bottom line is that performance is complicated and streams are not (yet) a magic bullet for speeding up your code.
Is FP paradigm is just to make code "more concise and readable" but not about performance?
Not exactly. It is certainly true that the FP paradigm is more concise and (to someone who is familiar with it) more readable.
However, by expressing the using the FP paradigm, you are also expressing it in a way that potentially could be optimized in ways that are much harder to achieve with code expressed using loops and assignment. FP code is also more amenable to formal methods; i.e. formal proof of correctness.
(In the context of this discussion of streams, "could be optimized" means in some future Java release.)
Laziness is about how elements are taken from the source of the stream - that is on demand. If there is needed to take more elements - they will, otherwise they will not. Here is an example:
Arrays.asList(1, 2, 3, 4, 5)
.stream()
.peek(x -> System.out.println("before filter : " + x))
.filter(x -> x > 2)
.peek(System.out::println)
.anyMatch(x -> x > 3);
Notice how each element goes through the entire pipeline of stages; that is filter is applied to one element at at time - not all of them, thus filter returns a Stream<Integer>. This allows the stream to be short-circuiting, as anyMatch does not even process 5 since there is no need at all.
Just notice that not all intermediate operations are lazy. For example sorted and distinct is not - and these are called stateful intermediate operations. Think about this way - to actually sort elements you do need to traverse the entire source. One more example that is not intuitive is flatMap, but this is not guaranteed and is seen more like a bug, more to read here
The speed is about how you measure, measuring micro-benchmarks in java is not easy, and de facto tool for that is jmh - you can try that out. There are numerous posts here on SO that show that streams are indeed slower (which in normal - they have an infrastructure), but the difference is not that big to actually care.

Reliably measure JVM allocations

I have two implementations of the same algorithm. I would like to verify that non of them uses more memory than necessary, or, in other words, that they allocate exactly the same number of objects.
My current solution is to measure the number of allocated bytes before and after the procedures via threadMXBean.getThreadAllocatedBytes(threadId) and use that as an approximation of the memory footprint.
The problem is that this method is unstable, e.i. sometimes it returns a much greater number than it should. It especially shows on algorithms that don't allocate objects. One problematic example is a method that sums an int[].
Actual code (Kotlin):
class MemAllocationTest {
private val threadMXBean = (ManagementFactory.getThreadMXBean() as? com.sun.management.ThreadMXBean)
?: throw RuntimeException("Runtime does not support com.sun.management.ThreadMXBean")
/**
* May run [block] several times
* */
private inline fun measureAllocatedBytes(block: () -> Unit): Long {
val threadId = Thread.currentThread().id
val before = threadMXBean.getThreadAllocatedBytes(threadId)
block()
val after = threadMXBean.getThreadAllocatedBytes(threadId)
return after - before
}
....
Is there a better solution?
(I don't know how to do that with JMH, but IMHO this is a very close topic)
JMH has -prof gc profiler, which is supposed to be accurate with allocation profiling. Although it uses the same ThreadMXBean under cover, it can filter out warmup effects, and average the hiccups out over multiple #Benchmark invocations. The typical errors are within 0.001 byte/op there.
My current solution is to collect statistics with several runs:
private inline fun stabiliseMeasureAllocatedBytes(block: () -> Unit): Long {
val runs = List(7) { measureAllocatedBytes(block) }
val results = runs.drop(2) // skip warm-up
val counts = results.groupingBy { it }.eachCount()
val (commonResult, commonCount) = counts.entries.maxBy { (result, count) -> count }!!
if (commonCount >= results.size / 2)
return commonResult
else
throw RuntimeException("Allocation measurements vary too much: $runs")
}

Performance Difference Using Update Operation on a Mutable Map in Scala with a Large Size Data

I would like to know if an update operation on a mutable map is better in performance than reassignment.
Lets assume I have the following Map
val m=Map(1 -> Set("apple", "banana"),
2 -> Set("banana", "cabbage"),
3 -> Set("cabbage", "dumplings"))
which I would like to reverse into this map:
Map("apple" -> Set(1),
"banana" -> Set(1, 2),
"cabbage" -> Set(2, 3),
"dumplings" -> Set(3))
The code to do so is:
def reverse(m:Map[Int,Set[String]])={
var rm = Map[String,Set[Int]]()
m.keySet foreach { k=>
m(k) foreach { e =>
rm = rm + (e -> (rm.getOrElse(e, Set()) + k))
}
}
rm
}
Would it be more efficient to use the update operator on a map if it is very large in size?
The code using the update on map is as follows:
def reverse(m:Map[Int,Set[String]])={
var rm = scala.collection.mutable.Map[String,Set[Int]]()
m.keySet foreach { k=>
m(k) foreach { e =>
rm.update(e,(rm.getOrElse(e, Set()) + k))
}
}
rm
}
I ran some tests using Rex Kerr's Thyme utility.
First I created some test data.
val rndm = new util.Random
val dna = Seq('A','C','G','T')
val m = (1 to 4000).map(_ -> Set(rndm.shuffle(dna).mkString
,rndm.shuffle(dna).mkString)).toMap
Then I timed some runs with both the immutable.Map and mutable.Map versions. Here's an example result:
Time: 2.417 ms 95% CI 2.337 ms - 2.498 ms (n=19) // immutable
Time: 1.618 ms 95% CI 1.579 ms - 1.657 ms (n=19) // mutable
Time 2.278 ms 95% CI 2.238 ms - 2.319 ms (n=19) // functional version
As you can see, using a mutable Map with update() has a significant performance advantage.
Just for fun I also compared these results with a more functional version of a Map reverse (or what I call a Map inverter). No var or any mutable type involved.
m.flatten{case(k, vs) => vs.map((_, k))}
.groupBy(_._1)
.mapValues(_.map(_._2).toSet)
This version consistently beat your immutable version but still doesn't come close to the mutable timings.
The trade-of between mutable and immutable collections usually narrows down to this:
immutable collections are safer to share and allows to use structural sharing
mutable collections have better performance
Some time ago I did comparison of performance between mutable and immutable Maps in Scala and the difference was about 2 to 3 times in favor of mutable ones.
So, when performance is not critical I usually go with immutable collections for safety and readability.
For example, in your case functional "scala way" of performing this transformation would be something like this:
m.view
.flatMap(x => x._2.map(_ -> x._1)) // flatten map to lazy view of String->Int pairs
.groupBy(_._1) // group pairs by String part
.mapValues(_.map(_._2).toSet) // extract all Int parts into Set
Although I used lazy view to avoid creating intermediate collections, groupBy still internally creates mutable map (you may want to check it's sources, the logic is pretty similar to what you have wrote), which in turn gets converted to immutable Map which then gets discarded by mapValues.
Now, if you want to squeeze every bit of performance you want to use mutable collections and do as little updates of immutable collections as possible.
For your case is means having Map of mutable Sets as you intermediate buffer:
def transform(m:Map[Int, Set[String]]):Map[String, Set[Int]] = {
val accum:Map[String, mutable.Set[Int]] =
m.valuesIterator.flatten.map(_ -> mutable.Set[Int]()).toMap
for ((k, vals) <- m; v <- vals) {
accum(v) += k
}
accum.mapValues(_.toSet)
}
Note, I'm not updating accum once it's created: I'm doing exactly one map lookup and one set update for each value, while in both your examples there was additional map update.
I believe this code is reasonably optimal performance wise. I didn't perform any tests myself, but I highly encourage you to do that on your real data and post results here.
Also, if you want to go even further, you might want to try mutable BitSet instead of Set[Int]. If ints in your data are fairly small it might yield some minor performance increase.
Just using #Aivean method in a functional way:
def transform(mp :Map[Int, Set[String]]) = {
val accum = mp.values.flatten
.toSet.map( (_-> scala.collection.mutable.Set[Int]())).toMap
mp.map {case(k,vals) => vals.map( v => accum(v)+=k)}
accum.mapValues(_.toSet)
}

Is Lambda Expression in Java 8 reduce execution time?

I am new to Java 8 and getting a bit confused about the scope of Lambda Expression. I read some articles expressing that Lambda Expression reduces execution time, so to find the same I wrote following two programs
1) Without using Lambda Expression
import java.util.*;
public class testing_without_lambda
{
public static void main(String args[])
{
long startTime = System.currentTimeMillis();
List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5, 6);
for (int number : numbers)
{
System.out.println(number);
}
long stopTime = System.currentTimeMillis();
System.out.print("without lambda:");
System.out.println(stopTime - startTime);
}//end main
}
output:
2) With using Lambda Expression
import java.util.*;
public class testing_with_lambda
{
public static void main(String args[])
{
long startTime = System.currentTimeMillis();
List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5, 6);
numbers.forEach((Integer value) -> System.out.println(value));
long stopTime = System.currentTimeMillis();
System.out.print("with lambda:");
System.out.print(stopTime - startTime);
}//end main
}
output:
Is this means Lambda Expression requires more time to execute?
There is no general statement about “execution time” possible, as even the term “execution time” isn’t always meaning the same. Of course, there is no reason, why just using a lambda expression should reduce the execution time in general.
Your code is measuring the initialization time of the code and it’s execution time, which is fair, when you consider the total execution time of that tiny program, but for real life applications, it has no relevance, as they usually run significantly longer than their initialization time.
What make the drastic difference in initialization time, is the fact that the JRE uses the Collection API itself internally, so its classes are loaded and initialized and possibly even optimized to some degree, before your application even starts (so you don’t measure its costs). In contrast, it doesn’t use lambda expressions, so your first use of a lambda expression will load and initialize an entire framework behind the scenes.
Since you are usually interested in how code would perform in a real application, where the initialization already happened, you would have to execute the code multiple times within the same JVM to get a better picture. However, allowing the JVM’s optimizer to process the code bears the possibility that it gets over-optimized due to its simpler nature (compared to a real life scenario) and shows too optimistic numbers then. That’s why it’s recommended to use sophisticated benchmark tools, developed by experts, instead of creating your own. Even with these tools, you have to study their documentation to understand and avoid the pitfalls. See also How do I write a correct micro-benchmark in Java?
When you compare the for loop, also known as “external iteration” with the equivalent forEach call, also known as “internal iteration”, the latter does indeed bear the potential of being more efficient, if properly implemented by the particular Collection, but its outcome is hard to predict, as the JVM’s optimizer is good at removing the drawbacks of the other solution. Also, your list is far too small to ever exhibit this difference, if it exists.
It also must be emphasized that this principle is not tight to lambda expressions. You could also implement the Consumer via an anonymous inner class, and in your case, where the example suffers from the first time initialization cost, the anonymous inner class would be faster than the lambda expression.
In Addition to Holger's answer I want to show how you could've benchmarked your code.
What you're really measuring is initialization of classes and system's IO (i.e. class loading and System.out::println).
In order to get rid of these you should use a benchmark framework like JMH. In addition you should measure with multiple list or array sizes.
Then your code may look like this:
#Fork(3)
#BenchmarkMode(Mode.AverageTime)
#Measurement(iterations = 10, timeUnit = TimeUnit.NANOSECONDS)
#State(Scope.Benchmark)
#Threads(1)
#Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {
#Param({ "10", "1000", "10000" })
private int length;
private List<Integer> numbers;
#Setup
public void setup() {
Random random = new Random(length);
numbers = random.ints(length).boxed().collect(Collectors.toList());
}
#Benchmark
public void externalIteration(Blackhole bh) {
for (Integer number : numbers) {
bh.consume(number);
}
}
#Benchmark
public void internalIteration(Blackhole bh) {
numbers.forEach(bh::consume);
}
}
And the results:
Benchmark (length) Mode Cnt Score Error Units
MyBenchmark.externalIteration 10 avgt 30 41,002 ± 0,263 ns/op
MyBenchmark.externalIteration 1000 avgt 30 4026,842 ± 71,318 ns/op
MyBenchmark.externalIteration 10000 avgt 30 40423,629 ± 572,055 ns/op
MyBenchmark.internalIteration 10 avgt 30 40,783 ± 0,815 ns/op
MyBenchmark.internalIteration 1000 avgt 30 3888,240 ± 28,790 ns/op
MyBenchmark.internalIteration 10000 avgt 30 41961,320 ± 991,047 ns/op
As you can see there is little to no difference.
I don't think lambda expression code will be faster in execution all the time, it would really depend on different conditions. Can you may be point me to the article where you read that the lambda expression is faster in execution time?
(It certainly is considered faster while writing the code, due to functional programming style of code)
I ran your test again locally, found something strange:
The first time I ran the code without lambda, it took almost the same time, rather more as with the lambda. (49 milliseconds)
Second time onward, the code without lambda ran very much faster i.e. in 1 millisecond.
The code with lambda expression runs in the same amount of time every time, I tried around 3-4 times in total.
General Results:
1
2
3
4
5
6
with lambda:47
1
2
3
4
5
6
without lambda:1
I think it would really take a very large sample of numbers to really test this and also multiple calls to the same code to remove any initialization burden on the JVM. This is a pretty small test sample.

Resources