Functional approach ~100 times slower than sequential in Kotlin?

Functional approach ~100 times slower than sequential in Kotlin? - performance

So I picked up Project Euler again to play with Kotlin a bit. For those who do not know Project Euler is a site with programming exercises ranging from banal to pretty hard.
Anyway the very first problem is this:
If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.
I solved this both sequentially and functionally since I wanted to see how they compared:
class ProjectEuler1 {
companion object {
fun sequential(): Int {
var sum = 0
for (i in 1..999)
if (i % 3 == 0 || i % 5 == 0)
sum += i
return sum
}
fun functional(): Int = (1..999).filter { it % 3 == 0 || it % 5 == 0 } .sum()
fun benchmark() {
var solSequential = -1
val tSequential = measureExactTimeMillis { solSequential = sequential() }
var solFunctional = -1
val tFunctional = measureExactTimeMillis { solFunctional = functional() }
println("""
+---
|${this::class.java.canonicalName.replace(".Companion", "")}
+---
|Sequential solution: $solSequential
|Sequential running time: ${tSequential.round(6)} ms
+---
|Functional solution: $solFunctional
|Functional running time: ${tFunctional.round(6)} ms
+---
""".trimIndent())
}
}
}
measureExactTimeMillis is just measureNanoTime divided by 1_000_000 returning a Double.
Anyway, after running this several times the output was always more or less this:
+---
|ProjectEuler1
+---
|Sequential solution: 233168
|Sequential running time: 0.6097 ms
+---
|Functional solution: 233168
|Functional running time: 33.3555 ms
+---
The sequential, and sadly more verbose, version being 50-100x faster than the simple functional implementation.
What is the reason for this? Shouldn't the performances be at least comparable (as opposed to orders of magnitude apart)? Did I miss something that should have been done to optimize the functional implementation?

Related

Getting an integer number by adding fewest number of elements from a list of integers

Given a list of integers [2, 3] I want to achieve the best combination of those numbers that add up to 8. The result should be [3, 3, 2]. The below code works correctly.
fun getBestCombination(targetSum: Int, numbers: Array<Int>)
: MutableList<Int>? {
if (targetSum == 0) return mutableListOf()
if (targetSum < 0) return null
var bestCombination: MutableList<Int>? = null
for (number in numbers) {
val newTarget = targetSum - number
val result = getBestCombination(newTarget, numbers)
result?.let {
it.add(number)
if (it.size < bestCombination?.size ?: it.size + 1) {
bestCombination = it
}
}
}
return bestCombination
}
This code produces the result [3, 3, 2] which is correct.
But the time complexity for the above code is exponential. When I try to cache the results from repeated recursive nodes it doesn,t work. The below code produces [3, 3, 2, 2, 3] I can't figure out why.
fun getBestCombinationOptimized(
targetSum: Int,
numbers: Array<Int>,
memory: HashMap<Int, MutableList<Int>?> = hashMapOf()
): MutableList<Int>? {
// Looking in the stored results
if (memory.containsKey(targetSum)) return memory[targetSum]
if (targetSum == 0) return mutableListOf()
if (targetSum < 0) return null
var bestCombination: MutableList<Int>? = null
for (number in numbers) {
val newTarget = targetSum - number
val result = getBestCombinationOptimized(newTarget, numbers, memory)
result?.let {
it.add(number)
if (it.size < bestCombination?.size ?: it.size + 1) {
bestCombination = it
}
}
}
// Caching the result
memory[targetSum] = bestCombination
return bestCombination
}

Your problem is known as the Subset Sum with Repetitions Problem, which is NP-complete. As such, it is highly unlikely you will find a worst-case polynomial time algorithm for it.

This is a pseudocode working solution for your specific case:
n = 8
dist = (INF, INF, 0, 0, ..., 0) /* size n + 1 */
last = (0, 0, ..., 0) /* size n + 1 */
//dynamic programming step: filling array
for i = 4, ..., n :
| if dist[i - 2] < dist[i - 3] :
| | dist[i] = 1 + dist[i - 2]
| | last[i] = i - 2
|
| else :
| | dist[i] = 1 + dist[i - 3]
| | last[i] = i - 3
//going back through the solution
while n != 2 and n != 3:
| if n - last[n] == 2 :
| | print(2)
| | n = n - 2
|
| else :
| | print(3)
| | n = n - 3
print(n)
OUTPUT: 3 3 2
The idea is to fill all the numbers from 2 to n (in your first case, n = 8), storing the "distance" in dist and the previous step in last, which is used to tell you the path to get to n.

I finally found the problem in my code. The problem was in the following part of the code where I wrote bestCombination = it. Here I am assigning the same list object reference (it) over and over every time a new fewest element combination is coming up. As a result I'm adding elements to the exact same list. What I really needed to do was to copy the elements and then assign it to bestCombination and thus prevent looping on the same list object.
result?.let {
it.add(number)
if (it.size < bestCombination?.size ?: it.size + 1) {
//--------- CULPRIT ----------//
bestCombination = it
//----------------------------//
}
}
A correct way would be:
bestCombination = it.toMutableList()
I was such an idiot. Thank you EVERYONE for spending your time on this silly mistake.

ArrayIndexOutofBoundsException in Knapsack Scala

I am trying to solve Knapsack problem in Scala using dynamic programming .As a part of requirement I also need to show which items are picked to be filled in Knapsack.But I am getting "ArrayIndexOutOfBoundException".
And so far what I have code is like :
availableMoney is equivalent to weight of knapsack.products.channels is equivalent to value[] in knapsack.products.price is equivalent to weight[] in knapsack.
def knapSack(availableMoney: Int, products: List[Product]) : Int = {
var wt = List[Int](products.length)
var value = List[Int](products.length)
for (product <- products) {
value ::= product.channels.length
wt ::= product.price
}
val matrix = Array.fill(2, 2)(0)
val picks = Array.fill(2, 2)(0)
for (i <- 1 to products.length){
for (j <- 0 to availableMoney){
if (wt(i-1)<=j){
matrix(i)(j) = max(matrix(i-1)(j),value(i-1)+matrix(i-1)(j-wt(i-1)));
if (value(i-1)+matrix(i-1)(j-wt(i-1))>matrix(i-1)(j))
picks(i)(j)= 1;
else
picks(i)(j)= -1;
}
else{
picks(i)(j) = -1;
matrix(i)(j) = matrix(i-1)(j);
}
}
}
matrix(products.length)(availableMoney)
}

There are a couple of issues I think:
j runs from 0 to availableMoney, and is then used as an index into picks and matrix which have been initialised to specific sizes, so if availableMoney exceeds those dimensions, it will fail
i runs from 1 to products.length but is also used as an index into picks and matrix, so will miss 0 and if there are more products than the second dimension size, it will fail
Use some println debugging to check more closely what is going on. Looks like an interesting algorithm. Post us a solution once you get it working :)

Scheduling algorithm in linear complexity

Here's the problem:
We have n tasks to complete in n days. We can complete a single task per day. Each task has a start date and an end date. We can't start a task before the start date and it has to be completed before the end date. So, given a vector s for the start dates and e for the end dates give a vector d, if it exists, where d[i] is the date you do task i. For example:
s = {1, 0, 1, 2, 0}
e = {2, 4, 2, 3, 1}
+--------+------------+----------+
| Task # | Start Date | End Date |
+--------+------------+----------+
| 0 | 1 | 2 |
| 1 | 0 | 4 |
| 2 | 1 | 2 |
| 3 | 2 | 3 |
| 4 | 0 | 1 |
+--------+------------+----------+
We have as a possible solution:
d = {1, 4, 2, 3, 0}
+--------+----------------+
| Task # | Date Completed |
+--------+----------------+
| 0 | 1 |
| 1 | 4 |
| 2 | 2 |
| 3 | 3 |
| 4 | 0 |
+--------+----------------+
It is trivial to create an algorithm with O(n^2). It is not too bad either to create an algorithm with O(nlogn). There is supposedly an algorithm that gives a solution with O(n). What would that be?

When you can't use time, use space! You can represent the tasks open on any day using a bit vector. In O(n) create a "starting today" array. You can also represent the tasks ending soonest using another bit vector that can also be calculated in O(n). And then finally, in O(n) again scan each day, adding in any tasks starting that day, picking the lowest numbered task open that day giving priority to the ones ending soonest.
using System.IO;
using System;
using System.Math;
class Program
{
static void Main()
{
int n = 5;
var s = new int[]{1, 0, 1, 2, 0};
var e = new int[]{2, 4, 2, 3, 1};
var sd = new int[n];
var ed = new int[n];
for (int task = 0; task < n; task++)
{
sd[s[task]] += (1 << task); // Start for task i
ed[e[task]] += (1 << task); // End for task i
}
int bits = 0;
// Track the earliest ending task
var ends = new int[n];
for (int day = n-1; day >= 0; day--)
{
if (ed[day] != 0) // task(s) ending today
{
// replace any tasks that have later end dates
bits = ed[day];
}
ends[day] = bits;
bits = bits ^ sd[day]; // remove any starting
}
var d = new int[n];
bits = 0;
for (int day = 0; day < n; day++)
{
bits |= sd[day]; // add any starting
int lowestBit;
if ((ends[day] != 0) && ((bits & ends[day]) != 0))
{
// Have early ending tasks to deal with
// and have not dealt with it yet
int tb = bits & ends[day];
lowestBit = tb & (-tb);
if (lowestBit == 0) throw new Exception("Fail");
}
else
{
lowestBit = bits & (-bits);
}
int task = (int)Math.Log(lowestBit, 2);
d[task] = day;
bits = bits - lowestBit; // remove task
}
Console.WriteLine(string.Join(", ", d));
}
}
Result in this case is: 1, 4, 2, 3, 0 as expected.

Correct me if I'm wrong, but I believe that the name for such problems would be Interval Scheduling.
From your post, I assume that you are not looking for the Optimal Schedule and that you're simply looking to find any solution within O(n).
The problem here is that sorting will take O(nlogn) and computing will also take O(nlogn). You can try do it in one step:
Define V - a vector of 'time taken' for each task.
Define P - a vector of tasks which finish before a given task.
i.e if Task 2 finishes before Task 3, P[3] = 2.
Of course, as you can see the main computation is involved in finding P[j]. You always take the latest-ending non-overlapping task.
Thus, to find P[i]. Find a task which ends before the ith task. If more than one exists, pick the one which ends last.
Define M - a vector containing the resultant slots.
The algorithm:
WIS(n)
M[0] <- 0
for j <- 1 to n
do M[j] = max(wj + M[p(j)], M[j-1])
return M[n]

Sort by end date, start date. Then process data in sorted order only executing tasks that are after the start date.

Using Grand Central Dispatch in Swift to parallelize and speed up “for" loops?

I am trying to wrap my head around how to use GCD to parallelize and speed up Monte Carlo simulations. Most/all simple examples are presented for Objective C and I really need a simple example for Swift since Swift is my first “real” programming language.
The minimal working version of a monte carlo simulation in Swift would be something like this:
import Foundation
import Cocoa
var winner = 0
var j = 0
var i = 0
var chance = 0
var points = 0
for j=1;j<1000001;++j{
var ability = 500
var player1points = 0
for i=1;i<1000;++i{
chance = Int(arc4random_uniform(1001))
if chance<(ability-points) {++points}
else{points = points - 1}
}
if points > 0{++winner}
}
println(winner)
The code works directly pasted into a command line program project in xcode 6.1
The innermost loop cannot be parallelized because the new value of variable “points” is used in the next loop. But the outermost just run the innermost simulation 1000000 times and tally up the results and should be an ideal candidate for parallelization.
So my question is how to use GCD to parallelize the outermost for loop?

A "multi-threaded iteration" can be done with dispatch_apply():
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
for innerIdx in 1 ... innerCount {
// ...
}
}
(You have to figure out the best relation between outer and inner counts.)
There are two things to notice:
arc4random() uses an internal mutex, which makes it extremely slow when called
from several threads in parallel, see Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version. From the answers given there,
rand_r() (with separate seeds for each thread) seems to be faster alternative.
The result variable winner must not be modified from multiple threads simultaneously.
You can use an array instead where each thread updates its own element, and the results
are added afterwards. A thread-safe method has been described in https://stackoverflow.com/a/26790019/1187415.
Then it would roughly look like this:
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
var winners = [Int](count: outerCount, repeatedValue: 0)
winners.withUnsafeMutableBufferPointer { winnersPtr -> Void in
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
var seed = arc4random() // seed for rand_r() in this "thread"
for innerIdx in 1 ... innerCount {
var points = 0
var ability = 500
for i in 1 ... 1000 {
let chance = Int(rand_r(&seed) % 1001)
if chance < (ability-points) { ++points }
else {points = points - 1}
}
if points > 0 {
winnersPtr[Int(outerIdx)] += 1
}
}
}
}
// Add results:
let winner = reduce(winners, 0, +)
println(winner)

Just to update this for contemporary syntax, we now use concurrentPerform (which replaces dispatch_apply).
So we can replace
for j in 0 ..< 1_000_000 {
for i in 0 ..< 1000 {
...
}
}
With
DispatchQueue.concurrentPerform(1_000_000) { j in
for i in 0 ..< 1000 {
...
}
}
Note, parallelizing introduces a little overhead, in both the basic GCD dispatch mechanism, as well as the synchronization of the results. If you had 32 iterations in your parallel loop this would be inconsequential, but you have a million iterations, and it will start to add up.
We generally solve this by “striding”: Rather than parallelizing 1 million iterations, you might only do 100 parallel iterations, doing 10,000 iterations each. E.g. something like:
let totalIterations = 1_000_000
let stride = 10_000
let (quotient, remainder) = totalIterations.quotientAndRemainder(dividingBy: stride)
let iterations = quotient + remainder == 0 ? 0 : 1
DispatchQueue.concurrentPerform(iterations: iterations) { iteration in
for j in iteration * stride ..< min(totalIterations, (iteration + 1) * stride) {
for i in 0 ..< 1000 {
...
}
}
}

Why is my Scala tail-recursion faster than the while loop?

Here are two solutions to exercise 4.9 in Cay Horstmann's Scala for the Impatient: "Write a function lteqgt(values: Array[Int], v: Int) that returns a triple containing the counts of values less than v, equal to v, and greater than v." One uses tail recursion, the other uses a while loop. I thought that both would compile to similar bytecode but the while loop is slower than the tail recursion by a factor of almost 2. This suggests to me that my while method is badly written.
import scala.annotation.tailrec
import scala.util.Random
object PerformanceTest {
def main(args: Array[String]): Unit = {
val bigArray:Array[Int] = fillArray(new Array[Int](100000000))
println(time(lteqgt(bigArray, 25)))
println(time(lteqgt2(bigArray, 25)))
}
def time[T](block : => T):T = {
val start = System.nanoTime : Double
val result = block
val end = System.nanoTime : Double
println("Time = " + (end - start) / 1000000.0 + " millis")
result
}
#tailrec def fillArray(a:Array[Int], pos:Int=0):Array[Int] = {
if (pos == a.length)
a
else {
a(pos) = Random.nextInt(50)
fillArray(a, pos+1)
}
}
#tailrec def lteqgt(values: Array[Int], v:Int, lt:Int=0, eq:Int=0, gt:Int=0, pos:Int=0):(Int, Int, Int) = {
if (pos == values.length)
(lt, eq, gt)
else
lteqgt(values, v, lt + (if (values(pos) < v) 1 else 0), eq + (if (values(pos) == v) 1 else 0), gt + (if (values(pos) > v) 1 else 0), pos+1)
}
def lteqgt2(values:Array[Int], v:Int):(Int, Int, Int) = {
var lt = 0
var eq = 0
var gt = 0
var pos = 0
val limit = values.length
while (pos < limit) {
if (values(pos) > v)
gt += 1
else if (values(pos) < v)
lt += 1
else
eq += 1
pos += 1
}
(lt, eq, gt)
}
}
Adjust the size of bigArray according to your heap size. Here is some sample output:
Time = 245.110899 millis
(50004367,2003090,47992543)
Time = 465.836894 millis
(50004367,2003090,47992543)
Why is the while method so much slower than the tailrec? Naively the tailrec version looks to be at a slight disadvantage, as it must always perform 3 "if" checks for every iteration, whereas the while version will often only perform 1 or 2 tests due to the else construct. (NB reversing the order I perform the two methods does not affect the outcome).

Test results (after reducing array size to 20000000)
Under Java 1.6.22 I get 151 and 122 ms for tail-recursion and while-loop respectively.
Under Java 1.7.0 I get 55 and 101 ms
So under Java 6 your while-loop is actually faster; both have improved in performance under Java 7, but the tail-recursive version has overtaken the loop.
Explanation
The performance difference is due to the fact that in your loop, you conditionally add 1 to the totals, while for recursion you always add either 1 or 0. So they are not equivalent. The equivalent while-loop to your recursive method is:
def lteqgt2(values:Array[Int], v:Int):(Int, Int, Int) = {
var lt = 0
var eq = 0
var gt = 0
var pos = 0
val limit = values.length
while (pos < limit) {
gt += (if (values(pos) > v) 1 else 0)
lt += (if (values(pos) < v) 1 else 0)
eq += (if (values(pos) == v) 1 else 0)
pos += 1
}
(lt, eq, gt)
}
and this gives exactly the same execution time as the recursive method (regardless of Java version).
Discussion
I'm not an expert on why the Java 7 VM (HotSpot) can optimize this better than your first version, but I'd guess it's because it's taking the same path through the code each time (rather than branching along the if / else if paths), so the bytecode can be inlined more efficiently.
But remember that this is not the case in Java 6. Why one while-loop outperforms the other is a question of JVM internals. Happily for the Scala programmer, the version produced from idiomatic tail-recursion is the faster one in the latest version of the JVM.
The difference could also be occurring at the processor level. See this question, which explains how code slows down if it contains unpredictable branching.

The two constructs are not identical. In particular, in the first case you don't need any jumps (on x86, you can use cmp and setle and add, instead of having to use cmp and jb and (if you don't jump) add. Not jumping is faster than jumping on pretty much every modern architecture.
So, if you have code that looks like
if (a < b) x += 1
where you may add or you may jump instead, vs.
x += (a < b)
(which only makes sense in C/C++ where 1 = true and 0 = false), the latter tends to be faster as it can be turned into more compact assembly code. In Scala/Java, you can't do this, but you can do
x += if (a < b) 1 else 0
which a smart JVM should recognize is the same as x += (a < b), which has a jump-free machine code translation, which is usually faster than jumping. An even smarter JVM would recognize that
if (a < b) x += 1
is the same yet again (because adding zero doesn't do anything).
C/C++ compilers routinely perform optimizations like this. Being unable to apply any of these optimizations was not a mark in the JIT compiler's favor; apparently it can as of 1.7, but only partially (i.e. it doesn't recognize that adding zero is the same as a conditional adding one, but it does at least convert x += if (a<b) 1 else 0 into fast machine code).
Now, none of this has anything to do with tail recursion or while loops per se. With tail recursion it's more natural to write the if (a < b) 1 else 0 form, but you can do either; and with while loops you can also do either. It just so happened that you picked one form for tail recursion and the other for the while loop, making it look like recursion vs. looping was the change instead of the two different ways to do the conditionals.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Functional approach ~100 times slower than sequential in Kotlin? - performance

Related

Getting an integer number by adding fewest number of elements from a list of integers

ArrayIndexOutofBoundsException in Knapsack Scala

Scheduling algorithm in linear complexity

Using Grand Central Dispatch in Swift to parallelize and speed up “for" loops?

Why is my Scala tail-recursion faster than the while loop?

Categories

Resources