Performance of F# Array.reduce - performance

I noticed while doing some F# experiments that if write my own reduce function for Array that it performs much better than the built in reduce. For example:
type Array with
static member inline fastReduce f (values : 'T[]) =
let mutable result = Unchecked.defaultof<'T>
for i in 0 .. values.Length-1 do
result <- f result values.[i]
result
This seems to behave identically to the built in Array.reduce but is ~2x faster for simple f
Is the built in one more flexible in some way?

By looking at the generated IL code it's easier to understand what's happening.
Using the built-in Array.reduce:
let reducer (vs : int []) : int = Array.reduce (+) vs
Gives the following equivalent C# (reverse engineered from the IL code using ILSpy)
public static int reducer(int[] vs)
{
return ArrayModule.Reduce<int>(new Program.BuiltIn.reducer#31(), vs);
}
Array.reduce looks like this:
public static T Reduce<T>(FSharpFunc<T, FSharpFunc<T, T>> reduction, T[] array)
{
if (array == null)
{
throw new ArgumentNullException("array");
}
int num = array.Length;
if (num == 0)
{
throw new ArgumentException(LanguagePrimitives.ErrorStrings.InputArrayEmptyString, "array");
}
OptimizedClosures.FSharpFunc<T, T, T> fSharpFunc = OptimizedClosures.FSharpFunc<T, T, T>.Adapt(reduction);
T t = array[0];
int num2 = 1;
int num3 = num - 1;
if (num3 >= num2)
{
do
{
t = fSharpFunc.Invoke(t, array[num2]);
num2++;
}
while (num2 != num3 + 1);
}
return t;
}
Notice that it invoking the reducer function f is a virtual call which typically the JIT:er struggles to inline.
Compare to your fastReduce function:
let reducer (vs : int []) : int = Array.fastReduce (+) vs
The reverse-engineered C# code:
public static int reducer(int[] vs)
{
int num = 0;
for (int i = 0; i < vs.Length; i++)
{
num += vs[i];
}
return num;
}
A lot more efficient as the virtual call is now gone. It seems that in this case F# inlines both the code for fastReduce as well as (+).
There's some kind of cut-off in F# as more complex reducer functions won't be inlined. I am unsure on the exact details.
Hope this helps
A side-note; Unchecked.defaultOf returns null values for class types in .NET such as string. I prefer LanguagePrimitives.GenericZero.
PS. A common trick for the real performance hungry is to loop towards 0. In F# that doesn't work for for-expressions because of a slight performance bug in how for-expressions are generated. In those case you can try to implement the loop using tail-recursion.

Related

Paper cut algorithm

I want to create a function to determine the most number of pieces of paper on a parent paper size
The formula above is still not optimal. If using the above formula will only produce at most 32 cut/sheet.
I want it like below.
This seems to be a very difficult problem to solve optimally. See http://lagrange.ime.usp.br/~lobato/packing/ for a discussion of a 2008 paper claiming that the problem is believed (but not proven) to be NP-hard. The researchers found some approximation algorithms and implemented them on that website.
The following solution uses Top-Down Dynamic Programming to find optimal solutions to this problem. I am providing this solution in C#, which shouldn't be too hard to convert into the language of your choice (or whatever style of pseudocode you prefer). I have tested this solution on your specific example and it completes in less than a second (I'm not sure how much less than a second).
It should be noted that this solution assumes that only guillotine cuts are allowed. This is a common restriction for real-world 2D Stock-Cutting applications and it greatly simplifies the solution complexity. However, CS, Math and other programming problems often allow all types of cutting, so in that case this solution would not necessarily find the optimal solution (but it would still provide a better heuristic answer than your current formula).
First, we need a value-structure to represent the size of the starting stock, the desired rectangle(s) and of the pieces cut from the stock (this needs to be a value-type because it will be used as the key to our memoization cache and other collections, and we need to to compare the actual values rather than an object reference address):
public struct Vector2D
{
public int X;
public int Y;
public Vector2D(int x, int y)
{
X = x;
Y = y;
}
}
Here is the main method to be called. Note that all values need to be in integers, for the specific case above this just means multiplying everything by 100. These methods here require integers, but are otherwise are scale-invariant so multiplying by 100 or 1000 or whatever won't affect performance (just make sure that the values don't overflow an int).
public int SolveMaxCount1R(Vector2D Parent, Vector2D Item)
{
// make a list to hold both the item size and its rotation
List<Vector2D> itemSizes = new List<Vector2D>();
itemSizes.Add(Item);
if (Item.X != Item.Y)
{
itemSizes.Add(new Vector2D(Item.Y, Item.X));
}
int solution = SolveGeneralMaxCount(Parent, itemSizes.ToArray());
return solution;
}
Here is an example of how you would call this method with your parameter values. In this case I have assumed that all of the solution methods are part of a class called SolverClass:
SolverClass solver = new SolverClass();
int count = solver.SolveMaxCount1R(new Vector2D(2500, 3800), new Vector2D(425, 550));
//(all units are in tenths of a millimeter to make everything integers)
The main method calls a general solver method for this type of problem (that is not restricted to just one size rectangle and its rotation):
public int SolveGeneralMaxCount(Vector2D Parent, Vector2D[] ItemSizes)
{
// determine the maximum x and y scaling factors using GCDs (Greastest
// Common Divisor)
List<int> xValues = new List<int>();
List<int> yValues = new List<int>();
foreach (Vector2D size in ItemSizes)
{
xValues.Add(size.X);
yValues.Add(size.Y);
}
xValues.Add(Parent.X);
yValues.Add(Parent.Y);
int xScale = NaturalNumbers.GCD(xValues);
int yScale = NaturalNumbers.GCD(yValues);
// rescale our parameters
Vector2D parent = new Vector2D(Parent.X / xScale, Parent.Y / yScale);
var baseShapes = new Dictionary<Vector2D, Vector2D>();
foreach (var size in ItemSizes)
{
var reducedSize = new Vector2D(size.X / xScale, size.Y / yScale);
baseShapes.Add(reducedSize, reducedSize);
}
//determine the minimum values that an allowed item shape can fit into
_xMin = int.MaxValue;
_yMin = int.MaxValue;
foreach (var size in baseShapes.Keys)
{
if (size.X < _xMin) _xMin = size.X;
if (size.Y < _yMin) _yMin = size.Y;
}
// create the memoization cache for shapes
Dictionary<Vector2D, SizeCount> shapesCache = new Dictionary<Vector2D, SizeCount>();
// find the solution pattern with the most finished items
int best = solveGMC(shapesCache, baseShapes, parent);
return best;
}
private int _xMin;
private int _yMin;
The general solution method calls a recursive worker method that does most of the actual work.
private int solveGMC(
Dictionary<Vector2D, SizeCount> shapeCache,
Dictionary<Vector2D, Vector2D> baseShapes,
Vector2D sheet )
{
// have we already solved this size?
if (shapeCache.ContainsKey(sheet)) return shapeCache[sheet].ItemCount;
SizeCount item = new SizeCount(sheet, 0);
if ((sheet.X < _xMin) || (sheet.Y < _yMin))
{
// if it's too small in either dimension then this is a scrap piece
item.ItemCount = 0;
}
else // try every way of cutting this sheet (guillotine cuts only)
{
int child0;
int child1;
// try every size of horizontal guillotine cut
for (int c = sheet.X / 2; c > 0; c--)
{
child0 = solveGMC(shapeCache, baseShapes, new Vector2D(c, sheet.Y));
child1 = solveGMC(shapeCache, baseShapes, new Vector2D(sheet.X - c, sheet.Y));
if (child0 + child1 > item.ItemCount)
{
item.ItemCount = child0 + child1;
}
}
// try every size of vertical guillotine cut
for (int c = sheet.Y / 2; c > 0; c--)
{
child0 = solveGMC(shapeCache, baseShapes, new Vector2D(sheet.X, c));
child1 = solveGMC(shapeCache, baseShapes, new Vector2D(sheet.X, sheet.Y - c));
if (child0 + child1 > item.ItemCount)
{
item.ItemCount = child0 + child1;
}
}
// if no children returned finished items, then the sheet is
// either scrap or a finished item itself
if (item.ItemCount == 0)
{
if (baseShapes.ContainsKey(item.Size))
{
item.ItemCount = 1;
}
else
{
item.ItemCount = 0;
}
}
}
// add the item to the cache before we return it
shapeCache.Add(item.Size, item);
return item.ItemCount;
}
Finally, the general solution method uses a GCD function to rescale the dimensions to achieve scale-invariance. This is implemented in a static class called NaturalNumbers. I have included the rlevant parts of this class below:
static class NaturalNumbers
{
/// <summary>
/// Returns the Greatest Common Divisor of two natural numbers.
/// Returns Zero if either number is Zero,
/// Returns One if either number is One and both numbers are >Zero
/// </summary>
public static int GCD(int a, int b)
{
if ((a == 0) || (b == 0)) return 0;
if (a >= b)
return gcd_(a, b);
else
return gcd_(b, a);
}
/// <summary>
/// Returns the Greatest Common Divisor of a list of natural numbers.
/// (Note: will run fastest if the list is in ascending order)
/// </summary>
public static int GCD(IEnumerable<int> numbers)
{
// parameter checks
if (numbers == null || numbers.Count() == 0) return 0;
int first = numbers.First();
if (first <= 1) return 0;
int g = (int)first;
if (g <= 1) return g;
int i = 0;
foreach (int n in numbers)
{
if (i == 0)
g = n;
else
g = GCD(n, g);
if (g <= 1) return g;
i++;
}
return g;
}
// Euclidian method with Euclidian Division,
// From: https://en.wikipedia.org/wiki/Euclidean_algorithm
private static int gcd_(int a, int b)
{
while (b != 0)
{
int t = b;
b = (a % b);
a = t;
}
return a;
}
}
Please let me know of any problems or questions you might have with this solution.
Oops, forgot that I was also using this class:
public class SizeCount
{
public Vector2D Size;
public int ItemCount;
public SizeCount(Vector2D itemSize, int itemCount)
{
Size = itemSize;
ItemCount = itemCount;
}
}
As I mentioned in the comments, it would actually be pretty easy to factor this class out of the code, but it's still in there right now.

Function taking std::initializer_list

I came across a function a colleague had written that accepted an initializer list of std::vectors. I have simplified the code for demonstration:
int sum(const std::initializer_list<std::vector<int>> list)
{
int tot = 0;
for (auto &v : list)
{
tot += v.size();
}
return tot;
}
Such a function would allow you call the function like this with the curly braces for the initializer list:
std::vector<int> v1(50, 1);
std::vector<int> v2(75, 2);
int sum1 = sum({ v1, v2 });
That looks neat but doesn't this involve copying the vectors to create the initializer list? Wouldn't it be more efficient to have a function that takes a vector or vectors? That would involve less copying since you can move the vectors. Something like this:
int sum(const std::vector<std::vector<int>> &list)
{
int tot = 0;
for (auto &v : list)
{
tot += v.size();
}
return tot;
}
std::vector<std::vector<int>> vlist;
vlist.reserve(2);
vlist.push_back(std::move(v1));
vlist.push_back(std::move(v2));
int tot = sum2(vlist);
Passing by initializer list could be useful for scalar types like int and float, but I think it should be avoided for types like std::vector to avoid unnecessary copying. Best to use std::initializer_list for constructors as it intended?
That looks neat but doesn't this involve copying the vectors to create the initializer list?
Yes, that is correct.
Wouldn't it be more efficient to have a function that takes a vector or vectors?
If you are willing to move the contents of v1 and v2 to a std::vector<std::vector<int>>, you could do the samething when using std::initializer_list too.
std::vector<int> v1(50, 1);
std::vector<int> v2(75, 2);
int sum1 = sum({ std::move(v1), std::move(v2) });
In other words, you can use either approach to get the same effect.

Why is iterating over a vector of integers slower in Rust than in Python, C# and C++?

I'm learning Rust right now and I'm using this simple Sieve of Erathostenes implementation:
fn get_primes(known_primes: &Vec<i64>, start: i64, stop: i64) -> Vec<i64> {
let mut new_primes = Vec::new();
for number in start..stop {
let mut is_prime = true;
let limit = (number as f64).sqrt() as i64;
for prime in known_primes {
if number % prime == 0 {
is_prime = false;
break;
}
if *prime > limit {
break;
}
}
if is_prime {
new_primes.push(number);
}
}
return new_primes;
}
I'm comparing it to virtually the same code (modulo syntax) in Python (with numba), C#, and C++ (gcc/clang). All of them are about 3x faster than this implementation on my machine.
I am compiling in release mode. To be exact, I've added this to my Cargo.toml, which seems to have the same effect:
[profile.dev]
opt-level = 3
I've also checked the toolchain, there is a slight (15% or so) difference between MSVC and GNU, but nothing that would explain this gap.
Am I getting something wrong here? Am I making a copy somewhere?
Is this code equivalent to the following C++ code?
vector<int> getPrimes(vector<int> &knownPrimes, int start, int stop) {
vector<int> newPrimes;
for (int number = start; number < stop; number += 1) {
bool isPrime = true;
int limit = (int)sqrt(number);
for (auto& prime : knownPrimes) {
if (number % prime == 0) {
isPrime = false;
break;
}
if (prime > limit)
break;
}
if (isPrime) {
newPrimes.push_back(number);
}
}
return newPrimes;
}
The size of a C++ int depends on target architecture, compiler options etc.. In the Rust code, you explicitly state a 64-bit integer. You may be comparing code using different underlying type sizes.

Scala performance on primes algorithm

I'm quite new on Scala and so in order to start writing some code I've implemented this simple program:
package org.primes.sim
object Primes {
def is_prime(a: Int): Boolean = {
val l = Stream.range(3, a, 2) filter { e => a % e == 0}
l.size == 0
}
def gen_primes(m: Int) =
2 #:: Stream.from(3, 2) filter { e => is_prime(e) } take m
def primes(m : Int) = {
gen_primes(m) foreach println
}
def main(args: Array[String]) {
if (args.size == 0)
primes(10)
else
primes(args(0).toInt)
}
}
It generates n primes starting from 2. Then I've implemented the same algorithm in C++11 using range-v3 library of Eric Nibler.This is the code:
#include <iostream>
#include <vector>
#include <string>
#include <range/v3/all.hpp>
using namespace std;
using namespace ranges;
inline bool is_even(unsigned int n) { return n % 2 == 0; }
inline bool is_prime(unsigned int n)
{
if (n == 2)
return true;
else if (n == 1 || is_even(n))
return false;
else
return ranges::any_of(
view::iota(3, n) | view::remove_if(is_even),
[n](unsigned int e) { return n % e == 0; }
) == false;
}
void primes(unsigned int n)
{
auto rng = view::ints(2) | view::filter(is_prime);
ranges::for_each(view::take(rng, n), [](unsigned int e){ cout << e << '\n'; });
}
int main(int argc, char* argv[])
{
if (argc == 1)
primes(100);
else if (argc > 1)
{
primes(std::stoi(argv[1]));
}
}
As you can see the code looks very similar but the performance are very different:
For n = 5000, C++ completes in 0,265s instead Scala completes in 24,314s!!!
So, from this test, Scala seems 100x slower than C++11.
Which is the problem on Scala code? Could you give me some hints for a better usage of scalac?
Note: I've compiled the C++ code using gcc 4.9.2 and -O3 opt.
Thanks
The main speed problem lies with your is_prime implementation.
First of all, you filter a Stream to find all divisors, and then check if there were none (l.size == 0). But it's faster to return false as soon as the first divisor is found:
def is_prime(a: Int): Boolean =
Stream.range(3, a, 2).find(a % _ == 0).isEmpty
This decreased runtime from 22 seconds to 5 seconds for primes(5000) on my machine.
The second problem is Stream itself. Scala Streams are slow, and using them for simple number calculations is a huge overkill. Replacing Stream with Range decreased runtime further to 1,2 seconds:
def is_prime(a: Int): Boolean =
3.until(a, 2).find(a % _ == 0).isEmpty
That's decent: 5x slower than C++. Usually, I'd stop here, but it is possible to decrease running-time a bit more if we remove the higher-order function find.
While nice-looking and functional, find also induces some overhead. Loop implementation (basically replacing find with foreach) further decreased runtime to 0,45 seconds, which is less than 2x slower than C++ (that's already on the order of JVM overhead):
def is_prime(a: Int): Boolean = {
for (e <- 3.until(a, 2)) if (a % e == 0) return false
true
}
There's another Stream in gen_primes, so doing something with it may improve the run time more, but in my opinion that's not necessary. At that point in performance improvement, I think it would be better to switch to some other algorithm of generating primes: e.g., using only primes, instead of all odd numbers, to look for divisors, or using Sieve of Eratosthenes.
All in all, functional abstractions in Scala are implemented with actual objects on the heap, which have some overhead, and JIT compiler can't fix everything. But the selling point of C++ is zero-cost abstractions: everything that is possible is expanded during compilation through templates, constexpr and further aggressively optimized by the compiler.

search tree in scala

I'm trying to put my first steps into Scala, and to practice I took a look at the google code jam storecredit excersize. I tried it in java first, which went well enough, and now I'm trying to port it to Scala. Now with the java collections framework, I could try to do a straight syntax conversion, but I'd end up writing java in scala, and that kind of defeats the purpose. In my Java implementation, I have a PriorityQueue that I empty into a Deque, and pop the ends off untill we have bingo. This all uses mutable collections, which give me the feeling is very 'un-scala'. What I think would be a more functional approach is to construct a datastructure that can be traversed both from highest to lowest, and from lowest to highest. Am I on the right path? Are there any suitable datastructures supplied in the Scala libraries, or should I roll my own here?
EDIT: full code of the much simpler version in Java. It should run in O(max(credit,inputchars)) and has become:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
public class StoreCredit {
private static BufferedReader in;
public static void main(String[] args) {
in = new BufferedReader(new InputStreamReader(System.in));
try {
int numCases = Integer.parseInt(in.readLine());
for (int i = 0; i < numCases; i++) {
solveCase(i);
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static void solveCase(int casenum) throws NumberFormatException,
IOException {
int credit = Integer.parseInt(in.readLine());
int numItems = Integer.parseInt(in.readLine());
int itemnumber = 0;
int[] item_numbers_by_price = new int[credit];
Arrays.fill(item_numbers_by_price, -1); // makes this O(max(credit,
// items)) instead of O(items)
int[] read_prices = readItems();
while (itemnumber < numItems) {
int next_price = read_prices[itemnumber];
if (next_price <= credit) {
if (item_numbers_by_price[credit - next_price] >= 0) {
// Bingo! DinoDNA!
printResult(new int[] {
item_numbers_by_price[credit - next_price],
itemnumber }, casenum);
break;
}
item_numbers_by_price[next_price] = itemnumber;
}
itemnumber++;
}
}
private static int[] readItems() throws IOException {
String line = in.readLine();
String[] items = line.split(" "); // uh-oh, now it's O(max(credit,
// inputchars))
int[] result = new int[items.length];
for (int i = 0; i < items.length; i++) {
result[i] = Integer.parseInt(items[i]);
}
return result;
}
private static void printResult(int[] result, int casenum) {
int one;
int two;
if (result[0] > result[1]) {
one = result[1];
two = result[0];
} else {
one = result[0];
two = result[1];
}
one++;
two++;
System.out.println(String.format("Case #%d: %d %d", casenum + 1, one,
two));
}
}
I'm wondering what you are trying to accomplish using sophisticated data structures such as PriorityQueue and Deque for a problem such as this. It can be solved with a pair of nested loops:
for {
i <- 2 to I
j <- 1 until i
if i != j && P(i-1) + P(j - 1) == C
} println("Case #%d: %d %d" format (n, j, i))
Worse than linear, better than quadratic. Since the items are not sorted, and sorting them would require O(nlogn), you can't do much better than this -- as far as I can see.
Actually, having said all that, I now have figured a way to do it in linear time. The trick is that, for every number p you find, you know what its complement is: C - p. I expect there are a few ways to explore that -- I have so far thought of two.
One way is to build a map with O(n) characteristics, such as a bitmap or a hash map. For each element, make it point to its index. One then only has to find an element for which its complement also has an entry in the map. Trivially, this could be as easily as this:
val PM = P.zipWithIndex.toMap
val (p, i) = PM find { case (p, i) => PM isDefinedAt C - p }
val j = PM(C - p)
However, that won't work if the number is equal to its complement. In other words, if there are two p such that p + p == C. There are quite a few such cases in the examples. One could then test for that condition, and then just use indexOf and lastIndexOf -- except that it is possible that there is only one p such that p + p == C, in which case that wouldn't be the answer either.
So I ended with something more complex, that tests the existence of the complement at the same time the map is being built. Here's the full solution:
import scala.io.Source
object StoreCredit3 extends App {
val source = if (args.size > 0) Source fromFile args(0) else Source.stdin
val input = source getLines ()
val N = input.next.toInt
1 to N foreach { n =>
val C = input.next.toInt
val I = input.next.toInt
val Ps = input.next split ' ' map (_.toInt)
val (_, Some((p1, p2))) = Ps.zipWithIndex.foldLeft((Map[Int, Int](), None: Option[(Int, Int)])) {
case ((map, None), (p, i)) =>
if (map isDefinedAt C - p) map -> Some(map(C - p) -> (i + 1))
else (map updated (p, i + 1), None)
case (answer, _) => answer
}
println("Case #%d: %d %d" format (n, p1, p2))
}
}

Resources