How to benchmark parallel code? - parallel-processing

I have some code that I paralellized using Rayon hoping to improve its performance, but the results, measured by the Bencher, were... most unimpressive. I suspected that it might be caused by the way I am performing the benchmarks (maybe they are run in parallel?), so I tested a simpler case.
Consider the following parallelized code, based on the quick_sort crate:
#![feature(test)]
extern crate rayon;
extern crate test;
use test::Bencher;
use std::cmp::Ordering;
pub fn sort<T>(arr: &mut [T])
where T: Send + std::cmp::PartialEq + Ord
{
qsort(arr, find_pivot, &|a, b| a.cmp(b))
}
pub fn sort_by<T, F>(arr: &mut [T], compare: &F)
where T: Send + std::cmp::PartialOrd,
F: Sync + Fn(&T, &T) -> Ordering
{
qsort(arr, find_pivot, compare);
}
fn qsort<T, F>(arr: &mut [T], pivot: fn(&[T], &F) -> usize, compare: &F)
where T: Send + std::cmp::PartialOrd,
F: Sync + Fn(&T, &T) -> Ordering
{
let len = arr.len();
if len <= 1 {
return;
}
let p = pivot(arr, compare);
let p = partition(arr, p, compare);
let (l, r) = arr.split_at_mut(p + 1);
if p > len / 2 {
rayon::join(
|| qsort(r, pivot, compare),
|| qsort(l, pivot, compare)
);
} else {
rayon::join(
|| qsort(l, pivot, compare),
|| qsort(r, pivot, compare)
);
}
}
fn find_pivot<T, F>(arr: &[T], compare: &F) -> usize
where T: Send + std::cmp::PartialOrd,
F: Sync + Fn(&T, &T) -> Ordering
{
let (l, r) = (0, arr.len() - 1);
let m = l + ((r - 1) / 2);
let (left, middle, right) = (&arr[l], &arr[m], &arr[r]);
if (compare(middle, left) != Ordering::Less) && (compare(middle, right) != Ordering::Greater) {
m
} else if (compare(left, middle) != Ordering::Less) &&
(compare(left, right) != Ordering::Greater) {
l
} else {
r
}
}
fn partition<T, F>(arr: &mut [T], p: usize, compare: &F) -> usize
where T: std::cmp::PartialOrd,
F: Sync + Fn(&T, &T) -> Ordering
{
if arr.len() <= 1 {
return p;
}
let last = arr.len() - 1;
let mut next_pivot = 0;
arr.swap(last, p);
for i in 0..last {
if compare(&arr[i], &arr[last]) == Ordering::Less {
arr.swap(i, next_pivot);
next_pivot += 1;
}
}
arr.swap(next_pivot, last);
next_pivot
}
#[bench]
fn bench_qsort(b: &mut Bencher) {
let mut vec = vec![ 3, 97, 50, 56, 58, 80, 91, 71, 83, 65,
92, 35, 11, 26, 69, 44, 42, 75, 40, 43,
63, 5, 62, 56, 35, 3, 51, 97, 100, 73,
42, 41, 79, 86, 93, 58, 65, 96, 66, 36,
17, 97, 6, 16, 52, 30, 38, 14, 39, 7,
48, 83, 37, 97, 21, 58, 41, 59, 97, 37,
97, 9, 24, 78, 77, 7, 78, 80, 11, 79,
42, 30, 39, 27, 71, 61, 12, 8, 49, 62,
69, 48, 8, 56, 89, 27, 1, 80, 31, 62,
7, 15, 30, 90, 75, 78, 22, 99, 97, 89];
b.iter(|| { sort(&mut vec); } );
}
Results of cargo bench:
running 1 test
test bench_qsort ... bench: 10,374 ns/iter (+/- 296) // WHAT
While the results for the sequential code (extern crate quick_sort) are:
running 1 test
test bench_qsort ... bench: 1,070 ns/iter (+/- 65)
I also tried benchmarking with longer vectors, but the results were consistent. In addition, I performed some tests using GNU time and it looks like the parallel code is faster with bigger vectors, as expected.
What am I doing wrong? Can I use Bencher to benchmark parallel code?

The array you use in the test is so small that the parallel code really is slower in that case.
There's some overhead to launching tasks in parallel, and the memory access will be slower when different threads access memory on the same cache line.
For iterators to avoid overhead on tiny arrays there's with_min_len, but for join you probably need to implement parallel/non-parallel decision yourself.
With 100 times larger array:
with rayon: 3,468,891 ns/iter (+/- 95,859)
without rayon: 4,227,220 ns/iter (+/- 635,260)
rayon if len > 1000: 3,166,570 ns/iter (+/- 66,329)
The relatively small speed-up is expected for this task, because it's memory-bound (there's no complex computation to parallelize).

Related

Better random shuffler in Rust? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 days ago.
Improve this question
I just made a program that creates a deck with 26 cards, splits it into 5 hands of 5 cards (discarding 1) and checks the poker combinations that are in those hands, if any.
Now, I also made another program that loops over this until a Royal Flush is found (which is very rare and should happen, on average, once every 600k or so decks).
And the strange thing is, once I added a counter to show how many decks it went through, it only said 150 - 4000 ! And I think it's the random shuffler's fault. I had made a similar program in Python, and that was checking approximately the correct amount of decks.
I used this, which should shuffle the deck in place:
fn shuffle_deck(deck: &mut Vec<Card>) -> () {
deck.shuffle(&mut rand::thread_rng())
}
Apparently, it's not doing a very good job at being random. Can anybody help me in finding a better solution?
Edit: also, for anyone wondering, this is the Card struct:
pub struct Card {
value: i32,
suit: String
}
Your assumption is incorrect.
26 cards can make 26 over 5 possible combinations, which is 65780 combinations.
As you have 4 royal flushes in your deck, the probability to get dealt a royal flush is 4/65780 = 0.006080875646093037 % or one out of ever 16445.
If you look at four at a time, this number roughly divides by 4 (not exactly because those four draws aren't independent), so you should get something in the ballpark of one every ~3300.
So if you measure it experimentally, you get:
use rand::seq::SliceRandom;
fn is_royal_flush(deck: &[u8]) -> bool {
if deck.len() != 5 {
false
} else {
let mut buckets = [0u8; 4];
for el in deck {
if let Some(bucket) = buckets.get_mut((*el / 5) as usize) {
*bucket += 1;
}
}
buckets.iter().any(|bucket| *bucket == 5)
}
}
fn has_royal_flush(deck: &[u8; 26]) -> bool {
is_royal_flush(&deck[0..5])
|| is_royal_flush(&deck[5..10])
|| is_royal_flush(&deck[10..15])
|| is_royal_flush(&deck[15..20])
|| is_royal_flush(&deck[20..25])
}
fn main() {
let mut rng = rand::thread_rng();
let mut deck = [0; 26];
deck.iter_mut()
.enumerate()
.for_each(|(pos, val)| *val = pos as u8);
println!("Deck, initially: {:?}", deck);
deck.shuffle(&mut rng);
println!("Deck, shuffled: {:?}", deck);
println!();
let mut total: usize = 0;
let mut royal_flushes: usize = 0;
loop {
deck.shuffle(&mut rng);
total += 1;
if has_royal_flush(&deck) {
royal_flushes += 1;
}
if total % 10000000 == 0 {
println!(
"Now {} out of {}. (Probability: 1/{})",
royal_flushes,
total,
total / royal_flushes
);
}
}
}
Deck, initially: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
Deck, shuffled: [25, 16, 3, 12, 18, 1, 14, 8, 17, 0, 5, 15, 6, 11, 4, 21, 2, 13, 24, 20, 22, 7, 10, 9, 19, 23]
Now 2976 out of 10000000. (Probability: 1/3360)
Now 6004 out of 20000000. (Probability: 1/3331)
Now 8973 out of 30000000. (Probability: 1/3343)
Now 11984 out of 40000000. (Probability: 1/3337)
Which is roughly in the expected ballpark. So your assumption is off that you should get only one royal flush every 600k, and most likely something is wrong with your python code.
That said, if you compute the probability to get a single royal flush in a 52 card deck, then you should get (52 over 5) / 4 = one every 649740 draws. Which is probably what you were referring to, and if you program it, it matches its expectations:
use rand::seq::SliceRandom;
fn is_royal_flush(deck: &[u8]) -> bool {
if deck.len() != 5 {
false
} else {
let mut buckets = [0u8; 4];
for el in deck {
if let Some(bucket) = buckets.get_mut((*el / 5) as usize) {
*bucket += 1;
}
}
buckets.iter().any(|bucket| *bucket == 5)
}
}
fn has_royal_flush(deck: &[u8; 52]) -> bool {
is_royal_flush(&deck[0..5])
}
fn main() {
let mut rng = rand::thread_rng();
let mut deck = [0; 52];
deck.iter_mut()
.enumerate()
.for_each(|(pos, val)| *val = pos as u8);
println!("Deck, initially: {:?}", deck);
deck.shuffle(&mut rng);
println!("Deck, shuffled: {:?}", deck);
println!();
let mut total: usize = 0;
let mut royal_flushes: usize = 0;
loop {
deck.shuffle(&mut rng);
total += 1;
if has_royal_flush(&deck) {
royal_flushes += 1;
}
if total % 10000000 == 0 {
println!(
"Now {} out of {}. (Probability: 1/{})",
royal_flushes,
total,
total / royal_flushes
);
}
}
}
Deck, initially: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51]
Deck, shuffled: [2, 15, 44, 39, 26, 47, 28, 11, 1, 19, 31, 6, 43, 42, 29, 48, 35, 30, 3, 49, 50, 37, 9, 10, 18, 45, 33, 22, 36, 5, 38, 46, 51, 32, 7, 17, 23, 27, 41, 14, 21, 13, 25, 4, 8, 16, 20, 24, 12, 34, 0, 40]
Now 6 out of 10000000. (Probability: 1/1666666)
Now 22 out of 20000000. (Probability: 1/909090)
Now 36 out of 30000000. (Probability: 1/833333)
Now 54 out of 40000000. (Probability: 1/740740)
Now 70 out of 50000000. (Probability: 1/714285)
Now 78 out of 60000000. (Probability: 1/769230)
Now 92 out of 70000000. (Probability: 1/760869)
Now 102 out of 80000000. (Probability: 1/784313)
Now 125 out of 90000000. (Probability: 1/720000)
Now 139 out of 100000000. (Probability: 1/719424)
Now 164 out of 110000000. (Probability: 1/670731)
Now 190 out of 120000000. (Probability: 1/631578)
Now 202 out of 130000000. (Probability: 1/643564)
Now 217 out of 140000000. (Probability: 1/645161)
Now 232 out of 150000000. (Probability: 1/646551)
Now 248 out of 160000000. (Probability: 1/645161)

What is the correct way to convert RGB555 to RGB888?

Some people suggest that to convert RGB555 to RGB888, one propagates the highest bits down, however, even though this method preserves full range (as opposed to left-shit by 3 which does not), this approach introduces noise from the highest bits.
Myself, I use the formula x * 255 / 31 which preserves full range and does not introduce noise from the highest bits.
This small test shows the difference between the two approaches:
using Microsoft.VisualStudio.TestTools.UnitTesting;
[TestClass]
public class ColorTest
{
public TestContext TestContext { get; set; }
[TestMethod]
public void Test1()
{
for (var i = 0; i < 32; i++)
{
var j = i * 255 / 31;
var k = (i << 3) | ((i >> 2) & 0b111);
TestContext.WriteLine($"{j,3} -> {k,3}, difference: {k - j}");
}
}
}
Result:
TestContext Messages:
0 -> 0, difference: 0
8 -> 8, difference: 0
16 -> 16, difference: 0
24 -> 24, difference: 0
32 -> 33, difference: 1
41 -> 41, difference: 0
49 -> 49, difference: 0
57 -> 57, difference: 0
65 -> 66, difference: 1
74 -> 74, difference: 0
82 -> 82, difference: 0
90 -> 90, difference: 0
98 -> 99, difference: 1
106 -> 107, difference: 1
115 -> 115, difference: 0
123 -> 123, difference: 0
131 -> 132, difference: 1
139 -> 140, difference: 1
148 -> 148, difference: 0
156 -> 156, difference: 0
164 -> 165, difference: 1
172 -> 173, difference: 1
180 -> 181, difference: 1
189 -> 189, difference: 0
197 -> 198, difference: 1
205 -> 206, difference: 1
213 -> 214, difference: 1
222 -> 222, difference: 0
230 -> 231, difference: 1
238 -> 239, difference: 1
246 -> 247, difference: 1
255 -> 255, difference: 0
Question:
Which approach is correct in definitive?
I crafted a small test and the results are quite surprising to say the least!
In turns out that propagating higher bits gets the best percentage when it comes to be equal to the value being calculated using floating-point, rounded and casted back to an integer:
Legend:
i: 5-bit index
j: N * 255 / 31
k: (N << 3) | ((N >> 2) & 0b111)
l: (N * 539087) >> 16
m: N * 255.0d / 31.0d
n: (int)Math.Round(N * 255.0d / 31.0d)
Results:
i: 0, j: 0, k: 0, l: 0, m: 0.00, n: 0.00
i: 1, j: 8, k: 8, l: 8, m: 8.23, n: 8.00
i: 2, j: 16, k: 16, l: 16, m: 16.45, n: 16.00
i: 3, j: 24, k: 24, l: 24, m: 24.68, n: 25.00
i: 4, j: 32, k: 33, l: 32, m: 32.90, n: 33.00
i: 5, j: 41, k: 41, l: 41, m: 41.13, n: 41.00
i: 6, j: 49, k: 49, l: 49, m: 49.35, n: 49.00
i: 7, j: 57, k: 57, l: 57, m: 57.58, n: 58.00
i: 8, j: 65, k: 66, l: 65, m: 65.81, n: 66.00
i: 9, j: 74, k: 74, l: 74, m: 74.03, n: 74.00
i: 10, j: 82, k: 82, l: 82, m: 82.26, n: 82.00
i: 11, j: 90, k: 90, l: 90, m: 90.48, n: 90.00
i: 12, j: 98, k: 99, l: 98, m: 98.71, n: 99.00
i: 13, j: 106, k: 107, l: 106, m: 106.94, n: 107.00
i: 14, j: 115, k: 115, l: 115, m: 115.16, n: 115.00
i: 15, j: 123, k: 123, l: 123, m: 123.39, n: 123.00
i: 16, j: 131, k: 132, l: 131, m: 131.61, n: 132.00
i: 17, j: 139, k: 140, l: 139, m: 139.84, n: 140.00
i: 18, j: 148, k: 148, l: 148, m: 148.06, n: 148.00
i: 19, j: 156, k: 156, l: 156, m: 156.29, n: 156.00
i: 20, j: 164, k: 165, l: 164, m: 164.52, n: 165.00
i: 21, j: 172, k: 173, l: 172, m: 172.74, n: 173.00
i: 22, j: 180, k: 181, l: 180, m: 180.97, n: 181.00
i: 23, j: 189, k: 189, l: 189, m: 189.19, n: 189.00
i: 24, j: 197, k: 198, l: 197, m: 197.42, n: 197.00
i: 25, j: 205, k: 206, l: 205, m: 205.65, n: 206.00
i: 26, j: 213, k: 214, l: 213, m: 213.87, n: 214.00
i: 27, j: 222, k: 222, l: 222, m: 222.10, n: 222.00
i: 28, j: 230, k: 231, l: 230, m: 230.32, n: 230.00
i: 29, j: 238, k: 239, l: 238, m: 238.55, n: 239.00
i: 30, j: 246, k: 247, l: 246, m: 246.77, n: 247.00
i: 31, j: 255, k: 255, l: 255, m: 255.00, n: 255.00
Total hits -> j: 17 (53.12%), k: 28 (87.50%), l: 17 (53.12%)
Code:
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
[TestClass]
public class ColorTest
{
public TestContext TestContext { get; set; }
[TestMethod]
public void Test1()
{
var (item1, item2, item3) = (0, 0, 0);
TestContext.WriteLine("i: 5-bit index");
TestContext.WriteLine("j: N * 255 / 31");
TestContext.WriteLine("k: (N << 3) | ((N >> 2) & 0b111)");
TestContext.WriteLine("l: (N * 539087) >> 16");
TestContext.WriteLine("m: N * 255.0d / 31.0d");
TestContext.WriteLine("n: (int)Math.Round(N * 255.0d / 31.0d)");
TestContext.WriteLine(string.Empty);
for (var i = 0; i < 32; i++)
{
var j = i * 255 / 31;
var k = (i << 3) | ((i >> 2) & 0b111);
var l = (i * 539087) >> 16;
var m = i * 255.0d / 31.0d;
var n = (int)Math.Round(m);
TestContext.WriteLine(
$"{nameof(i)}: {i,3}, " +
$"{nameof(j)}: {j,3}, " +
$"{nameof(k)}: {k,3}, " +
$"{nameof(l)}: {l,3}, " +
$"{nameof(m)}: {m,6:F}, " +
$"{nameof(n)}: {n,6:F}");
if (j == n)
{
item1++;
}
if (k == n)
{
item2++;
}
if (l == n)
{
item3++;
}
}
TestContext.WriteLine($"\r\nTotal hits -> j: {item1} ({item1 / 32f:P}), k: {item2} ({item2 / 32f:P}), l: {item3} ({item3 / 32f:P})");
}
}
Yes, even though not perfect, propagating higher bits turns out to be closer to what one expects :)

sql server 2008 checksum algorithm? [duplicate]

We perform checksums of some data in sql server as follows:
declare #cs int;
select
#cs = CHECKSUM_AGG(CHECKSUM(someid, position))
from
SomeTable
where
userid = #userId
group by
userid;
This data is then shared with clients. We'd like to be able to repeat the checksum at the client end... however there doesn't seem to be any info about how the checksums in the functions above are calculated. Can anyone enlighten me?
On SQL Server Forum, at this page, it's stated:
The built-in CHECKSUM function in SQL Server is built on a series of 4 bit left rotational xor operations. See this post for more explanation.
The CHECKSUM function doesn't provide a very good quality checksum and IMO is pretty useless for most purposes. As far as I know the algorithm isn't published. If you want a check that you can reproduce yourself then use the HashBytes function and one of the standard, published algorithms such as MD5 or SHA.
//Quick hash sum of SQL and C # mirror Ukraine
private Int64 HASH_ZKCRC64(byte[] Data)
{
Int64 Result = 0x5555555555555555;
if (Data == null || Data.Length <= 0) return 0;
int SizeGlobalBufer = 8000;
int Ost = Data.Length % SizeGlobalBufer;
int LeftLimit = (Data.Length / SizeGlobalBufer) * SizeGlobalBufer;
for (int i = 0; i < LeftLimit; i += 64)
{
Result = Result
^ BitConverter.ToInt64(Data, i)
^ BitConverter.ToInt64(Data, i + 8)
^ BitConverter.ToInt64(Data, i + 16)
^ BitConverter.ToInt64(Data, i + 24)
^ BitConverter.ToInt64(Data, i + 32)
^ BitConverter.ToInt64(Data, i + 40)
^ BitConverter.ToInt64(Data, i + 48)
^ BitConverter.ToInt64(Data, i + 56);
if ((Result & 0x0000000000000080) != 0)
Result = Result ^ BitConverter.ToInt64(Data, i + 28);
}
if (Ost > 0)
{
byte[] Bufer = new byte[SizeGlobalBufer];
Array.Copy(Data, LeftLimit, Bufer, 0, Ost);
for (int i = 0; i < SizeGlobalBufer; i += 64)
{
Result = Result
^ BitConverter.ToInt64(Bufer, i)
^ BitConverter.ToInt64(Bufer, i + 8)
^ BitConverter.ToInt64(Bufer, i + 16)
^ BitConverter.ToInt64(Bufer, i + 24)
^ BitConverter.ToInt64(Bufer, i + 32)
^ BitConverter.ToInt64(Bufer, i + 40)
^ BitConverter.ToInt64(Bufer, i + 48)
^ BitConverter.ToInt64(Bufer, i + 56);
if ((Result & 0x0000000000000080)!=0)
Result = Result ^ BitConverter.ToInt64(Bufer, i + 28);
}
}
byte[] MiniBufer = BitConverter.GetBytes(Result);
Array.Reverse(MiniBufer);
return BitConverter.ToInt64(MiniBufer, 0);
#region SQL_FUNCTION
/* CREATE FUNCTION [dbo].[HASH_ZKCRC64] (#data as varbinary(MAX)) Returns bigint
AS
BEGIN
Declare #I64 as bigint Set #I64=0x5555555555555555
Declare #Bufer as binary(8000)
Declare #i as int Set #i=1
Declare #j as int
Declare #Len as int Set #Len=Len(#data)
if ((#data is null) Or (#Len<=0)) Return 0
While #i<=#Len
Begin
Set #Bufer=Substring(#data,#i,8000)
Set #j=1
While #j<=8000
Begin
Set #I64=#I64
^ CAST(Substring(#Bufer,#j, 8) as bigint)
^ CAST(Substring(#Bufer,#j+8, 8) as bigint)
^ CAST(Substring(#Bufer,#j+16,8) as bigint)
^ CAST(Substring(#Bufer,#j+24,8) as bigint)
^ CAST(Substring(#Bufer,#j+32,8) as bigint)
^ CAST(Substring(#Bufer,#j+40,8) as bigint)
^ CAST(Substring(#Bufer,#j+48,8) as bigint)
^ CAST(Substring(#Bufer,#j+56,8) as bigint)
if #I64<0 Set #I64=#I64 ^ CAST(Substring(#Bufer,#j+28,8) as bigint)
Set #j=#j+64
End;
Set #i=#i+8000
End
Return #I64
END
*/
#endregion
}
I figured out the CHECKSUM algorithm, at least for ASCII characters. I created a proof of it in JavaScript (see https://stackoverflow.com/a/59014293/9642).
In a nutshell: rotate 4 bits left and xor by a code for each character. The trick was figuring out the "XOR codes". Here's the table of those:
var xorcodes = [
0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31,
0, 33, 34, 35, 36, 37, 38, 39, // !"#$%&'
40, 41, 42, 43, 44, 45, 46, 47, // ()*+,-./
132, 133, 134, 135, 136, 137, 138, 139, // 01234567
140, 141, 48, 49, 50, 51, 52, 53, 54, // 89:;<=>?#
142, 143, 144, 145, 146, 147, 148, 149, // ABCDEFGH
150, 151, 152, 153, 154, 155, 156, 157, // IJKLMNOP
158, 159, 160, 161, 162, 163, 164, 165, // QRSTUVWX
166, 167, 55, 56, 57, 58, 59, 60, // YZ[\]^_`
142, 143, 144, 145, 146, 147, 148, 149, // abcdefgh
150, 151, 152, 153, 154, 155, 156, 157, // ijklmnop
158, 159, 160, 161, 162, 163, 164, 165, // qrstuvwx
166, 167, 61, 62, 63, 64, 65, 66, // yz{|}~
];
The main thing to note is the bias towards alphanumerics (their codes are similar and ascending). English letters use the same code regardless of case.
I haven't tested high codes (128+) nor Unicode.

Two dimensional array sum in C#/Linq

I have a two dimensional array of integers. I would like to write an optimized and fast code to sum all the columns of the two dimensional array.
Any thoughts how I might be able to do this using LINQ/PLINQ/TASK parallelization ?
Ex:
private int[,] m_indexes = new int[6,4] { {367, 40, 74, 15},
{535, 226, 74, 15},
{368, 313, 74, 15},
{197, 316, 74, 15},
{27, 226, 74, 15},
{194, 41, 74, 15} };
The simplest parallel implementation:
int[,] m_indexes = new int[6, 4] { {367, 40, 74, 15},
{535, 226, 74, 15},
{368, 313, 74, 15},
{197, 316, 74, 15},
{27, 226, 74, 15},
{194, 41, 74, 15} };
var columns = Enumerable.Range(0, 4);
int[] sums = new int[4];
Parallel.ForEach(columns, column => {
int sum = 0;
for (int i = 0; i < 6; i++) {
sum += m_indexes[i, column];
}
sums[column] = sum;
});
This code can obviously be "generalized" (use m_indexes.GetLength(0) and m_indexes.GetLength(1)).
LINQ:
var sums = columns.Select(
column => {
int sum = 0;
for (int i = 0; i < 6; i++) {
sum += m_indexes[i, column];
} return sum;
}
).ToArray();
Be sure to profile on real-world data here if you truly need to optimize for performance here.
Also, if you truly care about optimizing for performance, try to load up your array so that you summing across rows. You'll get better locality for cache performance that way.
Or maybe without for's :
List<List<int>> m_indexes = new List<List<int>>() { new List<int>(){367, 40, 74, 15},
new List<int>(){535, 226, 74, 15},
new List<int>(){368, 313, 74, 15},
new List<int>(){197, 316, 74, 15},
new List<int>(){27, 226, 74, 15},
new List<int>(){194, 41, 74, 15} };
var res = m_indexes.Select(x => x.Sum()).Sum();
Straightforward LINQ way:
var columnSums = m_indexes.OfType<int>().Select((x,i) => new { x, col = i % m_indexes.GetLength(1) } )
.GroupBy(x => x.col)
.Select(x => new { Column = x.Key, Sum = x.Sum(g => g.x) });
It might not be worth it to parallelize. If you need to access the array by index, you spend some cycles on bounds checking, so, as always with performance, do measure it.

At which n does binary search become faster than linear search on a modern CPU?

Due to the wonders of branch prediction, a binary search can be slower than a linear search through an array of integers. On a typical desktop processor, how big does that array have to get before it would be better to use a binary search? Assume the structure will be used for many lookups.
I've tried a little C++ benchmarking and I'm surprised - linear search seems to prevail up to several dozen items, and I haven't found a case where binary search is better for those sizes. Maybe gcc's STL is not well tuned? But then -- what would you use to implement either kind of search?-) So here's my code, so everybody can see if I've done something silly that would distort timing grossly...:
#include <vector>
#include <algorithm>
#include <iostream>
#include <stdlib.h>
int data[] = {98, 50, 54, 43, 39, 91, 17, 85, 42, 84, 23, 7, 70, 72, 74, 65, 66, 47, 20, 27, 61, 62, 22, 75, 24, 6, 2, 68, 45, 77, 82, 29, 59, 97, 95, 94, 40, 80, 86, 9, 78, 69, 15, 51, 14, 36, 76, 18, 48, 73, 79, 25, 11, 38, 71, 1, 57, 3, 26, 37, 19, 67, 35, 87, 60, 34, 5, 88, 52, 96, 31, 30, 81, 4, 92, 21, 33, 44, 63, 83, 56, 0, 12, 8, 93, 49, 41, 58, 89, 10, 28, 55, 46, 13, 64, 53, 32, 16, 90
};
int tosearch[] = {53, 5, 40, 71, 37, 14, 52, 28, 25, 11, 23, 13, 70, 81, 77, 10, 17, 26, 56, 15, 94, 42, 18, 39, 50, 78, 93, 19, 87, 43, 63, 67, 79, 4, 64, 6, 38, 45, 91, 86, 20, 30, 58, 68, 33, 12, 97, 95, 9, 89, 32, 72, 74, 1, 2, 34, 62, 57, 29, 21, 49, 69, 0, 31, 3, 27, 60, 59, 24, 41, 80, 7, 51, 8, 47, 54, 90, 36, 76, 22, 44, 84, 48, 73, 65, 96, 83, 66, 61, 16, 88, 92, 98, 85, 75, 82, 55, 35, 46
};
bool binsearch(int i, std::vector<int>::const_iterator begin,
std::vector<int>::const_iterator end) {
return std::binary_search(begin, end, i);
}
bool linsearch(int i, std::vector<int>::const_iterator begin,
std::vector<int>::const_iterator end) {
return std::find(begin, end, i) != end;
}
int main(int argc, char *argv[])
{
int n = 6;
if (argc < 2) {
std::cerr << "need at least 1 arg (l or b!)" << std::endl;
return 1;
}
char algo = argv[1][0];
if (algo != 'b' && algo != 'l') {
std::cerr << "algo must be l or b, not '" << algo << "'" << std::endl;
return 1;
}
if (argc > 2) {
n = atoi(argv[2]);
}
std::vector<int> vv;
for (int i=0; i<n; ++i) {
if(data[i]==-1) break;
vv.push_back(data[i]);
}
if (algo=='b') {
std::sort(vv.begin(), vv.end());
}
bool (*search)(int i, std::vector<int>::const_iterator begin,
std::vector<int>::const_iterator end);
if (algo=='b') search = binsearch;
else search = linsearch;
int nf = 0;
int ns = 0;
for(int k=0; k<10000; ++k) {
for (int j=0; tosearch[j] >= 0; ++j) {
++ns;
if (search(tosearch[j], vv.begin(), vv.end()))
++nf;
}
}
std::cout << nf <<'/'<< ns << std::endl;
return 0;
}
and my a couple of my timings on a core duo:
AmAir:stko aleax$ time ./a.out b 93
1910000/2030000
real 0m0.230s
user 0m0.224s
sys 0m0.005s
AmAir:stko aleax$ time ./a.out l 93
1910000/2030000
real 0m0.169s
user 0m0.164s
sys 0m0.005s
They're pretty repeatable, anyway...
OP says: Alex, I edited your program to just fill the array with 1..n, not run std::sort, and do about 10 million (mod integer division) searches. Binary search starts to pull away from linear search at n=150 on a Pentium 4. Sorry about the chart colors.
binary vs linear search http://spreadsheets.google.com/pub?key=tzWXX9Qmmu3_COpTYkTqsOA&oid=1&output=image
I don't think branch prediction should matter because a linear search also has branches. And to my knowledge there are no SIMD that can do linear search for you.
Having said that, a useful model would be to assume that each step of the binary search has a multiplier cost C.
C log2 n = n
So to reason about this without actually benchmarking, you would make a guess for C, and round n to the next integer. For example if you guess C=3, then it would be faster to use binary search at n=11.
Not many - but hard to say exactly without benchmarking it.
Personally I'd tend to prefer the binary search, because in two years time, when someone else has quadrupled the size of your little array, you haven't lost much performance. Unless I knew very specifically that it's a bottleneck right now and I needed it to be as fast as possible, of course.
Having said that, remember that there are hash tables too; you could ask a similar question about them vs. binary search.

Resources