Using Basic Neural Network Subroutines (BNN) Accelerate

Using Basic Neural Network Subroutines (BNN) Accelerate - macos

I am trying to perform a 1x1 convolution using the Apple BNNS (Basic Neural Network Subroutine) library in Accelerate.
When I run on a 9x1 column vector, I get unexpected results.
Sample code posted at: https://gist.github.com/cancan101/5887cb93cc91a2d10e2bfd23284bb438 (a modification of BNNS sample code.)
Expected Results:
Print numbers 0-8.
Actual Results:
o0: 0.000000
o1: 0.000000
o2: 0.000000
o3: 3.000000
o4: 0.000000
o5: 5.000000
o6: 0.000000
o7: 7.000000
o8: 0.000000
I suspect I am doing this right, but am open for feedback on the linked code.

If you transpose to row vectors, you'd see expected output
from this:
i_desc.width = 1;
i_desc.height = 9;
i_desc.row_stride = 1;
to this:
i_desc.width = 9;
i_desc.height = 1;
i_desc.row_stride = 9;
same for output:
o_desc.width = 9;
o_desc.height = 1;
o_desc.row_stride = 9;
Result:
Input image stack: 9 x 1 x 1
Output image stack: 9 x 1 x 1
Convolution kernel: 1 x 1
o0: 0.000000
o1: 1.000000
o2: 2.000000
o3: 3.000000
o4: 4.000000
o5: 5.000000
o6: 6.000000
o7: 7.000000
o8: 8.000000

Related

storage problem in R. alternative to nested loop for creating array of matrices and then multiple plots

With the following pieces of information, I can easily create an array of matrices
b0=data.frame(b0_1=c(11.41,11.36),b0_2=c(8.767,6.950))
b1=data.frame(b1_1=c(0.8539,0.9565),b1_2=c(-0.03179,0.06752))
b2=data.frame(b2_1=c(-0.013020 ,-0.016540),b2_2=c(-0.0002822,-0.0026720))
T.val=data.frame(T1=c(1,1),T2=c(1,2),T3=c(2,1))
dt_data=cbind(b0,b1,b2,T.val)
fu.time=seq(0,50,by=0.8)
pat=ncol(T.val) #number of T's
nit=2 #no of rows
pt.array1=array(NA, dim=c(nit,length(fu.time),pat))
for ( it.er in 1:nit){
for ( ti in 1:length(fu.time)){
for (pt in 1:pat){
pt.array1[it.er,ti,pt]=b0[it.er,T.val[it.er,pt]]+b1[it.er,T.val[it.er,pt]]*fu.time[ti]+b2[it.er,T.val[it.er,pt]]*fu.time[ti]^2
}
}
}
pt.array_mean=apply(pt.array1, c(3,2), mean)
pt.array_LCL=apply(pt.array1, c(3,2), quantile, prob=0.25)
pt.array_UCL=apply(pt.array1, c(3,2), quantile, prob=0.975)
Now with these additional data, I can create three plots as follows
mydata
pt.ID time IPSS
1 1 0.000000 10
2 1 1.117808 8
3 1 4.504110 5
4 1 6.410959 14
5 1 13.808220 10
6 1 19.890410 4
7 1 28.865750 15
8 1 35.112330 7
9 2 0.000000 6
10 2 1.117808 7
11 2 4.109589 8
12 2 10.093151 7
13 2 16.273973 11
14 2 18.345205 18
15 2 21.567120 14
16 2 25.808220 12
17 2 56.087670 5
18 3 0.000000 8
19 3 1.413699 3
20 3 4.405479 3
21 3 10.389041 8
pdf("plots.pdf")
par(mfrow=c(3,2))
for( pt.no in 1:pat){
plot(IPSS[ID==pt.no]~time[ID==pt.no],xlim=c(0,57),ylim=c(0,35),type="l",col="black",
xlab="f/u time", ylab= "",main = paste("patient", pt.no),data=mydata)
points(IPSS[ID==pt.no]~time[ID==pt.no],data=mydata)
lines(pt.array_mean[pt.no,]~fu.time, col="blue")
lines(pt.array_LCL[pt.no,]~fu.time, col="green")
lines(pt.array_UCL[pt.no,]~fu.time, col="green")
}
dev.off()
The problem arise when the number of rows in each matrix is much bigger say 10000. It takes too much computation time to create the pt.array1 for large number of rows in b0, b1 and b2.
Is there any alternative way I can do it quickly using any builtin function?
Can I avoid the storage allocation for pt.array1 as I am not using it further? I just need pt.array_mean, pt.array_UCL and pt.array_LCL for myplot.
Any help is appreciated.

There are a couple of other approaches you can employ.
First, you largely have a model of b0 + b1*fu + b2*fu^2. Therefore, you could make the coefficients and apply the fu after the fact:
ind <- expand.grid(nits = seq_len(nit), pats = seq_len(pat))
mat_ind <- cbind(ind[, 'nits'], T.val[as.matrix(ind)])
b_mat <- matrix(c(b0[mat_ind], b1[mat_ind], b2[mat_ind]), ncol = 3)
b_mat
[,1] [,2] [,3]
[1,] 11.410 0.85390 -0.0130200
[2,] 11.360 0.95650 -0.0165400
[3,] 11.410 0.85390 -0.0130200
[4,] 6.950 0.06752 -0.0026720
[5,] 8.767 -0.03179 -0.0002822
[6,] 11.360 0.95650 -0.0165400
Now if we apply the model to each row, we will get all of your raw results. The only problem is that we don't match your original output - each column slice of your array is equivalent of a row slice of my matrix output.
pt_array <- apply(b_mat, 1, function(x) x[1] + x[2] * fu.time + x[3] * fu.time^2)
pt_array[1,]
[1] 11.410 11.360 11.410 6.950 8.767 11.360
pt.array1[, 1, ]
[,1] [,2] [,3]
[1,] 11.41 11.41 8.767
[2,] 11.36 6.95 11.360
That's OK because we can fix the shape of it as we get summary statistics - we just need to take the colSums and colQuantiles of each row converted to a 2 x 3 matrix:
library(matrixStats)
pt_summary = array(t(apply(pt_array,
1,
function(row) {
M <- matrix(row, ncol = pat)
c(colMeans2(M),colQuantiles(M, probs = c(0.25, 0.975))
)
}
)),
dim = c(length(fu.time), pat, 3),
dimnames = list(NULL, paste0('pat', seq_len(pat)), c('mean', 'LCL', 'UCL'))
)
pt_summary[1, ,] #slice at time = 1
mean LCL UCL
pat1 11.3850 11.37250 11.40875
pat2 9.1800 8.06500 11.29850
pat3 10.0635 9.41525 11.29518
# rm(pt.array1)
Then to do your final graphing, I simplified it - the data argument can be a subset(mydata, pt.ID == pt.no). Additionally, since the summary statistics are now in an array format, matlines allows everything to be done at once:
par(mfrow=c(3,2))
for( pt.no in 1:pat){
plot(IPSS~pt.ID, data=subset(mydata, pt.ID == pt.no),
xlim=c(0,57), ylim=c(0,35),
type="l",col="black", xlab="f/u time", ylab= "",
main = paste("patient", pt.no)
)
points(IPSS~time, data=subset(mydata, pt.ID == pt.no))
matlines(y = pt_summary[,pt.no ,], x = fu.time, col=c("blue", 'green', 'green'))
}

What Exactly math.Exp does?

Sorry, but I am not able to understand what exactly math.Exp is doing in following code block:
package main
import (
"fmt"
"math"
)
func main() {
for x := 0; x < 8; x++ {
fmt.Printf("x = %f ex = %8.3f\n", float64(x), math.Exp(float64(x)))
}
}
The output of the above program is:
x = 0.000000 ex = 1.000
x = 1.000000 ex = 2.718
x = 2.000000 ex = 7.389
x = 3.000000 ex = 20.086
x = 4.000000 ex = 54.598
x = 5.000000 ex = 148.413
x = 6.000000 ex = 403.429
x = 7.000000 ex = 1096.633
And, I am not able to understand what exactly is math.Exp function is doing internally and converting float64(x) to respective values as in the output. I have read the go's official documentation, which says as below:
Exp returns e**x, the base-e exponential of x.
Reading which I am not very clear of the purpose and mechanism of math.Exp function.
I am actually interested in what binary/mathematical operation is going under the hood.

It returns the value of e^x (also expressed as e**x or simply exp(x)).
That function is based on the number e=2.71828... [1], which is defined (among other definitions) as:
Lim (1+1/n)^n when n tends to infinity
Particularly, the function e^x has many properties that make it special, but the "most" important is the fact that the function itself is equal to its derivative, i.e.:
Let f(x)=e^x, then f'(x)=e^x
This translates to the fact that the value of the slope in one point is equal to the value of the function in that point.

Why fsync() takes much more time on Linux kernel 3.1.* than kernel 3.0

I have a test program. It takes about 37 seconds on Linux kernel 3.1.*, but only takes about 1 seconds on kernel 3.0.18 (I just replace the kernel on the same machine as before). Please give me a clue on how to improve it on kernel 3.1. Thanks!
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
int my_fsync(int fd)
{
// return fdatasync(fd);
return fsync(fd);
}
int main(int argc, char **argv)
{
int rc = 0;
int count;
int i;
char oldpath[1024];
char newpath[1024];
char *writebuffer = calloc(1024, 1);
snprintf(oldpath, sizeof(oldpath), "./%s", "foo");
snprintf(newpath, sizeof(newpath), "./%s", "foo.new");
for (count = 0; count < 1000; ++count) {
int fd = open(newpath, O_CREAT | O_TRUNC | O_WRONLY, S_IRWXU);
if (fd == -1) {
fprintf(stderr, "open error! path: %s\n", newpath);
exit(1);
}
for (i = 0; i < 10; i++) {
rc = write(fd, writebuffer, 1024);
if (rc != 1024) {
fprintf(stderr, "underwrite!\n");
exit(1);
}
}
if (my_fsync(fd)) {
perror("fsync failed!\n");
exit(1);
}
if (close(fd)) {
perror("close failed!\n");
exit(1);
}
if (rename(newpath, oldpath)) {
perror("rename failed!\n");
exit(1);
}
}
return 0;
}
# strace -c ./testfsync
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.58 0.068004 68 1000 fsync
0.84 0.000577 0 10001 write
0.40 0.000275 0 1000 rename
0.19 0.000129 0 1003 open
0.00 0.000000 0 1 read
0.00 0.000000 0 1003 close
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 1 access
0.00 0.000000 0 3 brk
0.00 0.000000 0 1 munmap
0.00 0.000000 0 2 setitimer
0.00 0.000000 0 68 sigreturn
0.00 0.000000 0 1 uname
0.00 0.000000 0 1 mprotect
0.00 0.000000 0 2 writev
0.00 0.000000 0 2 rt_sigaction
0.00 0.000000 0 6 mmap2
0.00 0.000000 0 2 fstat64
0.00 0.000000 0 1 set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00 0.068985 14099 1 total

Kernel 3.1.* is actually doing the sync, 3.0.18 is faking it. Your code does 1,000 synchronized writes. Since you truncate the file, each write also enlarges the file. So you actually have 2,000 write operations. Typical hard drive write latency is about 20 milliseconds per I/O. So 2,000*20 = 40,000 milliseconds or 40 seconds. So it seems about right, assuming you're writing to a typical hard drive.
Basically, by syncing after each write, you give the kernel no ability to efficiently cache or overlap the writes and force worst-case behavior on every operation. Also, the hard drive winds up having to seek back and forth between where the data is written and where the metadata is written once for each write.

Found the reason. File system barriers enabled by default in ext3 for Linux kernel 3.1 (http://kernelnewbies.org/Linux_3.1). After disable barriers, it becomes much faster.

Source of Ruby benchmark irregularites

Running this code:
require 'benchmark'
Benchmark.bm do |x|
  x.report("1+1") {15_000_000.times {1+1}}
  x.report("1+1") {15_000_000.times {1+1}}
  x.report("1+1") {15_000_000.times {1+1}}
  x.report("1+1") {15_000_000.times {1+1}}
  x.report("1+1") {15_000_000.times {1+1}}
end
Outputs these results:
       user     system      total        real
1+1  2.188000   0.000000   2.188000 (  2.250000)
1+1  2.250000   0.000000   2.250000 (  2.265625)
1+1  2.234000   0.000000   2.234000 (  2.250000)
1+1  2.203000   0.000000   2.203000 (  2.250000)
1+1  2.266000   0.000000   2.266000 (  2.281250)
Guessing the variation is a result of the system environment, but wanted to confirm this is the case.

"Guessing the variation is a result of the system environment", you are right.
Benchmarks can't be precise all time. You don't have a perfect regular machine to run something always in the same time. Take two numbers from benchmark as the same if they were too near, as in this case.

I tried using eval to partially unroll the loop, and although it made it faster, it made the execution time less consistent!
$VERBOSE &&= false # You do not want 15 thousand "warning: useless use of + in void context" warnings
# large_number = 15_000_000 # Too large! Caused eval to take too long, so I gave up
somewhat_large_number = 15_000
unrolled = "def do_addition\n" + ("1+1\n" * somewhat_large_number) + "end\n" ; nil
eval(unrolled)
require 'benchmark'
Benchmark.bm do |x|
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
x.report("1+1 partially unrolled") { i = 0; while i < 1000; do_addition; i += 1; end}
end
gave me
user system total real
1+1 partially unrolled 0.750000 0.000000 0.750000 ( 0.765586)
1+1 partially unrolled 0.765000 0.000000 0.765000 ( 0.765586)
1+1 partially unrolled 0.688000 0.000000 0.688000 ( 0.703089)
1+1 partially unrolled 0.797000 0.000000 0.797000 ( 0.796834)
1+1 partially unrolled 0.750000 0.000000 0.750000 ( 0.749962)
1+1 partially unrolled 0.781000 0.000000 0.781000 ( 0.781210)
1+1 partially unrolled 0.719000 0.000000 0.719000 ( 0.718713)
1+1 partially unrolled 0.750000 0.000000 0.750000 ( 0.749962)
1+1 partially unrolled 0.765000 0.000000 0.765000 ( 0.765585)
1+1 partially unrolled 0.781000 0.000000 0.781000 ( 0.781210)
For the purpose of comparison, your benchmark on my computer gave
user system total real
1+1 2.406000 0.000000 2.406000 ( 2.406497)
1+1 2.407000 0.000000 2.407000 ( 2.484629)
1+1 2.500000 0.000000 2.500000 ( 2.734655)
1+1 2.515000 0.000000 2.515000 ( 2.765908)
1+1 2.703000 0.000000 2.703000 ( 4.391075)
(real time varied in the last line, but not user or total)

How much slower are strings containing numbers compared to numbers?

Say I want to take a number and return its digits as an array in Ruby.
For this specific purpose or for string functions and number functions in general, which is faster?
These are the algorithms I assume would be most commonly used:
Using Strings: n.to_s.split(//).map {|x| x.to_i}
Using Numbers:
array = []
until n = 0
m = n % 10
array.unshift(m)
n /= 10
end

The difference seems to be less than one order of magnitude, with the integer-based approach faster for Fixnums. For Bignums, the relative performance starts out more or less even, with the string approach winning out significantly as the number of digits grows.
As strings
Program
#!/usr/bin/env ruby
require 'profile'
$n = 1234567890
10000.times do
$n.to_s.split(//).map {|x| x.to_i}
end
Output
% cumulative self self total
time seconds seconds calls ms/call ms/call name
55.64 0.74 0.74 10000 0.07 0.10 Array#map
21.05 1.02 0.28 100000 0.00 0.00 String#to_i
10.53 1.16 0.14 1 140.00 1330.00 Integer#times
7.52 1.26 0.10 10000 0.01 0.01 String#split
5.26 1.33 0.07 10000 0.01 0.01 Fixnum#to_s
0.00 1.33 0.00 1 0.00 1330.00 #toplevel
As integers
Program
#!/usr/bin/env ruby
require 'profile'
$n = 1234567890
10000.times do
array = []
n = $n
until n == 0
m = n%10
array.unshift(m)
n /= 10
end
array
end
Output
% cumulative self self total
time seconds seconds calls ms/call ms/call name
70.64 0.77 0.77 1 770.00 1090.00 Integer#times
29.36 1.09 0.32 100000 0.00 0.00 Array#unshift
0.00 1.09 0.00 1 0.00 1090.00 #toplevel
Addendum
The pattern seems to hold for smaller numbers also. With $n = 12345, it was around 800ms for the string-based approach and 550ms for the integer-based approach.
When I crossed the boundary into Bignums, say, with $n = 12345678901234567890, I got 2375ms for both approaches. It would appear that the difference evens out nicely, which I would have taken to mean that the internal local powering Bignum is string-like. However, the documentation seems to suggest otherwise.
For academic purposes, I once again doubled the number of digits to $n = 1234567890123456789012345678901234567890. I got around 4450ms for the string approach and 9850ms for the integer approach, a stark reversal that rules out my previous postulate.
Summary
Number of digits | String program | Integer program | Difference
---------------------------------------------------------------------------
5 | 800ms | 550ms | Integer wins by 250ms
10 | 1330ms | 1090ms | Integer wins by 240ms
20 | 2375ms | 2375ms | Tie
40 | 4450ms | 9850ms | String wins by 4400ms

Steven's response is impressive, but I looked at it for a couple minutes of and couldn't distill it into a simple answer, so here is mine.
For Fixnums
It is fastest to use the digits method I provide below. It's also pretty quick (and much easier) to use num.to_s.each_char.map(&:to_i).
For Bignums
It is fastest to use num.to_s.each_char.map(&:to_i).
The Solution
If speed is honestly the determining factor for what code you use (meaning don't be evil), then this code is the best choice for the job.
class Integer
def digits
working_int, digits = self, Array.new
until working_int.zero?
digits.unshift working_int % 10
working_int /= 10
end
digits
end
end
class Bignum
def digits
to_s.each_char.map(&:to_i)
end
end
Here are the approaches I considered to arrive at this conclusion.

I made a solution with 'benchmark' using the code examples of Steven Xu and a String#each_byte-version.
require 'benchmark'
MAX = 10_000
#Solution based on http://stackoverflow.com/questions/6445496/how-much-slower-are-strings-containing-numbers-compared-to-numbers/6447254#6447254
class Integer
def digits
working_int, digits = self, Array.new
until working_int.zero?
digits.unshift working_int % 10
working_int /= 10
end
digits
end
end
class Bignum
def digits
to_s.each_char.map(&:to_i)
end
end
[
12345,
1234567890,
12345678901234567890,
1234567890123456789012345678901234567890,
].each{|num|
puts "========="
puts "Benchmark #{num}"
Benchmark.bm do|b|
b.report("Integer% ") do
MAX.times {
array = []
n = num
until n == 0
m = n%10
array.unshift(m)
n /= 10
end
array
}
end
b.report("Integer% << ") do
MAX.times {
array = []
n = num
until n == 0
m = n%10
array << m
n /= 10
end
array.reverse
}
end
b.report("Integer#divmod ") do
MAX.times {
array = []
n = num
until n == 0
n, x = *n.divmod(10)
array.unshift(x)
end
array
}
end
b.report("Integer#divmod<<") do
MAX.times {
array = []
n = num
until n == 0
n, x = *n.divmod(10)
array << x
end
array.reverse
}
end
b.report("String+split// ") do
MAX.times { num.to_s.split(//).map {|x| x.to_i} }
end
b.report("String#each_byte") do
MAX.times { num.to_s.each_byte.map{|x| x.chr } }
end
b.report("String#each_char") do
MAX.times { num.to_s.each_char.map{|x| x.to_i } }
end
#http://stackoverflow.com/questions/6445496/how-much-slower-are-strings-containing-numbers-compared-to-numbers/6447254#6447254
b.report("Num#digit ") do
MAX.times { num.to_s.each_char.map{|x| x.to_i } }
end
end
}
My results:
Benchmark 12345
user system total real
Integer% 0.015000 0.000000 0.015000 ( 0.015625)
Integer% << 0.016000 0.000000 0.016000 ( 0.015625)
Integer#divmod 0.047000 0.000000 0.047000 ( 0.046875)
Integer#divmod<< 0.031000 0.000000 0.031000 ( 0.031250)
String+split// 0.109000 0.000000 0.109000 ( 0.109375)
String#each_byte 0.047000 0.000000 0.047000 ( 0.046875)
String#each_char 0.047000 0.000000 0.047000 ( 0.046875)
Num#digit 0.047000 0.000000 0.047000 ( 0.046875)
=========
Benchmark 1234567890
user system total real
Integer% 0.047000 0.000000 0.047000 ( 0.046875)
Integer% << 0.046000 0.000000 0.046000 ( 0.046875)
Integer#divmod 0.063000 0.000000 0.063000 ( 0.062500)
Integer#divmod<< 0.062000 0.000000 0.062000 ( 0.062500)
String+split// 0.188000 0.000000 0.188000 ( 0.187500)
String#each_byte 0.063000 0.000000 0.063000 ( 0.062500)
String#each_char 0.093000 0.000000 0.093000 ( 0.093750)
Num#digit 0.079000 0.000000 0.079000 ( 0.078125)
=========
Benchmark 12345678901234567890
user system total real
Integer% 0.234000 0.000000 0.234000 ( 0.234375)
Integer% << 0.234000 0.000000 0.234000 ( 0.234375)
Integer#divmod 0.203000 0.000000 0.203000 ( 0.203125)
Integer#divmod<< 0.172000 0.000000 0.172000 ( 0.171875)
String+split// 0.266000 0.000000 0.266000 ( 0.265625)
String#each_byte 0.125000 0.000000 0.125000 ( 0.125000)
String#each_char 0.141000 0.000000 0.141000 ( 0.140625)
Num#digit 0.141000 0.000000 0.141000 ( 0.140625)
=========
Benchmark 1234567890123456789012345678901234567890
user system total real
Integer% 0.718000 0.000000 0.718000 ( 0.718750)
Integer% << 0.657000 0.000000 0.657000 ( 0.656250)
Integer#divmod 0.562000 0.000000 0.562000 ( 0.562500)
Integer#divmod<< 0.485000 0.000000 0.485000 ( 0.484375)
String+split// 0.500000 0.000000 0.500000 ( 0.500000)
String#each_byte 0.218000 0.000000 0.218000 ( 0.218750)
String#each_char 0.282000 0.000000 0.282000 ( 0.281250)
Num#digit 0.265000 0.000000 0.265000 ( 0.265625)
String#each_byte/each_char is faster the split, for lower numbers the integer version is faster.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using Basic Neural Network Subroutines (BNN) Accelerate - macos

Related

storage problem in R. alternative to nested loop for creating array of matrices and then multiple plots

What Exactly math.Exp does?

Why fsync() takes much more time on Linux kernel 3.1.* than kernel 3.0

Source of Ruby benchmark irregularites

How much slower are strings containing numbers compared to numbers?

Categories

Resources