I am reading data from a Riff wav fmt file, I have an array of the data chunk DataBuffer of the wav file, how can I convert the number of bytes read of the data to the number of seconds read from the wav file.
int size_buffer = (Subchunk2Size / (NumOfChan * bitsPerSample / 8));
FILE* WavResult = fopen(FileNom, "rb");
u8* DataBuffer = new u8[size_buffer];
size_t nRead = fread(DataBuffer, sizeof DataBuffer[0], size_buffer, WavResult);
I think you mixed up things a bit. Assuming the field names in the WAV header as described in http://soundfile.sapp.org/doc/WaveFormat :
ChunkID - "RIFF"
ChunkSize
Format - "WAVE"
Subchunk1ID - "fmt "
Subchunk1Size
AudioFormat
NumChannels
SampleRate
ByteRate
BlockAlign
BitsPerSample
Subchunk2ID - "data"
Subchunk2Size
data
This line of yours:
int size_buffer = (Subchunk2Size / (NumOfChan * bitsPerSample / 8));
calculates a number of samples in a single channel. Or a number of blocks, where the block is a structure that contains one sample for each channel. If you use that for allocating memory for bytes from data chunk, then it will be enough only in case of 8-bit mono audio.
If allocating memory for bytes is really what you want, then simply use Subchunk2Size as the size.
If you want to allocate memory for samples, then it will differ depending if the audio is 8-bit or 16-bit (I'm ignoring other possibilities). For 8-bit:
const uint32_t num_of_samples = Subchunk2Size / (BitsPerSample / 8);
uint8_t *samples = new uint8_t[num_of_samples];
and for 16-bit:
const uint32_t num_of_samples = Subchunk2Size / (BitsPerSample / 8);
int16_t *samples = new int16_t[num_of_samples];
Personally, I'd rather use std::vector instead of c-arrays:
const uint32_t num_of_samples = Subchunk2Size / (BitsPerSample / 8);
std::vector<int16_t> samples;
samples.resize(num_of_samples); // could be done in the constructor, but I am afraid of vector constructors ;-)
I also assume here that the audio is in the most popular encoding (I think), i.e., unsigned for 8-bit and signed for 16-bit. I'm also ignoring the issue of endianness.
But back to the number of seconds. We can calculate that using the total number of blocks and SampleRate. SampleRate tells us how many samples (in a single channel) there are per second. Or in another words, how many blocks there are per second. So the number of seconds is:
const double num_of_seconds = 1.0 * num_of_blocks / SampleRate;
You can calculate the number of blocks using the formula from your first line:
const uint32_t num_of_blocks = Subchunk2Size / (NumChannels * BitsPerSample / 8);
or, as we already have num_of_samples, which is the total number of samples from all channels, we can just divide that by NumChannels:
const uint32_t num_of_blocks = num_of_samples / NumChannels;
And lastly, in case all you wanted was really just to get the number of seconds from the number of bytes, then there are 2 options. You can calculate the block size:
const int block_size = NumChannels * BitsPerSample / 8;
which should be essentially the same as BlockAlign, and then divide Subchunk2Size by it, to get the number of blocks, and again by SampleRate to get the number of seconds:
const double num_of_seconds = 1.0 * Subchunk2Size / block_size / SampleRate;
// or
const double num_of_seconds = 1.0 * Subchunk2Size / BlockAlign / SampleRate;
Or you can use ByteRate, which is the number of bytes per second:
const double num_of_seconds = 1.0 * Subchunk2Size / ByteRate;
Related
I noticed that on Windows every time I issue an unbuffered fread() request with an odd length, it's split into 2 requests (as observed through procmon):
a) fread for my requested length-1
b) 2-byte fread for the last byte
This has an obvious performance overhead like 2 kernel requests instead of one etc.
Sample code ran on Windows 10:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[]) {
FILE* pFile;
char* buffer;
pFile = fopen(argv[0], "rb");
setbuf(pFile, nullptr);
size_t len = 3;
buffer = (char*)malloc(sizeof(char)*len);
if (len != fread(buffer, 1, len, pFile)) { fputs("Reading error", stderr); exit(3); }
free(buffer);
fclose(pFile);
return 0;
}
This results in the following procmon reported calls:
ReadFile c:\work\cpptry\Debug\cpptry.exe SUCCESS Offset: 0, Length: 2, Priority: Normal
ReadFile c:\work\cpptry\Debug\cpptry.exe SUCCESS Offset: 2, Length: 2
It seems as if Windows is incapable of issuing odd-sized requests to the file system.
What's up with that?
This is implementation artifact.
MS CRT keeps all FILEs buffered even if you tell it to don't do this. Instead file buffer is set to internal buffer with space for two bytes. This allows to keep one code path instead of two and simplifies implementation of fast path in fgetc and fputc.
#define fgetc(_stream) (--(_stream)->_cnt >= 0 ? 0xff & *(_stream)->_ptr++ : _filbuf(_stream))
Some of you are probably bothered by size of the buffer (2 bytes when quasi unbuffered), but in _fread_nolock_s function we can find optimization
witch tries to read multiplies of buffer size directly to the destination bypassing file buffer.
See fread.c in CRT sources:
/* calc chars to read -- (count/streambufsize) * streambufsize */
nbytes = (unsigned)(count - count % streambufsize);
...
nread = _read_nolock(_fileno(stream), data, nbytes);
Because the file buffer's size is equal 2, even number of bytes is read directly to the destination and eventual one byte goes through the file buffer. Sometimes there could be some bytes in the buffer that need to be transfered to destination before optimized read can take place.
Bonus: buffer size is always forced to multiple of 2.
See setvbuf.c:
/*
* force size to be even by masking down to the nearest multiple
* of 2
*/
size &= (size_t)~1;
...
/*
* CASE 1: No Buffering.
*/
if (type & _IONBF) {
stream->_flag |= _IONBF;
buffer = (char *)&(stream->_charbuf);
size = 2;
}
Code snippets above are from VC 2013 CRT.
For comparison snippets from Universal CRT 10.0.17134
read.cpp
unsigned const bytes_to_read = stream_buffer_size != 0
? static_cast<unsigned>(maximum_bytes_to_read - maximum_bytes_to_read % stream_buffer_size)
: maximum_bytes_to_read;
...
int const bytes_read = _read_nolock(_fileno(stream.public_stream()), data, bytes_to_read);
setvbuf.cpp
// Force the buffer size to be even by masking the low order bit:
size_t const usable_buffer_size = buffer_size_in_bytes & ~static_cast<size_t>(1);
...
// Case 1: No buffering:
if (type & _IONBF)
{
return set_buffer(stream, reinterpret_cast<char*>(&stream->_charbuf), 2, _IOBUFFER_NONE);
}
And snippets from VC 6.0 (1998)
read.c
/* calc chars to read -- (count/bufsize) * bufsize */
nbytes = ( bufsize ? (count - count % bufsize) : count );
nread = _read(_fileno(stream), data, nbytes);
setvbuf.c
/*
* force size to be even by masking down to the nearest multiple
* of 2
*/
size &= (size_t)~1;
...
/*
* CASE 1: No Buffering.
*/
if (type & _IONBF) {
stream->_flag |= _IONBF;
buffer = (char *)&(stream->_charbuf);
size = 2;
}
I want to define a Point message in Protocol Buffers which represents an RGB Colored Point in 3-dimensional space.
message Point {
float x = 1;
float y = 2;
float z = 3;
uint8_t r = 4;
uint8_t g = 5;
uint8_t b = 6;
}
Here, x, y, z variables defines the position of Point and r, g, b defines the color in RGB space.
Since uint8_t is not defined in Protocol Buffers, I am looking for a workaround to define it. At present, I am using uint32 in place of uint8_t.
There isn't anything in protobuf that represents a single byte - it simply isn't a thing that the wire-format worries about. The options are:
varint (up to 64 bits input, up to 10 bytes on the wire depending on the highest set bit)
fixed 32 bit
fixed 64 bit
length-prefixed (strings, sub-objects, packed arrays)
(group tokens; a rare implementation detail)
A single byte isn't a good fit for any of those. Frankly, I'd use a single fixed32 for all 3, and combine/decompose the 3 bytes manually (via shifting etc). The advantage here is that it would only have one field header for the 3 bytes, and wouldn't be artificially stretched via having high bits (I'm not sure that a composed RGB value is a good candidate for varint). You'd also have a spare byte if you want to add something else at a later date (alpha, maybe).
So:
message Point {
float x = 1;
float y = 2;
float z = 3;
fixed32 rgb = 4;
}
IMHO this is the correct approach. You should use the nearest data type capable of holding all values to be sent between the system. The source & destination systems should validate the data if it is in the correct range. For uint8_t this is int32 indeed.
Some protocol buffers implementations actually allow this. In particular, nanopb allows to either have .options file alongside the .proto file or use its extension directly in .proto file to fine tune interpretation of individual fields.
Specifying int_size = IS_8 will convert uint32 from message to uint8_t in generated structure.
import "nanopb.proto";
message Point {
float x = 1;
float y = 2;
float z = 3;
uint32 r = 4 [(nanopb).int_size = IS_8];
uint32 g = 5 [(nanopb).int_size = IS_8];
uint32 b = 6 [(nanopb).int_size = IS_8];
}
I am building a C library on big integer number. Basically, I'm seeking a fast algorythm to convert any integer in it binary representation to a decimal one
I saw JDK's Biginteger.toString() implementation, but it looks quite heavy to me, as it was made to convert the number to any radix (it uses a division for each digits, which should be pretty slow while dealing with thousands of digits).
So if you have any documentations / knowledge to share about it, I would be glad to read it.
EDIT: more precisions about my question:
Let P a memory address
Let N be the number of bytes allocated (and set) at P
How to convert the integer represented by the N bytes at address P (let's say in little endian to make things simpler), to a C string
Example:
N = 1
P = some random memory address storing '00101010'
out string = "42"
Thank for your answer still
The reason for the BigInteger.toString method looking heavy is doing the conversion in chunks.
A trivial algorithm would take the last digits and then divide the whole big integer by the radix until there is nothing left.
One problem with this is that a big integer division is quite expensive, so the number is subdivided into chunks that can be processed with regular integer division (opposed to BigInt division):
static String toDecimal(BigInteger bigInt) {
BigInteger chunker = new BigInteger(1000000000);
StringBuilder sb = new StringBuilder();
do {
int current = bigInt.mod(chunker).getInt(0);
bigInt = bigInt.div(chunker);
for (int i = 0; i < 9; i ++) {
sb.append((char) ('0' + remainder % 10));
current /= 10;
if (currnet == 0 && bigInt.signum() == 0) {
break;
}
}
} while (bigInt.signum() != 0);
return sb.reverse().toString();
}
That said, for a fixed radix, you are probably even better off with porting the "double dabble" algorithm to your needs, as suggested in the comments: https://en.wikipedia.org/wiki/Double_dabble
I recently got the challenge to print a big mersenne prime: 2**82589933-1. On my CPU that takes ~40 minutes with apcalc and ~120 minutes with python 2.7. It's a number with 24 million digits and a bit.
Here is my own little C code for the conversion:
// print 2**82589933-1
#include <stdio.h>
#include <math.h>
#include <stdint.h>
#include <inttypes.h>
#include <string.h>
const uint32_t exponent = 82589933;
//const uint32_t exponent = 100;
//outputs 1267650600228229401496703205375
const uint32_t blocks = (exponent + 31) / 32;
const uint32_t digits = (int)(exponent * log(2.0) / log(10.0)) + 10;
uint32_t num[2][blocks];
char out[digits + 1];
// blocks : number of uint32_t in num1 and num2
// num1 : number to convert
// num2 : free space
// out : end of output buffer
void conv(uint32_t blocks, uint32_t *num1, uint32_t *num2, char *out) {
if (blocks == 0) return;
const uint32_t div = 1000000000;
uint64_t t = 0;
for (uint32_t i = 0; i < blocks; ++i) {
t = (t << 32) + num1[i];
num2[i] = t / div;
t = t % div;
}
for (int i = 0; i < 9; ++i) {
*out-- = '0' + (t % 10);
t /= 10;
}
if (num2[0] == 0) {
--blocks;
num2++;
}
conv(blocks, num2, num1, out);
}
int main() {
// prepare number
uint32_t t = exponent % 32;
num[0][0] = (1LLU << t) - 1;
memset(&num[0][1], 0xFF, (blocks - 1) * 4);
// prepare output
memset(out, '0', digits);
out[digits] = 0;
// convert to decimal
conv(blocks, num[0], num[1], &out[digits - 1]);
// output number
char *res = out;
while(*res == '0') ++res;
printf("%s\n", res);
return 0;
}
The conversion is destructive and tail recursive. In each step it divides num1 by 1_000_000_000 and stores the result in num2. The remainder is added to out. Then it calls itself with num1 and num2 switched and often shortened by one (blocks is decremented). out is filled from back to front. You have to allocate it large enough and then strip leading zeroes.
Python seems to be using a similar mechanism for converting big integers to decimal.
Want to do better?
For large number like in my case each division by 1_000_000_000 takes rather long. At a certain size a divide&conquer algorithm does better. In my case the first division would be by dividing by 10 ^ 16777216 to split the number into divident and remainder. Then convert each part separately. Now each part is still big so split again at 10 ^ 8388608. Recursively keep splitting till the numbers are small enough. Say maybe 1024 digits each. Those convert with the simple algorithm above. The right definition of "small enough" would have to be tested, 1024 is just a guess.
While the long division of two big integer numbers is expensive, much more so than a division by 1_000_000_000, the time spend there is then saved because each separate chunk requires far fewer divisions by 1_000_000_000 to convert to decimal.
And if you have split the problem into separate and independent chunks it's only a tiny step away from spreading the chunks out among multiple cores. That would really speed up the conversion another step. It looks like apcalc uses divide&conquer but not multi-threading.
On Darwin, the POSIX standard clock_gettime(CLOCK_MONOTONIC) timer is not available. Instead, the highest resolution monotonic timer is obtained through the mach_absolute_time function from mach/mach_time.h.
The result returned may be an unadjusted tick count from the processor, in which case the time units could be a strange multiple. For example, on a CPU with a 33MHz tick count, Darwin returns 1000000000/33333335 as the exact units of the returned result (ie, multiply the mach_absolute_time by that fraction to obtain a nanosecond value).
We usually wish to convert from exact ticks to "standard" (decimal) units, but unfortunately, naively multiplying the absolute time by the fraction will overflow even in 64-bit arithmetic. This is an error that Apple's sole piece of documentation on mach_absolute_time falls into (Technical Q&A QA1398).1
How should I write a function that correctly uses mach_absolute_time?
Note that this is not a theoretical problem: the sample code in QA1398 completely fails to work on PowerPC-based Macs. On Intel Macs, mach_timebase_info always returns 1/1 as the scaling factor because the CPU's raw tick count is unreliable (dynamic speed-stepping), so the API does the scaling for you. On PowerPC Macs, mach_timebase_info returns either 1000000000/33333335 or 1000000000/25000000, so Apple's provided code definitely overflows every few minutes. Oops.
Most-precise (best) answer
Perform the arithmetic at 128-bit precision to avoid the overflow!
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
}
uint64_t scale(uint64_t i) {
return scaleHighPrecision(i - bias, tb.numer, tb.denom);
}
static uint64_t scaleHighPrecision(uint64_t i, uint32_t numer,
uint32_t denom) {
U64 high = (i >> 32) * numer;
U64 low = (i & 0xffffffffull) * numer / denom;
U64 highRem = ((high % denom) << 32) / denom;
high /= denom;
return (high << 32) + highRem + low;
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return data.scale(now);
}
A simple low-resolution answer
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
if (tb.denom > 1024) {
double frac = (double)tb.numer/tb.denom;
tb.denom = 1024;
tb.numer = tb.denom * frac + 0.5;
assert(tb.numer > 0);
}
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
A fiddly solution using low-precision arithmetic but using continued fractions to avoid loss of accuracy
// This function returns the rational number inside the given interval with
// the smallest denominator (and smallest numerator breaks ties; correctness
// proof neglects floating-point errors).
static mach_timebase_info_data_t bestFrac(double a, double b) {
if (floor(a) < floor(b))
{ mach_timebase_info_data_t rv = {(int)ceil(a), 1}; return rv; }
double m = floor(a);
mach_timebase_info_data_t next = bestFrac(1/(b-m), 1/(a-m));
mach_timebase_info_data_t rv = {(int)m*next.numer + next.denum, next.numer};
return rv;
}
// Returns monotonic time in nanos, measured from the first time the function
// is called in the process. The clock may run up to 0.1% faster or slower
// than the "exact" tick count. However, although the bound on the error is
// the same as for the pragmatic answer, the error is actually minimized over
// the given accuracy bound.
uint64_t monotonicTimeNanos() {
uint64_t now = mach_absolute_time();
static struct Data {
Data(uint64_t bias_) : bias(bias_) {
kern_return_t mtiStatus = mach_timebase_info(&tb);
assert(mtiStatus == KERN_SUCCESS);
double frac = (double)tb.numer/tb.denom;
uint64_t spanTarget = 315360000000000000llu; // 10 years
if (getExpressibleSpan(tb.numer, tb.denom) >= spanTarget)
return;
for (double errorTarget = 1/1024.0; errorTarget > 0.000001;) {
mach_timebase_info_data_t newFrac =
bestFrac((1-errorTarget)*frac, (1+errorTarget)*frac);
if (getExpressibleSpan(newFrac.numer, newFrac.denom) < spanTarget)
break;
tb = newFrac;
errorTarget = fabs((double)tb.numer/tb.denom - frac) / frac / 8;
}
assert(getExpressibleSpan(tb.numer, tb.denom) >= spanTarget);
}
mach_timebase_info_data_t tb;
uint64_t bias;
} data(now);
return (now - data.bias) * data.tb.numer / data.tb.denom;
}
The derivation
We aim to reduce the fraction returned by mach_timebase_info to one that is essentially the same, but with a small denominator. The size of the timespan that we can handle is limited only by the size of the denominator, not the numerator of the fraction we shall multiply by:
uint64_t getExpressibleSpan(uint32_t numer, uint32_t denom) {
// This is just less than the smallest thing we can multiply numer by without
// overflowing. ceilLog2(numer) = 64 - number of leading zeros of numer
uint64_t maxDiffWithoutOverflow = ((uint64_t)1 << (64 - ceilLog2(numer))) - 1;
return maxDiffWithoutOverflow * numer / denom;
}
If denom=33333335 as returned by mach_timebase_info, we can handle differences of up to 18 seconds only before the multiplication by numer overflows. As getExpressibleSpan shows, by calculating a rough lower bound for this, the size of numer doesn't matter: halving numer doubles maxDiffWithoutOverflow. The only goal therefore is to produce a fraction close to numer/denom that has a smaller denominator. The simplest method to do this is using continued fractions.
The continued fractions method is rather handy. bestFrac clearly works correctly if the provided interval contains an integer: it returns the least integer in the interval over 1. Otherwise, it calls itself recursively with a strictly larger interval and returns m+1/next. The final result is a continued fraction that can be shown by induction to have the correct property: it's optimal, the fraction inside the given interval with the least denominator.
Finally, we reduce the fraction Darwin passes us to a smaller one to use when rescaling the mach_absolute_time to nanoseconds. We may introduce an error here because we can't reduce the fraction in general without losing accuracy. We set ourselves the target of 0.1% error, and check that we've reduced the fraction enough for common timespans (up to ten years) to be handled correctly.
Arguably the method is over-complicated for what it does, but it handles correctly anything the API can throw at it, and the resulting code is still short and extremely fast (bestFrac typically recurses only three or four iterations deep before returning a denominator less than 1000 for random intervals [a,a*1.002]).
You're worrying about overflow when multiplying/dividing with values from the mach_timebase_info struct, which is used for conversion to nanoseconds. So, while it may not fit your exact needs, there are easier ways to get a count in nanoseconds or seconds.
All solutions below are using mach_absolute_time internally (and NOT the wall clock).
Use double instead of uint64_t
(supported in Objective-C and Swift)
double tbInSeconds = 0;
mach_timebase_info_data_t tb;
kern_return_t kError = mach_timebase_info(&tb);
if (kError == 0) {
tbInSeconds = 1e-9 * (double)tb.numer / (double)tb.denom;
}
(remove the 1e-9 if you want nanoseconds)
Usage:
uint64_t start = mach_absolute_time();
// do something
uint64_t stop = mach_absolute_time();
double durationInSeconds = tbInSeconds * (stop - start);
Use ProcessInfo.processInfo.systemUptime
(supported in Objective-C and Swift)
It does the job in double seconds directly:
CFTimeInterval start = NSProcessInfo.processInfo.systemUptime;
// do something
CFTimeInterval stop = NSProcessInfo.processInfo.systemUptime;
NSTimeInterval durationInSeconds = stop - start;
For reference, source code of systemUptime
just does something similar as previous solution:
struct mach_timebase_info info;
mach_timebase_info(&info);
__CFTSRRate = (1.0E9 / (double)info.numer) * (double)info.denom;
__CF1_TSRRate = 1.0 / __CFTSRRate;
uint64_t tsr = mach_absolute_time();
return (CFTimeInterval)((double)tsr * __CF1_TSRRate);
Use QuartzCore.CACurrentMediaTime()
(supported in Objective-C and Swift)
Same as systemUptime, but without being open source.
Use Dispatch.DispatchTime.now()
(supported in Swift only)
Another wrapper around mach_absolute_time(). Base precision is nanoseconds, backed with UInt64.
DispatchTime start = DispatchTime.now()
// do something
DispatchTime stop = DispatchTime.now()
TimeInterval durationInSeconds = Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000
For reference, source code of DispatchTime.now() says it basically simply returns a struct DispatchTime(rawValue: mach_absolute_time()). And the calculation for uptimeNanoseconds is:
(result, overflow) = result.multipliedReportingOverflow(by: UInt64(DispatchTime.timebaseInfo.numer))
result = overflow ? UInt64.max : result / UInt64(DispatchTime.timebaseInfo.denom)
So it just discards results if the multiplication can't be stored in an UInt64.
If mach_absolute_time() sets the uint64 back to 0 then reset the time calculations if less than the last check.
That's the problem, they don't document what happens when the uint64 reaches all ones (binary).
read it. https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time
I've been doing a lot of OpenGL and shaders before, and now, I decided to give a try to OpenCL. I watched some online tutorials, and started reading books on the subject. In order to better understand, and because I believe that the best way to learn is by intelligently trying and learning from the issues that arose while doing so, I decided to start implementing a kernel for a fully-connected perceptron.
For those who don't know what that is, I'll explain the basic idea. It is a neural network in which each neuron of a layer is connected to every neurons of the next layer. Each neuron has but one action to perform: performing the sum of all the neurons from the previous layer, weighted by a different value for each neuron.
This seemed simple enough to implement, and after reading the paper "Parallel Neural Network Training with OpenCL" I implemented it in the following way
Each layer being dependent on the previous one, they're being run sequentially by the host
For computing a layer, I run my kernel with a global work size of the number of neurons within the layer (which can be quite huge, tens of thousand for instance). That makes it so that all the neurons are performing its sum independently to one another.
Each neuron (identified by its global_work_id) performs the weighted sum with all the neurons from the previous layer.
Here is my fully functional opencl kernel:
/**
* #brief Computes one layer of the perceptron given the previous one and the
* weights
* The kernel is run once for each layer.
* The work items are each tasked with computing the output of a single neuron
* of the out layer.
*
* #param out_layer_size
* Size of the output layer (number of elements in the output array that will
* contain the result for each neuron).
* #param in_layer_size
* Number of elements of the input layer
* #param in_value
* Values of the neuron in the previous layer
* #param in_weights
* Array containing the weights for each input neuron. It is organised as a
* two dimensional matrix, written by concatenating each line in the array
* [ w11, w12, w13, ...
* w21, w22, w23, ...
* ..., ..., ..., ...
* ]
* Where wij is the weight linking the neuron i of the input layer to the
* neuron j of the output layer
* #param out_values
* Computed values for the current layer
*/
void kernel perceptron(global const int* in_layer_size, global const int* out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
private const int global_id = get_global_id(0);
private const int out_layer_s = *out_layer_size;
private const int in_layer_s = *in_layer_size;
private const int offset = out_layer_s * global_id;
private float sum = 0.;
for(int i=0; i < in_layer_s; i++) {
sum += in_weights[i*out_layer_s+global_id] * in_value[i];
}
//out_values[global_id] = sigma(sum);
out_values[global_id] = sum;
}
And here is how I invoke it:
queue.enqueueNDRangeKernel(kernel, cl::NullRange,cl::NDRange(number of neurons within layer),cl::NullRange);
I realize that the bottleneck of this kernel is the implementation of the weighted sum. It would be really helpful if someone could explain how I could improve upon this to make it faster.
I probably don't make proper use of the different memory regions, I'm thinking essentially of the local memory that I don't even use.
Just to give you an idea of performance (that is on an Nvidia GTX 660M), I'll show you some of the times I achieved. Each value is the number of neurons per layer:
2500, 10 000, 2500 : 0.018s ~ 60FPS. It's about 4 to 5 times faster than on my processor (Intel Core i7 running at 2.40GHz)
100 000, 100 000, 500: 140s -> which I guess isn't surpsising since each neuron in the second layer has to perform the weighted sum of 100 000 elements. Running this on my processor yields about the same results.
As you told, bottleneck is the weighted summ. That's not hard to be, as at each layer every WI (Work Item) is doing a lot of IO operations in comparison to number of arithmetic operations. I have no experience in neural networks, but for me problem looks like poor memory access pattern on GPU.
Potentially, that can be solved by organizing your WI into local WGs (Work Groups). As every WI needs to process all data from prev. layer, I guess that all WI in WG can load some amount of data into local memory, process them and than to next bunch of data. This will make your algorithm much more cache friendly. Pseudo-code of kernel looks like:
void kernel Kernel(
__global const int in_layer_size,
__global const int out_layer_size,
__global const float *in_value,
__global const float *in_weights,
__global float *out_values){
__local float buffer[SOME_SIZE];
__global const float* p_in = in_value;
__global float* p_out = out_values;
const int
global_id = get_global_id(0),
local_id = get_local_id(0),
num_buffers = in_layer_size / SOME_SIZE,
offset = out_layer_size * global_id;
float sum = 0.0f;
for(int i=0; i < num_buffers; i++){
buffer[local_id] = p_in[local_id];
barrier(CLK_LOCAL_MEM_FENCE);
//Process all data inside buffer by every WI in WG
//...
p_in += SOME_SIZE;
out_values += SOME_SIZE;
}
//...
return;
}
So, you're sliding with the window of fixed size & calculating data within & then going to next window. Al data operations are done independently, Work Items are only using same data at same time. Optimal size of local group is Device- and Kernel- dependent.
You can do it in many ways.
But the most generic way, without changing how your kernel behaves is to do it is reusing your workgroup size (whatever you selected, or default) and reuse the memory accesses from the group.
I would suggest something like this:
NOTE: I removed thouse ugly pointers for single values. OpenCL supports this, and it is much easier. There is no need to create a memory zone, just do clSetKernelArg(kernel, arg_index, sizeof(cl_float), &size); Where cl_float size = the_size;.
#define IN_LOCAL_SIZE 4096 //Because 16KB/4B (for each float)
void kernel perceptron(global const int in_layer_size, global const int out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
const int global_id = get_global_id(0);
__local float in_buffer[IN_LOCAL_SIZE];
float sum = 0.0f;
event_t ev;
int j;
//For each full buffer
for(j=0; j < (in_layer_size/IN_LOCAL_SIZE)-1; i++) {
ev = async_work_group_copy(in_buffer, in_value+j*IN_LOCAL_SIZE, IN_LOCAL_SIZE, ev);
wait_group_events(1,&ev);
barrier(CLK_LOCAL_MEM_FENCE);
for(int i=0; i < IN_LOCAL_SIZE; i++) {
sum += in_weights[(i+j*IN_LOCAL_SIZE)*out_layer_size+global_id] * in_buffer[i];
}
}
//Last one
ev = async_work_group_copy(in_buffer, in_value+j*IN_LOCAL_SIZE, in_layer_size%IN_LOCAL_SIZE, ev);
wait_group_events(1,&ev);
barrier(CLK_LOCAL_MEM_FENCE);
for(int i=0; i < in_layer_size%IN_LOCAL_SIZE; i++) {
sum += in_weights[(i+j*IN_LOCAL_SIZE)*out_layer_size+global_id] * in_buffer[i];
}
out_values[global_id] = sum;
}
However, if the output size is small (100k, 250k, 500), then you will have just 500 work items, which is not optimal. In that case you should reshape the algorithm.
One possible way to do it is that each workitem works in the inner layer, performing sums, and the whole work group creates one output out of all the work items. That would be easy, since you can control the sums inside the workgroup easily.
But maybe other approaches fit better your problem.
You can make large improvements by caching in_values in local memory. The fewer times you have to read each element of in_values from global memory, the better.
I have come up with a solution that caches the maximum number of input values, and reads each element from global memory only once per work group. This is done by copying a block of in_values at a time, processing it against all out_values, and moving on to the next block. There is also a local array of floats used to reduce the work items' sums of each block.
pseudocode:
output elements assumed to be set to 0 already
for each block of input values:
cache the input block
for each target output value:
reset local sum to 0
for each element this work item is responsible for:
read the weight, multiply, and add to sum
reduce sums to a single value, ADD value to output element
I haven't had a chance to run this through a profiler or debugger yet, but I will give it a try when I am back at my home PC. (no opencl tools at my office workstation). Make sure to queue kernel with group size equal to the GROUP_SIZE constant. Also, only create a single group per compute unit on your device.
real code:
//experiment with GROUP_SIZE to discover the optimal value for your device
//this needs to be equal to local_work_size passed into clEnqueueNDRangeKernel
//use a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
//max. for most devices is 256
#define GROUP_SIZE = 64;
// IN_VALUE_CACHE_SIZE is the number of floats from in_value to copy to local memory at a time
//assuming GROUP_SIZE can be up to 256, sizeof(float)=4, and local memory size is 32kb, full saturation can be achieved with the following:
//(32768 - (256 * 4)) /4 = 7936
//try another multiple of 1024 (6144, 4096... )if there is trouble with this value
#define IN_VALUE_CACHE_SIZE = 7936;
void kernel perceptron(global const int* in_layer_size, global const int* out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
private const int global_id = get_global_id(0);
private const int out_layer_s = *out_layer_size;
private const int in_layer_s = *in_layer_size;
private const int offset = out_layer_s * global_id;
private const int item_id = get_local_id(0);
private const int group_id = get_group_id(0);
private const int group_count = get_num_groups(0);
local float result_buffer[GROUP_SIZE];
local float in_value_cache[IN_VALUE_CACHE_SIZE];
int i,j,k;
//init the block to 0, in case there are fewer than IN_VALUE_CACHE_SIZE values in total
for(i=item_id; i<IN_VALUE_CACHE_SIZE; i+= GROUP_SIZE){
in_value_cache[i] = 0.0;
}
barrier(CL_LOCAL_MEM_FENCE);
private float sum = 0.0;
event_t e;
int copy_total = 0;
int copy_offset;
for(i=0; i<in_layer_s; i+=IN_VALUE_CACHE_SIZE){
//cap the number of values to copy to local memory if loop is near the end of the input data
copy_total = IN_VALUE_CACHE_SIZE;
if((copy_total + i*IN_VALUE_CACHE_SIZE) > in_layer_s){
copy_total = in_layer_s - i*IN_VALUE_CACHE_SIZE;
}
//copy the next block of values
e = async_work_group_copy(in_value_cache, in_value + i * 4, copy_total, 0);
wait_group_events(1, &e);
for(j=group_id; j<out_layer_s; j+=group_count){
sum = 0.0;
//need to reset result_buffer[item_id] as well
//this is in case there are fewer than GROUP_SIZE input values remaining ie copy_total < GROUP_SIZE
result_buffer[item_id] = 0.0;
for(k=item_id; k<copy_total; k+=GROUP_SIZE){
sum += in_value_cache[k] * in_weights[(k+i) + j * out_layer_s];
}
result_buffer[item_id] = sum;
//simple O(n) reduction can be optimized further
if(item_id == 0){
for(k=1;k<GROUP_SIZE;k++){
sum += result_buffer[k];
}
out_values[j] += sum;
}
barrier(CL_LOCAL_MEM_FENCE);
}
}
}
This will handle input of any size, so you can try it with as many elements as you have global memory for.