Detect Greatest Dynamic Range Changes In Audio Sample - algorithm

Forgive/correct any wrong terminology in the below (I hope it makes sense!):
I want to detect the biggest dynamic audio changes in a given sample, (ie. the moments when the sound wave 'grows'/'accelerates' the most).
For example, if the audio goes quiet at some points during the sample, I want to know when the music comes back in after, and order these data points by the relative dynamic range (volume?) increase (largest to smallest).
My audio sample is a buffer of float32[] and sample rate, and I would like a resulting array of objects each containing:
start frame index
start time (seconds ... frameIndex/sampleRate?)
end frame index
end time (seconds)
dynamic change value
My naive approach iterates linearly and detects points at which the value starts rising until it is no longer rising, and then calculates the rise over run for each sub interval between those points.. but this is not producing the correct result.
Any ideas or existing algorithms that do this?
Not picky on languages, but anything with syntax like C#, Java, JavaScript preferred!

I am a little unsure as to how much audio DSP background you have so apologies if treading over old territory.
Essentially this is a problem of trying to find the envelope of the signal at any given point.
Since the audio signal will be fluctuating between -1 and 1, the value of any individual sample will not yield much
information about the loudness or dynamic range.
What would be best to find is the root mean square of the signal over some frame of audio data
Written in pseudo code, and assuming you already have your audio Data, a function and way of grabbing the rms data could be:
function rms(frame[], frameSize)
var rmsValue = 0;
for(int i = 0; i < frameSize; i++)
rmsValue += frame[i] * frame[i]; // square the sample and sum over frame
rmsValue = sqrt(rmsValue / frameSize);
return rmsValue;
// Main
var frameNum = floor(numberOfAudioSample / frameSize) // for analysis just floor to a whole number of frames, if thi is real-time, you will need to deal with a partial frame at the end
var frame = [] // an array or buffer to temporarily store audio data
var rmsData = [] // an array or buffer to store RMS data
for (var i = 0; i < frameNum; i++)
for (var j = 0; j < frameSize; j++)
sampleIndex = j + (i * frameSize)
frame[j] = audioData[sampleIndex]
rmsData[i] = rms(frame, frameSize)
You can then compare elements of the RMS Data to find when the dynamics are changing and by how much.
For digital audio RMS will be constrained to between 0 and 1. To get dBFS then all you need to do is 20 * log10(rmsData)
Finding the exact sample where dynamic range changes will be tricky. The frame index should be accurate enough with a small enough size of frame.
The smaller the frame, however, the more erratic the RMS values will be. Finding a time in seconds is simply sampleIndex / samplingRate
With a small frame size you may also want to low pass filter the rms data. It depends on whether this is for a real-time application or for non real-time analysis.
To make things easy, I would prototype something in Octave or MATLAB first


OpenCL crash when calling finish()

I am writing an OpenCL app on mac using c++, and it crashes in certain cases depending on the work size.
The program crashes due to a SIGABRT.
Is there any way to get more information about the error?
Why is SIGABRT being raised? Can I catch it?
I realize that this program is a doozie, however I will try to explain it in case anyone would like to take a stab at it.
Through debugging I discovered that the cause of the SIGABRT was one of the kernels timing out.
The program is a tile-based 3D renderer. It is an OpenCL implementation of this algorithm:
The screen is divided into 8x8 tiles. One of the kernels (the tiler) computes which polygons overlap each tile, storing the results in a data structure called tilePolys. A subsequent kernel (the rasterizer), which runs one work item per tile, iterates over the list of polys occupying the tile and rasterizes them.
The tiler writes to an integer buffer which is a list of lists of polygon indices. Each list is of a fixed size (polysPerTile + 1 for the count) where the first element is the count and the subsequent polysPerTile elements are indices of polygons in the tile. There is one such list per tile.
For some reason in certain cases the tiler writes a very large poly count (13172746) to one of the tile's lists in tilePolys. This causes the rasterizer to loop for a long time and time out.
The strange thing is that the index to which the large count is written is never accessed by the tiler.
The code for the tiler kernel is below:
// this kernel is executed once per polygon
// it computes which tiles are occupied by the polygon and adds the index of the polygon to the list for that tile
kernel void tiler(
// number of polygons
ulong nTris,
// width of screen
int width,
// height of screen
int height,
// number of tiles in x direction
int tilesX,
// number of tiles in y direction
int tilesY,
// number of pixels per tile (tiles are square)
int tileSize,
// size of the polygon list for each tile
int polysPerTile,
// 4x4 matrix representing the viewport
global const float4* viewport,
// vertex positions
global const float* vertices,
// indices of vertices
global const int* indices,
// array of array-lists of polygons per tile
// structure of list is an int representing the number of polygons covering that tile,
// followed by [polysPerTile] integers representing the indices of the polygons in that tile
// there are [tilesX*tilesY] such arraylists
volatile global int* tilePolys)
size_t faceInd = get_global_id(0);
// compute vertex position in viewport space
float3 vs[3];
for(int i = 0; i < 3; i++) {
// indices are vertex/uv/normal
int vertInd = indices[faceInd*9+i*3];
float4 vertHomo = (float4)(vertices[vertInd*4], vertices[vertInd*4+1], vertices[vertInd*4+2], vertices[vertInd*4+3]);
vertHomo = vec4_mul_mat4(vertHomo, viewport);
vs[i] = / vertHomo.w;
float2 bboxmin = (float2)(INFINITY,INFINITY);
float2 bboxmax = (float2)(-INFINITY,-INFINITY);
// size of screen
float2 clampCoords = (float2)(width-1, height-1);
// compute bounding box of triangle in screen space
for (int i=0; i<3; i++) {
for (int j=0; j<2; j++) {
bboxmin[j] = max(0.f, min(bboxmin[j], vs[i][j]));
bboxmax[j] = min(clampCoords[j], max(bboxmax[j], vs[i][j]));
// transform bounding box to tile space
int2 tilebboxmin = (int2)(bboxmin[0] / tileSize, bboxmin[1] / tileSize);
int2 tilebboxmax = (int2)(bboxmax[0] / tileSize, bboxmax[1] / tileSize);
// loop over all tiles in bounding box
for(int x = tilebboxmin[0]; x <= tilebboxmax[0]; x++) {
for(int y = tilebboxmin[1]; y <= tilebboxmax[1]; y++) {
// get index of tile
int tileInd = y * tilesX + x;
// get start index of polygon list for this tile
int counterInd = tileInd * (polysPerTile + 1);
// get current number of polygons in list
int numPolys = atomic_inc(&tilePolys[counterInd]);
// if list is full, skip tile
if(numPolys >= polysPerTile) {
// decrement the count because we will not add to the list
} else {
// otherwise add the poly to the list
// the index is the offset + numPolys + 1 as tilePolys[counterInd] holds the poly count
int ind = counterInd + numPolys + 1;
tilePolys[ind] = (int)(faceInd);
My theories are that either:
I have incorrectly implemented the atomic functions for reading and incrementing the count
I am using an incorrect number format causing garbage to be written into tilePolys
One of my other kernels is inadvertently writing into the tilePolys buffer
I do not think it is the last one though because if instead of writing faceInd to tilePolys, I write a constant value, the large poly count disappears.
tilePolys[counterInd+numPolys+1] = (int)(faceInd); // this is the problem line
tilePolys[counterInd+numPolys+1] = (int)(5); // this fixes the issue
It looks like your kernel is crashing on the GPU itself. You can't really get any extra diagnostics about that directly, at least not on macOS. You'll need to start narrowing down the problem. Some suggestions:
As the crash is currently happening in clFinish() you don't know what asynchronous command is causing the crash. Try switching all your enqueue calls to blocking mode. This should cause it to crash in the call that's actually going wrong.
Check return/error codes on all OpenCL API calls. Sometimes, ignoring an error from an earlier call can cause problems in a later call which relies on earlier results. For example, if creating a buffer fails, passing the result of that buffer creation as a kernel argument will cause problems when trying to run the kernel.
The most likely reason for the crash is that your OpenCL kernel is accessing memory out of bounds or is otherwise misusing pointers. Re-check any array index calculations.
Check if the problem occurs with smaller work batches. Scale up from one workgroup (or work item if not using groups) and see if it only occurs beyond a certain work size. This may give you a clue about buffer sizes and array indices that might be causing the crash.
Systematically comment out parts of your kernel. If the crash goes away if you comment out a specific piece of code, there's a good chance the problem is in that code.
If you've narrowed the problem down to a small area of code but can't work out where it's coming from, start recording diagnostic output to check that variables have the values you're expecting.
Without seeing any code, I can't give you any more specific advice than that.
Note that OpenCL is deprecated on macOS, so if you're specifically targeting that platform and don't need to support Linux, Windows, etc. I recommend learning Metal Compute instead. Apple has made it clear that this is the GPU programming platform they want to support, and the tooling for it is already much better than their OpenCL tooling ever was.
I suspect Apple will eventually stop implementing OpenCL support when they release a Mac with a new type of GPU, so even if you're targeting the Mac as well as other platforms, you will probably need to switch to Metal on the Mac somewhere down the line anyway. As of macOS 10.14, the minimum system requirements of the OS already include a Metal-capable GPU, so you only need OpenCL as a fallback if you wish to support all Mac models able to run 10.13 or an even older OS version.

Moving average digital filter implementation

I need to implement moving average digital filter for post processing of some
recorded oscilloscope waveforms in Scilab. I have prepared a script with
below given code (the recursive implementation with averaging window containing 256 samples)
// number of samples
N = 350000;
// vector of voltage samples
voltage = M(1:N, 2)';
// filtered values
filt_voltage = zeros(1:N);
// window length
L = 256;
// sum of the samples in the averaging window
sum = 0
for i = 1:N_01
// averaging window full?
if i > L
// remove the oldest sample in the averaging window
sum = sum - voltage(i - L);
// add the newest sample into the averaging window
sum = sum + voltage(i);
// average of the samples in the averaging window
filt_voltage(i) = sum/L;
The script output is following (blue waveform - the recorded data, red waveform - filtered data)
The problem is that I am not sure whether me moving average implementation
is correct (I have found a lot of implementations mostly based on convolution). The output seems to be somehow filtered but it would be helpful
for me if anybody can confirm to me that it is correct. Thanks in advance.
There is nothing wrong with your implementation. In fact, it is a common implementation of the recursive formulation given on and on wikipedia:

Nearest Neighbors in CUDA Particles

Edit 2: Please take a look at this crosspost for TLDR.
Edit: Given that the particles are segmented into grid cells (say 16^3 grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?
In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.
Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?
Also, what is the ideal ratio of number of particles/number of grid cells?
I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:
Buffer P holding all particles' 3D positions (float3)
Buffer Sp storing int2 pairs of particle ids and their spatial hashes. Sp is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)
Buffer L storing int2 pairs of starting and ending positions of particular spatial hashes in buffer Sp. Example: L[12] = (int2)(0, 50).
L[12].x is the index (in Sp) of the first particle with spatial hash 12.
L[12].y is the index (in Sp) of the last particle with spatial hash 12.
Now that I have all these buffers, I want to iterate through all the particles in P and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
for(int x=-1; x<=1; x++)
for(int y=-1; y<=1; y++)
for(int z=-1; z<=1; z++) {
float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
int neigh_hash = spatial_hash( neigh_position );
int2 particles_range = L[ neigh_hash ];
for(int p=particles_range.x; p<particles_range.y; p++)
processed_value += heavy_computation( P[ Sp[p].y ] );
Out[gid] = processed_value;
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.
The holy grail I want to achieve:
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
// How to accomplish this??
// `get_neighbors_range()` returns start and end Z-index values
// representing the start and end near neighbors cells range
int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
int last_particle_id = L[ nearest_neighboring_cells_range.y ].y;
for(int p=first_particle_id; p<=last_particle_id; p++) {
processed_value += heavy_computation( P[ Sp[p].y ] );
Out[gid] = processed_value;
You should study the Morton Code algorithms closely. Ericsons Real time collision detection explains that very well.
Ericson - Real time Collision detection
Here is another nice explanation including some tests:
Morton encoding/decoding through bit interleaving: Implementations
Z-Order algorithms only defines the paths of the coordinates in which you can hash from 2 or 3D coordinates to just an integer. Although the algorithm goes deeper for every iteration you have to set the limits yourself. Usually the stop index is denoted by a sentinel. Letting the sentinel stop will tell you at which level the particle is placed. So the maximum level you want to define will tell you the number of cells per dimension. For example with maximum level at 6 you have 2^6 = 64. You will have 64x64x64 cells in your system (3D). That also means that you have to use integer based coordinates. If you use floats you have to convert like coord.x = 64*float_x and so on.
If you know how many cells you have in your system you can define your limits. Are you trying to use a binary octree?
Since particles are in motion (in that CUDA example) you should try to parallelize over the number of particles instead of cells.
If you want to build lists of nearest neighbours you have to map the particles to cells. This is done through a table that is sorted afterwards by cells to particles. Still you should iterate through the particles and access its neighbours.
About your code:
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
Remember Donald Knuth. You should measure where the bottle neck is. You can use NVCC Profiler and look for bottleneck. Not sure what OpenCL has as profiler.
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
I think you should not branch it that way, how about returning zero when you call heavy_computation. Not sure, but maybe you have sort of a branch prediction here. Try to remove that somehow.
Running parallel over the cells is a good idea only if you have no write accesses to the particle data, otherwise you will have to use atomics. If you go over the particle range instead you read accesses to the cells and neighbours but you create your sum in parallel and you are not forced to some race condiction paradigm.
Also, what is the ideal ratio of number of particles/number of grid cells?
Really depends on your algorithms and the particle packing within your domain, but in your case I would define the cell size equivalent to the particle diameter and just use the number of cells you get.
So if you want to use Z-order and achieve your holy grail, try to use integer coordinates and hash them.
Also try to use larger amounts of particles. About 65000 particles like CUDA examples uses you should consider because that way the parallelisation is mostly efficient; the running processing units are exploited (fewer idles threads).

Determine Framerate Based On Delay

On every frame of my application, I can call timeGetTime() to retrieve the current elapsed milliseconds, and subtract the value of timeGetTime() from the previous frame to get the time between the two frames. However, to get the frame rate of the application, I have to use this formula: fps=1000/delay(ms). So for instance if the delay was 16 milliseconds, then 1000/16=62.5 (stored in memory as 62). Then let's say the delay became 17 milliseconds, then 1000/17=58, and so on:
As you can see for consecutive instances for the delay, there are pretty big gaps in the frame rates. So how do programs like FRAPS determine the frame rate of applications that are between these values (eg 51,53,54,56,57,etc)?
Why would you do this on every frame? You'll find if you do it on every tenth frame, and then divide that value by 10, you'll easily get frame rates within the gaps you're seeing. You'll also probably find your frame rates are higher since you're doing less admin work within the loop :-)
In other words, something like (pseudo code):
chkpnt = 10
cntr = chkpnt
baseTime = now()
do lots of times:
display next frame
if cntr == 0:
cntr = chkpnt
newTime = now()
display "framerate = " (newTime - baseTime) / chkpnt
baseTime = newTime
In addition to #Marko's suggestion to use a better timer, the key trick for a smoothly varying and better approximate evaluation of the frame rate is to use a moving average -- don't consider only the very latest delay you've observed, consider the average of (say) the last five. You can compute the latter as a floating-point number, to get more possible values for the frame rate (which you can still round to the nearest integer of course).
For minimal computation, consider a "fifo queue" of the last 5 delays (pseudocode)...:
array = [16, 16, 16, 16, 16] # initial estimate
totdelay = 80
while not Done:
newest = latestDelay()
oldest = array.pop(0)
totdelay += (newest - oldest)
estimatedFramerate = 200 / totdelay
Not sure, but maybe you need better (high resolution) timer. Check QueryPerformanceTimer.
Instead of a moving average (as #Alex suggests) I suggest a Low-Pass Filter. It's easier to calculate, and can be tweaked to have an arbitrary amount of value smoothing with no change to performance or memory usage. In short (demonstrated in JavaScript):
var smoothing = 10; // The larger this value, the more smoothing
var fps = 30; // some likely starting value
var lastUpdate = new Date;
function onFrameUpdate(){
var now = new Date;
var frameTime = now - lastUpdate;
var frameFPS = 1/frameTime;
// Here's the magic
fps += (frameFPS - fps) / smoothing;
lastUpdate = now;
For a pretty demo of this functionality, see my live example here:

What Are High-Pass and Low-Pass Filters?

Graphics and audio editing and processing software often contain functions called "High-Pass Filter" and "Low-Pass Filter". Exactly what do these do, and what are the algorithms for implementing them?
Here is how you implement a low-pass filter using convolution:
double[] signal = (some 1d signal);
double[] filter = [0.25 0.25 0.25 0.25]; // box-car filter
double[] result = new double[signal.Length + filter.Length + 1];
// Set result to zero:
for (int i=0; i < result.Length; i++) result[i] = 0;
// Do convolution:
for (int i=0; i < signal.Length; i++)
for (int j=0; j < filter.Length; j++)
result[i+j] = result[i+j] + signal[i] * filter[j];
Note that the example is extremely simplified. It does not do range checks and does not handle the edges properly. The filter used (box-car) is a particularly bad lowpass filter, because it will cause a lot of artifacts (ringing). Read up on filter design.
You can also implement the filters in the frequency domain. Here is how you implement a high-pass filter using FFT:
double[] signal = (some 1d signal);
// Do FFT:
double[] real;
double[] imag;
[real, imag] = fft(signal)
// Set the first quarter of the real part to zero to attenuate the low frequencies
for (int i=0; i < real.Length / 4; i++)
real[i] = 0;
// Do inverse FFT:
double[] highfrequencysignal = inversefft(real, imag);
Again, this is simplified, but you get the idea. The code does not look as complicated as the math.
High-pass filter
Low-pass filter
Band-pass filter
These "high", "low", and "band" terms refer to frequencies. In high-pass, you try to remove low frequencies. In low-pass, you try to remove high. In band pass, you only allow a continuous frequency range to remain.
Choosing the cut-off frequency depends upon your application. Coding these filters can either be done by simulating RC circuits or by playing around with Fourier transforms of your time-based data. See the wikipedia articles for code examples.
Here is a super simple example of a low pass filter in C++ that processes the signal one sample at a time:
float lopass(float input, float cutoff) {
lo_pass_output= outputs[0]+ (cutoff*(input-outputs[0]));
outputs[0]= lo_pass_output;
Here is pretty much the same thing, except it's high pass:
float hipass(float input, float cutoff) {
hi_pass_output=input-(outputs[0] + cutoff*(input-outputs[0]));
They are generally Electrical circuits that tend to pass parts of analog signals. High pass tends to transmit more of the high frequency parts and low pass tends to pass more of the low frequency parts.
They can be simulated in software. A walking average can act as a low pass filter for instance and the difference between a walking average and it's input can work as a high pass filter.
High-pass filter lets high-frequency (detailed/local information) pass.
Low-pass filter lets low-frequency (coarse/rough/global information) pass.
Filtering describes the act of processing data in a way that applies different levels of attenuation to different frequencies within the data.
A high pass filter will apply minimal attentuation (ie. leave levels unchanged) for high frequencies, but applies maximum attenuation to low frequencies.
A low pass filter is the reverse - it will apply no attenuation to low frequencies by applies attenuation to high frequencies.
There are a number of different filtering algorithms that are used. The two simplest are probably the Finite Impulse Response filter (aka. FIR filter) and the Infinite Impulse Response filter (aka. IIR filter).
The FIR filter works by keeping a series of samples and multiplying each of those samples by a fixed coefficient (which is based on the position in the series). The results of each of these multiplications is accumulated and is the output for that sample. This is referred to as a Multiply-Accumulate - and in dedicated DSP hardware there is a specific MAC instruction for doing just this.
When the next sample is taken it's added to the start of the series, and the oldest sample in the series is removed, and the process repeated.
The behavior of the filter is fixed by the selection of the filter coefficients.
One of the simplest filters that is often provided by image processing software is the averaging filter. This can be implemented by an FIR filter by setting all of the filter coefficients to the same value.
