Align two audios with similar parts - audio-processing

I have two audios tracks, extracted from two videos.
They sound almost equally except for a few differences.
Different duration. E.g. first track is 10min length, and second track is 10.5min length because it's stretched.
First audio has only English voice. Second audio contains English + foreign-language voice, and you can hear both because they're mixed as voiceover. In other words: audio 1 has Music, Noises, English speech; audio 2 has Music, Noises, English speech, Foreign language speech.
First and second tracks can differ by its windows, or gaps.
E.g. first track could be
scene 1, gap 1sec, scene 2, gap 1sec, scene 3,
and second track could be
scene 1, gap 2sec, scene 2, gap 2sec, scene 3.
I wonder if there are any solutions that could align these two tracks.
That's what I've tried so far:
Cubase 10.5. https://www.youtube.com/watch?v=BGXkHdzjzMg It doesn't work if tracks have different voices.
Revoice Pro. Same - doesn't recognize tracks with different voices. Probably doesn't support long audios.

Dynamic Time Warping (DTW) is the canonical algorithm for aligning sequences of data that might have slight differences in length/speed. The Python library librosa has a brief tutorial on using it for music syncronization.
There might also be DTW implementations in some graphical audio editors, but I am not familiar with any.

I would try to eliminate the silence gaps between every scene from both audios, so that you are left with only a pair of lists of clean audio clips for each scene.
Then I would recreate both audio signals. The stretched signal would have a constant-length gap between every scene. The original (non-stretched) signal would have variable gaps between scenes, equal to [length of constant gap] + [length of stretched scene - length of normal scene]. This would make every scene start at precisely the same time.
If the gaps between the scenes drop the audio signal to a perfect zero level, detecting and removing the gaps should be trivial.
Otherwise, this could be a little bit tricky (there is usually some DC shift and/or some background noise signal making it a bit difficult to detect "silence" from a time-domain wave representation).
I have successfully used an acoustic energy computation before, to precisely detect where an audio signal begins/ends. This implies sliding a Fourier Transform along the audio (be sure to use a tapered transform, with a Hann or Hamming window). Once you obtain the transform results, you can compute the energy by performing the following computation:
E = Sum(r[x]*r[x] + i[x]*i[x])
Where x goes from 0 to the [length of your Fourier transform] / 2 - 1, r represents the real part of each result bin, and i the imaginary part of each result bin.
This computation is performed repeatedly while sliding the Fourier transform along the audio, while you record the energy along the way. With some proper thresholding, you could probably successfully isolate the audio portions for each scene.
The length of the Fourier Transform can be small (probably something between the range of 64-256 would suffice, as you don't want a fine frequency resolution, just an estimate of the total energy present at a certain point in time)
Here is an example of a tapered Fourier Transform call (using the fftw3 library) computing the energy in a range of frequency bands:
double EnergyAnalyzer::GetEnergy(array<double>^ audioFrame, Int32 startIndex) {
if( startIndex + FrameSize > audioFrame->Length ) {
throw gcnew ArgumentException("The value of startIndex would overflow the array's boundary", "startIndex");
}
// Prepare input to the fourier transform. The signal is tapered using a Hann window
for( int i = 0; i < FrameSize; i++ ) {
_pIn[i] = audioFrame[startIndex + i] * _hann[i];
}
fftw_execute(_fftPlan);
double energy = 0.0;
for( int i = _binStart; i <= _binStop; i++ ) {
energy += _pOut[i][0] * _pOut[i][0] + _pOut[i][1] * _pOut[i][1];
}
return energy;
}

Related

Algorithm for evenly arranging steps in 2 directions

I am currently programming the controller for a CNC machine, and therefore I need to get the amount of stepper motor steps in each direction when I get from point A to B.
For example point A's coordinates are x=0 and y=0 and B's coordinates are x=15 and y=3. So I have to go 15 steps on the x axis and 3 und the y axis.
But how do I get those two values mixed up in a way that is smooth (aka not first x and then y, this results in really ugly lines)?
In my example with x=15 and y=3 I want it arranged like that:
for 3 times do:
x:4 steps y:0 steps
x:1 steps y:1 step
But how can I get these numbers from an algorithm?
I hope you get what my problem is, thanks for your time,
Luca
there are 2 major issues in here:
trajectory
this can be handled by any interpolation/rasterization like:
DDA
Bresenham
the DDA is your best option as it can handle any number of dimensions easily and can be computed on both integer and floating arithmetics. Its also faster (was not true in the x386 days but nowadays CPU architecture changed all)
and even if you got just 2D machine the interpolation itself will be most likely multidimensional as you will probably add another stuff like: holding force, tool rpm, preasures for what ever, etc... That has to be interpolated along your line in the same way.
speed
This one is much much more complicated. You need to drive your motors smoothly from start position to the end concerning with these:
line start/end speeds so you can smoothly connect more lines together
top speed (dependent on the manufactoring process usually constant for each tool)
motor/mechanics resonance
motor speed limits: start/stop and top
When writing about speed I mean frequency [Hz] for the steps of the motor or physical speed of the tool [m/s] or [mm/2].
Linear interpolation is not good for this I am using cubics instead as they can be smoothly connected and provide good shape for the speed change. See:
How can i produce multi point linear interpolation?
The interpolation cubic (form of CATMUL ROM) is exactly what I use for tasks like this (and I derived it for this very purpose)
The main problem is the startup of the motor. You need to drive from 0 Hz to some frequency but usual stepping motor has resonance in the lower frequencies and as they can not be avoided for multidimensional machines you need to spend as small time in such frequencies as possible. There are also another means of handling this shifting resonance of kinematics by adding weights or change of shape, and adding inertial dampeners on the motors itself (rotary motors only)
So usual speed control for single start/stop line looks like this:
So you should have 2 cubics one per start up and one per stopping dividing your line into 2 joined ones. You have to do it so start and stop frequency is configurable ...
Now how to merge speed and time? I am using discrete non linear time for this:
Find start point (time) of each cycle in a sine wave
its the same process but instead of time there is angle. The frequency of sinwave is changing linearly so that part you need to change with the cubic. Also You have not a sinwave so instead of that use the resulting time as interpolation parameter for DDA ... or compare it with time of next step and if bigger or equal do step and compute the next one ...
Here another example of this technique:
how to control the speed of animation, using a Bezier curve?
This one actually does exactly what you should be doing ... interpolate DDA with Speed controled by cubic curve.
When done you need to build another layer on top of this which will configure the speeds for each line of trajectory so the result is as fast as possible and matching your machine speed limits and also matching tool speed if possible. This part is the most complicated one...
In order to show you what is ahead of you when I put all this together mine CNC interpolator has ~166KByte of pure C++ code not counting depending libs like vector math, dynamic lists, communication etc... The whole control code is ~2.2 MByte
If your controller can issue commands faster than the steppers can actually turn, you probably want to use some kind of event-driven timer-based system. You need to calculate when you trigger each of the motors so that the motion is distributed evenly on both axes.
The longer motion should be programmed as fast as it can go (that is, if the motor can do 100 steps per second, pulse it every 1/100th of a second) and the other motion at longer intervals.
Edit: the paragraph above assumes that you want to move the tool as fast as possible. This is not normally the case. Usually, the tool speed is given, so you need to calculate the speed along X and Y (and maybe also Z) axes separately from that. You also should know what tool travel distance corresponds to one step of the motor. So you can calculate the number of steps you need to do per time unit, and also duration of the entire movement, and thus time intervals between successive stepper pulses along each axis.
So you program your timer to fire after the smallest of the calculated time intervals, pulse the corresponding motor, program the timer for the next pulse, and so on.
This is a simplification because motors, like all physical objects, have inertia and need time to accelerate/decelerate. So you need to take this into account if you want to produce smooth movement. There are more considerations to be taken into account. But this is more about physics than programming. The programming model stays the same. You model your machine as a physical object that reacts to known stimuli (stepper pulses) in some known way. Your program calculates timings for stepper pulses from the model, and sits in an event loop, waiting for the next time event to occur.
Consider Bresenham's line drawing algorithm - he invented it for plotters many years ago. (Also DDA one)
In your case X/Y displacements have common divisor GCD=3 > 1, so steps should change evenly, but in general case they won't distributed so uniformly.
You should take the ratio between the distance on each of the coordinates, and then alternate between steps along the coordinate that has the longest distance with steps that do a single unit step on both coordinates.
Here is an implementation in JavaScript -- using only the simplest of its syntax:
function steps(a, b) {
const dx = Math.abs(b.x - a.x);
const dy = Math.abs(b.y - a.y);
const sx = Math.sign(b.x - a.x); // sign = -1, 0, or 1
const sy = Math.sign(b.y - a.y);
const longest = Math.max(dx, dy);
const shortest = Math.min(dx, dy);
const ratio = shortest / longest;
const series = [];
let longDone = 0;
let remainder = 0;
for (let shortStep = 0; shortStep < shortest; shortStep++) {
const steps = Math.ceil((0.5 - remainder) / ratio);
if (steps > 1) {
if (dy === longest) {
series.push( {x: 0, y: (steps-1)*sy} );
} else {
series.push( {x: (steps-1)*sx, y: 0} );
}
}
series.push( {x: sx, y: sy} );
longDone += steps;
remainder += steps*ratio-1;
}
if (longest > longDone) {
if (dy === longest) {
series.push( {x: 0, y: longest-longDone} );
} else {
series.push( {x: longest-longDone, y: 0} );
}
}
return series;
}
// Demo
console.log(steps({x: 0, y: 0}, {x: 3, y: 15}));
Note that the first segment is shorter than all the others, so that it is more symmetrical with how the sequence ends near the second point. If you don't like that, then replace the occurrence of 0.5 in the code with either 0 or 1.

XNA 2D Camera loosing precision

I have created a 2D camera (code below) for a top down game. Everything works fine when the players position is close to 0.0x and 0.0y.
Unfortunately as distance increases the transform seems to have problems, at around 0.0x 30e7y (yup that's 30 million y) the camera starts to shudder when the player moves (the camera gets updated with the player position at the end of each update) At really big distances, a billion + the camera wont even track the player, as I'm guessing what ever error is in the matrix is amplified by too much.
My question is: Is there either a problem in the matrix, or is this standard behavior for extreme numbers.
Camera Transform Method:
public Matrix getTransform()
{
Matrix transform;
transform = (Matrix.CreateTranslation(new Vector3(-position.X, -position.Y, 0)) *
Matrix.CreateRotationZ(rotation) * Matrix.CreateScale(new Vector3(zoom, zoom, 1.0f)) *
Matrix.CreateTranslation(new Vector3((viewport.Width / 2.0f), (viewport.Height / 2.0f), 0)));
return transform;
}
Camera Update Method:
This requests the objects position given it's ID, it returns a basic Vector2 which is then set as the cameras position.
if (camera.CameraMode == Camera2D.Mode.Track && cameraTrackObject != Guid.Empty)
{
camera.setFocus(quadTree.getObjectPosition(cameraTrackObject));
}
If any one can see an error or enlighten me as to why the matrix struggles I would be most grateful.
I have actually found the reason for this, it was something I should have thought of.
I'm using single precision floating points, which only have precision to 7 digits. That's fine for smaller numbers (up to around the 2.5 million mark I have found). Anything over this and the multiplication functions in the matrix start to gain precision errors as the floats start to truncate.
The best solution for my particular problem is to introduce some artificial scaling (I need the very large numbers as the simulation is set in space). I have limited my worlds to 5 million units squared (+/- 2.5 million units) and will come up with another way of granulating the world.
I also found a good answer about this here:
Vertices shaking with large camera position values
And a good article that discusses floating points in more detail:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Thank you for the views and comments!!

Scaling Laplacian of Gaussian Edge Detection

I am using Laplacian of Gaussian for edge detection using a combination of what is described in http://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm and http://wwwmath.tau.ac.il/~turkel/notes/Maini.pdf
Simply put, I'm using this equation :
for(int i = -(kernelSize/2); i<=(kernelSize/2); i++)
{
for(int j = -(kernelSize/2); j<=(kernelSize/2); j++)
{
double L_xy = -1/(Math.PI * Math.pow(sigma,4))*(1 - ((Math.pow(i,2) + Math.pow(j,2))/(2*Math.pow(sigma,2))))*Math.exp(-((Math.pow(i,2) + Math.pow(j,2))/(2*Math.pow(sigma,2))));
L_xy*=426.3;
}
}
and using up the L_xy variable to build the LoG kernel.
The problem is, when the image size is larger, application of the same kernel is making the filter more sensitive to noise. The edge sharpness is also not the same.
Let me put an example here...
Suppose we've got this image:
Using a value of sigma = 0.9 and a kernel size of 5 x 5 matrix on a 480 × 264 pixel version of this image, we get the following output:
However, if we use the same values on a 1920 × 1080 pixels version of this image (same sigma value and kernel size), we get something like this:
[Both the images are scaled down version of an even larger image. The scaling down was done using a photo editor, which means the data contained in the images are not exactly similar. But, at least, they should be very near.]
Given that the larger image is roughly 4 times the smaller one... I also tried scaling the sigma by factor of 4 (sigma*=4) and the output was... you guessed it right, a black canvas.
Could you please help me realize how to implement a LoG edge detector that finds the same features from an input signal, even if the incoming signal is scaled up or down (scaling factor will be given).
Looking at your images, I suppose you are working in 24-bit RGB. When you increase your sigma, the response of your filter weakens accordingly, thus what you get in the larger image with a larger kernel are values close to zero, which are either truncated or so close to zero that your display cannot distinguish.
To make differentials across different scales comparable, you should use the scale-space differential operator (Lindeberg et al.):
Essentially, differential operators are applied to the Gaussian kernel function (G_{\sigma}) and the result (or alternatively the convolution kernel; it is just a scalar multiplier anyways) is scaled by \sigma^{\gamma}. Here L is the input image and LoG is Laplacian of Gaussian -image.
When the order of differential is 2, \gammais typically set to 2.
Then you should get quite similar magnitude in both images.
Sources:
[1] Lindeberg: "Scale-space theory in computer vision" 1993
[2] Frangi et al. "Multiscale vessel enhancement filtering" 1998

Correct usage of SetDeviceGammaRamp

I'd like to add the ability to adjust screen gamma at application startup and reset it at exit. While it's debatable whether one should tamper with gamma at all (personal I find it useless and detrimental), but hey, some people expect being able to do that kind of thing.
It's just one simple API call too, so it's all easy, right?
MSDN says: "The gamma ramp is specified in three arrays of 256 WORD elements each [...] values must be stored in the most significant bits of each WORD to increase DAC independence.". This means, in my understanding, something like word_value = byte_value<<8, which sounds rather weird, but it's how I read it.
The Doom3 source code contains a function that takes three arrays of char values and converts them into an array of uint16_t values that have the same byte value both in the upper and lower half. In other words something like word_value = (byte_value<<8)|byte_value. This is equally weird, but what's worse it is not the same as above.
Also there exist a few of code snippets on the internet on various hobby programmer sites (apparently one stolen from the other, because they're identical to the letter) which do some obscure math multiplying the linear index with a value, biasing with 128, and clamping to 65535. I'm not quite sure what this is about, but it looks like total nonsense to me, and again it is not the same as either of the above two.
What gives? It must be well-defined -- without guessing -- how the data that you supply must look like? In the end, what one will do is read the original values and let the user tweak some sliders anyway (and optionally save that blob to disk with the user's config), but still... in order to modify these values, one needs to know what they are and what's expected.
Has anyone done (and tested!) this before and knows which one is right?
While investigating the ability to change screen brightness programmatically, I came across this article Changing the screen brightness programmingly - By using the Gama Ramp API.
Using the debugger, I took a look at the values provided by the GetDeviceGamaRamp() function. The output is a two dimensional array defined as something like WORD GammaArray[3][256]; and is a table of 256 values to modify the Red, Green, and Blue values of displayed pixels. The values I saw started with a value of zero (0) at index 0 and adding a value of 256 to calculate the next value. So the sequence is 0, 256, 512, ..., 65024, 65280 for indices 0, 1, 2, ..., 254, 255.
My understanding is that these values are used to modify the RGB value for each pixel. By modifying the table value you can modify the display brightness. However the effectiveness of this technique may vary depending on display hardware.
You may find this brief article, Gamma Controls, of interest as it describes gamma ramp levels though from a Direct3D perspective. The article has this to say about Gamma Ramp Levels.
In Direct3D, the term gamma ramp describes a set of values that map
the level of a particular color component—red, green, blue—for all
pixels in the frame buffer to new levels that are received by the DAC
for display. The remapping is performed by way of three look-up
tables, one for each color component.
Here's how it works: Direct3D takes a pixel from the frame buffer and
evaluates its individual red, green, and blue color components. Each
component is represented by a value from 0 to 65535. Direct3D takes
the original value and uses it to index a 256-element array (the
ramp), where each element contains a value that replaces the original
one. Direct3D performs this look-up and replace process for each color
component of each pixel in the frame buffer, thereby changing the
final colors for all of the on-screen pixels.
According to the online documentation for GetDeviceGamaRamp() and SetDeviceGamaRamp() these functions are supported in the Windows API beginning with Windows 2000 Professional.
I used their source condensed down to the following example inserted into a Windows application to test the effect using values from the referenced article. My testing was done with Windows 7 and an AMD Radeon HD 7450 Graphics adapter.
With this test both of my displays, I have two displays, were affected.
//Generate the 256-colors array for the specified wBrightness value.
WORD GammaArray[3][256];
HDC hGammaDC = ::GetDC(NULL);
WORD wBrightness;
::GetDeviceGammaRamp (hGammaDC, GammaArray);
wBrightness = 64; // reduce the brightness
for (int ik = 0; ik < 256; ik++) {
int iArrayValue = ik * (wBrightness + 128);
if (iArrayValue > 0xffff) iArrayValue = 0xffff;
GammaArray[0][ik] = (WORD)iArrayValue;
GammaArray[1][ik] = (WORD)iArrayValue;
GammaArray[2][ik] = (WORD)iArrayValue;
}
::SetDeviceGammaRamp (hGammaDC, GammaArray);
Sleep (3000);
wBrightness = 128; // set the brightness back to normal
for (int ik = 0; ik < 256; ik++) {
int iArrayValue = ik * (wBrightness + 128);
if (iArrayValue > 0xffff) iArrayValue = 0xffff;
GammaArray[0][ik] = (WORD)iArrayValue;
GammaArray[1][ik] = (WORD)iArrayValue;
GammaArray[2][ik] = (WORD)iArrayValue;
}
::SetDeviceGammaRamp (hGammaDC, GammaArray);
Sleep (3000);
::ReleaseDC(NULL, hGammaDC);
As an additional note, I made a slight change to the above source so that instead of modifying each of the RGB values equally, I commented out the first two assignments so that only GammaArray[2][ik] was modified. The result was a yellowish cast to the display.
I also tried putting the above source in a loop to check how the display changed and it was quite a difference from wBrightness=0 to wBrightness=128.
for (wBrightness = 0; wBrightness <= 128; wBrightness += 16) {
for (int ik = 0; ik < 256; ik++) {
int iArrayValue = ik * (wBrightness + 128);
if (iArrayValue > 0xffff) iArrayValue = 0xffff;
GammaArray[0][ik] = (WORD)iArrayValue;
GammaArray[1][ik] = (WORD)iArrayValue;
GammaArray[2][ik] = (WORD)iArrayValue;
}
::SetDeviceGammaRamp (hGammaDC, GammaArray);
Sleep (3000);
}
Microsoft provides an on-line MSDN article, Using gamma correction, that is part of the Direct3D documentation which describes the basics of gamma as follows:
At the end of the graphics pipeline, just where the image leaves the
computer to make its journey along the monitor cable, there is a small
piece of hardware that can transform pixel values on the fly. This
hardware typically uses a lookup table to transform the pixels. This
hardware uses the red, green and blue values that come from the
surface to be displayed to look up gamma-corrected values in the table
and then sends the corrected values to the monitor instead of the
actual surface values. So, this lookup table is an opportunity to
replace any color with any other color. While the table has that level
of power, the typical usage is to tweak images subtly to compensate
for differences in the monitor’s response. The monitor’s response is
the function that relates the numerical value of the red, green and
blue components of a pixel with that pixel’s displayed brightness.
Additionally the software application Redshift has a page Windows gamma adjustments which has this to say about Microsoft Windows.
When porting Redshift to Windows I ran into trouble when setting a
color temperature lower than about 4500K. The problem is that Windows
sets limitations on what kinds of gamma adjustments can be made,
probably as a means of protecting the user against evil programs that
invert the colors, blank the display, or play some other annoying
trick with the gamma ramps. This kind of limitation is perhaps
understandable, but the problem is the complete lack of documentation
of this feature (SetDeviceGammaRamp on MSDN). A program that tries to
set a gamma ramp that is not allowed will simply fail with a generic
error leaving the programmer wondering what went wrong.
I haven't tested this, but if I had to guess, early graphics cards were non-standard in their implementation of SetDeviceGammaRamp() when Doom was written and sometimes used the LOBYTE and sometimes used the HIBYTE of the WORD value. The consensus moved to only using the HIBYTE, hence the word_value = byte_value<<8.
Here's another datapoint, from the PsychoPy library (in python) which is just swapping LOBYTE and HIBYTE:
"""Sets the hardware look-up table, using platform-specific ctypes functions.
For use with pyglet windows only (pygame has its own routines for this).
Ramp should be provided as 3x256 or 3x1024 array in range 0:1.0
"""
if sys.platform=='win32':
newRamp= (255*newRamp).astype(numpy.uint16)
newRamp.byteswap(True)#necessary, according to pyglet post from Martin Spacek
success = windll.gdi32.SetDeviceGammaRamp(pygletWindow._dc, newRamp.ctypes)
if not success: raise AssertionError, 'SetDeviceGammaRamp failed'
It also appears that Windows doesn't allow all gamma settings, see:
http://jonls.dk/2010/09/windows-gamma-adjustments/
Update:
The first Windows APIs to offer gamma control are Windows Graphics Device Interface (GDI)’s SetDeviceGammaRamp and GetDeviceGammaRamp. These APIs work with three 256-entry arrays of WORDs, with each WORD encoding zero up to one, represented by WORD values 0 and 65535. The extra precision of a WORD typically isn’t available in actual hardware lookup tables, but these APIs were intended to be flexible. These APIs, in contrast to the others described later in this section, allow only a small deviation from an identity function. In fact, any entry in the ramp must be within 32768 of the identity value. This restriction means that no app can turn the display completely black or to some other unreadable color.
http://msdn.microsoft.com/en-us/library/windows/desktop/jj635732(v=vs.85).aspx

How to Spectrum-inverse a sampled audio signal

I am looking for a simple (pseudo)code that spectrum-inverse a sampled audio signal.
Ideally C++
The code should support different sample rates (16/32/48KHz).
Mixing the signal by Fs/2 will swap high frequencies and low frequencies - think of rotating the spectrum around the unit circle by half a turn. You can achieve this rotation by multiplying every other sample by -1.
Mixing by Fs/2 is equivalent to mixing by exp(j*pi*n). If x is the input and y the output,
y[n] = x[n] * exp(j*pi*n) = x[n] * [cos(pi*n) + j*sin(pi*n)]
This simplifies easily because sin(pi*n) is 0, and cos(pi*n) is alternating 1,-1.
In order to get something that has the same type of temporal structure as the original, you need to
Create a spectrogram (with some window size)
Pick some upper and lower frequency bounds that you'll flip
Flip the spectogram's intensities within those bounds
Resynthesize a sound signal consistent with those frequencies
Since it's an audio signal, it doesn't much matter that the phases will be all messed up. You can't generally hear them anyway. Except for the flipping part, ARSS does the spectrogram creation and sound resynthesis.
Otherwise, you can just take a FFT, invert the amplitudes of the components, and take the inverse FFT. But that will be essentially nonsensical, as it will completely scramble the temporal structure of the sound as well as the frequency structure.
it does not make much sense to use a cosine. for a digital signal it is not neccessary to run a real ringmod here, at nyquist a consine is a square anyway.
so you would just multiply every other sample by *-1 and you are done.
no latency, no aliasing, no nothing.

Resources