Why mono sounds are preferred over stereo sounds when it comes to analyse acoustic parameters like intensity, RMS amplitude, fundamental frequency etc - spectrogram

In phonetic analyses, while using software like Praat, I have seen people's liking mono sounds over stereo sounds. Even if the recordings were done using a stereo microphone, the audio files are often converted to stereo. Any reason besides saving up a lot of memory? Thanks.

I've gathered a bunch of sources that marginally comment on whether to use mono or stereo audio. The consensus seems to be that mono audio saves space, like you said. However, a few of the sources note some of the times when you'd want to use stereo audio.
You want to record simultaneous electroglottograph data. Having both the audio and the electroglottograph data on one file as "stereo" audio keeps your data tidy.
You want to record audio once, but also want to choose the better half of the audio after your recording is done.
Also, note that human language is pretty much produced as a mono signal. I don't know of any human language that depends on stereo audio in order to communicate meaningful information. This link mentions that some occupations involving emergency communication make use of different audio in each ear to get more information across at the same time, but that's as close of a resource as I can find concerning stereo uses of language.
It also appears that Praat can't work with multiple files when they aren't all mono or all stereo, so that might also be a reason to use mono audio. Duplicating mono audio into both ears isn't quite the same as reducing stereo audio to mono audio.
Here's the most useful sources I found:
From https://colangpraat.wordpress.com/part-3-how-to-record-using-praat/
In the menu bar of the Objects window, click NEW and RECORD MONO SOUND. PRAAT also has the capability to record in stereo, but when it comes to gathering language data, mono files are preferred.
From https://web.stanford.edu/dept/linguistics/corpora/material/PRAAT_workshop_manual_v421.pdf
In most cases, you will record a single speech or voice sample and for that
purpose you can select 'Record mono Sound..'. If you want to make stereo
recordings, you obviously have to use “Record stereo Sound’. The latter option,
for example, can be used to digitize the stereo output signal of the EG-2 PC
Electroglottograph from Glottal Enterprises
(http://www.glottal.com/electroglottograph.html), thus giving you access to a
simultaneous recording of a speech and EGG signal.
From https://www.fon.hum.uva.nl/praat/manual/ExperimentMFC_2_2__The_stimuli.html
You can also use AIFF files, in which case stimulusFileNameTail would probably be ".aiff", or any other type of sound file that Praat supports. But all sound files must have the same number of channels (i.e. all mono or all stereo) and the same sampling frequency.
From https://www.fon.hum.uva.nl/david/LOT/sspbook.pdf
Before we go on, we repeat that a sound is represented in Praat as a matrix which means that
sounds are stored as rows of numbers. A mono sound is a matrix with only one row and many
columns. A stereo sound is a sound with two channels, each channel is represented in one row
of the matrix. A stereo sound is therefore a matrix with two rows and both rows have the same
number of columns. Each matrix cell contains one sample value. Whenever we want to use a
formula on a sound we can think about a sound as a matrix.
From https://person2.sol.lu.se/SidneyWood/praate/monstee.html
But make sure you have something to gain from merging the channels into one mono signal. It is simpler to use the Stereo recorder in Praat and take the best channel.

Related

FFMPEG: Automatically remove audio codec latency

I am using FFMPEG to apply several audio codecs to a large number of speech files. Each codec introduces a different latency and I could not find a description of the precise value of this delay. However, for an investigation I need to correct this delay, i.e. adjust the audio after coding such that it is time aligned with the original, uncoded audio. At the moment I use cross-correlation to figure out the actual latency introduced by the codecs which is working fine, but it feels slightly unreliable. Is there some way to a) remove the delay automatically or b) to exactly know the delay of the codecs?
Thank you!
Cross-correlation is about as good as you can do to measure these delays. If you are currently using the full duration of audio data (or the same durations acoross all files), you can shorten the original audio by a little more than the maximum delay expected. This way the cross-correlations are fully supported (i.e., no zero padding) over the lag range of your interest.

How do I interpret the audio stream from the microphone to detect a certain sound in WP7?

I am using the basic methods from http://msdn.microsoft.com/en-us/library/gg442302(v=vs.92).aspx to access the microphone. But I am trying to detect the occurrance of a specific sound, like the clapper. How does one interpret the stream from the microphone? What exactly do the floats in the buffer represent?
Thanks
I think this might help http://en.wikipedia.org/wiki/Pulse-code_modulation. I think the values in a way represent the offset of the mechanical part in the microphone from its middle position, but I am sure the theory and vocabulary might go really deep.
When it comes to recognizing sounds - it can also get arbitrarily complex, but a clapper might be a simple task - you basically want to detect a sudden increase in volume, which would manifest in a sharp, short-term increase of the moving-average of absolute values in the stream, so I'd put a sliding windows on the stream and keep checking with certain thresholds - one short window for the high volume threshold and two adjacent, longer and lower threshold windows to make sure there was no such noise before and after the clapper.

Determining Apple AAC vs. Lossless format using something like Ruby MP4Info gem?

I'm trying to organized music for a radio station and have an iTunes library with a huge number of music files. The files are in various formats (Flac, MP3, AAC, etc.). I need to break all the files up by format.
I have a simple Ruby script that walks the directory tree and can pull by extension, so I can move all .mp3 files into an MP3 directory. However, I have a problem with m4a files, because some .m4a files are Apple Lossless format and some are Apple's AAC format.
The problem I have is that the MP4Info gem seems only to have "Encoder," which returns something like iTunes 9.0.2, which is not helpful in determining lossless vs lossy formatting.
So, my thought is to take the SIZE attribute of the file and divide that by the SECS attribute. It seems that I should be able to come up with a decent rubric for bytes/second in a lossless vs a lossy format, since they will be roughly an order of magnitude off. I'm not sure what order of magnitude I'm looking for (it depends on bitrate, I'd guess).
Are there better, easier ways to do this?
So, it looks like using the heuristics for bitrate (e.g. the values given at en.wikipedia.org/wiki/Bit_rate#Audio_.28MP3.29) is useful, but maybe more useful is the iTunes list of songs.
It turns out that there is a column available in iTunes for bitrate- and anything saves as AAC will be listed as a bitrate of "256 (Variable)." So one can easily sort the entire music library by bitrate and find all songs with a value of 256. You could also see which are lower- and then anything above around 600 is going to be Lossless.
There's an issue around 300-500 or so. Depending on the complexity of the music, you might have a song with a bitrate of 400 or 500 that is lossless. Not sure what to do around there- but it's pretty minimum in total song number.

How do I programatically sort through media?

I have 6 server with a aggregated storage capacity of 175TB all hosting different sorts of media. A lot of the media is double and copies stored in different formats so what I need is a library or something I can use to read the tags in the media and decided if it is the best copy available. For example some of my media is japanese content in which I have DVD and now blu ray rips of said content. This content sometimes has "Hardsubs" ie, subtitles that are encoded into the video and "Softsubs which are subtitles that are rendered on top of the raw video when it plays/ I would like to be able to find all copies of that rip and compare them by resolution and wether or not they have soft subs and which audio format and quality.
Therefore, can anyone suggest a library I can incorporate into my program to do this?
EDIT: I forgot to mention, the distribution server mounts the other servers as drives and is running windows server so I will probably code the solution in C#. And all the media is for my own legal use I have so many copies because some of the stuff is in other format for other players. For example I have some of my blu rays re-encoded to xvid for my xbox since it can't play Blu ray.
When this is done, I plan to open source the code since there doesn't seem to be anything like this already and I'm sure it can help someone else.
I don't know of any libraries, but as I try to think about how I'd progmatically approach it, I come up with this:
It is the keyframes that are most likely to be comparable. Keyframes occur regularly, but more importantly keyframes occur during massive scene changes. These massive changes will be common across many different formats, and those frames can be compared as still images. You may more easily find a still image comparison library.
Of course, you'll still have to find something to read all the different formats, but it's a start and a fun exercise to think about, even if the coding time involved is far beyond my one-person threshold.

Which format has an audio stream before reaching the soundcard?

I want to manipulate an audio stream before it gets to the soundcard. So I wanna use the sAPOs from Microsoft to manipulate the audiostream in the audio engine (vista audio architecture).
My basic question actually is which format the audio stream is in. I don't know but I think it is the WAVE format or the RIFF. Can anyone help me in this case? :)
Apparently, the format is negotiated.
typically most sound cards work with 16bit signed integers representing linear PCM audio: [http://en.wikipedia.org/wiki/Linear_pulse_code_modulation] however this is not always the case (just typically). generally if your audio API's are not already converting this 'raw' audio into floating point representation, then you will need some code to do this, unless you are particularly fond of performing math on integers.
as Larry has already pointed out, many API's will handle the floating point conversion for you and simply pass a buffer of floats, the convention is that they are values between -1 and 1.
|K<
Your APO tells the audio engine what input and output formats it supports, the engine will give it whatever you tell it (that's not actually 100% accurate - it's roughly correct and you need to read the APO documentation for full information).
The actual audio data will be whatever is specified, typically they will be 32 bit floating point samples with an amplitude between -1.0 and 1.0.

Resources