How to match times in audio to text - time

I have a bunch of audio files and the text that corresponds to the speech in the audio files. I need to be able find the specific times in the audio where each word starts. I trying to use the C# SpeechRecognizer library but I can't find a way to get time data for the words it recognizes. Is it possible with this library? Is there a different tool/library that can do this?

I use CMU Sphinx for this. (Java though, not sure if they provide C# API.) It allows me to access internals of the recognized text, in which I am able to find start/stop times.
https://cmusphinx.github.io/
https://cmusphinx.github.io/wiki/tutorialsphinx4/

Related

Returnn Switchboard data processing

Could anybody give me pointers on how to process Switchboard dataset for training with RETURNN? I did see BlissDataset class that seems to be designed for switchboard, but it's not clear to me what I should include in the paths given in the example:
Example:
./tools/dump-dataset.py "
{'class':'BlissDataset',
'path': '/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz',
'bpe_file': '/u/zeyer/setups/switchboard/subwords/swb-bpe-codes',
'vocab_file': '/u/zeyer/setups/switchboard/subwords/swb-vocab'}"
The switchboard dataset has several folders with audios, i.e. swb1_d2/data/*.sph and transcripts swb1_LDC97S62/swb_ms98_transcriptions/**/*
I'm not quite sure how to proceed with this to get a dataset that can be used to train RETURNN.
At our group (RWTH Aachen University), we use the config as it was published on GitHub. As you see, this one uses ExternSprintDataset. That dataset uses
The implementation uses Sprint (publicly called RWTH ASR (RASR), see here) as an external tool (ran in a subprocess) to handle the data (feature extraction, etc). Sprint gets a Bliss XML file which describes all the segments with path to audio and audio offsets and transcriptions, and also it gets further configs for the feature extraction and maybe other things. There is an open source version of RASR which should work but it might be a bit involved to get this to work.
The BlissDataset was planned to be a simpler replacement for that. However, the implementation is incomplete. Also, you still would need to generate the Bliss XML by yourself in some way (we have used some own internal scripts to prepare that based on the official LDC data).
So, unfortunately, there is no simple way yet. Actually, I think the easiest way would be to come up with yet another custom format, which might be similar to the LibriSpeechDataset implementation, or maybe just the same, and then you could just reuse LibriSpeechDataset, or at least parts of that. That dataset implementation takes the data in some zip format which contains the transcripts in txt files and the audio in ogg or wav files. It uses librosa to do MFCC feature extraction (or also other feature types). I planned to implement that for Switchboard, and then reproduce the results, however I did not have time yet and not sure when I will get to that. But if you want to try that on your own, I will be happy to help you however I can. The starting point would be to look at LibriSpeechDataset and understand how the format of that looks like.

Music data format for polyphonic music visualization with Processing

I am interested in visualizing melodic contours of polyphonic music with Processing. It is still unclear to me, though, what the most convenient format for imported data (pitch and onset/duration) would be: tabular (e.g. Humdrum), XML (e.g. MEI, musicXML), or JSON? Maybe another format?
Any suggestions/thoughts on this would be really helpful! Thanks.
Using MIDI files would be optimal, because of the combination of those 3 reasons
MIDI is widely used. You can export a .midi file from pratically any score editor plus you can create your own by recording the input from a midi instrument.
You can already find .midi files of iconic polyphonic music on the web (Bach's counterpoints, Reinaissance vocal music, etc)
It just contain music/playback information. It doesn't contain notation information like music XML. So if you just want to see pitches and note position/duration (like in this video) then .midi will contain just what you need
You can use the Java Midi Package in Processing and it already contains everything you need to read the MIDI files.
While other formats might also apply for 1, 2, 3 or 4 only MIDI applies for all of them.
The best answer I can give you is that you should put together a simple hello world program that tests out each format and see which one you like the best.
In the end, you're the one that has to deal with the code, so only you can really decide on the best format.

Censor Plugin or Extension for VLC Media Player

I'm having an idea to create a Censor Plugin/Extension for VLC Player..
Problem Scenario :
An Adult-Scene for 1 minute in a nice movie makes it not watchable with Family.
My Solution :
Create a Plugin/Extension which does the following
Reads time positions from a file similar to subtitle files
Skip these time positions (which are adult or inappropriate) when playing
Help i needed :
I searched in Google and in videolan website, But can't find an exact solution
Are there already similar Plugins available?
Where should i start?
Please help me if you could guys.. thanks..
Same looking for having/developing Exact same solution. This might be helpful to you.
http://code.google.com/p/movie-content-editor/
A similar thing is also available on github:
https://github.com/rdp/sensible-cinema
You may also want to read this discussion thread:
https://forum.videolan.org/viewtopic.php?t=89466
finding great similar answer here
If you chop random bytes out the movie is likely not playable. The player might crash or fail to resynchronize the stream – the video might just stop. Plus, you're gonna have a hard time figuring out where the "adult" bytes are, so to speak.
If you already know where the parts are that you want to cut out, I would edit the file in any of the numerous video editors. Even Windows Movie Maker or iMovie would do the job, and those are easily available on both major OSes.
This is a requested feature for VLC. Not really anything user-friendly out there. Still, VLC offers the possibility to create playlists in a certain format that would mute or skip parts of a file. This is called XSPF. You might be able to figure out the proper format for this.
Also, there's movie-content-editor:
A VLC based editor built in python that allows users to create and use custom filter files to make movies more family friendly. Allows users to have the player automatically mute specific words or skip certain scenes based on the content of those scenes.
And sensible-cinema:
Clean Editing Movie Player allows you watch edited movies by applying delete lists (EDL's) (i.e. "mute out" or "cut out" scenes) to DVD's/files, with preliminary support for also applying them to arbitrary web/internet based players like netflix instant, hulu/hulu plus etc
See also these threads on The VideoLAN Forums:
auto skip unwanted parts of a video
Clearplay-like (content filter) module exists?

How to perform audio file analysis with Cocoa?

my aim is to perform analysis (like DFT) on an audio file (mp3).
Then :
my input is a file
And my output is a treatment
I would like to use QTKit framework to perform this, but I am a bit disappointed:
QTMovie is able to open a file but I don't see own to access to decompressed audio buffer
QTSampleBuffer can be treat with QTCaptureDecompressedAudioOutput but I don't find how to open a file (the only input seems to be QTCaptureDeviceInput)
Is there a way to do what I want with QTKit or should I use Core Audio (or other) which will be more difficult (and I prefer Objective-C than C or C++) ?
(Actually I have no code, I am just trying to find the good way and it the first time I use sound...)
QTKit won't let you do that. You'll have to use Core Audio. You could always take a look at this code (which is written for the iPhone but most of the code works on Mac OS X) to understand everything a bit more. It detects frequency using FFT.
I also was afraid of using Core Audio, but in the end it all worked out pretty well.

Use NSSpeechRecognizer or alternative with audio file instead of microphone input?

Is it possible to use the NSSpeechRecognizer with an pre-recorded audio file instead of direct microphone input?
Or is there any other speech-to-text framework for Objective-C/Cocoa available?
Added:
Rather than using voice at the machine that is running the application external devices (e.g. iPhone) could be used for sending just an recorded audio stream to that desktop application. The desktop Cocoa app then would process and do whatever it's supposed to do using the assigned commands.
Thanks.
I don't see any obvious way to switch the input programmatically, though the "Speech" companion guide's first paragraph in the "Recognizing Speech" section seems to imply other inputs can be used. I think this is meant to be set via System Preferences, though. I'm guessing it uses the primary audio input device selected there.
I suspect, though, you're looking for open-ended speech recognition, which NSSpeechRecognizer is not. If you're looking to transform any pre-recorded audio into text (ie, make a transcript of a recording), you're completely out of luck with NSSpeechRecognizer, as you must give it an array of "commands" to listen for.
Theoretically, you could feed it the whole dictionary, but I don't think that would work since you usually have to give it clear, distinct commands. Its performance would suffer, I would guess, if you gave it a bunch of stuff to analyze for (in real time).
Your best bet is to look at third-party open source solutions. There are a few generalized packages out there (none specifically for Cocoa/Objective-C), but this poses another question: What kind of recognition are you looking for? The two main forms of speech recognition ('trained' is more accurate but less flexible for different voices and the recording environment, whereas 'open' is generally much less accurate).
It'd probably be best if you stated exactly what you're trying to accomplish.

Resources