Speech to Text on UWP apps using Audio File as Input - visual-studio

I am having troubles finding the answer for this question on web.
The project I am developing demands that I could save a recorded audio file, and, after that, transcribe the audio to text for finding interesting predefined keywords.
I am using the Windows.Media.SpeechRecognition framework, and it works fine when you are transcribing the speech during the recording process. I can't find, in the same framework, a function which I can use with an audio file as input.
Does anybody know a good approach for this problem? Or another [free] framework for Windows Apps?

For online recognition and in particular in JS projects you can use directly Microsoft Cognitive Services, that are behind online recognition in the SpeechRecognition in Windows. It is free under some limits.
In particular here is open sourced wrapped for JavaScript on GitHub:Oxford.Speech.JS. It can deal with both wav-files and microphone. Sample code is designed like a website, but I'm pretty sure you can easily convert it into a HTML/JS-based UWP app.

Related

Implementing a TTS service for Windows 10

I'm working on a research project in which we create a new text-to-speech (TTS) engine, that converts text to spoken audio.
As the engine is already performing good, we try to make it usable by a large number of applications which made us want the engine to show up as a TTS voice on Windows 10.
In Microsoft's developer documentations, all I found was information on how I can use exisiting/already installed voices in my application. However, I didn't find any information on how to implement a voice so that it shows up as a Windows voice and can be used by any application using the Speech SDK or SAPI.
Which interface do I have to implement or what API do I have to connect to in order to get our new TTS engine work with Windows Speech?
I already crawled the documentation of the Microsoft Speech SDK as well as developer sites like https://learn.microsoft.com/en-us/dotnet/api/system.speech.synthesis.ttsengine
You should look at the TTS Engine Vendor Porting Guide. You need to implement ISpTTSEngine, which does all the work, and ISpObjectWithToken, which manages registration and creation.

Simple effect processing in AudioKit example for MacOS files needed

I have a unique filter that I want to use while playing Audio files on MacOS and am trying to start based on an AudioKit example whereby I just replace a current effect with new code with my filtering. However, I can't find an example that is close to what I want and the amount of effort required to do from scratch is a bit too much. Is anyone aware of such examples that I can use? I've looked through all of the examples on AudioKit they either just play audio files or have effects but assume microphone data. Thanks
In the Developer directory of AudioKit you'll find an example called "ExtendingAudioKit" which is probably going to be your best template for how to set up a project that uses AudioKit for the base, but allows you to develop custom DSP external to AudioKit.

Using Mac OSX Dictation with Speech API

In OSX Mavericks, speech dictation is now included, and is very useful. I am trying to use the dictation capability to create my own digital life assistant, but I can't find how to use the recognition functionality to get the speech in an application rather than a text box.
I have looked into NSSpeechRecognizer, but that seems to be geared toward programming speakable commands with a pre-defined grammar rather than dictation. It doesn't matter what programming language I use, but Python or Java would be nice...
Thanks for your help!
You can use SFSpeechRecognizer (mirror) (requires macOS 10.15+): this is made for speech recognition.
Perform speech recognition on live or prerecorded audio, receive transcriptions, alternative interpretations, and confidence levels of the results.
Whereas as you have noted in the question NSSpeechRecognizer (mirror) indeed provides a “command and control” style of voice recognition system (the command phrases must be defined prior to listening, in contrast to a dictation system where the recognized text is unconstrained).
From https://developer.apple.com/videos/play/wwdc2019/256/ (mirror):
Another way is to directly use Mac Dictation, but as far as I know the only way is to rerdirect audio feeds, which isn't very neat, e.g. see http://www.showcasemarketing.com/ideablog/transcribe-mp3-audio-to-text-mac-os/ (mirror).

Changing output audio device of other Win32 applications?

I want to write a program that allows you to select the output audio device (based on currently connected devices) used by other applications on an individual basis. (E.g. Winamp to my headphones, VLC to my speakers, etc).
The program would (probably) be written in C++ for Vista/7. Most likely I'll try to use the Windows sound APIs, but not sure where to start, or if the whole attempt is futile. (seeing the answer here made me doubtful)
I'm not new to writing code, and this isn't a "please do my homework" request, but I am new to windows code and was having trouble finding much documentation on anything like this.
Is this possible? Where would you start with this? Do you know of any projects that have already done this? Thanks in advance

How do I read a video camera in a win32 C program

I have this garden variety USB video camera, and it came with two mini-apps, one that just lets you see what the camera sees, and one that records to an .avi file.
But what's the API if I want to grab images from the camera in my own C program? I am making the assumptions that it's (1) possible and (2) desirable to make some call and have a 2D array of pixel information filled in.
What I really want to do is tinker with image processing algorithms, and for that I'd really like to get my code around some live data.
EDIT -
Having had a healthy exposure to Linux, I can grasp how (ideally/in theory) you could open() the device, use ioctl() to configure it, and read() the data. And I'm virtually certain that that's not how Windows is going to present the API. Not knowing what function names Windows might use for a video device API, or even if it has one, makes it difficult to look up, at least with the win32 api search capabilities that I have at my disposal.
You'll probably need the DirectShow API, provided that's how the camera operates. If the manufacturer created their own code path, you'll need their API.
Your first step, as pointed out by ChrisBD, is to check if Windows supports your device.
If that is the case you have three possible Windows APIs for capture:
DirectShow
VFW. Has more or less been replaced by DirectShow
MediaFoundation. Is the newest API that is intended to replace DirectShow. AFAIK not fully implemented yet and only available in Vista.
From the three DirectShow is the best choice. However, learning and using DirectShow is not a trivial task. An excellent example can be found here.
Another possibility is to use OpenCV. OpenCV is an image processing library, that you can also use for processing the images. OpenCV has an image capture API that provides a simpler abstraction and is easier to use than the Windows APIs.
The API is the way to go.
A good indication of whether the camera requires a bespoke one or not is to see if it is recognised by a PC without the manufacturer's applications installed. If windows has the drivers built in the you should be able to use the windows APIs to capture the images.
Alternatively if you know what compression codec has been used for the AVI file you could unpack it.
Ideally it would be good if you could capture the video in native (YUV, RGB15 or similar) format as then you can work on compression as well as manipulation.

Resources