Is it possible to detect if any sound plays on a windows xp machine? Help in any language would be useful. I basically need to write a program that runs all the time and outputs some text to a file whenever a sound plays. I don't need any specific information about the sound, just whether a sound is playing. I don't care whether the speakers are actually powered on or anything like that.
The question was easy, but the answer is difficult. You'll need to utilize DirectSound to achieve your purpose. I haven't tested my solution yet, but you can try to call IDirectSoundBuffer8::GetStatus(), then check the return value of pdwStatus parameter. According to MSDN, DSBSTATUS_PLAYING is set if the buffer is being heard.
Since you didn't tell about programming language you are using, I implement the following example using my favorite language, Delphi.
var
dwStatus: DWORD;
hResult: HRESULT;
hResult := GetStatus(#dwStatus);
if hResult = DS_OK then begin
if dwStatus and DSBSTATUS_PLAYING <> 0 then
ShowMessage('Sound card is playing sound now.');
end;
UPDATE
I just found a VB forum discussed about how to detect silence (no output of sound card). Download DetSilence.zip. In the DXRecord_GotWavData Sub, modify the constants SilencePercent and NonSilencePercent to the values you need.
I ended up approaching this in an unconventional manner. First I installed Virtual Audio Cable (http://www.ntonyx.com/vac.htm) and configured it as my primary sound device. I then configured the recording device to record the sound from the primary output device. This basically means I can hit "record" and it will record anything going to the sound card. Then I used a perl module, Win32::SoundRec to record sound to a file. I periodically check the wav file for activity and if there is some, I know sound was playing. I used another perl module, Audio::Wav, to parse the WAV file and look for activity (silence vs. non-silence).
Related
Going through Google search results, there is no widely known way to capture audio from a specific application on Microsoft Windows, at least without having to resort to workarounds such as sending audio from one process to a separate virtual audio loopback device (which however results in an inability to hear the sound, unless you either use a hardware loopback playback device or "listen" to the emulated input via the main output).
These workarounds are clunky, require configuration for each specific application and software will often misbehave, no longer successfully make any sound or straight-up stop working if their output device is changed during execution. Meanwhile, launching a Discord "Live Streaming" session allows you to easily, without failure, share a single application's sound with a VoIP group call. Sound from other application is completely removed. Looking at audio devices, it appears that no virtual loopback routing is taking place, and there is absolutely zero interruption in audio playback on the client side. The functionality isn't available on the macOS or Linux versions of the software, only on Windows. Capturing sound from a specific process is thus possible in Win32, but why isn't anyone else doing this? What would it take, say, to implement something like this in a fork of software where such functionality would be extremely useful, like OBS or Audacity?
EDIT: Not sure if this is useful at all, but I found this page: https://obsproject.com/forum/threads/audio-sources.465/
In particular, this strikes me as useful information:
It's quite similar to hooking Direct3D. You hook the IAudioRenderClient interface, and intercept GetBuffer to read the audio samples.
Beginner's reverse engineering time!
Also, I cannot give a definitive answer, but I can lead you in the right direction.
Discord has a directory called \modules\discord_hook inside of it's root directory, here we can find there is a JavaScript file, named index.js, a json file named manifest.json, a .node file, named discord_hook.node (which is compiled/encrypted and I cannot read), a directory with .dlls and .exes, and it also generates a log file, named hook.log.
index.js appears to just load discord_hook.node and do some other things that aren't important to us.
Googling manifest.json brings me here: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/manifest.json
The manifest.json file is the only file that every extension using WebExtension APIs must contain.
In the .json file, we find it is referencing the .exes, .dlls, discord_hook.node, index.js, and itself.
The .node file as previously mentioned is for the most part unreadable by a human being.
hook.log doesn't output anything seemingly helpful, just stuff about the graphics/video share.
This leaves us with looking inside the exe and dll files inside the subdirectory here, I have no knowledge of asm, but we can look at some strings left inside of these binaries.
I found a section of strings referencing audio at offset 1266B4 to offset 126EA6 in DiscordHook.dll (This may and almost definitely WILL change in future versions of discord)
Here are some of the strings that seem to be worth posting here.
Audio buffer stopped, WASAPI capture stopping
Failed to get format of WASAPI audio buffer, not capturing, error code [%d]
Failed to get WASAPI audio client from render client, not capturing
Starting capture of WASAPI buffer with sample rate %d, depth %d, %d channels
Starting capture of Windows Sonic stream with downmix sample rate %d, depth %d, %d channels
ISpatialAudioObjectRenderStream::Stop
ISpatialAudioObjectRenderStream::BeginUpdatingAudioObjects
ISpatialAudioObjectRenderStream::EndUpdatingAudioObjects
ISpatialAudioObject::GetBuffer
HookWasapi failed to load audioses.dll
WaveFormatFromRenderClient failed with error code [%d]
LoadWASAPIOffsets failed with error code [%d]
WASAPI module sizes don't match (expected: %lu, actual: %lu)
WASAPI offsets invalid (stop: %lu, getBuffer: %lu, releaseBuffer: %lu, clientOffset: %lu, endpointOffset: %lu)
WASAPI offsets out of bounds (size: %lu, stop: %lu, getBuffer: %lu, releaseBuffer: %lu)
IAudioClient::Stop
IAudioRenderClient::GetBuffer
IAudioRenderClient::ReleaseBuffer
HookWasapi: MH_ApplyQueued failed 0x%x
Also, I googled "hook process audio" and found this:
https://ywjheart.wordpress.com/2017/02/26/audio-captureapihook-based-for-obs-studio/
It doesn't give any code examples, or downloads, but it describes some stuff on doing this very thing but in OBS. They also link the stuff they used to document it at the bottom.
Good luck, I hope all this information can help in some way!
OK, the first issue. I am trying to write a virtual soundboard that will output to multiple devices at once. I would prefer OpenAL for this, but if I have to switch over to MS libs (I'm writing this initially on Windows 7) I will.
Anyway, the idea is that you have a bunch of sound files loaded up and ready to play. You're on Skype, and someone fails in a major way, so you hit the play button on the Price is Right fail ditty. Both you and your friends hear this sound at the same time, and have a good laugh about it.
I've gotten OAL to the point where I can play on the default device, and selecting a device at this point seems rather trivial. However, from what I understand, each OAL device needs its context to be current in order for the buffer to populate/propagate properly. Which means, in a standard program, the sound would play on one device, and then the device would be switched and the sound buffered then played on the second device.
Is this possible at all, with any audio library? Would threads be involved, and would those be safe?
Then, the next problem is, in order for it to integrate seamlessly with end-user setups, it would need to be able to either output to the default recording device, or intercept the recording device, mix it with the sound, and output it as another playback device. Is either of these possible, and if both are, which is more feasible? I think it would be preferable to be able to output to the recording device itself, as then the program wouldn't have to be running in order to have the microphone still work for calls.
If I understood well there are two questions here, mainly.
Is it possible to play a sound on two or more audio output devices simultaneously, and how to achieve this?
Is it possible to loop back data through a audio input (recording) device so that is is played on the respective monitor i.e for example sent through the audio stream of Skype to your partner, in your respective case.
Answer to 1: This is absolutely feasable, all independent audio outputs of your system can play sounds simultaneously. For example some professional audio interfaces (for music production) have 8, 16, 64 independent outputs of which all can be played sound simultaneously. That means that each output device maintains its own buffer that it consumes independently (apart from concurrency on eventual shared memory to feed the buffer).
How?
Most audio frameworks / systems provide functions to get a "device handle" which will need you to pass a callback for feeding the buffer with samples (so does Open AL for example). This will be called independently and asynchroneously by the framework / system (ultimately the audio device driver(s)).
Since this all works asynchroneously you dont necessarily need multi-threading here. All you need to do in principle is maintaining two (or more) audio output device handles, each with a seperate buffer consuming callback, to feed the two (or more) seperate devices.
Note You can also play several sounds on one single device. Most devices / systems allow this kind of "resources sharing". Actually, that is one purpose for which sound cards are actually made for. To mix together all the sounds produced by the various programs (and hence take off that heavy burden from the CPU). When you use one (physical) device to play several sounds, the concept is the same as with multiple device. For each sound you get a logical device handle. Only that those handle refers to several "channels" of one physical device.
What should you use?
Open AL seems a little like using heavy artillery for this simple task I would say (since you dont want that much portability, and probably dont plan to implement your own codec and effects ;) )
I would recommend you to use Qt here. It is highly portable (Win/Mac/Linux) and it has a very handy class that will do the job for you: http://qt-project.org/doc/qt-5.0/qtmultimedia/qaudiooutput.html
Check the example in the documentation to see how to play a WAV file, with a couple of lines of code. To play several WAV files simultaneously you simply have to open several QAudioOutput (basically put the code from the example in a function and call it as often as you want). Note that you have to close / stop the QAudioOutput in order for the sound to stop playing.
Answer to 2: What you want to do is called a loopback. Only a very limited number of sound cards i.e audio devices provide a so called loopback input device, which would permit for recording what is currently output by the main output mix of the soundcard for example. However, even this kind of device provided, it will not permit you to loop back anything into the microphone input device. The microphone input device only takes data from the microphone D/A converter. This is deep in the H/W, you can not mix in anything on your level there.
This said, it will be very very hard (IMHO practicably impossible) to have Skype send your sound with a standard setup to your conversation partner. Only thing I can think of would be having an audio device with loopback capabilities (or simply have a physical cable connection a possible monitor line out to any recording line in), and have then Skype set up to use this looped back device as an input. However, Skype will not pick up from your microphone anymore, hence, you wont have a conversation ;)
Note: When saying "simultaneous" playback here, we are talking about synchronizing the playback of two sounds as concerned by real-time perception (in the range of 10-20ms). We are not looking at actual synchronization on a sample level, and the related clock jitter and phase shifting issues that come into play when sending sound onto two physical devices with two independent (free running) clocks. Thus, when the application demands in phase signal generation on independent devices, clock recovery mechanisms are necessary, which may be provided by the drivers or OS.
Note: Virtual audio device software such as Virtual Audio Cable will provide virtual devices to achieve loopback functionnality in Windows. Frameworks such as Jack Audio may achieve the same in UX environment.
There is a very easy way to output audio on two devices at the same time:
For Realtek devices you can use the Audio-mixer "trick" (but this will give you a delay / echo);
For everything else (and without echo) you can use Voicemeeter (which is totaly free).
I have explained BOTH solutions in this video: https://youtu.be/lpvae_2WOSQ
Best Regards
I am trying to find out which output formats are supported by a specific audio device in exclusive mode.
To do this, I am using IAudioClient->IsFormatSupported(), which according to the documentation should be usable for this.
Unfortunately, it returns AUDCLNT_E_UNSUPPORTED_FORMAT for almost every format I try to pass, except for default 2-channel, 44.1khz audio.
If I actually try to initialize the audioclient, there are however formats that succeed, but which failed in IsFormatSupported().
Just trying to Initialize every format is not an option because this could result in stopping the audio from other applications.
Has anyone else seen this behavior or know if there is another way to find which formats are supported by a specific audio device?
I have seen this behavior as well. It seems like IsFormatSupported will only accept what is marked as 'supported' in the playback device settings in Windows, but Initialize seems to actually end up asking the drivers if it's indeed possible.
In my specific situation, I have a Xoxar HDAV1.3 setup to use HDMI as output. Two playback devices are always available: Speakers and S/PDIF Pass-through Device. If I try, for example, to request 6 channels for the S/PDIF playback device, IsFormatSupported will reject it (in theory, S/PDIF only supports 2, and that's all I can see in the settings), but calling Initialize will succeed and work (it goes out HDMI after all, for which 6 channels is supported). Talk about misleading device names!
I'm afraid there's no real practical way to work around this issue.
Is there an API that is suitable for doing this? A possible application of this is for writing a visualiser, and to play with real time signal processing.
EDIT: The operating system in question is Windows. On Linux, a roundabout way to accomplish this is with Jack, but I'm hoping for a way to read the data in the audio buffer without having to couple apps to Jack.
EDIT: A good answer is found here.
If sound board used for playback has recording device/line like "Stereo Mix", "What U Hear", etc., then it is enought to write simple recording application, that is capable to record from a specified recording device/line and record from the "Stereo Mix",...
General case (for "all sound boards") will require to write special driver. Examples of applications with such spesial drivers: Virtual Audio Cable (http://software.muzychenko.net/eng/vac.html); Total Recorder (http://www.totalrecorder/com).
Is it possible to use the NSSpeechRecognizer with an pre-recorded audio file instead of direct microphone input?
Or is there any other speech-to-text framework for Objective-C/Cocoa available?
Added:
Rather than using voice at the machine that is running the application external devices (e.g. iPhone) could be used for sending just an recorded audio stream to that desktop application. The desktop Cocoa app then would process and do whatever it's supposed to do using the assigned commands.
Thanks.
I don't see any obvious way to switch the input programmatically, though the "Speech" companion guide's first paragraph in the "Recognizing Speech" section seems to imply other inputs can be used. I think this is meant to be set via System Preferences, though. I'm guessing it uses the primary audio input device selected there.
I suspect, though, you're looking for open-ended speech recognition, which NSSpeechRecognizer is not. If you're looking to transform any pre-recorded audio into text (ie, make a transcript of a recording), you're completely out of luck with NSSpeechRecognizer, as you must give it an array of "commands" to listen for.
Theoretically, you could feed it the whole dictionary, but I don't think that would work since you usually have to give it clear, distinct commands. Its performance would suffer, I would guess, if you gave it a bunch of stuff to analyze for (in real time).
Your best bet is to look at third-party open source solutions. There are a few generalized packages out there (none specifically for Cocoa/Objective-C), but this poses another question: What kind of recognition are you looking for? The two main forms of speech recognition ('trained' is more accurate but less flexible for different voices and the recording environment, whereas 'open' is generally much less accurate).
It'd probably be best if you stated exactly what you're trying to accomplish.