C++ / W32 - Recording sound, Direct Show or WaveInOpen? - windows

what to choose when thinking of WinXP, Vista, Win7 ++ :
Record audio with Direct Show / Direct ... ?
Go with classic WaveInOpen ( i've seen somewhere somebody saying that this is going to be oudated in W7/W8 - possible ? )
Ps. I need a callback functionality, to pass the buffer to the encoder.
Thanks!

WaveIn is easy to use, there is plenty of example code on the net, and it gives you a callback in the way you need it.
DirectSound uses a circular buffer and can be a little cumbersome to set up, and most likely you'll need to take care of the circular buffer rather than "just filling a buffer". DirectSound, however, can give you tighter control of the audio, namely a bit better latency.
IMO, it's very unlikely that Microsoft will ever deprecate/remove the Wave API. They'd break thousands of applications. I actually don't think that MS has ever removed a core API from Windows.
So I'd go for the Wave API for simplicity.

Related

Pitch Shift in p5.js

I've been racking my brain on how to do a pitch shift in p5.js, and I've found documentation for a rate change (pitch and speed together), as well as a speed change without changing pitch. I was trying to experiment with having those run simultaneously, but it appears rate() is only available for p5.SoundFile and speed is only available for p5.MediaElement.
I was wondering if anyone had run across a way to extend functionality from one object to another, or if there was a way to manually extend the functionality somewhere in custom code.
Option 1. Switch to ToneJS
ToneJS is another library that wraps the browser's Audio API and it has a built in PitchShift effect. There's nothing special about p5.sound that makes it better or worse for use with p5.js except maybe that it follows some of the same conventions.
Option 2. Write a Custom Effect
p5.Sound provides a base class p5.Effect which could be used to implement a pitch shift effect however, this would be a pretty challenging project unless you have experience with digital signal processing and the underlying browser Audio API. Here's a Wikipedia page on the algorithm in question.

Popup window in Turbo Pascal 7

In Turbo Pascal 7 for DOS you can use the Crt unit to define a window. If you define a second window on top of the first one, like a popup, I don’t see a way to get rid of the second one except for redrawing the first one on top again.
Is there a window close technique I’m overlooking?
I’m considering keeping an array of screens in memory to make it work, but the TP IDE does popups like I want to do, so maybe it’s easy and I’m just looking in the wrong place?
I don't think there's a window-closing technique you're missing, if you mean one provided by the CRT unit.
The library Borland used for the TP7 IDE was called TurboVision (see https://en.wikipedia.org/wiki/Turbo_Vision) and it was eventually released to the public domain, but well before that, a number of 3rd-party screen handling/windowing libraries had become available and these were much more powerful than what could be achieved with the CRT unit. Probably the best known was Turbopower Software's Object Professional (aka OPro).
Afaik, these libraries (and, fairly obviously TurboVision) were all based on an in-memory representation of a framed window which could be rapidly copied in and out of the PC's video memory and, as in Windows with a capital W, they were treated as a stack with a z-order. So the process or closing/erasing the top level window was one of getting the window(s) that it had been covering to re-draw itself/themselves. Otoh, CRT had basically evolved from v. primitive origins similar to, if not based on, the old DEC VT100 display protocol and wasn't really up to the job of supporting independent, stackable window objects.
Although you may still be able to track down the PD release of TurboVision, it never really caught on as a library for developers. In an ideal world, a better place to start would be with OPro. It was apparently on SoureForge for a while, but seems to have been taken down sometime since about 2007, and these days even if you could get hold of a copy, there is a bit of a question mark over licensing. However ...
There was also a very popular freeware library available for TP by the name of the "Technojock's toolkit" and which had a large functionality overlap (including screen handling) with OPro and it is still available on github - see https://github.com/lallousx86/TurboPascal/tree/master/TotLib/TOTSRC11. Unlike OPro, I never used TechnoJocks myself, but devotees swore by it. Take a look.

What is the OS X framework(s) for synthesizing music data?

I'd like to write some code that generates some pretty simple musical tones (notes) and has them output through the speaker (whatever sound device).
I suspect I'll likely need to generate as MIDI data, which I can go figure out independently, but I'm new to audio programming generally and I'm not sure what the best entry point into the system frameworks is. AudioToolbox has these MusicSequence objects. There's also Core MIDI and Core Audio. None has an obvious interface for "here's a data structure for a bunch of notes, now call this method to play them", so I'll presumably need some combination of these to cobble it together.
I'm confident that OS X supports this. If anyone has context with this kind of work, I'd appreciate a couple basic pointers on where in the docs (or other resources) to start looking for building whatever structures represent music data and where you'd turn around and trigger playback.
OS X does support this, but it's a lot more inherently complex than it might seem at first. There are essentially three pieces:
MusicSequence is the "data structure for a bunch of notes" (along with timing information in the form of a tempo/meter map.
MusicPlayer is the object that controls playback of the MusicSequence.
AUGraph is what you'd use to create an instrument object and hook it up to your physical outputs, to turn the note data into sound.
There's a lot of potential variety in how you set up the AUGraph. For example, the default General MIDI synthesizer is the built-in DLSMusicDevice, but you could also load an FM synth, a sampler, or any number of other instrument units. From there, you could be processing the audio in various ways and routing it to various devices. All that stuff that falls in the general category of "audio processing" happens within the AUGraph.
Apple's PlaySequence sample code does mostly what you're looking for. It's a C++ project—but MusicSequence, MusicPlayer, and AUGraph are plain C APIs, so it should be a decent starting point. https://developer.apple.com/library/mac/samplecode/PlaySequence/Introduction/Intro.html

How many layers are between my program and the hardware?

I somehow have the feeling that modern systems, including runtime libraries, this exception handler and that built-in debugger build up more and more layers between my (C++) programs and the CPU/rest of the hardware.
I'm thinking of something like this:
1 + 2 >> OS top layer >> Runtime library/helper/error handler >> a hell lot of DLL modules >> OS kernel layer >> Do you really want to run 1 + 2?-Windows popup (don't take this serious) >> OS kernel layer >> Hardware abstraction >> Hardware >> Go through at least 100 miles of circuits >> Eventually arrive at the CPU >> ADD 1, 2 >> Go all the way back to my program
Nearly all technical things are simply wrong and in some random order, but you get my point right?
How much longer/shorter is this chain when I run a C++ program that calculates 1 + 2 at runtime on Windows?
How about when I do this in an interpreter? (Python|Ruby|PHP)
Is this chain really as dramatic in reality? Does Windows really try "not to stand in the way"? e.g.: Direct connection my binary <> hardware?
"1 + 2" in C++ gets directly translated in an add assembly instruction that is executed directly on the CPU. All the "layers" you refer to really only come into play when you start calling library functions. For example a simple printf("Hello World\n"); would go through a number of layers (using Windows as an example, different OSes would be different):
CRT - the C runtime implements things like %d replacements and creates a single string, then it calls WriteFile in kernel32
kernel32.dll implements WriteFile, notices that the handle is a console and directs the call to the console system
the string is sent to the conhost.exe process (on Windows 7, csrss.exe on earlier versions) which actually hosts the console window
conhost.exe adds the string to an internal buffer that represents the contents of the console window and invalidates the console window
The Window Manager notices that the console window is now invalid and sends it a WM_PAINT message
In response to the WM_PAINT, the console window (inside conhost.exe still) makes a series of DrawString calls inside GDI32.dll (or maybe GDI+?)
The DrawString method loops through each character in the string and:
Looks up the glyph definition in the font file to get the outline of the glyph
Checks it's cache for a rendered version of that glyph at the current font size
If the glyph isn't in the cache, it rasterizes the outline at the current font size and caches the result for later
Copies the pixels from the rasterized glyph to the graphics buffer you specified, pixel-by-pixel
Once all the DrawString calls are complete, the final image for the window is sent to the DWM where it's loaded into the graphics memory of your graphics card, and replaces the old window
When the next frame is drawn, the graphics card now uses the new image to render the console window and your new string is there
Now there's a lot of layers that I've simplified (e.g. the way the graphics card renders stuff is a whole 'nother layer of abstractions). I may have made some errors (I don't know the internals of how Windows is implemented, obviously) but it should give you an idea hopefully.
The important point, though, is that each step along the way adds some kind of value to the system.
As codeka said there's a lot that goes on when you call a library function but what you need to remember is that printing a string or displaying a jpeg or whatever is a very complicated task. Even more so when the method used must work for everyone in every situation; hundreds of edge cases.
What this really means is that when you're writing serious number crunching, chess playing, weather predicting code don't call library functions. Instead use only cheap functions which can and will be executed by the CPU directly. Additionally planning where your expensive functions are can make a huge difference (print everything at the end not each time through the loop).
It doesnt matter how many levels of abstraction there is, as long has the hard work is done in the most efficient way.
In a general sense you suffer from "emulating" your lowest level, e.g. you suffer from emulating a 68K CPU on a x86 CPU running some poorly implemented app, but it wont perform worse than the original hardware would. Otherwise you would not emulate it in the first place. E.g. today most user interface logic is implemented using high level dynamic script languages, because its more productive while the hardcore stuff is handled by optimized low level code.
When it comes to performance, its always the hard work that hits the wall first. The thing in between never suffers from performance issues. E.g. a key handler that processes 2-3 key presses a second can spend a fortune in badly written code without affecting the end user experience, while the motion estimator in an mpeg encoder will fail utterly by just being implemented in software instead of using dedicated hardware.

Use NSSpeechRecognizer or alternative with audio file instead of microphone input?

Is it possible to use the NSSpeechRecognizer with an pre-recorded audio file instead of direct microphone input?
Or is there any other speech-to-text framework for Objective-C/Cocoa available?
Added:
Rather than using voice at the machine that is running the application external devices (e.g. iPhone) could be used for sending just an recorded audio stream to that desktop application. The desktop Cocoa app then would process and do whatever it's supposed to do using the assigned commands.
Thanks.
I don't see any obvious way to switch the input programmatically, though the "Speech" companion guide's first paragraph in the "Recognizing Speech" section seems to imply other inputs can be used. I think this is meant to be set via System Preferences, though. I'm guessing it uses the primary audio input device selected there.
I suspect, though, you're looking for open-ended speech recognition, which NSSpeechRecognizer is not. If you're looking to transform any pre-recorded audio into text (ie, make a transcript of a recording), you're completely out of luck with NSSpeechRecognizer, as you must give it an array of "commands" to listen for.
Theoretically, you could feed it the whole dictionary, but I don't think that would work since you usually have to give it clear, distinct commands. Its performance would suffer, I would guess, if you gave it a bunch of stuff to analyze for (in real time).
Your best bet is to look at third-party open source solutions. There are a few generalized packages out there (none specifically for Cocoa/Objective-C), but this poses another question: What kind of recognition are you looking for? The two main forms of speech recognition ('trained' is more accurate but less flexible for different voices and the recording environment, whereas 'open' is generally much less accurate).
It'd probably be best if you stated exactly what you're trying to accomplish.

Resources