Make DirectShow play sound from a memory buffer - winapi

I want to play sound "on-demand". A simple drum machine is what I want to program.
Is it possible to make DirectShow read from a memory buffer ?(object created by c++)
I am thinking:
Create a buffer of, lets say, 40000 positions, type double (I don't know the actual data type to use as sound, so I might be wrong with double).
40000 positions can be 1 second of playback.
The DirectShow object is supposed to read this buffer position by position, over and over again. and the buffer will contain the actual value of the output of the sound. For example (a sine-looking output):
{0, 0.4, 0.7, 0.9, 0.99, 0.9, 0.7, 0.4, 0, -0,4, -0.7, -0.9, -0.99, -0.9, -0.7, -0.4, 0}
The resolution of this sound sequence is probably not that good, but it is only to display what I mean.
Is this possible? I cannot find any examples or information about it on Google.
edit:
When working on DirectShow and streaming video (UBS camera), I used something called Sample Grabber. Which called a method for every frame from the cam. I am looking for something similar, but for music, and something that is called before the music is played.
Thanks

You want to stream your data through and injecting data into DirectShow pipeline is possible.
By design, outer DirectShow interface does not provide access to streamed data. Controlling code builds the topology, connects filters, sets them up and controls the state of the pipeline. All data is streamed behind the scenes, filters are passing pieces of data one to another and this adds up into data streaming.
Sample Grabber is the helper filter that allows to grab a copy of data being passed through certain graph point. Because otherwise payload data is not available to controlling code, Sample Grabber gained popularity, esp. for grabbing video frames out the the "inaccessible" stream, live or file backed playback.
Now when you want to do the opposite, put your own data into pipeline, the Sample Grabber concept does not work. Taking a copy of data is one thing, and proactive putting your own data into the stream is a different one.
To inject your own data you typically put your own custom filter into the pipeline that generates the data. You want to generate PCM audio data. You are choose where you take it from - generation, reading from file, memory, network, looping whatsoever. You fill buffers, you add time stamps and you deliver the audio buffers to the downstream filters. A typical starting point is PushSource Filters Sample which introduces the concept of a filter producing video data. In a similar way you want to produce PCM audio data.
A related question:
How do I inject custom audio buffers into a DirectX filter graph using DSPACK?

Related

Change the scale of overlay video based on audio level

I'm looking to adjust the scale of my image overlay based on the input audio's loudness.
As the volume rises, I want to make it larger, as it gets quieter, I want to make it smaller.
I can't figure out how to access any relevant audio information in video filters.
I'm open to multi-step solutions but the result needs to be in sync

FFMPEG API -- How much do stream parameters change frame-to-frame?

I'm trying to extract raw streams from devices and files using ffmpeg. I notice the crucial frame information (Video: width, height, pixel format, color space, Audio: sample format) is stored both in the AVCodecContext and in the AVFrame. This means I can access it prior to the stream playing and I can access it for every frame.
How much do I need to account for these values changing frame-to-frame? I found https://ffmpeg.org/doxygen/trunk/demuxing__decoding_8c_source.html#l00081 which indicates that at least width, height, and pixel format may change frame to frame.
Will the color space and sample format also change frame to frame?
Will these changes be temporary (a single frame) or lasting (a significant block of frames) and is there any way to predict for this stream which behavior will occur?
Is there a way to find the most descriptive attributes that this stream is possible of producing, such that I can scale all the lower-quality frames up, but not offer a result that is mindlessly higher-quality than the source, even if this is a device or a network stream where I cannot play all the frames in advance?
The fundamental question is: how do I resolve the flexibility of this API with the restriction that raw streams (my output) do not have any way of specifying a change of stream attributes mid-stream. I imagine I will need to either predict the most descriptive attributes to give the stream, or offer a new stream when the attributes change. Which choice to make depends on whether these values will change rapidly or stay relatively stable.
So, to add to what #szatmary says, the typical use case for stream parameter changes is adaptive streaming:
imagine you're watching youtube on a laptop with various methods of internet connectivity, and suddenly bandwidth decreases. Your stream will automatically switch to a lower bandwidth. FFmpeg (which is used by Chrome) needs to support this.
alternatively, imagine a similar scenario in a rtc video chat.
The reason FFmpeg does what it does is because the API is essentially trying to accommodate to the common denominator. Videos shot on a phone won't ever change resolution. Neither will most videos exported from video editing software. Even videos from youtube-dl will typically not switch resolution, this is a client-side decision, and youtube-dl simply won't do that. So what should you do? I'd just use the stream information from the first frame(s) and rescale all subsequent frames to that resolution. This will work for 99.99% for the cases. Whether you want to accommodate your service to this remaining 0.01% depends on what type of videos you think people will upload and whether resolution changes make any sense in that context.
Does colorspace change? They could (theoretically) in software that mixes screen recording with video fragments, but it's highly unlikely (in practice). Sample format changes as often as video resolution: quite often in the adaptive scenario, but whether you care depends on your service and types of videos you expect to get.
Usually not often, or ever. However, this is based on the codec and are options chosen at encode time. I pass the decoded frames through swscale just in case.

Delay in video in DirectShow graph

I'm seeing a noticeable video which is causing the resulting audio/video sync to be off for a capture card that I'm testing. My graph topology is as follows.
Video Source -> Sample Grabber -> Null Renderer
Audio Source -> Sample Grabber -> Null Renderer
The samples from video is compressed using H264, and Audio is compressed using FAAC. This topology and application code works for capture cards that I've used in the past. But I see this delay with the current card that I'm testing. Naturally I thought it was related to the card itself. So I checked and found that there is no video/audio desync when using Open Broadcaster, VLC, or the same graph in GraphEdit to capture with this card.
This indicates to me that the problem is related to how I'm constructing the graph. I then tried adjusting the buffer sizes using IAMBufferNegotiation, as well as SetStreamSyncOffset without success.
The sync is almost perfect if I apply a 500 ms lag to the video (e.g. videoTimeStamp = videoTimeStamp - 500). This is strange because I would expect to see more latency in the audio than video.
Video and audio synchronization is all about time stamps. Video or audio leg might delay processing of data, but it is time stamps that show original and intended sync.
Potential causes include:
Video and audio sources timestamp data independently, incorrectly delivering unsynchronized data - does not look like your case
You neglect time stamps and you use actual time of sample arrival to your sample grabber, which is incorrect
Another filter in between, such as decoder, incorrectly restamps data when processes it

Detect frames that have a given image/logo with FFmpeg

I'm trying to split a video by detecting the presence of a marker (an image) in the frames. I've gone over the documentation and I see removelogo but not detectlogo.
Does anyone know how this could be achieved? I know what the logo is and the region it will be on.
I'm thinking I can extract all frames to png's and then analyse them one by one (or n by n) but it might be a lengthy process...
Any pointers?
ffmpeg doesn't have any such ability natively. The delogo filter simply works by taking a rectangular region in its parameters and interpolating that region based on its surroundings. It doesn't care what the region contained previously; it'll fill in the region regardless of what it previously contained.
If you need to detect the presence of a logo, that's a totally different task. You'll need to create it yourself; if you're serious about this, I'd recommend that you start familiarizing yourself with the ffmpeg filter API and get ready to get your hands dirty. If the logo has a distinctive color, that might be a good way to detect it.
Since what you're after is probably going to just be outputting information on which frames contain (or don't contain) the logo, one filter to look at as a model will be the blackframe filter (which searches for all-black frames).
You can write a detect-logo module, Decode the video(YUV 420P FORMAT), feed the raw frame to this module, Do a SAD(Sum of Absolute Difference) on the region where you expect a logo,if SAD is negligible its a match, record the frame number. You can split the videos at these frames.
SAD is done only on Y(luma) frames. To save processing you can scale the video to a lower resolution before decoding it.
I have successfully detect logo using a rpi and coral ai accelerator in conjunction with ffmeg to to extract the jpegs. Crop the image to just the logo then apply to your trained model. Even then you will need to sample a minute or so of video to determine the actual logos identity.

OpenAL synchronization

I'm new to audio programming so excuse me if I'm not using the right terms...
I have two streaming buffers that I want to have playing simultaneously completely synchronized. I want to control ratio of blending between the streams. I'm sure it's as simple as having two sources playing and just changing the their gain, but I read about people doing some tricks like having 2 channels buffer instead of two single channels. Then they play from a single source but control the blending between the channels. The article I read wasn't about OpenAL so my question is: Is this even possible with OpenAL?
I guess I don't have to do it this way but now I'm curious and want to learn how to set it up. Do I suppose to setup alFilter? Creative's documentation sais "Buffers containing more than one channel of data will be played without 3D spatialization." Reading this I guess I need a pre-pass on a buffer level and then having the source output blended mono channel signal.
I guess I'll ask another question. Is OpenAL flexible enough to do tricks like this?
I decode my stream manually so I realize how easy it will be to do the blending myself before feeding the buffer but then I won't be able in real time to change the blending factor since I already have a second or so of the stream buffered.
I have two streaming buffers that I want to have playing simultaneously completely synchronized.
I want to control ratio of blending between the streams. I'm sure it's as simple as having two sources playing and just changing the their gain
Yes, it should be. Did you try that? What was the problem?
ALuint source1;
ALuint source2;
...
void set_ratio(float ratio) {
ratio=std::min(ratio,1);
alSourcef (source1, AL_GAIN, ratio);
alSourcef (source2, AL_GAIN, (1-ratio));
}

Resources