DIrect Show, MUX to Renderer - render

Are Muxers generally implemented to be written to a file.
I want to connect a renderer to the output pin of a mixer but it fails.
I have implanted my renderer so that I have the source code and I can step through it.
If I connect it to a encoded data the renderer works.
My renderer does not do anything but just consumes the data.
Actually it will send the data owed the wire.
I am using webm muxer.
error code says they are not compatible types. Interesting thing is that, it does not even call my CheckMediaType or any of the input pin's functions.
So whatever is happening is happening in the muxers internals.
I know it is hard to guess what the issue is.
This also happened to me with the AVI muxer which comes with windows.

Related

DirectX vs FFmpeg

i'm in the process of deciding how to decode received video frames, based on the following:
platform is Windows.
frames are encoded in H264 or H265.
GPU should be used as much
certainly we prefer less coding and simplest code. we just need to decode and show the result on screen. no recording is required, not anything else.
still i'm a newbie, but i think one may decode a frame directly by directx or through ffmpeg. am i right?
if so, which one is preferred?
For a simple approach and simple code using GPU only, take a look at my project using DirectX : H264Dxva2Decoder
If you are ready to code, you can use my approach.
If not, you can use MediaFoundation or FFMPEG, both can do the job.
MediaFoundation is C++ and COM oriented. FFMPEG is C oriented. It can make the difference for you.
EDIT
You can use my program because you have frames encoded in H264 or H265. For h265, you will have to add extra code.
Of course, you need to make modifications. And yes you can send frames to DirectX without using a file. This project use only avcc video file format, but it can be modify for others cases.
You don't need the atom parser. You need to modify the nalu parser, if frames are annex-b format, for example. You will also need to modify the buffering mechanism, if frames are annex-b format.
I can help you, if you provide frames samples encoded in H264.
About Ffmpeg, it has fewer limitations than my program, according to h264 specifications,
but does not provide the rendering mechanism. You will have to mix Ffmepg and my rendering mechanism, for example.
Or study a program like MPC-HC that shows the mix. I can not help anymore here.
EDIT 2
One thing to know, you can't decode encoded packets directly to GPU. You need to parse them before. That's why there is a nalu parser (see DXVA_PicParams_H264).
If you are not ready to code and to understand how it works, use Ffmpeg, it will be simpler, in effect. You can focus on rendering, not on decoding.
It's also important to know which one gives a better result, consumes less resources (CPU, GPU, RAM (both system memory and graphics card memory), supports wider range of formats, etc.
You ask for a real expertise...
If you code your own program, you will be able to optimize it, and certainly get better results. If you use Ffmpeg, and it has performance problems in your context, you could be blocked... because you will not modify Ffmpeg.
You say you will use Bosch cameras. Normally, all encoded video will be in the same format. So once your code is able to decode it, you don't really need all the Ffmpeg features.

Resize MFT Issues: Video Composition in Windows Media Foundation

I'm trying to do composition with two separate video sources in Media Foundation. I am attempting to encode a video with a video overlay. To do so I am attempting to use the Video Resizer on the smaller input.
I've seen several threads on this, but I thought I'd ask around in any case.
Basically the idea is to create two source readers and a sink writer. The source files are h264, so I use the reader to decode into YUY2. While processing samples, I send the appropriate sample to the Resize MFT, then down the line (I haven't made it this far) I combine the two images to create the overlay effect with MFCopyImage.
My question is: I am getting an E_INVALIDARG when I call ProcessInput on the Resize MFT.
To initialize the mft, I am giving it the appropriate type from the reader via SetInput Type. After that I am setting all the appropriate properties via the PropertyStore, and then updating the framesize for the output type of the MFT. I have read the documentation and modeled my implementation according to the MFT Processing Model.
None of these steps raise any red flags until I actually attempt to use ProcessInput.
Although I have limited experience in Windows Media Foundation, I have been able to use the Framerate DSP with success. I would appreciate any advice.
Thank you!
For anyone else stuck in a similar situation, I ended up not using the Resizer MFT but the Video Processor MFT which worked with much less effort.

Sending per frame metadata with H264 encoded frames

We're looking for a way to send per frame metadata (for example an ID) with H264 encoded frames from a server to a client.
We're currently developing a remote rendering application, where both client and server side are actively involved.
The server renders a high quality image with all effects, lighting etc.
The client also has model-informations and renders a diffuse image that is used when the bandwidth is too low or the images have to be warped in order to avoid stuttering .
So far we're encoding the frames on the server side with ffmpeg and streaming them with live555 to the client, who receives an rtsp-stream and decodes the frames again using ffmpeg.
For our application, we now need to send per frame metadata.
We want the client to tell the server where the camera is right now.
Ideally we'd be able to send the client's view matrix to the server, render the corresponding frame and send it back to the client together with its view matrix. So when the client receives a frame, we need to know exactly at what camera position the frame was rendered.
Alternatively we could also tag each view matrix with an ID, send it to the server, render the frame and tag it with the same ID and send it back. In this case we'd have to assign the right matrix to the frame again on the client side.
After several attempts to realize the above intent with ffmpeg we came to the conclusion that ffmpeg does not provide the required functionality. ffmpeg only provides a fix, predefined set of fields for metadata, that either cannot store a matrix or can only be set for every key frame, which is not frequently enough for our purpose.
Now we're considering using live555. So far we have an on demand Server, witch gets a VideoSubsession with a H264VideoStreamDiscreteFramer to contain our own FramedSource class. In this class we load the encoded AVPacket (from ffmpeg) and send its data-buffer over the network. Now we need a way to send some kind of metadata with every frame to the client.
Do you have any ideas how to solve this metadata problem with live555 oder another library?
Thanks for your help!
It seems this question was answered in the comments:
pipe the output of ffmpeg through a custom tool that embedded the data
in the 264 elementary stream via an SEI
Someone also gave the following answer, which was deleted a few years ago for dubious reasons (it is brief but does seem to contain sufficient information):
You can do so using MPEG-4. See details for MPEG-4 Part 14 for
details.

DirectShow - How to read a file from a source filter

I'm writing a DirectShow source filter which is registered as a CLSID_VideoInputDeviceCategory, so it can be seen as a Video Capture Device (from Skype, for example, it is viewed as another WebCam).
My source filter is based on the VCam example from here, and, for now, the filter produces the exact output as this example (random colored pixels with one Video output pin, no audio yet), all implemented in the FillBuffer() method of the one and only output pin.
Now the real scenario will be a bit more tricky - The filter uses a file handle to a hardware device, opened using the CreateFile() API call (opening the device is out of my control, and is done by a 3Party library). It should then read chunks of data from this handle (usually 256-512 bytes chunk sizes).
The device is a WinUSB device and the 3Party framework just "gives" me an opened file handle to read chunks from.
The data read by the filter is a *.mp4 file, which is streamed from the device to the "handle".
This scenario is equivalent to a source filter reading from a *.mp4 file on the disk (in "chunks") and pushing its data to the DirectShow graph, but without the ability to read the file entirely from start to end, so the file size is unknown (Correct?).
I'm pretty new to DirectShow and I feel as though I'm missing some basic concepts. I'll be happy if anyone can direct me to solutions\resources\explanations for the following questions:
1) From various sources on the web and Microsoft SDK (v7.1) samples, I understood that for an application (such as Skype) to build a correct & valid DirectShow graph (so it will render the Video & Audio successfully), the source filter pin (inherits from CSourceStream) should implement the method "GetMediaType". Depending on the returned value from this implemented function, an application will be able to build the correct graph to render the data, thus, build the correct order of filters. If this is correct - How would I implement it in my case so that the graph will be built to render *.mp4 input in chunks (we can assume constant chunk sizes)?
2) I've noticed the the FillBuffer() method is supposed to call SetTime() for the IMediaSample object it gets (and fills). I'm reading raw *.mp4 data from the device. Will I have to parse the data and extract the frames & time values from the stream? If yes - an example would b great.
3) Will I have to split the data received from the file handle (the "chunks") to Video & Audio, or can the data be pushed to the graph without the need to manipulate it in the source filter? If split is needed - How can it be done (the data is not continuous, and is spitted to chunks) and will this affect the desired implementation of "GetMediaType"?
Please feel free to correct me if I'm using incorrect terminology.
Thanks :-)
This is a good question. On the one hand this is doable, but there is some specific involved.
First of all, your filter registered under CLSID_VideoInputDeviceCategory category is expected to behave as a live video source. By doing so you make it discoverable by applications (such as Skype as you mentioned), and those applications will be attempting to configure video resolution, they expect video to go at real time rate, some applications (such as Skype) are not expecting compressed video such H.264 there or would just reject such device. You can neither attach audio right to this filter as applications would not even look for audio there (not sure if you have audio on your filter, but you mentioned .MP4 file so audio might be there).
On your questions:
1 - You would have a better picture of application requirement by checking what interface methods applications call on your filter. Most of the methods are implemented by BaseClasses and convert the calls into internal methods such as GetMediaType. Yes you need to implement it, and by doing so you will - among other - enable your filter to connect with downstream filter pins by trying specific media types you support.
Again, those cannot me MP4 chunks, even if such approach can work in other DirectShow graphs. Implementing a video capture device you should be delivering exactly video frames, preferably decompressed (well those could be compressed too, but you are going to immediately have compatibility issies with applications).
A solution you might be thinking of is to embed a fully featured graph internally to which you inject your MP4 chunks, then the pipelines parse those, decodes and delivers to your custom renderer, taking frames on which you re-expose them off your virtual device. This might be a good design, though assumes certain understanding of how filters work internally.
2 - Your device is typically treated as/expected to be a live source, which means that you deliver video in realtime and frames are not necessarily time stamped. So you can put times there and yes you definitely need to extract time stamps from your original media (or have it done by internal graph as mentioned in item 1 above), however be prepared that applications strip time stamps especially for preview purposes, since the source is "live".
3 - Getting back to audio, you cannot implement audio on the same virtual device. Well you can, and this filter might be even working in a custom built graph, but this is not going to work with applications. They will be looking for separate audio device, and if you implement such, they will instantiate it separately. So you are expected to implement both virtual video and virtual audio source, and implement internal synchronization behind the scenes. This is where timestamps will be important, by providing them correctly you will keep lip sync in live session to what it was originally on the media file you are streaming from.

Encode WebCam frames with H.264 on .NET

What i want to do is the following procedure:
Get a frame from the Webcam.
Encode it with an H264 encoder.
Create a packet with that frame with my own "protocol" to send it via UDP.
Receive it and decode it...
It would be a live streaming.
Well i just need help with the Second step.
Im retrieving camera images with AForge Framework.
I dont want to write frames to files and then decode them, that would be very slow i guess.
I would like to handle encoded frames in memory and then create the packets to be sent.
I need to use an open source encoder. Already tryed with x264 following this example
How does one encode a series of images into H264 using the x264 C API?
but seems it only works on Linux, or at least thats what i thought after i saw like 50 errors when trying to compile the example with visual c++ 2010.
I have to make clear that i already did a lot of research (1 week reading) before writing this but couldnt find a (simple) way to do it.
I know there is the RTMP protocol, but the video stream will always be seen by one peroson at a(/the?) time and RTMP is more oriented to stream to many people. Also i already streamed with an adobe flash application i made but was too laggy ¬¬.
Also would like you to give me an advice about if its ok to send frames one by one or if it would be better to send more of them within each packet.
I hope that at least someone could point me on(/at?) the right direction.
My english is not good maybe blah blah apologies. :P
PS: doesnt has to be in .NET, it can be in any language as long as it works on Windows.
Many many many many thanks in advance.
You could try your approach using Microsoft's DirectShow technology. There is an opensource x264 wrapper available for download at Monogram.
If you download the filter, you need to register it with the OS using regsvr32. I would suggest doing some quick testing to find out if this approach is feasible, use the GraphEdit tool to connect your webcam to the encoder and have a look at the configuration options.
Also would like you to give me an advice about if its ok to send frames one by one or if it would be better to send more of them within each packet.
This really depends on the required latency: the more frames you package, the less header overhead, but the more latency since you have to wait for multiple frames to be encoded before you can send them. For live streaming the latency should be kept to a minimum and the typical protocols used are RTP/UDP. This implies that your maximum packet size is limited to the MTU of the network often requiring IDR frames to be fragmented and sent in multiple packets.
My advice would be to not worry about sending more frames in one packet until/unless you have a reason to. This is more often necessary with audio streaming since the header size (e.g. IP + UDP + RTP) is considered big in relation to the audio payload.

Resources