Logmel spectrogram-1 sec audio dataset? - spectrogram

Can I use the below code for 1sec audio dataset which I need to use for Keyword spotting in micro controller. This code is used for 30sec audio dataset for acoustic scene classification.
Same Code can I use for 1sec audio.


VMAF-like quality indicator with single video file

I am looking for a VMAF-like objective user-perception video quality scanner that functions at scale. The use case is a twitch-like streaming service where videos are eligible to be played on demand after the live stream completes. We want to have some level of quality in the on demand library without having to view every live stream. We are encoding the livestreams into HLS playlists after the stream completes, but using VMAF to compare the post-stream mp4 to the post-encoded mp4s in HLS doesn't provide the information needed as the original mp4 could be of low quality due to bandwidth issues during the live stream.
Not sure if I get the question correctly. You want to measure the output quality of the transcoded video without using the reference video. Is that correct?
VMAF is a reference quality metric, which means it simply compares how much subjective distortion was introduced into the transcoded video when compared to the source video. It always needs a reference input video.
I think what you are looking for is a no-reference quality metric(s). Where you can measure the "quality" of video without a reference source video. There are a lot of no-reference quality metrics intended to capture different distortion artifacts in the output video. For example, blurring, blocking, and so on. Then you can make an aggregated metric based on these values depending upon what you want to measure.
So, if I were you, I would start searching for no-reference quality metrics. And then look for tools that can measure those no-reference quality metrics efficiently. Hope that answers your question.

Looking for a audio Analysis library for information extraction

Hey guys I'm a beginner in Audio Analysis and trying to find a library which gives me insights like amplitude, classification of sound, what should detect background noise. I have tried out Paura/pyAudioAnalysis (pAura: Python AUdio Recording and Analysis) which analyzes some of the information for live recording. Is there any good audio analysis library in GitHub ?
There are many. search for DTLN model for audio noise removal on github. DTLN is a pretrained noise removal lite model.
if you're not planning to use any models then try to fix this problem using audio signal processing. use audio features like zero crossing rate for noise/speech activity detection.

Convert Jitter from RTP timestamp unit to millisseconds

I have a video conference app and I want to display the Interarrival Jitter to the user. I am getting this information from FFmpeg, and it follows the RFC 3550 Appendix A.8, so the information is in timestamp units. I am not sure how to convert it. I am currently dividing the Jitter by 90.000 (the video stream timebase). Is this correct?
Similar question: Jitter units for Live555

Delay in video in DirectShow graph

I'm seeing a noticeable video which is causing the resulting audio/video sync to be off for a capture card that I'm testing. My graph topology is as follows.
Video Source -> Sample Grabber -> Null Renderer
Audio Source -> Sample Grabber -> Null Renderer
The samples from video is compressed using H264, and Audio is compressed using FAAC. This topology and application code works for capture cards that I've used in the past. But I see this delay with the current card that I'm testing. Naturally I thought it was related to the card itself. So I checked and found that there is no video/audio desync when using Open Broadcaster, VLC, or the same graph in GraphEdit to capture with this card.
This indicates to me that the problem is related to how I'm constructing the graph. I then tried adjusting the buffer sizes using IAMBufferNegotiation, as well as SetStreamSyncOffset without success.
The sync is almost perfect if I apply a 500 ms lag to the video (e.g. videoTimeStamp = videoTimeStamp - 500). This is strange because I would expect to see more latency in the audio than video.
Video and audio synchronization is all about time stamps. Video or audio leg might delay processing of data, but it is time stamps that show original and intended sync.
Potential causes include:
Video and audio sources timestamp data independently, incorrectly delivering unsynchronized data - does not look like your case
You neglect time stamps and you use actual time of sample arrival to your sample grabber, which is incorrect
Another filter in between, such as decoder, incorrectly restamps data when processes it

Make DirectShow play sound from a memory buffer

I want to play sound "on-demand". A simple drum machine is what I want to program.
Is it possible to make DirectShow read from a memory buffer ?(object created by c++)
I am thinking:
Create a buffer of, lets say, 40000 positions, type double (I don't know the actual data type to use as sound, so I might be wrong with double).
40000 positions can be 1 second of playback.
The DirectShow object is supposed to read this buffer position by position, over and over again. and the buffer will contain the actual value of the output of the sound. For example (a sine-looking output):
{0, 0.4, 0.7, 0.9, 0.99, 0.9, 0.7, 0.4, 0, -0,4, -0.7, -0.9, -0.99, -0.9, -0.7, -0.4, 0}
The resolution of this sound sequence is probably not that good, but it is only to display what I mean.
Is this possible? I cannot find any examples or information about it on Google.
When working on DirectShow and streaming video (UBS camera), I used something called Sample Grabber. Which called a method for every frame from the cam. I am looking for something similar, but for music, and something that is called before the music is played.
You want to stream your data through and injecting data into DirectShow pipeline is possible.
By design, outer DirectShow interface does not provide access to streamed data. Controlling code builds the topology, connects filters, sets them up and controls the state of the pipeline. All data is streamed behind the scenes, filters are passing pieces of data one to another and this adds up into data streaming.
Sample Grabber is the helper filter that allows to grab a copy of data being passed through certain graph point. Because otherwise payload data is not available to controlling code, Sample Grabber gained popularity, esp. for grabbing video frames out the the "inaccessible" stream, live or file backed playback.
Now when you want to do the opposite, put your own data into pipeline, the Sample Grabber concept does not work. Taking a copy of data is one thing, and proactive putting your own data into the stream is a different one.
To inject your own data you typically put your own custom filter into the pipeline that generates the data. You want to generate PCM audio data. You are choose where you take it from - generation, reading from file, memory, network, looping whatsoever. You fill buffers, you add time stamps and you deliver the audio buffers to the downstream filters. A typical starting point is PushSource Filters Sample which introduces the concept of a filter producing video data. In a similar way you want to produce PCM audio data.
A related question:
How do I inject custom audio buffers into a DirectX filter graph using DSPACK?
