Watson speech-to-text: Narrowband producing better results than Broadband? - ffmpeg

I'm using IBM Watson to transcribe a video library that we have. I'm currently doing initial research into it's efficacy and accuracy.
The videos in question have OK to very good sound quality and based on Watson documentation I should be using the Broadband model to transcribe them.
I've however tested using both Narrow and Broadband and I'm finding that Narrowband always either slightly better or a lot better in some cases (up to 10%).
Has anyone else done any similar testing? It's contrary to the documentation so I'm a little reluctant to just go ahead and use Narrowband for everything, but I may have to based on the results.
I'm using ffmpeg to convert the videos to audio files to send to Watson, and the audio files show 48KHz sampling rates, which again means I should be using and getting better results using Broadband.
Hoping someone out there has done similar research and can help.
Thanks in advance.

do you know what the original sampling rate of the audio is? Maybe it was recorded at 8k originally and then upsampled. If that were the case the original lower frequencies would be lost and the right model to use would be the Narrowband model. You can see this in an spectrogram, using for example audacity (https://github.com/audacity/audacity).
Another explanation would be that the n-grams in your video are better predicted by the language model that the Narrowband system uses. I suggest sharing your audio file with Watson support team to get further insight (you can go to the Bluemix portal and then click on "support").

Related

What I can use to process video sentiment analysis on a video stream?

I want to make an app,maybe web, that performs a sentiment analysis on a video stream during a web interview.Do you have any recommendations ?
I want to mention that I'm good with java and c#.
The solution might be different for different use cases. You mentioned during web interview. Are you looking for real time processing? How do you capture video - which format? Do you need spoken words analysis or face emotions analysis? And your budget for the same.

Is opencv image similarity comparison reliable for objects? Is there any cost/benefit quality alternative to open-source API's?

I'm trying to choose an API to match object images taken with a cell phone with a list of images in a file system. The point is, I'm afraid that I won't get reliable results and it won't be worth it to loose time in this feature.
I would really appreciate some advice regarding this topic.

Smart video thumbnail generator algorithm

Hello I'm a Java developer and I'm a part of video on demand website team.
I'm currently doing research on how to implement a back-end component that we are planning to build; the component is expected to automatically generate a meaningful thumbnail representing the content of the videos like the algorithm used in YouTube to generate default thumbnails.
However, I can't seem to find any good open source or payed implementation that can do so, and building the algorithm from scratch is very complicated and needs a lot of time that I don't think the company is willing to invest at the current stage (maybe in the future though)
I would appreciate if someone can refer to any implementation that can help me or even vendors that sell an implementation or a product that can serve my component's objective.
Thanks!
As explained by google research blog:
https://research.googleblog.com/2015/10/improving-youtube-video-thumbnails-with.html
The key component is using a convolutional neural network to predict the score for each sampled frame.
There are so many open sourced CNN implementation like caffe or tensorflow. The only efforts are preparing some training data.

Rhythm detection through analyzing the audio spectrum

I'm building a rhythm-based game, and facing a lot of problems with rhythm-detection. I receive the current spectrum of a playing song. It looks like a float array with 512 floats. 256 for left and right channel representation. FFT is also available. But I have no idea how to work with that data, I've made some experiments with visualizing, but it gave me very few information. I've googled for some ready algorithms, but there is nothing. Please, can someone help me with, maybe, some references, materials, articles connected with rhythm detection, working with audio spectrum. Code will also be very helpful. Thanks.
Maybe you didn't use the right search-terms. Try to google 'tempo detection' or 'beat detection', together with 'code' or 'algorithm'. There are lots of papers, references, code examples, etc.
Just a few hits:
http://www.cs.princeton.edu/~lieber/cos325/final/
http://www.clear.rice.edu/elec301/Projects01/beat_sync/beatalgo.html
You might want to check out the source and project report for the Dancing Monkeys project. Dancing monkeys automatically generates stepfiles for DDR, and it does so using some rather sophisticated beat detection. It's written in matlab.
You should have a look at the beat spectrum algorithm: http://www.rotorbrain.com/foote/papers/icme2001/icmehtml.htm.
It extracts information about rythm and musical structure by computing the similarity of small samples' spectrograms. It is relatively easy to implement and allows robust information to be retrieved.

How do I programatically sort through media?

I have 6 server with a aggregated storage capacity of 175TB all hosting different sorts of media. A lot of the media is double and copies stored in different formats so what I need is a library or something I can use to read the tags in the media and decided if it is the best copy available. For example some of my media is japanese content in which I have DVD and now blu ray rips of said content. This content sometimes has "Hardsubs" ie, subtitles that are encoded into the video and "Softsubs which are subtitles that are rendered on top of the raw video when it plays/ I would like to be able to find all copies of that rip and compare them by resolution and wether or not they have soft subs and which audio format and quality.
Therefore, can anyone suggest a library I can incorporate into my program to do this?
EDIT: I forgot to mention, the distribution server mounts the other servers as drives and is running windows server so I will probably code the solution in C#. And all the media is for my own legal use I have so many copies because some of the stuff is in other format for other players. For example I have some of my blu rays re-encoded to xvid for my xbox since it can't play Blu ray.
When this is done, I plan to open source the code since there doesn't seem to be anything like this already and I'm sure it can help someone else.
I don't know of any libraries, but as I try to think about how I'd progmatically approach it, I come up with this:
It is the keyframes that are most likely to be comparable. Keyframes occur regularly, but more importantly keyframes occur during massive scene changes. These massive changes will be common across many different formats, and those frames can be compared as still images. You may more easily find a still image comparison library.
Of course, you'll still have to find something to read all the different formats, but it's a start and a fun exercise to think about, even if the coding time involved is far beyond my one-person threshold.

Resources