Music data format for polyphonic music visualization with Processing - processing

I am interested in visualizing melodic contours of polyphonic music with Processing. It is still unclear to me, though, what the most convenient format for imported data (pitch and onset/duration) would be: tabular (e.g. Humdrum), XML (e.g. MEI, musicXML), or JSON? Maybe another format?
Any suggestions/thoughts on this would be really helpful! Thanks.

Using MIDI files would be optimal, because of the combination of those 3 reasons
MIDI is widely used. You can export a .midi file from pratically any score editor plus you can create your own by recording the input from a midi instrument.
You can already find .midi files of iconic polyphonic music on the web (Bach's counterpoints, Reinaissance vocal music, etc)
It just contain music/playback information. It doesn't contain notation information like music XML. So if you just want to see pitches and note position/duration (like in this video) then .midi will contain just what you need
You can use the Java Midi Package in Processing and it already contains everything you need to read the MIDI files.
While other formats might also apply for 1, 2, 3 or 4 only MIDI applies for all of them.

The best answer I can give you is that you should put together a simple hello world program that tests out each format and see which one you like the best.
In the end, you're the one that has to deal with the code, so only you can really decide on the best format.

Related

Returnn Switchboard data processing

Could anybody give me pointers on how to process Switchboard dataset for training with RETURNN? I did see BlissDataset class that seems to be designed for switchboard, but it's not clear to me what I should include in the paths given in the example:
Example:
./tools/dump-dataset.py "
{'class':'BlissDataset',
'path': '/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz',
'bpe_file': '/u/zeyer/setups/switchboard/subwords/swb-bpe-codes',
'vocab_file': '/u/zeyer/setups/switchboard/subwords/swb-vocab'}"
The switchboard dataset has several folders with audios, i.e. swb1_d2/data/*.sph and transcripts swb1_LDC97S62/swb_ms98_transcriptions/**/*
I'm not quite sure how to proceed with this to get a dataset that can be used to train RETURNN.
At our group (RWTH Aachen University), we use the config as it was published on GitHub. As you see, this one uses ExternSprintDataset. That dataset uses
The implementation uses Sprint (publicly called RWTH ASR (RASR), see here) as an external tool (ran in a subprocess) to handle the data (feature extraction, etc). Sprint gets a Bliss XML file which describes all the segments with path to audio and audio offsets and transcriptions, and also it gets further configs for the feature extraction and maybe other things. There is an open source version of RASR which should work but it might be a bit involved to get this to work.
The BlissDataset was planned to be a simpler replacement for that. However, the implementation is incomplete. Also, you still would need to generate the Bliss XML by yourself in some way (we have used some own internal scripts to prepare that based on the official LDC data).
So, unfortunately, there is no simple way yet. Actually, I think the easiest way would be to come up with yet another custom format, which might be similar to the LibriSpeechDataset implementation, or maybe just the same, and then you could just reuse LibriSpeechDataset, or at least parts of that. That dataset implementation takes the data in some zip format which contains the transcripts in txt files and the audio in ogg or wav files. It uses librosa to do MFCC feature extraction (or also other feature types). I planned to implement that for Switchboard, and then reproduce the results, however I did not have time yet and not sure when I will get to that. But if you want to try that on your own, I will be happy to help you however I can. The starting point would be to look at LibriSpeechDataset and understand how the format of that looks like.

How to match times in audio to text

I have a bunch of audio files and the text that corresponds to the speech in the audio files. I need to be able find the specific times in the audio where each word starts. I trying to use the C# SpeechRecognizer library but I can't find a way to get time data for the words it recognizes. Is it possible with this library? Is there a different tool/library that can do this?
I use CMU Sphinx for this. (Java though, not sure if they provide C# API.) It allows me to access internals of the recognized text, in which I am able to find start/stop times.
https://cmusphinx.github.io/
https://cmusphinx.github.io/wiki/tutorialsphinx4/

Censor Plugin or Extension for VLC Media Player

I'm having an idea to create a Censor Plugin/Extension for VLC Player..
Problem Scenario :
An Adult-Scene for 1 minute in a nice movie makes it not watchable with Family.
My Solution :
Create a Plugin/Extension which does the following
Reads time positions from a file similar to subtitle files
Skip these time positions (which are adult or inappropriate) when playing
Help i needed :
I searched in Google and in videolan website, But can't find an exact solution
Are there already similar Plugins available?
Where should i start?
Please help me if you could guys.. thanks..
Same looking for having/developing Exact same solution. This might be helpful to you.
http://code.google.com/p/movie-content-editor/
A similar thing is also available on github:
https://github.com/rdp/sensible-cinema
You may also want to read this discussion thread:
https://forum.videolan.org/viewtopic.php?t=89466
finding great similar answer here
If you chop random bytes out the movie is likely not playable. The player might crash or fail to resynchronize the stream – the video might just stop. Plus, you're gonna have a hard time figuring out where the "adult" bytes are, so to speak.
If you already know where the parts are that you want to cut out, I would edit the file in any of the numerous video editors. Even Windows Movie Maker or iMovie would do the job, and those are easily available on both major OSes.
This is a requested feature for VLC. Not really anything user-friendly out there. Still, VLC offers the possibility to create playlists in a certain format that would mute or skip parts of a file. This is called XSPF. You might be able to figure out the proper format for this.
Also, there's movie-content-editor:
A VLC based editor built in python that allows users to create and use custom filter files to make movies more family friendly. Allows users to have the player automatically mute specific words or skip certain scenes based on the content of those scenes.
And sensible-cinema:
Clean Editing Movie Player allows you watch edited movies by applying delete lists (EDL's) (i.e. "mute out" or "cut out" scenes) to DVD's/files, with preliminary support for also applying them to arbitrary web/internet based players like netflix instant, hulu/hulu plus etc
See also these threads on The VideoLAN Forums:
auto skip unwanted parts of a video
Clearplay-like (content filter) module exists?

Use NSSpeechRecognizer or alternative with audio file instead of microphone input?

Is it possible to use the NSSpeechRecognizer with an pre-recorded audio file instead of direct microphone input?
Or is there any other speech-to-text framework for Objective-C/Cocoa available?
Added:
Rather than using voice at the machine that is running the application external devices (e.g. iPhone) could be used for sending just an recorded audio stream to that desktop application. The desktop Cocoa app then would process and do whatever it's supposed to do using the assigned commands.
Thanks.
I don't see any obvious way to switch the input programmatically, though the "Speech" companion guide's first paragraph in the "Recognizing Speech" section seems to imply other inputs can be used. I think this is meant to be set via System Preferences, though. I'm guessing it uses the primary audio input device selected there.
I suspect, though, you're looking for open-ended speech recognition, which NSSpeechRecognizer is not. If you're looking to transform any pre-recorded audio into text (ie, make a transcript of a recording), you're completely out of luck with NSSpeechRecognizer, as you must give it an array of "commands" to listen for.
Theoretically, you could feed it the whole dictionary, but I don't think that would work since you usually have to give it clear, distinct commands. Its performance would suffer, I would guess, if you gave it a bunch of stuff to analyze for (in real time).
Your best bet is to look at third-party open source solutions. There are a few generalized packages out there (none specifically for Cocoa/Objective-C), but this poses another question: What kind of recognition are you looking for? The two main forms of speech recognition ('trained' is more accurate but less flexible for different voices and the recording environment, whereas 'open' is generally much less accurate).
It'd probably be best if you stated exactly what you're trying to accomplish.

Best image format for uncompressed textures?

I am in the process of selecting an image format that will be used as the storage format for all in-house textures.
The format will be used as a source format from which compressed textures for different platforms and configurations will be generated, and so needs to cover all possible texture types (2D, cube, volymetric, varying number of mip-maps, floating point pixel formats, etc.) and be completely lossless.
In addition the format has to be able to keep a bit of metadata.
Currently a custom format is used for this, but a commonly available format will be easier to work with for the artists since its viewable in most image editors.
I have thought of using DDS, but this format does not support metadata as far as I can see.
All suggestions appreciated!
With your requirements you should stay with your selfmade format. I don't know about any image-format besides DDS that supports volumetric and cube-textures. Unfortunately DDS does not support meta-data.
The closest thing you can find is TIFF. It does not directly support cube-maps or volumetric textures, but it supports any number of sub-images. That way you could re-use the sub-images as slices or cube-sides.
TIFF also has a very good support for custom meta-data. The libtiff image reading/writing library works pretty good. It looks a bit archaic if you come from a OO side, but it gets it's job done.
Nils
When peeking inside various games' resources I found out that most of them store textures (I don't know whether they're compressed or not) in TGA
TIFF would probably be your closest bet for a format which supports arbitrary meta-data and multiple frames, but I think you are better off keeping the assets (in this case, images) separate from how they are converted and utilized in your engine.
Keep images in 32 bit PNG format, and put type- and meta information in XML. That keeps your data human viewable, readable and editable. Obscure custom formats are for engines, not people.
Stick with whatever your artists work with.
If you are a windows/mac shop and use
photoshop stick with .psd
If you are a unix shop and use gimp
stick with .xcf
These formats will store layers and all the stuff your artists need and are used to.
Since your artists will be creating loads of assets make their life as easy as possible,
even if it means to write some extra code.
Put the meta data (whatever it may be) somewhere "along" the images if the native format (psd/xcf) doesn't support it.
For stuff like cube maps, mipmaps (if not generated by the converter) stick to naming guidlines or guidlines on how to put them into one file.
Depending on what tool you use to create the volumetric stuff, just stick with that tools native format.
While writing custom formats for the target is usually a good idea,
writing custom formats for artists results in mayhem...
My experience with DDS is that it is a poorly documented and difficult format to work with and offers few advantages. It is generally simpler to just store a master file for each image class that has references to the source images that make it up ( i.e. 6 faces for a cube map, an arbitrary number of slices for a volume texture ) as well as any other useful meta-data. It's always going to be a good idea to keep the meta-data in a seperate file ( or in a database ) as you do not want to be loading large numbers of images when carryong out searches, populating browsers, etc. It also makes sense to seperate your source image format ( tiff, tga, jpeg, dds ... ) from your "meta-format" ( cube, volume ... ) since you may well find that you need to use lossy compression to support HDR formats or very large source volume data.
Have you tried PNG? http://java.sun.com/javase/6/docs/api/javax/imageio/metadata/doc-files/png_metadata.html
As an alternative solution, maybe spend some time writing a plugin for a Free Image Editor for your file format? I've never done it before, so I don't know the work involved, but there is boatloads of example code out there for you.

Resources