Could somebody point me to a voice alteration algorithm? Preferably in Java or C? Something that I could use to change a stream of recorded vocals into something that sounds like Optimus Prime. (FYI- Optimus Prime is the lead Autobot from transformers with a very distinctive sounding voice... not everybody may know this.) Is there an open source solution?
You can't just change the sample rate. The human voice has formants. What you want to do is move the formants. That should be your line of research.
Read about vocoders and filter banks.
Could you provide a link as example? I haven't seen the film so I'm just speculating.
Audacity is an open-source wave editor which includes effect filters - since it's open source you could see what algorithms they use.
Not knowing what it sounded like, I figured it would be a vocoder, but after listening to a few samples, it's definitely not a vocoder (or if it is, it's pretty low in the mix.) It sounds like maybe there's a short, fast delay effect on it, along with some heavy EQ to make it sound kind of like a tiny AM radio, and maybe a little ring modulator. I think that a LOT of the voice actor's voice is coming through relatively intact, so a big part of the sound is just getting your own voice to sound right, and no effects will do that part for you.
All the above info is just me guessing based on having messed around with a lot of guitar pedals over the years, and done some amateur recording and sound-effects making, so I could be way off.
Related
(Preface: This is my first audio-related question on Stack Overflow, so I'll try to word this as best as I possibly can. Edits welcome.)
I'm creating an application that'll allow users to loop music. At the moment our prototypes allow these "loop markers" (implemented as UISliders) to snap at every second, specifying the beginning and end of a loop. Obviously, when looping music, seconds are a very crude manner to handle this, so I would like to use beats instead.
I don't want to do anything other than mark beats for the UISliders to snap to:
Feed our loadMusic method an audio file.
Run it through a library to detect beats or the intervals between them (maybe).
Feed that value into the slider's setNumberOfTickMarks: method.
Profit!
Unfortunately, most of the results I've run into via Google and SO have yielded much more advanced beat detection libraries like those that remixers would use. Overkill in my case.
Is this something that CoreMedia, AVFoundation or AudioToolbox can handle? If not, are there other libraries that can handle this? My research into Apple's documentation has only yielded relevant results... for MIDI files. But Apple's own software have features like this, such as iMovie's snap-to-beats functionality.
Any guidance, code or abstracts would be immensely helpful at this point.
EDIT: After doing a bit more digging around, it seems the correct terminology for what I'm looking for is onset detection.
Onset Detection algorithms come in many flavors from looking at the raw music signal to using frequency domain techniques.
if you want a quick and easy way to determin where beats are:
Chop up the music signal into small segments (20-50ms chunks)
Compute the squared sum average of the signal: Sum(Xn ^2) / N (where N is the number of sample per 20-50ms)
If you want more sophisticated techniques look into:
https://adamhess.github.io/Onset_Detection_Nov302011.pdf
or for hardcore treatment of it:
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=PMHXcoAAAAAJ&citation_for_view=PMHXcoAAAAAJ:uJ-U7cs_P_0C
I need to develop a performance evaluator for piano playing. Based on a midi generated from sheet music, I need to evaluate the midi of the actual playing (midi keyboard). I'm planning to evaluate the playing based on note pitch, duration and loudness. The evaluation is I suppose a comparison of the notes of the sheet music and playing in midi.
But I have no idea how I can visualise (i.e. show where the person have gone wrong) this evaluation process. i.e. maybe show both the notation and highlight which note has gone wrong.
But how can I show any of this in some graphical form? Or more precisely on a stave (a music score) itself. I have note details (pitch, duration) and score details (key and time signature) stored in a table, and I'm using Java. But I have no clue as in how I can put all this into graphical form.
Any insight is most gratefully appreciated. Advance thanks
What you're talking about, really, is a graphical diff tool for musical notation. I think the easiest way to show differences is with an overlay of played notes (and rests) over "correct" score symbols. Where it gets tricky is in showing volume differences, whether notes are played (or should be played) staccato, marcato, tenuto, etc. For example, a note with a dot over it is meant to be played staccato, but your MIDI representation of a quarter note might be interpreted as an eight note followed by an eight rest, etc.
You'll also have to quantize the results of the live play, which means you will have to allow some leeway for the human being to be slightly before or after the beat without notating differently. If you don't do this, the only "correct" interpretation of the notes will be very mechanical (and not pleasing to the ear).
As for drawing the notation and placing it on the correct lines or spaces on the staves, that is not hard if you understand how to draw graphics. There are musical fonts available that permit you to use alphanumeric characters to represent note bodies, stems, rests, etc. you will also have to understand key signatures, accidentals, when certain notes are enharmonic, etc.
This is not a small undertaking you are proposing, and there is already a lot of software out there that does a lot of what you are trying to do. Maybe some exists that does exactly what you want to do, so do research it before you start coding. :) Look at various work that has already been done and see if there is anything you can use or which would put you off your project.
I made my own keyboard player/recorder for QuickTime's MIDI implementation a few years ago and had to solve a number of the problems you face. I did it for fun, and it was fun (and educational for me), but it could never compete with commercial software in the genre. And although people did enjoy it, I really did not have time to maintain it and add the features people wanted, so eventually I had to abandon it. This kind of thing is really a lot of work.
Is it possible to write a program that can extract a melody/beat/rhythm provided by a specific instument in a wave (or other music format) file made up of multiple instruments?
Which algorithms could be used for this and what programming language would be best suited to it?
This is a fascinating area. The basic mathematical tool here is the Fourier Transform. To get an idea of how it works, and how challenging it can be, take a look at the analysis of the opening chord to A Hard Day's Night.
An instrument produces a sound signature, just the same way our voices do. There are algorithms out there that can pick a single voice out of a crowd and identify that voice from its signature in a database which is used in forensics. In the exact same way, the sound signature of a single instrument can be picked out of a soundscape (such as your mixed wave) and be used to pick out a beat, or make a copy of that instrument on its own track.
Obviously if you're thinking about making copies of tracks, i.e. to break down the mixed wave into a single track per instrument you're going to be looking at a lot of work. My understanding is that because of the frequency overlaps of instruments, this isn't going to be straightforward by any means... not impossible though as you've already been told.
There's quite an interesting blog post by Comparisonics about sound matching technologies which might be useful as a start for your quest for information: http://www.comparisonics.com/SearchingForSounds.html
To extract the beat or rhythm, you might not need perfect isolation of the instrument you're targeting. A general solution may be hard, but if you're trying to solve it for a particular piece, it may be possible. Try implementing a band-pass filter and see if you can tune it to selects th instrument you're after.
Also, I just found this Mac product called PhotoSounder. They have a blog showing different ways it can be used, including isolating an individual instrument (with manual intervention).
Look into Karaoke machine algorithms. If they can remove voice from a song, I'm sure the same principles can be applied to extract a single instrument.
Most instruments make sound within certain frequency ranges.
If you write a tunable bandpass filter - a filter that only lets a certain frequency range through - it'll be about as close as you're likely to get. It will not be anywhere near perfect; you're asking for black magic. The only way to perfectly extract a single instrument from a track is to have an audio sample of the track without that instrument, and do a difference of the two waveforms.
C, C++, Java, C#, Python, Perl should all be able to do all of this with the right libraries. Which one is "best" depends on what you already know.
It's possible in principle, but very difficult - an open area of research, even. You may be interested in the project paper for Dancing Monkeys, a step generation program for StepMania. It does some fairly sophisticated beat detection and music analysis, which is detailed in the paper (linked near the bottom of that page).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What's so difficult about the subject that algorithm designers are having a hard time tackling it?
Is it really that complex?
I'm having a hard time grasping why this topic is so problematic. Can anyone give me an example as to why this is the case?
Auditory processing is a very complex task. Human evolution has produced a system so good that we don't realize how good it is. If three persons are talking to you at the same time you will be able to focus in one signal and discard the others, even if they are louder. Noise is very well discarded too. In fact, if you hear human voice played backwards, the first stages of the auditory system will send this signal to a different processing area than if it is real speech signal, because the system will regard it as "no-voice". This is an example of the outstanding abilities humans have.
Speech recognition advanced quickly from the 70s because researchers were studying the production of voice. This is a simpler system: vocal chords excited or not, resonation of vocal tractus... it is a mechanical system easy to understand. The main product of this approach is the cepstral analysis. This led automatic speech recognition (ASR) to achieve acceptable results. But this is a sub-optimal approach. Noise separation is quite bad, even when it works more or less in clean environments, it is not going to work with loud music in the background, not as humans will.
The optimal approach depends on the understanding of the auditory system. Its first stages in the cochlea, the inferior colliculus... but also the brain is involved. And we don't know so much about this. It is being a difficult change of paradigm.
Professor Hynek Hermansky compared in a paper the current state of the research with when humans wanted to fly. We didn't know what was the secret —The feathers? wings flapping?— until we discovered Bernoulli's force.
Because if people find it hard to understand other people with a strong accent why do you think computers will be any better at it?
I remember reading that Microsoft had a team working on speech recognition, and they called themselves the "Wreck a Nice Beach" team (a name given to them by their own software).
To actually turn speech into words, it's not as simple as mapping discrete sounds, there has to be an understanding of the context as well. The software would need to have a lifetime of human experience encoded in it.
This kind of problem is more general than only speech recognition.
It exists also in vision processing, natural language processing, artificial intelligence, ...
Speech recognition is affected by the semantic gap problem :
The semantic gap characterizes the
difference between two descriptions of
an object by different linguistic
representations, for instance
languages or symbols. In computer
science, the concept is relevant
whenever ordinary human activities,
observations, and tasks are
transferred into a computational
representation
Between an audio wave form and a textual word, the gap is big,
Between the word and its meaning, it is even bigger...
beecos iyfe peepl find it hard to arnerstand uvver peepl wif e strang acsent wie doo yoo fink compootrs wyll bee ani bettre ayt it?
I bet that took you half a second to work out what the hell I was typing and all Iw as doing was repeating Simons answer in a different 'accent'. The processing power just isn't there yet but it's getting there.
The variety in language would be the predominant factor, making it difficult. Dialects and accents would make this more complicated. Also, context. The book was read. The book was red. How do you determine the difference. The extra effort needed for this would make it easier to just type the thing in the first place.
Now, there would probably be more effort devoted to this if it was more necessary, but advances in other forms of data input have come along so quickly that it is not deemed that necessary.
Of course, there are areas where it would be great, even extremely useful or helpful. Situations where you have your hands full or can't look at a screen for input. Helping the disabled etc. But most of these are niche markets which have their own solutions. Maybe some of these are working more towards this, but most environments where computers are used are not good candidates for speech recognition. I prefer my working environment to be quiet. And endless chatter to computers would make crosstalk a realistic problem.
On top of this, unless you are dictating prose to the computer, any other type of input is easier and quicker using keyboard, mouse or touch. I did once try coding using voice input. The whole thing was painful from beginning to end.
Because Lernout&Hauspie went bust :)
(sorry, as a Belgian I couldn't resist)
The basic problem is that human language is ambiguous. Therefore, in order to understand speech, the computer (or human) needs to understand the context of what is being spoken. That context is actually the physical world the speaker and listener inhabit. And no AI program has yet demonstrated having adeep understanding of the physical world.
Speech synthesis is very complex by itself - many parameters are combined to form the resulting speech. Breaking it apart is hard even for people - sometimes you mishear one word for another.
Most of the time we human understand based on context. So that a perticular sentence is in harmony with the whole conversation unfortunately computer have a big handicap in this sense. It is just tries to capture the word not whats between it.
we would understand a foreigner whose english accent is very poor may be guess what is he trying to say instead of what is he actually saying.
To recognize speech well, you need to know what people mean - and computers aren't there yet at all.
You said it yourself, algorithm designers are working on it... but language and speech are not an algorithmic constructs. They are the peak of the development of the highly complex human system involving concepts, meta-concepts, syntax, exceptions, grammar, tonality, emotions, neuronal as well as hormon activity, etc. etc.
Language needs a highly heuristic approach and that's why progress is slow and prospects maybe not too optimistic.
I once asked a similar question to my instructor; i asked him something like what challenge is there in making a speech-to-text converter. Among the answers he gave, he asked me to pronounce 'p' and 'b'. Then he said that they differ for a very small time in the beginning, and then they sound similar. My point is that it is even hard to recognize what sound is made, recognizing voice would be even harder. Also, note that once you record people's voices, it is just numbers that you store. Imagine trying to find metrics like accent, frequency, and other parameters useful for identifying voice from nothing but input such as matrices of numbers. Computers are good at numerical processing etc, but voice is not really 'numbers'. You need to encode voice in numbers and then do all computation on them.
I would expect some advances from Google in the future because of their voice data collection through 1-800-GOOG411
It's not my field, but I do believe it is advancing, just slowly.
And I believe Simon's answer is somewhat correct in a way: part of the problem is that no two people speak alike in terms of the patterns that a computer is programmed to recognize. Thus, it is difficult to analysis speech.
Computers are not even very good at natural language processing to start with. They are great at matching but when it comes to inferring, it gets hairy.
Then, with trying to figure out the same word from hundreds of different accents/inflections and it suddenly doesn't seem so simple.
Well I have got Google Voice Search on my G1 and it works amazingly well. The answer is, the field is advancing, but you just haven't noticed!
If speech recognition was possible with substantially less MIPS than the human brain, we really could talk to the animals.
Evolution wouldn't spend all those calories on grey matter if they weren't required to do the job.
Spoken language is context sensitive, ambiguous. Computers don't deal well with ambiguous commands.
I don't agree with the assumption in the question - I have recently been introduced to Microsoft's speech recognition and am impressed. It can learn my voice after a few minutes and usually identifies common words correctly. It also allows new words to be added. It is certainly usable for my purposes (understanding chemistry).
Differentiate between recognising the (word) tokens and understanding the meaning of them.
I don't yet know about other languages or operating systems.
The problem is that there are two types of speech recognition engines. Speaker-trained ones such as Dragon are good for dictation. They can recognize almost any spoke text with fairly good accuracy, but require (a) training by the user, and (b) a good microphone.
Speaker-independent speech rec engines are most often used in telephony. They require no "training" by the user, but must know ahead of time exactly what words are expected. The application development effort to create these grammars (and deal with errors) is huge. Telephony is limited to a 4Khz bandwidth due to historical limits in our public phone network. This limited audio quality greatly hampers the speech rec engines' ability to "hear" what people are saying. Digits such as "six" or "seven" contain an ssss sound that is particularly hard for the engines to distinguish. This means that recognizing strings of digits, one of the most basic recognition tasks, is problematic. Add in regional accents, where "nine" is pronounced "nan" in some places, and accuracy really suffers.
The best hope are interfaces that combine graphics and speech rec. Think of an IPhone application that you can control with your voice.
I remember the days of Shadowrun that got me excited about hacking. There is CodeWar and LightBot which are both fun (though CoreWar is a little dated). What other games are there involving coding that are fun and challenging that can be used to get someone excited about coding or flex their chops or even learn the basics?
How about RoboCode
You code your tank in Java and let it loose in the 'ring' with other coded tanks. People got pretty into coding strategy, targeting, etc. IBM sponsored it and came up with some nice introductory programming tutorials to get you started.
Here's a great article to get the feel for it:
Rock 'em, sock 'em Robocode!
(source: sourceforge.net)
Uplink isn't so much a coding game, but it is a great game that makes you feel like a hacker.
There's a whole bunch of "drag-and-drop" coding games, where you make a little thing (usually a robot) solve some puzzle by giving it a list of instructions. They're only vaguely similar to actual coding, but they are still pretty fun.
RoboZZle
The Codex of Alchemical Engineering
light-Bot
Not sure if it's considered a "game", but the TopCoder Competitions are fun, and come in various sizes and commitment levels. You can also work on puzzles from the archives for some good programming practice.
The Python Challenge is like those "look at the html source" riddles, but requires a bit of programming to get the answers.
When I was a kid I played "Rocky's Boots", where you had to hook up logic gates to solve puzzles. That had a big impact on my thinking.
Core Wars.
Here's something that allows you to make games and animations: Alice
If you're looking for a board game, you might want to have a look at Robo Rally. In this game 2-8 people are trying to maneuver their robots over the board as quickly as possible, dodging deadly obstacles and trying to shove other people robots into obstacles on the way.
Each game round all players have to "code" the program the robot is going to execute in the next round and then the robots just follow their program. The programs are just five instructions long, but still creating an optimal program can be quite tricky. There usually is very little luck involved, which is why I really like this game.
Similar to Uplink is HackWars. Instead of point and click hacking though, it's multiplayer and you can write your own attack scripts. There's actually an included runtime for writing 2d/3d games and there's a bunch of different places to hook in scripts (for defense, banking, in game website, etc).
Scripting language looks similar to Java.
How about Ai-Board
You play it on your phone/tablet.
It's IDE is built into the game.
It has a built-in node-based visual programming language, whose code-behind is a python-like language.
You write the code, that drives the Ai, that moves the pawns, that plays the game, all still on your mobile device.
YouTube Video: Visual Programming Time-lapse on a mobile device
It comes with quite a few tutorials that introduce the player to programming, genetic algorithms etc, and you get a step-by-step walk-through of all these methods.
It also comes with ready-made scripts that work right out of the box, and are ready for you to copy into your free 'Dev' & 'Test' envs,...
...so that you can tweak them to your heart's content, knowing that you can always revert to the original at any time.
The in-built machine learning engine allows you to
train your AiBot to play the board-game,
play your AiBot against their in-built Ai
play against your own AiBot
breed your AiBot (Genetic Algorithm)
fine-tune your AiBot (Back-Propagation)
...debug your AiBot, and so on.
YouTube Video: Machine Learning is mobile!
Its currently in BETA testing, but soon to be released, and everything described comes for free.
In addition, there are single and multi-player modes as well, but it is primarily a game about coding, coming complete "...with batteries included!"
Project Euler