Speech recognition, Such as Siri - performance

Softwares such as Siri, takes voice command and responds to those questions appropriately(98%). I wanted to know that when we write a software to take input stream of voice signal and to respond to those questions,
Do we need to convert the input into human readable language? such as English?
As in nature we have so many different languages but when we speak, we basically make different noise. That's it. However, we have created the so called alphabet to denote those noise variations.
So, Again my question is when we write speech recognition algorithms, Do we match those noise variation signals with our database or first we convert those noise variation into English and then check what to answer from database?

The "noise variation signals" you are referring to are called phonemes. How a speech recognition system translates these phonemes int words depends on the type of system. Siri is not a grammar based system where you tell the speech recognition system what types of phrases you are expecting based on a set of rules. Since Siri translates speech in an open context it probably uses some type of statistical modeling. A popular statistical model for speech recognition today is the Hidden Markov Model. While there is a database of sorts involved it is not a simple search of groups of phonemes into words. There is a pretty good high level description of the process and issues with translation here.

Apple's Siri Based on Natural Language understanding..
I believe Nuance is behind the Scene.. Refer This Article
Nuance is Leader in Speech recognition system development.
Accuracy of Nuance Dragon Engine is just Amazing...
The Client whom i m working for is Consuming Nuance NOD service for their IVR system...
I have tried Nuance Dragon SDK for Android...
from my experience if you use Nuance you need not to worry about the noise variation etc etc... But when you going for enterprise release of you application Nuance might be costly..
If you are planning to use Power of voice to drive your application Google API is also a better choice...
There are API's like Sphinx and pocket sphinx can also help you better for speech application development.. All the above API will take care of the noise rejection and Converting Speech into text etc etc..
all you need to worry is building your system to understand semantic meaning of the given String or recognized Speech content.. Apple should have very good semantic meaning interpreter. So give a try for Nuance SDK. it is available for Android ,iOS , Windows phone and HTTP Client Versions.
I hope it can help you

Related

What does the Apple Neural Engine do?

I can't find any useful documentation on what the Apple Neural Engine does specifically beyond "accelerate machine learning tasks across the Mac for things like video analysis, voice recognition, image processing, and more", which doesn't tell me much. Is it just another TPU? What architecture does it use? Does it have accelerators for any specific operations linear algebra operations? Do they have a CUDA-type language to actually use it?

Face identification and recognition without libraries

Is there any "step by step guide" on how to do a face recognition and identification without any libraries?
The goal is to receive 60 images and identify who is it in every image, so first we will have to get the info of every person and then, identify they image by image.
it's for an academic research and we are not supposed to use any kind of "external help" for our algorithmy. Any programming language will do. We just need to keep everything simple. What should i research and use to do this kind of software?
This article provides basic techniques for image recognition. It uses neural nets to recognize handwritten digits.

Moleskine Pen+ convertible with Nnotebooks of Neo Smart Lab

I'm very new in using smartpens and I'm not sure in which forum I best post questions about it...
I have a Moleskine Smart writing set but would like to make certain notes on single sheets (like this ones) or use other notebooks than the PaperTablet from Moleskine.
I thought that this would be no problem, since the Pen+ uses the same Ncode technology as the smartpens from Neolab do. But this turned out to be wrong... My Moleskine Notes app doesn't recognise the paper from Neolab.
Do you have an idea, how I can use my Pen+ with single paper sheets and/or other notebooks which use the Ncode technology?
Pen + from moleskine and neo notes work perfectly (I'm using that for 2 years) with that combination you can use all coded papers from neolab including the printed ones
Regarding the first part of your question, printing and using your own pages depends primarily on your printer's ability to print "fine enough") so the pen can read the Ncoded paper pattern. That said, I believe a laser printer matching these NeoLab recommended criteria:
Use a color laser printer for printing Ncode PDF.
Install PCL/PS drivers to print exquisite Ncode. (Ask the printer manufacturer about PCL/PS driver.)
Use plain format for grayscale laser printers.
Regarding the second half of your question concerning off-the-shelf compatibility, I had a similar question and had difficulty finding the answer until I started looking at the Question/Answer of these pens and paper products. (e.g.,Pen Q/A).
Finally my sense of researching this "too much", is the biggest snag is for people to realize the communication protocols between pen and paper are different. In other words, the apps you choose (i.e., Neo's notes or Moleskine+ apps) is the big decision. Thus, I believe choosing how to interface with the pen is the biggest choice to make.
TOTAL DISCLOSURE: I have yet to purchase either of these products and my above statements are somewhat speculative and not COMPLETELY verified.
My experience is that, at least for the Moleskine Elipse + smart pen, the Moleskine Notes app was required, and this unfortunately limited my selection of notebooks to moleskine-branded ones. Neo Notes would not recognize and register the pen.
However, my experience with Neo Studio is that it recognizes both NeoLab ncode and Moleskine-branded ncode notebooks.

Why is speech recognition difficult? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Why is speech recognition so difficult? What are the specific challenges involved? I've read through a question on speech recognition, which did partially answer some of my questions, but the answers were largely anecdotal rather than technical. It also still didn't really answer why we still can't just throw more hardware at the problem.
I've seen tools that perform automated noise reduction using neural nets and ambient FFT analysis with excellent results, so I can't see a reason why we're still struggling with noise except in difficult scenarios like ludicrously loud background noise or multiple speech sources.
Beyond this, isn't it just a case of using very large, complex, well-trained neural nets to do the processing, then throwing hardware at it to make it work fast enough?
I understand that strong accents are a problem and that we all have our colloquialisms, but these recognition engines still get basic things wrong when the person is speaking in a slow and clear American or British accent.
So, what's the deal? What technical problems are there that make it still so difficult for a computer to understand me?
Some technical reasons:
You need lots of tagged training data, which can be difficult to acquire once you take into account all the different accents, sounds etc.
Neural networks and similar gradient descent algorithms don't scale that well - just making them bigger (more layers, more nodes, more connections) doesn't guarantee that they will learn to solve your problem in a reasonable time. Scaling up machine learning to solve complex tasks is still a hard, unsolved problem.
Many machine learning approaches require normalised data (e.g. a defined start point, a standard pitch, a standard speed). They don't work well once you move outside these parameters. There are techniques such as convolutional neural networks etc. to tackle these problems, but they all add complexity and require a lot of expert fine-tuning.
Data size for speech can be quite large - the size of the data makes the engineering problems and computational requirements a little more challenging.
Speech data usually needs to be interpreted in context for full understanding - the human brain is remarkably good at "filling in the blanks" based on understood context. Missing informations and different interpretations are filled in with the help of other modalities (like vision). Current algorithms don't "understand" context so they can't use this to help interpret the speech data. This is particularly problematic because many sounds / words are ambiguous unless taken in context.
Overall, speech recognition is a complex task. Not unsolvably hard, but hard enough that you shouldn't expect any sudden miracles and it will certainly keep many reasearchers busy for many more years.....
Humans use more than their ears when listening, they use the knowledge they
have about the speaker and the subject. Words are not arbitrarily sequenced
together, there is a grammatical structure and redundancy that humans use
to predict words not yet spoken. Furthermore, idioms and how we ’usually’
say things makes prediction even easier.
In Speech Recognition we only have the speech signal. We can of course construct a
model for the grammatical structure and use some kind of statistical model
to improve prediction, but there are still the problem of how to model world
knowledge, the knowledge of the speaker and encyclopedic knowledge. We
can, of course, not model world knowledge exhaustively, but an interesting
question is how much we actually need in the ASR to measure up to human
comprehension.
Speech is uttered in an environment of sounds, a clock ticking, a computer
humming, a radio playing somewhere down the corridor, another human
speaker in the background etc. This is usually called noise, i.e., unwanted
information in the speech signal. In Speech Recognition we have to identify and filter out
these noises from the speech signal. Spoken language != Written language
1: Continuous speech
2: Channel variability
3: Speaker variability
4: Speaking style
5: Speed of speech
6: Ambiguity
All this points have to be considered while building a speech recognition, That's why its a quite difficult.
-------------Refered from http://www.speech.kth.se/~rolf/gslt_papers/MarkusForsberg.pdf
I suspect you are interested in 'countinuous' speech recognition, where the speaker speaks sentences (not single words) at normal speed.
The problem is not simply one of signal analysis, but there is a large natural language component as well. Most of us understand spoken language not by analyzing every single thing that we hear, as that would never work because each person speaks differently, phonemes are suppressed, pronunciations are different, etc. We just interpret a portion of what we hear and the rest is 'interpolated' by our brain once the context of what is being said is established. When you have no context, it is difficult to understand spoken language.
Lots of major problems in speech recognition are not directly related to the language itself:
different people (women, men, children, elders etc.) have different voices
sometimes the same person sounds different for example when the person has a cold
different background noises
everyday speech sometimes contains words from other languages (like you have the german word Kindergarden in the US/English)
some persons not from the country itself learned the language (they usually sound different)
some persons speak faster, others speak slower
quality of the microphone
etc.
Solving these things always is pretty hard... on top of that you have the language/pronounciation to take care of...
For reference see the Wikipedia article http://en.wikipedia.org/wiki/Speech_recognition - it has a good overview including some links and book references which are a good starting point...
From the technical POV the "audio preprocessing" is just one step in a long process... let's say the audio is "crytal clear", then several of the above mentioned aspects (like having a cold, having a mixup in languages etc.) still need to be solved.
All this means that for good speech recognition you need to have a model of the langauge(s) that is thorough enough to account for slight differences (like "ate" versus "eight") which usually involves some context-analysis (i.e. semantic and fact/world knowledge, see http://en.wikipedia.org/wiki/Semantic%5Fgap) etc.
Since almost all relevant languages have evolved and were not designed as mathematical models you basically need to "reverse engineer" the available implicit and explicit knowlegde about a language into a model which is a big challenge IMHO.
Having worked myself with neural nets I can assure you that while they provide good results in some cases they are not "magical tools"... almost always a good neural net has been carefully designed and optimized for the specific requirement... in this context it needs both extensive experience/knowledge of languages and neural nets PLUS extensive training to achieve usable results...
Its been a decade since I took a language class in college, but from what I recall language can be broken up into phonemes. Language processors do their best to identify these phonemes, but they are unique to every individual. Even once they are broken up they must then be reassembled into a meaningful construct.
Take this example, humans are quite capable of reading with no punctuation and no capital letters and no spaces. It takes a second, but we can do it quite readily. This is kind of what a computer has to look at when it gets a block of phonemes. However, computers are not nearly as good at parsing this data out. One of the reasons is it is difficult for computers to have context. Humans can even understand babies despite the fact that their phonemes can be completely wrong.
Even if you have all the phonemes correct, then arranging them into an order that makes sense is also difficult.

Why isn't speech recognition advancing? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What's so difficult about the subject that algorithm designers are having a hard time tackling it?
Is it really that complex?
I'm having a hard time grasping why this topic is so problematic. Can anyone give me an example as to why this is the case?
Auditory processing is a very complex task. Human evolution has produced a system so good that we don't realize how good it is. If three persons are talking to you at the same time you will be able to focus in one signal and discard the others, even if they are louder. Noise is very well discarded too. In fact, if you hear human voice played backwards, the first stages of the auditory system will send this signal to a different processing area than if it is real speech signal, because the system will regard it as "no-voice". This is an example of the outstanding abilities humans have.
Speech recognition advanced quickly from the 70s because researchers were studying the production of voice. This is a simpler system: vocal chords excited or not, resonation of vocal tractus... it is a mechanical system easy to understand. The main product of this approach is the cepstral analysis. This led automatic speech recognition (ASR) to achieve acceptable results. But this is a sub-optimal approach. Noise separation is quite bad, even when it works more or less in clean environments, it is not going to work with loud music in the background, not as humans will.
The optimal approach depends on the understanding of the auditory system. Its first stages in the cochlea, the inferior colliculus... but also the brain is involved. And we don't know so much about this. It is being a difficult change of paradigm.
Professor Hynek Hermansky compared in a paper the current state of the research with when humans wanted to fly. We didn't know what was the secret —The feathers? wings flapping?— until we discovered Bernoulli's force.
Because if people find it hard to understand other people with a strong accent why do you think computers will be any better at it?
I remember reading that Microsoft had a team working on speech recognition, and they called themselves the "Wreck a Nice Beach" team (a name given to them by their own software).
To actually turn speech into words, it's not as simple as mapping discrete sounds, there has to be an understanding of the context as well. The software would need to have a lifetime of human experience encoded in it.
This kind of problem is more general than only speech recognition.
It exists also in vision processing, natural language processing, artificial intelligence, ...
Speech recognition is affected by the semantic gap problem :
The semantic gap characterizes the
difference between two descriptions of
an object by different linguistic
representations, for instance
languages or symbols. In computer
science, the concept is relevant
whenever ordinary human activities,
observations, and tasks are
transferred into a computational
representation
Between an audio wave form and a textual word, the gap is big,
Between the word and its meaning, it is even bigger...
beecos iyfe peepl find it hard to arnerstand uvver peepl wif e strang acsent wie doo yoo fink compootrs wyll bee ani bettre ayt it?
I bet that took you half a second to work out what the hell I was typing and all Iw as doing was repeating Simons answer in a different 'accent'. The processing power just isn't there yet but it's getting there.
The variety in language would be the predominant factor, making it difficult. Dialects and accents would make this more complicated. Also, context. The book was read. The book was red. How do you determine the difference. The extra effort needed for this would make it easier to just type the thing in the first place.
Now, there would probably be more effort devoted to this if it was more necessary, but advances in other forms of data input have come along so quickly that it is not deemed that necessary.
Of course, there are areas where it would be great, even extremely useful or helpful. Situations where you have your hands full or can't look at a screen for input. Helping the disabled etc. But most of these are niche markets which have their own solutions. Maybe some of these are working more towards this, but most environments where computers are used are not good candidates for speech recognition. I prefer my working environment to be quiet. And endless chatter to computers would make crosstalk a realistic problem.
On top of this, unless you are dictating prose to the computer, any other type of input is easier and quicker using keyboard, mouse or touch. I did once try coding using voice input. The whole thing was painful from beginning to end.
Because Lernout&Hauspie went bust :)
(sorry, as a Belgian I couldn't resist)
The basic problem is that human language is ambiguous. Therefore, in order to understand speech, the computer (or human) needs to understand the context of what is being spoken. That context is actually the physical world the speaker and listener inhabit. And no AI program has yet demonstrated having adeep understanding of the physical world.
Speech synthesis is very complex by itself - many parameters are combined to form the resulting speech. Breaking it apart is hard even for people - sometimes you mishear one word for another.
Most of the time we human understand based on context. So that a perticular sentence is in harmony with the whole conversation unfortunately computer have a big handicap in this sense. It is just tries to capture the word not whats between it.
we would understand a foreigner whose english accent is very poor may be guess what is he trying to say instead of what is he actually saying.
To recognize speech well, you need to know what people mean - and computers aren't there yet at all.
You said it yourself, algorithm designers are working on it... but language and speech are not an algorithmic constructs. They are the peak of the development of the highly complex human system involving concepts, meta-concepts, syntax, exceptions, grammar, tonality, emotions, neuronal as well as hormon activity, etc. etc.
Language needs a highly heuristic approach and that's why progress is slow and prospects maybe not too optimistic.
I once asked a similar question to my instructor; i asked him something like what challenge is there in making a speech-to-text converter. Among the answers he gave, he asked me to pronounce 'p' and 'b'. Then he said that they differ for a very small time in the beginning, and then they sound similar. My point is that it is even hard to recognize what sound is made, recognizing voice would be even harder. Also, note that once you record people's voices, it is just numbers that you store. Imagine trying to find metrics like accent, frequency, and other parameters useful for identifying voice from nothing but input such as matrices of numbers. Computers are good at numerical processing etc, but voice is not really 'numbers'. You need to encode voice in numbers and then do all computation on them.
I would expect some advances from Google in the future because of their voice data collection through 1-800-GOOG411
It's not my field, but I do believe it is advancing, just slowly.
And I believe Simon's answer is somewhat correct in a way: part of the problem is that no two people speak alike in terms of the patterns that a computer is programmed to recognize. Thus, it is difficult to analysis speech.
Computers are not even very good at natural language processing to start with. They are great at matching but when it comes to inferring, it gets hairy.
Then, with trying to figure out the same word from hundreds of different accents/inflections and it suddenly doesn't seem so simple.
Well I have got Google Voice Search on my G1 and it works amazingly well. The answer is, the field is advancing, but you just haven't noticed!
If speech recognition was possible with substantially less MIPS than the human brain, we really could talk to the animals.
Evolution wouldn't spend all those calories on grey matter if they weren't required to do the job.
Spoken language is context sensitive, ambiguous. Computers don't deal well with ambiguous commands.
I don't agree with the assumption in the question - I have recently been introduced to Microsoft's speech recognition and am impressed. It can learn my voice after a few minutes and usually identifies common words correctly. It also allows new words to be added. It is certainly usable for my purposes (understanding chemistry).
Differentiate between recognising the (word) tokens and understanding the meaning of them.
I don't yet know about other languages or operating systems.
The problem is that there are two types of speech recognition engines. Speaker-trained ones such as Dragon are good for dictation. They can recognize almost any spoke text with fairly good accuracy, but require (a) training by the user, and (b) a good microphone.
Speaker-independent speech rec engines are most often used in telephony. They require no "training" by the user, but must know ahead of time exactly what words are expected. The application development effort to create these grammars (and deal with errors) is huge. Telephony is limited to a 4Khz bandwidth due to historical limits in our public phone network. This limited audio quality greatly hampers the speech rec engines' ability to "hear" what people are saying. Digits such as "six" or "seven" contain an ssss sound that is particularly hard for the engines to distinguish. This means that recognizing strings of digits, one of the most basic recognition tasks, is problematic. Add in regional accents, where "nine" is pronounced "nan" in some places, and accuracy really suffers.
The best hope are interfaces that combine graphics and speech rec. Think of an IPhone application that you can control with your voice.

Resources