What does the Apple Neural Engine do? - macos

I can't find any useful documentation on what the Apple Neural Engine does specifically beyond "accelerate machine learning tasks across the Mac for things like video analysis, voice recognition, image processing, and more", which doesn't tell me much. Is it just another TPU? What architecture does it use? Does it have accelerators for any specific operations linear algebra operations? Do they have a CUDA-type language to actually use it?

Related

Face identification and recognition without libraries

Is there any "step by step guide" on how to do a face recognition and identification without any libraries?
The goal is to receive 60 images and identify who is it in every image, so first we will have to get the info of every person and then, identify they image by image.
it's for an academic research and we are not supposed to use any kind of "external help" for our algorithmy. Any programming language will do. We just need to keep everything simple. What should i research and use to do this kind of software?
This article provides basic techniques for image recognition. It uses neural nets to recognize handwritten digits.

How does a computer reproduce the SIFT paper method on its own in deep learning

Let me begin by saying that I am struggling to understand what is going on in deep learning. From what I gather, it is an approach to try to have a computer engineer different layers of representations and features to enable it learn stuff on its own. SIFT seems to be a common way to sort of detect things by tagging and hunting for scale invariant things in some representation. Again I am completely stupid and in awe and wonder about how this magic is achieved. How does one have a computer do this by itself? I have looked at this paper https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf and I must say at this point I think it is magic. Can somebody help me distill the main points of how this works and why a computer can do it on its own?
SIFT and CNN are both methods to extract features from images in different ways and outputs.
SIFT/SURF/ORB or any similar feature extraction algorithms are "Hand-made" feature extraction algorithms. It means, independent from the real world cases, they are aiming to extract some meaningful features. This approach has some advantages and disadvantages.
Advantages :
You don't have to care about input image conditions and probably you don't need any pre-processing step to extract those features.
You can directly get SIFT implementation and integrate it to your application
With GPU based implementations (i.e. GPU-SIFT), you can achieve high inference speed.
Disadvantages:
It has limitation about finding the features. You will have trouble about getting features over quite plain surfaces.
SIFT/SURF/ORB cannot solve all problems that requires feature classification / matching. Think face recognition problem. Do you think that extracting & classifying SIFT features over face will be enough to recognize people?
Those are hand-made feature extraction techniques, they cannot be improved over time (of course unless a better technique is being introduced)
Developing such a feature extraction technique requires a lot of research work
In the other hand, in deep learning, you can start analyzing much complex features which are impossible by human to recognize. CNNs are perfect approach as today to analyze hierarchical filter responses and much complex features which are created by combining those filter responses (going deeper).
The main power of CNNs are coming from not extracting features by hand. We only define "how" PC has to look for features. Of course this method has some pros and cons too.
Advantages :
More data, better success! It is all depending on data. If you have enough data to explain your case, DL outperforms hand-made feature extraction techniques.
As soon as you extract the features from image, you can use it for many purposes like to segment image, to create description words, to detect objects inside image, to recognize them. The better part is, all of them can be obtained in one shot, rather than complex sequential processes.
Disadvantages:
You need data. Probably a lot.
It is better to use supervised or reinforcement learning methods in these days. As unsupervised learning is still not good enough yet.
It takes time and resource to train a good neural net. A complex hierarchy like Google Inception took 2 weeks to be trained on 8 GPU server rack. Of course not all the networks are so hard to train.
It has some learning curve. You don't have to know how SIFT is working to use it for your application but you have to know how CNNs are working to use them in your custom purposes.

Speech recognition, Such as Siri

Softwares such as Siri, takes voice command and responds to those questions appropriately(98%). I wanted to know that when we write a software to take input stream of voice signal and to respond to those questions,
Do we need to convert the input into human readable language? such as English?
As in nature we have so many different languages but when we speak, we basically make different noise. That's it. However, we have created the so called alphabet to denote those noise variations.
So, Again my question is when we write speech recognition algorithms, Do we match those noise variation signals with our database or first we convert those noise variation into English and then check what to answer from database?
The "noise variation signals" you are referring to are called phonemes. How a speech recognition system translates these phonemes int words depends on the type of system. Siri is not a grammar based system where you tell the speech recognition system what types of phrases you are expecting based on a set of rules. Since Siri translates speech in an open context it probably uses some type of statistical modeling. A popular statistical model for speech recognition today is the Hidden Markov Model. While there is a database of sorts involved it is not a simple search of groups of phonemes into words. There is a pretty good high level description of the process and issues with translation here.
Apple's Siri Based on Natural Language understanding..
I believe Nuance is behind the Scene.. Refer This Article
Nuance is Leader in Speech recognition system development.
Accuracy of Nuance Dragon Engine is just Amazing...
The Client whom i m working for is Consuming Nuance NOD service for their IVR system...
I have tried Nuance Dragon SDK for Android...
from my experience if you use Nuance you need not to worry about the noise variation etc etc... But when you going for enterprise release of you application Nuance might be costly..
If you are planning to use Power of voice to drive your application Google API is also a better choice...
There are API's like Sphinx and pocket sphinx can also help you better for speech application development.. All the above API will take care of the noise rejection and Converting Speech into text etc etc..
all you need to worry is building your system to understand semantic meaning of the given String or recognized Speech content.. Apple should have very good semantic meaning interpreter. So give a try for Nuance SDK. it is available for Android ,iOS , Windows phone and HTTP Client Versions.
I hope it can help you

Image Processing on a micro-controller

I'm interested in starting a hobbyist project, where I do some image processing by interfacing HW and SW. I am quite a newbie to this. I know how to do some basic image processing in Matlab using the existing image processing commands.
I personally enjoy working with HW and wanted to a combination of HW/SW to be able to do this. I've read articles on people using FPGAs and just basic FPGAs/micro-controllers to go about doing this.
Here is my question: can someone recommend languages I should consider that will help me with interfacing on a PC? I image, the SW part would essentially be a GUI and is place-holder for all the processing that is done on the HW. Also in-terms of selecting the HW and realistically considering what I could do on the HW, could I get a few recommendations on that too?
Any recommendations will be appreciated!
EDIT: I read a few of the other posts saying requirements are directly related to knowing what kind of image processing one is doing. Well initially, I want to do finger print recognition. So filtering and locating unique markers in the image etc.
It all depends on what you are familiar with, how you plan on doing the interface between FPGA and PC, and generally the scale of what you want to do. Examples could be:
A fast system could for instance consist of a Xilinx SP605
board, using the PCI Express interface to quickly transfer image
data between PC and FPGA. For this, you'd need to write a device
driver (in C), and a user-space application (I've done this in
C++/Qt).
A more realistic hobbyist system could be a Xilinx SP601
board, using Ethernet to transfer data - you'd then just have to
write a simple protocol (possibly using raw sockets (no TCP/UDP) to
make the FPGA side Ethernet simpler), which can be done in basically
any language offering network access (there's a Xilinx reference
design for the SP605 demonstrating this).
The simplest and cheapest solution would be an FPGA board with a
serial connection - you probably wouldn't be able to do any
"serious" image processing with this, but it should be enough for
very simple proof-of-concept stuff, although the smaller FPGA devices used o these boards typically do not have much on-board memory available.
But again, it all depends on what you actually want to do.

Why is speech recognition difficult? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Why is speech recognition so difficult? What are the specific challenges involved? I've read through a question on speech recognition, which did partially answer some of my questions, but the answers were largely anecdotal rather than technical. It also still didn't really answer why we still can't just throw more hardware at the problem.
I've seen tools that perform automated noise reduction using neural nets and ambient FFT analysis with excellent results, so I can't see a reason why we're still struggling with noise except in difficult scenarios like ludicrously loud background noise or multiple speech sources.
Beyond this, isn't it just a case of using very large, complex, well-trained neural nets to do the processing, then throwing hardware at it to make it work fast enough?
I understand that strong accents are a problem and that we all have our colloquialisms, but these recognition engines still get basic things wrong when the person is speaking in a slow and clear American or British accent.
So, what's the deal? What technical problems are there that make it still so difficult for a computer to understand me?
Some technical reasons:
You need lots of tagged training data, which can be difficult to acquire once you take into account all the different accents, sounds etc.
Neural networks and similar gradient descent algorithms don't scale that well - just making them bigger (more layers, more nodes, more connections) doesn't guarantee that they will learn to solve your problem in a reasonable time. Scaling up machine learning to solve complex tasks is still a hard, unsolved problem.
Many machine learning approaches require normalised data (e.g. a defined start point, a standard pitch, a standard speed). They don't work well once you move outside these parameters. There are techniques such as convolutional neural networks etc. to tackle these problems, but they all add complexity and require a lot of expert fine-tuning.
Data size for speech can be quite large - the size of the data makes the engineering problems and computational requirements a little more challenging.
Speech data usually needs to be interpreted in context for full understanding - the human brain is remarkably good at "filling in the blanks" based on understood context. Missing informations and different interpretations are filled in with the help of other modalities (like vision). Current algorithms don't "understand" context so they can't use this to help interpret the speech data. This is particularly problematic because many sounds / words are ambiguous unless taken in context.
Overall, speech recognition is a complex task. Not unsolvably hard, but hard enough that you shouldn't expect any sudden miracles and it will certainly keep many reasearchers busy for many more years.....
Humans use more than their ears when listening, they use the knowledge they
have about the speaker and the subject. Words are not arbitrarily sequenced
together, there is a grammatical structure and redundancy that humans use
to predict words not yet spoken. Furthermore, idioms and how we ’usually’
say things makes prediction even easier.
In Speech Recognition we only have the speech signal. We can of course construct a
model for the grammatical structure and use some kind of statistical model
to improve prediction, but there are still the problem of how to model world
knowledge, the knowledge of the speaker and encyclopedic knowledge. We
can, of course, not model world knowledge exhaustively, but an interesting
question is how much we actually need in the ASR to measure up to human
comprehension.
Speech is uttered in an environment of sounds, a clock ticking, a computer
humming, a radio playing somewhere down the corridor, another human
speaker in the background etc. This is usually called noise, i.e., unwanted
information in the speech signal. In Speech Recognition we have to identify and filter out
these noises from the speech signal. Spoken language != Written language
1: Continuous speech
2: Channel variability
3: Speaker variability
4: Speaking style
5: Speed of speech
6: Ambiguity
All this points have to be considered while building a speech recognition, That's why its a quite difficult.
-------------Refered from http://www.speech.kth.se/~rolf/gslt_papers/MarkusForsberg.pdf
I suspect you are interested in 'countinuous' speech recognition, where the speaker speaks sentences (not single words) at normal speed.
The problem is not simply one of signal analysis, but there is a large natural language component as well. Most of us understand spoken language not by analyzing every single thing that we hear, as that would never work because each person speaks differently, phonemes are suppressed, pronunciations are different, etc. We just interpret a portion of what we hear and the rest is 'interpolated' by our brain once the context of what is being said is established. When you have no context, it is difficult to understand spoken language.
Lots of major problems in speech recognition are not directly related to the language itself:
different people (women, men, children, elders etc.) have different voices
sometimes the same person sounds different for example when the person has a cold
different background noises
everyday speech sometimes contains words from other languages (like you have the german word Kindergarden in the US/English)
some persons not from the country itself learned the language (they usually sound different)
some persons speak faster, others speak slower
quality of the microphone
etc.
Solving these things always is pretty hard... on top of that you have the language/pronounciation to take care of...
For reference see the Wikipedia article http://en.wikipedia.org/wiki/Speech_recognition - it has a good overview including some links and book references which are a good starting point...
From the technical POV the "audio preprocessing" is just one step in a long process... let's say the audio is "crytal clear", then several of the above mentioned aspects (like having a cold, having a mixup in languages etc.) still need to be solved.
All this means that for good speech recognition you need to have a model of the langauge(s) that is thorough enough to account for slight differences (like "ate" versus "eight") which usually involves some context-analysis (i.e. semantic and fact/world knowledge, see http://en.wikipedia.org/wiki/Semantic%5Fgap) etc.
Since almost all relevant languages have evolved and were not designed as mathematical models you basically need to "reverse engineer" the available implicit and explicit knowlegde about a language into a model which is a big challenge IMHO.
Having worked myself with neural nets I can assure you that while they provide good results in some cases they are not "magical tools"... almost always a good neural net has been carefully designed and optimized for the specific requirement... in this context it needs both extensive experience/knowledge of languages and neural nets PLUS extensive training to achieve usable results...
Its been a decade since I took a language class in college, but from what I recall language can be broken up into phonemes. Language processors do their best to identify these phonemes, but they are unique to every individual. Even once they are broken up they must then be reassembled into a meaningful construct.
Take this example, humans are quite capable of reading with no punctuation and no capital letters and no spaces. It takes a second, but we can do it quite readily. This is kind of what a computer has to look at when it gets a block of phonemes. However, computers are not nearly as good at parsing this data out. One of the reasons is it is difficult for computers to have context. Humans can even understand babies despite the fact that their phonemes can be completely wrong.
Even if you have all the phonemes correct, then arranging them into an order that makes sense is also difficult.

Resources