How a marker-based augmented reality algorithm (like ARToolkit's one) works?

For my job i've been using a Java version of ARToolkit (NyARTookit). So far it proven good enough for our needs, but my boss is starting to want the framework ported in other platforms such as web (Flash, etc) and mobiles. While i suppose i could use other ports, i'm increasingly annoyed by not knowing how the kit works and beyond that, from some limitations. Later i'll also need to extend the kit's abilities to add stuff like interaction (virtual buttons on cards, etc), which as far as i've seen in NyARToolkit aren't supported.
So basically, i need to replace ARToolkit with a custom mark detector (and in case of NyARToolkit, try to get rid of JMF and use a better solution via JNI). However i don't know how these detectors work. I know about 3D graphics and i've built a nice framework around it, but i need to know how to build the underlying tech :-).
Does anyone know any sources about how to implement a marker-based augmented reality application from scratch? When searching in google i only find "applications" of AR, not the underlying algorithms :-/.

'From scratch' is a relative term. Truly doing it from scratch, without using any pre-existing vision code, would be very painful and you wouldn't do a better job of it than the entire computer vision community.
However, if you want to do AR with existing vision code, this is more reasonable. The essential sub-tasks are:
Find the markers in your image or video.
Make sure they are the ones you want.
Figure out how they are oriented relative to the camera.
The first task is keypoint localization. Techniques for this include SIFT keypoint detection, the Harris corner detector, and others. Some of these have open source implementations - i think OpenCV has the Harris corner detector in the function GoodFeaturesToTrack.
The second task is making region descriptors. Techniques for this include SIFT descriptors, HOG descriptors, and many many others. There should be an open-source implementation of one of these somewhere.
The third task is also done by keypoint localizers. Ideally you want an affine transformation, since this will tell you how the marker is sitting in 3-space. The Harris affine detector should work for this. For more details go here:


How does a computer reproduce the SIFT paper method on its own in deep learning

Let me begin by saying that I am struggling to understand what is going on in deep learning. From what I gather, it is an approach to try to have a computer engineer different layers of representations and features to enable it learn stuff on its own. SIFT seems to be a common way to sort of detect things by tagging and hunting for scale invariant things in some representation. Again I am completely stupid and in awe and wonder about how this magic is achieved. How does one have a computer do this by itself? I have looked at this paper and I must say at this point I think it is magic. Can somebody help me distill the main points of how this works and why a computer can do it on its own?
SIFT and CNN are both methods to extract features from images in different ways and outputs.
SIFT/SURF/ORB or any similar feature extraction algorithms are "Hand-made" feature extraction algorithms. It means, independent from the real world cases, they are aiming to extract some meaningful features. This approach has some advantages and disadvantages.
Advantages :
You don't have to care about input image conditions and probably you don't need any pre-processing step to extract those features.
You can directly get SIFT implementation and integrate it to your application
With GPU based implementations (i.e. GPU-SIFT), you can achieve high inference speed.
It has limitation about finding the features. You will have trouble about getting features over quite plain surfaces.
SIFT/SURF/ORB cannot solve all problems that requires feature classification / matching. Think face recognition problem. Do you think that extracting & classifying SIFT features over face will be enough to recognize people?
Those are hand-made feature extraction techniques, they cannot be improved over time (of course unless a better technique is being introduced)
Developing such a feature extraction technique requires a lot of research work
In the other hand, in deep learning, you can start analyzing much complex features which are impossible by human to recognize. CNNs are perfect approach as today to analyze hierarchical filter responses and much complex features which are created by combining those filter responses (going deeper).
The main power of CNNs are coming from not extracting features by hand. We only define "how" PC has to look for features. Of course this method has some pros and cons too.
Advantages :
More data, better success! It is all depending on data. If you have enough data to explain your case, DL outperforms hand-made feature extraction techniques.
As soon as you extract the features from image, you can use it for many purposes like to segment image, to create description words, to detect objects inside image, to recognize them. The better part is, all of them can be obtained in one shot, rather than complex sequential processes.
You need data. Probably a lot.
It is better to use supervised or reinforcement learning methods in these days. As unsupervised learning is still not good enough yet.
It takes time and resource to train a good neural net. A complex hierarchy like Google Inception took 2 weeks to be trained on 8 GPU server rack. Of course not all the networks are so hard to train.
It has some learning curve. You don't have to know how SIFT is working to use it for your application but you have to know how CNNs are working to use them in your custom purposes.

Leap Motion point cloud

How can we access the point cloud in the Leap Motion API? One feature that led me to purchase it was the point cloud demo from their promo video, but I can't seem to locate documentation regarding it and user replies on the forums seem mixed. Am I just missing something?
I'm looking to use the Leap Motion as a sort of cheap 3D scanner.
That demo was clearly a mockup which simulated a 3-D model of the human hand, not actual point cloud data. You can tell by the fact that points were displayed which could not have possibly been read by the sensor, due to obstruction.
orion78fr points to one forum post on this, but the transcript of an interview by the founders provides more information direct from the source:
Can you please allow access to cloud points in SDK?
David: So I think sometimes people have a misperception as to really
how things work in our hardware. It’s very different from other things
like the Kinect, and in normal device operation we have very different
priorities than most other technologies. Our priority is precision,
small movements, very low latency, very low CPU usage - so in order to
do that we will often be making sacrifices that make what the device
is doing completely not applicable to what I think you’re getting at,
which is 3D scanning.
What we’re working on are sort of alternative device modes that will
let you use it for those sorts of purposes, but that’s not what it was
originally built for.You know, it’s our goal to let it be able to do
those things and with the hardware can do many things. But our
priority right now is of course human computer interaction, which we
think is really the missing component in technology, and that’s our
core passion.
Michael: We really believe in trying to squeeze every ounce of
optimization and performance out of the devices for the purpose they
were built. So in this case the Leap today is intended to be a great
human computer interface. And we have made thousands of little
optimizations along the way to make it better, that might sacrifice
things in the process that might be useful for things like 3D scanning
objects. But those are intentional decisions, but they don’t mean that
we think 3D scanning isn’t exciting and isn’t a good use case. There
will be other things we build as a company in the future, and other
devices that might be able to do both or maybe there will be two
different devices. One that is fully optimized for 3D scanning, and
one that continues to be optimized and as great as it can be at
tracking fingers and hands.
If we haven’t done a good job communicating that the device isn’t
about 3D scanning or isn’t going to be able to 3D scan, that’s
unfortunate and it’s a mistake on our part - but that’s something that
we’ve had to sacrifice. The good news is that those sacrifices have
made the main device really exceptional at tracking hands and fingers.
I have developed with the Leap Motion Controller as well as several other 3-D scanning systems, and from what I've seen I'd seriously doubt if we're ever going to get point cloud data out of the currently shipping hardware. If we do, the fidelity will be far below what we see for gross finger and hand tracking from that device.
There are some low-cost alternatives for 3-D scanning that have started to emerge. SoftKinetic has their DepthSense 325 camera for $250 (which is effectively the same as the Creative Gesture Camera that is only $150 right now). The DS 325 is a time-of-flight IR camera that gives you a 320x240 point cloud map of the 3-D space in front of it. In my tests, it worked well with opaque materials, but anything with a little gloss or shininess gave it trouble.
The PrimeSense Carmine 1.09 ($200) uses structured light to get point cloud data in front of it, as an advancement of the technology they supplied for the original Kinect. It has a lower effective sptial resolution than the SoftKinetic cameras, but it seems to provide less depth noise and to work on a wider variety of materials.
The DUO was also a promising project, but unfortunately its Kickstarter campaign failed. It was using stereoscopic imaging from an IR source to return a point cloud from a couple of PS3 Eye cameras. They may restart that project at some point in the future.
While the Leap may not do what you want, it looks like more and more devices are coming out in the consumer price range to enable 3-D scanning.
See this link
It says that yes, Leap Motion can theorically handle point cloud and it was temporarily part of the visualiser in beta and no, you can't access it using the Leap Motion APIs right now.
It may appear in the future but it's not a priority of Leap Motion Team.
As with LeapMotion SDK 2.x one can at least access the stereo camera images! As I know by myself it is a convenient solution, for many tasks where the point cloud data was asked for. This is why I mention it here, even if it does not give the point-cloud data internally generated by the driver to extract the pointer-metadata. But now one has the capability to generate own point-cloud by yourself, this is why I think it is strongly related to the question.
Currently there is no access to the Pointcloud in the public API. But I think this video is no mock-up, so there should be a possibility:
Roadtovr recently reviewed the Nimble Sense Kickstarter, which is using point cloud.
It’s the same technology that the Kinect 2 uses, and it’s supposed to have some advantages over the Leap Motion.
Because it’s a depth sensing camera, you can point the camera top-down like the Touch+, although their product will not ship till next year.

how to extract an object from an image

I want to extract an object such as a man ,a car or something like that from an image.The image is just an ordinary iamge, not medical image or other types for specific purpose.
I have searched for a long time and found that the automatic image segmentation algorithms just segment the image into a set of regions or gives out the contour in the image,not a semantic object. so I turned to the interactive image segmentation algorithms and I found some popular algorithms like interactive graph cuts and SIOX and so on. I think these algorithms just meet my demand.
Further more, I also downloaded two interactive image segmentation tool,the first one is the interactive segmentation tool, the second one is the interactive segmentation tool-box.
So my quesions are
1.if the interactive image segmentation algorithm is the right solution for my task since the performance is the most important.
2.and if I want to use the automatic image segmentation algorithm, what should I do next?
Any suggestion will be approciated.
If you want to pick out a object from a single static image just by a few scribbles. I recommend you have a read of
'Closed-form solution to image matting'
or 'Spectral matting',
or 'lazy snapping'
but as in my tests, the last doesn't perform as well as the first two methods when dealing with subtle objects like hairs.
However you can find their source matlab codes very easily from google.
But the first two method are't so pleasant to use actually, I think you'll need to do lots of modification to make them easy to use. It's main problem IMHO, is it requires very decent scribbles on the image, that's if you draw some extra scribbles or at wrong positions, you'll ruin your object cutting .
Apart from these, you may try 'bayesian matting, possion matting, etc.' which all request some helping image called trimap, and it's hard to draw really.
Extracting objects from an image, specially a picture is not that easy as you think, you may want to take a look at the OpenCV project.
Other than OpenCV, I would suggest looking at ITK. It is very popular in medical image analysis projects, because there it is known that semi-automatic segmentation tools provide the best results. I think that the methods apply to natural images as well.
Try looking at tools like livewire segmentation, and level-set based image segmentation. ITK has several demos that allow you to play with these tools on your own images. The demo application such as this is part of the open source distribution, but it can be downloaded directly from the itk servers (look around for instructions)
If this is a business case, you'd better look for companies specialized in "video content analysis". I mean it: reliable people and vehicle detection aren't a single man's project.
Genreral purpose segmentation tools won't do the trick because they have no notion of what a man or a car look like. All they are deemed to do is to find uniform regions in an image.
It is quite late but there is an algorithm called connected component labeling, which you may find useful.
here is wiki link of the algorithm

extracting a specific melody/beat/rhythm from a specific instument from a mixed wave (or other music format) file

Is it possible to write a program that can extract a melody/beat/rhythm provided by a specific instument in a wave (or other music format) file made up of multiple instruments?
Which algorithms could be used for this and what programming language would be best suited to it?
This is a fascinating area. The basic mathematical tool here is the Fourier Transform. To get an idea of how it works, and how challenging it can be, take a look at the analysis of the opening chord to A Hard Day's Night.
An instrument produces a sound signature, just the same way our voices do. There are algorithms out there that can pick a single voice out of a crowd and identify that voice from its signature in a database which is used in forensics. In the exact same way, the sound signature of a single instrument can be picked out of a soundscape (such as your mixed wave) and be used to pick out a beat, or make a copy of that instrument on its own track.
Obviously if you're thinking about making copies of tracks, i.e. to break down the mixed wave into a single track per instrument you're going to be looking at a lot of work. My understanding is that because of the frequency overlaps of instruments, this isn't going to be straightforward by any means... not impossible though as you've already been told.
There's quite an interesting blog post by Comparisonics about sound matching technologies which might be useful as a start for your quest for information:
To extract the beat or rhythm, you might not need perfect isolation of the instrument you're targeting. A general solution may be hard, but if you're trying to solve it for a particular piece, it may be possible. Try implementing a band-pass filter and see if you can tune it to selects th instrument you're after.
Also, I just found this Mac product called PhotoSounder. They have a blog showing different ways it can be used, including isolating an individual instrument (with manual intervention).
Look into Karaoke machine algorithms. If they can remove voice from a song, I'm sure the same principles can be applied to extract a single instrument.
Most instruments make sound within certain frequency ranges.
If you write a tunable bandpass filter - a filter that only lets a certain frequency range through - it'll be about as close as you're likely to get. It will not be anywhere near perfect; you're asking for black magic. The only way to perfectly extract a single instrument from a track is to have an audio sample of the track without that instrument, and do a difference of the two waveforms.
C, C++, Java, C#, Python, Perl should all be able to do all of this with the right libraries. Which one is "best" depends on what you already know.
It's possible in principle, but very difficult - an open area of research, even. You may be interested in the project paper for Dancing Monkeys, a step generation program for StepMania. It does some fairly sophisticated beat detection and music analysis, which is detailed in the paper (linked near the bottom of that page).

Tools/Techniques to use our ability to think spatially

What software/UI techniques can leverage our spatial memory? I think and remember in physical space, often the location of something is as important as it's content. For instance I keep an untidy desk, but I know where to find things, I use different parts of my (multiscreen) desktop for different windows/icons. I annotate books (with post its) and can remember facing page, top/bottom etc. In the good old days we used to file things so we could find them later, now we use search, but that doesn't really use our spatial abilities. Google maps etc are brilliant but they're only really being used for the real real world, what about our internal locations? How can we leverage the wet ware to best advantage.
EDIT -> I've thought about a code tool that would profile the running code and then build a visualisation with classes/methods scaled to match their use, with large/small motorways/footpaths between them. Spatial layout still escapes me though - UI at the top, DB at the bottom, but how do you position a class in 3D based on it's usage?
Slightly off topic since it's not code per say but I've built my own tools to translate some of out complicated XML config files into DOT format and run them through Graphviz so that I could visualise them. We were able to strip out lots of pointless stuff from them after just looking at them.
Wetware win :o)
