How to model a time dependant matrix with RNN? - matrix

I am working on video classification. So let's say, I sub-sample the video temporally. So for each sub-sample, I engineer some features out of it. Let's say I can represent these features with a two dimensional matrix.
So, these values of the matrix are time-Dependant. So for each video, I have a set of matrices, which are time dependent.
So I need to use a RNN to model these time dependent matrix values and represent the video. This representation should classify the video in to classes. In other words, the RNN should be able to classify the video in to classes, depending on these time dependent matrix values.
I need to know, Is this possible with RNNs? would it be a good practice? If so, what are the guide lines anyone could provide me. What is a good library to use? What would be good tutorials? Thanks.

You flatten the images of the video and use them as an element of the sequence.
You will have to put a convnet under the RNN, most likely, to get reasonable results.
As such, you feed the image to a convnet, then flatten the activation map and feed it to the RNN cell.
https://stackoverflow.com/a/36992625/447599

Related

Best CNN architectures for small images (80x80)?

I'm new in computer vision area and I hope you can help me with some fundamental questions regarding CNN architectures.
I know some of the most well-known ones are:
VGG Net
ResNet
Dense Net
Inception Net
Xception Net
They usually need an input of images around 224x224x3 and I also saw 32x32x3.
Regarding my specific problem, my goal is to train biomedical images with size (80x80) for a 4-class classification - at the end I'll have a dense layer of 4. Also my dataset is quite small (1000images) and I wanted to use transfer learning.
Could you please help me with the following questions? It seems to me that there is no single correct answer to them, but I need to understand what should be the correct way of thinking about them. I will appreciate if you can give me some pointers as well.
Should I scale my images? How about the opposite and shrink to 32x32 inputs?
Should I change the input of the CNNs to 80x80? What parameters should I change mainly? Any specific ratio for the kernel and the parameters?
Also I have another problem, the input requires 3 channels (RGB) but I'm working with grayscale images. Will it change the results a lot?
Instead of scaling should I just fill the surroundings (between the 80x80 and 224x224) as background? Should the images be centered in this case?
Do you have any recommendations regarding what architecture to choose?
I've seen some adaptations of these architectures to 3D/volumes inputs instead of 2D/images. I have a similar problem to the one I described here but with 3D inputs. Is there any common reasoning when choosing a 3D CNN architecture instead of a 2D?
In advances I leave my thanks!
I am assuming you basic know-how in using CNN for classification
Answering question 1~3
You scale your image for several purposes. Smaller the image, the faster the training and inference time. However you will lose important information in the process of shrinking the image. There is no one right answer and it all depends on your application. Is real-time process important? If your answer is no, always stick to the original size.
You will also need to resize your image to fit the input size of predefined models if you plan to retrain them. However, since your image is in grayscale, you will need to find models trained in gray or create a 3 channel image and copy the same value to all R,G and B channel. This is not efficient but it will help you reuse the high quality model trained by others.
The best way i see for you to handle this problem is to train everything from start. 1000 can seem to be a small number of data, but since your domain is specific and only require 4 classes, training from scratch doesnt seem that bad.
Question 4
When the size is different, always scale. filling with the surrounding will cause the model to learn the empty spaces and that is not what we want.
Also make sure the input size and format during inference is the same as the input size and format during training.
Question 5
If processing time is not a problem RESNET. If processing time is important, then MobileNet.
Question 6
6) Depends on your input data. If you have 3D data then you can use it. More input data usually helps in better classification. But 2D will be enough to solve certain problem. If you can classify the images by looking at the 2D images, most probabily 2D images will be enough to complete the task.
I hope this will clear some of your problems and direct you to a proper solution.

Matching photographed image with screenshot (or generated image based on data model)

first of all, I have to say I'm new to the field of computervision and I'm currently facing a problem, I tried to solve with opencv (Java Wrapper) without success.
Basicly I have a picture of a part from a Model taken by a camera (different angles, resoultions, rotations...) and I need to find the position of that part in the model.
Example Picture:
Model Picture:
So one question is: Where should I start/which algorithm should I use?
My first try was to use KeyPoint Matching with SURF as Detector, Descriptor and BF as Matcher.
It worked for about 2 pcitures out of 10. I used the default parameters and tried other detectors, without any improvements. (Maybe it's a question of the right parameters. But how to find out the right parameteres combined with the right algorithm?...)
Two examples:
My second try was to use the color to differentiate the certain elements in the model and to compare the structure with the model itself (In addition to the picture of the model I also have and xml representation of the model..).
Right now I extraxted the color red out of the image, adjusted h,s,v values manually to get the best detection for about 4 pictures, which fails for other pictures.
Two examples:
I also tried to use edge detection (canny, gray, with histogramm Equalization) to detect geometric structures. For some results I could imagine, that it will work, but using the same canny parameters for other pictures "fails". Two examples:
As I said I'm not familiar with computervision and just tried out some algorithms. I'm facing the problem, that I don't know which combination of algorithms and techniques is the best and in addition to that which parameters should I use. Testing it manually seems to be impossible.
Thanks in advance
gemorra
Your initial idea of using SURF features was actually very good, just try to understand how the parameters for this algorithm work and you should be able to register your images. A good starting point for your parameters would be varying only the Hessian treshold, and being fearles while doing so: your features are quite well defined, so try to use tresholds around 2000 and above (increasing in steps of 500-1000 till you get good results is totally ok).
Alternatively you can try to detect your ellipses and calculate an affine warp that normalizes them and run a cross-correlation to register them. This alternative does imply much more work, but is quite fascinating. Some ideas on that normalization using the covariance matrix and its choletsky decomposition here.

Comparing and matching images

I am looking to compare a new image to a database of images, and then output the higher "similarity". The images I want to compare are similar, but the problem is though because they're not pixel by pixel equal. I've tried to use BoW (Bag Of Words) model already (I implemented it in Matlab, but I'm willing to learn openCV), as per recommendation, I tried various implementations without success, the best correct rate I got was 30%, which is something really low.
Let me show you what I am talking about: imgur gallery with 5 example images. I want to detect that the four initial images are equal, and the fifth one is different. I wouldn't mind only detecting that the ones with the same angle orientation are equal, though. (In my example 2, 3 and 4)
So, that being said, are there any better methods than BoW for that? Or perhaps BoW should be enough if I implemented in a different way?
Thanks in advance.
I would try some keypoint based approach using randomized trees. Has the advantage that point extraction is local and adapts to many sort of transformations (Like the ones your pictures show). The advantage of being local is that they are more robust against changes in illumination across the scene, occlusions, and so on.
Also, take a look at the SURF algorithm.

Computer Science Theory: Image Similarity

So I'm trying to run a comparison of different images and was wondering if anyone could point me in the right direction for some basic metrics I can take for the group of images.
Assuming I have two images, A and B, I pretty much want as much data as possible about each so I can later programmatically compare them. Things like "general color", "general shape", etc. would be great.
If you can help me find specific properties and algorithms to compute them that would be great!
Thanks!
EDIT: The end goal here is to be able to have a computer tell me how "similar" too pictures are. If two images are the same but in one someone blurred out a face; they should register as fairly similar. If two pictures are completely different, the computer should be able to tell.
What you are talking about is way much general and non-specific.
Image information is formalised as Entropy.
What you seem to be looking for is basically feature extraction and then comparing these features. There are tons of features that can be extracted but a lot of them could be irrelevant depending on the differences in the pictures.
There are space domain and frequency domain descriptors of the image which each can be useful here. I can probably name more than 100 descriptors but in your case, only one could be sufficient or none could be useful.
Pre-processing is also important, perhaps you could turn your images to grey-scale and then compare them.
This field is so immensely diverse, so you need to be a bit more specific.
(Update)
What you are looking for is a topic of hundreds if not thousands of scientific articles. But well, perhaps a simplistic approach can work.
So assuming that the question here is not identifying objects and there is no transform, translation, scale or rotation involved and we are only dealing with the two images which are the same but one could have more noise added upon it:
1) Image domain (space domain): Compare the pixels one by one and add up the square of the differences. Normalise this value by the width*height - just divide by the number of pixels. This could be a useful measure of similarity.
2) Frequency domain: Convert the image to frequency domain image (using FTT in an image processing tool such as OpenCV) which will be 2D as well. Do the same above squared diff as above, but perhaps you want to limit the frequencies. Then normalise by the number of pixels. This fares better on noise and translation and on a small rotation but not on scale.
SURF is a good candidate algorithm for comparing images
Wikipedia Article
A practical example (in Mathematica), identifying corresponding points in two images of the moon (rotated, colorized and blurred) :
You can also calculate sum of differences between histogram bins of those two images. But it is also not a silver bullet...
I recommend taking a look at OpenCV. The package offers most (if not all) of the techniques mentioned above.

What algorithm could be used to identify if images are the "same" or similar, regardless of size?

TinEye, the "reverse image search engine", allows you to upload/link to an image and it is able to search through the billion images it has crawled and it will return links to images it has found that are the same image.
However, it isn't a naive checksum or anything related to that. It is often able to find both images of a higher resolution and lower resolution and larger and smaller size than the original image you supply. This is a good use for the service because I often find an image and want the highest resolution version of it possible.
Not only that, but I've had it find images of the same image set, where the people in the image are in a different position but the background largely stays the same.
What type of algorithm could TinEye be using that would allow it to compare an image with others of various sizes and compression ratios and yet still accurately figure out that they are the "same" image or set?
These algorithms are usually fingerprint-based. Fingerprint is a reasonably small data structure, something like a long hash code. However, the goals of fingerprint function are opposite to the goals of hash function. A good hash function should generate very different codes for very similar (but not equal) objects. The fingerprint function should, on contrary, generate the same fingerprint for similar images.
Just to give you an example, this is a (not particularly good) fingerprint function: resize the picture to 32x32 square, normalize and and quantize the colors, reducing the number of colors to something like 256. Then, you have 1024-byte fingerprint for the image. Just keep a table of fingerprint => [list of image URLs]. When you need to look images similar to a given image, just calculate its fingerprint value and find the corresponding image list. Easy.
What is not easy - to be useful in practice, the fingerprint function needs to be robust against crops, affine transforms, contrast changes, etc. Construction of good fingerprint functions is a separate research topic. Quite often they are hand-tuned and uses a lot of heuristics (i.e. use the knowledge about typical photo contents, about image format / additional data in EXIF, etc.)
Another variation is to use more than one fingerprint function, try to apply each of them and combine the results. Actually, it's similar to finding similar texts. Just instead of "bag of words" the image similarity search uses a "bag of fingerprints" and finds how many elements from one bag are the same as elements from another bag. How to make this search efficient is another topic.
Now, regarding the articles/papers. I couldn't find a good article that would give an overview of different methods. Most of the public articles I know discuss specific improvement to specific methods. I could recommend to check these:
"Content Fingerprinting Using Wavelets". This article is about audio fingerprinting using wavelets, but the same method can be adapted for image fingerprinting.
PERMUTATION GROUPING:
INTELLIGENT HASH FUNCTION DESIGN FOR AUDIO & IMAGE RETRIEVAL. Info on Locality-Sensitive Hashes.
Bundling Features for Large Scale Partial-Duplicate Web Image Search. A very good article, talks about SIFT and bundling features for efficiency. It also has a nice bibliography at the end
The creator of the FotoForensics site posted this blog post on this topic, it was very useful to me, and showed algorithms that may be good enough for you and that require a lot less work than wavelets and feature extraction.
http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html
aHash (also called Average Hash or Mean Hash). This approach crushes the image into a grayscale 8x8 image and sets the 64 bits in
the hash based on whether the pixel's value is greater than the
average color for the image.
pHash (also called "Perceptive Hash"). This algorithm is similar to aHash but use a discrete cosine transform (DCT) and compares based
on frequencies rather than color values.
dHash Like aHash and pHash, dHash is pretty simple to implement and is far more accurate than it has any right to be. As an
implementation, dHash is nearly identical to aHash but it performs
much better. While aHash focuses on average values and pHash evaluates
frequency patterns, dHash tracks gradients.
It's probably based on improvements of feature extraction algorithms, taking advantage of features which are scale invariant.
Take a look at
Feature extraction
SIFT, other site
or, if you are REALLY interested, you can shell out some 70 bucks (or at least look at the Google preview) for
Feature Extraction & Image Processing
http://tineye.com/faq#how
Based on this, Igor Krivokon's answer seems to be on the mark.
The Hough Transform is a very old feature extraction algorithm, that you mind find interesting. I doubt it's what tinyeye uses, but it's a good, simple starting place for learning about feature extraction.
There are also slides to a neat talk from some University of Toronto folks about their work at astrometry.net. They developed an algorithm for matching telescoping images of the night sky to locations in star catalogs in order to identify the features in the image. It's a more specific problem than what tinyeye tries to solve, but I'd expect that a lot of the basic ideas that they talk about are applicable to the more general problem.
Check out this blog post (not mine) for a very understandable description of a very understandable algorithm which seems to get good results for how simple it is. It basically partitions the respective pictures into a very coarse grid, sorts the grid by red:blue and green:blue ratios, and checks whether the sorts were the same. This naturally works for color images only.
The pros most likely get better results using vastly more advanced algorithms. As mentioned in the comments on that blog, a leading approach seems to be wavelets.
They may well be doing a Fourier Transform to characterize the complexity of the image, as well as a histogram to characterize the chromatic distribution, paired with a region categorization algorithm to assure that similarly complex and colored images don't get wrongly paired. Don't know if that's what they're using, but it seems like that would do the trick.
What about resizing the pictures to a standard small size and checking for SSIM or luma-only PSNR values? that's what I would do.

Resources