Example of variational autoencoder for non-image data - feature-extraction

Most of the examples on variational autoencoders use image data as input. Can anyone give me a small example or a link to a resource material where a variational autoencoder has been applied to non-image data? I especially struggle to understand how can I extract features from a variational autoencoder. Most of the examples set the latent dimension size equal to two. In that case, is the size of the encoded input equal to two?

Related

Best CNN architectures for small images (80x80)?

I'm new in computer vision area and I hope you can help me with some fundamental questions regarding CNN architectures.
I know some of the most well-known ones are:
VGG Net
ResNet
Dense Net
Inception Net
Xception Net
They usually need an input of images around 224x224x3 and I also saw 32x32x3.
Regarding my specific problem, my goal is to train biomedical images with size (80x80) for a 4-class classification - at the end I'll have a dense layer of 4. Also my dataset is quite small (1000images) and I wanted to use transfer learning.
Could you please help me with the following questions? It seems to me that there is no single correct answer to them, but I need to understand what should be the correct way of thinking about them. I will appreciate if you can give me some pointers as well.
Should I scale my images? How about the opposite and shrink to 32x32 inputs?
Should I change the input of the CNNs to 80x80? What parameters should I change mainly? Any specific ratio for the kernel and the parameters?
Also I have another problem, the input requires 3 channels (RGB) but I'm working with grayscale images. Will it change the results a lot?
Instead of scaling should I just fill the surroundings (between the 80x80 and 224x224) as background? Should the images be centered in this case?
Do you have any recommendations regarding what architecture to choose?
I've seen some adaptations of these architectures to 3D/volumes inputs instead of 2D/images. I have a similar problem to the one I described here but with 3D inputs. Is there any common reasoning when choosing a 3D CNN architecture instead of a 2D?
In advances I leave my thanks!
I am assuming you basic know-how in using CNN for classification
Answering question 1~3
You scale your image for several purposes. Smaller the image, the faster the training and inference time. However you will lose important information in the process of shrinking the image. There is no one right answer and it all depends on your application. Is real-time process important? If your answer is no, always stick to the original size.
You will also need to resize your image to fit the input size of predefined models if you plan to retrain them. However, since your image is in grayscale, you will need to find models trained in gray or create a 3 channel image and copy the same value to all R,G and B channel. This is not efficient but it will help you reuse the high quality model trained by others.
The best way i see for you to handle this problem is to train everything from start. 1000 can seem to be a small number of data, but since your domain is specific and only require 4 classes, training from scratch doesnt seem that bad.
Question 4
When the size is different, always scale. filling with the surrounding will cause the model to learn the empty spaces and that is not what we want.
Also make sure the input size and format during inference is the same as the input size and format during training.
Question 5
If processing time is not a problem RESNET. If processing time is important, then MobileNet.
Question 6
6) Depends on your input data. If you have 3D data then you can use it. More input data usually helps in better classification. But 2D will be enough to solve certain problem. If you can classify the images by looking at the 2D images, most probabily 2D images will be enough to complete the task.
I hope this will clear some of your problems and direct you to a proper solution.

How to plot the PSNR behavior with respect to the bit-rate for different quality factors and block sizes?

I am learning video processing and I have successfully implemented a video encoder based on JPEG algorithm (for spartial redundancy) and block matching algorithm (for temporal redundancy).
Now I am asked to discuss the PSNR behavior with respect to the bit-rate for different quality factors and block sizes.(Provide a graph)
The question made me so confused about what would be included in the graph and how to display such properties on a graph.
Can anyone give me some ideas?
Please forgive me if my English is poor.
Thank you very much!
The PSNR as a function of the bitrate is generally a concave function.
When you encode a video with your encoder, you get a PSNR value and a bitrate value. To get a curve, you need to vary your "quality factor", generally this can be the quantization parameter (QP).
Now you get multiple pairs (PSNR, bitrate), which allows you to plot the curves.
If you have to plot for different block sizes, you will need different curves.
Example below:

How to provide a score value to an image based on pattern information in it?

I saw a few image processing and analysis related questions on this forum and thought I could try this forum for my question. I have a say 30 two-dimensional arrays (to make things simple, although I have a very big data set) which form 30 individual images. Many of these images have similar base structure, but differ in intensities for different pixels. Due to this intensity variation amongst pixels, some images have a prominent pattern (say a larger area with localised intense pixels or high intensity pixels classifying an edge). Some images, also just contain single high intensity pixels randomly distributed without any prominent feature (so basically noise). I am now trying to build an algorithm, which can give a specific score to an image based on different factors like area fraction of high intensity pixels, mean standard deviation, so that I can find out the image with the most prominent pattern (in order words rank them). But these factors depend on a common factor i.e. a user defined threshold, which becomes different for every image. Any inputs on how I can achieve this ranking or a image score in an automated manner (without the use of a threshold)? I initially used Matlab to perform all the processing and area fraction calculations, but now I am using R do the same thing.
Can some amount of machine learning/ random forest stuff help me here? I am not sure. Some inputs would be very valuable.
P.S. If this not the right forum to post, any suggestions on where I can get good advise?
First of all, let me suggest a change in terminology: What you denote as feature is usually called pattern in image prcessing, while what you call factor is usually called feature.
I think that the main weakness of the features you are using (mean, standard deviation) is that they are only based on the statistics of single pixels (1st order statistics) without considering correlations (neighborhood relations of pixels). If you take a highly stuctured image and shuffle the pixels randomly, you will still have the same 1st order statistics.
There are many ways to take these correlations into account. A simple, efficient and therefore popular method is to apply some filters on the image first (high-pass, low-pass etc.) and then get the 1st order statistics of the resulting image. Other methods are based on Fast Fourier Transform (FFT).
Of course machine learning is also an option here. You could try convolutional neural networks for example, but I would try the simple filtering stuff first.

Simple but reasonably accurate algorithm to determine if tag / keyword is related to an image

I have a hard problem to solve which is about automatic image keywording. You can assume that I have a database with 100000+ keyworded low quality jpeg images for training (low quality = low resolution about 300x300px + low compression ratio). Each image has about 40 mostly accurate keywords (data may contain slight "noise"). I can also extract some data on keyword correlations.
Given a color image and a keyword, I want to determine the probability that the keyword is related to this image.
I need a creative understandable solution which I could implement on my own in about a month or less (I plan to use python). What I found so far is machine learning, neural networks and genetic algorithms. I was also thinking about generating some kind of signatures for each keyword which I could then use to check against not yet seen images.
Crazy/novel ideas are appreciated as well if they are practicable. I'm also open to using other python libraries.
My current algorithm is extremely complex and computationally heavy. It suggests keywords instead of calculating probability and 50% of suggested keywords are not accurate.
Given the hard requirements of the application, only gross and brainless solutions can be proposed.
For every image, use some segmentation method and keep, say, the four largest segments. Distinguish one or two of them as being background (those extending to the image borders), and the others as foreground, or item of interest.
Characterize the segments in terms of dominant color (using a very rough classification based on color primaries), and in terms of shape (size relative to the image, circularity, number of holes, dominant orientation and a few others).
Then for every keyword you can build a classifier that decides if a given image has/hasn't this keyword. After training, the classifiers will tell you if the image has/hasn't the keyword(s). If you use a fuzzy classification, you get a "probability".

Why do image compression algorithms process the image by sub-blocks?

For instance, consider the DFT or DCT. Precisely, what would be the differences between an image transformed by sub-blocks, and an image transformed whole? Is the resulting file size smaller? Is the algorithm more efficient? Does the transformed image look different? Thanks.
They are designed so they can be implemented using parallel hardware. Each block is independent, and can be calculated on a different computing node, or shared out to as many nodes as you have.
Also as noted in an answer to Why JPEG compression processes image by 8x8 blocks? the computational complexity is high. I think (block_y_size × block_y_size)2
It's to make the image smaller. There a many ways to subdivide an image into blocks. The most simple is by complete rows. More advance tiling is with fractals, i.e hilbert curve. Jpeg 2000 uses a hilbert curve. It uses additional spatial information and it's also used in mapping applications.

Resources