I am new to machine learning. I am trying to create an input matrix (X) from a set of images (Stanford dog set of 120 breeds) to train a convolutional neural network. I aim to resize images and turn each image into one row by making each pixel a separate column.
If I directly resize images to a fixed size, the images lose their originality due to squishing or stretching, which is not good (first solution).
I can resize by fixing either width or height and then crop it (all resultant images will be of the same size as 100x100), but critical parts of the image can be cropped (second solution).
I am thinking of another way of doing it, but I am sure. Assume I want 10000 columns per image. Instead of resizing images to 100x100, I will resize the image so that the total pixel count will be around 10000 pixels. So, images of size 50x200, 100x100 and 250x40 will all converted into 10000 columns. For other sizes like 52x198, the first 10000 pixels out of 10296 will be considered (third solution).
The third solution I mentioned above seems to preserve the original shape of the image. However, it may be losing all of this originality while converting into a row since not all images are of the same size. I wonder about your comments on this issue. It will also be great if you can direct me to sources I can learn about the topic.
Solution 1 (simply resizing the input image) is a common approach. Unless you have a very different aspect ratio from the expected input shape (or your target classes have tight geometric constraints), you can usually still get good performance.
As you mentioned, Solution 2 (cropping your image) has the drawback of potentially excluding a critical part of your image. You can get around that by running the classification on multiple subwindows of the original image (i.e., classify multiple 100 x 100 sub-images by stepping over the input image horizontally and/or vertically at an appropriate stride). Then, you need to decide how to combine your multiple classification results.
Solution 3 will not work because the convolutional network needs to know the image dimensions (otherwise, it wouldn't know which pixels are horizontally and vertically adjacent). So you need to pass an image with explicit dimensions (e.g., 100 x 100) unless the network expects an array that was flattened from assumed dimensions. But if you simply pass an array of 10000 pixel values and the network doesn't know (or can't assume) whether the image was 100 x 100, 50 x 200, or 250 x 40, then the network can't apply the convolutional filters properly.
Solution 1 is clearly the easiest to implement but you need to balance the likely effect of changing the image aspect ratios with the level of effort required for running and combining multiple classifications for each image.
Related
I have a training dataset of 640x512 images that I would like to use with a 320x240 camera.
Is it ok to change the aspect ratio and the size of the training images to that of the camera?
Would it be better to upscale the camera frames?
It is better if you keep the aspect ratio of the images because you will be artificially modifying the composition of the objects in the image. What you can do is downscale the image by a factor of 2, so it's 320 x 256, then crop from the center so you have a 320 x 240 image. You can do this by simply removing the first 8 and last 8 columns of the image to get it to 320 x 240. Removing the first 8 and last 8 columns should be safe because it is very unlikely you will see meaningful information within an 8 pixel band on either side of the image.
If you are using a deep learning framework such as Tensorflow or PyTorch, there are pre-processing methods to automatically allow you to crop from the center as well as downscale the image by a factor of 2 for you. You just need to set up a pre-processing pipeline and have these two things in place. You don't have any code established so I can't help you with implementation details, but hopefully what I've said is enough to get you started.
Finally, do not upsample the images. There will be no benefit because you will be using existing information to interpolate to a larger space which is inaccurate. You can scale down, but never scale up. The only situation where this could be useful is if you use superresolution, but that would be for specific cases and it highly depends on what images you use. In general, I do not recommend upscaling. Take your training set and downscale to the resolution of the camera as the images from the camera would be what is used at inference and at that resolution.
While people usually tend to simply resize any image into a square while training a CNN (for example, resnet takes a 224x224 square image), that looks ugly to me, especially when the aspect ratio is not around 1.
(In fact, that might change ground truth, for example, the label that an expert might give the distorted image could be different than the original one).
So now I resize the image to, say, 224x160 , keeping the original ratio, and then I pad the image with 0s (by pasting it into a random location in a totally black 224x224 image).
My approach doesn't seem original to me, and yet I cannot find any information whatsoever about my approach versus the "usual" approach.
Funky!
So, which approach is better? Why? (if the answer is data dependent, please share your thoughts regarding when one is preferable to the other.)
According to Jeremy Howard, padding a big piece of the image (64x160 pixels) will have the following effect: The CNN will have to learn that the black part of the image is not relevant and does not help distinguishing between the classes (in a classification setting), as there is no correlation between the pixels in the black part and belonging to a given class. As you are not hard coding this, the CNN will have to learn it by gradient descent, and this might probably take some epochs. For this reason, you can do it if you have lots of images and computational power, but if you are on a budget on any of them, resizing should work better.
Sorry, this is late but this answer is for anyone facing the same issue.
First, if scaling with changing the aspect ratio will affect some important features, then you have to use zero-padding.
Zero padding doesn't make it take longer for the network to learn because of the large black area itself but because of the different possible locations that the unpadded image could be inside the padded image since you can pad an image in many ways.
For areas with zero pixels, the output of the convolution operation is zero. The same with max or average pooling. Also, you can prove that the weight is not updated after backpropagation if the input associated with that weight is zero under some activation functions (e.g. relu, sigmoid). So the large area doesn't make any updates to the weights in this sense.
However, the relative position of the unpadded image inside the padded image does indeed affect training. This is not due to the convolution nor the pooling layers but the last fully connected layer(s). For example, if the unpadded image is on the left relative inside the padded image and the output of flattening the last convolution or pooling layer was [1, 0, 0] and the output for the same unpadded image but on the right relative inside the padded image was [0, 0, 1] then the fully connected layer(s) must learn that [1, 0, 0] and [0, 0, 1] are the same thing for a classification problem.
Therefore, learning the equivariance of different possible positions of the image is what makes training take more time. If you have 1,000,000 images then after resizing you will have the same number of images; on the other hand, if you pad and want to consider different possible locations (10 randomly for each image) then you will have 10,000,000 images. That is, training will take 10 times longer.
That said, it depends on your problem and what you want to achieve. Also, testing both methods will not hurt.
I am using Perl's
Image::Imlib2
package to generate thumbnails from larger images.
I've done such tasks before with several ImageMagick interfaces (PHP, Ruby, Python) and it was relatively easy. I have no prior experience with Imlib2 and it is a long time since I wrote something in Perl, so I am sorry if this seems naive!
This is what I've tried so far. It is simple, and assumes that scaling an image will keep the aspect ratio, and the generated thumbnail will be an exact miniature copy of the original image.
use strict;
use warnings;
use Image::Imlib2;
my $dir = 'imgs/*';
my #files = glob ($dir);
foreach my $img ( #files ) {
my $image = Image::Imlib2->load($img);
my $cropped_image = $image->create_scaled_image(50, 50);
$cropped_image->save($img);
}
Original image
Generated image
My first look at the image tells me that something is wrong. It may be my ignorance on cropping, resizing and scaling, but the generated image is displaying wrongly on small screens.
I've read What's the difference between cropping and resizing?, and honestly didn't understand anything. Also this one Image scaling.
Could someone explain the differences between those three ideas, and if possible give examples (preferably with Perl) to achieve better results? Or at least describe what I should consider when I want to create thumbnails?
The code you use isn't preserving the aspect-ratio. From Image::Imlib2::create_scaled_image
If x or y are 0, then retain the aspect ratio given in the other.
So change the line
my $cropped_image = $image->create_scaled_image(50, 50);
to
my $scaled_image = $image->create_scaled_image(50, 0);
and the new image will be 50 pixels wide, and its height computed so to keep the original aspect-ratio.
Since this is not cropping I've changed the variable name as well.
As for other questions, below is a basic discussion from comments. Please search for tutorials on image processing. Also, documentation of major libraries often have short and good explanations.
This is aggregated from comments deemed helpful. Also see Borodin's short and clear answer.
Imagine that you want to draw a picture (of some nice photograph) yourself in the following way. You draw a grid of, say, 120 (horizontally) by 60 (vertically) boxes. So 120 x 60, 720 boxes. These are your "pixels," and each you may fill with only one color. If the photo you are re-drawing is "mostly" blue at some spot, you color that pixel blue. Etc. It is not easy to end up with a faithful redrawing -- the denser the pixels the beter.
Now imagine that you want to draw another copy of this, just smaller. If you make it 20x20 that will be completely different, since it's a square. The best chance of getting it to "look the same" is to pick 2-to-1 ratio (like 120x60), so say 40x20. That's "aspect-ratio." But there is still a problem, since now you have to decide all over again what color to pick for each box, so to represent what is "mostly" on the photo at that spot. There are algorithms for that ("sampling," see your second link). That's involved with "resizing." The "quality" of the obtained drawing clearly must be much worse.
So "resizing" isn't all that simple. But, for us users, we mostly need to roughly know what is involved, and to find out how to use these features in a library. So read documentation. Some uses are very simple, while sometimes you'll have to decide which "algorithm" to let it use, or some such. Again, what I do is read manuals carefully.
The basic version of "cropping" is simple -- you just cut off a part of the picture. Say, remove the first and last 20 columns and the bottom and top 10 rows, and from the initial 120x60 you get a picture of 80x40. This is normally done when outer parts of an image have just white areas (or, worse, black!). So you want to "cut out" the picture itself from the whole image. Many graphics tools can do that on their own, by analyzing the image and figuring out those areas. Or, we select and hit a button.
I'm still not certain that you understand the difference between these terms
Your original image is 752 × 500 pixels
Resizing is a vague term that just means making a picture a different size somehow
Scaling is to change the size of an image proportionally. Scaling your picture down by a factor of ten would result in an image 75 × 50 (it should be 75.2 but we can't have 0.2 of a pixel). Scaling it up would make it bigger
You have scaled your picture to 50 × 50 pixels, which is a vertical scale of 10 (500 ÷ 5) but a horizontal scale of 15 (752 ÷ 50), so it appears squashed horizontally (or stretched vertically)
Cropping is to reduce an image by removing parts of it. To crop your image to 50 × 50 you would choose a 50 × 50 rectangle out of the whole picture and remove the rest. It would be a piece about the size of your monkey's nose, but you can pick any region you wish
zdim has shown you how you can call
$image->create_scaled_image(0, 50)
so that the height, or y-dimension, is reduced to 50, while the width, or x-dimension, is scaled by the same factor. That will result in a thumbnail 75 × 50 as above
I hope that helps
As I said in my comment, there is an
Image::Magick
Perl module if you would prefer to be back on familiar ground
Resizing and scaling is the same; you just change the size of the image. You can make it smaller or bigger.
Depending on the interface, you have to give either the new dimensions or a scaling factor for the operation. A factor less than or greater than 1.0 would make the image smaller or bigger. Smaller images are created by subsampling and bigger images by interpolation.
Cropping is very simple. You select a rectangular region of an image and that's your new image. It's like using scissors.
In your code example the image is named cropped_image although it is created through scaling, or resizing.
The output image is an image of size 50 x 50 pixels. That's what you did here:
my $cropped_image = $image->create_scaled_image(50, 50);
So no matter how your image looks before, you stuff it into 50 x 50 pixels. In this case not only reducing the resolution but also changing the aspect ratio.
The image is not displayed improperly, it's displayed perfectly fine.
I am trying to find a way to determine whether an image needs to be rotated in order for the text to be horizontally aligned. And if it does need to be rotated then by how many degrees?
I am sending the images to tesseract and for tesseract to be effective, the text in the images needs to be horizontally aligned.
I'm looking for a way do this without depending on the "Orientation" metadata in the image.
I've thought of following ways to do this:
Rotate the image 90 degrees clockwise four times and send all four images to tesseract. This isn't ideal because of the need to process one image 4 times.
Use hough line transform to see if the lines are vertical or horizontal. If they are vertical then rotate the image. This way the image still might need to be rotated 180 degrees. So I'm unsure how effective this would be.
I'm wondering if there are other ways to accomplish this using OpenCV, imageMagik or any other image processing techniques.
If you have a 1000 images which say horizontal or vertical, you can resize these images to 224x224 and then fine-tune a Convolutional neural network, like AlexNet or VGG for this task. If you want to know how many right rotations to make for the image, you can set the labels as the number of clock-wise rotations, like 0,1,2,3.
http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html
Aytempting ocr on all 4 orientations seems like a reasonable choice, and I doubt you will find a more reliable heuristic.
If speed is an issue, you could OCR a small part of the image first. Select a rectangular region, that has the proper amount of edge pixels and white/black ratio for text, then send that to tesseract in different orientations. With a small region, you could even try smaller steps than 90°, or combine it with another heuristic like Hough.
If you remember the most likely orientation based on previous images, and stop once an orientation is successfully processed by tesseract, you probably do not even have to try most orientations in most cases.
You can figure this out in a terminal with tesseract's psm option.
tesseract --psm 0 "infile" "outfile" will create outfile.osd which contains the info:
Page number: 0
Orientation in degrees: 90
Rotate: 270
Orientation confidence: 27.93
Script: Latin
Script confidence: 6.55
man tesseract
...
--psm N
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
...
I have 2 images that I need to compare:
Image 1: size [512 x 512] with pixel dimension: 0.41 mm
Image 2: size [210 x 210] with pixel dimension 1 mm
I tried to use: imresize
imresize(Image_1, [210 210]) % to change size/pixel
However it reduce the resolution and image is not clear at all.
Any suggestion will be welcome!
if you meant to test if the two images are identical, instead of resizing the images, you can use filters with different bandwidths. or a higher level feature, such as sift feature, can usually take care of sizing issues because it picks the most interesting scale internally.
vlfeat is a good toolbox if you use matlab.
You always have that problem with comparing two images of different resolutions. I would do a pre-processing of images to make them comparable, maybe something more than just making them of the same size. That pre-processing really depends on your images.
Anyway, perhaps it would be better to re-size the smaller one to a larger version using one of the methods mentioned here: http://www.mathworks.com/help/images/ref/imresize.html and then compare them. For example, I would enlarge the smaller image using 'lanczos3' method.
imresize(Image_2,[512 512],'lanczos3');