I'm detecting object from a live feed of a camera. The backend model used is ssd_mobilenet_v2. If I capture a image and feed it to the model for 10 times; everytime I get the bounding box of same size. But when I feed a live video to the model (without changing anything in the frame), with every frame I'm getting bouding box of different size (variation of 4 to 5 pixels when the image resolution is 640x480). The reason which I think behind is that due to tiny variations in the digital camera sensors, no two frames will be 100% the same — some pixels will most certainly have different intensity values 1. In this link the user have used GaussianBlur to average pixel intensities across an 21 x 21 region. Is this the only way to fix this ? Or there any better way to correct this.
I'm using Raspberry camera to get the video feed.
https://www.pyimagesearch.com/2015/05/25/basic-motion-detection-and-tracking-with-python-and-opencv/
Related
I have a training dataset of 640x512 images that I would like to use with a 320x240 camera.
Is it ok to change the aspect ratio and the size of the training images to that of the camera?
Would it be better to upscale the camera frames?
It is better if you keep the aspect ratio of the images because you will be artificially modifying the composition of the objects in the image. What you can do is downscale the image by a factor of 2, so it's 320 x 256, then crop from the center so you have a 320 x 240 image. You can do this by simply removing the first 8 and last 8 columns of the image to get it to 320 x 240. Removing the first 8 and last 8 columns should be safe because it is very unlikely you will see meaningful information within an 8 pixel band on either side of the image.
If you are using a deep learning framework such as Tensorflow or PyTorch, there are pre-processing methods to automatically allow you to crop from the center as well as downscale the image by a factor of 2 for you. You just need to set up a pre-processing pipeline and have these two things in place. You don't have any code established so I can't help you with implementation details, but hopefully what I've said is enough to get you started.
Finally, do not upsample the images. There will be no benefit because you will be using existing information to interpolate to a larger space which is inaccurate. You can scale down, but never scale up. The only situation where this could be useful is if you use superresolution, but that would be for specific cases and it highly depends on what images you use. In general, I do not recommend upscaling. Take your training set and downscale to the resolution of the camera as the images from the camera would be what is used at inference and at that resolution.
In my understanding of spec sheets from digital cameras, each output color pixel is made out 4 real pixels on the CCD. However, when reading from a camera, for example using OpenCV, one can get as much as NxMx3 pixels. The two green pixels get averaged.
For what I understood, OpenCV let's you transform from RGB images to grayscale, but couldn't find a way of getting the raw values from the CCD. Of course, it could be that there is a lower level limitation (i.e. the transformation to color space happens on the electronics and not on the computer). Or that there is some interpolation and hence, there are in reality NxM pixels and not NxMx4 pixels in a camera.
Is there any way of getting RAW data from a camera with OpenCV, or is there any information stored in RAW files acquired with commercial cameras?
I am new to machine learning. I am trying to create an input matrix (X) from a set of images (Stanford dog set of 120 breeds) to train a convolutional neural network. I aim to resize images and turn each image into one row by making each pixel a separate column.
If I directly resize images to a fixed size, the images lose their originality due to squishing or stretching, which is not good (first solution).
I can resize by fixing either width or height and then crop it (all resultant images will be of the same size as 100x100), but critical parts of the image can be cropped (second solution).
I am thinking of another way of doing it, but I am sure. Assume I want 10000 columns per image. Instead of resizing images to 100x100, I will resize the image so that the total pixel count will be around 10000 pixels. So, images of size 50x200, 100x100 and 250x40 will all converted into 10000 columns. For other sizes like 52x198, the first 10000 pixels out of 10296 will be considered (third solution).
The third solution I mentioned above seems to preserve the original shape of the image. However, it may be losing all of this originality while converting into a row since not all images are of the same size. I wonder about your comments on this issue. It will also be great if you can direct me to sources I can learn about the topic.
Solution 1 (simply resizing the input image) is a common approach. Unless you have a very different aspect ratio from the expected input shape (or your target classes have tight geometric constraints), you can usually still get good performance.
As you mentioned, Solution 2 (cropping your image) has the drawback of potentially excluding a critical part of your image. You can get around that by running the classification on multiple subwindows of the original image (i.e., classify multiple 100 x 100 sub-images by stepping over the input image horizontally and/or vertically at an appropriate stride). Then, you need to decide how to combine your multiple classification results.
Solution 3 will not work because the convolutional network needs to know the image dimensions (otherwise, it wouldn't know which pixels are horizontally and vertically adjacent). So you need to pass an image with explicit dimensions (e.g., 100 x 100) unless the network expects an array that was flattened from assumed dimensions. But if you simply pass an array of 10000 pixel values and the network doesn't know (or can't assume) whether the image was 100 x 100, 50 x 200, or 250 x 40, then the network can't apply the convolutional filters properly.
Solution 1 is clearly the easiest to implement but you need to balance the likely effect of changing the image aspect ratios with the level of effort required for running and combining multiple classifications for each image.
I am trying to find a way to determine whether an image needs to be rotated in order for the text to be horizontally aligned. And if it does need to be rotated then by how many degrees?
I am sending the images to tesseract and for tesseract to be effective, the text in the images needs to be horizontally aligned.
I'm looking for a way do this without depending on the "Orientation" metadata in the image.
I've thought of following ways to do this:
Rotate the image 90 degrees clockwise four times and send all four images to tesseract. This isn't ideal because of the need to process one image 4 times.
Use hough line transform to see if the lines are vertical or horizontal. If they are vertical then rotate the image. This way the image still might need to be rotated 180 degrees. So I'm unsure how effective this would be.
I'm wondering if there are other ways to accomplish this using OpenCV, imageMagik or any other image processing techniques.
If you have a 1000 images which say horizontal or vertical, you can resize these images to 224x224 and then fine-tune a Convolutional neural network, like AlexNet or VGG for this task. If you want to know how many right rotations to make for the image, you can set the labels as the number of clock-wise rotations, like 0,1,2,3.
http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html
Aytempting ocr on all 4 orientations seems like a reasonable choice, and I doubt you will find a more reliable heuristic.
If speed is an issue, you could OCR a small part of the image first. Select a rectangular region, that has the proper amount of edge pixels and white/black ratio for text, then send that to tesseract in different orientations. With a small region, you could even try smaller steps than 90°, or combine it with another heuristic like Hough.
If you remember the most likely orientation based on previous images, and stop once an orientation is successfully processed by tesseract, you probably do not even have to try most orientations in most cases.
You can figure this out in a terminal with tesseract's psm option.
tesseract --psm 0 "infile" "outfile" will create outfile.osd which contains the info:
Page number: 0
Orientation in degrees: 90
Rotate: 270
Orientation confidence: 27.93
Script: Latin
Script confidence: 6.55
man tesseract
...
--psm N
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
...
I am trying to write Core Image filter with custom kernel for iOS 8. My code works correct for images less than 4096x4096. For images bigger than 4096 I obtain "tile" effect. In other words it looks like image split into several sub-images and filters runs independently for each sub-image.
Here is good sample kernel:
kernel vec4 effectFoo(__sampler source_image) {
return sample(source_image, samplerTransform(source_image, destCoord() + vec2(200.0, 200.0)));
};
I apply it to big image with size 6000x4500. I expect to get image translated by (200, 200) offset with extrapolation by border color at two sides. However, the result is "tiled", so each tile has been translated and extrapolated independently.
Using log in ROI callback I got evidence that Core Image split image into 4 tiles of different sizes. Biggest tile was 4096x4096.
Thus there are two problems for me:
I can't get coordinates neither in full input image nor in full output image.
Core image extrapolate tiles without respect to neighbour tiles.