CNN: how to improve prediction speed? - performance

I use a CNN (Keras/Theano) to recognize given numbers in a sudoku puzzle. Because I display the solution in an "augmented reality" way, I am looking for minimizing the speed for the image processing task and for number recognition task, so that the display keeps smooth. Currently, I get a rate of 7 frame per second (you can have a look here : Augmented Reality Sudoku solver : OpenCV, Keras )
A big part of the processing time is spent in the prediction task : the prediction is made for every small non empty square (square with a number in it). On average, one prediction takes 3.6 ms, an puzzle has about 25 non empty squares, so about 90 ms per image just for the recognition of the given numbers.
The CNN model I use for the prediction task is almost exactly the one proposed by F.Chollet in the Keras MNIST example :
img_rows, img_cols = 28, 28 # Image size of the small square
num_classes = 9 # We want to recognize the numbers from 1 to 9
input_shape = (1, img_rows, img_cols)
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
The accuracy I get with this model is very good, so good that I wonder if I can get the same accuracy with a "smaller" model. I imagine that smaller model would also mean faster (I only focus on prediction speed improvement, training speed doesn't matter).
Do you think my assumption ("smaller implies faster") is correct and is it worth to try to make a smaller model ? Between the convolutionnal part and the dense part, which one(s) I should try to make smaller ?
Any other suggestion on how to improve the prediction speed ?
Thank you
My CPU is an old i5 4210U, no nvidia GPU.
Prediction code:
def predict(img): # img: 28x28 binary image
img = img.astype('float32')
img /= 255.
return best,result[best-1]


Speeding up matrix multiplication like operations

Suppose I have the following two matrices:
x = torch.randint(0, 256, (100000, 48), dtype=torch.uint8)
x_new = torch.randint(0, 256, (1000, 48), dtype=torch.uint8)
I wish to do a matrix multiplication like operation where I compare the 48 dimensions and sum up all the elements that are equal. The following operation takes 7.81 seconds. Batching does not seem to help:
matrix = (x_new.unsqueeze(1) == x).sum(dim=-1)
However, doing a simple matrix multiplication (matrix = x_new # x.T) takes 3.54 seconds. I understand this is most likely calling a deeper library that isn't slowed down by python. However, the question is, is there a way to speed up the multiplication like operation? by using scripting, or any other way at all?
What is even stranger though is that if I do matrix = x_new.float() # x.float().T this operation takes 214ms. This is more than 10x faster than the uint8 multiplication.
For context, I am trying to quantize vectors so that I can find the closest vector by comparing integers than directly doing dot products.

Is it fine to have a threshold greater than 1 in roc_curve metrics?

Predicting the probability of class assignment for each chosen sample from the Train_features:
probs = classifier.predict_proba(Train_features)`
Choosing the class for which the AUC has to be determined.
preds = probs[:,1]
Calculating false positive rate, true positive rate and the possible thresholds that can clearly separate TP and TN.
fpr, tpr, threshold = metrics.roc_curve(Train_labels, preds)
roc_auc = metrics.auc(fpr, tpr)
Output : 1.97834
The previous answer did not really address your question of why the threshold is > 1, and in fact is misleading when it says the threshold does not have any interpretation.
The range of threshold should technically be [0,1] because it is the probability threshold. But scikit learn adds +1 to the last number in the threshold array to cover the full range [0, 1]. So if in your example the max(threshold) = 1.97834, the very next number in the threshold array should be 0.97834.
See this sklearn github issue thread for an explanation. It's a little funny because somebody thought this is a bug, but it's just how the creators of sklearn decided to define threshold.
Finally, because it is a probability threshold, it does have a very useful interpretation. The optimal cutoff is the threshold at which sensitivity + specificity are maximum. In sklearn learn this can be computed like so
fpr_p, tpr_p, thresh = roc_curve(true_labels, pred)
# maximize sensitivity + specificity, i.e. tpr + (1-fpr) or just tpr-fpr
th_optimal = thresh[np.argmax(tpr_p - fpr_p)]
The threshold value does not have any kind of interpretation, what really matters is the shape of the ROC curve. Your classifier performs well if there are thresholds (no matter their values) such that the generated ROC curve lies above the linear function (better than random guessing); your classifier has a perfect result (this happens rarely in practice) if for any threshold the ROC curve is only one point at (0,1); your classifier has the worst result if for any threshold the ROC curve is only one point at (1,0). A good indicator of the performance of your classifier is the integral of the ROC curve, this indicator is known as AUC and is limited between 0 and 1, 0 for the worst performance and 1 for perfect performance.

Color Segmentation: A better cluster-analysis to find K

I know there have been many questions such as this and some solutions to them, but I'm hoping there's another way.
GOAL: The final goal is to cluster colors given an image, then allow the user to change those colors. The user does not need to enter any k. The algorithm determines K.
METHOD: Currently, I'm using the silhouette score metric ( I'm using MiniBatchKMeans to cluster the image and then calculate the silhouette_score within a range of k (4-8). The code would be:
# silhouetteCoeff determination
def silhouetteCoeff(z):
max_silhouette = 0
max_k = 0
for i in range(4, 17):
clt = MiniBatchKMeans(n_clusters = i, random_state = 42)
silhouette_avg = silhouette_score(z, clt.labels_, sample_size = 250, random_state = 42)
print("k: ", i, " silhouette avg: ", silhouette_avg)
if (silhouette_avg == 1.0):
max_k = i
elif (silhouette_avg > max_silhouette):
max_silhouette = silhouette_avg
max_k = i
print("Max silhouette: ", max_silhouette)
print("Max k: ", max_k)
return int(max_k)
Even if I color quantize the image beforehand (to 16 colors), the function still takes a good 6-8 seconds to run (assume image size 400x400).
My question is, is there any better or faster way to find k? I've tried the Elbow method too, but still gotta calculate the SSE there. From testing on some images, I've found a good average k = 8. But on a more color intensive image, the algorithm loses out on some colors.
Measure your bottleneck!
Silhouette is in O(n²) so most likely it will be the bottleneck of your approach. Also, there are much faster k-means variants than the one in sklearn... so there is a lot of potential to make things faster.
Minibatch kmeans won't even converge, but only approximate the result. It only makes sense if you can't afford to keep all data in memory as far as I can tell.
Reducing the color palette to just 16 colors supposedly does not at all help.

extracting many regions of interests ROIs) from thousand images

I have a large set of microscopy images and each image has several hundreds of spots (ROIs). These spots are fixed in space. I want to extract each spot from each image and save into workspace so that I can analyze them further.
I have written a code myself and it working perfectly but its too slow. It takes around 250 sec to completely read out all the spots from every image.
The core of my code looks as following:
for s=1:NumberImages
for i=1:length(p_g_x)
GreenROI(i,s)=double(sum(sum(im(round(p_g_y(i))+(-2:2), round(p_g_x(i))+(-2:2)))));
RedROI(i,s)=double(sum(sum(im(round(p_r_y(i))+(-2:2), round(p_r_x(i))+(-2:2)))));
As you can see from the code I am extracting 5x5 regions. Length of p_g_x is between 500-700.
Thanks for your input. I used profile viewer to figure out which function exactly is taking more time. It was median filter which is taking a lot of time (~90%).
Any suggestion to fast it up will be greatly appreciated.
Use Matlab's profiling tools!
profile on % Starts the profiler
% Run some code now.
profile viewer % Shows you how often each function was called, and
% where most time was spent. Try to start with the slowest part.
profile off % Resets the Profiler, so you can measure again.
Preallocate the output because you know the size and this way it is much faster. (Matlab told you this already!)
GreenROI = zeros(length(p_g_x), NumberImages); % And the same for RedROI.
Use convolution
Read about Matlab's conv2 code.
for s=1:NumberImages
% Pre-compute the sums first. This will only be faster for large p_g_x
roi_image = conv2(im, ones(5,5));
for i=1:length(p_g_x)
GreenROI(i,s)=roi_image(round(p_g_y(i)), round(p_g_x(i))); % You might have to offset the indices by 2, because of the convolution. Check that results are the same.
RedROI(i,s)=roi_image(round(p_r_y(i)), round(p_r_x(i)));
Matlab-ize the code
Now, that you've used convolution to get an image of sums over 5x5 windows (or you could've used #Shai's accumarray, same thing), you can speed things up further by not iterating through each element in p_g_x but use it as a vector straight away.
I leave that as an exercise for the reader. (convert p_g_x and p_g_y to indices using sub2ind, as a hint).
Our answers, mine included, showed how premature optimisation is a bad thing. Without knowing, I assumed that your loop would take most of the time, but when you measured it (thanks!) it turns out that is not the problem. The bottleneck is medfilt2 the median filter, which takes 90% of the time. So you should address this first. (Note, that on my computer your original code is fast enough for my taste but it is still the median filter taking up most of the time.)
Looking at what the median filter operation does might help us figure out how to make it faster. Here is an image. On the left you see the original image. In the middle the median filter and on the right there is the result.
To me the result looks awfully similar to an edge detection result. (Mathematically this is no surprise.)
I would suggest you start experimenting with various edge detections. Have a look at Canny and Sobel. Or just use conv2(image, kernel_x) where kernel_x = [1, 2, 1; 0, 0, 0; -1, -2, -1] and the same but transposed for a kernel_y. You can find various edge detection options here: edge(im, option). I tried all options from {'sobel', 'canny', 'roberts', 'prewitt'}. Except for Canny, they all take about the same time as your median filter method. Canny is 4x slower, the rest (including the original) take 7.x seconds. All of this without a GPU. imgradient was 9 seconds.
So from this I would say that you can't get any faster. If you have a GPU and it works with Matlab, you could speed it up. Load your image as gpuArrays. There is an example on the medfilt2 documentation. You can still do minor speed ups but they can only amount to a 10% speed increase, so are hardly worthwile.
A few things you should do
Pre-allocate as suggested by Didac Perez.
Use profiler to see what exactly takes long in your code, is it the median filter? is it the indexing?
Assuming all images are of the same size, you can use accumarray and a fixed mask subs to quickly sum the values:
subs_g = zeros( h, w ); %// allocate mask for green
subs_r = zeros( h, w );
subs_g( sub2ind( [h w], round(p_g_y), round(p_g_x) ) = 1:numel(p_g_x); %//index each region
subs_g = conv2( subs_g, ones(5), 'same' );
subs_r( sub2ind( [h w], round(p_r_y), round(p_r_x) ) = 1:numel(p_r_x); %//index each region
subs_r = conv2( subs_r, ones(5), 'same' );
sel_g = subs_g > 0;
sel_r = subs_r > 0;
subs_g = subs_g(sel_g);
subs_r = subs_r(sel_r);
once these masks are fixed, you can process all images
%// pre-allocation goes here - I'll leave it to you
for s=1:NumberImages
im=double( im1-medfilt2(im1,[15,15]) );
accumarray( subs_g, im( sel_g ) ); % summing all the green ROIs
accumarray( subs_r, im( sel_r ) ); % summing all the green ROIs
First, preallocate your GreenROI and RedROI structures since you previously know the final size. Now, you are resizing them again and again in each iteration.
Secondly, I do recommend you to use "tic" and "toc" to investigate where is the problem, it will give you useful timings.
Vectorized code that operates on each image -
%// Pre-compute green and red indices to be used across all the images
r1 = round(bsxfun(#plus,permute(p_g_y,[3 2 1]),[-2:2]'));
c1 = round(bsxfun(#plus,permute(p_g_x,[3 2 1]),[-2:2]));
green_ind = reshape(bsxfun(#plus,(c1-1)*size(im,1),r1),[],numel(p_g_x));
r2 = round(bsxfun(#plus,permute(p_r_y,[3 2 1]),[-2:2]'));
c2 = round(bsxfun(#plus,permute(p_r_x,[3 2 1]),[-2:2]));
red_ind = reshape(bsxfun(#plus,(c2-1)*size(im,1),r2),[],numel(p_g_x));
for s=1:NumberImages
RedROI =sum(im(red_ind));

Checking images for similarity with OpenCV

Does OpenCV support the comparison of two images, returning some value (maybe a percentage) that indicates how similar these images are? E.g. 100% would be returned if the same image was passed twice, 0% would be returned if the images were totally different.
I already read a lot of similar topics here on StackOverflow. I also did quite some Googling. Sadly I couldn't come up with a satisfying answer.
This is a huge topic, with answers from 3 lines of code to entire research magazines.
I will outline the most common such techniques and their results.
Comparing histograms
One of the simplest & fastest methods. Proposed decades ago as a means to find picture simmilarities. The idea is that a forest will have a lot of green, and a human face a lot of pink, or whatever. So, if you compare two pictures with forests, you'll get some simmilarity between histograms, because you have a lot of green in both.
Downside: it is too simplistic. A banana and a beach will look the same, as both are yellow.
OpenCV method: compareHist()
Template matching
A good example here matchTemplate finding good match. It convolves the search image with the one being search into. It is usually used to find smaller image parts in a bigger one.
Downsides: It only returns good results with identical images, same size & orientation.
OpenCV method: matchTemplate()
Feature matching
Considered one of the most efficient ways to do image search. A number of features are extracted from an image, in a way that guarantees the same features will be recognized again even when rotated, scaled or skewed. The features extracted this way can be matched against other image feature sets. Another image that has a high proportion of the features matching the first one is considered to be depicting the same scene.
Finding the homography between the two sets of points will allow you to also find the relative difference in shooting angle between the original pictures or the amount of overlapping.
There are a number of OpenCV tutorials/samples on this, and a nice video here. A whole OpenCV module (features2d) is dedicated to it.
Downsides: It may be slow. It is not perfect.
Over on the OpenCV Q&A site I am talking about the difference between feature descriptors, which are great when comparing whole images and texture descriptors, which are used to identify objects like human faces or cars in an image.
Since no one has posted a complete concrete example, here are two quantitative methods to determine the similarity between two images. One method for comparing images with the same dimensions; another for scale-invariant and transformation indifferent images. Both methods return a similarity score between 0 to 100, where 0 represents a completely different image and 100 represents an identical/duplicate image. For all other values in between: the lower the score, the less similar; the higher the score, the more similar.
Method #1: Structural Similarity Index (SSIM)
To compare differences and determine the exact discrepancies between two images, we can utilize Structural Similarity Index (SSIM) which was introduced in Image Quality Assessment: From Error Visibility to Structural Similarity. SSIM is an image quality assessment approach which estimates the degradation of structural similarity based on the statistical properties of local information between a reference and a distorted image. The range of SSIM values extends between [-1, 1] and it typically calculated using a sliding window in which the SSIM value for the whole image is computed as the average across all individual window results. This method is already implemented in the scikit-image library for image processing and can be installed with pip install scikit-image.
The skimage.metrics.structural_similarity() function returns a comparison score and a difference image, diff. The score represents the mean SSIM score between two images with higher values representing higher similarity. The diff image contains the actual image differences with darker regions having more disparity. Larger areas of disparity are highlighted in black while smaller differences are in gray. Here's an example:
Input images
Difference image -> highlighted mask differences
The SSIM score after comparing the two images show that they are very similar.
Similarity Score: 89.462%
To visualize the exact differences between the two images, we can iterate through each contour, filter using a minimum threshold area to remove tiny noise, and highlight discrepancies with a bounding box.
Limitations: Although this method works very well, there are some important limitations. The two input images must have the same size/dimensions and also suffers from a few problems including scaling, translations, rotations, and distortions. SSIM also does not perform very well on blurry or noisy images. These problems are addressed in Method #2.
from skimage.metrics import structural_similarity
import cv2
import numpy as np
first = cv2.imread('clownfish_1.jpeg')
second = cv2.imread('clownfish_2.jpeg')
# Convert images to grayscale
first_gray = cv2.cvtColor(first, cv2.COLOR_BGR2GRAY)
second_gray = cv2.cvtColor(second, cv2.COLOR_BGR2GRAY)
# Compute SSIM between two images
score, diff = structural_similarity(first_gray, second_gray, full=True)
print("Similarity Score: {:.3f}%".format(score * 100))
# The diff image contains the actual image differences between the two images
# and is represented as a floating point data type so we must convert the array
# to 8-bit unsigned integers in the range [0,255] before we can use it with OpenCV
diff = (diff * 255).astype("uint8")
# Threshold the difference image, followed by finding contours to
# obtain the regions that differ between the two images
thresh = cv2.threshold(diff, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
contours = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0] if len(contours) == 2 else contours[1]
# Highlight differences
mask = np.zeros(first.shape, dtype='uint8')
filled = second.copy()
for c in contours:
area = cv2.contourArea(c)
if area > 100:
x,y,w,h = cv2.boundingRect(c)
cv2.rectangle(first, (x, y), (x + w, y + h), (36,255,12), 2)
cv2.rectangle(second, (x, y), (x + w, y + h), (36,255,12), 2)
cv2.drawContours(mask, [c], 0, (0,255,0), -1)
cv2.drawContours(filled, [c], 0, (0,255,0), -1)
cv2.imshow('first', first)
cv2.imshow('second', second)
cv2.imshow('diff', diff)
cv2.imshow('mask', mask)
cv2.imshow('filled', filled)
Method #2: Dense Vector Representations
Typically, two images will not be exactly the same. They may have variations with slightly different backgrounds, dimensions, feature additions/subtractions, or transformations (scaled, rotated, skewed). In other words, we cannot use a direct pixel-to-pixel approach since with variations, the problem shifts from identifying pixel-similarity to object-similarity. We must switch to deep-learning feature models instead of comparing individual pixel values.
To determine identical and near-similar images, we can use the the sentence-transformers library which provides an easy way to compute dense vector representations for images and the OpenAI Contrastive Language-Image Pre-Training (CLIP) Model which is a neural network already trained on a variety of (image, text) pairs. The idea is to encode all images into vector space and then find high density regions which correspond to areas where the images are fairly similar.
When two images are compared, they are given a score between 0 to 1.00. We can use a threshold parameter to identify two images as similar or different. A lower threshold will result in clusters which have fewer similar images in it. Conversely, a higher threshold will result in clusters that have more similar images. A duplicate image will have a score of 1.00 meaning the two images are exactly the same. To find near-similar images, we can set the threshold to any arbitrary value, say 0.9. For instance, if the determined score between two images are greater than 0.9 then we can conclude they are near-similar images.
An example:
This dataset has five images, notice how there are duplicates of flower #1 while the others are different.
Identifying duplicate images
Score: 100.000%
.\flower_1 copy.jpg
Both flower #1 and its copy are the same
Identifying near-similar images
Score: 97.141%
Score: 95.693%
Score: 57.658%
.\flower_1 copy.jpg
Score: 57.658%
Score: 57.378%
Score: 56.768%
.\flower_1 copy.jpg
Score: 56.768%
Score: 56.284%
We get more interesting results between different images. The higher the score, the more similar; the lower the score, the less similar. Using a threshold of 0.9 or 90%, we can filter out near-similar images.
Comparison between just two images
Score: 97.141%
Score: 95.693%
Score: 88.914%
Score: 94.503%
from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import os
# Load the OpenAI CLIP Model
print('Loading CLIP Model...')
model = SentenceTransformer('clip-ViT-B-32')
# Next we compute the embeddings
# To encode an image, you can use the following code:
# from PIL import Image
# encoded_image = model.encode(
image_names = list(glob.glob('./*.jpg'))
print("Images:", len(image_names))
encoded_image = model.encode([ for filepath in image_names], batch_size=128, convert_to_tensor=True, show_progress_bar=True)
# Now we run the clustering algorithm. This function compares images aganist
# all other images and returns a list with the pairs that have the highest
# cosine similarity score
processed_images = util.paraphrase_mining_embeddings(encoded_image)
# =================
# =================
print('Finding duplicate images...')
# Filter list for duplicates. Results are triplets (score, image_id1, image_id2) and is scorted in decreasing order
# A duplicate image will have a score of 1.00
# It may be 0.9999 due to lossy image compression (.jpg)
duplicates = [image for image in processed_images if image[0] >= 0.999]
# Output the top X duplicate images
for score, image_id1, image_id2 in duplicates[0:NUM_SIMILAR_IMAGES]:
print("\nScore: {:.3f}%".format(score * 100))
# =================
# =================
print('Finding near duplicate images...')
# Use a threshold parameter to identify two images as similar. By setting the threshold lower,
# you will get larger clusters which have less similar images in it. Threshold 0 - 1.00
# A threshold of 1.00 means the two images are exactly the same. Since we are finding near
# duplicate images, we can set it at 0.99 or any number 0 < X < 1.00.
threshold = 0.99
near_duplicates = [image for image in processed_images if image[0] < threshold]
for score, image_id1, image_id2 in near_duplicates[0:NUM_SIMILAR_IMAGES]:
print("\nScore: {:.3f}%".format(score * 100))
If for matching identical images ( same size/orientation )
// Compare two images by getting the L2 error (square-root of sum of squared error).
double getSimilarity( const Mat A, const Mat B ) {
if ( A.rows > 0 && A.rows == B.rows && A.cols > 0 && A.cols == B.cols ) {
// Calculate the L2 relative error between images.
double errorL2 = norm( A, B, CV_L2 );
// Convert to a reasonable scale, since L2 error is summed across all pixels of the image.
double similarity = errorL2 / (double)( A.rows * A.cols );
return similarity;
else {
//Images have a different size
return 100000000.0; // Return a bad value
Sam's solution should be sufficient. I've used combination of both histogram difference and template matching because not one method was working for me 100% of the times. I've given less importance to histogram method though. Here's how I've implemented in simple python script.
import cv2
class CompareImage(object):
def __init__(self, image_1_path, image_2_path):
self.minimum_commutative_image_diff = 1
self.image_1_path = image_1_path
self.image_2_path = image_2_path
def compare_image(self):
image_1 = cv2.imread(self.image_1_path, 0)
image_2 = cv2.imread(self.image_2_path, 0)
commutative_image_diff = self.get_image_difference(image_1, image_2)
if commutative_image_diff < self.minimum_commutative_image_diff:
print "Matched"
return commutative_image_diff
return 10000 //random failure value
def get_image_difference(image_1, image_2):
first_image_hist = cv2.calcHist([image_1], [0], None, [256], [0, 256])
second_image_hist = cv2.calcHist([image_2], [0], None, [256], [0, 256])
img_hist_diff = cv2.compareHist(first_image_hist, second_image_hist, cv2.HISTCMP_BHATTACHARYYA)
img_template_probability_match = cv2.matchTemplate(first_image_hist, second_image_hist, cv2.TM_CCOEFF_NORMED)[0][0]
img_template_diff = 1 - img_template_probability_match
# taking only 10% of histogram diff, since it's less accurate than template method
commutative_image_diff = (img_hist_diff / 10) + img_template_diff
return commutative_image_diff
if __name__ == '__main__':
compare_image = CompareImage('image1/path', 'image2/path')
image_difference = compare_image.compare_image()
print image_difference
A little bit off topic but useful is the pythonic numpy approach. Its robust and fast but just does compare pixels and not the objects or data the picture contains (and it requires images of same size and shape):
A very simple and fast approach to do this without openCV and any library for computer vision is to norm the picture arrays by
import numpy as np
picture1 = np.random.rand(100,100)
picture2 = np.random.rand(100,100)
picture1_norm = picture1/np.sqrt(np.sum(picture1**2))
picture2_norm = picture2/np.sqrt(np.sum(picture2**2))
After defining both normed pictures (or matrices) you can just sum over the multiplication of the pictures you like to compare:
1) If you compare similar pictures the sum will return 1:
In[1]: np.sum(picture1_norm**2)
Out[1]: 1.0
2) If they aren't similar, you'll get a value between 0 and 1 (a percentage if you multiply by 100):
In[2]: np.sum(picture2_norm*picture1_norm)
Out[2]: 0.75389941124629822
Please notice that if you have colored pictures you have to do this in all 3 dimensions or just compare a greyscaled version. I often have to compare huge amounts of pictures with arbitrary content and that's a really fast way to do so.
one can use auto encoder for such task using architectures like VGG16 on pre-trained ImageRes data; Then calculate distance between query and other images in order to find the closest match.
