Fast and simple image hashing algorithm

Fast and simple image hashing algorithm - image

I need a (preferably simple and fast) image hashing algorithm. The hash value is used in a lookup table, not for cryptography.
Some of the images are "computer graphic" - i.e. solid-color filled rects, rasterized texts and etc., whereas there are also "photographic" images - containing rich color spectrum, mostly smooth, with reasonable noise amplitude.
I'd also like the hashing algorithm to be able to be applied to specific image parts. I mean, the image can be divided into a grid cells, and the hash function of each cell should depend only on the contents of this cell. So that one may spot quickly if two images have common areas (in case they're aligned appropriately).
Note: I only need to know if two images (or their parts) are identical. That is, I don't need to match similar images, there's no need in feature recognition, correlation, and other DSP techniques.
I wonder what is the preferred hashing algorithm.
For "photographic" images just XOR-ing all the pixels within a grid cell is ok more-or-less. The probability of the same hash value for different images is pretty low, especially because the presence of the (nearly white) noise breaks all the potential symmetries. Plus the spectrum of such a hash function looks good (any value is possible with nearly the same probability).
But such a naive algorithm may not be used with "artificial" graphics. Identical pixels, repeating patterns, geometrical offset invariance are very common for such images. XOR-ing all the pixels will give 0 for any image with even number of identical pixels.
Using something like CRT-32 looks somewhat promising, but I'd like to figure-out something faster. I thought about iterative formula, each new pixel mutates the current hash value, like this:
hashValue = (hashValue * /*something*/ | newPixelValue) % /* huge prime */
Doing modulo prime number should probably give a good dispersion, so that I'm leaning toward this option. But I'd like to know if there are better varians.
Thanks in advance.

Have a look at this tutorial on the phash algorithm http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html which is used to find closely matching images.

If you want to make it very fast, you should consider taking a random subset of the pixels to avoid reading the entire image. Next, compute a hash function on the sequence of values at those pixels. The random subset should be selected by a deterministic pseudo-random number generator with fixed seed so that identical images produce identical subsets and consequently identical hash values.
This should work reasonably well even for artificial images. However, if you have images which differ from each other by a small number of pixels, this is going to give hash collisions. More iterations give better reliability. If that is the case, for instance, if your images set is likely to have pairs with one different pixel, you must read every pixel to compute the hash value. Taking a simple linear combination with pseudo-random coefficients would be good enough even for artificial images.
pseudo-code of a simple algorithm
Random generator = new generator(2847) // Initialized with fixed seed
int num_iterations = 100
int hash(Image image) {
generator.reset() //To ensure consistency on each evaluation
int value = 0
for num_iteration steps {
int nextValue = image.getPixel(generator.nextInt()%image.getSize()).getValue()
value = value + nextValue*generator.nextInt()
}
return value
}

Related

Data structure for pixel selections in a picture

Is there a convenient data structure for storing a pixel selection in a picture?
By pixel selection I mean a set of pixels you obtain with selection tools such as those in image editing software (rectangles, lasso, magic wand, etc.). there can be holes, and in the general case the selection is (much) smaller than the picture itself.
The objective is to be able to save/load selections, display the selected pixels only is a separate view (bounding box size), using selections in specific algorithms (typically algorithms requiring segmentation), etc. It should use as little memory space as possible since the objective is to store a lot of them in a DB.
Solutions I found so far:
a boolean array (size of the picture/8)
a list of (uint16,uint16) => unefficient if many pixels in the selection
an array of lists: lists of pixels series for each line

A boolean array will take W x H bits for the raster plus extra accounting (such as ROI limits). This is roughly proportional to the area of the bounding box.
A list of pixel coordinates will take like 32 bits (2x16 bits) per selected pixel. This is pretty large compared to the boolean array, except when the selection is very hollow.
Another useful representation is the run-length-encoding, which counts the continguous pixels row by row. This representation will take about 16 bits per run. Said differently, 16 / n bits per pixel when the average length of the runs is n pixels. This works fine for large filled shapes, but poorly for isolated pixels.
Finally, you can also consider just storing the outlines of the shapes as a list of pixels (32 bits per pixel) or as a Freeman chain (only 3 bits per pixel), which can be a significant saving with respect to the full enumeration.
As you can see, the choice is uneasy because the efficiency of the different representations is strongly dependent on the shape of the selection. Another important aspect is the ease with which the given representation can be used for the targeted processing of the selection.

Is it possible to calculate the mathematical function of a 2D image?

The question basically says it all. I would like to add that lets suppose I have an image, a photograph and I wish to calculate its mathematical function, so that when I input x and y pixel value, it returns a vector consisting of R,G,B values at that x,y point. Therefore I can use a for loop to construct the whole image by just that function. I am not asking about the whole solution or algorithm here, but just that if this thing is possible, which direction should I take to go about doing this. Reference to relevant papers would be really nice.
Thanks
Azmuh

Yes, it is absolutely always possible. Basically, if you choose some points, there is always (an infinity of) smooth explicit functions (that is nice functions) which value on the points is exactly the one you choose.
For example, you can have a look at http://en.wikipedia.org/wiki/Lagrange_polynomial or http://en.wikipedia.org/wiki/Trigonometric_interpolation. They are two different methods to compute an explicit function which pass exactly by the data points you have. So you can apply those methods for your image, seen as a set of data points, and separately for R, G, and B.
At the end, you get one simple function explicitly (a polynomial or a trigonometric series, depending on what you chose), and you can compute its values where you want.
However, note that I would definitely not recommend to use those methods to effectively retrieve the data. Indeed, the functions you get are absolutely not optimized (that is with a veeeery high degree (for a n×m image, each color will have a degree nm-1), very high coefficients) and furthermore will have extremely large values between your original points (look for Runge's phenomenon).

This is not possible in general... Imagine an image that has been generated by random values for each pixel. You can't find a mathematical expression that will give you the value of a pixel given its 2d coordinates.
Now it may be possible for some images that have been generated using a function. In that case, it's not a problem specific to image processing, it's get back the function from some points of the function (in your case, you have all the points). It's exactly the same thing as extrapolating a curve from a set of points when you trace a graph in excel. The more points you have, the more precise the function you wind will be.
Look for information about Regression analysis. I can't help you much but there are some algorithms that exist.

"Covering" the space of all possible histogram shapes

There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?

Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.

Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.

I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

What is sparsity in image processing?

I am new in image processing and I don't know the use of basic terms, I know the basic definition of sparsity, but can anyone please elaborate the definition in term of image processing?

Well Sajid, I actually was doing image processing a few months ago, and I had found a website that gave me what I thought was the best definition of sparsity.
Sparsity and density are terms used to describe the percentage of
cells in a database table that are not populated and populated,
respectively. The sum of the sparsity and density should equal 100%.
A table that is 10% dense has 10% of its cells populated with non-zero
values. It is therefore 90% sparse – meaning that 90% of its cells are
either not filled with data or are zeros.
I took this in the context of on/off for black and white image processing. If many pixels were off, then the pixels were sparse.

As The Obscure Question said, sparsity is when a vector or matrix is mostly zeros. To see a real world example of this, just look at the wavelet transform, which is known to be sparse for any real-world image.
(all the black values are 0)
Sparsity has powerful impacts. It can transform matrix multiplication of two NxN matrices, normally a O(N^3) operation, into an O(k) operation (with k non-zero elements). Why? Because it's a well-known fact that for all x, x * 0 = 0.
What does sparsity mean? In the problems I've been exposed to, it means similarity in some domain. For example, natural images are largely the same color in areas (the sky is blue, the grass is green, etc). If you take the wavelet transform of that natural image, the output is sparse through the recursive nature of the wavelet (well, at least recursive in the Haar wavelet).

Random projection algorithm pseudo code

I am trying to apply Random Projections method on a very sparse dataset. I found papers and tutorials about Johnson Lindenstrauss method, but every one of them is full of equations which makes no meaningful explanation to me. For example, this document on Johnson-Lindenstrauss
Unfortunately, from this document, I can get no idea about the implementation steps of the algorithm. It's a long shot but is there anyone who can tell me the plain English version or very simple pseudo code of the algorithm? Or where can I start to dig this equations? Any suggestions?
For example, what I understand from the algorithm by reading this paper concerning Johnson-Lindenstrauss is that:
Assume we have a AxB matrix where A is number of samples and B is the number of dimensions, e.g. 100x5000. And I want to reduce the dimension of it to 500, which will produce a 100x500 matrix.
As far as I understand: first, I need to construct a 100x500 matrix and fill the entries randomly with +1 and -1 (with a 50% probability).
Edit:
Okay, I think I started to get it. So we have a matrix A which is mxn. We want to reduce it to E which is mxk.
What we need to do is, to construct a matrix R which has nxk dimension, and fill it with 0, -1 or +1, with respect to 2/3, 1/6 and 1/6 probability.
After constructing this R, we'll simply do a matrix multiplication AxR to find our reduced matrix E. But we don't need to do a full matrix multiplication, because if an element of Ri is 0, we don't need to do calculation. Simply skip it. But if we face with 1, we just add the column, or if it's -1, just subtract it from the calculation. So we'll simply use summation rather than multiplication to find E. And that is what makes this method very fast.
It turned out a very neat algorithm, although I feel too stupid to get the idea.

You have the idea right. However as I understand random project, the rows of your matrix R should have unit length. I believe that's approximately what the normalizing by 1/sqrt(k) is for, to normalize away the fact that they're not unit vectors.
It isn't a projection, but, it's nearly a projection; R's rows aren't orthonormal, but within a much higher-dimensional space, they quite nearly are. In fact the dot product of any two of those vectors you choose will be pretty close to 0. This is why it is a generally good approximation of actually finding a proper basis for projection.

The mapping from high-dimensional data A to low-dimensional data E is given in the statement of theorem 1.1 in the latter paper - it is simply a scalar multiplication followed by a matrix multiplication. The data vectors are the rows of the matrices A and E. As the author points out in section 7.1, you don't need to use a full matrix multiplication algorithm.

If your dataset is sparse, then sparse random projections will not work well.
You have a few options here:
Option A:
Step 1. apply a structured dense random projection (so called fast hadamard transform is typically used). This is a special projection which is very fast to compute but otherwise has the properties of a normal dense random projection
Step 2. apply sparse projection on the "densified data" (sparse random projections are useful for dense data only)
Option B:
Apply SVD on the sparse data. If the data is sparse but has some structure SVD is better. Random projection preserves the distances between all points. SVD preserves better the distances between dense regions - in practice this is more meaningful. Also people use random projections to compute the SVD on huge datasets. Random Projections gives you efficiency, but not necessarily the best quality of embedding in a low dimension.
If your data has no structure, then use random projections.
Option C:
For data points for which SVD has little error, use SVD; for the rest of the points use Random Projection
Option D:
Use a random projection based on the data points themselves.
This is very easy to understand what is going on. It looks something like this:
create a n by k matrix (n number of data point, k new dimension)
for i from 0 to k do #generate k random projection vectors
randomized_combination = feature vector of zeros (number of zeros = number of features)
sample_point_ids = select a sample of point ids
for each point_id in sample_point_ids do:
random_sign = +1/-1 with prob. 1/2
randomized_combination += random_sign*feature_vector[point_id] #this is a vector operation
normalize the randomized combination
#note that the normal random projection is:
# randomized_combination = [+/-1, +/-1, ...] (k +/-1; if you want sparse randomly set a fraction to 0; also good to normalize by length]
to project the data points on this random feature just do
for each data point_id in dataset:
scores[point_id, j] = dot_product(feature_vector[point_id], randomized_feature)
If you are still looking to solve this problem, write a message here, I can give you more pseudocode.
The way to think about it is that a random projection is just a random pattern and the dot product (i.e. projecting the data point) between the data point and the pattern gives you the overlap between them. So if two data points overlap with many random patterns, those points are similar. Therefore, random projections preserve similarity while using less space, but they also add random fluctuations in the pairwise similarities. What JLT tells you is that to make fluctuations 0.1 (eps)
you need about 100*log(n) dimensions.
Good Luck!

An R Package to perform Random Projection using Johnson- Lindenstrauss Lemma
RandPro

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio