I am trying to compare a source image against thousands of images in a set to get a similarity score of the most likely matches (0 - 1). Each image is small (64x64 or smaller). Each image is 1 bit, meaning that each pixel is either off (fully transparent) or on (fully white). I am trying to create a very fast similarity algorithm to compare these images. I have found many similarity algorithms via Google search but they all involve comparing large, full color images, which I don't need to do.
I realize I can just compare pixels that match / don't match, but this can be potentially slow, given that the compare set can be very large. The compare set images will all be the same exact size as the lookup image.
Is it possible to create hashes or other fast lookups for these kinds of images where a hash or binary search lookup could be performed and similarity score created with the most likely matches?
To get a comparison score for binary images, I'd suggest you calculate the Hamming distance with xor operations and then count the number of ones. This can be sped up a lot using the fast popcount operation of SSSE3 instructions.
The Hamming distance tells you the number of bits that are different between two binary strings (so it's actually a dissimilarity value). To get a score in the range, say, [0, 1], you can divide by the size of the images (this way you get a score invariant to the image size).
With regard to the comparison with thousands of images, make sure it's a bottleneck, because if the data are not that large, it might be faster than you think. If you still need to make it faster, you can consider any or both these ideas:
1) Parallelization: the function is probably very easy to parallelize with OpenMP or tbb, for example.
2) A hash table: use the first (or some subset) bits of each image to index them in a vector. Then, compare those images that belong to the same hash bin only. Of course, this is an approximate approach and you will not get a comparison score for any pair of images, only for those that are similar enough.
Keep in mind that if you want to compare against all the images, you have to run the full comparison against all your database, so there are little chances other than parallelization to speed it up.
One way to do this would be a binary tree. Each image's pixel could be converted to a string of 1's and 0's. Then that string could be used to construct a binary tree.
While checking for a new string, you just start following where the path takes you, if you reach a leaf node, then it was present, if you don't then its new.
The image above shows a tree constructed using 3 strings of length 4
1010
0110
0001
So, if 0001 comes again, just follow the path, if you end up in a leaf (filled circle) then the string (image) is duplicate and has occurred again. If not, then you can add it also, while knowing it is new and unique.
It will take 0(n) time for each comparison and addition where n is the length of the string. In your case n == 32*32.
You could implement a quadtree structure https://en.wikipedia.org/wiki/Quadtree
Segment your images recursively. At each level, store the number of 1 and/or 0 pixels (one can be computed from the other)
Ex : for this image :
0 1 1 0
0 1 0 1
0 0 0 0
0 0 1 0
You compute the following tree :
(5)
(2) - (2) - (0) - (1)
(0) - (1) - (0) - (1) - - - (1) - (0) - (0) - (1) - - - (0) - (0) - (0) - (0) - - - (0) - (0) - (1) - (0)
The higher levels of the tree are coarser versions of the image :
First level :
5/16
Second level :
2/4 2/4
0/4 1/4
Then, your similarity score could be computing whether the number of 0s and 1s is different, at different levels of recursion, with a weight at each level. And you could get an approximation of it (to quickly dismiss very different images) by not going down the whole tree.
If you find that comparing all images completely (using e.g. ChronoTrigger's answer) still takes too much time, consider these two strategies to reduce the number of necessary comparisons.
I will assume that the images are compared line-by-line. You start by comparing the first image completely, store its score as the maximum, then move on to the next, each time updating the maximum as necessary. While comparing each image line-by-line, you do the following after each line (or after each n lines):
Check if the number of mismatched bits so far exceeds the number of mismatches in the image with the maximum score so far. If it does, this image can never reach the maximum score, and it can be discarded.
If the average score per line so far is lower than the average score per line of the image with the maximum score, leave the comparison to be continued during the next run, and skip to the next image.
Repeat this until all images have been completely checked, or discarded.
Trying this strategy on 100 random 32x32-pixel images based on an example image, each with a random number of bits altered, gave this promising result:
FIRST RUN (100 images):
images checked completely: 5 (maximum score: 1015, image #52)
postponed after 1 line: 59
discarded after 1 line: 35
discarded after 10 lines: 1
SECOND RUN (59 images):
discarded without additional checks: 31 (because of new maximum score)
discarded after 1 additional line: 12
discarded after 2 additional lines: 9
discarded after 3 additional lines: 1
discarded after 4 additional lines: 3
discarded after 5 additional lines: 1
discarded after 6 additional lines: 2
Total number of lines compared: 326 out of 3200 lines (~ 10.1875 out of 100 images)
If your image stores pixel data in bitmap-like format, then every line is just 32-bit integer value, and you can simply compare image lines
for iy = 0 to ImageHeight - 1 do
if CastToInt32(Image1.Scanline[0]) <> CastToInt32(Image2.Scanline[0]) then
break due to inequality
//32 comparisons or less
For the case of approximate similarity you can calculate the overall number of discrepancies counting set bits in xor-ed values for all lines.
NumberOf1Bits(Value1 xor Value2)
P.S. Straightforward implementation in Delphi takes 300 nanoseconds per one image/image comparison (0.3 sec for 1 million images). Single thread, i5 processor, mismatching limit 450.
Time will be significantly less for low mismatching limit (47 ns for limit 45).
The main time eater - NumberOf1Bits/popcount function.
I made an image hashing class for the Skia4Delphi library unit tests. It generates a hash that makes it possible to compare the similarity percentage between 2 images using only the hash. The focus was on accuracy and not performance, but the performance is not bad. To use it, you must have Skia4Delphi installed. Check the source: https://github.com/skia4delphi/skia4delphi/blob/main/Tests/Source/Skia.Tests.Foundation.ImageHash.pas
Related
Given a (long) string of 0's and 1's, I need to be able to answer quickly queries of the kind: how many 1's in the string precede a given index i? One can assume that a 1 is located at index i.
I am looking for an as compact as possible a data structure that can be computed once for the given string of 0's and 1's and then used as a look-up table to answer quickly the queries as described above.
Background. In my particular case, the string of 0's and 1's encodes a grid map (such as in a video game), where 0 denotes an obstacle and 1 denotes a passable location. I store distances from all passable locations to one special location in an array. The query corresponds to this: given a passable location (i.e. an index into the string of 0's and 1's), I need to be able to determine quickly the corresponding index into the array of distances.
You're looking for "succinct indexable dictionaries", which are known by many other names, as well - you can also Google for "succinct rank select". The best solutions have ~5% overhead and constant-time lookups.
"Space-Efficient, High-Performance Rank & Select Structures on Uncompressed Bit Sequences", by Dong Zhou, David G. Andersen, Michael Kaminsky
https://github.com/efficient/rankselect
https://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf
I am looking for an as compact as possible a data structure that can be computed once for the given string of 0's and 1's and then used as a look-up table to answer quickly the queries as described above.
This problem is about 6 decades old, and extensively solved. What you're looking at is really just a vector that you could define to be 0 for every value but 1.
If there's very little 1s compared to other values, just go with one of the many sparse vector representations that have been around with linear algebra libraries for ever.
You're not giving enough info (like, is your original vector still going to be available, or is it going to be deleted as soon as you have your data storage? I'm going to assume this), but assuming this is a test in solving real world problems on your own rather than choosing the right library to do so:
Knowing real computers are nothing like what the algorithms they teach in basic CS were optimized for, the best storage is almost always linear storage.
Because counting ones is actually much less time-intense than loading data from RAM into CPU registers, the most effective choice here is the simplest:
Take a wordlength (for example, 64) of your original vector's values, and convert them to bits set (or not set, if the value != 1) in a word; move on to the next word and the next part of your original vector.
Now, to evaluate the number of ones, you would just use a "population count" instruction that practically all CPUs nowadays have – for example, as introduced n x86(_64) by SSE4.1 as POPCNT. Use SIMD instructions to generate the sum over adjacent word population counts, and accumulate them up to the point of your index/wordlength. You can, if your problem is both large enough and you have multiple cores with individual caches, also easily divide that algorithm into multiple parallel threads, because there's no mutual dependency. You just add up the sums at the end. Having implemented similar SIMD-optimized code myself, multithreading doesn't pay off if you're limited on CPU cache, because you just end up waiting on RAM with multiple cores.
Anyone telling you to use "runlength" or "linked-list" implementations to encode the distance between 1s neglects the fact that, as mentioned, the problematic part is getting data from RAM, not the actual counting. Memory controllers always fetch a whole memory "row", not just a single value, so that whilst waiting for the first element might easily take the time it does to count the 1s in a couple hundred words worth of wordlength original values each, subsequent accesses to words from the same row are pretty fast.
This is pretty nicely illustrated (partly with invisible graphs) by Bjarne Stroustrup (being one of the evil masterminds behind C++) in this short lecture.
EDIT realised I'd answered the inverse of your problem; this tells you how many 1's come after your position. Just go forward through the bitmap instead of what I suggest below.
Try creating an int array the length of your bitmap. Working backwards, sum the number of 1's you've seen so far; eg
[ 1 0 0 1 1 0 1 0 1 1 1 0 0 0 ]
gives
[ - - - - - - - - - - - - - 0 ]
[ - - - - - - - - - - - - 0 0 ]
[ - - - - - - - - - - - 0 0 0 ]
[ - - - - - - - - - - 1 0 0 0 ]
[ - - - - - - - - - 2 1 0 0 0 ]
[ - - - - - - - - 3 2 1 0 0 0 ]
[ - - - - - - - 3 3 2 1 0 0 0 ]
...
[ 7 6 6 6 5 5 4 3 3 2 1 0 0 0 ]
Now it's just an array lookup, with the added benefit that if you want to know the number between any two points, you can work it out by subtracting one from another.
I'm trying to arrange multiple PCM audio data into a specific sequence.
the fact that it's audio data is just for context, the problem itself has nothing to do with audio/DSP.
my input is a varying set of files with varying lengths, and I'm trying to arrange the data sequentially into a new file and add padding after each segment where needed so that each input element is aligned to a grid which is integer-divisible by 120 units. In other words, I need to be able to address the beginning of each segment by choosing an offset between 0-119.
to illustrate the problem here is a trivial case example. two input files have the following byte lengths:
200
+ 400
---
= 600
in this case, there is no padding needed.
the files can be arranged back to back, as they fit into the 120-grid as is. in the grid, the 200-file has a range from 0-40 (40 units), the 400 file has a range from 40-120 (80 units).
this becomes trickier if any of the files do not fit into the grid.
199
+ 398
---
= 597
intuitively, it's easy to see that the 199-byte file needs 1 byte of padding at the end so that its length becomes 200, and the 398-byte file needs 2 bytes to become 400 bytes. We then have a nice 1:2 ratio between the 2 files, which in the 120-grid translates to 40 and 80 units.
now, I'm trying to find an algorithm which can do this for any number of input files from 1-120, where each file can have arbitrary non-zero length.
maybe there is an existing algorithm which does just that, but I'm finding it difficult to find descriptive keywords for the problem.
I've tried to solve this naively, but somehow I fail to grok the problem fully. Basically I need to grow the individual files so that their sizes are multiples of the smallest common denominator of the sum of their lengths - which to me is kind of a chicken/egg problem. If I grow the files so their ratios fit together, I also grow the sum of their lengths and I don't understand how to check both against the 120-grid...
edit: ok I think I got it:
https://gist.github.com/jpenca/b033122fcb2300c5e9e4
not sure how to prove correctness, but trying this with varying inputs seems to work ok.
I have two arrays of data:
I would like to align these similar graphs together (by adding an offset to either array):
Essentially what I want is the most constructive interference, as shown when two waves together produce the same wave but with larger amplitude:
This is also the same as finding the most destructive interference, but one of the arrays must be inverted as shown:
Notice that the second wave is inverted (peaks become troughs / vice-versa).
The actual data will not only consist of one major and one minor peak and trough, but of many, and there might not be any noticeable spikes. I have made the data in the diagram simpler to show how I would like the data aligned.
I was thinking about a few loops, such as:
biggest = 0
loop from -10 to 10 as offset
count = 0
loop through array1 as ar1
loop through array2 as ar2
count += array1[ar1] + array2[ar2 - offset]
replace biggest with count if count/sizeof(array1) > biggest
However, that requires looping through offset and looping through both arrays. My real array definitions are extremely large and this would would take too long.
How would I go about determining the offset required to match data1 with data2?
JSFiddle (note that this is language agnostic and I would like to understand the algorithm more-so than the actual code)
Look at Convolution and Cross-correlation an its computation using Fast Fourier Transformation. It's the way how it is done in real life applications.
If (and only if) you data has very recognizeable spikes, you could do, what a human being would do: Match the spikes: Fiddle
the importand part is function matchData()
An improved version would search for N max and min spikes, then calculate an average offset.
I am brainstorming for a project which will store large chunks of coordinate data (latitude, longitude) in a database. Key aspects of this data will be calculated and stored, and then the bulk of the data will be compressed and stored. I am looking for a lossless compression algorithm to reduce the storage space of this data. Is there an (preferably common) algorithm which is good at compressing this type of data?
Known attributes of the data
The coordinate pairs are ordered and that order should be preserved.
All numbers will be limited to 5 decimal places (roughly 1m accuracy).
The coordinate pairs represent a path, and adjacent pairs will likely be relatively close to each other in value.
Example Data
[[0.12345, 34.56789], [0.01234, 34.56754], [-0.00012, 34.56784], …]
Note: I am not so concerned about language at this time, but I will potentially implement this in Javascript and PHP.
Thanks in advance!
To expand on the delta encoding suggested by barak manos, you should start by encoding the coordinates as binary numbers instead of strings. Use four-byte signed integers, which each equal to 105 times your values.
Then apply delta encoding, where each latitude and longitude respectively are subtracted from the previous one. The first lat/long is left as is.
Now break the data into four planes, one for each of the four-bytes in the 32-bit integers. The higher bytes will be mostly zeros, with all of the entropy in the lower bytes. You can break the data into blocks, so that your planes don't have to span the entire data set.
Then apply zlib or lzma compression.
I would recommend that you first exploit the fact that adjacent symbols are similar, and convert your data in order to reduce the entropy. Then, apply the compression algorithm of your choice on the output.
Let IN_ARR be the original array and OUT_ARR be the converted array (input for compression):
OUT_ARR[0] = IN_ARR[0]
for i = 1 to N-1
OUT_ARR[i] = IN_ARR[i] - IN_ARR[i-1]
For simplicity, the pseudo-code above is written for 1-dimension coordinates.
But of course, you can easily implement it for 2-dimension coordinates...
And of course, you will have to apply the inverse operation after decompression:
IN_ARR[0] = OUT_ARR[0]
for i = 1 to N-1
IN_ARR[i] = OUT_ARR[i] + IN_ARR[i-1]
Here is way to efficiently structure your data to get most out of it : -
First divide your data in two sets as integer and decimals :-
eg: [1.23467,2.45678] => [1,2] and [23467,45678] => [1],[2],[23467],[45678]
As your data seems random then first thing you can do for compression is not to store it as string directly but use following compression.
range of latitudes is -90 to +90 hence total 180 values hence need log2(180) bits that is 8 bits per integer for first values
range of longitutes is -180 to 180 which is 360 values hence log2(360) bits which is 9 bits
decimals are of 5 digits hence need log2(10^5) = 17 bits.
Use above compression you will need 8+9+17*2 = 51 bits per record whereas if you use strings then you would need 2 + 3 + 5*2 = 15 bytes per record at max.
compression ratio = 51/(15*8) = 42% if compared with string data size
compression ratio = 51/(2*32) = 80% if compared with float data size .
Group similar parts of the path into 4 group like for example : -
[[0.12345,34.56789],[0.01234,34.56754],[-0.00012,34.56784]...]
=> [0,0,-0],[34,34,34],[12345,1234,12],[56789,56754,56784]
Use delta encoding on the individual group and then apply huffman coding to get further compression on total data.
I am a graduate student of physics and I am working on writing some code to sort several hundred gigabytes of data and return slices of that data when I ask for it. Here is the trick, I know of no good method for sorting and searching data of this kind.
My data essentially consists of a large number of sets of numbers. These sets can contain anywhere from 1 to n numbers within them (though in 99.9% of the sets, n is less than 15) and there are approximately 1.5 ~ 2 billion of these sets (unfortunately this size precludes a brute force search).
I need to be able to specify a set with k elements and have every set with k+1 elements or more that contains the specified subset returned to me.
Simple Example:
Suppose I have the following sets for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)
If I were to give the request (1,3) I would have the sets: (1,2,3),
(1,2,3,4,5), and (1,3,8,9).
The request (11) would return the set: (5,8,11).
The request (1,2,3) would return the sets: (1,2,3) and (1,2,3,4,5)
The request (50) would return no sets:
By now the pattern should be clear. The major difference between this example and my data is that the sets withn my data are larger, the numbers used for each element of the sets run from 0 to 16383 (14 bits), and there are many many many more sets.
If it matters I am writing this program in C++ though I also know java, c, some assembly, some fortran, and some perl.
Does anyone have any clues as to how to pull this off?
edit:
To answer a couple questions and add a few points:
1.) The data does not change. It was all taken in one long set of runs (each broken into 2 gig files).
2.) As for storage space. The raw data takes up approximately 250 gigabytes. I estimate that after processing and stripping off a lot of extraneous metadata that I am not interested in I could knock that down to anywhere from 36 to 48 gigabytes depending on how much metadata I decide to keep (without indices). Additionally if in my initial processing of the data I encounter enough sets that are the same I might be able to comress the data yet further by adding counters for repeat events rather than simply repeating the events over and over again.
3.) Each number within a processed set actually contains at LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). So I will need at LEAST three bytes per number.
4.) My "though in 99.9% of the sets, n is less than 15" comment was misleading. In a preliminary glance through some of the chunks of the data I find that I have sets that contain as many as 22 numbers but the median is 5 numbers per set and the average is 6 numbers per set.
5.) While I like the idea of building an index of pointers into files I am a bit leery because for requests involving more than one number I am left with the semi slow task (at least I think it is slow) of finding the set of all pointers common to the lists, ie finding the greatest common subset for a given number of sets.
6.) In terms of resources available to me, I can muster approximately 300 gigs of space after I have the raw data on the system (The remainder of my quota on that system). The system is a dual processor server with 2 quad core amd opterons and 16 gigabytes of ram.
7.) Yes 0 can occur, it is an artifact of the data acquisition system when it does but it can occur.
Your problem is the same as that faced by search engines. "I have a bajillion documents. I need the ones which contain this set of words." You just have (very conveniently), integers instead of words, and smallish documents. The solution is an inverted index. Introduction to Information Retrieval by Manning et al is (at that link) available free online, is very readable, and will go into a lot of detail about how to do this.
You're going to have to pay a price in disk space, but it can be parallelized, and should be more than fast enough to meet your timing requirements, once the index is constructed.
Assuming a random distribution of 0-16383, with a consistent 15 elements per set, and two billion sets, each element would appear in approximately 1.8M sets. Have you considered (and do you have the capacity for) building a 16384x~1.8M (30B entries, 4 bytes each) lookup table? Given such a table, you could query which sets contain (1) and (17) and (5555) and then find the intersections of those three ~1.8M-element lists.
My guess is as follows.
Assume that each set has a name or ID or address (a 4-byte number will do if there are only 2 billion of them).
Now walk through all the sets once, and create the following output files:
A file which contains the IDs of all the sets which contain '1'
A file which contains the IDs of all the sets which contain '2'
A file which contains the IDs of all the sets which contain '3'
... etc ...
If there are 16 entries per set, then on average each of these 2^16 files will contain the IDs of 2^20 sets; with each ID being 4 bytes, this would require 2^38 bytes (256 GB) of storage.
You'll do the above once, before you process requests.
When you receive requests, use these files as follows:
Look at a couple of numbers in the request
Open up a couple of the corresponding index files
Get the list of all sets which exist in both these files (there's only a million IDs in each file, so this should't be difficult)
See which of these few sets satisfy the remainder of the request
My guess is that if you do the above, creating the indexes will be (very) slow and handling requests will be (very) quick.
I have recently discovered methods that use Space Filling curves to map the multi-dimensional data down to a single dimension. One can then index the data based on its 1D index. Range queries can be easily carried out by finding the segments of the curve that intersect the box that represents the curve and then retrieving those segments.
I believe that this method is far superior to making the insane indexes as suggested because after looking at it, the index would be as large as the data I wished to store, hardly a good thing. A somewhat more detailed explanation of this can be found at:
http://www.ddj.com/184410998
and
http://www.dcs.bbk.ac.uk/~jkl/publications.html
Make 16383 index files, one for each possible search value. For each value in your input set, write the file position of the start of the set into the corresponding index file. It is important that each of the index files contains the same number for the same set. Now each index file will consist of ascending indexes into the master file.
To search, start reading the index files corresponding to each search value. If you read an index that's lower than the index you read from another file, discard it and read another one. When you get the same index from all of the files, that's a match - obtain the set from the master file, and read a new index from each of the index files. Once you reach the end of any of the index files, you're done.
If your values are evenly distributed, each index file will contain 1/16383 of the input sets. If your average search set consists of 6 values, you will be doing a linear pass over 6/16383 of your original input. It's still an O(n) solution, but your n is a bit smaller now.
P.S. Is zero an impossible result value, or do you really have 16384 possibilities?
Just playing devil's advocate for an approach which includes brute force + index lookup :
Create an index with the min , max and no of elements of sets.
Then apply brute force excluding sets where max < max(set being searched) and min > min (set being searched)
In brute force also exclude sets whole element count is less than that of the set being searched.
95% of your searches would really be brute forcing a very smaller subset. Just a thought.