Lossless Compression for Coordinate Path Data - algorithm

I am brainstorming for a project which will store large chunks of coordinate data (latitude, longitude) in a database. Key aspects of this data will be calculated and stored, and then the bulk of the data will be compressed and stored. I am looking for a lossless compression algorithm to reduce the storage space of this data. Is there an (preferably common) algorithm which is good at compressing this type of data?
Known attributes of the data
The coordinate pairs are ordered and that order should be preserved.
All numbers will be limited to 5 decimal places (roughly 1m accuracy).
The coordinate pairs represent a path, and adjacent pairs will likely be relatively close to each other in value.
Example Data
[[0.12345, 34.56789], [0.01234, 34.56754], [-0.00012, 34.56784], …]
Note: I am not so concerned about language at this time, but I will potentially implement this in Javascript and PHP.
Thanks in advance!

To expand on the delta encoding suggested by barak manos, you should start by encoding the coordinates as binary numbers instead of strings. Use four-byte signed integers, which each equal to 105 times your values.
Then apply delta encoding, where each latitude and longitude respectively are subtracted from the previous one. The first lat/long is left as is.
Now break the data into four planes, one for each of the four-bytes in the 32-bit integers. The higher bytes will be mostly zeros, with all of the entropy in the lower bytes. You can break the data into blocks, so that your planes don't have to span the entire data set.
Then apply zlib or lzma compression.

I would recommend that you first exploit the fact that adjacent symbols are similar, and convert your data in order to reduce the entropy. Then, apply the compression algorithm of your choice on the output.
Let IN_ARR be the original array and OUT_ARR be the converted array (input for compression):
OUT_ARR[0] = IN_ARR[0]
for i = 1 to N-1
OUT_ARR[i] = IN_ARR[i] - IN_ARR[i-1]
For simplicity, the pseudo-code above is written for 1-dimension coordinates.
But of course, you can easily implement it for 2-dimension coordinates...
And of course, you will have to apply the inverse operation after decompression:
IN_ARR[0] = OUT_ARR[0]
for i = 1 to N-1
IN_ARR[i] = OUT_ARR[i] + IN_ARR[i-1]

Here is way to efficiently structure your data to get most out of it : -
First divide your data in two sets as integer and decimals :-
eg: [1.23467,2.45678] => [1,2] and [23467,45678] => [1],[2],[23467],[45678]
As your data seems random then first thing you can do for compression is not to store it as string directly but use following compression.
range of latitudes is -90 to +90 hence total 180 values hence need log2(180) bits that is 8 bits per integer for first values
range of longitutes is -180 to 180 which is 360 values hence log2(360) bits which is 9 bits
decimals are of 5 digits hence need log2(10^5) = 17 bits.
Use above compression you will need 8+9+17*2 = 51 bits per record whereas if you use strings then you would need 2 + 3 + 5*2 = 15 bytes per record at max.
compression ratio = 51/(15*8) = 42% if compared with string data size
compression ratio = 51/(2*32) = 80% if compared with float data size .
Group similar parts of the path into 4 group like for example : -
[[0.12345,34.56789],[0.01234,34.56754],[-0.00012,34.56784]...]
=> [0,0,-0],[34,34,34],[12345,1234,12],[56789,56754,56784]
Use delta encoding on the individual group and then apply huffman coding to get further compression on total data.

Related

Impact of data intervals in fast fourier transform

I have sampled sensor data for 1 minute with 5kHz sampling.
So, one sampled data file includes 5,000 x 60 = 300,000 data points.
Note that the sensor measures periodic data such as 60Hz AC current.
Now, I would like to apply FFT (using python numpy.rfft function) to the one data file.
As I know, the number of FFT results is half of the number of input data, i.e., 150,000 FFT results in the case of 300,000 data points.
However, the number of FFT results is too large to analyze them.
So, I would like to reduce the number of FFT results.
Regarding that, my question is that the following method valid given the one sampled data file?
Segment the one sampled data file into M segments
Apply FFT to each segment
Average the M FFT results to get one averaged FFT result
Use the average FFT result as FFT result of the given one sampled data file
Thank you in advance.
It depends on your purposes.
If source signal is sampled with 5 kHz, then frequency of max output element will corresponds to 2.5 kHz. So for 150K output length frequency resolution will about 0.017 Hz. If you apply transform to 3000 data points, you'll get freq.resolution 1.7 Hz.
Is this important for you? Do you need to register all possible frequency components of AC current?
AC quality (magnitude, frequency, noise) might vary during one-minute interval. Do you need to register such instability?
Perhaps, high freq. resolution and short-range temporal stability is not necessary for AC control, in this case you approach is quite well.
Edit: Longer interval also diminishes finite-duration signal windowing effect that gives false peaks
P.S. Note that fast Fourier transform usually (not always, I don't see such directions in rfft description) works with interval length = 2^N, so here output might contain 256K

quantize/arrange/sequence numbers into a specific format

I'm trying to arrange multiple PCM audio data into a specific sequence.
the fact that it's audio data is just for context, the problem itself has nothing to do with audio/DSP.
my input is a varying set of files with varying lengths, and I'm trying to arrange the data sequentially into a new file and add padding after each segment where needed so that each input element is aligned to a grid which is integer-divisible by 120 units. In other words, I need to be able to address the beginning of each segment by choosing an offset between 0-119.
to illustrate the problem here is a trivial case example. two input files have the following byte lengths:
200
+ 400
---
= 600
in this case, there is no padding needed.
the files can be arranged back to back, as they fit into the 120-grid as is. in the grid, the 200-file has a range from 0-40 (40 units), the 400 file has a range from 40-120 (80 units).
this becomes trickier if any of the files do not fit into the grid.
199
+ 398
---
= 597
intuitively, it's easy to see that the 199-byte file needs 1 byte of padding at the end so that its length becomes 200, and the 398-byte file needs 2 bytes to become 400 bytes. We then have a nice 1:2 ratio between the 2 files, which in the 120-grid translates to 40 and 80 units.
now, I'm trying to find an algorithm which can do this for any number of input files from 1-120, where each file can have arbitrary non-zero length.
maybe there is an existing algorithm which does just that, but I'm finding it difficult to find descriptive keywords for the problem.
I've tried to solve this naively, but somehow I fail to grok the problem fully. Basically I need to grow the individual files so that their sizes are multiples of the smallest common denominator of the sum of their lengths - which to me is kind of a chicken/egg problem. If I grow the files so their ratios fit together, I also grow the sum of their lengths and I don't understand how to check both against the 120-grid...
edit: ok I think I got it:
https://gist.github.com/jpenca/b033122fcb2300c5e9e4
not sure how to prove correctness, but trying this with varying inputs seems to work ok.

Compare 2 One Bit Images for Similarity

I am trying to compare a source image against thousands of images in a set to get a similarity score of the most likely matches (0 - 1). Each image is small (64x64 or smaller). Each image is 1 bit, meaning that each pixel is either off (fully transparent) or on (fully white). I am trying to create a very fast similarity algorithm to compare these images. I have found many similarity algorithms via Google search but they all involve comparing large, full color images, which I don't need to do.
I realize I can just compare pixels that match / don't match, but this can be potentially slow, given that the compare set can be very large. The compare set images will all be the same exact size as the lookup image.
Is it possible to create hashes or other fast lookups for these kinds of images where a hash or binary search lookup could be performed and similarity score created with the most likely matches?
To get a comparison score for binary images, I'd suggest you calculate the Hamming distance with xor operations and then count the number of ones. This can be sped up a lot using the fast popcount operation of SSSE3 instructions.
The Hamming distance tells you the number of bits that are different between two binary strings (so it's actually a dissimilarity value). To get a score in the range, say, [0, 1], you can divide by the size of the images (this way you get a score invariant to the image size).
With regard to the comparison with thousands of images, make sure it's a bottleneck, because if the data are not that large, it might be faster than you think. If you still need to make it faster, you can consider any or both these ideas:
1) Parallelization: the function is probably very easy to parallelize with OpenMP or tbb, for example.
2) A hash table: use the first (or some subset) bits of each image to index them in a vector. Then, compare those images that belong to the same hash bin only. Of course, this is an approximate approach and you will not get a comparison score for any pair of images, only for those that are similar enough.
Keep in mind that if you want to compare against all the images, you have to run the full comparison against all your database, so there are little chances other than parallelization to speed it up.
One way to do this would be a binary tree. Each image's pixel could be converted to a string of 1's and 0's. Then that string could be used to construct a binary tree.
While checking for a new string, you just start following where the path takes you, if you reach a leaf node, then it was present, if you don't then its new.
The image above shows a tree constructed using 3 strings of length 4
1010
0110
0001
So, if 0001 comes again, just follow the path, if you end up in a leaf (filled circle) then the string (image) is duplicate and has occurred again. If not, then you can add it also, while knowing it is new and unique.
It will take 0(n) time for each comparison and addition where n is the length of the string. In your case n == 32*32.
You could implement a quadtree structure https://en.wikipedia.org/wiki/Quadtree
Segment your images recursively. At each level, store the number of 1 and/or 0 pixels (one can be computed from the other)
Ex : for this image :
0 1 1 0
0 1 0 1
0 0 0 0
0 0 1 0
You compute the following tree :
(5)
(2) - (2) - (0) - (1)
(0) - (1) - (0) - (1) - - - (1) - (0) - (0) - (1) - - - (0) - (0) - (0) - (0) - - - (0) - (0) - (1) - (0)
The higher levels of the tree are coarser versions of the image :
First level :
5/16
Second level :
2/4 2/4
0/4 1/4
Then, your similarity score could be computing whether the number of 0s and 1s is different, at different levels of recursion, with a weight at each level. And you could get an approximation of it (to quickly dismiss very different images) by not going down the whole tree.
If you find that comparing all images completely (using e.g. ChronoTrigger's answer) still takes too much time, consider these two strategies to reduce the number of necessary comparisons.
I will assume that the images are compared line-by-line. You start by comparing the first image completely, store its score as the maximum, then move on to the next, each time updating the maximum as necessary. While comparing each image line-by-line, you do the following after each line (or after each n lines):
Check if the number of mismatched bits so far exceeds the number of mismatches in the image with the maximum score so far. If it does, this image can never reach the maximum score, and it can be discarded.
If the average score per line so far is lower than the average score per line of the image with the maximum score, leave the comparison to be continued during the next run, and skip to the next image.
Repeat this until all images have been completely checked, or discarded.
Trying this strategy on 100 random 32x32-pixel images based on an example image, each with a random number of bits altered, gave this promising result:
FIRST RUN (100 images):
images checked completely: 5 (maximum score: 1015, image #52)
postponed after 1 line: 59
discarded after 1 line: 35
discarded after 10 lines: 1
SECOND RUN (59 images):
discarded without additional checks: 31 (because of new maximum score)
discarded after 1 additional line: 12
discarded after 2 additional lines: 9
discarded after 3 additional lines: 1
discarded after 4 additional lines: 3
discarded after 5 additional lines: 1
discarded after 6 additional lines: 2
Total number of lines compared: 326 out of 3200 lines (~ 10.1875 out of 100 images)
If your image stores pixel data in bitmap-like format, then every line is just 32-bit integer value, and you can simply compare image lines
for iy = 0 to ImageHeight - 1 do
if CastToInt32(Image1.Scanline[0]) <> CastToInt32(Image2.Scanline[0]) then
break due to inequality
//32 comparisons or less
For the case of approximate similarity you can calculate the overall number of discrepancies counting set bits in xor-ed values for all lines.
NumberOf1Bits(Value1 xor Value2)
P.S. Straightforward implementation in Delphi takes 300 nanoseconds per one image/image comparison (0.3 sec for 1 million images). Single thread, i5 processor, mismatching limit 450.
Time will be significantly less for low mismatching limit (47 ns for limit 45).
The main time eater - NumberOf1Bits/popcount function.
I made an image hashing class for the Skia4Delphi library unit tests. It generates a hash that makes it possible to compare the similarity percentage between 2 images using only the hash. The focus was on accuracy and not performance, but the performance is not bad. To use it, you must have Skia4Delphi installed. Check the source: https://github.com/skia4delphi/skia4delphi/blob/main/Tests/Source/Skia.Tests.Foundation.ImageHash.pas

Efficiently Store List of Numbers in Binary Format

I'm writing a compression algorithm (mostly for fun) in C, and I need to be able to store a list of numbers in binary. Each element of this list will be in the form of two digits, both under 10 (like (5,5), (3,6), (9,2)). I'll potentially be storing thousands of these pairs (one pair is made for each character in a string in my compression algorithm).
Obviously the simplest way to do this would be to concatenate each pair (-> 55, 36, 92) to make a 2-digit number (since they're just one digit each), then store each pair as a 7-bit number (since 99 is the highest). Unfortunately, this isn't so space-efficient (7 bits per pair).
Then I thought perhaps if I concatenate each pair, then concatenate that (553692), I'd be able to then store that as a plain number in binary form (10000111001011011100, which for three pairs is already smaller than storing each number separately), and keep a quantifier for the number of bits used for the binary number. The only problem is, this approach requires a bigint library and could be potentially slow because of that. As the number gets bigger and bigger (+2 digits per character in the string) the memory usage and slowdown would get bigger and bigger as well.
So here's my question: Is there a better storage-efficient way to store a list of numbers like I'm doing, or should I just go with the bignum or 7-bit approach?
The information-theoretic minimum for storing 100 different values is log2100, which is about 6.644. In other words, the possible compression from 7 bits is a hair more than 5%. (log2100 / 7 is 94.91%.)
If these pairs are simply for temporary storage during the algorithm, then it's almost certainly not worth going to a lot of effort to save 5% of storage, even if you managed to do that.
If the pairs form part of you compressed output, then your compression cannot be great (a character is only eight bits, and presumably the pairs are additional to any compressed character data.) Nonetheless, the easy compression technique is to store up to 6 pairs in 40 bits (5 bytes), which can be done without a bigint package assuming a 64-bit machine. (Alternatively, store up to 3 pairs in 20 bits and then pack two 20-bit sequences into five bytes.) That gives you 99.66% of the maximum compression for the values.
All of the above assumes that the 100 possible values are equally distributed. If the distribution is not even and it is possible to predict the frequencies, then you can use Hoffman encoding to improve compression. Even so, I wouldn't recommend it for temporary storage.

Determine offset where the most constructive interference occurs

I have two arrays of data:
I would like to align these similar graphs together (by adding an offset to either array):
Essentially what I want is the most constructive interference, as shown when two waves together produce the same wave but with larger amplitude:
This is also the same as finding the most destructive interference, but one of the arrays must be inverted as shown:
Notice that the second wave is inverted (peaks become troughs / vice-versa).
The actual data will not only consist of one major and one minor peak and trough, but of many, and there might not be any noticeable spikes. I have made the data in the diagram simpler to show how I would like the data aligned.
I was thinking about a few loops, such as:
biggest = 0
loop from -10 to 10 as offset
count = 0
loop through array1 as ar1
loop through array2 as ar2
count += array1[ar1] + array2[ar2 - offset]
replace biggest with count if count/sizeof(array1) > biggest
However, that requires looping through offset and looping through both arrays. My real array definitions are extremely large and this would would take too long.
How would I go about determining the offset required to match data1 with data2?
JSFiddle (note that this is language agnostic and I would like to understand the algorithm more-so than the actual code)
Look at Convolution and Cross-correlation an its computation using Fast Fourier Transformation. It's the way how it is done in real life applications.
If (and only if) you data has very recognizeable spikes, you could do, what a human being would do: Match the spikes: Fiddle
the importand part is function matchData()
An improved version would search for N max and min spikes, then calculate an average offset.

Resources