quantize/arrange/sequence numbers into a specific format - algorithm

I'm trying to arrange multiple PCM audio data into a specific sequence.
the fact that it's audio data is just for context, the problem itself has nothing to do with audio/DSP.
my input is a varying set of files with varying lengths, and I'm trying to arrange the data sequentially into a new file and add padding after each segment where needed so that each input element is aligned to a grid which is integer-divisible by 120 units. In other words, I need to be able to address the beginning of each segment by choosing an offset between 0-119.
to illustrate the problem here is a trivial case example. two input files have the following byte lengths:
200
+ 400
---
= 600
in this case, there is no padding needed.
the files can be arranged back to back, as they fit into the 120-grid as is. in the grid, the 200-file has a range from 0-40 (40 units), the 400 file has a range from 40-120 (80 units).
this becomes trickier if any of the files do not fit into the grid.
199
+ 398
---
= 597
intuitively, it's easy to see that the 199-byte file needs 1 byte of padding at the end so that its length becomes 200, and the 398-byte file needs 2 bytes to become 400 bytes. We then have a nice 1:2 ratio between the 2 files, which in the 120-grid translates to 40 and 80 units.
now, I'm trying to find an algorithm which can do this for any number of input files from 1-120, where each file can have arbitrary non-zero length.
maybe there is an existing algorithm which does just that, but I'm finding it difficult to find descriptive keywords for the problem.
I've tried to solve this naively, but somehow I fail to grok the problem fully. Basically I need to grow the individual files so that their sizes are multiples of the smallest common denominator of the sum of their lengths - which to me is kind of a chicken/egg problem. If I grow the files so their ratios fit together, I also grow the sum of their lengths and I don't understand how to check both against the 120-grid...
edit: ok I think I got it:
https://gist.github.com/jpenca/b033122fcb2300c5e9e4
not sure how to prove correctness, but trying this with varying inputs seems to work ok.

Related

How do I efficiently find the fastest segments from a sequence of distances and times?

My input is a gpx-file containing a sequence of timestamped positions, like the one you'd get if you go for a run with a GPS and tell it to record your track.
The timestamped positions are not necessarily equal in distance from each other, or equal in time-delta between each other.
Given this input, I want to efficiently find the highest speed the gpx-file indicates for all different distances.
Example:
12:00:00 start
12:00:05 moved 100m
12:00:15 moved 100m
12:00:35 moved 200m
In this example the correct answer is:
20.0 m/s at 100m
13.3 m/s at 200m
11.4 m/s at 400m
What is a good algorithm for (preferably reasonably efficiently) calculating this?
Clarification: I'm not looking solely for the fastest segment, that's trivial. I'm looking for the fastest speed represented by the track for ALL distances up to the length of the track in sum total.
If someone uploaded a gpx-track of a marathon they ran, I'd want to know the fastest 100m they ran in that marathon, the fastest 200m, the fastest 300m and so on.
Let's say you have a gpx track for a 1,500 meter run, and you want to do this. So you want the fastest 100, 200, 300, 400, ... 1,500 meters. There are:
15 100-meter segments
14 200-meter segments
13 300-meter segments
12 400-meter segments
...
2 1,400-meter segments
1 1,500-meter segment
That works out to 15+14+13+12+...2+1 = (15^2-15)/2, or 105 different segments to examine if you want to calculate the 15 different distances.
You can do this in a single pass of the array. Simply initialize an array that contains the current running total and maximum speed for each of the distances you're interested in. As you read each segment, you subtract the value for the oldest segment, add the new segment value, recompute the average speed, and update the max speed if appropriate.
The algorithm will require you to look at (n^2-n)/2 individual splits. Regardless of how you do it, you have to look at every possible split for every distance you want to compute. You have n data points and you're trying to determine n different best split times. That's O(n^2) any way you slice it.
But the amount of data you're talking about isn't huge, certainly not by today's standards. A marathon is only 42,165 meters. You'll need an array of 422 distances if you want 100-meter resolution. And your code will do on the order of 178,084 calculations. That's quite doable with even a low-end computer these days.
As for the data, I would recommend either pre-processing the .gpx file to generate a stream of data points that are exactly 100 meters. You can do that separately, or you can do it as part of reading the data in while you're computing the splits. It's not difficult, and it'll make the rest of your code much easier to work with.

Compare 2 One Bit Images for Similarity

I am trying to compare a source image against thousands of images in a set to get a similarity score of the most likely matches (0 - 1). Each image is small (64x64 or smaller). Each image is 1 bit, meaning that each pixel is either off (fully transparent) or on (fully white). I am trying to create a very fast similarity algorithm to compare these images. I have found many similarity algorithms via Google search but they all involve comparing large, full color images, which I don't need to do.
I realize I can just compare pixels that match / don't match, but this can be potentially slow, given that the compare set can be very large. The compare set images will all be the same exact size as the lookup image.
Is it possible to create hashes or other fast lookups for these kinds of images where a hash or binary search lookup could be performed and similarity score created with the most likely matches?
To get a comparison score for binary images, I'd suggest you calculate the Hamming distance with xor operations and then count the number of ones. This can be sped up a lot using the fast popcount operation of SSSE3 instructions.
The Hamming distance tells you the number of bits that are different between two binary strings (so it's actually a dissimilarity value). To get a score in the range, say, [0, 1], you can divide by the size of the images (this way you get a score invariant to the image size).
With regard to the comparison with thousands of images, make sure it's a bottleneck, because if the data are not that large, it might be faster than you think. If you still need to make it faster, you can consider any or both these ideas:
1) Parallelization: the function is probably very easy to parallelize with OpenMP or tbb, for example.
2) A hash table: use the first (or some subset) bits of each image to index them in a vector. Then, compare those images that belong to the same hash bin only. Of course, this is an approximate approach and you will not get a comparison score for any pair of images, only for those that are similar enough.
Keep in mind that if you want to compare against all the images, you have to run the full comparison against all your database, so there are little chances other than parallelization to speed it up.
One way to do this would be a binary tree. Each image's pixel could be converted to a string of 1's and 0's. Then that string could be used to construct a binary tree.
While checking for a new string, you just start following where the path takes you, if you reach a leaf node, then it was present, if you don't then its new.
The image above shows a tree constructed using 3 strings of length 4
1010
0110
0001
So, if 0001 comes again, just follow the path, if you end up in a leaf (filled circle) then the string (image) is duplicate and has occurred again. If not, then you can add it also, while knowing it is new and unique.
It will take 0(n) time for each comparison and addition where n is the length of the string. In your case n == 32*32.
You could implement a quadtree structure https://en.wikipedia.org/wiki/Quadtree
Segment your images recursively. At each level, store the number of 1 and/or 0 pixels (one can be computed from the other)
Ex : for this image :
0 1 1 0
0 1 0 1
0 0 0 0
0 0 1 0
You compute the following tree :
(5)
(2) - (2) - (0) - (1)
(0) - (1) - (0) - (1) - - - (1) - (0) - (0) - (1) - - - (0) - (0) - (0) - (0) - - - (0) - (0) - (1) - (0)
The higher levels of the tree are coarser versions of the image :
First level :
5/16
Second level :
2/4 2/4
0/4 1/4
Then, your similarity score could be computing whether the number of 0s and 1s is different, at different levels of recursion, with a weight at each level. And you could get an approximation of it (to quickly dismiss very different images) by not going down the whole tree.
If you find that comparing all images completely (using e.g. ChronoTrigger's answer) still takes too much time, consider these two strategies to reduce the number of necessary comparisons.
I will assume that the images are compared line-by-line. You start by comparing the first image completely, store its score as the maximum, then move on to the next, each time updating the maximum as necessary. While comparing each image line-by-line, you do the following after each line (or after each n lines):
Check if the number of mismatched bits so far exceeds the number of mismatches in the image with the maximum score so far. If it does, this image can never reach the maximum score, and it can be discarded.
If the average score per line so far is lower than the average score per line of the image with the maximum score, leave the comparison to be continued during the next run, and skip to the next image.
Repeat this until all images have been completely checked, or discarded.
Trying this strategy on 100 random 32x32-pixel images based on an example image, each with a random number of bits altered, gave this promising result:
FIRST RUN (100 images):
images checked completely: 5 (maximum score: 1015, image #52)
postponed after 1 line: 59
discarded after 1 line: 35
discarded after 10 lines: 1
SECOND RUN (59 images):
discarded without additional checks: 31 (because of new maximum score)
discarded after 1 additional line: 12
discarded after 2 additional lines: 9
discarded after 3 additional lines: 1
discarded after 4 additional lines: 3
discarded after 5 additional lines: 1
discarded after 6 additional lines: 2
Total number of lines compared: 326 out of 3200 lines (~ 10.1875 out of 100 images)
If your image stores pixel data in bitmap-like format, then every line is just 32-bit integer value, and you can simply compare image lines
for iy = 0 to ImageHeight - 1 do
if CastToInt32(Image1.Scanline[0]) <> CastToInt32(Image2.Scanline[0]) then
break due to inequality
//32 comparisons or less
For the case of approximate similarity you can calculate the overall number of discrepancies counting set bits in xor-ed values for all lines.
NumberOf1Bits(Value1 xor Value2)
P.S. Straightforward implementation in Delphi takes 300 nanoseconds per one image/image comparison (0.3 sec for 1 million images). Single thread, i5 processor, mismatching limit 450.
Time will be significantly less for low mismatching limit (47 ns for limit 45).
The main time eater - NumberOf1Bits/popcount function.
I made an image hashing class for the Skia4Delphi library unit tests. It generates a hash that makes it possible to compare the similarity percentage between 2 images using only the hash. The focus was on accuracy and not performance, but the performance is not bad. To use it, you must have Skia4Delphi installed. Check the source: https://github.com/skia4delphi/skia4delphi/blob/main/Tests/Source/Skia.Tests.Foundation.ImageHash.pas

How to detect the small amount change in a big file(TB)

I just found an interesting blog talking about some interview questions. One of the question is:
Given a very large file (multiple TB), detect what 4MB ranges has changed in the file between consecutive runs of your program.
I don't have any clues on this. Can anyone give some ideas on this?
If you have any control on the creation of the data you can use Merkle trees
Split the data into small fragments (let's say 10MB each, but it's not the issue), and for each fragment create a h=hash(fragment).
Now, all these hashes will be the leaves of the tree. Now, create a full binary tree from the leaves up: h(father) = hash(father.left XOR father.right).
Now, you've got yourself a tree - and if you compare two trees, h(root1) = h(root2) if and only if tree1=tree2 - with high probability (if you use a 128 bits hash, the probability to mistake is 1/2^128, which is really negligible).
The same claim is correct for any subtrees of course, and this allows you to quickly find the leaf which is different, and this leaf represents the fragment that got changed.
This idea is used by Amazon's Dynamo to compare if two data bases got changed, and quickly finding the change.
you can compare it byte by byte and find the difference. it would take a long time but is worth a try.
another solution right out of my mind is split the file into 500 GB and calculate the md5 value and compare it with the original md5 value of the split. one would be different from the original and you can split that into 250 gb and again compare the the md5 value of the original. and you do it further and further untill you get the 4 mb.
It is similar to the coin problem with a weighing machine with limited number of turns.

Lossless Compression for Coordinate Path Data

I am brainstorming for a project which will store large chunks of coordinate data (latitude, longitude) in a database. Key aspects of this data will be calculated and stored, and then the bulk of the data will be compressed and stored. I am looking for a lossless compression algorithm to reduce the storage space of this data. Is there an (preferably common) algorithm which is good at compressing this type of data?
Known attributes of the data
The coordinate pairs are ordered and that order should be preserved.
All numbers will be limited to 5 decimal places (roughly 1m accuracy).
The coordinate pairs represent a path, and adjacent pairs will likely be relatively close to each other in value.
Example Data
[[0.12345, 34.56789], [0.01234, 34.56754], [-0.00012, 34.56784], …]
Note: I am not so concerned about language at this time, but I will potentially implement this in Javascript and PHP.
Thanks in advance!
To expand on the delta encoding suggested by barak manos, you should start by encoding the coordinates as binary numbers instead of strings. Use four-byte signed integers, which each equal to 105 times your values.
Then apply delta encoding, where each latitude and longitude respectively are subtracted from the previous one. The first lat/long is left as is.
Now break the data into four planes, one for each of the four-bytes in the 32-bit integers. The higher bytes will be mostly zeros, with all of the entropy in the lower bytes. You can break the data into blocks, so that your planes don't have to span the entire data set.
Then apply zlib or lzma compression.
I would recommend that you first exploit the fact that adjacent symbols are similar, and convert your data in order to reduce the entropy. Then, apply the compression algorithm of your choice on the output.
Let IN_ARR be the original array and OUT_ARR be the converted array (input for compression):
OUT_ARR[0] = IN_ARR[0]
for i = 1 to N-1
OUT_ARR[i] = IN_ARR[i] - IN_ARR[i-1]
For simplicity, the pseudo-code above is written for 1-dimension coordinates.
But of course, you can easily implement it for 2-dimension coordinates...
And of course, you will have to apply the inverse operation after decompression:
IN_ARR[0] = OUT_ARR[0]
for i = 1 to N-1
IN_ARR[i] = OUT_ARR[i] + IN_ARR[i-1]
Here is way to efficiently structure your data to get most out of it : -
First divide your data in two sets as integer and decimals :-
eg: [1.23467,2.45678] => [1,2] and [23467,45678] => [1],[2],[23467],[45678]
As your data seems random then first thing you can do for compression is not to store it as string directly but use following compression.
range of latitudes is -90 to +90 hence total 180 values hence need log2(180) bits that is 8 bits per integer for first values
range of longitutes is -180 to 180 which is 360 values hence log2(360) bits which is 9 bits
decimals are of 5 digits hence need log2(10^5) = 17 bits.
Use above compression you will need 8+9+17*2 = 51 bits per record whereas if you use strings then you would need 2 + 3 + 5*2 = 15 bytes per record at max.
compression ratio = 51/(15*8) = 42% if compared with string data size
compression ratio = 51/(2*32) = 80% if compared with float data size .
Group similar parts of the path into 4 group like for example : -
[[0.12345,34.56789],[0.01234,34.56754],[-0.00012,34.56784]...]
=> [0,0,-0],[34,34,34],[12345,1234,12],[56789,56754,56784]
Use delta encoding on the individual group and then apply huffman coding to get further compression on total data.

Finding sets that have specific subsets

I am a graduate student of physics and I am working on writing some code to sort several hundred gigabytes of data and return slices of that data when I ask for it. Here is the trick, I know of no good method for sorting and searching data of this kind.
My data essentially consists of a large number of sets of numbers. These sets can contain anywhere from 1 to n numbers within them (though in 99.9% of the sets, n is less than 15) and there are approximately 1.5 ~ 2 billion of these sets (unfortunately this size precludes a brute force search).
I need to be able to specify a set with k elements and have every set with k+1 elements or more that contains the specified subset returned to me.
Simple Example:
Suppose I have the following sets for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)
If I were to give the request (1,3) I would have the sets: (1,2,3),
(1,2,3,4,5), and (1,3,8,9).
The request (11) would return the set: (5,8,11).
The request (1,2,3) would return the sets: (1,2,3) and (1,2,3,4,5)
The request (50) would return no sets:
By now the pattern should be clear. The major difference between this example and my data is that the sets withn my data are larger, the numbers used for each element of the sets run from 0 to 16383 (14 bits), and there are many many many more sets.
If it matters I am writing this program in C++ though I also know java, c, some assembly, some fortran, and some perl.
Does anyone have any clues as to how to pull this off?
edit:
To answer a couple questions and add a few points:
1.) The data does not change. It was all taken in one long set of runs (each broken into 2 gig files).
2.) As for storage space. The raw data takes up approximately 250 gigabytes. I estimate that after processing and stripping off a lot of extraneous metadata that I am not interested in I could knock that down to anywhere from 36 to 48 gigabytes depending on how much metadata I decide to keep (without indices). Additionally if in my initial processing of the data I encounter enough sets that are the same I might be able to comress the data yet further by adding counters for repeat events rather than simply repeating the events over and over again.
3.) Each number within a processed set actually contains at LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). So I will need at LEAST three bytes per number.
4.) My "though in 99.9% of the sets, n is less than 15" comment was misleading. In a preliminary glance through some of the chunks of the data I find that I have sets that contain as many as 22 numbers but the median is 5 numbers per set and the average is 6 numbers per set.
5.) While I like the idea of building an index of pointers into files I am a bit leery because for requests involving more than one number I am left with the semi slow task (at least I think it is slow) of finding the set of all pointers common to the lists, ie finding the greatest common subset for a given number of sets.
6.) In terms of resources available to me, I can muster approximately 300 gigs of space after I have the raw data on the system (The remainder of my quota on that system). The system is a dual processor server with 2 quad core amd opterons and 16 gigabytes of ram.
7.) Yes 0 can occur, it is an artifact of the data acquisition system when it does but it can occur.
Your problem is the same as that faced by search engines. "I have a bajillion documents. I need the ones which contain this set of words." You just have (very conveniently), integers instead of words, and smallish documents. The solution is an inverted index. Introduction to Information Retrieval by Manning et al is (at that link) available free online, is very readable, and will go into a lot of detail about how to do this.
You're going to have to pay a price in disk space, but it can be parallelized, and should be more than fast enough to meet your timing requirements, once the index is constructed.
Assuming a random distribution of 0-16383, with a consistent 15 elements per set, and two billion sets, each element would appear in approximately 1.8M sets. Have you considered (and do you have the capacity for) building a 16384x~1.8M (30B entries, 4 bytes each) lookup table? Given such a table, you could query which sets contain (1) and (17) and (5555) and then find the intersections of those three ~1.8M-element lists.
My guess is as follows.
Assume that each set has a name or ID or address (a 4-byte number will do if there are only 2 billion of them).
Now walk through all the sets once, and create the following output files:
A file which contains the IDs of all the sets which contain '1'
A file which contains the IDs of all the sets which contain '2'
A file which contains the IDs of all the sets which contain '3'
... etc ...
If there are 16 entries per set, then on average each of these 2^16 files will contain the IDs of 2^20 sets; with each ID being 4 bytes, this would require 2^38 bytes (256 GB) of storage.
You'll do the above once, before you process requests.
When you receive requests, use these files as follows:
Look at a couple of numbers in the request
Open up a couple of the corresponding index files
Get the list of all sets which exist in both these files (there's only a million IDs in each file, so this should't be difficult)
See which of these few sets satisfy the remainder of the request
My guess is that if you do the above, creating the indexes will be (very) slow and handling requests will be (very) quick.
I have recently discovered methods that use Space Filling curves to map the multi-dimensional data down to a single dimension. One can then index the data based on its 1D index. Range queries can be easily carried out by finding the segments of the curve that intersect the box that represents the curve and then retrieving those segments.
I believe that this method is far superior to making the insane indexes as suggested because after looking at it, the index would be as large as the data I wished to store, hardly a good thing. A somewhat more detailed explanation of this can be found at:
http://www.ddj.com/184410998
and
http://www.dcs.bbk.ac.uk/~jkl/publications.html
Make 16383 index files, one for each possible search value. For each value in your input set, write the file position of the start of the set into the corresponding index file. It is important that each of the index files contains the same number for the same set. Now each index file will consist of ascending indexes into the master file.
To search, start reading the index files corresponding to each search value. If you read an index that's lower than the index you read from another file, discard it and read another one. When you get the same index from all of the files, that's a match - obtain the set from the master file, and read a new index from each of the index files. Once you reach the end of any of the index files, you're done.
If your values are evenly distributed, each index file will contain 1/16383 of the input sets. If your average search set consists of 6 values, you will be doing a linear pass over 6/16383 of your original input. It's still an O(n) solution, but your n is a bit smaller now.
P.S. Is zero an impossible result value, or do you really have 16384 possibilities?
Just playing devil's advocate for an approach which includes brute force + index lookup :
Create an index with the min , max and no of elements of sets.
Then apply brute force excluding sets where max < max(set being searched) and min > min (set being searched)
In brute force also exclude sets whole element count is less than that of the set being searched.
95% of your searches would really be brute forcing a very smaller subset. Just a thought.

Resources