I have an existing "compression" algorithm, which compresses an association into ranges. Something like
type Assoc = [(P,C)]
type RangeCompress :: Assoc -> [(P, [(C,C)]]
Read P as "product" and C as "code". The result is a list of products each associated with a list of code-ranges. To find the product associated with a given code, one traverses the compressed data until one finds a range, the given code falls into.
This mechanism works well if consecutive codes are likely to belong to the same product. However, if the codes of different product interleave, they no longer form compact ranges and I end up with lots of ranges, where the upper bound is equal to the lower bound and nil compression.
What I am looking for is a pre-compressor, which looks at the original association and determines a "good enough" transformation of codes, such that the association expressed in terms of transformed codes can be range-compressed into compact ranges. Something like
preCompress :: Assoc -> (C->C)
or more fine-grained (by Product)
preCompress ::Assoc -> P -> (C->C)
In that case a product lookup would have to first transform the code in question and then do the lookup as before. Therefore the transformation must be expressible by a handful of parameters, which would have to be attached to the compressed data, either once for the entire association or by product.
I checked some compression algorithms, but they all seem to focus on reconstructing the original data (which is not strictly needed here), while being totally free about the way they store the compressed data. In my case however, the compressed data must be ranges, only enriched by the parameters of the pre-compression.
Is this a known problem?
Does it have a solution?
Where to look next?
Please note:
I am not primarily interested in restoring the original data
I am primarily interested in the product lookup
The number of codes is apx 7,000,000
The number of products is apx 200
Assuming that for each code there is only one product, you can encode the data as a list (string) of products. Let's pretend we have the following list of products and codes randomly generated from 9 products (or no product) and 10 codes.
[(P6,C2),
(P1,C4),
(P2,C10),
(P3,C9),
(P3,C1),
(P4,C7),
(P6,C8),
(P5,C3),
(P1,C5)]
If we sort them by code we have
[(P3,C1),
(P6,C2),
(P5,C3),
(P1,C4),
(P1,C5),
-- C6 didn't have a product
(P4,C7),
(P6,C8),
(P3,C9),
(P2,C10)]
We can convert this into a string of products + nothing (N). The position in the string determines the code.
{- C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 -}
[ P3, P6, P5, P1, P1, N , P4, P6, P3, P2 ]
A string of symbols in some alphabet (in this case products+nothing) puts us squarely in the realm of well-studied string compression problems.
If we run-length encode this list, we have an encoding similar to the encoding you originally presented. For each range of associations we store a single product+nothing and the run length. We only need a single small integer for the run length instead of two (possibly large) codes for the interval.
{- P3 , P6, P5, P1, P1, N , P4, P6, P3, P2 -}
[ (P3, 1), (P6, 1), (P5, 1), (P1, 2), (N, 1), (P4, 1), (P6, 1), (P3, 1), (P2, 1) ]
We can serialize this to a string of bytes and use any of the existing compression libraries on bytes to perform the actual compression. Some compression libraries, such as the frequently used zlib, are found in the codec category.
Sorted the other way
We'll take the same data from before and sort it by product instead of by code
[(P1,C4),
(P1,C5),
(P2,C10),
(P3,C1),
(P3,C9),
(P4,C7),
(P5,C3),
(P6,C2),
(P6,C8)]
We'd like to allocate new codes to each product so that the codes for a product are always consecutive. We'll just allocate these codes in order.
-- v-- New code --v v -- Old code
[(P1,C1), [(C1,C4)
(P1,C2), (C2,C5),
(P2,C3), (C3,C10),
(P3,C4), (C4,C1),
(P3,C5), (C5,C9),
(P4,C6), (C6,C7),
(P5,C7), (C7,C3),
(P6,C8), (C8,C2),
(P6,C9), (C9,C8)]
We now have two pieces of data to save. We have a (now) one-to-one mapping between products and code ranges and a new one-to-one mapping between new codes and old codes. If we follow the steps from the previous section, we can convert the mapping between new codes and old codes into a list of old codes. The position in the list determines the new code.
{- C1 C2 C3 C4 C5 C6 C7 C8 C9 -} -- new codes
[ C4, C5, C10, C1, C9, C7, C3, C2, C8 ] -- old codes
We can throw any of our existing compression algorithms at this string. Each symbol only occurs at most once in this list so traditional compression mechanisms will not compress this list any more. This isn't significantly different from grouping by the product and storing the list of codes as an array with the product; the new intermediate codes are just (larger) pointers to the start of the array and the length of the array.
[(P1,[C4,C5]),
(P2,[C10]),
(P3,[C1,C9]),
(P4,[C7]),
(P5,[C3]),
(P6,[C2,C8])]
A better representation for the list of old codes is probably the difference between old codes. If the codes already tend to be in consecutive ranges these consecutive ranges can be run-length encoded away.
{- C4, C5, C10, C1, C9, C7, C3, C2, C8 -} -- old codes
[ +4, +1, +5, -9, +8, -2, -4, -1, +6 ] -- old code difference
I might be tempted to store the difference in product and the difference in codes with the products. This should increase the opportunity for common sub-strings for the final compression algorithm to compress away.
[(+1,[+4,+1]),
(+1,[+5]),
(+1,[-9,+8]),
(+1,[-2]),
(+1,[-4]),
(+1,[-1,+6])]
I have three large sets of vectors: A, B1 and B2. These sets are stored in files on disk. For each vector a from A I need to check whether it may be presented as a = b1 + b2, where b1 is from B1 and b2 is from B2. Vectors have 20 components, and all components are non-negative numbers.
How I'm solving this problem now (pseudocode):
foreach a in A
foreach b1 in B1
for i = 1 to 20
bt[i] = a[i] - b1[i]
if bt[i] < 0 then try next b1
next i
foreach b2 in B2
for i = 1 to 20
if bt[i] != b2[i] then try next b2
next i
num_of_expansions++
next b2
next b1
next a
My questions:
1. Any ideas on how to make if faster?
2. How to make it in parallel?
3. Questions 1, 2 for the case when I have B1, B2, ..., Bk, k > 2?
You can sort B1 and B2 by norm. If a = b1 + b2, then ||a|| = ||b1 + b2|| <= ||b1|| + ||b2||, so for any a and b1, you can efficiently eliminate all elements of B2 that have norm < ||a|| - ||b1||. There may also be some way to use the distribution of norms in B1 and B2 to decide whether to switch the roles of the two sets in this. (I don't see off-hand how to do it, but it seems to me that something like this should hold if the distributions of norms in B1 and B2 are significantly different.)
As for making it parallel, it seems that each loop can be turned into a parallel computation, since all computations of one inner iteration are independent of all other iterations.
EDIT
Continuing the analysis: since b2 = a - b1, we also have ||b2|| <= ||a|| + ||b1||. So for any given a and b1, you can restrict the search in B2 to those elements with norms in the range ||a|| ± ||b1||. This suggests that for B1 you should select the set with the smallest average norm.
Let's say I have the three following lists
A1
A2
A3
B1
B2
C1
C2
C3
C4
C5
I'd like to combine them into a single list, with the items from each list as evenly distributed as possible sorta like this:
C1
A1
C2
B1
C3
A2
C4
B2
A3
C5
I'm using .NET 3.5/C# but I'm looking more for how to approach it then specific code.
EDIT: I need to keep the order of elements from the original lists.
Take a copy of the list with the most members. This will be the destination list.
Then take the list with the next largest number of members.
divide the destination list length by the smaller length to give a fractional value of greater than one.
For each item in the second list, maintain a float counter. Add the value calculated in the previous step, and mathematically round it to the nearest integer (keep the original float counter intact). Insert it at this position in the destination list and increment the counter by 1 to account for it. Repeat for all list members in the second list.
Repeat steps 2-5 for all lists.
EDIT: This has the advantage of being O(n) as well, which is always nice :)
Implementation of
Andrew Rollings' answer:
public List<String> equimix(List<List<String>> input) {
// sort biggest list to smallest list
Collections.sort(input, new Comparator<List<String>>() {
public int compare(List<String> a1, List<String> a2) {
return a2.size() - a1.size();
}
});
List<String> output = input.get(0);
for (int i = 1; i < input.size(); i++) {
output = equimix(output, input.get(i));
}
return output;
}
public List<String> equimix(List<String> listA, List<String> listB) {
if (listB.size() > listA.size()) {
List<String> temp;
temp = listB;
listB = listA;
listA = temp;
}
List<String> output = listA;
double shiftCoeff = (double) listA.size() / listB.size();
double floatCounter = shiftCoeff;
for (String item : listB) {
int insertionIndex = (int) Math.round(floatCounter);
output.add(insertionIndex, item);
floatCounter += (1+shiftCoeff);
}
return output;
}
First, this answer is more of a train of thought than a concete solution.
OK, so you have a list of 3 items (A1, A2, A3), where you want A1 to be somewhere in the first 1/3 of the target list, A2 in the second 1/3 of the target list, and A3 in the third 1/3. Likewise you want B1 to be in the first 1/2, etc...
So you allocate your list of 10 as an array, then start with the list with the most items, in this case C. Calculate the spot where C1 should fall (1.5) Drop C1 in the closest spot, (in this case, either 1 or 2), then calculate where C2 should fall (3.5) and continue the process until there are no more Cs.
Then go with the list with the second-to-most number of items. In this case, A. Calculate where A1 goes (1.66), so try 2 first. If you already put C1 there, try 1. Do the same for A2 (4.66) and A3 (7.66). Finally, we do list B. B1 should go at 2.5, so try 2 or 3. If both are taken, try 1 and 4 and keep moving radially out until you find an empty spot. Do the same for B2.
You'll end up with something like this if you pick the lower number:
C1 A1 C2 A2 C3 B1 C4 A3 C5 B2
or this if you pick the upper number:
A1 C1 B1 C2 A2 C3 A3 C4 B2 C5
This seems to work pretty well for your sample lists, but I don't know how well it will scale to many lists with many items. Try it and let me know how it goes.
Make a hash table of lists.
For each list, store the nth element in the list under the key (/ n (+ (length list) 1))
Optionally, shuffle the lists under each key in the hash table, or sort them in some way
Concatenate the lists in the hash by sorted key
I'm thinking of a divide and conquer approach. Each iteration of which you split all the lists with elements > 1 in half and recurse. When you get to a point where all the lists except one are of one element you can randomly combine them, pop up a level and randomly combine the lists removed from that frame where the length was one... et cetera.
Something like the following is what I'm thinking:
- filter lists into three categories
- lists of length 1
- first half of the elements of lists with > 1 elements
- second half of the elements of lists with > 1 elements
- recurse on the first and second half of the lists if they have > 1 element
- combine results of above computation in order
- randomly combine the list of singletons into returned list
You could simply combine the three lists into a single list and then UNSORT that list. An unsorted list should achieve your requirement of 'evenly-distributed' without too much effort.
Here's an implementation of unsort: http://www.vanheusden.com/unsort/.
A quick suggestion, in python-ish pseudocode:
merge = list()
lists = list(list_a, list_b, list_c)
lists.sort_by(length, descending)
while lists is not empty:
l = lists.remove_first()
merge.append(l.remove_first())
if l is not empty:
next = lists.remove_first()
lists.append(l)
lists.sort_by(length, descending)
lists.prepend(next)
This should distribute elements from shorter lists more evenly than the other suggestions here.