Pre-processing data for existing compressor - algorithm

I have an existing "compression" algorithm, which compresses an association into ranges. Something like
type Assoc = [(P,C)]
type RangeCompress :: Assoc -> [(P, [(C,C)]]
Read P as "product" and C as "code". The result is a list of products each associated with a list of code-ranges. To find the product associated with a given code, one traverses the compressed data until one finds a range, the given code falls into.
This mechanism works well if consecutive codes are likely to belong to the same product. However, if the codes of different product interleave, they no longer form compact ranges and I end up with lots of ranges, where the upper bound is equal to the lower bound and nil compression.
What I am looking for is a pre-compressor, which looks at the original association and determines a "good enough" transformation of codes, such that the association expressed in terms of transformed codes can be range-compressed into compact ranges. Something like
preCompress :: Assoc -> (C->C)
or more fine-grained (by Product)
preCompress ::Assoc -> P -> (C->C)
In that case a product lookup would have to first transform the code in question and then do the lookup as before. Therefore the transformation must be expressible by a handful of parameters, which would have to be attached to the compressed data, either once for the entire association or by product.
I checked some compression algorithms, but they all seem to focus on reconstructing the original data (which is not strictly needed here), while being totally free about the way they store the compressed data. In my case however, the compressed data must be ranges, only enriched by the parameters of the pre-compression.
Is this a known problem?
Does it have a solution?
Where to look next?
Please note:
I am not primarily interested in restoring the original data
I am primarily interested in the product lookup
The number of codes is apx 7,000,000
The number of products is apx 200

Assuming that for each code there is only one product, you can encode the data as a list (string) of products. Let's pretend we have the following list of products and codes randomly generated from 9 products (or no product) and 10 codes.
[(P6,C2),
(P1,C4),
(P2,C10),
(P3,C9),
(P3,C1),
(P4,C7),
(P6,C8),
(P5,C3),
(P1,C5)]
If we sort them by code we have
[(P3,C1),
(P6,C2),
(P5,C3),
(P1,C4),
(P1,C5),
-- C6 didn't have a product
(P4,C7),
(P6,C8),
(P3,C9),
(P2,C10)]
We can convert this into a string of products + nothing (N). The position in the string determines the code.
{- C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 -}
[ P3, P6, P5, P1, P1, N , P4, P6, P3, P2 ]
A string of symbols in some alphabet (in this case products+nothing) puts us squarely in the realm of well-studied string compression problems.
If we run-length encode this list, we have an encoding similar to the encoding you originally presented. For each range of associations we store a single product+nothing and the run length. We only need a single small integer for the run length instead of two (possibly large) codes for the interval.
{- P3 , P6, P5, P1, P1, N , P4, P6, P3, P2 -}
[ (P3, 1), (P6, 1), (P5, 1), (P1, 2), (N, 1), (P4, 1), (P6, 1), (P3, 1), (P2, 1) ]
We can serialize this to a string of bytes and use any of the existing compression libraries on bytes to perform the actual compression. Some compression libraries, such as the frequently used zlib, are found in the codec category.
Sorted the other way
We'll take the same data from before and sort it by product instead of by code
[(P1,C4),
(P1,C5),
(P2,C10),
(P3,C1),
(P3,C9),
(P4,C7),
(P5,C3),
(P6,C2),
(P6,C8)]
We'd like to allocate new codes to each product so that the codes for a product are always consecutive. We'll just allocate these codes in order.
-- v-- New code --v v -- Old code
[(P1,C1), [(C1,C4)
(P1,C2), (C2,C5),
(P2,C3), (C3,C10),
(P3,C4), (C4,C1),
(P3,C5), (C5,C9),
(P4,C6), (C6,C7),
(P5,C7), (C7,C3),
(P6,C8), (C8,C2),
(P6,C9), (C9,C8)]
We now have two pieces of data to save. We have a (now) one-to-one mapping between products and code ranges and a new one-to-one mapping between new codes and old codes. If we follow the steps from the previous section, we can convert the mapping between new codes and old codes into a list of old codes. The position in the list determines the new code.
{- C1 C2 C3 C4 C5 C6 C7 C8 C9 -} -- new codes
[ C4, C5, C10, C1, C9, C7, C3, C2, C8 ] -- old codes
We can throw any of our existing compression algorithms at this string. Each symbol only occurs at most once in this list so traditional compression mechanisms will not compress this list any more. This isn't significantly different from grouping by the product and storing the list of codes as an array with the product; the new intermediate codes are just (larger) pointers to the start of the array and the length of the array.
[(P1,[C4,C5]),
(P2,[C10]),
(P3,[C1,C9]),
(P4,[C7]),
(P5,[C3]),
(P6,[C2,C8])]
A better representation for the list of old codes is probably the difference between old codes. If the codes already tend to be in consecutive ranges these consecutive ranges can be run-length encoded away.
{- C4, C5, C10, C1, C9, C7, C3, C2, C8 -} -- old codes
[ +4, +1, +5, -9, +8, -2, -4, -1, +6 ] -- old code difference
I might be tempted to store the difference in product and the difference in codes with the products. This should increase the opportunity for common sub-strings for the final compression algorithm to compress away.
[(+1,[+4,+1]),
(+1,[+5]),
(+1,[-9,+8]),
(+1,[-2]),
(+1,[-4]),
(+1,[-1,+6])]

Related

Problem of assigning some values to a set from multiple options algo

I have a problem statemment, where I have some sets and each set have some options, some specific option from options, needs to be assigned to that set.
Some options can be common in multiple sets, but none can be assigned to more than one set. Need an algo to achieve this. A rough example is
Set1 has options [100,101,102,103] - 2 needs to be selected,
Set2 has options [101,102,103,104] - 2 needs to be selected,
Set3 has options [99,100,101] - 2 needs to be selected,
so the possible solution is
Set1 gets 100,102
Set2 gets 103,104
Set3 gets 99,101
Can anyone suggests an approach on how can I get a generic solution to this problem.
This can be modelled as an instance of the bipartite graph matching problem.
Let A be the numbers which appear in the lists (combined into one set with no duplicates), and let B be the lists themselves, each repeated according to how many elements need to be selected from them. There is an edge from a number a ∈ A to a list b ∈ B whenever the number a is in the list b.
Therefore this problem can be approached using an algorithm which finds a "perfect matching", i.e. a matching which includes all vertices. Wikipedia lists several algorithms which can be used to find matchings as large as possible, including the Ford–Fulkerson algorithm and the Hopcroft–Karp algorithm.
Thanks #kaya3, I was not remembering the exact algo, and getting me remember that its a bipartite graph matching problem was really helpful.
But it wasn't giving me the exact solution when I needed n number of options for each So I followed the following approach, i.e.
A = [99,100,101,102,103,104]
B = [a1, a2, b1, b2, c1, c2]
# I repeated the instances because I need 2 instances for each and created my
# graph like this
# Now my graph will be like
99 => [c1, c2]
100 => [a1, a2, c1, c2]
101 => [a1, a2, b1, b2, c1, c2]
102 => [a1, a2, b1, b2]
103 => [a1, a2, b1, b2]
104 => [b1, b2]
Now it is giving correct solution everytime. I tried with multiple use cases. Repeating

Storing a number bigger than the integer limit in vhdl

Let me explain my problem with an example.
I have two variables
a=74686 and b=20930625.
I want to store
c= (a x 2^16) + b.
This exceeds the integer limit(32bits) in vhdl.
It is okay for me to store c in two separate registers say c1 and c2, and tell users to concatenate bits of c1 and c2 to get the actual result. i.e I want to store lower 32 bits of c in c1 and remaining bits in c2.
How can I do this?

Hadoop data cross calculation

I have a huge file, which contains a list of numbers, for example A0, A1, ..., An. I want to calculate values for all combinations of Ai and Aj using a complex operation (op). So I want the values
op(A1, A2), op(A1, A3), ....., op(An, An-2), op(An, An-1)
or event more op(A1, A2, A3), op(A1, A2, A4), ....
My questions is the huge file contains all numbers are divided into segments on nodes. How can I get number Ai & Aj which are not on the same node?
thanks

A travelling salesman that has state

Let's say we have a directed graph. We want to visit every node exactly once by traveling on the edges of this graph. Every node is annotated with one or more tags; some nodes may share tags, and even have the exact same set of tags. As we go along our walk, we are collecting a list of every distinct tag we have encountered - our objective is to find the walk which postpones acquisition of new tags as much as possible.
To restate this as a traveler analogy, let's say that a carpet salesman is trying to decide which supplier he should acquire his carpets from. He makes a list of all the carpet factories in the city. He makes an appointment witht every factory, and collect samples of the kinds of carpet they make.
Let's say we have 3 factories, producing the following kinds of carpet:
F1: C1, C2, C3
F2: C1, C4
F3: C1, C4, C5
The salesman could take the following routes:
Start at F1, collect C1, C2, C3. Go to F2, collect C4 (since he already has C1). Go to F3, collect C5 (he already has C1 and C4).
Start at F1, collect C1, C2, C3. Go to F3, collect C4 and C5. Go to F2, collect nothing (since it turns out he already has all their carpets).
Start at F2, collect C1, C4. Go to F1, collect C2, C3. Go to F3 and collect C5.
Start at F2, collect C1, C4. Go to F3, collect C5. Go to F1 and collect C3.
Start at F3, collect C1, C4, C5. Go to F1, collect C2, C3. Go to F2, collect nothing.
Start at F3, collect C1, C4, C5. Go to F2, collect nothing. Go to F1, collect C2, C3.
Note how sometimes, the salesman visits a factory even though he knows he has already collected a sample for every kind of carpet they produce. The analogy breaks down here a bit, but let's say he must visit them because it would be rude to not show up for his appointment.
Now, the carpet samples are heavy, and our salesman is traveling on foot. Distance by itself isn't hugely important (assume every edge has cost 1), but he doesn't want to carry around a whole bunch of samples any more than he needs to. So, he needs to plan his trip such that he visits the factories which have a lot of rare carpets (and where he will have to pick up a lot of new samples) last.
For the example paths above, here are the numbers of samples carried at each leg of the journey (columns 2-4), and the sum (column 5).
1 0 3 4 7
2 0 3 5 8
3 0 2 4 6
4 0 2 3 5
5 0 3 5 8
6 0 3 3 6
We can see now that route 2 is very bad: first he had to carry 3 sample from F1 to F3, then he had to carry 5 samples from F3 to F2! Instead, he could have went with route 4 - he would carry first 2 samples from F2 to F3, and then 3 samples from F3 to F1.
Also, as shown in the last column, the sum of the samples carried through every edge is a good metric for how many samples he had to carry overall: The number of samples he is carrying cannot decrease, so visiting varied factories early on will necessarily inflate the sum, and a low sum is only possible by visiting similar factories with few carpets.
Is this a known problem? Is there an algorithm to solve it?
Note: I would recommend being careful about making assumptions based on my example problem. I came up with it on the spot, and deliberately kept it small for brevity. It is certain there are many edge cases that it fails to catch.
As the size of the Graph is small, we can consider using bit-mask and dynamic programming to solve this problem (Similar with how we solve the traveling salesman problem)
Assume that we have total 6 cities to visit. So the starting state is 0 and the ending is 111111b or 127 in decimal.
From each step, if the state is x, we can easily calculate the number of sampling the salesman is carrying, and the cost from state x to state y will be the number of newly added samples from x to y times the number of unvisited cities .
public int cal(int mask) {
if (/*Visit all city*/) {
return 0;
}
HashSet<Integer> sampleSet = new HashSet();//Store current samples
int left = 0;//Number of unvisited cities
for (int i = 0; i < numberOfCity; i++) {
if (((1 << i) & mask) != 0) {//If this city was visited
sampleSet.addAll(citySample[i]);
} else {
left++;
}
}
int cost;
for (int i = 0; i < numberOfCity; i++) {
if (((1 << i) & mask) == 0) {
int dif = number of new sample from city i;
cost = min(dif * left + cal(mask | (1 << i));
}
}
return cost;
}
In the case where there are edges between every pair of nodes, and each carpet is only available at one location, this looks tractable. If you pick up X carpets when there are Y steps to go, then the contribution from this to the final cost is XY. So you need to minimise SUM_i XiYi where Xi is the number of carpets picked up when you have Yi steps to go. You can do this by visiting the factories in increasing order of the number of carpets to be picked up at that factory. If you provide a schedule in which you pick up more carpets at A than B, and you visit A before B, I can improve it by swapping the times at which you visit A and B, so any schedule that does not follow this rule is not optimal.

Weighting Different Outcomes when Pseudorandomly Choosing from an Arbitrarily Large Sample

So, I was sitting in my backyard thinking about Pokemon, as we're all wont to do, and it got me thinking: When you encounter a 'random' Pokemon, some specimen appear a lot more often than others, which means that they're weighted differently than the ones that appear less.
Now, were I to approach the problem of getting the different Pokemon to appear with a certain probability, I would most likely do so by simply increasing the number of entries that certain Pokemon have in the pool of choices (like so),
Pool:
C1 C1 C1 C1
C2 C2
C3 C3 C3 C3 C3
C4
so C1 has a 1/3 chance of being pulled, C2 has a 1/6th chance, etc, but I understand that this may be a very simple and naive approach, and is unlikely to scale well with a large number of choices.
So, my question is this, S/O: Given an arbitrarily large sample size, how would you go about weighting the chance of one outcome as greater than another? And, as a follow up question, assume that you want the probability of certain options to occur in a ratio with floating-point precision as opposed to whole number ratios?
If you know the probability of each event happening you need to map these probabilities to the range 0-100 (or 0 to 1 if you want to use real numbers and probabilities.)
So in the example above there are 12 Cs. C1 is 4/12 or ~33%,
C2 is 2/12 of ~17%, C3 is 5/12 or ~42%, and C4 is 1/12 or ~8%.
Notice that these all add up to 100%. So if we choose a random number between 0 and 100 we can map C1 to 0-33, C2 to 33-50 (17 more than C1's value) , C3 to 50-92, and C4 to 92-100.
An if statement could make the choice:
r = rand() # between 0-100
if (r <33)
return "C1"
elsif (r < 50)
return "C2"
elsif (r < 92)
return "C3"
elsif (r < 100)
return "C4"
If you wanted more accuracy than 1 in 100 just go from 1-1000 or whatever range you want. It's probably better form to use integers and scale them rather than use floating point numbers as floating point can have odd behavior if the spread between values gets large.
If you wanted to go the binning route like you show above you could try something like so (in ruby though the idea is more general):
a = ["C1"]*4 + ["C2"]*2 + ["C3"]*5 + ["C4"]
# ["C1", "C1", "C1", "C1", "C2", "C2",
# "C3", "C3", "C3", "C3", "C3", "C4"]
a[rand(a.length)] # => "C1' w/ probability 4/12
Binning would be slower as you need to create the array, but easier to add alternatives as you wouldn't need to recalculate the probabilities each time.
You could also generate the above if code from the array representation so you'd just take the pre-processing hit once when the code was generated and then get a fast answer from the created code.

Resources