Binary heap insertion, don't understand for loop - algorithm

In Weiss 'Data Structures and Algorithms In Java", he explains the insert algorithm for binary heaps thusly
public void insert( AnyType x )
{
if( currentSize == array.length -1)
enlargeArray( array.length * 2 + 1);
// Percolate up
int hole = ++currentSize;
for(array[0] = x; x.compareTo( array[ hole / 2 ]) < 0; hole /=2 )
array[ hole ] = array[ hole / 2 ];
array[ hole ] = x;
}
I get the principle of moving a hole up the tree, but I don't understand how he's accomplishing it with this syntax in the for loop... What does the initializer array[0] = x; mean? It seems he's overwriting the root value? It seems like a very contrived piece of code. What's he doing ere?

First off, I got a response from Mark Weiss and his email basically said the code was correct (full response at the bottom of this answer).
He also said this:
Consequently, the minimum item is in array index 1 as shown in findMin. To do an insertion, you follow the path from the bottom to the root.
Index 1? Hmmm... I then had to go back and re-read larger portions of the chapter and when I saw figure 6.3 it clicked.
The array is 0-based, but the elements that are considered part of the heap is stored from index 1 and onwards. Illustration 6.3 looks like this:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | A | B | C | D | E | F | G | H | I | J | | | |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
0 1 2 3 4 5 6 7 8 9 10 11 12 13
The placing of the value at element 0 is a sentinel value to make the loop terminate.
Thus, with the above tree, let's see how the insert function works. H below marks the hole.
First we place x into the 0th element (outside the heap), and places the hole at the next available element in the array.
H
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| x | A | B | C | D | E | F | G | H | I | J | | | |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Then we bubble up (percolate) the hole, moving the values up from "half the index" until we find the right spot to place the x.
If we look at figure 6.5 and 6.6, let's place the actual values into the array:
H/2 H
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 14 | 13 | 21 | 16 | 24 | 31 | 19 | 68 | 65 | 26 | 32 | | | |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Notice that we placed 14, the value to insert, into index 0, but this is outside the heap, our sentinel value to ensure the loop terminates.
Then we compare the value x with the value at hole / 2, which now is 11/2 = 5. x is less than 31, so we move the value up and move the hole:
H/2 H <---------------------------
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 14 | 13 | 21 | 16 | 24 | 31 | 19 | 68 | 65 | 26 | 32 | 31 | | |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7 8 9 10 11 12 13
| ^
+--------- move 31 -----------+
We compare again, 14 is again less than 21 (5 / 2 = 2), so once more:
H/2 H <------------
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 14 | 13 | 21 | 16 | 24 | 21 | 19 | 68 | 65 | 26 | 32 | 31 | | |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7 8 9 10 11 12 13
| ^
+-- move 21 ---+
Now, however, 14 is not less than 13 (hole / 2 --> 2 / 1 = 1), so we've found the right spot for x:
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 14 | 13 | 14 | 16 | 24 | 21 | 19 | 68 | 65 | 26 | 32 | 31 | | |
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7 8 9 10 11 12 13
^
x
As you can see, if you look at illustrations 6.6 and 6.7, this matches the expected behavior.
So while the code isn't wrong, you got one little snag that is perhaps outside of scope of the book.
If the type of x being inserted is a reference type, you will in the current heap have 2 references to the same object just inserted. If you then immediately delete the object from the heap, it looks (but look where looking like got us in the first place...) like the 0th element will still retain the reference, prohibiting the garbage collector from doing its job.
To make sure there's no hidden agenda here, here is the complete answer from Mark:
Hi Lasse,
The code is correct.
The binary heap is a complete binary tree in which on any path from a
bottom to the root, values never increase. Consequently the minimum
item is at the root. The array representation places the root at
index 1, and for any node at index i, the parent is at i/2 (rounded
down) (the left child is at 2i and the right child at 2i+1, but that
is not needed here).
Consequently, the minimum item is in array index 1 as shown in
findMin. To do an insertion, you follow the path from the bottom to
the root.
In the for loop:
hole /= 2 expresses the idea of moving the hole to the parent.
x.compareTo( array[ hole / 2 ]) < 0 expresses the idea that we stay in
the loop as long as x is smaller than the parent.
The problem is that if x is a new minimum, you never get out of the
loop safely (technically you crash trying to compare x and array[0]).
You could put in an extra test to handle the corner case.
Alternatively, the code gets around that by putting x in array[0] at
the start, and since the "parent" of node i is i/2, the "parent" of
the root which is in index 1 can be found in index 0. This guarantees
the loop terminates if x is the new minimum (and then places x, which
is the new minimum in the root at index 1).
A longer explanation is in the book... but the basic concept here is
that of using a sentinel (or dummy) value to avoid extra code for
boundary cases.
Regards,
Mark Weiss

The array initialiser looks wrong. If it were array[hole] = x;, then the whole thing makes perfect sense.
It first puts the value in the lowest rank of the tree (the entry after the current size), then it looks in the entry `above it' by looking at (int) hole/2.
It keeps moving it up until the comparator tells it to stop. I think that this is a slight misuse of the syntax of a for loop, since it feels like its really a while(x.compare(hole/2) < 0) type loop.

Related

Finding tuples if it only exists in all occurrences of a constraint

Database (all entries are integers):
ID | BUDGET
1 | 20
8 | 20
10 | 20
5 | 4
9 | 4
10 | 4
1 | 11
9 | 11
Suppose my constraint is having a budget of >= 10.
I would want to return ID of 1 only in this case. How do I go about it?
I've tried taking the cross product of itself after selecting budget >= 10 and returning if id1 = id2 and budget1 <> budget2 but that does not work in the case where there's only 1 budget that is >= 10. (EG below)
ID | BUDGET
1 | 20
8 | 20
10 | 20
1 | 4
5 | 4
9 | 4
10 | 4
9 | 4
If I were to do what I did for the first example, nothing will be returned as budget1 <> budget2 will result in an empty table.
EDIT1: I can only use relational algebra to solve the problem. So SQL's exist, where and count keywords cant be used.
Edit2: Only project, select, rename, set difference, set union, left join, right join, full inner join, natural joins, set intersection and cross product allowed
The question is not completely clear to me. If you want to return all the ID for which there is a budget greater than 10, and no budget less than 10, the expression is simply the following:
π(ID)(σ(BUDGET>=10)(R)) - π(ID)(σ(BUDGET<10)(R))
If, an the other hand, you want all the ID which have all the budgets present in the relation and greater then 10, then we must use the ÷ operator:
R ÷ π(BUDGET)(σ(BUDGET>=10)(R))
From your comment, the second case is the correct one. Let’s see how to compute the division from its definition (applied to two generic relations R(A) and S(B)):
R ÷ S = πA-B(R) - πA-B((πA-B(R) x S) - R)
where R is the original relation, and
S = π(BUDGET)(σ(BUDGET>=10)(R)),
that is:
BUDGET
------
20
11
Starting from the inner expression:
πA-B(R) is equal to πID(R) =
ID
--
1
5
8
9
10
then πA-B(R) x S) is:
ID BUDGET
---------
1 20
1 11
5 20
5 11
8 20
8 11
9 20
9 11
10 20
10 11
then ((πA-B(R) x S) - R) is:
ID BUDGET
---------
5 20
5 11
8 11
9 20
10 20
then πA-B((πA-B(R) x S) - R) is:
ID
__
5
8
9
10
and, finally, subtracting this relation from πA-B(R) we obtain the result:
ID
--
1

What is the point of choosing closest node in Dijkstra algorithm?

In all articles which I read, neighbor to process first is "closest" neighbor. But finally it's needed to visit all nodes to figure out all possible paths. So, the question is - why do we do this? I believe the same result can be achieved if we simply traverse Graph in BFS way and will perform calculation of costs. For example:
first step- 0, costs table:
2 - 6 |
3 - 2 |
second step- 2, costs table:
2 - 6 |
3 - 2 |
1 - 9 |
third step- 3, costs table:
2 - 6 |
3 - 2 |
1 - 9 |
4 - 12 |
forth step- 1, costs table:
2 - 6 |
3 - 2 |
1 - 9 |
4 - 12 |
5 - 12 |
fifth step- 4, costs table:
2 - 6 |
3 - 2 |
1 - 9 |
4 - 12 |
5 - 12 |
With simple BFS traversing the cheapest way was find out. What do I missing?
Suppose the path from A to B and B to C are both cost 1, and the direct route from A to C is cost 3. (In the real world, the first two are highways that bypass a mountain while the third is a tiny trail over a mountain pass.)
Dijkstra will route you A -> B -> C for a total cost of 2 while BFS will route you A -> C for a total cost of 3.
Therefore you have to process lowest cost first to get the right answer.
At each step, Dijkstra's algorithm extends the lowest-cost path known so far. Thus, when you finally encounter the goal state, you know that all other, unfinished paths have a greater cost. Therefore, the one you just found is the shortest path.

Algorithm - converting nested data to plain data

I have the following nested data structure:
Node 1
|--- Node 11
|--- Node 111
|--- Node 12
|--- Node 121
|--- Node 122
|--- Node 123
|--- Node 13
|--- Node 131
Node 2
|--- Node 21
|--- Node 211
|--- Node 212
etc.
and I'm trying to write an algorithm that converts it into a "plain" 2D matrix, like this:
| 1 | 11 | 111 |
| 1 | 12 | 121 |
| 1 | 12 | 122 |
| 1 | 12 | 123 |
| 1 | 13 | 131 |
| 2 | 21 | 211 |
| 2 | 21 | 212 |
etc.
however, I'm having a bit of trouble doing it efficiently, since I can't just traverse the tree and fill the matrix: as you can see the matrix has more cells than the tree has nodes, due to redundant data in all columns except the last.
Note that, like in the example, all leaves of the tree will have the same number of parents, i.e.: the same "nesting depth", so I don't need to account for shorter branches.
I'm sure there's already an algorithm that does this properly, but I don't know how this particular problem is called, so I couldn't find it. Can anyone help me out?
I'm not sure there is any specific name for this, maybe "tree flattening", but I suppose there are several ways in which you could flatten a tree anyway. You can do it with something like this (pseudocode since there is no language tag):
proc flatten_tree(tree : Node<Int>) : List<List<Int>>
matrix := []
flatten_tree_rec(tree, [], matrix)
return matrix
endproc
proc flatten_tree_rec(tree : Node<Int>, current : List<Int>, matrix : List<List<Int>>)
current.append(tree.value)
if tree.is_leaf()
matrix.append(current.copy())
else
for child in tree.children()
flatten_tree(child, current, matrix)
loop
endif
current.remove_last()
endproc
If you need to produce an actual matrix that needs to be preallocated you would need two passes, one to count the number of leafs and depth and another to actually fill the matrix:
proc flatten_tree(tree : Node<Int>) : List<List<Int>>
leafs, depth := count_leafs_and_depth(tree, 0)
matrix := Matrix<Int>(leafs, depth)
flatten_tree_rec(tree, [], matrix, 0)
return matrix
endproc
proc count_leafs_and_depth(tree : Node<Int>, base_depth : Int) : Int
if tree.is_leaf()
return 1, base_depth + 1
else
leafs := 0
depth := 0
for child in tree.children()
c_leafs, c_depth := count_leafs_and_depth(child, base_depth + 1)
leafs += c_leafs
depth = max(c_depth, depth)
loop
return leafs, depth
endif
endproc
proc flatten_tree_rec(tree : Node<Int>, current : List<Int>, matrix : Matrix<Int>, index : Int)
current.append(tree.value)
if tree.is_leaf()
matrix[index] = current
index += 1
else
for child in tree.children()
index = flatten_tree(child, current, matrix, index)
loop
endif
current.remove_last()
return index
endproc

Counting the ways to build a wall with two tile sizes [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
You are given a set of blocks to build a panel using 3”×1” and 4.5”×1" blocks.
For structural integrity, the spaces between the blocks must not line up in adjacent rows.
There are 2 ways in which to build a 7.5”×1” panel, 2 ways to build a 7.5”×2” panel, 4 ways to build a 12”×3” panel, and 7958 ways to build a 27”×5” panel. How many different ways are there to build a 48”×10” panel?
This is what I understand so far:
with the blocks 3 x 1 and 4.5 x 1
I've used combination formula to find all possible combinations that the 2 blocks can be arranged in a panel of this size
C = choose --> C(n, k) = n!/r!(n-r)! combination of group n at r at a time
Panel: 7.5 x 1 = 2 ways -->
1 (3 x 1 block) and 1 (4.5 x 1 block) --> Only 2 blocks are used--> 2 C 1 = 2 ways
Panel: 7.5 x 2 = 2 ways
I used combination here as well
1(3 x 1 block) and 1 (4.5 x 1 block) --> 2 C 1 = 2 ways
Panel: 12 x 3 panel = 2 ways -->
2(4.5 x 1 block) and 1(3 x 1 block) --> 3 C 1 = 3 ways
0(4.5 x 1 block) and 4(3 x 1 block) --> 4 C 0 = 1 way
3 ways + 1 way = 4 ways
(This is where I get confused)
Panel 27 x 5 panel = 7958 ways
6(4.5 x 1 block) and 0(3 x 1) --> 6 C 0 = 1 way
4(4.5 x 1 block) and 3(3 x 1 block) --> 7 C 3 = 35 ways
2(4.5 x 1 block) and 6(3 x 1 block) --> 8 C 2 = 28 ways
0(4.5 x 1 block) and 9(3 x 1 block) --> 9 C 0 = 1 way
1 way + 35 ways + 28 ways + 1 way = 65 ways
As you can see here the number of ways is nowhere near 7958. What am I doing wrong here?
Also how would I find how many ways there are to construct a 48 x 10 panel?
Because it's a little difficult to do it by hand especially when trying to find 7958 ways.
How would write a program to calculate an answer for the number of ways for a 7958 panel?
Would it be easier to construct a program to calculate the result? Any help would be greatly appreciated.
I don't think the "choose" function is directly applicable, given your "the spaces between the blocks must not line up in adjacent rows" requirement. I also think this is where your analysis starts breaking down:
Panel: 12 x 3 panel = 2 ways -->
2(4.5 x 1 block) and 1(3 x 1 block)
--> 3 C 1 = 3 ways
0(4.5 x 1 block) and 4(3 x 1 block)
--> 4 C 0 = 1 way
3 ways + 1 way = 4 ways
...let's build some panels (1 | = 1 row, 2 -'s = 1 column):
+---------------------------+
| | | | |
| | | | |
| | | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
Here we see that there are 4 different basic row types, but none of these are valid panels (they all violate the "blocks must not line up" rule). But we can use these row types to create several panels:
+---------------------------+
| | | | |
| | | | |
| | | |
+---------------------------+
+---------------------------+
| | | | |
| | | | |
| | | |
+---------------------------+
+---------------------------+
| | | | |
| | | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
...
But again, none of these are valid. The valid 12x3 panels are:
+---------------------------+
| | | | |
| | | |
| | | | |
+---------------------------+
+---------------------------+
| | | |
| | | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
+---------------------------+
| | | |
| | | |
| | | |
+---------------------------+
So there are in fact 4 of them, but in this case it's just a coincidence that it matches up with what you got using the "choose" function. In terms of total panel configurations, there are quite more than 4.
Find all ways to form a single row of the given width. I call this a "row type". Example 12x3: There are 4 row types of width 12: (3 3 3 3), (4.5 4.5 3), (4.5 3 4.5), (3 4.5 4.5). I would represent these as a list of the gaps. Example: (3 6 9), (4.5 9), (4.5 7.5), (3 7.5).
For each of these row types, find which other row types could fit on top of it.
Example:
a. On (3 6 9) fits (4.5 7.5).
b. On (4.5 9) fits (3 7.5).
c: On (4.5 7.5) fits (3 6 9).
d: On (3 7.5) fits (4.5 9).
Enumerate the ways to build stacks of the given height from these rules. Dynamic programming is applicable to this, as at each level, you only need the last row type and the number of ways to get there.
Edit: I just tried this out on my coffee break, and it works. The solution for 48x10 has 15 decimal digits, by the way.
Edit: Here is more detail of the dynamic programming part:
Your rules from step 2 translate to an array of possible neighbours. Each element of the array corresponds to a row type, and holds that row type's possible neighbouring row types' indices.
0: (2)
1: (3)
2: (0)
3: (1)
In the case of 12×3, each row type has only a single possible neighbouring row type, but in general, it can be more.
The dynamic programming starts with a single row, where each row type has exactly one way of appearing:
1 1 1 1
Then, the next row is formed by adding for each row type the number of ways that possible neighbours could have formed on the previous row. In the case of a width of 12, the result is 1 1 1 1 again. At the end, just sum up the last row.
Complexity:
Finding the row types corresponds to enumerating the leaves of a tree; there are about (/ width 3) levels in this tree, so this takes a time of O(2w/3) = O(2w).
Checking whether two row types fit takes time proportional to their length, O(w/3). Building the cross table is proportional to the square of the number of row types. This makes step 2 O(w/3·22w/3) = O(2w).
The dynamic programming takes height times the number of row types times the average number of neighbours (which I estimate to be logarithmic to the number of row types), O(h·2w/3·w/3) = O(2w).
As you see, this is all dominated by the number of row types, which grow exponentially with the width. Fortunately, the constant factors are rather low, so that 48×10 can be solved in a few seconds.
This looks like the type of problem you could solve recursively. Here's a brief outline of an algorithm you could use, with a recursive method that accepts the previous layer and the number of remaining layers as arguments:
Start with the initial number of layers (e.g. 27x5 starts with remainingLayers = 5) and an empty previous layer
Test all possible layouts of the current layer
Try adding a 3x1 in the next available slot in the layer we are building. Check that (a) it doesn't go past the target width (e.g. doesn't go past 27 width in a 27x5) and (b) it doesn't violate the spacing condition given the previous layer
Keep trying to add 3x1s to the current layer until we have built a valid layer that is exactly (e.g.) 27 units wide
If we cannot use a 3x1 in the current slot, remove it and replace with a 4.5x1
Once we have a valid layer, decrement remainingLayers and pass it back into our recursive algorithm along with the layer we have just constructed
Once we reach remainingLayers = 0, we have constructed a valid panel, so increment our counter
The idea is that we build all possible combinations of valid layers. Once we have (in the 27x5 example) 5 valid layers on top of each other, we have constructed a complete valid panel. So the algorithm should find (and thus count) every possible valid panel exactly once.
This is a '2d bin packing' problem. Someone with decent mathematical knowledge will be able to help or you could try a book on computational algorithms. It is known as a "combinatorial NP-hard problem". I don't know what that means but the "hard" part grabs my attention :)
I have had a look at steel cutting prgrams and they mostly use a best guess. In this case though 2 x 4.5" stacked vertically can accommodate 3 x 3" inch stacked horizontally. You could possibly get away with no waste. Gets rather tricky when you have to figure out the best solution --- the one with minimal waste.
Here's a solution in Java, some of the array length checking etc is a little messy but I'm sure you can refine it pretty easily.
In any case, I hope this helps demonstrate how the algorithm works :-)
import java.util.Arrays;
public class Puzzle
{
// Initial solve call
public static int solve(int width, int height)
{
// Double the widths so we can use integers (6x1 and 9x1)
int[] prev = {-1}; // Make sure we don't get any collisions on the first layer
return solve(prev, new int[0], width * 2, height);
}
// Build the current layer recursively given the previous layer and the current layer
private static int solve(int[] prev, int[] current, int width, int remaining)
{
// Check whether we have a valid frame
if(remaining == 0)
return 1;
if(current.length > 0)
{
// Check for overflows
if(current[current.length - 1] > width)
return 0;
// Check for aligned gaps
for(int i = 0; i < prev.length; i++)
if(prev[i] < width)
if(current[current.length - 1] == prev[i])
return 0;
// If we have a complete valid layer
if(current[current.length - 1] == width)
return solve(current, new int[0], width, remaining - 1);
}
// Try adding a 6x1
int total = 0;
int[] newCurrent = Arrays.copyOf(current, current.length + 1);
if(current.length > 0)
newCurrent[newCurrent.length - 1] = current[current.length - 1] + 6;
else
newCurrent[0] = 6;
total += solve(prev, newCurrent, width, remaining);
// Try adding a 9x1
if(current.length > 0)
newCurrent[newCurrent.length - 1] = current[current.length - 1] + 9;
else
newCurrent[0] = 9;
total += solve(prev, newCurrent, width, remaining);
return total;
}
// Main method
public static void main(String[] args)
{
// e.g. 27x5, outputs 7958
System.out.println(Puzzle.solve(27, 5));
}
}

How to efficiently store a matrix with highly-redundant values

I have a very large matrix (100M rows by 100M columns) that has a lots of duplicate values right next to each other. For example:
8 8 8 8 8 8 8 8 8 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 3 3 3 3 3 3 3 3 3 3 3
I want a datastructure/algorithm to store matricies like these as compactly as possible. For instance, the matrix above should only take O(1) space (even if the matrix was stretched out arbitrarily big), because there is only a constant number of rectangular regions, where each region only has one value.
The repetition happens both across rows and down columns, so the simple approach of compressing the matrix row-by-row isn't good enough. (That would require a minimum of O(num_rows) space to store any matrix.)
The representation of the matrix also needs to accessible row-by-row, so that I can do a matrix multiplication to a column vector.
You could store the matrix as a quadtree with the leaves containing single values. Think of this as a two-dimensional "run" of values.
Now for my preferred method.
Ok, as I made mention in my previous answer rows with the same entries in each column in matrix A will multiply out to the same result in matrix AB. If we can maintain that relationship then we can theoretically speed up calculations significantly (a profiler is your friend).
In this method we maintain the row * column structure of the matrix.
Each row is compressed with whatever method can decompress fast enough not to affect the multiplication speed too much. RLE may be sufficient.
We now have a list of compressed rows.
We use an entropy encoding method (like Shannon-Fano, Huffman or arithmetic coding), but we don’t compress the data in the rows with this, we use it to compress the set of rows.
We use it to encode the relative frequency of the rows. I.e. we treat a row the same way standard entropy encoding would treat a character/byte.
In this example RLE compresses a row, and Huffman compresses the entire set of rows.
So, for example, given the following matrix (prefixed with row numbers, Huffman used for ease of explanation)
0 | 8 8 8 8 8 8 8 8 8 8 8 8 8 |
1 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
2 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
3 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
4 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
5 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
6 | 8 8 8 8 8 8 8 8 8 8 8 8 8 |
7 | 8 8 3 3 3 3 3 3 3 3 3 3 3 |
Run length encoded
0 | 8{13} |
1 | 8{1} 4{1} 8{2} 1{5} 8{4} |
2 | 8{1} 4{1} 8{2} 1{5} 8{4} |
3 | 8{1} 4{1} 8{2} 1{5} 8{4} |
4 | 8{1} 4{1} 8{2} 1{5} 8{4} |
5 | 8{1} 4{1} 8{2} 1{5} 8{4} |
6 | 8{13} |
7 | 8{2} 3{11} |
So, 0 and 6 appear twice and 1 – 5 appear 5 times. 7 only once.
Frequency table
A: 5 (1-5) | 8{1} 4{1} 8{2} 1{5} 8{4} |
B: 2 (0,6) | 8{13} |
C: 1 7 | 8{2} 3{11} |
Huffman tree
0|1
/ \
A 0|1
/ \
B C
So in this case it takes one bit (for each row) to encode rows 1 – 5, and 2 bits to encode rows 0, 6, and 7.
(If the runs are longer than a few bytes then do freq count on a hash that you build up as you do the RLE).
You store the Huffman tree, unique strings, and the row encoding bit stream.
The nice thing about Huffman is that it has a unique prefix property, so you always know when you are done. Thus, given the bit string 10000001011 you can rebuild the matrix A from the stored unique strings and the tree. The encoded bit stream tells you the order that the rows appear in.
You may want to look into adaptive Huffman encoding, or its arithmetic counterpart.
Seeing as rows in A with the same column entries multiply to the same result in AB over vector B you can cache the result and use it instead of calculating it again (it’s always good to avoid 100M*100M multiplications if you can).
Links to further info:
Arithmetic Coding + Statistical Modeling = Data Compression
Priority Queues and the STL
Arithmetic coding
Huffman coding
A Comparison
Uncompressed
0 1 2 3 4 5 6 7
=================================
0 | 3 3 3 3 3 3 3 3 |
|-------+ +-------|
1 | 4 4 | 3 3 3 3 | 4 4 |
| +-----------+---+ |
2 | 4 4 | 5 5 5 | 1 | 4 4 |
| | | | |
3 | 4 4 | 5 5 5 | 1 | 4 4 |
|---+---| | | |
4 | 5 | 0 | 5 5 5 | 1 | 4 4 |
| | +---+-------+---+-------|
5 | 5 | 0 0 | 2 2 2 2 2 |
| | | |
6 | 5 | 0 0 | 2 2 2 2 2 |
| | +-------------------|
7 | 5 | 0 0 0 0 0 0 0 |
=================================
= 64 bytes
Quadtree
0 1 2 3 4 5 6 7
=================================
0 | 3 | 3 | | | 3 | 3 |
|---+---| 3 | 3 |---+---|
1 | 4 | 4 | | | 4 | 4 |
|-------+-------|-------+-------|
2 | | | 5 | 1 | |
| 4 | 5 |---+---| 4 |
3 | | | 5 | 1 | |
|---------------+---------------|
4 | 5 | 0 | 5 | 5 | 5 | 1 | 4 | 4 |
|---+---|---+---|---+---|---+---|
5 | 5 | 0 | 0 | 2 | 2 | 2 | 2 | 2 |
|-------+-------|-------+-------|
6 | 5 | 0 | 0 | 2 | 2 | 2 | 2 | 2 |
|---+---+---+---|---+---+---+---|
7 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
=================================
0 +- 0 +- 0 -> 3
| +- 1 -> 3
| +- 2 -> 4
| +- 3 -> 4
+- 1 -> 3
+- 2 -> 4
+- 3 -> 5
1 +- 0 -> 3
+- 1 +- 0 -> 3
| +- 1 -> 3
| +- 2 -> 4
| +- 3 -> 4
+- 2 +- 0 -> 5
| +- 1 -> 1
| +- 2 -> 5
| +- 3 -> 1
+- 3 -> 4
2 +- 0 +- 0 -> 5
| +- 1 -> 0
| +- 2 -> 5
| +- 3 -> 0
+- 1 +- 0 -> 5
| +- 1 -> 5
| +- 2 -> 0
| +- 3 -> 2
+- 2 +- 0 -> 5
| +- 1 -> 0
| +- 2 -> 5
| +- 3 -> 0
+- 3 +- 0 -> 0
+- 1 -> 2
+- 2 -> 0
+- 3 -> 0
3 +- 0 +- 0 -> 5
| +- 1 -> 1
| +- 2 -> 2
| +- 3 -> 2
+- 1 +- 0 -> 4
| +- 1 -> 4
| +- 2 -> 2
| +- 3 -> 2
+- 2 +- 0 -> 2
| +- 1 -> 2
| +- 2 -> 0
| +- 3 -> 0
+- 3 +- 0 -> 2
+- 1 -> 2
+- 2 -> 0
+- 3 -> 0
((1*4) + 3) + ((2*4) + 2) + (4 * 8) = 49 leaf nodes
49 * (2 + 1) = 147 (2 * 8 bit indexer, 1 byte data)
+ 14 inner nodes -> 2 * 14 bytes (2 * 8 bit indexers)
= 175 Bytes
Region Hash
0 1 2 3 4 5 6 7
=================================
0 | 3 3 3 3 3 3 3 3 |
|-------+---------------+-------|
1 | 4 4 | 3 3 3 3 | 4 4 |
| +-----------+---+ |
2 | 4 4 | 5 5 5 | 1 | 4 4 |
| | | | |
3 | 4 4 | 5 5 5 | 1 | 4 4 |
|---+---| | | |
4 | 5 | 0 | 5 5 5 | 1 | 4 4 |
| + - +---+-------+---+-------|
5 | 5 | 0 0 | 2 2 2 2 2 |
| | | |
6 | 5 | 0 0 | 2 2 2 2 2 |
| +-------+-------------------|
7 | 5 | 0 0 0 0 0 0 0 |
=================================
0: (4,1; 4,1), (5,1; 6,2), (7,1; 7,7) | 3
1: (2,5; 4,5) | 1
2: (5,3; 6,7) | 1
3: (0,0; 0,7), (1,2; 1,5) | 2
4: (1,0; 3,1), (1,6; 4,7) | 2
5: (2,2; 4,4), (4,0; 7,0) | 2
Regions: (3 + 1 + 1 + 2 + 2 + 2) * 5
= 55 bytes {4 bytes rectangle, 1 byte data)
{Lookup table is a sorted array, so it does not need extra storage}.
Huffman encoded RLE
0 | 3 {8} | 1
1 | 4 {2} | 3 {4} | 4 {2} | 2
2,3 | 4 {2} | 5 {3} | 1 {1} | 4 {2} | 4
4 | 5 {1} | 0 {1} | 5 {3} | 1 {1} | 4 {2} | 5
5,6 | 5 {1} | 0 {2} | 2 {5} | 3
7 | 5 {1} | 0 {7} | 2
RLE Data: (1 + 3+ 4 + 5 + 3 + 2) * 2 = 36
Bit Stream: 20 bits packed into 3 bytes = 3
Huffman Tree: 10 nodes * 3 = 30
= 69 Bytes
One Giant RLE stream
3{8};4{2};3{4};4{4};5{3};1{1};4{4};5{3};1{1};4{2};5{1};0{1};
5{3};1{1};4{2};5{1};0{2};2{5};5{1};0{2};2{5};5{1};0{7}
= 2 * 23 = 46 Bytes
One Giant RLE stream encoded with common prefix folding
3{8};
4{2};3{4};
4{4};5{3};1{1};
4{4};5{3};
1{1};4{2};5{1};0{1};5{3};
1{1};4{2};5{1};0{2};2{5};
5{1};0{2};2{5};
5{1};0{7}
0 + 0 -> 3{8};4{2};3{4};
+ 1 -> 4{4};5{3};1{1};
1 + 0 -> 4{2};5{1} + 0 -> 0{1};5{3};1{1};
| + 1 -> 0{2}
|
+ 1 -> 2{5};5{1} + 0 -> 0{2};
+ 1 -> 0{7}
3{8};4{2};3{4} | 00
4{4};5{3};1{1} | 01
4{4};5{3};1{1} | 01
4{2};5{1};0{1};5{3};1{1} | 100
4{2};5{1};0{2} | 101
2{5};5{1};0{2} | 110
2{5};5{1};0{7} | 111
Bit stream: 000101100101110111
RLE Data: 16 * 2 = 32
Tree: : 5 * 2 = 10
Bit stream: 18 bits in 3 bytes = 3
= 45 bytes
If your data is really regular, you might benefit from storing it in a structured format; e.g. your example matrix might be stored as the following list of "fill-rectangle" instructions:
(0,0)-(13,7) = 8
(4,1)-(8,5) = 1
(Then to look up the value of a particular cell, you'd iterate backwards through the list until you found a rectangle that contained that cell)
As Ira Baxter suggested,
you could store the matrix as a quadtree with the leaves containing single values.
The simplest way to do this is for every node of the quadtree to cover an area 2^n x 2^n,
and each non-leaf node points to its 4 children of size 2^(n-1) x 2^(n-1).
You might get slightly better compression with an adaptive quadtree that allows irregular sub-division.
Then each non-leaf node stores the cut-point (B,G) and points to its 4 children.
For example, if some non-leaf node covers an area from (A,F) in the upper-left corner to (C,H) in the lower-right corner,
then its 4 children cover areas
(A,F) to (B-1, G-1)
(A,G) to (B-1, H)
(B,F) to (C,G-1)
(B,G) to (C,H).
You would try to pick the (B,G) cut-point for each non-leaf node such that it lines up with some real division in your data.
For example, say you have a matrix with a small square in the middle filled with nines and zero elsewhere.
With the simple powers-of-two quadtree, you'll end up with at least 21 nodes: 5 non-leaf nodes, 4 leaf nodes of nines, and 12 leaf nodes of zeros.
(You'll get even more nodes if the centered small square is not precisely some power-of-two distance from the left and top edges, and not itself some precise power-of-two).
With an adaptive quadtree, if you are smart enough to pick the cut-point for the root node at the upper-left corner of that square, then for the root's lower-right child you pick a cut-point at the lower-right corner of the square, you can representing the entire matrix in 9 nodes: 2 non-leaf nodes, 1 leaf node for the nines, and 6 leaf nodes for the zeros.
Do you know about.... interval trees ?
Interval trees are a way to store intervals efficiently, and then query them. A generalization is the Range Tree, which can be adapted to any dimension.
Here you could effectively describe your rectangles and attach a value to them. Of course the rectangles can overlap, that's what will make it efficient.
0,0-n,n --> 8
4,4-7,7 --> 1
8,8-8,n --> 3
Then when querying for a value in one particular spot, you are returned a list of several rectangles and need to determine the innermost one: this is the value in this spot.
The simplest approach is to use run-length encoding on one dimension and not worry about the other dimension.
(If the dataset weren't so incredibly huge, interpreting it as an image and using a standard lossless image compression method would be very simple also--but since you'd have to work on making the algorithm work on sparse matrices, it wouldn't end up being all that simple.)
Another simple approach is to try a rectangular flood fill--start at the top-right pixel and increase it into the largest rectangle you can (breadth-first); then mark all those pixels as "done" and take the top-right most remaining pixel, repeat until done. (You'd probably want to store these rectangles in some sort of BSP or quad-tree.)
A highly effective technique--not optimal, but probably good enough--is to use a binary space partitioning tree where "space" is measured not spatially but by number of changes. You'd recursively cut so that you have equal numbers of changes on the left and right (or top and bottom--presumably you'd want to keep things square) and, as your sizes got smaller, so that you would cut as many changes as possible. Eventually, you'll end up cutting two rectangles apart from each other, each of which has all the same number; then stop. (Encoding by RLE in x and y will quickly tell you where the change points are.)
Your description of O(1) space for a matrix of size 100M x 100M is confusing. When you have a finite matrix, then your size is a constant (unless the program that generates the matrix doesn't alter it). So the amount of space required to store is also a constant even if you multiply it with a scalar. Definitely the time to read and write the matrix is not going to be O(1).
Sparse matrix is what I could think of to reduce the amount of space required to store such a matrix. You can write this sparse matrix to a file and store it as a tar.gz which will further compress the data.
I do have a question what does M in 100M denote? Does it mean Megabyte/million? If yes, this matrix size will be 100 x 10^6 x 100 x 10^6 bytes = 10^16 / 10^6 MB = 10^10/10^6 TB = 10^4 TB!!! What kind of a machine are you using?
I'm not sure why this question was made Community Wiki, but so it goes.
I'll rely on the assumption that you have a linear algebra application, and that your matrix has a rectangular type of redundancy. If so, then you can do something much better than quadtrees, and cleaner than cutting the matrix into rectangles (which is generally the right idea).
Let M be your matrix, let v be the vector that you want to multiply by M, and let
A be the special matrix
A = [1 -1 0 0 0]
[0 1 -1 0 0]
[0 0 1 -1 0]
[0 0 0 1 -1]
[0 0 0 0 1]
You'll also need the inverse matrix to A, which I'll call B:
B = [1 1 1 1 1]
[0 1 1 1 1]
[0 0 1 1 1]
[0 0 0 1 1]
[0 0 0 0 1]
Multiplying a vector v by A is fast and easy: You just take differences of consecutive pairs of elements of v. Multiply a vector v by B is also fast and easy: The entries of Bv are partial sums of the elements of v. Then you want to use the equation
Mv = B AMA B v
The matrix AMA is sparse: In the middle, each entry is an alternating sum of 4 entries of M that make a 2 x 2 square. You have to be at a corner of one of the rectangles in M for this alternating sum to be non-zero. Since AMA is sparse, you can store its non-zero entries in an associative array and use sparse matrix multiplication to apply it to a vector.
I do not have a specific answer for the matrix you have shown. In finite element analysis (FEA), you have matrices with redundant data. In implementing a FEA package in my under grad project, I used skyline storage method.
Some links:
Intel page for sparse matrix storage
Wikipedia link
The first thing to try is always the existing libraries and solutions. It is a lot of work getting custom formats working with all the operations you're going to want in the end. Sparse matrices is an old problem, so make sure you read up on the existing stuff.
Assuming you don't find something suitable, I would recommend a row-based format. Don't try to be too fancy with super-compact representations, you will end up with lots of processing needed for every little operation and bugs in your code. Instead try to compress each row separately. You know you are going to have to scan through each row for the matrix-vector multiplication, make life easy for yourself.
I would start with run-length-encoding, see how that works first. Once that is working, try adding some tricks like references to sections of the previous row. So a row might be encoded as: 126 zeros, 8 ones, 1000 entries copied directly from row above, 32 zeros. That seems like it might be very efficient with your given example.
Many of the above solutions are fine.
If you are working with a file consider file oriented
compression tools like compress, bzip, zip, bzip2 and friends.
They work very well especially if the data contains redundant
ASCII characters. Using an external compression tool eliminates
problems and challenges inside your code and will compress
both binary and ASCII data.
In your example you are displaying one character numbers.
The numbers 0-9 can be represented by a smaller four bit
encoding pattern. You can use the additional bits in
a byte as a count. Four bits gives you extra codes to
escape to extras... But there is a caution which reaches
back to the old Y2K bugs where two characters were used
for a year. Byte encoding from an ofset would have given
255 years and the same two bytes would span all of written
history and then some.
You may want to take a look at GIF format and its compression algorithm. Just think about your matrix as a Bitmap...
Let me check my assumptions, if for no other reason than to guide my thinking about the problem:
The matrix is highly redundant, not necessarily sparse.
We want to minimize storage (on disk and RAM).
We want to be able to multiply A[m*n] by vector B[n*1] to get to AB[m*1] without first decompressing either (at least not more than required to do the calculations).
We don’t need random access to any A[i*j] entry --all operations are over the matrix.
The multiplication is done online (as needed), and so must be as efficient as possible.
The matrix is static.
One can try all kinds of clever schemes to detect rectangles or self similarity etc, but that is going to end up hurting performance when doing the multiplication. I propose 2 relatively simple solutions.
I am going to have to work backwards a bit, so please be patient with me.
If the data is predominantly biased towards horizontal repetition then the following may work well.
Think of the matrix flattened into an array (this is really the way it is stored in memory anyway). E.g.
A
| w0 w1 w2 |
| x0 x1 x2 |
| y0 y1 y2 |
| z0 z1 z2 |
becomes
A’
| w0 w1 w2 x0 x1 x2 y0 y1 y2 z0 z1 z2 |
We can use the fact that any index [i,j] = i * j.
So, when we do the multiplication we iterate over the “matrix” array A’ with k = [0..m*n-1] and index into the vector B using (k mod n) and into vector AB with (k div n). “div” being integer division.
So, for example, A[10] = z1. 10 mod 3 = 1 and 10 div 3 = 3 A[3,1] = z1.
Now, on to the compression.
We do normal run of the mill Run Length Encoding (RLE), but against the A’, not A. With the flat array there will be longer sequences of repetition, hence better compression. Then after encoding the runs we do another process where we extract common substrings. We can either do a form of dictionary compression, or process the run data into some form of space optimized graph like a radix tree/suffix tree or a device of your own creation that merges tops and tails. The graph should have a representation of all the unique strings in the data. You can pick any number of methods to break the stream into strings: matching prefixes, length, or something else (whatever suits your graph best) but do it on a run boundary, not bytes or your decoding will be made more complicated. The graph becomes a state machine when we decompress the stream.
I’m going to use a bit stream and Patricia trie as an example, because it is simplest, but you can use something else (more bits per state change better merging, etc. Look for papers by Stefan Nilsson).
To compress the run data we build a hash table against the graph. The table maps a string to a bit sequence. You can do this by walking the graph and encoding each left branch as 0 and right branch as 1 (arbitrary choice).
Process the run data and build up a bit string until you get a match in the hash table, output the bits and clear the string (the bits will not be on a byte boundary, so you may have to buffer until you get a sequence long enough to write out). Rinse and repeat until you have processed the complete run data stream. You store the graph and the bit stream. The bit stream encodes strings, not bytes.
If you reverse the process, using the bit stream to walk the graph until you reach a leaf/terminal node you get back the original run data, which you can decode on the fly to produce the stream of integers that you multiply against the vector B to get AB. Each time you run out of runs you read the next bit and lookup its corresponding string. We don’t care that we don’t have random access into A, because we only need it in B (B which can be range / interval compressed but doesn’t need to be).
So even though RLE is biased towards horizontal runs we still get good vertical compression because common strings are stored only once.
I will explain the other method in a separate answer as this is getting too long as it is, but that method can actually speed up calculation due to the fact that repeat rows in matrix A multiplies to the same result in AB.
ok you need a compression algorithm try RLE (Run Length Encoding) its work very good when the data is
highly-redundant .

Resources