what is the appropriate method to cluster binary matrix - cluster-computing

I am a beginner in clustering, and I have a binary matrix in which each student have the sessions they are enrolled in. I want to cluster students with same sessions.
clustering methods are so many and varies according to the dataset
for exemple k-means is not appropriate, because the data is binary and the standard "mean" operation does not make much sense for binary.
i'm open to any suggestion
Here's an example:
+------------+---------+--------+--------+
| session1 | session2|session3|session4|
+------------+---------+--------+--------+
| 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 |
+------------+---------+--------+--------+
Result:
clusterA = [user1,user3]
clusterB = [user2,user4]

You could use the Jaccard distance for each pair of points.
In R:
# create data table
mat = data.frame(s1 = c(T,F,T,F), s2 = c(F,T,F,T),
s3 = c(T,F,T,F), s4 = c(F,T,F,T))
Result:
s1 s2 s3 s4
1 TRUE FALSE TRUE FALSE
2 FALSE TRUE FALSE TRUE
3 TRUE FALSE TRUE FALSE
4 FALSE TRUE FALSE TRUE
dist(mat, method="binary") # jaccard distance
Result:
1 2 3
2 1
3 0 1
4 1 0 1
Row 3 has a distance of 1 from row 4.
By chance, the distances are all exactly 1 and 0 here. These are actually floats. (Your toy dataset may be too simplistic here)
Cluster them:
hclust(dist(mat, method="binary"))
Result (no so informative):
Call:
hclust(d = dist(mat, method = "binary"))
Cluster method : complete
Distance : binary
Number of objects: 4
Create dendrogram plot
plot(hclust(dist(mat, method="binary")))

Related

Question about the behavior of the fortran function MATMUL()

I am a newbie in fortran and i have to multiply matrices of different shapes with MATMUL() and the result is not what i expected...
Here is my fortran code:
integer, dimension(3,2) :: a
integer, dimension(2,2) :: b
integer :: i, j
a = reshape((/ 1, 1, 1, 1, 1, 1 /), shape(a))
b = MATMUL(a,TRANSPOSE(a))
do j = 1, 2
do i = 1, 2
print*, b(i, j)
end do
end do
I expected this matrix as a result:
b =
| 3 3 | , a 2x2 matrix
| 3 3 |
Instead, i got this error message:
matmlt.f90(9): error #6366: The shapes of the array expressions do not conform. [B]
b = MATMUL(a,TRANSPOSE(a))
------^
To make this code work properly i had to switch the MATMUL arguments like this:
b = MATMUL(TRANSPOSE(a), a)
And this way, i obtain what i was expecting at the beginning. But this is not intuitive.
On paper,
a =
| 1 1 1 |
| 1 1 1 |
transpose(a) =
| 1 1 |
| 1 1 |
| 1 1 |
a x transpose(a) =
| 3 3 |
| 3 3 |
and
transpose(a) x a =
| 2 2 2 |
| 2 2 2 |
| 2 2 2 |
What is wrong with my code?
Thank you.
your matrix definition for the variable
integer, dimension(3,2) :: a
means, that you have 3 rows and 2 cols (different ofyour assumption). Subsequently
a=
|11|
|11|
|11|
and
transpose(a) = |111||111|
matmul(a,transpose(a)) =
|2 2 2|
|2 2 2|
|2 2 2|
so your variable b should defined like
integer, dimension (3,3) :: b
instead of
integer, dimension (2,2) :: b
what is the reason of the
matmlt.f90(9): error #6366: The shapes of the array expressions do not conform. [B] b = MATMUL(a,TRANSPOSE(a)) ------^
Error

How to rearrange the elements of a matrix in Matlab according to the order of elements in another matrix?

I have a vector A in Matlab of dimension (mxn)x1 composed by real numbers greater or equal then zero, e.g. m=3,n=4 A=[1;0;0;0;0.4;0.7;0.5;0.6;0.8;0;1;6] which looks like
A=[1
0
0
---
0
0.4
0.7
---
0.5
0.6
0.8
---
0
1
6]
We can see that A is composed by n subvectors of dimension m. I have a vector B of dimension gx1, with g greater or equal than m, composed by ones and zeros such that the total number of ones is equal to m, e.g. g=9 B=[1;0;0;0;0;0;0;1;1] which looks like
B=[1
0
0
0
0
0
0
1
1]
I want to create a matrix C of dimension gxn in which the entries of each subvector of A are placed in correspondence of the ones in g for each column of B, e.g.
C=[1 | 0 | 0.5 | 0
0 | 0 | 0 | 0
0 | 0 | 0 | 0
0 | 0 | 0 | 0
0 | 0 | 0 | 0
0 | 0 | 0 | 0
0 | 0 | 0 | 0
0 | 0.4| 0.6 | 1
0 | 0.7| 0.8 | 6]
Loops are fine only if very fast. Real dimensions of matrices are very large (e.g. mxn=100000, g=50000)
Approach #1 (bsxfun based linear indexing)
C = zeros(g,n) %// Pre-allocate
idx = bsxfun(#plus,find(B),[0:n-1]*g) %// Indices where A elements are to be put
C(idx)= A %// Put A elements
Approach #2 (Direct replacement)
C = zeros(g,n)
C(find(B),:) = reshape(A,m,[])
Pre-allocation: For faster pre-allocation you can do this instead in both of the above mentioned approaches -
C(g,n) = 0;
You can also try the repmat and logical indexing approach. First, reshape your data so that it's in the right matrix form, so in your case 3 x 4, then use B as a logical mask and replicate it for as many times as we have columns in your matrix, then do an assignment. You would have to allocate a matrix that is the size of your desired output before doing this assignment. Something like this:
%// Your example data
m = 3; n = 4;
A = [1;0;0;0;0.4;0.7;0.5;0.6;0.8;0;1;6];
B = logical([1 0 0 0 0 0 0 1 1]).';
%// Relevant code
Am = reshape(A, m, n);
Bm = repmat(B, 1, n);
C = zeros(numel(B), 4);
C(Bm) = Am;
C is the desired result, where we get:
C =
1.0000 0 0.5000 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0.4000 0.6000 1.0000
0 0.7000 0.8000 6.0000
My gut feeling is that this will be slower than the bsxfun approach, but the above is more readable if you're not familiar with how bsxfun works.

what is the Algorithm used to identify the scenario

I have scenario where for any given number i need to identify the corresponding 2 to the power of value.
for example if the given number is 12:
12 is represented in 2 to the power as: 2 to the power of 3 and 2 to the power of 2
5 is represented in 2 to the power as: 2 to the power of 2 and 2 to the power of 0
Can i know the algorithm named on this scenario
It's name is radix conversion. Convert your number to binary radix and you'll get your sum of power of 2. For example,
12 = 1100
That means:
1 1 0 0
^ ^ ^ ^
| | | |
12 = 1 * (2^3) + 1* (2^2) + 0*(2^1) + 0*(2^0)
| | | |
V V V V
3 2 1 0
-it's by definition of what radix (numeral base) is.

How to efficiently store a matrix with highly-redundant values

I have a very large matrix (100M rows by 100M columns) that has a lots of duplicate values right next to each other. For example:
8 8 8 8 8 8 8 8 8 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 4 8 8 1 1 1 1 1 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 3 3 3 3 3 3 3 3 3 3 3
I want a datastructure/algorithm to store matricies like these as compactly as possible. For instance, the matrix above should only take O(1) space (even if the matrix was stretched out arbitrarily big), because there is only a constant number of rectangular regions, where each region only has one value.
The repetition happens both across rows and down columns, so the simple approach of compressing the matrix row-by-row isn't good enough. (That would require a minimum of O(num_rows) space to store any matrix.)
The representation of the matrix also needs to accessible row-by-row, so that I can do a matrix multiplication to a column vector.
You could store the matrix as a quadtree with the leaves containing single values. Think of this as a two-dimensional "run" of values.
Now for my preferred method.
Ok, as I made mention in my previous answer rows with the same entries in each column in matrix A will multiply out to the same result in matrix AB. If we can maintain that relationship then we can theoretically speed up calculations significantly (a profiler is your friend).
In this method we maintain the row * column structure of the matrix.
Each row is compressed with whatever method can decompress fast enough not to affect the multiplication speed too much. RLE may be sufficient.
We now have a list of compressed rows.
We use an entropy encoding method (like Shannon-Fano, Huffman or arithmetic coding), but we don’t compress the data in the rows with this, we use it to compress the set of rows.
We use it to encode the relative frequency of the rows. I.e. we treat a row the same way standard entropy encoding would treat a character/byte.
In this example RLE compresses a row, and Huffman compresses the entire set of rows.
So, for example, given the following matrix (prefixed with row numbers, Huffman used for ease of explanation)
0 | 8 8 8 8 8 8 8 8 8 8 8 8 8 |
1 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
2 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
3 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
4 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
5 | 8 4 8 8 1 1 1 1 1 8 8 8 8 |
6 | 8 8 8 8 8 8 8 8 8 8 8 8 8 |
7 | 8 8 3 3 3 3 3 3 3 3 3 3 3 |
Run length encoded
0 | 8{13} |
1 | 8{1} 4{1} 8{2} 1{5} 8{4} |
2 | 8{1} 4{1} 8{2} 1{5} 8{4} |
3 | 8{1} 4{1} 8{2} 1{5} 8{4} |
4 | 8{1} 4{1} 8{2} 1{5} 8{4} |
5 | 8{1} 4{1} 8{2} 1{5} 8{4} |
6 | 8{13} |
7 | 8{2} 3{11} |
So, 0 and 6 appear twice and 1 – 5 appear 5 times. 7 only once.
Frequency table
A: 5 (1-5) | 8{1} 4{1} 8{2} 1{5} 8{4} |
B: 2 (0,6) | 8{13} |
C: 1 7 | 8{2} 3{11} |
Huffman tree
0|1
/ \
A 0|1
/ \
B C
So in this case it takes one bit (for each row) to encode rows 1 – 5, and 2 bits to encode rows 0, 6, and 7.
(If the runs are longer than a few bytes then do freq count on a hash that you build up as you do the RLE).
You store the Huffman tree, unique strings, and the row encoding bit stream.
The nice thing about Huffman is that it has a unique prefix property, so you always know when you are done. Thus, given the bit string 10000001011 you can rebuild the matrix A from the stored unique strings and the tree. The encoded bit stream tells you the order that the rows appear in.
You may want to look into adaptive Huffman encoding, or its arithmetic counterpart.
Seeing as rows in A with the same column entries multiply to the same result in AB over vector B you can cache the result and use it instead of calculating it again (it’s always good to avoid 100M*100M multiplications if you can).
Links to further info:
Arithmetic Coding + Statistical Modeling = Data Compression
Priority Queues and the STL
Arithmetic coding
Huffman coding
A Comparison
Uncompressed
0 1 2 3 4 5 6 7
=================================
0 | 3 3 3 3 3 3 3 3 |
|-------+ +-------|
1 | 4 4 | 3 3 3 3 | 4 4 |
| +-----------+---+ |
2 | 4 4 | 5 5 5 | 1 | 4 4 |
| | | | |
3 | 4 4 | 5 5 5 | 1 | 4 4 |
|---+---| | | |
4 | 5 | 0 | 5 5 5 | 1 | 4 4 |
| | +---+-------+---+-------|
5 | 5 | 0 0 | 2 2 2 2 2 |
| | | |
6 | 5 | 0 0 | 2 2 2 2 2 |
| | +-------------------|
7 | 5 | 0 0 0 0 0 0 0 |
=================================
= 64 bytes
Quadtree
0 1 2 3 4 5 6 7
=================================
0 | 3 | 3 | | | 3 | 3 |
|---+---| 3 | 3 |---+---|
1 | 4 | 4 | | | 4 | 4 |
|-------+-------|-------+-------|
2 | | | 5 | 1 | |
| 4 | 5 |---+---| 4 |
3 | | | 5 | 1 | |
|---------------+---------------|
4 | 5 | 0 | 5 | 5 | 5 | 1 | 4 | 4 |
|---+---|---+---|---+---|---+---|
5 | 5 | 0 | 0 | 2 | 2 | 2 | 2 | 2 |
|-------+-------|-------+-------|
6 | 5 | 0 | 0 | 2 | 2 | 2 | 2 | 2 |
|---+---+---+---|---+---+---+---|
7 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
=================================
0 +- 0 +- 0 -> 3
| +- 1 -> 3
| +- 2 -> 4
| +- 3 -> 4
+- 1 -> 3
+- 2 -> 4
+- 3 -> 5
1 +- 0 -> 3
+- 1 +- 0 -> 3
| +- 1 -> 3
| +- 2 -> 4
| +- 3 -> 4
+- 2 +- 0 -> 5
| +- 1 -> 1
| +- 2 -> 5
| +- 3 -> 1
+- 3 -> 4
2 +- 0 +- 0 -> 5
| +- 1 -> 0
| +- 2 -> 5
| +- 3 -> 0
+- 1 +- 0 -> 5
| +- 1 -> 5
| +- 2 -> 0
| +- 3 -> 2
+- 2 +- 0 -> 5
| +- 1 -> 0
| +- 2 -> 5
| +- 3 -> 0
+- 3 +- 0 -> 0
+- 1 -> 2
+- 2 -> 0
+- 3 -> 0
3 +- 0 +- 0 -> 5
| +- 1 -> 1
| +- 2 -> 2
| +- 3 -> 2
+- 1 +- 0 -> 4
| +- 1 -> 4
| +- 2 -> 2
| +- 3 -> 2
+- 2 +- 0 -> 2
| +- 1 -> 2
| +- 2 -> 0
| +- 3 -> 0
+- 3 +- 0 -> 2
+- 1 -> 2
+- 2 -> 0
+- 3 -> 0
((1*4) + 3) + ((2*4) + 2) + (4 * 8) = 49 leaf nodes
49 * (2 + 1) = 147 (2 * 8 bit indexer, 1 byte data)
+ 14 inner nodes -> 2 * 14 bytes (2 * 8 bit indexers)
= 175 Bytes
Region Hash
0 1 2 3 4 5 6 7
=================================
0 | 3 3 3 3 3 3 3 3 |
|-------+---------------+-------|
1 | 4 4 | 3 3 3 3 | 4 4 |
| +-----------+---+ |
2 | 4 4 | 5 5 5 | 1 | 4 4 |
| | | | |
3 | 4 4 | 5 5 5 | 1 | 4 4 |
|---+---| | | |
4 | 5 | 0 | 5 5 5 | 1 | 4 4 |
| + - +---+-------+---+-------|
5 | 5 | 0 0 | 2 2 2 2 2 |
| | | |
6 | 5 | 0 0 | 2 2 2 2 2 |
| +-------+-------------------|
7 | 5 | 0 0 0 0 0 0 0 |
=================================
0: (4,1; 4,1), (5,1; 6,2), (7,1; 7,7) | 3
1: (2,5; 4,5) | 1
2: (5,3; 6,7) | 1
3: (0,0; 0,7), (1,2; 1,5) | 2
4: (1,0; 3,1), (1,6; 4,7) | 2
5: (2,2; 4,4), (4,0; 7,0) | 2
Regions: (3 + 1 + 1 + 2 + 2 + 2) * 5
= 55 bytes {4 bytes rectangle, 1 byte data)
{Lookup table is a sorted array, so it does not need extra storage}.
Huffman encoded RLE
0 | 3 {8} | 1
1 | 4 {2} | 3 {4} | 4 {2} | 2
2,3 | 4 {2} | 5 {3} | 1 {1} | 4 {2} | 4
4 | 5 {1} | 0 {1} | 5 {3} | 1 {1} | 4 {2} | 5
5,6 | 5 {1} | 0 {2} | 2 {5} | 3
7 | 5 {1} | 0 {7} | 2
RLE Data: (1 + 3+ 4 + 5 + 3 + 2) * 2 = 36
Bit Stream: 20 bits packed into 3 bytes = 3
Huffman Tree: 10 nodes * 3 = 30
= 69 Bytes
One Giant RLE stream
3{8};4{2};3{4};4{4};5{3};1{1};4{4};5{3};1{1};4{2};5{1};0{1};
5{3};1{1};4{2};5{1};0{2};2{5};5{1};0{2};2{5};5{1};0{7}
= 2 * 23 = 46 Bytes
One Giant RLE stream encoded with common prefix folding
3{8};
4{2};3{4};
4{4};5{3};1{1};
4{4};5{3};
1{1};4{2};5{1};0{1};5{3};
1{1};4{2};5{1};0{2};2{5};
5{1};0{2};2{5};
5{1};0{7}
0 + 0 -> 3{8};4{2};3{4};
+ 1 -> 4{4};5{3};1{1};
1 + 0 -> 4{2};5{1} + 0 -> 0{1};5{3};1{1};
| + 1 -> 0{2}
|
+ 1 -> 2{5};5{1} + 0 -> 0{2};
+ 1 -> 0{7}
3{8};4{2};3{4} | 00
4{4};5{3};1{1} | 01
4{4};5{3};1{1} | 01
4{2};5{1};0{1};5{3};1{1} | 100
4{2};5{1};0{2} | 101
2{5};5{1};0{2} | 110
2{5};5{1};0{7} | 111
Bit stream: 000101100101110111
RLE Data: 16 * 2 = 32
Tree: : 5 * 2 = 10
Bit stream: 18 bits in 3 bytes = 3
= 45 bytes
If your data is really regular, you might benefit from storing it in a structured format; e.g. your example matrix might be stored as the following list of "fill-rectangle" instructions:
(0,0)-(13,7) = 8
(4,1)-(8,5) = 1
(Then to look up the value of a particular cell, you'd iterate backwards through the list until you found a rectangle that contained that cell)
As Ira Baxter suggested,
you could store the matrix as a quadtree with the leaves containing single values.
The simplest way to do this is for every node of the quadtree to cover an area 2^n x 2^n,
and each non-leaf node points to its 4 children of size 2^(n-1) x 2^(n-1).
You might get slightly better compression with an adaptive quadtree that allows irregular sub-division.
Then each non-leaf node stores the cut-point (B,G) and points to its 4 children.
For example, if some non-leaf node covers an area from (A,F) in the upper-left corner to (C,H) in the lower-right corner,
then its 4 children cover areas
(A,F) to (B-1, G-1)
(A,G) to (B-1, H)
(B,F) to (C,G-1)
(B,G) to (C,H).
You would try to pick the (B,G) cut-point for each non-leaf node such that it lines up with some real division in your data.
For example, say you have a matrix with a small square in the middle filled with nines and zero elsewhere.
With the simple powers-of-two quadtree, you'll end up with at least 21 nodes: 5 non-leaf nodes, 4 leaf nodes of nines, and 12 leaf nodes of zeros.
(You'll get even more nodes if the centered small square is not precisely some power-of-two distance from the left and top edges, and not itself some precise power-of-two).
With an adaptive quadtree, if you are smart enough to pick the cut-point for the root node at the upper-left corner of that square, then for the root's lower-right child you pick a cut-point at the lower-right corner of the square, you can representing the entire matrix in 9 nodes: 2 non-leaf nodes, 1 leaf node for the nines, and 6 leaf nodes for the zeros.
Do you know about.... interval trees ?
Interval trees are a way to store intervals efficiently, and then query them. A generalization is the Range Tree, which can be adapted to any dimension.
Here you could effectively describe your rectangles and attach a value to them. Of course the rectangles can overlap, that's what will make it efficient.
0,0-n,n --> 8
4,4-7,7 --> 1
8,8-8,n --> 3
Then when querying for a value in one particular spot, you are returned a list of several rectangles and need to determine the innermost one: this is the value in this spot.
The simplest approach is to use run-length encoding on one dimension and not worry about the other dimension.
(If the dataset weren't so incredibly huge, interpreting it as an image and using a standard lossless image compression method would be very simple also--but since you'd have to work on making the algorithm work on sparse matrices, it wouldn't end up being all that simple.)
Another simple approach is to try a rectangular flood fill--start at the top-right pixel and increase it into the largest rectangle you can (breadth-first); then mark all those pixels as "done" and take the top-right most remaining pixel, repeat until done. (You'd probably want to store these rectangles in some sort of BSP or quad-tree.)
A highly effective technique--not optimal, but probably good enough--is to use a binary space partitioning tree where "space" is measured not spatially but by number of changes. You'd recursively cut so that you have equal numbers of changes on the left and right (or top and bottom--presumably you'd want to keep things square) and, as your sizes got smaller, so that you would cut as many changes as possible. Eventually, you'll end up cutting two rectangles apart from each other, each of which has all the same number; then stop. (Encoding by RLE in x and y will quickly tell you where the change points are.)
Your description of O(1) space for a matrix of size 100M x 100M is confusing. When you have a finite matrix, then your size is a constant (unless the program that generates the matrix doesn't alter it). So the amount of space required to store is also a constant even if you multiply it with a scalar. Definitely the time to read and write the matrix is not going to be O(1).
Sparse matrix is what I could think of to reduce the amount of space required to store such a matrix. You can write this sparse matrix to a file and store it as a tar.gz which will further compress the data.
I do have a question what does M in 100M denote? Does it mean Megabyte/million? If yes, this matrix size will be 100 x 10^6 x 100 x 10^6 bytes = 10^16 / 10^6 MB = 10^10/10^6 TB = 10^4 TB!!! What kind of a machine are you using?
I'm not sure why this question was made Community Wiki, but so it goes.
I'll rely on the assumption that you have a linear algebra application, and that your matrix has a rectangular type of redundancy. If so, then you can do something much better than quadtrees, and cleaner than cutting the matrix into rectangles (which is generally the right idea).
Let M be your matrix, let v be the vector that you want to multiply by M, and let
A be the special matrix
A = [1 -1 0 0 0]
[0 1 -1 0 0]
[0 0 1 -1 0]
[0 0 0 1 -1]
[0 0 0 0 1]
You'll also need the inverse matrix to A, which I'll call B:
B = [1 1 1 1 1]
[0 1 1 1 1]
[0 0 1 1 1]
[0 0 0 1 1]
[0 0 0 0 1]
Multiplying a vector v by A is fast and easy: You just take differences of consecutive pairs of elements of v. Multiply a vector v by B is also fast and easy: The entries of Bv are partial sums of the elements of v. Then you want to use the equation
Mv = B AMA B v
The matrix AMA is sparse: In the middle, each entry is an alternating sum of 4 entries of M that make a 2 x 2 square. You have to be at a corner of one of the rectangles in M for this alternating sum to be non-zero. Since AMA is sparse, you can store its non-zero entries in an associative array and use sparse matrix multiplication to apply it to a vector.
I do not have a specific answer for the matrix you have shown. In finite element analysis (FEA), you have matrices with redundant data. In implementing a FEA package in my under grad project, I used skyline storage method.
Some links:
Intel page for sparse matrix storage
Wikipedia link
The first thing to try is always the existing libraries and solutions. It is a lot of work getting custom formats working with all the operations you're going to want in the end. Sparse matrices is an old problem, so make sure you read up on the existing stuff.
Assuming you don't find something suitable, I would recommend a row-based format. Don't try to be too fancy with super-compact representations, you will end up with lots of processing needed for every little operation and bugs in your code. Instead try to compress each row separately. You know you are going to have to scan through each row for the matrix-vector multiplication, make life easy for yourself.
I would start with run-length-encoding, see how that works first. Once that is working, try adding some tricks like references to sections of the previous row. So a row might be encoded as: 126 zeros, 8 ones, 1000 entries copied directly from row above, 32 zeros. That seems like it might be very efficient with your given example.
Many of the above solutions are fine.
If you are working with a file consider file oriented
compression tools like compress, bzip, zip, bzip2 and friends.
They work very well especially if the data contains redundant
ASCII characters. Using an external compression tool eliminates
problems and challenges inside your code and will compress
both binary and ASCII data.
In your example you are displaying one character numbers.
The numbers 0-9 can be represented by a smaller four bit
encoding pattern. You can use the additional bits in
a byte as a count. Four bits gives you extra codes to
escape to extras... But there is a caution which reaches
back to the old Y2K bugs where two characters were used
for a year. Byte encoding from an ofset would have given
255 years and the same two bytes would span all of written
history and then some.
You may want to take a look at GIF format and its compression algorithm. Just think about your matrix as a Bitmap...
Let me check my assumptions, if for no other reason than to guide my thinking about the problem:
The matrix is highly redundant, not necessarily sparse.
We want to minimize storage (on disk and RAM).
We want to be able to multiply A[m*n] by vector B[n*1] to get to AB[m*1] without first decompressing either (at least not more than required to do the calculations).
We don’t need random access to any A[i*j] entry --all operations are over the matrix.
The multiplication is done online (as needed), and so must be as efficient as possible.
The matrix is static.
One can try all kinds of clever schemes to detect rectangles or self similarity etc, but that is going to end up hurting performance when doing the multiplication. I propose 2 relatively simple solutions.
I am going to have to work backwards a bit, so please be patient with me.
If the data is predominantly biased towards horizontal repetition then the following may work well.
Think of the matrix flattened into an array (this is really the way it is stored in memory anyway). E.g.
A
| w0 w1 w2 |
| x0 x1 x2 |
| y0 y1 y2 |
| z0 z1 z2 |
becomes
A’
| w0 w1 w2 x0 x1 x2 y0 y1 y2 z0 z1 z2 |
We can use the fact that any index [i,j] = i * j.
So, when we do the multiplication we iterate over the “matrix” array A’ with k = [0..m*n-1] and index into the vector B using (k mod n) and into vector AB with (k div n). “div” being integer division.
So, for example, A[10] = z1. 10 mod 3 = 1 and 10 div 3 = 3 A[3,1] = z1.
Now, on to the compression.
We do normal run of the mill Run Length Encoding (RLE), but against the A’, not A. With the flat array there will be longer sequences of repetition, hence better compression. Then after encoding the runs we do another process where we extract common substrings. We can either do a form of dictionary compression, or process the run data into some form of space optimized graph like a radix tree/suffix tree or a device of your own creation that merges tops and tails. The graph should have a representation of all the unique strings in the data. You can pick any number of methods to break the stream into strings: matching prefixes, length, or something else (whatever suits your graph best) but do it on a run boundary, not bytes or your decoding will be made more complicated. The graph becomes a state machine when we decompress the stream.
I’m going to use a bit stream and Patricia trie as an example, because it is simplest, but you can use something else (more bits per state change better merging, etc. Look for papers by Stefan Nilsson).
To compress the run data we build a hash table against the graph. The table maps a string to a bit sequence. You can do this by walking the graph and encoding each left branch as 0 and right branch as 1 (arbitrary choice).
Process the run data and build up a bit string until you get a match in the hash table, output the bits and clear the string (the bits will not be on a byte boundary, so you may have to buffer until you get a sequence long enough to write out). Rinse and repeat until you have processed the complete run data stream. You store the graph and the bit stream. The bit stream encodes strings, not bytes.
If you reverse the process, using the bit stream to walk the graph until you reach a leaf/terminal node you get back the original run data, which you can decode on the fly to produce the stream of integers that you multiply against the vector B to get AB. Each time you run out of runs you read the next bit and lookup its corresponding string. We don’t care that we don’t have random access into A, because we only need it in B (B which can be range / interval compressed but doesn’t need to be).
So even though RLE is biased towards horizontal runs we still get good vertical compression because common strings are stored only once.
I will explain the other method in a separate answer as this is getting too long as it is, but that method can actually speed up calculation due to the fact that repeat rows in matrix A multiplies to the same result in AB.
ok you need a compression algorithm try RLE (Run Length Encoding) its work very good when the data is
highly-redundant .

Why is (a | b ) equivalent to a - (a & b) + b?

I was looking for a way to do a BITOR() with an Oracle database and came across a suggestion to just use BITAND() instead, replacing BITOR(a,b) with a + b - BITAND(a,b).
I tested it by hand a few times and verified it seems to work for all binary numbers I could think of, but I can't think out quick mathematical proof of why this is correct.
Could somebody enlighten me?
A & B is the set of bits that are on in both A and B. A - (A & B) leaves you with all those bits that are only on in A. Add B to that, and you get all the bits that are on in A or those that are on in B.
Simple addition of A and B won't work because of carrying where both have a 1 bit. By removing the bits common to A and B first, we know that (A-(A&B)) will have no bits in common with B, so adding them together is guaranteed not to produce a carry.
Imagine you have two binary numbers: a and b. And let's say that these number never have 1 in the same bit at the same time, i.e. if a has 1 in some bit, the b always has 0 in the corresponding bit. And in other direction, if b has 1 in some bit, then a always has 0 in that bit. For example
a = 00100011
b = 11000100
This would be an example of a and b satisfying the above condition. In this case it is easy to see that a | b would be exactly the same as a + b.
a | b = 11100111
a + b = 11100111
Let's now take two numbers that violate our condition, i.e. two numbers have at least one 1 in some common bit
a = 00100111
b = 11000100
Is a | b the same as a + b in this case? No
a | b = 11100111
a + b = 11101011
Why are they different? They are different because when we + the bit that has 1 in both numbers, we produce so called carry: the resultant bit is 0, and 1 is carried to the next bit to the left: 1 + 1 = 10. Operation | has no carry, so 1 | 1 is again just 1.
This means that the difference between a | b and a + b occurs when and only when the numbers have at least one 1 in common bit. When we sum two numbers with 1 in common bits, these common bits get added "twice" and produce a carry, which ruins the similarity between a | b and a + b.
Now look at a & b. What does a & b calculate? a & b produces the number that has 1 in all bits where both a and b have 1. In our latest example
a = 00100111
b = 11000100
a & b = 00000100
As you saw above, these are exactly the bits that make a + b differ from a | b. The 1 in a & b indicate all positions where carry will occur.
Now, when we do a - (a & b) we effectively remove (subtract) all "offending" bits from a and only such bits
a - (a & b) = 00100011
Numbers a - (a & b) and b have no common 1 bits, which means that if we add a - (a & b) and b we won't run into a carry, and, if you think about it, we should end up with the same result as if we just did a | b
a - (a & b) + b = 11100111
A&B = C where any bits left set in C are those set in both A and in B.
Either A-C = D or B-C = E sets just these common bits to 0. There is no carrying effect because 1-1=0.
D+B or E+A is similar to A+B except that because we subtracted A&B previously there will be no carry due to having cleared all commonly set bits in D or E.
The net result is that A-A&B+B or B-A&B+A is equivalent to A|B.
Here's a truth table if it's still confusing:
A | B | OR A | B | & A | B | - A | B | +
---+---+---- ---+---+--- ---+---+--- ---+---+---
0 | 0 | 0 0 | 0 | 0 0 | 0 | 0 0 | 0 | 0
0 | 1 | 1 0 | 1 | 0 0 | 1 | 0-1 0 | 1 | 1
1 | 0 | 1 1 | 0 | 0 1 | 0 | 1 1 | 0 | 1
1 | 1 | 1 1 | 1 | 1 1 | 1 | 0 1 | 1 | 1+1
Notice the carry rows in the + and - operations, we avoid those because A-(A&B) sets cases were both bits in A and B are 1 to 0 in A, then adding them back from B also brings in the other cases were there was a 1 in either A or B but not where both had 0, so the OR truth table and the A-(A&B)+B truth table are identical.
Another way to eyeball it is to see that A+B is almost like A|B except for the carry in the bottom row. A&B isolates that bottom row for us, A-A&B moves those isolated cased up two rows in the + table, and the (A-A&B)+B becomes equivalent to A|B.
While you could commute this to A+B-(A&B), I was afraid of a possible overflow but that was unjustified it seems:
#include <stdio.h>
int main(){ unsigned int a=0xC0000000, b=0xA0000000;
printf("%x %x %x %x\n",a, b, a|b, a&b);
printf("%x %x %x %x\n",a+b, a-(a&b), a-(a&b)+b, a+b-(a&b)); }
c0000000 a0000000 e0000000 80000000
60000000 40000000 e0000000 e0000000
Edit: So I wrote this before there were answers, then there was some 2 hours of down time on my home connection, and I finally managed to post it, noticing only afterwards that it'd been properly answered twice. Personally I prefer referring to a truth table to work out bitwise operations, so I'll leave it in case it helps someone.

Resources