diagonal value in co-occurrence matrix - matrix

I am so newbie and thank you so much in advance for advice
I want to make co-occurrence matrix, and followed link below
How to use R to create a word co-occurrence matrix
but I cannot understand why value of A-A is 10 in the matirx below
It should be 4 isn't it? because there are four A
dat <- read.table(text='film tag1 tag2 tag3
1 A A A
2 A C F
3 B D C ', header=T)
crossprod(as.matrix(mtabulate(as.data.frame(t(dat[, -1])))))
( ) A C F B D
A 10 1 1 0 0
C 1 2 1 1 1
F 1 1 1 0 0
B 0 1 0 1 1
D 0 1 0 1 1

The solution you use presumes each tag appears only once per film, which jives with the definition of a co-occurrence matrix as far as I can tell. Therefore, each A on the first line gets counted as co-occurring with itself and with the other two As, resulting in a total of ten co-occurences when factoring in the A on the second line.

Related

How to find all sub rectangles using fastest algorithm?

An example , suppose we have a 2D array such as:
A= [
[1,0,0],
[1,0,0],
[0,1,1]
]
The task is to find all sub rectangles concluding only zeros. So the output of this algorithm should be:
[[0,1,0,2] , [0,1,1,1] , [0,2,1,2] , [0,1,1,2] ,[1,1,1,2], [2,0,2,0] ,
[0,1,0,1] , [0,2,0,2] , [1,1,1,1] , [1,2,1,2]]
Where i,j in [ i , j , a , b ] are coordinates of rectangle's starting point and a,b are coordinates of rectangle's ending point.
I found some algorithms for example Link1 and Link2 but I think first one is simplest algorithm and we want fastest.For the second one we see that the algorithm only calculates rectangles and not all sub rectangles.
Question:
Does anyone know better or fastest algorithm for this problem? My idea is to use dynamic programming but how to use isn't easy for me.
Assume an initial array of size c columns x r rows.
Every 0 is a rectangle of size 1x1.
Now perform an "horizontal dilation", i.e. replace every element by the maximum of itself and the one to its right, and drop the last element in the row. E.g.
1 0 0 1 0
1 0 0 -> 1 0
0 1 1 1 1
Every zero now corresponds to a 1x2 rectangle in the original array. You can repeat this c-1 times, until there is a single column left.
1 0 0 1 0 1
1 0 0 -> 1 0 -> 1
0 1 1 1 1 1
The zeroes correspond to a 1xc rectangles in the original array (initially c columns).
For every dilated array, perform a similar "vertical dilation".
1 0 0 1 0 1
1 0 0 -> 1 0 -> 1
0 1 1 1 1 1
| | |
V V V
1 0 0 1 0 1
1 1 1 -> 1 1 -> 1
| | |
V V V
1 1 1 -> 1 1 -> 1
In these rxc arrays, the zeroes correspond to the subrectangles of all possible sizes. (Here, 5 of size 1x1, 2 of size 2x1, 2 of size 1x2 and one of size 2x2.)
The total workload to detect the zeroes and compute the dilations is of order O(c²r²). I guess that this is worst-case optimal. (In case an array contains no zeroes, there is no need to continue any dilation.)

Creating co-occurrence matrix in SAS

All, thanks to the amazing help and camaraderie at Stack Exchange, I can now build and do further analysis using the co-occurrence matrix R code that was discussed in my original thread: Creating Co-Occurrence Matrix.
I am now dealing with a massive data set that could only be processed on a server, and I am using SAS Studio to analyse it and thus, I will have to do the co-occurrence analysis using SAS. I would really appreciate any help from SAS experts out there, as my SAS programming techniques are limited. I am trying to do it in the SAS Studio environment.
So, essentially - I have a massive SAS .sav file of households and items, and I want to see a matrix of the number of households where items appear together. Taking the same example from my earlier thread, essentially I have a table containing the following:
HHID Items Quant
HH1 A 3
HH1 B 1
HH1 C 1
HH2 E 3
HH2 B 1
HH3 B 1
HH3 C 4
HH4 D 1
HH4 E 1
HH4 A 1
HH5 F 5
HH5 B 3
HH5 C 2
HH5 D 1, etc.
The output needed is something like this:
A B C D E F
A 0 1 1 0 1 1
B 1 0 3 1 1 0
C 1 3 0 1 0 0
D 1 1 1 0 1 1
E 1 1 0 1 0 0
F 0 1 1 1 0 0
I see that there is a macro out there that is done to do market basket analysis already, and although the output is not in this format, I can work with it as well. It's just too bad that the website doesn't exist anymore, so any help is much appreciated.
Thank you.

Algorithm: How to find the number of solutions to SAT?

Assume that the number of variables N and the number of clauses K are equal. Find an algorithm that returns the number of different ways to satisfy the clauses.
I read that SAT is related to Independent Sets.
A function with N variables has a truth-table with 2^N rows. Each row corresponds to one minterm which can be either a solution or not.
A clause with N variables excludes exactly one of the minterm as part of the solutions. That is the minterm which consists of all inverted variables of the clause.
Provided, the K clauses are all different,
the number of solutions is 2^N - K
Example:
The K=3 clauses with N=3 variables:
A or B or C
!A or B or C
A or B or !C
The truth-table for three inputs:
A B C output
0 0 0 0 // excluded by A or B or C
0 0 1 0 // excluded by A or B or !C
0 1 0 1
0 1 1 1
1 0 0 0 // excluded by !A or B or C
1 0 1 1
1 1 0 1
1 1 1 1
Five of the possible eight terms remain true. Thus, the example has 2^3 - 3 = 5 solutions.

Unix / Shell Add a range of columns to file

So I've been trying the same problem for the last few days, and I'm at a formatting road block.
I have a program that will only run if its working on an equal number of columns. I know the total column count, and the number needed to add with a filler value of 0, but am not sure how to do this. Is there some time of range option with awk or sed for this?
Input:
A B C D E
A B C D E 1 1 1 1
Output:
A B C D E 0 0 0 0
A B C D E 1 1 1 1
The the alphabet columns are always present (with different values), but this "fill in the blank" function is eluding me. I can't use R for this due to data file size.
One way using awk:
$ awk 'NF!=n{for(i=NF+1;i<=n;i++)$i=0}1' n=9 file
A B C D E 0 0 0 0
A B C D E 1 1 1 1
Just set n to the number of columns you want to pad upto.

Hungarian (Kuhn Munkres) algorithm oddity

I've read every answer here, Wikipedia and WikiHow, the indian guy's lecture, and other sources, and I'm pretty sure I understand what they're saying and have implemented it that way. But I'm confused about a statement that all of these explanations make that is clearly false.
They all say to cover the zeros in the matrix with a minimum number of lines, and if that is equal to N (that is, there's a zero in every row and every column), then there's a zero solution and we're done. But then I found this:
a b c d e
A 0 7 0 0 0
B 0 8 0 0 6
C 5 0 7 3 4
D 5 0 5 9 3
E 0 4 0 0 9
There's a zero in every row and column, and no way to cover the zeros with fewer than five lines, but there's clearly no zero solution. Row C has only the zero in column b, but that leaves no zero for row D.
Do I misunderstand something here? Do I need a better test for whether or not there's a zero assignment possible? Are all these sources leaving out something essential?
You can cover the zeros in the matrix in your example with only four lines: column b, row A, row B, row E.
Here is a step-by-step walkthrough of the algorithm as it is presented in the Wikipedia article as of June 25 applied to your example:
a b c d e
A 0 7 0 0 0
B 0 8 0 0 6
C 5 0 7 3 4
D 5 0 5 9 3
E 0 4 0 0 9
Step 1: The minimum in each row is zero, so the subtraction has no effect. We try to assign tasks such that every task is performed at zero cost, but this turns out to be impossible. Proceed to next step.
Step 2: The minimum in each column is also zero, so this step also has no effect. Proceed to next step.
Step 3: We locate a minimal number of lines to cover up all the zeros. We find [b,A,B,E].
a b c d e
A ---|---------
B ---|---------
C 5 | 7 3 4
D 5 | 5 9 3
E ---|---------
Step 4: We locate the minimal uncovered element. This is 3, at (C,d) and (D,e). We subtract 3 from every unmarked element and add 3 to every element covered by two lines:
a b c d e
A 0 10 0 0 0
B 0 11 0 0 6
C 2 0 4 0 1
D 2 0 2 6 0
E 0 7 0 0 9
Immediately the minimum number of lines to cover up all the zeros becomes 5. This is easy to verify as there is a zero in every row and a zero in every column. The algorithm asserts that an assignment like the one we were looking for in step 1 should now be possible on the new matrix.
We try to assign tasks such that every task is performed at zero cost (according to the new matrix). This is now possible. We find the solution [(A,e),(B,c),(C,d),(D,b),(E,a)].
We can now go back and verify that the solution that we found actually is optimal. We see that every assigned job has zero cost, except (C,d), which has cost 3. Since 3 is actually the lowest nonzero element in the matrix, and we have seen that there is no zero-cost solution, it is clear that this is an optimal solution.

Resources