I have a matrix that describes correlation between items A-K, where 1=correlated and 0=uncorrelated.
Is there an easy way to extract the largest cluster from the data? In other words, the cluster with the most correlated elements. Below is some sample data:
# A B C D E F G H I J K
A 1 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1 1
C 1 1 1 1 1 1 1 1 1 1 1
D 1 1 1 1 0 1 0 0 0 1 1
E 1 1 1 0 1 0 0 1 1 0 0
F 1 1 1 1 0 1 0 0 1 0 1
G 1 1 1 0 0 0 1 0 0 0 0
H 1 1 1 0 1 0 0 1 1 1 1
I 1 1 1 0 1 1 0 1 1 0 0
J 1 1 1 1 0 0 0 1 0 1 0
K 1 1 1 1 0 1 0 1 0 0 1
Swapping a few columns/rows by eye, the expected result would be the top left of the matrix, which is a cluster of size 6 that contains: {A, B, C, D, F, K}
I know awk isn't the most user-friendly for this application, but I'm keen on using awk since this will integrate into a larger awk script. That being said, I'm not completely immovable on the language.
Not sure where to start but here's a more complex version of what I'm thinking in python:
https://stats.stackexchange.com/questions/138325/clustering-a-correlation-matrix
Assumptions:
all matrices are symmetric (ie, square; equal to its transpose; matrix[x,y]=matrix[y,x])
matrix[x,x]=1 for all x
all matrix entries are 0 or 1
not interested in 1-element clusters
not interested in permutations of the same cluster (ie, A,B is the same as B,A)
since we don't have to worry about permutations we can focus on processing elements in the order in which they show up in the matrix (eg, we process A,B,C and ignore the equivalents of A,C,B, B,A,C, B,C,A, C,A,B and C,B,A); this allows us to focus on processing just the top/right half of the matrix (above the identity/diagonal) and in order from left to right; this will greatly reduce the number of permutations we need to evaluate
as demonstrated in the question, elements that make up a cluster can be shifted up/left in the matrix so as to fill the top/left of the matrix with 1's (this comes into play during processing where for each new element we merely need to test the equivalent of the new column/row added to this top/left portion of the matrix)
Regarding the last assumption ... assume we have cluster A,D and we now want to test A,D,F; we just need to test the new column/row entries (?):
Current Cluster New Cluster ?
A D A D F
A 1 1 A 1 1 ? # if matrix is symmetric then only need to test
D 1 1 D 1 1 ? # the new column *OR* the new row, not both;
F ? ? 1 # bottom/right == 1 == matrix[F,F] per earlier assumption
One idea using a recursive function and two GNU awk's features: a) array of arrays (aka, multi-dimensional arrays) and b) PROCINFO["sorted_in"] for custom sorting of clusters to stdout:
awk '
######
# load matrix into memory
FNR==1 { n=(NF-1) # number of elements to be processed
for (i=2;i<=NF;i++)
label[i-1]=$i # save labels
next
}
{ for (i=2;i<=NF;i++)
m[FNR-1][i-1]=$i # populate matrix array m[row#][column#]
}
######
# define our recursive function
function find_cluster(cluster, i, clstrcount, stackseq, j, k, corrcount) {
# cluster : current working cluster (eg, "A,B,C")
# i : index of latest element (eg, for "A,B,C" => latest element is "C" so i = 3
# clstrcount : number of elements in current cluster
# stackseq : sequence number of stack[] array
# : stack[] contains list of indexes for current cluster (for "A,B,C" stack = "1,2,3")
# j,k,corrcount : declaring additional variables as "local" to this invocation of the function
clstrcount++ # number of elements to be processed at this call/level
for (j=i+1;j<=n;j++) { # process all elements/indexes greater than i
corrcount=1 # reset correlation count; always start with 1 since m[j][j]=1
# check the new column/row added to the top/left of the matrix to see if it extends the current cluster (ie, all entries are "1")
for (k in stack) { # loop through element/indexes in stack
if (m[stack[k]][j]) # check column entries
corrcount++
if (m[j][stack[k]]) # check row entries; not necessary if matrix is symmetric but we will add it here to show the m[][] references
corrcount++
}
if (corrcount == (stackseq*2 +1) ) { # if we have all "1"s we have a new cluster of size clstrcount
stack[++stackseq]=j # "push" current element/index on stack; increment stack seq/index
cluster=cluster "," label[j] # add current element/label to cluster
max= (clstrcount>max) ? clstrcount : max # new max(cluster count) ?
clusters[clstrcount][++clsterseq]=cluster # add new cluster to our master list: clusters[cluster_count][seq]
find_cluster(cluster, j, clstrcount, stackseq) # recursive call to check for next element(s)
delete stack[stackseq--] # back from recursive call so "pop" curent element (j) from stack
gsub(/[,][^,]+$/,"",cluster) # remove current element/label from cluster to make way for next element/label to be tested
}
}
}
######
# start looking for clusters of size 2+
END { max=2 # not interested in clusters of "1"
for (i=1;i<n;i++) { # loop through list of elements
clstrcount=1 # init cluster count = 1
clstrseq=0 # init clusters[...][seq] sequence seed
cluster=label[i] # reset cluster to current element/label
stackseq=1 # reset stack[seq] sequence seed
stack[stackseq]=i # "push" current element on stack
find_cluster(cluster, i, clstrcount, stackseq) # start recursive calls looking for next element in cluster
}
######
# for now just display clusters with size > 2; adjust this next line to add/remove cluster sizes from stdout
if (max>2) # print list of clusters with length > 2
for (i=max;i>2;i--) { # print from largest to smallest and ...
PROCINFO["sorted_in"]="#val_str_asc" # in alphabetical order
printf "####### clusters of size %s:\n", i
for (j in clusters[i]) # loop through all entries for clusters of size "i"
print clusters[i][j]
}
}
' matrix.dat
NOTE: The current version is (admittedly) a bit verbose ... the results of jotting down a first-pass solution as I was working through the details; with some further analysis it may be possible to reduce the code; having said that, the time it takes to find all 2+ sized clusters in this 11-element matrix isn't too bad:
real 0m0.084s
user 0m0.031s
sys 0m0.046s
This generates:
####### clusters of size 6:
A,B,C,D,F,K
A,B,C,E,H,I
####### clusters of size 5:
A,B,C,D,F
A,B,C,D,J
A,B,C,D,K
A,B,C,E,H
A,B,C,E,I
A,B,C,F,I
A,B,C,F,K
A,B,C,H,I
A,B,C,H,J
A,B,C,H,K
A,B,D,F,K
A,B,E,H,I
A,C,D,F,K
A,C,E,H,I
B,C,D,F,K
B,C,E,H,I
####### clusters of size 4:
A,B,C,D
A,B,C,E
A,B,C,F
A,B,C,G
A,B,C,H
A,B,C,I
A,B,C,J
A,B,C,K
A,B,D,F
A,B,D,J
A,B,D,K
A,B,E,H
A,B,E,I
A,B,F,I
A,B,F,K
A,B,H,I
A,B,H,J
A,B,H,K
A,C,D,F
A,C,D,J
A,C,D,K
A,C,E,H
A,C,E,I
A,C,F,I
A,C,F,K
A,C,H,I
A,C,H,J
A,C,H,K
A,D,F,K
A,E,H,I
B,C,D,F
B,C,D,J
B,C,D,K
B,C,E,H
B,C,E,I
B,C,F,I
B,C,F,K
B,C,H,I
B,C,H,J
B,C,H,K
B,D,F,K
B,E,H,I
C,D,F,K
C,E,H,I
####### clusters of size 3:
A,B,C
A,B,D
A,B,E
A,B,F
A,B,G
A,B,H
A,B,I
A,B,J
A,B,K
A,C,D
A,C,E
A,C,F
A,C,G
A,C,H
A,C,I
A,C,J
A,C,K
A,D,F
A,D,J
A,D,K
A,E,H
A,E,I
A,F,I
A,F,K
A,H,I
A,H,J
A,H,K
B,C,D
B,C,E
B,C,F
B,C,G
B,C,H
B,C,I
B,C,J
B,C,K
B,D,F
B,D,J
B,D,K
B,E,H
B,E,I
B,F,I
B,F,K
B,H,I
B,H,J
B,H,K
C,D,F
C,D,J
C,D,K
C,E,H
C,E,I
C,F,I
C,F,K
C,H,I
C,H,J
C,H,K
D,F,K
E,H,I
I have a very large text file (16GB) that I want to subset as fast as possible.
Here is a sample of the data involved
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
2 M 15 1
2 0 Q 0 17143989 4219157,1841361,853923,1720163,1912374,1755325,4454730 65548702,4975721 197782,39086 54375043,4396765 31589696,3091097 6876504,851594 3374640,455375 13274885,1354902 31585771,3091016 61234218,4723345 31583582,3091014
2 27 C 0 31589696
The first number on every line is a sessionID and any line with an 'M' denotes the start of a session (data is grouped by session). The number following an M is a Day and the second number is a userID, a user can have multiple sessions.
I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines). As a second task I also want to extract all session lines related to a specific day.
For example with the above data, to extract the records for userid '0' the output would be:
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
To extract the records for day 7 the output would be:
1 M 7 0
1 0 Q 0 17143989
I believe there is a much more elegant and simple solution to what I have achieved so far and it would be great to get some feedback and suggestions. Thank you.
What I have tried
I tried to use pcrgrep -M to apply this pattern directly (matching data between two M's) but struggled to get this working across the linebreaks. I still suspect this may be the fastest option so any guidance on whether this may be possible would be great.
The next part is quite scattered and it is not necessary to read on if you already have an idea for a better solution!
Failing the above, I split the problem into two parts:
Part 1: Isolating all 'M' lines to obtain a list of sessions which belonging to that user/day
grep method is fast (then need to figure out how to use this data)
time grep -c "M\t.*\t$user_id" trainSample.txt >> sessions.txt
awk method to create an array is slow
time myarr=$(awk '/M\t.*\t$user_id/ {print $1}' trainSample.txt
Part 2: Extracting all lines belonging to a session on the list created in part 1
Continuing from the awk method, I ran grep for each but this is WAY too slow (days to complete 16GB)
for i in "${!myarr[#]}";
do
grep "^${myarr[$i]}\t" trainSample.txt >> sessions.txt
echo -ne "Session $i\r"
done
Instead of running grep once per session ID as above using them all in the one grep command is MUCH faster (I ran it with 8 sessionIDs in a [1|2|3|..|8] format and it took the same time as each did separately i.e. 8X faster). However I need then to figure out how to do this dynamically
Update
I have actually established a working solution which only takes seconds to complete but it is some messy and inflexible bash coe which I have yet to extend to the second (isolating by days) case.
I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines).
$ awk '$2=="M"{p=$4==0}p' file
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
As a second task I also want to extract all session lines related to a specific day.
$ awk '$2=="M"{p=$3==7}p' file
1 M 7 0
1 0 Q 0 17143989