I have two files and would like to find out which parts of file 1 occur in the same order/sequence of file 2 based on one of multiple columns (col4). The files are sorted based on an identifier in col1 (from 1 to n) but the identifier is not the between the files. The column in file 1 always occurs as one block in file 2.
file1:
x 1
x 2
x 3
file2:
y 5
y 1
y 2
y 3
y 6
output:
y 1
y 2
y 3
Another thing to take into consideration is, that the entries in the column to be filtered on are not unique.
I already tried
awk 'FNR==NR{ a[$2]=$2;next } ($2 in a)' file1 file2 > output
but it only works if you have unique identifiers.
To clarify it with real life data: I would like to extract the rows where I have the same order based on column 4.
File1:
ATOM 13 O ALA A 2 37.353 35.331 -19.903 1.00 71.02 O
ATOM 18 O TRP A 3 38.607 32.133 -18.273 1.00 69.13 O
File2:
ATOM 1 N MET A 1 42.218 38.990 -18.511 1.00 64.21 N
ATOM 10 CA ALA A 2 38.451 37.475 -20.033 1.00 71.02 C
ATOM 13 O ALA A 2 37.353 35.331 -19.903 1.00 71.02 O
ATOM 18 O TRP A 3 38.607 32.133 -18.273 1.00 69.13 O
ATOM 29 CA ILE A 4 38.644 33.633 -15.907 1.00 72.47 C
I have m by n matrix A, m by k matrix B. I want to obtain m by (nxk) matrix C whose each row is a flattened outer product of rows of A and B. Is there an easy and efficient way to do this? Thanks!
Example:
A:
1 2
3 4
B:
0.5 2
-0.5 -2
C:
0.5 1 2 4
-1.5 -2 -6 -8
How do I turn a matrix:
[ 0.12 0.23 0.34 ;
0.45 0.56 0.67 ;
0.78 0.89 0.90 ]
into a 'coordinate' matrix with a bunch of rows?
[ 1 1 0.12 ;
1 2 0.23 ;
1 3 0.34 ;
2 1 0.45 ;
2 2 0.56 ;
2 3 0.67 ;
3 1 0.78 ;
3 2 0.89 ;
3 3 0.90 ]
(permutation of the rows is irrelevant, it only matters that the data is in this structure)
Right now I'm using a for loop but that takes a long time.
Here is an option using ind2sub:
mat= [ 0.12 0.23 0.34 ;
0.45 0.56 0.67 ;
0.78 0.89 0.90 ] ;
[I,J] = ind2sub(size(mat), 1:numel(mat));
r=[I', J', mat(:)]
r =
1.0000 1.0000 0.1200
2.0000 1.0000 0.4500
3.0000 1.0000 0.7800
1.0000 2.0000 0.2300
2.0000 2.0000 0.5600
3.0000 2.0000 0.8900
1.0000 3.0000 0.3400
2.0000 3.0000 0.6700
3.0000 3.0000 0.9000
Note that the indices are reversed compared to your example.
A = [ .12 .23 .34 ;
.45 .56 .67 ;
.78 .89 .90 ];
[ii jj] = meshgrid(1:size(A,1),1:size(A,2));
B = A.';
R = [ii(:) jj(:) B(:)];
If you don't mind a different order (according to your edit), you can do it more easily:
[ii jj] = ndgrid(1:size(A,1),1:size(A,2));
R = [ii(:) jj(:) A(:)];
In addition to generating the row/col indexes with meshgrid, you can use all three outputs of find as follows:
[II,JJ,AA]= find(A.'); %' note the transpose since you want to read across
M = [JJ II AA]
M =
1 1 0.12
1 2 0.23
1 3 0.34
2 1 0.45
2 2 0.56
2 3 0.67
3 1 0.78
3 2 0.89
3 3 0.9
Limited application because zeros get lost. Nasty, but correct workaround (thanks user664303):
B = A.'; v = B == 0; %' transpose to read across, otherwise work directly with A
[II, JJ, AA] = find(B + v);
M = [JJ II AA-v(:)];
Needless to say, I would recommend one of the other solutions. :) In particular, ndgrid is the most natural solution to obtaining the row,col inds.
I find ndgrid to be the most natural solution, but here's a fun way to do it manually with the odd couple of kron and repmat:
M = [kron(1:size(A,2),ones(1,size(A,1))).' ... %' row indexes
repmat((1:size(A,1))',size(A,2),1) ... %' col indexes
reshape(A.',[],1)] %' matrix values, read across
Simple adjustment to read down, as is natural in MATLAB:
M = [repmat((1:size(A,1))',size(A,2),1) ... %' row indexes (still)
kron(1:size(A,2),ones(1,size(A,1))).' ... %' column indexes
A(:)] % matrix values, read down
(Also since my first answer was obscenely hackish.)
I also find kron to be a nice tool to replicate each element at a time rather than than the entire array at a time, as repmat does. For example:
>> 1:size(A,2)
ans =
1 2 3
>> kron(1:size(A,2),ones(1,size(A,1)))
ans =
1 1 1 2 2 2 3 3 3
Taking this a bit further, we can generate a new function called repel to replicate elements of an array as opposed to the whole array:
>> repel = #(x,m,n) kron(x,ones(m,n));
>> repel(1:4,1,2)
ans =
1 1 2 2 3 3 4 4
>> repel(1:3,2,2)
ans =
1 1 2 2 3 3
1 1 2 2 3 3
I previously calculated phi, psi, and omega, pretty easily from .pdb file. Because their definitions are rather straight-forward. For instance, I know that they require four cartesian coordinates (four atoms) that are set
phi: C-N-CA-C
psi: N-CA-C-N
omega: CA-C-N-CA
Now I am trying to calculate side-chain angles. I know this is similar to phi, psi, and omega (in that I will need 4 atoms per angle). However, I am having difficulty reading the .pdb file and determining what atoms in the first place constitute the side chains? For instance, in the following segment (I removed hydrogens and the one carbon per residue without a subscript):
1 N -14.152 0.961 4.712
1 CA -13.296 0.028 3.924
1 O -11.358 1.432 3.941
1 CB -13.571 0.173 2.426
1 CG -15.046 -0.135 2.144
1 SD -16.174 1.270 1.982
1 CE -17.702 0.313 1.823
2 N -11.121 -0.642 4.703
2 CA -9.669 -0.447 4.998
2 O -9.036 -2.736 4.724
2 CB -9.462 -0.447 6.516
2 OG1 -10.399 0.505 7.010
2 CG2 -8.090 0.103 6.896
3 N -7.990 -1.247 3.462
3 CA -7.173 -2.314 2.811
3 O -5.487 -1.663 4.367
3 CB -6.881 -1.930 1.359
3 CG -8.162 -1.388 0.715
3 CD1 -8.594 -0.102 0.975
3 CD2 -8.903 -2.180 -0.135
3 CE1 -9.749 0.380 0.392
3 CE2 -10.057 -1.699 -0.718
3 CZ -10.490 -0.415 -0.457
3 OH -11.645 0.066 -1.038
4 N -5.204 -3.598 3.323
4 CA -3.922 -3.881 4.044
4 O -2.647 -4.537 2.142
4 CB -4.003 -5.297 4.612
4 CG -3.169 -5.399 5.890
4 CD -2.632 -6.837 6.002
4 CE -2.044 -7.084 7.401
4 NZ -2.526 -8.390 7.935
Would the first few angles be between atoms as such:
N-CA-O-CB
CA-O-CB-CG
O-CB-CG-SD
CB-CG-SD-CE
In other words, would I be including atoms like O, SD, etc? Or do I only include subscripts in the order A, B, G, D, E, Z (anything else)? So that my first few angles would be:
N-CA-CB-CG
CA-CB-CG-CE
CG-CE-N-CA
CE-N-CA-CB
As I understand it, you usually just include the numbered carbons (so the second option that you showed would be correct). The exception is when there aren't enough carbons to define an angle. For instance, in cysteine you use the sulpher instead of a gamma carbon. You can find a list of the standard side chain angles for each amino acid here.
For more information, see this page.
I have files with the following format:
ATOM 8962 CA VAL W 8 8.647 81.467 25.656 1.00115.78 C
ATOM 8963 C VAL W 8 10.053 80.963 25.506 1.00114.60 C
ATOM 8964 O VAL W 8 10.636 80.422 26.442 1.00114.53 O
ATOM 8965 CB VAL W 8 7.643 80.389 25.325 1.00115.67 C
ATOM 8966 CG1 VAL W 8 6.476 80.508 26.249 1.00115.54 C
ATOM 8967 CG2 VAL W 8 7.174 80.526 23.886 1.00115.26 C
ATOM 4440 O TYR S 89 4.530 166.005 -14.543 1.00 95.76 O
ATOM 4441 CB TYR S 89 2.847 168.812 -13.864 1.00 96.31 C
ATOM 4442 CG TYR S 89 3.887 169.413 -14.756 1.00 98.43 C
ATOM 4443 CD1 TYR S 89 3.515 170.073 -15.932 1.00100.05 C
ATOM 4444 CD2 TYR S 89 5.251 169.308 -14.451 1.00100.50 C
ATOM 4445 CE1 TYR S 89 4.464 170.642 -16.779 1.00100.70 C
ATOM 4446 CE2 TYR S 89 6.219 169.868 -15.298 1.00101.40 C
ATOM 4447 CZ TYR S 89 5.811 170.535 -16.464 1.00100.46 C
ATOM 4448 OH TYR S 89 6.736 171.094 -17.321 1.00100.20 O
ATOM 4449 N LEU S 90 3.944 166.393 -12.414 1.00 94.95 N
ATOM 4450 CA LEU S 90 5.079 165.622 -11.914 1.00 94.44 C
ATOM 5151 N LEU W 8 -66.068 209.785 -11.037 1.00117.44 N
ATOM 5152 CA LEU W 8 -64.800 210.035 -10.384 1.00116.52 C
ATOM 5153 C LEU W 8 -64.177 208.641 -10.198 1.00116.71 C
ATOM 5154 O LEU W 8 -64.513 207.944 -9.241 1.00116.99 O
ATOM 5155 CB LEU W 8 -65.086 210.682 -9.033 1.00115.76 C
ATOM 5156 CG LEU W 8 -64.274 211.829 -8.478 1.00113.89 C
ATOM 5157 CD1 LEU W 8 -64.528 211.857 -7.006 1.00111.94 C
ATOM 5158 CD2 LEU W 8 -62.828 211.612 -8.739 1.00112.96 C
In principle, column 5 (W, in this case, which represents the chain ID) should be identical only in consecutive chunks. However, in files with too many chains, there are no enough letters of the alphabet to assign a single ID per chain and therefore duplicity may occur.
I would like to be able to check whether or not this is the case. In other words I would like to know if a given chain ID (A-Z, always in the 5th column) is present in non-consecutive chunks. I do not mind if it changes from W to S, I would like to know if there are two chunks sharing the same chain ID. In this case, if W or S reappear at some point. In fact, this is only a problem if they also share the first and the 6th columns, but I do not want to complicate things too much.
I do not want to print the lines, just to know the name of the file in which the issue occurs and the chain ID (in this case W), in order to solve the problem. In fact, I already know how to solve the problem, but I need to identify the problematic files to focus on those ones and not repairing already sane files.
SOLUTION (thanks to all for your help and namely to sehe):
for pdb in $(ls *.pdb) ; do
hit=$(awk -v pdb="$pdb" '{ if ( $1 == "ATOM" ) { print $0 } }' $pdb | cut -c22-23 | uniq | sort | uniq -dc)
[ "$hit" ] && echo $pdb = $hit
done
For this particular sample:
cut -c22-23 t | uniq | sort | uniq -dc
Will output
2 W
(the 22nd column contains 2 runs of the letter 'W')
untested
awk '
seen[$5] && $5 != current {
print "found non-consecutive chain on line " NR
exit
}
{ current = $5; seen[$5] = 1 }
' filename
Here you go, this awk script is tested and takes into account not just 'W':
{
if (ln[$5] && ln[$5] + 1 != NR) {
print "dup " $5 " at line " NR;
}
ln[$5] = NR;
}