Use awk to print first column, second column and one column every four after that - bash

I have a text file and I want to extract the first column, second column and one column every 3 after that. Also, I want to get rid of row #2. How can I do that using awk or similar?
An example:
I have a text file as such:
A B C D E F G H I J .. N (header 1)
A B C D E F G H I J .. N (header 2)
A B C D E F G H I J .. N (row 1)
A B C D E F G H I J .. N (row 2)
A B C D E F G H I J .. N (row n)
I'm trying to get it as
A B F J .. N (header 1)
A B F J .. N (row 1)
A B F J .. N (row 2)
A B F J .. N (row n)
Thanks.
P.s. I've tried playing with the solutions mentioned in How to print every 4th column up to nth column and from (n+1)th column to last using awk? but the solution doesn't work for me

$ awk 'NR!=2{out=$1; for (i=2;i<=NF;i+=4) out = out OFS $i; print out}' file
A B F J 1)
A B F J 1)
A B F J 2)
A B F J n)
The above output is messy because of the ...s and comments in your sample input that make it untestable. Always post ACTUAL, testable input/output, not a description or other abstract representation of such. And don't repeat the same data on every line as that makes it hard to map the output fields to the input and so makes it harder to understand your requirements. This would have been a more useful example:
$ cat file
101 102 103 104 105 106 107 108 109 110 111
201 202 203 204 205 206 207 208 209 210 211
301 302 303 304 305 306 307 308 309 310 311
401 402 403 404 405 406 407 408 409 410 411
$ awk 'NR!=2{out=$1; for (i=2;i<=NF;i+=4) out = out OFS $i; print out}' file
101 102 106 110
301 302 306 310
401 402 406 410

Related

Find a column from a file in another file based on value and order

I have two files and would like to find out which parts of file 1 occur in the same order/sequence of file 2 based on one of multiple columns (col4). The files are sorted based on an identifier in col1 (from 1 to n) but the identifier is not the between the files. The column in file 1 always occurs as one block in file 2.
file1:
x 1
x 2
x 3
file2:
y 5
y 1
y 2
y 3
y 6
output:
y 1
y 2
y 3
Another thing to take into consideration is, that the entries in the column to be filtered on are not unique.
I already tried
awk 'FNR==NR{ a[$2]=$2;next } ($2 in a)' file1 file2 > output
but it only works if you have unique identifiers.
To clarify it with real life data: I would like to extract the rows where I have the same order based on column 4.
File1:
ATOM 13 O ALA A 2 37.353 35.331 -19.903 1.00 71.02 O
ATOM 18 O TRP A 3 38.607 32.133 -18.273 1.00 69.13 O
File2:
ATOM 1 N MET A 1 42.218 38.990 -18.511 1.00 64.21 N
ATOM 10 CA ALA A 2 38.451 37.475 -20.033 1.00 71.02 C
ATOM 13 O ALA A 2 37.353 35.331 -19.903 1.00 71.02 O
ATOM 18 O TRP A 3 38.607 32.133 -18.273 1.00 69.13 O
ATOM 29 CA ILE A 4 38.644 33.633 -15.907 1.00 72.47 C

find characters present in two different lines if they satisfy a positional relationship

I have a peculiar problem. A text file contains the following three lines,
>chain A
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKGDFATKGEKGQKGEPGFQGMPGVGEKGEPGKPGPRGKPGKDGDKGEKGSPGFPGEPGYPGLIGRQGPQGEKGEAGPPGPPGIVIGTGPLGEKGERGYPGTPGPRGEPGPKGFPGLPGQPGPPGLPVPGQAGAPGFPGERGEKGDRGFPGTS-LP-GPSGRDGLPGPPGSPGPPGQPGYTNGIVECQPGPPGDQGPPGIPGQPGFIGEIGEKGQKGESCLICDIDGYRGPPGPQGPPGEIGFPGQPGAKGDRGLPGRDGVAGVPGPQGTPGLIGQPGAKGEPGEFYFDLRLKGDKGDPGFPGQPGMPGRAGSPGRDGHPGLPGPKGSPGSVGLKGERGPPGGVGFPGSRGDTGPPGPPGY---GPAGPIGDKGQAGFPGGPGSPGLPGPKGEPGKIVP--------------------LPGPPGAEGLPGSPGFPGPQGDRGFPGTPGRPGLPGEKGAVGQPGI-GFPGPPGPKGVDGLPGDMGPPGTPGRPGFNGLPGNPGVQGQKGEP---GVGLPGLKGLPGLPGIPGTPGEKGSIGVPGVPGEHGAIGPPGLQGIRGEPGPPGLPGSVGSPGVPGI-GPPGARGPPGGQGPPGLSGPPGIKGEKGFPGFPGLD-MPGPKGDKGAQGLPGITGQSGLPGLPGQQGAPGIPGFPGSKGEMGVMGTPGQPGSPGPVGAPGLPGEKGDHGFPGSSGPRGDPGLKGDKGDVGLPGKPGSMDKVDMGSMKGQKGDQGEKGQIGPIGEKGSRGDPGTPGVPGKDGQAGQPGQP-GPKGDPGISGTPGAPGLPGPKGSVGGMGLPGTPGEKGVPGIPGPQGSPGLPGDKGAKGEKGQAGPPGIGIPGLRGEKGDQGIAGFPGSPGEKGEKGSIGIPGMPGSPGLKGSPGSVGYPGSPGLPGEKGDKGLPGLDGIPGVKGEAGLPGTPGPTGPAGQKGEPGSDGIPGSAGEKGEPGLPGRGFPGFPGAKGDKGSKGEVGFP-GLAGSPGIPGSKGEQGFMGPPGPQGQPGLPGSPGHA-TEGPKGDRGPQGQPGLPGLPGPMGPPGLPGIDGVKGDKGNPGWPGAPGVPGPKGDPGFQGMPGIGGSPGITGSKGDMGPPGVPGFQGPKGLPGLQGIKGDQGDQGVPGAKGLPGPPGPPGPYDIIKGEPGLPGPEGPPGLKGLQGLPGPKGQQGVTGLVGIPGPPGIPGFDGAPGQKGEMGPAGPTGPRGFPGPPGPDGLPGSMGPPGTPSVDHGFLVTRHSQTIDDPQCPSGTKILYHGYSLLYVQGNERAHGQDLGTAGSCLRKFSTMPFLFCNINNVCNFASRNDYSYWLSTPEPMPMSMAPITGENIRPFISRCAVCEAPAMVMAVHSQTIQIPPCPSGWSSLWIGYSFVMHTSAGAEGSGQALASPGSCLEEFRSAPFIECHG-RGTCNYYANAYSFWLATIERSEMFKKPTPSTLKAGELRTHVSRCQVCMRRT
>chain B
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKGDFATKGEKGQKGEPGFQGMPGVGEKGEPGKPGPRGKPGKDGDKGEKGSPGFPGEPGYPGLIGRQGPQGEKGEAGPPGPPGIVIGTGPLGEKGERGYPGTPGPRGEPGPKGFPGLPGQPGPPGLPVPGQAGAPGFPGERGEKGDRGFPGTS-LP-GPSGRDGLPGPPGSPGPPGQPGYTNGIVECQPGPPGDQGPPGIPGQPGFIGEIGEKGQKGESCLICDIDGYRGPPGPQGPPGEIGFPGQPGAKGDRGLPGRDGVAGVPGPQGTPGLIGQPGAKGEPGEFYFDLRLKGDKGDPGFPGQPGMPGRAGSPGRDGHPGLPGPKGSPGSVGLKGERGPPGGVGFPGSRGDTGPPGPPGY---GPAGPIGDKGQAGFPGGPGSPGLPGPKGEPGKIVP--------------------LPGPPGAEGLPGSPGFPGPQGDRGFPGTPGRPGLPGEKGAVGQPGI-GFPGPPGPKGVDGLPGDMGPPGTPGRPGFNGLPGNPGVQGQKGEP---GVGLPGLKGLPGLPGIPGTPGEKGSIGVPGVPGEHGAIGPPGLQGIRGEPGPPGLPGSVGSPGVPGI-GPPGARGPPGGQGPPGLSGPPGIKGEKGFPGFPGLD-MPGPKGDKGAQGLPGITGQSGLPGLPGQQGAPGIPGFPGSKGEMGVMGTPGQPGSPGPVGAPGLPGEKGDHGFPGSSGPRGDPGLKGDKGDVGLPGKPGSMDKVDMGSMKGQKGDQGEKGQIGPIGEKGSRGDPGTPGVPGKDGQAGQPGQP-GPKGDPGISGTPGAPGLPGPKGSVGGMGLPGTPGEKGVPGIPGPQGSPGLPGDKGAKGEKGQAGPPGIGIPGLRGEKGDQGIAGFPGSPGEKGEKGSIGIPGMPGSPGLKGSPGSVGYPGSPGLPGEKGDKGLPGLDGIPGVKGEAGLPGTPGPTGPAGQKGEPGSDGIPGSAGEKGEPGLPGRGFPGFPGAKGDKGSKGEVGFP-GLAGSPGIPGSKGEQGFMGPPGPQGQPGLPGSPGHA-TEGPKGDRGPQGQPGLPGLPGPMGPPGLPGIDGVKGDKGNPGWPGAPGVPGPKGDPGFQGMPGIGGSPGITGSKGDMGPPGVPGFQGPKGLPGLQGIKGDQGDQGVPGAKGLPGPPGPPGPYDIIKGEPGLPGPEGPPGLKGLQGLPGPKGQQGVTGLVGIPGPPGIPGFDGAPGQKGEMGPAGPTGPRGFPGPPGPDGLPGSMGPPGTPSVDHGFLVTRHSQTIDDPQCPSGTKILYHGYSLLYVQGNERAHGQDLGTAGSCLRKFSTMPFLFCNINNVCNFASRNDYSYWLSTPEPMPMSMAPITGENIRPFISRCAVCEAPAMVMAVHSQTIQIPPCPSGWSSLWIGYSFVMHTSAGAEGSGQALASPGSCLEEFRSAPFIECHG-RGTCNYYANAYSFWLATIERSEMFKKPTPSTLKAGELRTHVSRCQVCMRRT
>chain C
MGRDQRAVAGPALRRWLLLGTVTVGFLAQSVLAGVKKFDVPCGGRDCSGGCQCYPEKGGRGQPGPVGPQGYNGPPGLQGFPGLQGRKGDKGERGAPGVTGPKGDVGARGVSGFPGADGIPGHPGQGGPRGRPGYDGCNGTQGDSGPQGPPGSEGFTGPPGPQGPKGQKGEP-YALPKEERDRYRGEPGEPGLVGFQGPPGRPGHVGQMGPVGAPGRPGPPGPPGPKGQQGNRGLGFYGVKGEKGDVGQPGPNGIPSDTLHPIIAPTGVTFHPDQYKGEKGSEGEPGIRGISLKGEEGIMGFPGLRGYPGLSGEKGSPGQKGSRGLDGYQGPDGPRGPKGEAGDPGPPGLP--AYSPHPSLAKGARGDPGFPGAQGEPGSQGEPGDPGLPGPPGLSIGDGDQRRGLPGEMGPKGFIGDPGIPALYGGPPGPDGKRGPPGPPGLPGPPGPDGFL-FGLKGAKGRAGFPGLPGSPGARGPKGWKGDAGECRCTEGDEAIKGLPGLPGPKGFAGINGEPGRKGDRGDPGQHGLPGFPGLKGVPGNIGAPGPKGAKGDS-RTITTKGERGQPGVPGVPGMKGDDGSPGRDGLDGFPGLPGPPGD-GIKGPPGDPGYPGIPGTKGTPGEMGPPGLGLPGLKGQRGFPGDAGLPGPPGFLGPPGPAGTPGQIDCDTDVKRAVGGDRQEAIQPGCIGGPKGLPGLPGPPGPTGAKGLRGIPGFAGADGGPGPRGLPGDAGREGFPGPPGFIGPRGSKGAVGLPGPDGSPGPIGLPGPDGPPGERGLPGEVLGAQPGPRGDAGVPGQPGLKGLPGDRGPPGFRGSQGMPGMPGLKGQPGLPGPSGQPGLYGPPGLHGFPGAPGQEGPLGLPGIPGREGLPGDRGDPGDTGAPGPVGMKGLSGDRGDAGFTGEQGHPGSPGFKGIDGMPGTPGLKGDRGSPGMDGFQGMPGLKGRPGFPGSKGEAGFFGIPGLKGLAGEPGFKGSRGDPGPPGPP-PVILPGMKDIKGEKGDEGPMGLKGYLGAKGIQGMPGIPGLSGIPGLPGRPGHIKGVKGDIGVPGIPGLPGFPGVAGPPGITGFPGFIGSRGDKGAPGRAGLYGEIGATGDFGDIGDT-INLPGRPGLKGERGTTGIPGLKGFFGEKGTEGDIGFPGITGVTGVQGPPGLKGQTGFPGLTGPPGSQGELGRIGLPGGKGDDGWPGAPGLPGFPGLRGIRGLHGLPGTKGFPGSPGSDIHGDPGFPGPPGERGDPGEANTLPGPVGVPGQKGDQGAPGERGPPGSPGLQGFPGITPPSNISGAPGDKGAPGIFGLKGYRGPPGPPGSAALPGSKGDTGNPGAPGTPGTKGWAGDSGPQGRPGVFGLPGEKGPRGEQGFMGNTGPTGAVGDRGPKGPKGDPGFPGAPGTVGAPGIAGIPQKIAVQPGTVGPQGRRGPPGAPGEMGPQGPPGEPGFRGAPGKAGPQGRGGVSAVPGFRGDEGPIGHQGPIGQEGAPGRPGSPGLPGMPGR-SVSIGYLLVKHSQTDQEPMCPVGMNKLWSGYSLLYFEGQEKAHNQDLGLAGSCLARFSTMPFLYCNPGDVCYYASRNDKSYWLSTTAPLP--MMPVAEDEIKPYISRCSVCEAPAIAIAVHSQDVSIPHCPAGWRSLWIGYSFLMHTAAGDEGGGQSLVSPGSCLEDFRATPFIECNGGRGTCHYYANKYSFWLTTIPEQSFQGSPSADTLKAGLIRTHISRCQVCMKNL
Now, I want to find out the character position of all R and D/E in the three chains that satisfy the following relationship
Ri (chain A) - Di+2 (chain B)
Ri (chain B) - Di+2 (chain C)
Ri (chain C) - Di+5 (chain A)
Explanation: Iterate over every ith R in chain A and check if the i+2 position of chain B contains D or E. If yes, output the character positions of every such R and D/E pair. Do the same with chains B+C and chains C+A.
I tried to the following:
IFS=$'\n' read -d '' -r -a lines <file.txt
echo "${lines[1]}" | awk '{for(i=1;i<=length($0);i++) {if (substr($0,i,1)=="R") {print i}}}'
echo "${lines[3]}" | awk '{for(i=1;i<=length($0);i++) {if (substr($0,i,1)=="R") {print i}}}'
echo "${lines[5]}" | awk '{for(i=1;i<=length($0);i++) {if (substr($0,i,1)=="R") {print i}}}'
but this will give positions of R or E in the lines but not constrained by the relationship.
this can be optimized but I think works... Prints the chains compared, the position of the first match and the matched chars. Assumes chains are the same length and doesn't check for bounds. Iterates each sequence once and compares with the other two for offset match.
Note that A and B sequences are the same, so for C-A and C-B comparisons you'll get the same results.
$ awk 'function charAt(_d, _i) {return substr(_d,_i,1)}
NR%2 {chain[int(NR/2)+1]=$2; next}
{d[NR/2]=$0}
END {nc=NR/2;
for(i=1;i<=nc;i++)
for(j=1;j<=length(d[i]);j++) {
os=j+(chain[i]=="C"?5:2);
if( (c1=charAt(d[i],j))=="R") {
if( (c2=charAt(d[k=i%nc+1],os))=="D" || c2=="E") print chain[i]"-"chain[k],j,c1,c2;
if( (c2=charAt(d[k=(i+1)%nc+1],os))=="D" || c2=="E") print chain[i]"-"chain[k],j,c1,c2;
}}}' file
A-C 187 R E
A-C 365 R D
A-B 374 R E
A-C 374 R E
A-B 409 R E
A-C 415 R D
A-C 521 R D
A-C 606 R D
A-B 618 R D
A-B 829 R E
A-B 967 R D
A-C 967 R E
A-B 1018 R D
A-B 1114 R E
A-C 1114 R E
A-C 1224 R D
A-B 1569 R D
A-C 1569 R D
A-B 1692 R E
B-C 187 R E
B-C 365 R D
B-C 374 R E
B-A 374 R E
B-A 409 R E
B-C 415 R D
B-C 521 R D
B-C 606 R D
B-A 618 R D
B-A 829 R E
B-C 967 R E
B-A 967 R D
B-A 1018 R D
B-C 1114 R E
B-A 1114 R E
B-C 1224 R D
B-C 1569 R D
B-A 1569 R D
B-A 1692 R E
C-A 335 R E
C-B 335 R E
C-A 403 R E
C-B 403 R E
C-A 475 R E
C-B 475 R E
C-A 746 R D
C-B 746 R D
C-A 1236 R E
C-B 1236 R E
C-A 1600 R E
C-B 1600 R E
NOTE: OP hasn't provided the expected output so I'll duplicate karakfa's output format.
One awk idea:
awk '
{ chain_id[++c]=$2 # save chain id, eg, "A", "B", "C"
getline # read next line from input file
chains[c]=$0 # save associated chain
}
END { i_char="R" # character to search for in 1st chain
for (i=1;i<=c;i++) { # loop through list of chains
j= (i==c ? 1 : i+1) # determine index of 2nd chain
offset= (i==c ? 5 : 2) # +2 for A-B, B-C; +5 for C-A
chain_i=chains[i] # copy chains as we are going to cut them up as we process them
chain_j=chains[j]
chain_pair= chain_id[i] "-" chain_id[j] # build output label, eg, "A-B"
pos=0 # reset position
while (length(chain_i)>0) {
n=index(chain_i,i_char) # look for "R"
if (n==0) break # if not found we are done with this chain pair so break out of loop else ...
pos=pos+n # update our position in the chain and ...
j_char=substr(chain_j,n+offset,1) # find character from 2nd chain at location n+2
if (j_char ~ /D|E/) { # if 2nd chain character is one of "D" or "E" then ...
print chain_pair,pos,i_char,j_char # print our finding
}
chain_i=substr(chain_i,n+1) # strip off 1st n characters
chain_j=substr(chain_j,n+1)
}
}
}
' file.txt
This generates:
A-B 374 R E
A-B 409 R E
A-B 618 R D
A-B 829 R E
A-B 967 R D
A-B 1018 R D
A-B 1114 R E
A-B 1569 R D
A-B 1692 R E
B-C 187 R E
B-C 365 R D
B-C 374 R E
B-C 415 R D
B-C 521 R D
B-C 606 R D
B-C 967 R E
B-C 1114 R E
B-C 1224 R D
B-C 1569 R D
C-A 335 R E
C-A 403 R E
C-A 475 R E
C-A 746 R D
C-A 1236 R E
C-A 1600 R E

How to count number of occurrences in a sorted text file

I have a sorted text file with the following format:
Company1 Company2 Date TransactionAmount
A B 1/1/19 20000
A B 1/4/19 200000
A B 1/19/19 324
A C 2/1/19 3456
A C 2/1/19 663633
A D 1/6/19 3632
B C 1/9/19 84335
B C 1/23/19 253
B C 1/13/19 850
B D 1/1/19 234
B D 1/8/19 635
C D 1/9/19 749
C D 1/10/19 203200
Ultimately I want a Python dictionary so that each pair maps to a list containing the number of transactions and the total amount of all transactions. For instance, (A,B) would map to [3,220324].
The file has ~250,000 lines in this format and each pair may have 1 transaction up to ~10 or so transactions. There are also tens of thousands of pairs of companies.
Here's the only way I've thought of implementing it.
my_dict = {}
file = open("my_file.txt").readlines()[1:]
for i in file:
i = i.split()
pair = (i[0],i[1])
amt = int(i[3])
if pair in my_dict:
exist = my_dict[pair]
exist[0] += 1
exist[1] += amt
my_dict[pair] = exist
else:
my_dict[pair] = [1,amt]
I feel like there is a faster way to do this. Any ideas?

How do I line number a set of columns taking into account the strings values of the first column (UNIX shell)

Can someone help me? I would like to number a tabulated file in UNIX depends on the columns in that file. However, the last column from some rows have SAME LETTERS AND LENGTH between them but different order and must be considered the same if the other previous columns are the same too. In summary, the input is something like
rs758613821 574290 insertion_inframe P 285 AAAP
rs758613821 574290 insertion_inframe P 285 APAA
rs758613821 574290 insertion_inframe P 285 APLA
rs1367252071 574290 deletion_inframe CADDL 134 F
rs538 3246 frameshift_variant F 97 FGLYP
rs538 3246 frameshift_variant F 97 PYFLG
And the output should be
1 rs758613821 574290 insertion_inframe P 285 AAAP
1 rs758613821 574290 insertion_inframe P 285 APAA
2 rs758613821 574290 insertion_inframe P 285 APLA
3 rs1367252071 574290 deletion_inframe CADDL 134 F
4 rs538 3246 frameshift_variant F 97 FGLYP
4 rs538 3246 frameshift_variant F 97 PYFLG
and so on...
In this way I have performed the code as follow
awk 'BEGIN {FS=OFS="\t"} function intern(sym) { if (sym in table)
return table[sym]
return table[sym] = ++counter }
{ print intern($1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6), $0 }' "input" > "output";
Nevertheless I didn't solve the problem concerning to the last column to assign the same number if are same letters and length although different order. Is it possible to do it in UNIX environment? I think maybe by means of substr function or similar like that but I'm not sure what will be a properly code. Thanks in advance for the support and helping!
Not sure this is what you want to do but give it a try
$ awk 'function canon(f) {n=split(f,a,"");
asort(a); c="";
for(i=1;i<=n;i++) c=c a[i];
return c;}
{key=canon($NF)}
!(key in keys) {keys[key]=++ctr}
{print keys[key],$0}' file
1 rs758613821 574290 insertion_inframe P 285 AAAP
1 rs758613821 574290 insertion_inframe P 285 APAA
2 rs758613821 574290 insertion_inframe P 285 APLA
3 rs1367252071 574290 deletion_inframe CADDL 134 F
4 rs538 3246 frameshift_variant F 97 FGLYP
4 rs538 3246 frameshift_variant F 97 PYFLG
convert the last field into canonical form and count the unique instances.
To use the full record as key do this instead
...
{line=$0;
$NF=canon($NF);
key=$0}
!(key in keys) {keys[key]=++ctr}
{print keys[key],line}' file
copy the line, replace last field with canonical form, use the updated line as key, count unique instances, print count and original line

Check if string exist in non-consecutive lines in a given column

I have files with the following format:
ATOM 8962 CA VAL W 8 8.647 81.467 25.656 1.00115.78 C
ATOM 8963 C VAL W 8 10.053 80.963 25.506 1.00114.60 C
ATOM 8964 O VAL W 8 10.636 80.422 26.442 1.00114.53 O
ATOM 8965 CB VAL W 8 7.643 80.389 25.325 1.00115.67 C
ATOM 8966 CG1 VAL W 8 6.476 80.508 26.249 1.00115.54 C
ATOM 8967 CG2 VAL W 8 7.174 80.526 23.886 1.00115.26 C
ATOM 4440 O TYR S 89 4.530 166.005 -14.543 1.00 95.76 O
ATOM 4441 CB TYR S 89 2.847 168.812 -13.864 1.00 96.31 C
ATOM 4442 CG TYR S 89 3.887 169.413 -14.756 1.00 98.43 C
ATOM 4443 CD1 TYR S 89 3.515 170.073 -15.932 1.00100.05 C
ATOM 4444 CD2 TYR S 89 5.251 169.308 -14.451 1.00100.50 C
ATOM 4445 CE1 TYR S 89 4.464 170.642 -16.779 1.00100.70 C
ATOM 4446 CE2 TYR S 89 6.219 169.868 -15.298 1.00101.40 C
ATOM 4447 CZ TYR S 89 5.811 170.535 -16.464 1.00100.46 C
ATOM 4448 OH TYR S 89 6.736 171.094 -17.321 1.00100.20 O
ATOM 4449 N LEU S 90 3.944 166.393 -12.414 1.00 94.95 N
ATOM 4450 CA LEU S 90 5.079 165.622 -11.914 1.00 94.44 C
ATOM 5151 N LEU W 8 -66.068 209.785 -11.037 1.00117.44 N
ATOM 5152 CA LEU W 8 -64.800 210.035 -10.384 1.00116.52 C
ATOM 5153 C LEU W 8 -64.177 208.641 -10.198 1.00116.71 C
ATOM 5154 O LEU W 8 -64.513 207.944 -9.241 1.00116.99 O
ATOM 5155 CB LEU W 8 -65.086 210.682 -9.033 1.00115.76 C
ATOM 5156 CG LEU W 8 -64.274 211.829 -8.478 1.00113.89 C
ATOM 5157 CD1 LEU W 8 -64.528 211.857 -7.006 1.00111.94 C
ATOM 5158 CD2 LEU W 8 -62.828 211.612 -8.739 1.00112.96 C
In principle, column 5 (W, in this case, which represents the chain ID) should be identical only in consecutive chunks. However, in files with too many chains, there are no enough letters of the alphabet to assign a single ID per chain and therefore duplicity may occur.
I would like to be able to check whether or not this is the case. In other words I would like to know if a given chain ID (A-Z, always in the 5th column) is present in non-consecutive chunks. I do not mind if it changes from W to S, I would like to know if there are two chunks sharing the same chain ID. In this case, if W or S reappear at some point. In fact, this is only a problem if they also share the first and the 6th columns, but I do not want to complicate things too much.
I do not want to print the lines, just to know the name of the file in which the issue occurs and the chain ID (in this case W), in order to solve the problem. In fact, I already know how to solve the problem, but I need to identify the problematic files to focus on those ones and not repairing already sane files.
SOLUTION (thanks to all for your help and namely to sehe):
for pdb in $(ls *.pdb) ; do
hit=$(awk -v pdb="$pdb" '{ if ( $1 == "ATOM" ) { print $0 } }' $pdb | cut -c22-23 | uniq | sort | uniq -dc)
[ "$hit" ] && echo $pdb = $hit
done
For this particular sample:
cut -c22-23 t | uniq | sort | uniq -dc
Will output
2 W
(the 22nd column contains 2 runs of the letter 'W')
untested
awk '
seen[$5] && $5 != current {
print "found non-consecutive chain on line " NR
exit
}
{ current = $5; seen[$5] = 1 }
' filename
Here you go, this awk script is tested and takes into account not just 'W':
{
if (ln[$5] && ln[$5] + 1 != NR) {
print "dup " $5 " at line " NR;
}
ln[$5] = NR;
}

Resources