find characters present in two different lines if they satisfy a positional relationship - bash

I have a peculiar problem. A text file contains the following three lines,
>chain A
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKGDFATKGEKGQKGEPGFQGMPGVGEKGEPGKPGPRGKPGKDGDKGEKGSPGFPGEPGYPGLIGRQGPQGEKGEAGPPGPPGIVIGTGPLGEKGERGYPGTPGPRGEPGPKGFPGLPGQPGPPGLPVPGQAGAPGFPGERGEKGDRGFPGTS-LP-GPSGRDGLPGPPGSPGPPGQPGYTNGIVECQPGPPGDQGPPGIPGQPGFIGEIGEKGQKGESCLICDIDGYRGPPGPQGPPGEIGFPGQPGAKGDRGLPGRDGVAGVPGPQGTPGLIGQPGAKGEPGEFYFDLRLKGDKGDPGFPGQPGMPGRAGSPGRDGHPGLPGPKGSPGSVGLKGERGPPGGVGFPGSRGDTGPPGPPGY---GPAGPIGDKGQAGFPGGPGSPGLPGPKGEPGKIVP--------------------LPGPPGAEGLPGSPGFPGPQGDRGFPGTPGRPGLPGEKGAVGQPGI-GFPGPPGPKGVDGLPGDMGPPGTPGRPGFNGLPGNPGVQGQKGEP---GVGLPGLKGLPGLPGIPGTPGEKGSIGVPGVPGEHGAIGPPGLQGIRGEPGPPGLPGSVGSPGVPGI-GPPGARGPPGGQGPPGLSGPPGIKGEKGFPGFPGLD-MPGPKGDKGAQGLPGITGQSGLPGLPGQQGAPGIPGFPGSKGEMGVMGTPGQPGSPGPVGAPGLPGEKGDHGFPGSSGPRGDPGLKGDKGDVGLPGKPGSMDKVDMGSMKGQKGDQGEKGQIGPIGEKGSRGDPGTPGVPGKDGQAGQPGQP-GPKGDPGISGTPGAPGLPGPKGSVGGMGLPGTPGEKGVPGIPGPQGSPGLPGDKGAKGEKGQAGPPGIGIPGLRGEKGDQGIAGFPGSPGEKGEKGSIGIPGMPGSPGLKGSPGSVGYPGSPGLPGEKGDKGLPGLDGIPGVKGEAGLPGTPGPTGPAGQKGEPGSDGIPGSAGEKGEPGLPGRGFPGFPGAKGDKGSKGEVGFP-GLAGSPGIPGSKGEQGFMGPPGPQGQPGLPGSPGHA-TEGPKGDRGPQGQPGLPGLPGPMGPPGLPGIDGVKGDKGNPGWPGAPGVPGPKGDPGFQGMPGIGGSPGITGSKGDMGPPGVPGFQGPKGLPGLQGIKGDQGDQGVPGAKGLPGPPGPPGPYDIIKGEPGLPGPEGPPGLKGLQGLPGPKGQQGVTGLVGIPGPPGIPGFDGAPGQKGEMGPAGPTGPRGFPGPPGPDGLPGSMGPPGTPSVDHGFLVTRHSQTIDDPQCPSGTKILYHGYSLLYVQGNERAHGQDLGTAGSCLRKFSTMPFLFCNINNVCNFASRNDYSYWLSTPEPMPMSMAPITGENIRPFISRCAVCEAPAMVMAVHSQTIQIPPCPSGWSSLWIGYSFVMHTSAGAEGSGQALASPGSCLEEFRSAPFIECHG-RGTCNYYANAYSFWLATIERSEMFKKPTPSTLKAGELRTHVSRCQVCMRRT
>chain B
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKGDFATKGEKGQKGEPGFQGMPGVGEKGEPGKPGPRGKPGKDGDKGEKGSPGFPGEPGYPGLIGRQGPQGEKGEAGPPGPPGIVIGTGPLGEKGERGYPGTPGPRGEPGPKGFPGLPGQPGPPGLPVPGQAGAPGFPGERGEKGDRGFPGTS-LP-GPSGRDGLPGPPGSPGPPGQPGYTNGIVECQPGPPGDQGPPGIPGQPGFIGEIGEKGQKGESCLICDIDGYRGPPGPQGPPGEIGFPGQPGAKGDRGLPGRDGVAGVPGPQGTPGLIGQPGAKGEPGEFYFDLRLKGDKGDPGFPGQPGMPGRAGSPGRDGHPGLPGPKGSPGSVGLKGERGPPGGVGFPGSRGDTGPPGPPGY---GPAGPIGDKGQAGFPGGPGSPGLPGPKGEPGKIVP--------------------LPGPPGAEGLPGSPGFPGPQGDRGFPGTPGRPGLPGEKGAVGQPGI-GFPGPPGPKGVDGLPGDMGPPGTPGRPGFNGLPGNPGVQGQKGEP---GVGLPGLKGLPGLPGIPGTPGEKGSIGVPGVPGEHGAIGPPGLQGIRGEPGPPGLPGSVGSPGVPGI-GPPGARGPPGGQGPPGLSGPPGIKGEKGFPGFPGLD-MPGPKGDKGAQGLPGITGQSGLPGLPGQQGAPGIPGFPGSKGEMGVMGTPGQPGSPGPVGAPGLPGEKGDHGFPGSSGPRGDPGLKGDKGDVGLPGKPGSMDKVDMGSMKGQKGDQGEKGQIGPIGEKGSRGDPGTPGVPGKDGQAGQPGQP-GPKGDPGISGTPGAPGLPGPKGSVGGMGLPGTPGEKGVPGIPGPQGSPGLPGDKGAKGEKGQAGPPGIGIPGLRGEKGDQGIAGFPGSPGEKGEKGSIGIPGMPGSPGLKGSPGSVGYPGSPGLPGEKGDKGLPGLDGIPGVKGEAGLPGTPGPTGPAGQKGEPGSDGIPGSAGEKGEPGLPGRGFPGFPGAKGDKGSKGEVGFP-GLAGSPGIPGSKGEQGFMGPPGPQGQPGLPGSPGHA-TEGPKGDRGPQGQPGLPGLPGPMGPPGLPGIDGVKGDKGNPGWPGAPGVPGPKGDPGFQGMPGIGGSPGITGSKGDMGPPGVPGFQGPKGLPGLQGIKGDQGDQGVPGAKGLPGPPGPPGPYDIIKGEPGLPGPEGPPGLKGLQGLPGPKGQQGVTGLVGIPGPPGIPGFDGAPGQKGEMGPAGPTGPRGFPGPPGPDGLPGSMGPPGTPSVDHGFLVTRHSQTIDDPQCPSGTKILYHGYSLLYVQGNERAHGQDLGTAGSCLRKFSTMPFLFCNINNVCNFASRNDYSYWLSTPEPMPMSMAPITGENIRPFISRCAVCEAPAMVMAVHSQTIQIPPCPSGWSSLWIGYSFVMHTSAGAEGSGQALASPGSCLEEFRSAPFIECHG-RGTCNYYANAYSFWLATIERSEMFKKPTPSTLKAGELRTHVSRCQVCMRRT
>chain C
MGRDQRAVAGPALRRWLLLGTVTVGFLAQSVLAGVKKFDVPCGGRDCSGGCQCYPEKGGRGQPGPVGPQGYNGPPGLQGFPGLQGRKGDKGERGAPGVTGPKGDVGARGVSGFPGADGIPGHPGQGGPRGRPGYDGCNGTQGDSGPQGPPGSEGFTGPPGPQGPKGQKGEP-YALPKEERDRYRGEPGEPGLVGFQGPPGRPGHVGQMGPVGAPGRPGPPGPPGPKGQQGNRGLGFYGVKGEKGDVGQPGPNGIPSDTLHPIIAPTGVTFHPDQYKGEKGSEGEPGIRGISLKGEEGIMGFPGLRGYPGLSGEKGSPGQKGSRGLDGYQGPDGPRGPKGEAGDPGPPGLP--AYSPHPSLAKGARGDPGFPGAQGEPGSQGEPGDPGLPGPPGLSIGDGDQRRGLPGEMGPKGFIGDPGIPALYGGPPGPDGKRGPPGPPGLPGPPGPDGFL-FGLKGAKGRAGFPGLPGSPGARGPKGWKGDAGECRCTEGDEAIKGLPGLPGPKGFAGINGEPGRKGDRGDPGQHGLPGFPGLKGVPGNIGAPGPKGAKGDS-RTITTKGERGQPGVPGVPGMKGDDGSPGRDGLDGFPGLPGPPGD-GIKGPPGDPGYPGIPGTKGTPGEMGPPGLGLPGLKGQRGFPGDAGLPGPPGFLGPPGPAGTPGQIDCDTDVKRAVGGDRQEAIQPGCIGGPKGLPGLPGPPGPTGAKGLRGIPGFAGADGGPGPRGLPGDAGREGFPGPPGFIGPRGSKGAVGLPGPDGSPGPIGLPGPDGPPGERGLPGEVLGAQPGPRGDAGVPGQPGLKGLPGDRGPPGFRGSQGMPGMPGLKGQPGLPGPSGQPGLYGPPGLHGFPGAPGQEGPLGLPGIPGREGLPGDRGDPGDTGAPGPVGMKGLSGDRGDAGFTGEQGHPGSPGFKGIDGMPGTPGLKGDRGSPGMDGFQGMPGLKGRPGFPGSKGEAGFFGIPGLKGLAGEPGFKGSRGDPGPPGPP-PVILPGMKDIKGEKGDEGPMGLKGYLGAKGIQGMPGIPGLSGIPGLPGRPGHIKGVKGDIGVPGIPGLPGFPGVAGPPGITGFPGFIGSRGDKGAPGRAGLYGEIGATGDFGDIGDT-INLPGRPGLKGERGTTGIPGLKGFFGEKGTEGDIGFPGITGVTGVQGPPGLKGQTGFPGLTGPPGSQGELGRIGLPGGKGDDGWPGAPGLPGFPGLRGIRGLHGLPGTKGFPGSPGSDIHGDPGFPGPPGERGDPGEANTLPGPVGVPGQKGDQGAPGERGPPGSPGLQGFPGITPPSNISGAPGDKGAPGIFGLKGYRGPPGPPGSAALPGSKGDTGNPGAPGTPGTKGWAGDSGPQGRPGVFGLPGEKGPRGEQGFMGNTGPTGAVGDRGPKGPKGDPGFPGAPGTVGAPGIAGIPQKIAVQPGTVGPQGRRGPPGAPGEMGPQGPPGEPGFRGAPGKAGPQGRGGVSAVPGFRGDEGPIGHQGPIGQEGAPGRPGSPGLPGMPGR-SVSIGYLLVKHSQTDQEPMCPVGMNKLWSGYSLLYFEGQEKAHNQDLGLAGSCLARFSTMPFLYCNPGDVCYYASRNDKSYWLSTTAPLP--MMPVAEDEIKPYISRCSVCEAPAIAIAVHSQDVSIPHCPAGWRSLWIGYSFLMHTAAGDEGGGQSLVSPGSCLEDFRATPFIECNGGRGTCHYYANKYSFWLTTIPEQSFQGSPSADTLKAGLIRTHISRCQVCMKNL
Now, I want to find out the character position of all R and D/E in the three chains that satisfy the following relationship
Ri (chain A) - Di+2 (chain B)
Ri (chain B) - Di+2 (chain C)
Ri (chain C) - Di+5 (chain A)
Explanation: Iterate over every ith R in chain A and check if the i+2 position of chain B contains D or E. If yes, output the character positions of every such R and D/E pair. Do the same with chains B+C and chains C+A.
I tried to the following:
IFS=$'\n' read -d '' -r -a lines <file.txt
echo "${lines[1]}" | awk '{for(i=1;i<=length($0);i++) {if (substr($0,i,1)=="R") {print i}}}'
echo "${lines[3]}" | awk '{for(i=1;i<=length($0);i++) {if (substr($0,i,1)=="R") {print i}}}'
echo "${lines[5]}" | awk '{for(i=1;i<=length($0);i++) {if (substr($0,i,1)=="R") {print i}}}'
but this will give positions of R or E in the lines but not constrained by the relationship.

this can be optimized but I think works... Prints the chains compared, the position of the first match and the matched chars. Assumes chains are the same length and doesn't check for bounds. Iterates each sequence once and compares with the other two for offset match.
Note that A and B sequences are the same, so for C-A and C-B comparisons you'll get the same results.
$ awk 'function charAt(_d, _i) {return substr(_d,_i,1)}
NR%2 {chain[int(NR/2)+1]=$2; next}
{d[NR/2]=$0}
END {nc=NR/2;
for(i=1;i<=nc;i++)
for(j=1;j<=length(d[i]);j++) {
os=j+(chain[i]=="C"?5:2);
if( (c1=charAt(d[i],j))=="R") {
if( (c2=charAt(d[k=i%nc+1],os))=="D" || c2=="E") print chain[i]"-"chain[k],j,c1,c2;
if( (c2=charAt(d[k=(i+1)%nc+1],os))=="D" || c2=="E") print chain[i]"-"chain[k],j,c1,c2;
}}}' file
A-C 187 R E
A-C 365 R D
A-B 374 R E
A-C 374 R E
A-B 409 R E
A-C 415 R D
A-C 521 R D
A-C 606 R D
A-B 618 R D
A-B 829 R E
A-B 967 R D
A-C 967 R E
A-B 1018 R D
A-B 1114 R E
A-C 1114 R E
A-C 1224 R D
A-B 1569 R D
A-C 1569 R D
A-B 1692 R E
B-C 187 R E
B-C 365 R D
B-C 374 R E
B-A 374 R E
B-A 409 R E
B-C 415 R D
B-C 521 R D
B-C 606 R D
B-A 618 R D
B-A 829 R E
B-C 967 R E
B-A 967 R D
B-A 1018 R D
B-C 1114 R E
B-A 1114 R E
B-C 1224 R D
B-C 1569 R D
B-A 1569 R D
B-A 1692 R E
C-A 335 R E
C-B 335 R E
C-A 403 R E
C-B 403 R E
C-A 475 R E
C-B 475 R E
C-A 746 R D
C-B 746 R D
C-A 1236 R E
C-B 1236 R E
C-A 1600 R E
C-B 1600 R E

NOTE: OP hasn't provided the expected output so I'll duplicate karakfa's output format.
One awk idea:
awk '
{ chain_id[++c]=$2 # save chain id, eg, "A", "B", "C"
getline # read next line from input file
chains[c]=$0 # save associated chain
}
END { i_char="R" # character to search for in 1st chain
for (i=1;i<=c;i++) { # loop through list of chains
j= (i==c ? 1 : i+1) # determine index of 2nd chain
offset= (i==c ? 5 : 2) # +2 for A-B, B-C; +5 for C-A
chain_i=chains[i] # copy chains as we are going to cut them up as we process them
chain_j=chains[j]
chain_pair= chain_id[i] "-" chain_id[j] # build output label, eg, "A-B"
pos=0 # reset position
while (length(chain_i)>0) {
n=index(chain_i,i_char) # look for "R"
if (n==0) break # if not found we are done with this chain pair so break out of loop else ...
pos=pos+n # update our position in the chain and ...
j_char=substr(chain_j,n+offset,1) # find character from 2nd chain at location n+2
if (j_char ~ /D|E/) { # if 2nd chain character is one of "D" or "E" then ...
print chain_pair,pos,i_char,j_char # print our finding
}
chain_i=substr(chain_i,n+1) # strip off 1st n characters
chain_j=substr(chain_j,n+1)
}
}
}
' file.txt
This generates:
A-B 374 R E
A-B 409 R E
A-B 618 R D
A-B 829 R E
A-B 967 R D
A-B 1018 R D
A-B 1114 R E
A-B 1569 R D
A-B 1692 R E
B-C 187 R E
B-C 365 R D
B-C 374 R E
B-C 415 R D
B-C 521 R D
B-C 606 R D
B-C 967 R E
B-C 1114 R E
B-C 1224 R D
B-C 1569 R D
C-A 335 R E
C-A 403 R E
C-A 475 R E
C-A 746 R D
C-A 1236 R E
C-A 1600 R E

Related

Processing data swapped over files BASH

First, I would like to apologize for my extremely basic knowledge about coding. Then I hope that I will be able to express myself correctly about my issue. Do no hesitate to ask for further clarifications or anything else...
I'm encountering troubles postprocessing data...
My goal is to recombine data which were swapped.
EDIT : here is a .rar folder containing my test example which works and the one that I try to make working... (do not be afraid by the time it requires to process the data)
https://drive.google.com/file/d/1AEPUc8haT5_Z3LR3jnZZlpyfxhdDwwo6/view?usp=sharing
EDIT 2 : Here is what I expect on paper (Its my TestReorder3OK folder in my rar archive)
enter image description here
EDIT 3 : MINIMAL COMPLETE EXAMPLE
Script :
#!/bin/bash
# Definir le nombre de replica
NP=3
NP1=$[NP-1]
rm torder*
for repl in `seq 0 $NP1`
do
echo $repl
# colle la colonne 2 du fichier .lammps dans un fichier rep_0, puis dans la seconde boucle, la colonne 3 dans rep_1, etc.
awk -v rep=$repl '{r2=rep+2;print $r2}' < log.lammps > rep_$repl
i=0
j=0
# cree une boucle dans la boucle
for a in `cat rep_$repl`
do
i=$[i+1]
j=$[j+3]
head -$i screen.$repl.temp | tail -1 >> torder.$a
head -$j ccccd2_H_${repl}_col.bak2 | tail -3 >> ccccd2_H_${a}_temp_col.bak2
done
done
log.lammps file
1 0 1 2
2 1 0 2
3 1 2 0
Starting at column 2, this file contains the number associated to the inputs below. Here is an expanded explanation :
column 2 has three values : 0, 1 and 1 ; the 0 is associated to the first three lines of the file ccccd2_H_0_col.bak2, the next three ones are associated the 1 and the last three ones again to the value 1.
column 3 has also three values : 1, 0 and 2 ; the 1 is associated to the first three lines of the file ccccd2_H_1_col.bak2, the next three ones are associated the 0 and the last three ones again to the value 2.
Same story for column 4.
Now what I want, is that every set of three lines associated to the 0 value go into a single file. Every set of three lines associated to the 1 value go into another single file, and the sets of three lines associated to the 2 value to a last file.
Inputs :
ccccd2_H_0_col.bak2
blank line
N a b c
C d e f
N g h i
C j k l
N m n o
C p q r
ccccd2_H_1_col.bak2
blank line
N s t u
C v w x
N y z a
C b c d
N e f g
C h i j
ccccd2_H_2_col.bak2
blank line
N k l m
C n o p
N q r s
C t u v
N w x y
C z a b
Outputs : These are the desired outputs and the one that I get for simple test files
ccccd2_H_0_temp_col
blank line
N a b c
C d e f
N y z a
C b c d
N w x y
C z a b
ccccd2_H_1_temp_col
blank line
N g h i
C j k l
N m n o
C p q r
N s t u
C v w x
ccccd2_H_2_temp_col
blank line
N e f g
C h i j
N k l m
C n o p
N q r s
C t u v
This works fine on small test files (as shown here), but not on my real system. For my real system, I have the log.lammps file that contains 14 rows and 10,001 lines, and my input files that contain 121,121 lines (so 10,001 * block of 121 lines). It creates files 10 times larger with more data than it should.
Can you enlighten me about my issue ? I think this is linked to the difference of line number from my files containing a single row and the files containing cartesian coordinates, but I really don't understand the link nor the way to solve it...
Thank you in advance...
I think I understand what you're trying do do now and this GNU awk script (for ARGIND, ENDFILE and inbuilt open file management) will do it:
$ cat ../tst.awk
ARGIND == 1 {
for (inFileNr=2; inFileNr<=NF; inFileNr++) {
outFileNrs[inFileNr,NR] = $inFileNr
}
next
}
ENDFILE { RS = "" }
{ print ORS $0 > ("ccccd2_H_" outFileNrs[ARGIND,FNR] "_temp_col") }
Look:
INPUT:
$ ls
ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 log.lammps
$ cat log.lammps
1 0 1 2
2 1 0 2
3 1 2 0
$ paste ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 | sed 's/\t/\t\t/g'
N a b c N s t u N k l m
C d e f C v w x C n o p
N g h i N y z a N q r s
C j k l C b c d C t u v
N m n o N e f g N w x y
C p q r C h i j C z a b
SCRIPT EXECUTION:
$ awk -f ../tst.awk log.lammps ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2
OUTPUT:
$ ls
ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 log.lammps
ccccd2_H_0_temp_col ccccd2_H_1_temp_col ccccd2_H_2_temp_col
$ paste ccccd2_H_0_temp_col ccccd2_H_1_temp_col ccccd2_H_2_temp_col | sed 's/\t/\t\t/g'
N a b c N g h i N e f g
C d e f C j k l C h i j
N y z a N m n o N k l m
C b c d C p q r C n o p
N w x y N s t u N q r s
C z a b C v w x C t u v

Use awk to print first column, second column and one column every four after that

I have a text file and I want to extract the first column, second column and one column every 3 after that. Also, I want to get rid of row #2. How can I do that using awk or similar?
An example:
I have a text file as such:
A B C D E F G H I J .. N (header 1)
A B C D E F G H I J .. N (header 2)
A B C D E F G H I J .. N (row 1)
A B C D E F G H I J .. N (row 2)
A B C D E F G H I J .. N (row n)
I'm trying to get it as
A B F J .. N (header 1)
A B F J .. N (row 1)
A B F J .. N (row 2)
A B F J .. N (row n)
Thanks.
P.s. I've tried playing with the solutions mentioned in How to print every 4th column up to nth column and from (n+1)th column to last using awk? but the solution doesn't work for me
$ awk 'NR!=2{out=$1; for (i=2;i<=NF;i+=4) out = out OFS $i; print out}' file
A B F J 1)
A B F J 1)
A B F J 2)
A B F J n)
The above output is messy because of the ...s and comments in your sample input that make it untestable. Always post ACTUAL, testable input/output, not a description or other abstract representation of such. And don't repeat the same data on every line as that makes it hard to map the output fields to the input and so makes it harder to understand your requirements. This would have been a more useful example:
$ cat file
101 102 103 104 105 106 107 108 109 110 111
201 202 203 204 205 206 207 208 209 210 211
301 302 303 304 305 306 307 308 309 310 311
401 402 403 404 405 406 407 408 409 410 411
$ awk 'NR!=2{out=$1; for (i=2;i<=NF;i+=4) out = out OFS $i; print out}' file
101 102 106 110
301 302 306 310
401 402 406 410

Getting started with ruby God

When following this tutorial on God, I run the command god -c path/to/simple.god -D and instead of getting the output as described, I get following weird output.
0000000 G o d . w a t c h d o | w |
778334023 1668571511 1868832872 2088205344
0000020 \n w . n a m e = " s
538976266 1848538912 543518049 1931616317
0000040 i m p l e " \n w . s t a
1819307369 537535077 1998594080 1635021614
0000060 r t = " r u b y / U s e r
1025537138 1970414112 790657378 1919251285
0000100 s / k a m a l / g o d r b / s i
1634414451 795631981 1919184743 1769156450
0000120 m p l e . r b " \n w . k
1701605485 576877102 538976266 1798207264
0000140 e e p a l i v e \n e n d \n
1634755941 1702259052 1684956426 10
0000155
I have no idea why it doesn't work.
I just ran god --version and the output is
od (GNU coreutils) 8.22
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Jim Meyering.
It seems it is some GNU god. I reinstalled god (sudo gem install god) and still I get the above output on running god --version. Any workaround?
Just in case: Using Mac OS X.
It seems like you run the command od instead of god.
$ od -c /etc/lsb-release -D
0000000 D I S T R I B _ I D = U b u n t
1414744388 1598179666 1430078537 1953396066
0000020 u \n D I S T R I B _ R E L E A S
1229195893 1230132307 1163026242 1396786508
0000040 E = 1 4 . 1 0 \n D I S T R I B _
875642181 170930478 1414744388 1598179666
0000060 C O D E N A M E = u t o p i c \n
1162104643 1162690894 1869903165 174287216
0000100 D I S T R I B _ D E S C R I P T
1414744388 1598179666 1129530692 1414547794
0000120 I O N = " U b u n t u 1 4 . 1
1028542281 1969378594 544568430 825111601
0000140 0 " \n
664112
0000143
Fix the typo: god

Concatenate last columns from multiple files of one type

I am trying to cat the last 2 columns of multiple text files side by side. The files are in a directory of various types of files. All files have >2 columns, but no guarantee all files have the same number of columns.
For example, if I have:
file1.txt
1 a b J H
2 b c E E
3 c d L L
4 d e L L
5 e f O O
file2.txt
1 a b M B
2 b c O E
3 c d O E
I want:
J H M B
E E O E
L L O E
L L
O O
The closest I've got is:
awk '{print $(NF-1), "\t", $NF}' *.txt
Which is almost what I want.
For the concatenation, I was thinking something like here for concatenation
pr -m -t one.txt two.txt
awk 'NR==FNR{a[NR]=$(NF-1)" "$NF;next}{print $(NF-1),$NF,a[FNR]}' file2.txt file1.txt
Tested:
> cat temp2
1 a b M B
2 b c O E
3 c d O E
> cat temp1
1 a b J H
2 b c E E
3 c d L L
4 d e L L
5 e f O O
> awk 'NR==FNR{a[NR]=$(NF-1)" "$NF;next}{print $(NF-1),$NF,a[FNR]}' temp2 temp1
J H M B
E E O E
L L O E
L L
O O
>
join -a1 -a2 one.txt two.txt | cut -d' ' -f4,5,8,9

Check if string exist in non-consecutive lines in a given column

I have files with the following format:
ATOM 8962 CA VAL W 8 8.647 81.467 25.656 1.00115.78 C
ATOM 8963 C VAL W 8 10.053 80.963 25.506 1.00114.60 C
ATOM 8964 O VAL W 8 10.636 80.422 26.442 1.00114.53 O
ATOM 8965 CB VAL W 8 7.643 80.389 25.325 1.00115.67 C
ATOM 8966 CG1 VAL W 8 6.476 80.508 26.249 1.00115.54 C
ATOM 8967 CG2 VAL W 8 7.174 80.526 23.886 1.00115.26 C
ATOM 4440 O TYR S 89 4.530 166.005 -14.543 1.00 95.76 O
ATOM 4441 CB TYR S 89 2.847 168.812 -13.864 1.00 96.31 C
ATOM 4442 CG TYR S 89 3.887 169.413 -14.756 1.00 98.43 C
ATOM 4443 CD1 TYR S 89 3.515 170.073 -15.932 1.00100.05 C
ATOM 4444 CD2 TYR S 89 5.251 169.308 -14.451 1.00100.50 C
ATOM 4445 CE1 TYR S 89 4.464 170.642 -16.779 1.00100.70 C
ATOM 4446 CE2 TYR S 89 6.219 169.868 -15.298 1.00101.40 C
ATOM 4447 CZ TYR S 89 5.811 170.535 -16.464 1.00100.46 C
ATOM 4448 OH TYR S 89 6.736 171.094 -17.321 1.00100.20 O
ATOM 4449 N LEU S 90 3.944 166.393 -12.414 1.00 94.95 N
ATOM 4450 CA LEU S 90 5.079 165.622 -11.914 1.00 94.44 C
ATOM 5151 N LEU W 8 -66.068 209.785 -11.037 1.00117.44 N
ATOM 5152 CA LEU W 8 -64.800 210.035 -10.384 1.00116.52 C
ATOM 5153 C LEU W 8 -64.177 208.641 -10.198 1.00116.71 C
ATOM 5154 O LEU W 8 -64.513 207.944 -9.241 1.00116.99 O
ATOM 5155 CB LEU W 8 -65.086 210.682 -9.033 1.00115.76 C
ATOM 5156 CG LEU W 8 -64.274 211.829 -8.478 1.00113.89 C
ATOM 5157 CD1 LEU W 8 -64.528 211.857 -7.006 1.00111.94 C
ATOM 5158 CD2 LEU W 8 -62.828 211.612 -8.739 1.00112.96 C
In principle, column 5 (W, in this case, which represents the chain ID) should be identical only in consecutive chunks. However, in files with too many chains, there are no enough letters of the alphabet to assign a single ID per chain and therefore duplicity may occur.
I would like to be able to check whether or not this is the case. In other words I would like to know if a given chain ID (A-Z, always in the 5th column) is present in non-consecutive chunks. I do not mind if it changes from W to S, I would like to know if there are two chunks sharing the same chain ID. In this case, if W or S reappear at some point. In fact, this is only a problem if they also share the first and the 6th columns, but I do not want to complicate things too much.
I do not want to print the lines, just to know the name of the file in which the issue occurs and the chain ID (in this case W), in order to solve the problem. In fact, I already know how to solve the problem, but I need to identify the problematic files to focus on those ones and not repairing already sane files.
SOLUTION (thanks to all for your help and namely to sehe):
for pdb in $(ls *.pdb) ; do
hit=$(awk -v pdb="$pdb" '{ if ( $1 == "ATOM" ) { print $0 } }' $pdb | cut -c22-23 | uniq | sort | uniq -dc)
[ "$hit" ] && echo $pdb = $hit
done
For this particular sample:
cut -c22-23 t | uniq | sort | uniq -dc
Will output
2 W
(the 22nd column contains 2 runs of the letter 'W')
untested
awk '
seen[$5] && $5 != current {
print "found non-consecutive chain on line " NR
exit
}
{ current = $5; seen[$5] = 1 }
' filename
Here you go, this awk script is tested and takes into account not just 'W':
{
if (ln[$5] && ln[$5] + 1 != NR) {
print "dup " $5 " at line " NR;
}
ln[$5] = NR;
}

Resources