Substitute word in a specific line with multiple keyword check - bash

I need to modify a lot of .pdb file for work and i need to script this operation to not waste time doing it manually every time.
I got a file with this particular format (this is an extract from the file, you can see full file here):
ATOM 5210 C4 G B 96 10.157 -47.431 -42.832 1.00 43.97 C
ATOM 5211 P G B 97 11.305 -41.644 -44.835 1.00 26.64 P
ATOM 5212 OP1 A B 97 12.654 -41.242 -44.460 1.00 26.64 O
ATOM 5213 OP2 A B 97 10.167 -41.192 -44.014 1.00 26.64 O
ATOM 5214 O5' A B 97 11.079 -41.206 -46.340 1.00 26.64 O
In particular for each file i need to substitute the word 'OP1' in third column with another keyword, but ONLY if the first column display 'ATOM' and there is a particular number on sixth column.
I tried to script it with sed but I didn't get any decent result.
Hope anyone can help
Thanks

A simply way to start:
improved:
awk '{if ($1=="ATOM" && $6=="410" && $3=="OP2")sub($3,"XXX"); print }' 1X8W.pdb

Try this script
while read p; do
value1=`echo $p | cut -d' ' -f1`
value2=`echo $p | cut -d' ' -f3`
value3=`echo $p | cut -d' ' -f6`
if [ "$value1" == "ATOM" ] && [ $value3 == 97 ]; then
if [ "$value2" == "OP1" ]; then
echo $p | awk '{gsub("OP1", "newtext", $0); print}'
fi
fi
done < 1X8W.pdb
Change newtext with the text you want to be replaced for OP1. Also, change the $value3 comparison number from 97 to anything else if you are checking for any other number as well.

Sounds like this might be what you're trying to do:
awk '($1=="ATOM") && ($6==97) { sub(/^OP1$/,"other",$3); print }' file
but without more details in your question we're all just guessing and can't test it.

Related

Shell script with grep and sed to extract individuals from a pair after comparing the numerical values of a variable

I want to compare a group of words (individuals) in pairs and extract the one with the lowest numeric variable. My files and scripts are made this way.
Relatedness_3rdDegree.txt (example):
Individual1 Individual2
Individual5 Individual23
Individual50 Individual65
filename.imiss
INDV N_DATA N_GENOTYPES_FILTERED N_MISS F_MISS
Individual1 375029 0 782 0.00208517
Individual2 375029 0 341 0.000909263
Individual3 375029 0 341 0.000909263
Main script:
numlines=$(wc -l Relatedness_3rdDegree.txt|awk '{print $1}')
for line in `seq 1 $numlines`
do
ind1=$(sed -n "${line}p" Relatedness_3rdDegree.txt|awk '{print $1}')
ind2=$(sed -n "${line}p" Relatedness_3rdDegree.txt|awk '{print $2}')
miss1=$(grep $ind1 filename.imiss|awk '{print $5}')
miss2=$(grep $ind2 filename.imiss|awk '{print $5}')
if echo "$miss1 > $miss2" | bc -l | grep -q 1
then
echo $ind1 >> miss.txt
else
echo $ind2 >> miss.txt
fi
echo "$line / $numlines"
done
This last script will echo a series of line like this :
1 / 208
2 / 208
3 / 208
and so on, until getting to this error:
91 / 208
(standard_in) 1: syntax error
92 / 208
(standard_in) 1: syntax error
93 / 208
If I go to my output (miss.txt), the printed individuals are not correct.
It should print the individuals, within the pairs contained in the file "Relatedness_3rdDegree.txt", that have the lowest value of F_MISS (column $5 of the "filename.imiss").
For instance, in the pair "Individual1 Individual2", it should compare their values of F_MISS and print only the individual with the lowest value, which in this example would be Individual 2.
I have manually checked the values and the printed individual, and it looks like it printed random individuals per each pair.
What is wrong in this script?
Bash version:
#!/bin/bash
declare -A imiss
while read -r ind nd ngf nm fm # we'll ignore most of these
do
imiss[$ind]=$fm
done < filename.imiss
while read -r i1 i2
do
if (( $(echo "${imiss[$i1]} > ${imiss[$i2]}" | bc -l) ))
then
echo "$i1"
else
echo "$i2"
fi
done < Relatedness_3rdDegree.txt
Run* it like:
bash-imiss
AWK version:
#!/usr/bin/awk -f
NR == FNR {imiss[$1] = $5; next}
{
if (imiss[$1] > imiss[$2]) {
print $1
} else {
print $2
}
}
Run* it like:
awk-imiss filename.imiss Relatedness_3rdDegree.txt
These two scripts do exactly the same thing in exactly the same way using associative arrays.
* This assumes that you have set the script file executable using chmod and that it's in your PATH and that the data files are in your current directory.

How can a "grep | sed | awk" script merging line pairs be more cleanly implemented?

I have a little script to extract specific data and cleanup the output a little. It seems overly messy and i'm wondering if the script can be trimmed down a bit.
The input file contains of pairs of lines -- names, followed by numbers.
Line pairs where the numeric value is not between 80 and 199 should be discarded.
Pairs may sometimes, but will not always, be preceded or followed by blank lines, which should be ignored.
Example input file:
al12t5682-heapmemusage-latest.log
38
al12t5683-heapmemusage-latest.log
88
al12t5684-heapmemusage-latest.log
100
al12t5685-heapmemusage-latest.log
0
al12t5686-heapmemusage-latest.log
91
Example/wanted output:
al12t5683 88
al12t5684 100
al12t5686 91
Current script:
grep --no-group-separator -PxB1 '([8,9][0-9]|[1][0-9][0-9])' inputfile.txt \
| sed 's/-heapmemusage-latest.log//' \
| awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'
Extra input example
al14672-heapmemusage-latest.log
38
al14671-heapmemusage-latest.log
5
g4t5534-heapmemusage-latest.log
100
al1t0000-heapmemusage-latest.log
0
al1t5535-heapmemusage-latest.log
al1t4676-heapmemusage-latest.log
127
al1t4674-heapmemusage-latest.log
53
A1t5540-heapmemusage-latest.log
54
G4t9981-heapmemusage-latest.log
45
al1c4678-heapmemusage-latest.log
81
B4t8830-heapmemusage-latest.log
76
a1t0091-heapmemusage-latest.log
88
al1t4684-heapmemusage-latest.log
91
Extra Example expected output:
g4t5534 100
al1t4676 127
al1c4678 81
a1t0091 88
al1t4684 91
another awk
$ awk -F- 'NR%2{p=$1; next} 80<=$1 && $1<=199 {print p,$1}' file
al12t5683 88
al12t5684 100
al12t5686 91
UPDATE
for the empty line record delimiter
$ awk -v RS= '80<=$2 && $2<=199{sub(/-.*/,"",$1); print}' file
al12t5683 88
al12t5684 100
al12t5686 91
Consider implementing this in native bash, as in the following (which can be seen running with your sample input -- including sporadically-present blank lines -- at http://ideone.com/Qtfmrr):
#!/bin/bash
name=; number=
while IFS= read -r line; do
[[ $line ]] || continue # skip blank lines
[[ -z $name ]] && { name=$line; continue; } # first non-blank line becomes name
number=$line # second one becomes number
if (( number >= 80 && number < 200 )); then
name=${name%%-*} # prune everything after first "-"
printf '%s %s\n' "$name" "$number" # emit our output
fi
name=; number= # clear the variables
done <inputfile.txt
The above uses no external commands whatsoever -- so whereas it might be slower to run over large input than a well-implemented awk or perl script, it also has far shorter startup time since no interpreter other than the already-running shell is required.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?, describing the while read idiom.
BashFAQ #100 - How do I do string manipulations in bash?; or The Bash-Hackers' Wiki on parameter expansion, describing how name=${name%%-*} works.
The Bash-Hackers' Wiki on arithmetic expressions, describing the (( ... )) syntax used for numeric comparisons.
perl -nle's/-.*//; $n=<>; print "$_ $n" if 80<=$n && $n<=199' inputfile.txt
With gnu sed
sed -E '
N
/\n[8-9][0-9]$/bA
/\n1[0-9]{2}$/!d
:A
s/([^-]*).*\n([0-9]+$)/\1 \2/
' infile

bash to identify and verify file headers

Using the tab-delimited file below I am trying to validate the header line 1 and then store that number in a variable $header to use in a couple of if statements. If $header equals 10 then file has expected number of fields, but if $header less than 10 file is missing header for: and the missing header fields are printed underneath. The bash seems close and if i use the awk by itself it seems to work perfectly, but I can not seem to use it in the if. Thank you :).
file.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
file2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
bash
for f in /home/cmccabe/Desktop/validate/*.txt; do
bname=`basename $f`
pref=${bname%%.txt}
header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output # detect header row in file and store in header and write to output
if [[ $header == "10" ]]; then # display results
echo "file has expected number of fields" # file is validated for headers
else
echo "file is missing header for:" # missing header field ...in file not-validated
echo "$header"
fi # close if.... else
done >> ${pref}_output
desired output for file.txt
file has expected number of fields
desired output for file1.txt
file is missing header for:
Input
You can use awk if you like, but bash is more than capable of handling the first line fields comparison on its own. If you maintain an array of expected field names, you can then easily split the first line into fields, compare against the expected number of fields, and output the identity of the missing field if you read less than the expected number of fields from any given file.
The following is a short example that takes filenames as arguments (you need to take filenames from stdin for a large number of files, or use xargs, as required). The script simply reads the first line in each file, separates the line into fields, checks the field count, and outputs any missing fields in a short error message:
#!/bin/bash
declare -i header=10 ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
"Chr"
"Start"
"End"
"Ref"
"Alt"
"Freq"
"Qual"
"Score"
"Input" )
for i in "$#"; do ## for each file given as argument
read -r line < "$i" ## read first line from file into 'line'
oldIFS="$IFS" ## save current Internal Field Separator (IFS)
IFS=$'\t' ## set IFS to word-split on '\t'
fldarray=( $line ); ## fill 'fldarray' with fields in line
IFS="$oldIFS" ## restore original IFS
nfields=${#fldarray[#]} ## get number of fields in 'line'
if (( nfields < header )) ## test against header
then
printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
for j in "${fields[#]}" ## for each expected field
do ## check against those in line, if not present print
[[ $line =~ $j ]] || printf " %s" "$j"
done
printf "\n\n" ## tidy up with newlines
fi
done
Example Input
$ cat dat/hdr.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
$ cat dat/hdr2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
$ cat dat/hdr3.txt
Index Chr Start End Alt Freq Qual Score Input
1 1 1 100 - 1 GOOD 10 .
2 2 20 200 C .002 STRAND BIAS 2 .
3 2 270 400 GG .036 GOOD 6 .
Example Use/Output
$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input
error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref
Look things over, while awk can do many things bash cannot on its own, bash is more than capable with parsing text.
Here is one in GNU awk (nextfile):
$ awk '
FNR==NR {
for(n=1;n<=NF;n++)
a[$n]
nextfile
}
NF==(n-1) {
print FILENAME " file has expected number of fields"
nextfile
}
{
for(i=1;i<=NF;i++)
b[$i]
print FILENAME " is missing header for: "
for(i in a)
if(i in b==0)
print i
nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for:
Input
The first file processed by the script defines the headers (in a) that the following files should have and compares them (in b) against it.
This piece of code will do exactly what you are asking. Let me know if it works for you.
for f in ./*.txt; do
[[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]] && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index Chr Start End Ref Alt Freq Qual Score Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c | awk '/1 / {print $2}' );
done
Output :
File ./file2.txt is missing Input
File ./file.txt has all the fields on its header

Clean list of points depending on close range (+-5)

How to clean a list of points in a variable regarding on if it is
the same point or
a close by point (+-5).
Example each line is one point with to coordinates:
points="808,112\n807,113\n809,113\n155,183\n832,572"
echo "$points"
#808,112
#807,113
#809,113
#155,183
#832,572
#196,652
I would like to ignore points within a range of +-5 counts. The result should be:
echo "$points_clean"
#808,112
#155,183
#832,572
#196,652
I thought about looping through the list, but I need help to how to check if point coordinates already exist in the new list:
points_clean=$(for point in $points; do
x=$(echo "$point" | cut -d, -f1)
y=$(echo "$point" | cut -d, -f2)
# check if same or similar point coordinates already in $points_clean
echo "$x,$y"
done)
This seems to work with Bash 4.x (support for process substitution is needed):
#!/bin/bash
close=100
points="808,112\n807,113\n809,113\n155,183\n832,572"
echo -e "$points"
clean=()
distance()
{
echo $(( ($1 - $3) * ($1 - $3) + ($2 - $4) * ($2 - $4) ))
}
while read x1 y1
do
ok=1
for point in "${clean[#]}"
do
echo "compare $x1 $y1 with $point"
set -- $point
if [[ $(distance $x1 $y1 $1 $2) -le $close ]]
then
ok=0
break
fi
done
if [ $ok = 1 ]
then clean+=("$x1 $y1")
fi
done < <( echo -e "$points" | tr ',' ' ' | sort -u )
echo "Clean:"
printf "%s\n" "${clean[#]}" | tr ' ' ','
The sort is optional and may slow things down. Identical points will be too close together, so the second instance of a given coordinate will be eliminated even if the first wasn't.
Sample output:
808,112
807,113
809,113
155,183
832,572
compare 807 113 with 155 183
compare 808 112 with 155 183
compare 808 112 with 807 113
compare 809 113 with 155 183
compare 809 113 with 807 113
compare 832 572 with 155 183
compare 832 572 with 807 113
Clean:
155,183
807,113
832,572
The workaround for Bash 3.x (as found on Mac OS X 10.10.4, for example) is a tad painful; you need to send the output of the echo | tr | sort command to a file, then redirect the input of the pair of loops from that file (and clean up afterwards). Or you can put the pair of loops and the code that follows (the echo of the clean array) inside the scope of { …; } command grouping.
In response to the question 'what defines close?', wittich commented:
Let's say ±5 counts. Eg. 808(±5,) 112(±5). That's why the second and third point would be "cleaned".
OK. One way of looking at that would be to adjust the close value to 50 in my script (allowing a difference of 52 + 52), but that rejects points connected by a line of length just over 7, though. You could revise the distance function to do ±5; it takes a bit more work and maybe an auxilliary abs function, or you could return the square of the larger delta and compare that with 25 (52 of course). You can play with what the criterion should be to your hearts content.
Note that Bash shell arithmetic is integer arithmetic (only); you need Korn shell (ksh) or Z shell (zsh) to get real arithmetic in the shell, or you need to use bc or some other calculator.

calculate distance; substract the first column of the second line from the second column of the fist line using awk

I have a question. I have a file with coordinates (TAB separated)
2 10
35 50
90 200
400 10000
...
I would like to substract the first column of the second line from the second column of the fist line , i.e. calculate the distance, i.e. I would like a file with
25
40
200
...
How could I do that using awk???
Thank you very much in advance
here is an awk one-liner may help you:
kent$ awk 'a{print $1-a}{a=$2}' file
25
40
200
Here's a pure bash solution:
{
read _ ps
while read f s; do
echo $((f-ps))
((ps=s))
done
} < input_file
This only works if you have (small) integers, as it uses bash's arithmetic. If you want to deal with arbitrary sized integers or floats, you can use bc (with only one fork):
{
read _ ps
while read f s; do
printf '%s-%s\n' "$f" "$ps"
ps=$s
done
} < input_file | bc
Now I leave the others give an awk answer!
Alright, since nobody wants to upvote my answer, here's a really funny solution that uses bash and bc:
a=( $(<input_file) )
printf -- '-(%s)+(%s);\n' "${a[#]:1:${#a[#]}-2}" | bc
or the same with dc (shorter but doesn't work with negative numbers):
a=( $(<input_file) )
printf '%s %sr-pc' "${a[#]:1:${#a[#]}-2}" | dc
using sed and ksh for evaluation
sed -n "
1x
1!H
$ !b
x
s/^ *[0-9]\{1,\} \(.*\) [0-9]\{1,\} *\n* *$/\1 /
s/\([0-9]\{1,\}\)\(\n\)\([0-9]\{1,\}\) /echo \$((\3 - \1))\2/g
s/\n *$//
w /tmp/Evaluate.me
"
. /tmp/Evaluate.me
rm /tmp/Evaluate.me

Resources