For loop and value changing using awk - for-loop

I have the following file format
...
MODE P E
IMP:P 1 19r 0
IMP:E 1 19r 0
...
SDEF POS= 0 0 14.6 AXS= 0 0 1 EXT=d3 RAD= d4 cell=23 ERG=d1 PAR=2
SI1 L 0.020
SP1 1
SI4 0. 3.401
SI3 0.9
...
NPS 20000000
What I am trying to do is to locate a specific value(in particular the value after the sequence SI1 L) and create a series of files with different values. For instance ST1 L 0.020--->ST1 L 0.050. What I have in mind is to give a start value, an end value and a step so as to generate files with different values after the sequence SI1 L. For instance a for loop would work, but I don't know how to use it outside awk.
I am able to locate the value using
awk '$1=="SI1" {printf "%12s\n", $3}' file
I could also use the following to replace the value
awk '$1=="SI1" {gsub(/0.020/, "0.050"); printf "%12s\n", $3}' file
The thing is that the value won't always be 0.020. That's why I need a way to replace the value after the sequence SI1 L and this replacement should be done for many values.
How can this be acheived?

You can try:
awk -vval="0.05" '$1=="SI1"{$3=val}1' file
This will replace SI1 L 0.020 by SI1 L 0.05 in the input file.
Then use a bash script to call the awk program in a for loop..
For instance:
#! /bin/bash
vals=(0.02 0.03 0.04 0.05)
i=0
for val in "${vals[#]}"; do
i=$(($i+1))
awk -vval="$val" '$1=="SI1"{$3=val}1' file > "file${i}"
done

If your system has seq command, here is easier script for you.
for val in $(seq 0.02 0.01 0.05)
do
awk -vval="$val" '/SI1 L/{$3=val}1' file > "${val}"
# or Using sed
# sed: sed "s/SI1 L .*/SI1 L $val/" > "${val}"
done

Related

BASH: Performing decimal division on a column in file and printing result in another file

I have a file (in.txt) with the following columns:
# DM Sigma Time (s) Sample Downfact
78.20 7.36 134.200512 2096883 70
78.20 7.21 144.099904 2251561 70
78.20 9.99 148.872384 2326131 150
78.20 10.77 283.249664 4425776 45
I want to write a bash script to divide all values in column 'Time' by 0.5867, get a precision up to 2 decimal points and print out the resulting values in another file out.txt
I tried using bc/awk but it gives this error.
awk: cmd. line:1: fatal: division by zero attempted
awk: fatal: cannot open file `file' for reading (No such file or directory)
Could someone help me with this? Thanks.
This is the bash script that I attempted:
cat in.txt | while read DM Sigma Time Sample Downfact; do
echo "$DM $Sigma $Time $Sample $Downfact"
pperiod = 0.5867
awk -v n=$Time 'BEGIN {printf "%.2f\n", (n/$pperiod)}'
#echo "scale=2 ; $Time / $pperiod" | bc
#echo "$subint" > out.txt
done
I expected the script to divide column 'Time' with pperiod and get the result with a precision of 2 decimal places. This result should be printed to a file named out.txt
Lots of issues with current awk code:
need to pass in the value of the $pperiod variable
need to reference the Time column by is position ($3 in this case)
BEGIN{} block is applied before any input lines are processed and has nothing to do with processing of actual input lines
there is no code to perform processing on actual input lines
need to decide what to do in the case of a divide by zero scenario (in this case we'll default answer to 0.00)
NOTE: current code generates divide by zero error because $pperiod is an undefined (awk) variable which in turn defaults to 0
additionally, pperiod = 0.5867 is invalid bash syntax
One idea for fixing current issues:
pperiod=0.5867
awk -v pp="${pperiod}" 'NR>1 {printf "%.2f\n", (pp==0 ? 0 : ($3/pp))}' in.txt > out.txt
Where:
-v pp="${pperiod}" - assign awk variable pp the value of the bash variable "${pperiod}"
NR>1 - skip header line
NR>1 {printf "%.2f\n" ...}- for each input line, other than the header line, print the result of dividing theTimecolumn (aka$3) by the value of the awkvariablepp(which holds the value of thebashvariable"${pperiod}"`)
(pp==0 ? 0 : ($3/pp)) - if pp is equal 0 we print 0 else print result of $3/pp) (this keeps us from generating a divide by zero error)
NOTE: this also eliminates the need for the cat|while loop
This generates:
$ cat out.txt
228.74
245.61
253.75
482.78

Faster way to extract data from large file

I have file containing about 40000 frames of Cartesian coordinates of 28 atoms. I need to extract coordinates of atom 21 to 27 from each frame.
I tried using bash script with for-loop.
for i in {0..39999}
do
cat $1 | grep -A 27 "frame $i " | tail -n 6 | awk '{print $2, $3, $4}' >> new_coors.xyz
done
Data have following form:
28
-1373.82296 frame 0 xyz file generated by terachem
Re 1.6345663991 0.9571586961 0.3920887712
N 0.7107677071 -1.0248027788 0.5007181135
N -0.3626961076 1.1948218124 -0.4621264246
C -1.1299268126 0.0792071086 -0.5595954110
C -0.5157993503 -1.1509115191 -0.0469223696
C 1.3354467762 -2.1017253883 1.0125736017
C 0.7611763218 -3.3742177216 0.9821756556
C -1.1378354025 -2.4089069492 -0.1199253156
C -0.4944655989 -3.5108477831 0.4043826684
C -0.8597552614 2.3604180994 -0.9043060625
C -2.1340008843 2.4846545826 -1.4451933224
C -2.4023114639 0.1449111237 -1.0888703147
C -2.9292779079 1.3528434658 -1.5302429615
H 2.3226814021 -1.9233467458 1.4602019023
H 1.3128699342 -4.2076373780 1.3768411246
H -2.1105470176 -2.5059031902 -0.5582958817
H -0.9564415355 -4.4988963635 0.3544299401
H -0.1913951275 3.2219343258 -0.8231465989
H -2.4436044324 3.4620639189 -1.7693069306
H -3.0306593902 -0.7362803011 -1.1626515622
H -3.9523215784 1.4136948699 -1.9142814745
C 3.3621999538 0.4972227756 1.1031860016
O 4.3763020637 0.2022266109 1.5735343064
C 2.2906331057 2.7428149541 0.0483795630
O 2.6669163864 3.8206298898 -0.1683800650
C 1.0351398442 1.4995168190 2.1137684156
O 0.6510904387 1.8559680025 3.1601927094
Cl 2.2433490373 0.2064711824 -1.9226174036
It works but it takes enormous amount of time,
In future I will be working with larger file. Is there faster way to do that?
The reason why your program is slow is that you keep on re-reading your input file over and over in your for-loop. You can do everything with reading your file a single time and use awk instead:
awk '/frame/{c=0;next}{c++}(c>20 && c<27){ print $2,$3,$4 }' input > output
This answer assumes the following form of data:
frame ???
??? x y z ???
??? x y z ???
...
frame ???
??? x y z ???
??? x y z ???
...
The solution checks if it finds the word frame in a line. If so, it sets the atom counter c to zero and skips to the next line. From that point forward, it will always read increase the counter if it reads a new line. If the counter is between 20 and 27 (exclusive), it will print the coordinates.
You can now easily expand on this: Assume you want the same atoms but only from frame 1000 till 1500. You can do this by introducing a frame-counter fc
awk '/frame/{fc++;c=0;next}{c++}(fc>=1000 && fc <=1500) && (c>20 && c<27){ print $2,$3,$4 }' input > output
If frames numbers in file are already in sorted order, e.g. they have numbers 0 - 39999 in this order, then maybe something likes this could do the job (not tested, since we don't have a sample input file, as Jepessen suggested):
cat $1 | grep -A 27 -E "frame [0-9]+ " | \
awk '{if ($1 == "frame") n = 0; if (n++ > 20) print $2, $3, $4}' > new_coors.xyz
(code above made explicitly verbose to be easier to understand and closer to your existing script. If you need a more compact solution check kvantour answer)
You could perhaps use 2 passes of grep, rather than thousands?
Assuming you want the lines 21-27 after every frame, and you don't want to record the frame number itself, the following phrase should get the lines you want, which you can then 'tidy' with awk:
grep -A27 ' frame ' | grep -B6 '-----'
If you also wanted the frame numbers (I see no evidence), or you really want to restrict the range of frame numbers, you could do that with tee and >( grep 'frame') to generate a second file that you would then need to re-merge. If you added -n to grep then you could easily merge sort the files on line number.
Another way to restrict the frame number without doing multiple passes would be a more complex grep expression that describes the range of numbers (-E because life is too short for backticks):
-E ' frame (([0-9]{1,4}|[0-3][0-9]{1,4}) '

Replace a value in a file by another one (bash/awk)

I have a file (a coordinates file for those who know what it is) like following :
1 C 1
2 C 1 1 1.60000
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
and so on.. My idea is to replace the value "1.60000" in the second line, by other values using a for loop.
I would like the value to start at, lets say 0, and stop at 2.0 for example, with a increment step of 0.05
Here is what I already tried:
#! /bin/bash
a=0;
for ((i=0; i<=10 (for example); i++)); do
awk '{if ((NR==2) && ($5=="1.60000")) {($5=a)} print $0 }' file.dat > ${i}_file.dat
a=$((a+0.05))
done
But, unfortunately it doesn't work. I tried a lot of combination for the {$5=a} statement but without conclusive results.
Here is what I obtained:
1 C 1
2 C 1 1
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
The value 1.6000 simply dissapear or at least replaced by a blank.
Any advice ?
Thanks a lot,
Pierre-Louis
for this perhaps sed is a better alternative
$ v=0.00; for((i=0; i<=40; i++)) do
sed '2s/1.60/'"$v"'/' file > file_"$i";
v=$(echo "$v + 0.05" | bc | xargs printf "%.2f\n");
done
Explanation
sed '2s/1.60/'"$v"'/' file change the value 1.60 on second line with the value of variable v
floating point arithmetic in bash is hard, this adds 0.05 to the value and formats it (0.05 instead of .05) so that we can use it in the substitution with sed.
Exercise to you: in bash try to add 0.05 to 0.05 and format the output as 0.10 with leading zero.
example with awk (glenn's suggestion)
for ((i=0; i<=10; i++)); do
awk -v "i=$i" '
(FNR==2){ $5=sprintf("%2.1f ",i*0.5); print $0 }
' file.dat # > $i_file.dat # uncomment for a file output
done
advantage: it's awk who manage floating-point arithmetic

Bash: arithmetic addressed by line number and column

I have normally done this with Excel, but as I am trying to learn bash, I'd like to ask for advice here on how to do so. My input file resembles:
# s0 legend "1001"
# s1 legend "1002"
#target G0.S0
#type xy
2.0 -1052.7396157664
2.5 -1052.7330560932
3.0 -1052.7540013664
3.5 -1052.7780321236
4.0 -1052.7948229060
4.5 -1052.8081313831
5.0 -1052.8190310613
&
#target G0.S1
#type xy
2.0 -1052.5384564253
2.5 -1052.7040374678
3.0 -1052.7542803612
3.5 -1052.7781686744
4.0 -1052.7948927247
4.5 -1052.8081704241
5.0 -1052.8190543049
&
where the above only shows two data sets: s0 and s1. In reality I have 17 data sets and will combine them arbitrarily. By combine, I mean I would like to:
For two data sets, extract the second column of each separately.
Subtract these two columns row by row.
Multiply the difference by a constant, $C.
Note: $C multiplies very small numbers and the only way I could get it to not divide by zero was to take a massive scale.
Edit: After requests, I was apparently not entirely clear what I was going for. Take for example:
set0
2 x
3 y
4 z
set1
2 r
3 s
4 t
I also have defined a constant C.
I would like to perform the following operation:
C*(r - x)
C*(s - y)
C*(t - z)
I will be doing this for sets > 1, up to 16, for example (set 10) minus (set 0). Therefore, I need the flexibility to target a value based on its line number and column number, and preferably acting over a range of line numbers to make it efficient.
So far this works:
C=$(echo "scale=45;x=(small numbers)*(small numbers); x" | bc -l)
sed -n '5,11p' input.in | cut -c 5-20 > tmp1.in
sed -n '15,21p' input.in | cut -c 5-20 > tmp2.in
pr -m -t -s tmp1.in tmp2.in > tmp3.in
awk '{printf $2-$1 "\n"}' tmp3.in > tmp4.in
but the multiplication failed:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
returning:
0.00
0.00
0.00
0.00
0.00
0.00
0.00
I have a feeling the whole thing can be accomplished more elegantly with awk. I also tried this:
for (( i=0; i<=6; i++ ))
do
n=5+$i
m=10+n
awk 'NR==n{a=$2};NR==m{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
done
but all I get in temp.in is a long column of 0s.
I also tried
awk 'NR==5,NR==11{a=$2};NR==15,NR==21{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
but got the error
awk: (FILENAME=input.in FNR=20) fatal: attempt to access field -1052
Any idea how to formulate this with awk, and if that doesn't work, then why I cannot multiply with awk above? Thank you!
this does the math in one go
$ awk -v c=1 '/^&/ {s++}
s==1 {a[$1]=$2}
s==3 {print $1,a[$1],$2,c*(a[$1]-$2)}
/#type/ {s++}' file
2.0 -1052.7396157664 -1052.5384564253 -0.201159
2.5 -1052.7330560932 -1052.7040374678 -0.0290186
3.0 -1052.7540013664 -1052.7542803612 0.000278995
3.5 -1052.7780321236 -1052.7781686744 0.000136551
4.0 -1052.7948229060 -1052.7948927247 6.98187e-05
4.5 -1052.8081313831 -1052.8081704241 3.9041e-05
5.0 -1052.8190310613 -1052.8190543049 2.32436e-05
you can remove the decorations and add print formatting easily. The magic numbers 1=g1 and 3=2*g2-1 correspond to data groups 1 and 2 as the order presented in the data file, can be converted to awk variables as well.
The counter s keeps track of whether you're in a set or not, Odd numbers correspond to sets and even numbers between sets. The increment is done both at the start pattern and end pattern. The order of increment statements were set in such a way they, they are not printed following the pattern (unset first, print set values, reset last}. You can change the order and observe the effects.
This might be what you're looking for:
$ cat tst.awk
/^[#&]/ { lineNr=0; next }
{
++lineNr
if (lineNr in prev) {
print $1, c * ($2 - prev[lineNr])
}
prev[lineNr] = $2
}
$ awk -v c=100000 -f tst.awk file
2.0 20115.9
2.5 2901.86
3.0 -27.8995
3.5 -13.6551
4.0 -6.98187
4.5 -3.9041
5.0 -2.32436
In your first try, you should replace that line:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
with that one:
awk -v C=$C '{printf "%11.2f\n", C*$1 }' tmp4.in > tmp5.in
You are mixing notations of bash shell with notation with awk.
in shell you define variable without $, and you use them with $.
Here you are in awk script, there is no $ to use variables. Yet there are some special variables : $1 $2 ...
You have put single quote ' around your awk script, so the shell variables cant be used. I mean you have written $C, but the shell can not see it inside single-quote. That is why you have to write awk -v C=$C so that the shell variable $C is transferred to an awk variable called C.
In your other tries with awk, we can see such errors also. Now I think you'll make it.

Using awk with Operations on Variables

I'm trying to write a Bash script that reads files with several columns of data and multiplies each value in the second column by each value in the third column, adding the results of all those multiplications together.
For example if the file looked like this:
Column 1 Column 2 Column 3 Column 4
genome 1 30 500
genome 2 27 500
genome 3 83 500
...
The script should multiply 1*30 to give 30, then 2*27 to give 54 (and add that to 30), then 3*83 to give 249 (and add that to 84) etc..
I've been trying to use awk to parse the input file but am unsure of how to get the operation to proceed line by line. Right now it stops after the first line is read and the operations on the variables are performed.
Here's what I've written so far:
for file in fileone filetwo
do
set -- $(awk '/genome/ {print $2,$3}' $file.hist)
var1=$1
var2=$2
var3=$((var1*var2))
total=$((total+var3))
echo var1 \= $var1
echo var2 \= $var2
echo var3 \= $var3
echo total \= $total
done
I tried placing a "while read" loop around everything but could not get the variables to update with each line. I think I'm going about this the wrong way!
I'm very new to Linux and Bash scripting so any help would be greatly appreciated!
That's because awk reads the entire file and runs its program on each line. So the output you get from awk '/genome/ {print $2,$3}' $file.hist will look like
1 30
2 27
3 83
and so on, which means in the bash script, the set command makes the following variable assignments:
$1 = 1
$2 = 30
$3 = 2
$4 = 27
$5 = 3
$6 = 83
etc. But you only use $1 and $2 in your script, meaning that the rest of the file's contents - everything after the first line - is discarded.
Honestly, unless you're doing this just to learn how to use bash, I'd say just do it in awk. Since awk automatically runs over every line in the file, it'll be easy to multiply columns 2 and 3 and keep a running total.
awk '{ total += $2 * $3 } ENDFILE { print total; total = 0 }' fileone filetwo
Here ENDFILE is a special address that means "run this next block at the end of each file, not at each line."
If you are doing this for educational purposes, let me say this: the only thing you need to know about doing arithmetic in bash is that you should never do arithmetic in bash :-P Seriously though, when you want to manipulate numbers, bash is one of the least well-adapted tools for that job. But if you really want to know, I can edit this to include some information on how you could do this task primarily in bash.
I agree that awk is in general better suited for this kind of work, but if you are curious what a pure bash implementation would look like:
for f in file1 file2; do
total=0
while read -r _ x y _; do
((total += x * y))
done < "$f"
echo "$total"
done

Resources