Replace a value in a file by another one (bash/awk) - bash

I have a file (a coordinates file for those who know what it is) like following :
1 C 1
2 C 1 1 1.60000
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
and so on.. My idea is to replace the value "1.60000" in the second line, by other values using a for loop.
I would like the value to start at, lets say 0, and stop at 2.0 for example, with a increment step of 0.05
Here is what I already tried:
#! /bin/bash
a=0;
for ((i=0; i<=10 (for example); i++)); do
awk '{if ((NR==2) && ($5=="1.60000")) {($5=a)} print $0 }' file.dat > ${i}_file.dat
a=$((a+0.05))
done
But, unfortunately it doesn't work. I tried a lot of combination for the {$5=a} statement but without conclusive results.
Here is what I obtained:
1 C 1
2 C 1 1
3 H 5 1 1.10000 2 109.4700
4 H 5 1 1.10000 2 109.4700 3 109.4700 1
The value 1.6000 simply dissapear or at least replaced by a blank.
Any advice ?
Thanks a lot,
Pierre-Louis

for this perhaps sed is a better alternative
$ v=0.00; for((i=0; i<=40; i++)) do
sed '2s/1.60/'"$v"'/' file > file_"$i";
v=$(echo "$v + 0.05" | bc | xargs printf "%.2f\n");
done
Explanation
sed '2s/1.60/'"$v"'/' file change the value 1.60 on second line with the value of variable v
floating point arithmetic in bash is hard, this adds 0.05 to the value and formats it (0.05 instead of .05) so that we can use it in the substitution with sed.
Exercise to you: in bash try to add 0.05 to 0.05 and format the output as 0.10 with leading zero.

example with awk (glenn's suggestion)
for ((i=0; i<=10; i++)); do
awk -v "i=$i" '
(FNR==2){ $5=sprintf("%2.1f ",i*0.5); print $0 }
' file.dat # > $i_file.dat # uncomment for a file output
done
advantage: it's awk who manage floating-point arithmetic

Related

Sorting on multiple columns w/ an output file per key

I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt
This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

How to sum a row of numbers from text file-- Bash Shell Scripting

I'm trying to write a bash script that calculates the average of numbers by rows and columns. An example of a text file that I'm reading in is:
1 2 3 4 5
4 6 7 8 0
There is an unknown number of rows and unknown number of columns. Currently, I'm just trying to sum each row with a while loop. The desired output is:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
And so on and so forth with each row. Currently this is the code I have:
while read i
do
echo "num: $i"
(( sum=$sum+$i ))
echo "sum: $sum"
done < $2
To call the program it's stats -r test_file. "-r" indicates rows--I haven't started columns quite yet. My current code actually just takes the first number of each column and adds them together and then the rest of the numbers error out as a syntax error. It says the error comes from like 16, which is the (( sum=$sum+$i )) line but I honestly can't figure out what the problem is. I should tell you I'm extremely new to bash scripting and I have googled and searched high and low for the answer for this and can't find it. Any help is greatly appreciated.
You are reading the file line by line, and summing line is not an arithmetic operation. Try this:
while read i
do
sum=0
for num in $i
do
sum=$(($sum + $num))
done
echo "$i Sum: $sum"
done < $2
just split each number from every line using for loop. I hope this helps.
Another non bash way (con: OP asked for bash, pro: does not depend on bashisms, works with floats).
awk '{c=0;for(i=1;i<=NF;++i){c+=$i};print $0, "Sum:", c}'
Another way (not a pure bash):
while read line
do
sum=$(sed 's/[ ]\+/+/g' <<< "$line" | bc -q)
echo "$line Sum = $sum"
done < filename
Using the numsum -r util covers the row addition, but the output format needs a little glue, by inefficiently paste-ing a few utils:
paste "$2" \
<(yes "Sum =" | head -$(wc -l < "$2") ) \
<(numsum -r "$2")
Output:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
Note -- to run the above line on a given file foo, first initialize $2 like so:
set -- "" foo
paste "$2" <(yes "Sum =" | head -$(wc -l < "$2") ) <(numsum -r "$2")

Count and sum up specific decimal number (bash,awk,perl)

I have a tab delimited file and I want to sum up certain decimal number to the output (1.5) each time its find number instead of character to the first column and print out the results for all the rows from the first to the last.
I have example file which look like this:
It has 8 rows
1st-column 2nd-Column
a ship
1 name
b school
c book
2 blah
e blah
3 ...
9 ...
Now I want my script to read line by line and if it finds number sum up 1.5 and give me output just for first column like this:
0
1.5
1.5
1.5
3
3
4.5
6
my script is:
#!/bin/bash
for c in {1..8}
do
awk 'NR==$c { if (/^[0-9]/) sum+=1.5} END {print sum }' file
done
but I don't get any output!
Thanks for your help in advance.
The last item in your expected output appears to be incorrect. If it is, then you can do:
$ awk '$1~/^[[:digit:]]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6
use warnings;
use strict;
my $sum = 0;
while (<DATA>) {
my $data = (split)[0]; # 1st column
$sum += 1.5 if $data =~ /^\d+$/;
print "$sum\n";
}
__DATA__
a ship
1 name
b school
c book
2 blah
e blah
3 ...
6 ...
Why not just use awk:
awk '{if (/^[0-9]+[[:blank:]]/) sum+=1.5} {print sum+0 }' file
Edited to simplify based on jaypal's answer, bound the number and work with tabs and spaces.
How about
perl -lane 'next unless $F[0]=~/^\d+$/; $c+=1.5; END{print $c}' file
Or
awk '$1~/^[0-9]+$/{c+=1.5}END{print c}' file
These only produce the final sum as your script would have done. If you want to show the numbers as they grow use:
perl -lane 'BEGIN{$c=0}$c+=1.5 if $F[0]=~/^\d+$/; print "$c"' file
Or
awk 'BEGIN{c=0}{if($1~/^[0-9]+$/){c+=1.5}{print c}}' file
I'm not sure if you're multiplying the first field by 1.5 or adding 1.5 to a sum every time there's any number in $1 and ignoring the contents of the line otherwise. Here's both in awk, using your sample data as the contents of "file."
$ awk '$1~/^[0-9]+$/{val=$1*1.5}{print val+0}' file
0
1.5
1.5
1.5
3
3
4.5
9
$ awk '$1~/^[0-9]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6
Or, here you go in ksh (or bash if you have a newer bash that can do floating point math), assuming the data is on STDIN
#!/usr/bin/ksh
sum=0
while read a b
do
[[ "$a" == +([0-9]) ]] && (( sum += 1.5 ))
print $sum
done

For loop and value changing using awk

I have the following file format
...
MODE P E
IMP:P 1 19r 0
IMP:E 1 19r 0
...
SDEF POS= 0 0 14.6 AXS= 0 0 1 EXT=d3 RAD= d4 cell=23 ERG=d1 PAR=2
SI1 L 0.020
SP1 1
SI4 0. 3.401
SI3 0.9
...
NPS 20000000
What I am trying to do is to locate a specific value(in particular the value after the sequence SI1 L) and create a series of files with different values. For instance ST1 L 0.020--->ST1 L 0.050. What I have in mind is to give a start value, an end value and a step so as to generate files with different values after the sequence SI1 L. For instance a for loop would work, but I don't know how to use it outside awk.
I am able to locate the value using
awk '$1=="SI1" {printf "%12s\n", $3}' file
I could also use the following to replace the value
awk '$1=="SI1" {gsub(/0.020/, "0.050"); printf "%12s\n", $3}' file
The thing is that the value won't always be 0.020. That's why I need a way to replace the value after the sequence SI1 L and this replacement should be done for many values.
How can this be acheived?
You can try:
awk -vval="0.05" '$1=="SI1"{$3=val}1' file
This will replace SI1 L 0.020 by SI1 L 0.05 in the input file.
Then use a bash script to call the awk program in a for loop..
For instance:
#! /bin/bash
vals=(0.02 0.03 0.04 0.05)
i=0
for val in "${vals[#]}"; do
i=$(($i+1))
awk -vval="$val" '$1=="SI1"{$3=val}1' file > "file${i}"
done
If your system has seq command, here is easier script for you.
for val in $(seq 0.02 0.01 0.05)
do
awk -vval="$val" '/SI1 L/{$3=val}1' file > "${val}"
# or Using sed
# sed: sed "s/SI1 L .*/SI1 L $val/" > "${val}"
done

reset row number count in awk

I have a file like this
file.txt
0 1 a
1 1 b
2 1 d
3 1 d
4 2 g
5 2 a
6 3 b
7 3 d
8 4 d
9 5 g
10 5 g
.
.
.
I want reset row number count to 0 in first column $1 whenever value of field in second column $2 changes, using awk or bash script.
result
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
.
.
.
As long as you don't mind a bit of excess memory usage, and the second column is sorted, I think this is the most fun:
awk '{$1=a[$2]+++0;print}' input.txt
This awk one-liner seems to work for me:
[ghoti#pc ~]$ awk 'prev!=$2{first=0;prev=$2} {$1=first;first++} 1' input.txt
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
Let's break apart the script and see what it does.
prev!=$2 {first=0;prev=$2} -- This is what resets your counter. Since the initial state of prev is empty, we reset on the first line of input, which is fine.
{$1=first;first++} -- For every line, set the first field, then increment variable we're using to set the first field.
1 -- this is awk short-hand for "print the line". It's really a condition that always evaluates to "true", and when a condition/statement pair is missing a statement, the statement defaults to "print".
Pretty basic, really.
The one catch of course is that when you change the value of any field in awk, it rewrites the line using whatever field separators are set, which by default is just a space. If you want to adjust this, you can set your OFS variable:
[ghoti#pc ~]$ awk -vOFS=" " 'p!=$2{f=0;p=$2}{$1=f;f++}1' input.txt | head -2
0 1 a
1 1 b
Salt to taste.
A pure bash solution :
file="/PATH/TO/YOUR/OWN/INPUT/FILE"
count=0
old_trigger=0
while read a b c; do
if ((b == old_trigger)); then
echo "$((count++)) $b $c"
else
count=0
echo "$((count++)) $b $c"
old_trigger=$b
fi
done < "$file"
This solution (IMHO) have the advantage of using a readable algorithm. I like what's other guys gives as answers, but that's not that comprehensive for beginners.
NOTE:
((...)) is an arithmetic command, which returns an exit status of 0 if the expression is nonzero, or 1 if the expression is zero. Also used as a synonym for let, if side effects (assignments) are needed. See http://mywiki.wooledge.org/ArithmeticExpression
Perl solution:
perl -naE '
$dec = $F[0] if defined $old and $F[1] != $old;
$F[0] -= $dec;
$old = $F[1];
say join "\t", #F[0,1,2];'
$dec is subtracted from the first column each time. When the second column changes (its previous value is stored in $old), $dec increases to set the first column to zero again. The defined condition is needed for the first line to work.

Resources