Using awk with Operations on Variables - bash

I'm trying to write a Bash script that reads files with several columns of data and multiplies each value in the second column by each value in the third column, adding the results of all those multiplications together.
For example if the file looked like this:
Column 1 Column 2 Column 3 Column 4
genome 1 30 500
genome 2 27 500
genome 3 83 500
...
The script should multiply 1*30 to give 30, then 2*27 to give 54 (and add that to 30), then 3*83 to give 249 (and add that to 84) etc..
I've been trying to use awk to parse the input file but am unsure of how to get the operation to proceed line by line. Right now it stops after the first line is read and the operations on the variables are performed.
Here's what I've written so far:
for file in fileone filetwo
do
set -- $(awk '/genome/ {print $2,$3}' $file.hist)
var1=$1
var2=$2
var3=$((var1*var2))
total=$((total+var3))
echo var1 \= $var1
echo var2 \= $var2
echo var3 \= $var3
echo total \= $total
done
I tried placing a "while read" loop around everything but could not get the variables to update with each line. I think I'm going about this the wrong way!
I'm very new to Linux and Bash scripting so any help would be greatly appreciated!

That's because awk reads the entire file and runs its program on each line. So the output you get from awk '/genome/ {print $2,$3}' $file.hist will look like
1 30
2 27
3 83
and so on, which means in the bash script, the set command makes the following variable assignments:
$1 = 1
$2 = 30
$3 = 2
$4 = 27
$5 = 3
$6 = 83
etc. But you only use $1 and $2 in your script, meaning that the rest of the file's contents - everything after the first line - is discarded.
Honestly, unless you're doing this just to learn how to use bash, I'd say just do it in awk. Since awk automatically runs over every line in the file, it'll be easy to multiply columns 2 and 3 and keep a running total.
awk '{ total += $2 * $3 } ENDFILE { print total; total = 0 }' fileone filetwo
Here ENDFILE is a special address that means "run this next block at the end of each file, not at each line."
If you are doing this for educational purposes, let me say this: the only thing you need to know about doing arithmetic in bash is that you should never do arithmetic in bash :-P Seriously though, when you want to manipulate numbers, bash is one of the least well-adapted tools for that job. But if you really want to know, I can edit this to include some information on how you could do this task primarily in bash.

I agree that awk is in general better suited for this kind of work, but if you are curious what a pure bash implementation would look like:
for f in file1 file2; do
total=0
while read -r _ x y _; do
((total += x * y))
done < "$f"
echo "$total"
done

Related

How to put a line from a file into a table (variable)

I have the following file
Durand 12 9 14
Lucas 8 11 4
Martin 9 12 1
I need to display the name and the average of the three other with a function. The function part is easy.
I thought I could get line by line with:
head -i notes | tail -1
and then put the result of the command in a table in order to access it
table=(head -i notes | tail -1)
echo "${table[0]} averge : moy ${table[1]} ${table[2]} ${table[3]}"
You might use three important concepts to approach a problem like this.
Iterate over a file
Store values as variables
Do math to variables
A good way to read a file line by line is with a while loop:
while read line; do echo $line; done < notes
Notice how we use a file redirect < to treat the file as standard input. read consumes one full line at a time. Let's expand on that in order to store separate variables.
while read name a b c; do echo $name $a $b $c; done < notes
Now let's get math involved. You could use an external program like bc, but that's inefficient if we don't need floating point math (decimals). Bash has math built in!
while read name a b c; do echo $name $(( (a + b + c) / 3 )); done < notes
Like you said, the function part is easy :)
awk one liner:
awk '{print $1, ($2+$3+$4)/3}' notes

Sorting on multiple columns w/ an output file per key

I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt
This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

Problems for Extracting data from multiple files with awk

If I have 300 files:
f_1.dat, f_2.dat,......, f_300.dat
For example, the file f_114.dat has the structure like
atom(index):
114
Ave: cosp1 cosp2 cosp3:
-0.74 -0.54 -0.37
...
I want to extract the 2nd line (eg, the number 114) and the 4th line (the three numbers) from certain files (these files are f_114.dat, f_182.dat,...,f_249.dat ) among the 300 files and merge them to the same file , eg:
114 -0.74 -0.54 -0.37
182 -0.72 -0.59 -0.37
…
I tried with for structure, the command is
for i in `114,182,251,131,183,257,140,191,31,148,192,48,151,195,51,92,177,249`; do awk -v num=$i 'NR=2&&NR=4{print $0}' f_$i.dat > $i.dat; done
But an error shows there is a grammar mistake.
Could you give a solution for the problem ?
I think you want something like:
for i in {114,182,251,131,183,257,140,191,31,148,192,48,151,195,51,92,177,249} ; do
awk 'BEGIN {ORS=" "} NR==2 || NR==4' f_$i.dat
echo ""
done > merged-file.dat
There were a few problems with your for-loop
do was missing from the for-loop;
for-loops are structured like for i in $LIST ; do ... ; done
the $LIST part was invalid
-v num=$i was not needed as the num variable wasn't even used in the awk expression
you wanted to merge the output in the same file, so putting > $i.dat in the for-loop wouldn't work as they output to separate files;
you need to place > $OUTFILE.dat outside the for-loop or
append (>>) the output of each iteration of the for-loop to the same file with something like: for i in ... ; do ... >> OUTFILE.dat ; done

Bash For loop - multiple variables, not using arrays?

I have run into an issue that seems like it should have an easy answer, but I keep hitting walls.
I'm trying to create a directory structure that contains files that are named via two different variables. For example:
101_2465
203_9746
526_2098
I am looking for something that would look something like this:
for NUM1 in 101 203 526 && NUM2 in 2465 9746 2098
do
mkdir $NUM1_$NUM2
done
I thought about just setting the values of NUM1 and NUM2 into arrays, but it overcomplicated the script -- I have to keep each line of code as simple as possible, as it is being used by people who don't know much about coding. They are already familiar with a for loop set up using the example above (but only using 1 variable), so I'm trying to keep it as close to that as possible.
Thanks in advance!
while read NUM1 NUM2; do
mkdir ${NUM1}_$NUM2
done << END
101 2465
203 9746
526 2098
END
Note that underscore is a valid variable name character, so you need to use braces to disambiguate the name NUM1 from the underscore
...setting the values of NUM1 and NUM2 into arrays, but it overcomplicated the script...
No-no-no. Everything will be more complicated, than arrays.
NUM1=( 101 203 526 )
NUM2=( 2465 9746 2098 )
for (( i=0; i<${#NUM1}; i++ )); do
echo ${NUM1[$i]}_${NUM2[$i]}
done
One way is to separate the entries in your two variables by newlines, and then use paste to get them together:
a='101 203 526'
b='2465 9746 2098'
# Convert space-separated lists into newline-separated lists
a="$(echo $a | sed 's/ /\n/g')"
b="$(echo $b | sed 's/ /\n/g')"
# Acquire newline-separated list of tab-separated pairs
pairs="$(paste <(echo "$a") <(echo "$b"))"
# Loop over lines in $pairs
IFS='
'
for p in $pairs; do
echo "$p" | awk '{print $1 "_" $2}'
done
Output:
101_2465
203_9746
526_2098

Bash: Sum fields of a line

I have a file with the following format:
a 1 2 3 4
b 7 8
c 120
I want it to be parsed into:
a 10
b 15
c 120
I know this can be easily done with awk, but I'm not familiar with the syntax and can't get it to work for me.
Thanks for any help
ok simple awk primer:
awk '{ for (i=2;i<=NF;i++) { total+=$i }; print $1,total; total=0 }' file
NF is an internal variable that is reset on each line and is equal to the number of fields on that line so
for (i=2;i<=NF;i++) starts a for loop starting at 2
total+=$i means the var total has the value of the i'th field added to it. and is performed for each iteration of the loop above.
print $1,total prints the 1st field followed by the contents of OFS variable (space by default) then the total for that line.
total=0 resets the totals var ready for the next iteration.
all of the above is done on each line of input.
For more info see grymoires intro here
Start from column two and add them:
awk '{tot=0; for(i=2;i<$NF;i++) tot+=$i; print $1, tot;}' file
A pure bash solution:
$ while read f1 f2
> do
> echo $f1 $((${f2// /+}))
> done < file
On running it, got:
a 10
b 15
c 120
The first field is read into variable f1 and the rest of the fields are i f2. In variable f2 , spaces are replaced in place with + and evaluated.
Here's a tricky way to use a subshell, positional parameters and IFS. Works with various amounts of whitespace between the fields.
while read label numbers; do
echo $label $(set -- $numbers; IFS=+; bc <<< "$*")
done < filename
This works because the shell expands "$*" into a single string of the positional parameters joined by the first char of $IFS (documentation)

Resources