Bash: arithmetic addressed by line number and column - bash

I have normally done this with Excel, but as I am trying to learn bash, I'd like to ask for advice here on how to do so. My input file resembles:
# s0 legend "1001"
# s1 legend "1002"
#target G0.S0
#type xy
2.0 -1052.7396157664
2.5 -1052.7330560932
3.0 -1052.7540013664
3.5 -1052.7780321236
4.0 -1052.7948229060
4.5 -1052.8081313831
5.0 -1052.8190310613
&
#target G0.S1
#type xy
2.0 -1052.5384564253
2.5 -1052.7040374678
3.0 -1052.7542803612
3.5 -1052.7781686744
4.0 -1052.7948927247
4.5 -1052.8081704241
5.0 -1052.8190543049
&
where the above only shows two data sets: s0 and s1. In reality I have 17 data sets and will combine them arbitrarily. By combine, I mean I would like to:
For two data sets, extract the second column of each separately.
Subtract these two columns row by row.
Multiply the difference by a constant, $C.
Note: $C multiplies very small numbers and the only way I could get it to not divide by zero was to take a massive scale.
Edit: After requests, I was apparently not entirely clear what I was going for. Take for example:
set0
2 x
3 y
4 z
set1
2 r
3 s
4 t
I also have defined a constant C.
I would like to perform the following operation:
C*(r - x)
C*(s - y)
C*(t - z)
I will be doing this for sets > 1, up to 16, for example (set 10) minus (set 0). Therefore, I need the flexibility to target a value based on its line number and column number, and preferably acting over a range of line numbers to make it efficient.
So far this works:
C=$(echo "scale=45;x=(small numbers)*(small numbers); x" | bc -l)
sed -n '5,11p' input.in | cut -c 5-20 > tmp1.in
sed -n '15,21p' input.in | cut -c 5-20 > tmp2.in
pr -m -t -s tmp1.in tmp2.in > tmp3.in
awk '{printf $2-$1 "\n"}' tmp3.in > tmp4.in
but the multiplication failed:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
returning:
0.00
0.00
0.00
0.00
0.00
0.00
0.00
I have a feeling the whole thing can be accomplished more elegantly with awk. I also tried this:
for (( i=0; i<=6; i++ ))
do
n=5+$i
m=10+n
awk 'NR==n{a=$2};NR==m{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
done
but all I get in temp.in is a long column of 0s.
I also tried
awk 'NR==5,NR==11{a=$2};NR==15,NR==21{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
but got the error
awk: (FILENAME=input.in FNR=20) fatal: attempt to access field -1052
Any idea how to formulate this with awk, and if that doesn't work, then why I cannot multiply with awk above? Thank you!

this does the math in one go
$ awk -v c=1 '/^&/ {s++}
s==1 {a[$1]=$2}
s==3 {print $1,a[$1],$2,c*(a[$1]-$2)}
/#type/ {s++}' file
2.0 -1052.7396157664 -1052.5384564253 -0.201159
2.5 -1052.7330560932 -1052.7040374678 -0.0290186
3.0 -1052.7540013664 -1052.7542803612 0.000278995
3.5 -1052.7780321236 -1052.7781686744 0.000136551
4.0 -1052.7948229060 -1052.7948927247 6.98187e-05
4.5 -1052.8081313831 -1052.8081704241 3.9041e-05
5.0 -1052.8190310613 -1052.8190543049 2.32436e-05
you can remove the decorations and add print formatting easily. The magic numbers 1=g1 and 3=2*g2-1 correspond to data groups 1 and 2 as the order presented in the data file, can be converted to awk variables as well.
The counter s keeps track of whether you're in a set or not, Odd numbers correspond to sets and even numbers between sets. The increment is done both at the start pattern and end pattern. The order of increment statements were set in such a way they, they are not printed following the pattern (unset first, print set values, reset last}. You can change the order and observe the effects.

This might be what you're looking for:
$ cat tst.awk
/^[#&]/ { lineNr=0; next }
{
++lineNr
if (lineNr in prev) {
print $1, c * ($2 - prev[lineNr])
}
prev[lineNr] = $2
}
$ awk -v c=100000 -f tst.awk file
2.0 20115.9
2.5 2901.86
3.0 -27.8995
3.5 -13.6551
4.0 -6.98187
4.5 -3.9041
5.0 -2.32436

In your first try, you should replace that line:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
with that one:
awk -v C=$C '{printf "%11.2f\n", C*$1 }' tmp4.in > tmp5.in
You are mixing notations of bash shell with notation with awk.
in shell you define variable without $, and you use them with $.
Here you are in awk script, there is no $ to use variables. Yet there are some special variables : $1 $2 ...
You have put single quote ' around your awk script, so the shell variables cant be used. I mean you have written $C, but the shell can not see it inside single-quote. That is why you have to write awk -v C=$C so that the shell variable $C is transferred to an awk variable called C.
In your other tries with awk, we can see such errors also. Now I think you'll make it.

Related

AWK: subset randomly and without replacement a string in every row of a file

So I need to subset 10 characters from all strings in a particular column of a file, randomly and without repetition (i.e. I want to avoid drawing a character from any given index more than once).
For the sake of simplicity, let's say I have the following string:
ABCDEFGHIJKLMN
For which I should obtain, for example, this result:
DAKLFCHGBI
Notice that no letter occurs twice, which means that no position is extracted more than once.
For this other string:
CCCCCCCCCCCCGG
Analogously, I should never find more than two "G" characters in the output (otherwise it would mean that a "G" character has been sampled more than once), e.g.:
CCGCCCCCCC
Or, in other words, I want to shuffle all characters from each string, and keep the first 10. This can be easily achieved in bash using:
echo "ABCDEFGHIJKLMN" | fold -w1 | shuf -n10 | tr -d '\n'
However, since I need to perform this many times on dozens of files with over a hundred thousand lines each, this is way too slow. So looking around, I've arrived at the following awk code, which seems to work fine whenever the strings are passed to it one by one, e.g.:
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' <(echo "ABCDEFGHIJKLMN")
But when I input the following file with a string on each row, awk hangs and the output gets truncated on the second line:
echo "ABCDEFGHIJKLMN" > file.txt
echo "CCCCCCCCCCCCGG" >> file.txt
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' file.txt
This other version of the code which samples characters from the string with repetition works fine, so it looks like the issue lies in the part which populates the N array, but I'm not proficient in awk so I'm a bit stuck:
awk '{srand(); len=length($1); for(i=1;i<=10;i++) {k=int(rand()*len)+1; printf "%s", substr($1,k,1)} print ""}'
Anyone can help?
In case this matters: my actual file is more complex than the examples provided here, with several other columns, and unlike the ones in this example, its strings may have different lengths.
Thanks in advance for your time :)
EDIT:
As mentioned in the comments, I managed to make it work by removing the N array (so that it resets before processing each row):
awk 'BEGIN{srand()} {len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} split("", N); print ""}' file.txt
Do note however that if the string in $1 is shorter than 10, this will get stuck in an infinite loop, so make sure that all strings are always longer than the subset target size. The alternative solution provided by Andre Wildberg in the comments doesn't carry this issue.
I would harness GNU AWK for this task following way, let file.txt content be
ABCDEFGHIJKLMN
CCCCCCCCCCCCGG
then
awk 'function comp_func(i1, v1, i2, v2){return rand()-0.5}BEGIN{FPAT=".";PROCINFO["sorted_in"]="comp_func"}{s="";patsplit($0,arr);for(i in arr){s = s arr[i]};print substr(s,1,10)}' file.txt
might give output
NGLHCKEIMJ
CCCCCCCCGG
Explanation: I use custom Array Traversal Control function which does randomly decides which element should be considered greater. -0.5 is used as rand() gives values from 0 to 1. For each line array arr is populated by characters of line, then traversed in random order to create s string which are characters shuffled, then substr used to get first 10 characters. You might elect to add counter which will terminate for loop if you have very long lines in comparison to number of characters to select.
(tested in GNU Awk 5.0.1)
Iteratively construct a substring of the remaining letters.
Tested with
awk version 20121220
GNU Awk 4.2.1, API: 2.0
GNU Awk 5.2.1, API 3.2
mawk 1.3.4 20200120
% awk -v size=10 'BEGIN{srand()} {n=length($0); a=$0; x=0;
for(i=1; i<=n; i++){x++; na=length(a); rnd = int(rand() * na + 1)
printf("%s", substr(a, rnd, 1))
a=substr(a, 1, rnd - 1)""substr(a, rnd + 1, na)
if(x >= size){break}}
print ""}' file.txt
CJFMKHNDLA
CGCCCCCCCC
In consecutive iterative runs remember to check if srand works the way you expect in your version of awk. If in doubt use $RANDOM or, better, /dev/urandom.
if u don't need to be strictly within awk, then jot makes it super easy :
say you want 20 random characters between
"A" (ascii 65) and "N" (ascii 78), inc. repeats of same chars
jot -s '' -c -r 20 65 78
ANNKECLDMLMNCLGDIGNL

how to round the output in shell?

For our webshop we get from the manufacturers a csv file (automatically updated) with product data.
Some manufacturers use prices without Tax and some within.
I want to change prices with a shell script to add 21% TAX and round it to nearest .95 or .50
For example I get a sheet:
sku|ean|name|type|price_excl_vat|price
EU-123|123123123123|Product name|simple|24.9900
I use this code:
sed -i "1 s/price/price_excl_vat/" inputfile
awk '{FS="|"; OFS="|"; if (NR<=1) {print $0 "|price"} else {print $0 "|" $5*1.21}}' inputfile > outputfile
the output is:
sku|ean|name|type|price_excl_vat|price
EU-123|123123123123|Product name|simple|24.9900|30.2379
How do I round it to the correct price like below ?
sku|ean|name|type|price_excl_vat|price
EU-123|123123123123|Product name|simple|24.9900|29.95
awk to the rescue!
awk 'BEGIN {FS=OFS="|"}
$NF==$NF+0 {a=$NF*1.21;
r=a-int(a);
if (r<0.225) a=a-r-0.05;
else if (r<0.725) a=a-r+0.50;
else a=a-r+0.95;
$(NF+1)=a} 1'
note that in your example the nearest number for 30.2379 will be 30.50 Perhaps you want to round down?
To round down instead of the nearest, and with a variable price column. The new computed value will be appended to the end of the row.
awk 'BEGIN {FS=OFS="|"; k=5}
$k==$k+0 {a=$k*1.21;
r=a-int(a);
if (r<0.50) a=a-r-0.05;
else if (r<0.95) a=a-r+0.50;
else a=a-r+0.95;
$(NF+1)=a} 1'
awk '#define field separator in and out
BEGIN{FS=OFS="|"}
# add/modify a 6th field for price label if missing on header only
NR==1 && NF == 6 { $6 = "price"; print; next}
NR==1 && NF == 5 { $6 = "price"; print; next}
# add price with tva rounded to 0.01 if missing
NF == 5 { $6 = int( $5 * 121 ) / 100 }
# print the line (modified or not, ex empty lines) [7 is just a *not 0*)
7
' inputfile \
> outputfile
self documented
not sure about your sed for header becasue sample show already a header with price so take the one you want
Not knowing what you're program looks like, it makes it difficult to give you more information.
However, both awk and bash have the printf command. This command can be used for rounding floating point numbers. (Yes, Bash is integer arithmetic, but it can pretend a number is a decimal number).
I gave you the link for the C printf command because the one for Bash doesn't include the formatting codes. Read it and weep because the documentation is a bit dense, and if you've never used printf before, it can be quite difficult to understand. Fortunately, an example will bring things to light:
$ foo="23.42532"
$ printf "%2.2f\n", $foo
$ 23.43 #All rounded for you!
The f means it's a floating point number. The % tells you that this is the beginning of a formatting sequence. The 2.2 means you want 2 digits on the left side of the decimal and two digits on the right. If you said %4.2f, it would make sure there's enough room for four digits on the left side of the decimal, and left pad the number with spaces. The \n on the end is the New Line character.
Fortunately, although printf can be hard to understand at first, it's pretty much the same in almost all programming languages. It's in awk, Perl, Python, C, Java, and many more languages. And, if the information you need isn't in printf, try the documentation on sprintf which is like printf, but prints the formatted text into a string.
The best documentation I've seen is in the Perl sprintf documentation because it gives you plenty of examples.

Count and sum up specific decimal number (bash,awk,perl)

I have a tab delimited file and I want to sum up certain decimal number to the output (1.5) each time its find number instead of character to the first column and print out the results for all the rows from the first to the last.
I have example file which look like this:
It has 8 rows
1st-column 2nd-Column
a ship
1 name
b school
c book
2 blah
e blah
3 ...
9 ...
Now I want my script to read line by line and if it finds number sum up 1.5 and give me output just for first column like this:
0
1.5
1.5
1.5
3
3
4.5
6
my script is:
#!/bin/bash
for c in {1..8}
do
awk 'NR==$c { if (/^[0-9]/) sum+=1.5} END {print sum }' file
done
but I don't get any output!
Thanks for your help in advance.
The last item in your expected output appears to be incorrect. If it is, then you can do:
$ awk '$1~/^[[:digit:]]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6
use warnings;
use strict;
my $sum = 0;
while (<DATA>) {
my $data = (split)[0]; # 1st column
$sum += 1.5 if $data =~ /^\d+$/;
print "$sum\n";
}
__DATA__
a ship
1 name
b school
c book
2 blah
e blah
3 ...
6 ...
Why not just use awk:
awk '{if (/^[0-9]+[[:blank:]]/) sum+=1.5} {print sum+0 }' file
Edited to simplify based on jaypal's answer, bound the number and work with tabs and spaces.
How about
perl -lane 'next unless $F[0]=~/^\d+$/; $c+=1.5; END{print $c}' file
Or
awk '$1~/^[0-9]+$/{c+=1.5}END{print c}' file
These only produce the final sum as your script would have done. If you want to show the numbers as they grow use:
perl -lane 'BEGIN{$c=0}$c+=1.5 if $F[0]=~/^\d+$/; print "$c"' file
Or
awk 'BEGIN{c=0}{if($1~/^[0-9]+$/){c+=1.5}{print c}}' file
I'm not sure if you're multiplying the first field by 1.5 or adding 1.5 to a sum every time there's any number in $1 and ignoring the contents of the line otherwise. Here's both in awk, using your sample data as the contents of "file."
$ awk '$1~/^[0-9]+$/{val=$1*1.5}{print val+0}' file
0
1.5
1.5
1.5
3
3
4.5
9
$ awk '$1~/^[0-9]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6
Or, here you go in ksh (or bash if you have a newer bash that can do floating point math), assuming the data is on STDIN
#!/usr/bin/ksh
sum=0
while read a b
do
[[ "$a" == +([0-9]) ]] && (( sum += 1.5 ))
print $sum
done

AWK : Why does the for-loop take the second argument first?

Can someone explain me this behavior ?
$ echo "A_B_C_D" | awk '{split($0,a,"_"); for (i in a) {print i,a[i]}}'
2 B
3 C
4 D
1 A
Same with
$ awk '{split("A_B_C_D",a,"_"); for (i in a) {print i,a[i]}}' empty
2 B
3 C
4 D
1 A
Where empty is a file with one line.
However, this works :
$ echo "A_B_C_D" | awk '{n=split($0,a,"_"); for (i=1;i<=n;i++) {print i,a[i]}}'
1 A
2 B
3 C
4 D
Thanks
man awk and look up the in operator. If you want to control the output order using the in operator, you can do so with GNU awk by populating PROCINFO["sorted_in"]. See http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal for details.
Thanks to #fedorqui, the answer was already asked here:
And here is the best answer which satisfies my curiosity :
From 8. Arrays in awk --> 8.5 Scanning All Elements of an Array in the
GNU Awk user's guide (gawk 3.1.0) when referring to the for (value in array)
syntax:
The order in which elements of the array are accessed by this statement
is determined by the internal arrangement of the array
elements within awk and cannot be controlled or changed. This can lead
to problems if new elements are added to array by statements in the
loop body; it is not predictable whether or not the for loop will
reach them. Similarly, changing var inside the loop may produce
strange results. It is best to avoid such things.
It's like python dictionary when looping over dict.keys(), isn't it ?

Using awk with Operations on Variables

I'm trying to write a Bash script that reads files with several columns of data and multiplies each value in the second column by each value in the third column, adding the results of all those multiplications together.
For example if the file looked like this:
Column 1 Column 2 Column 3 Column 4
genome 1 30 500
genome 2 27 500
genome 3 83 500
...
The script should multiply 1*30 to give 30, then 2*27 to give 54 (and add that to 30), then 3*83 to give 249 (and add that to 84) etc..
I've been trying to use awk to parse the input file but am unsure of how to get the operation to proceed line by line. Right now it stops after the first line is read and the operations on the variables are performed.
Here's what I've written so far:
for file in fileone filetwo
do
set -- $(awk '/genome/ {print $2,$3}' $file.hist)
var1=$1
var2=$2
var3=$((var1*var2))
total=$((total+var3))
echo var1 \= $var1
echo var2 \= $var2
echo var3 \= $var3
echo total \= $total
done
I tried placing a "while read" loop around everything but could not get the variables to update with each line. I think I'm going about this the wrong way!
I'm very new to Linux and Bash scripting so any help would be greatly appreciated!
That's because awk reads the entire file and runs its program on each line. So the output you get from awk '/genome/ {print $2,$3}' $file.hist will look like
1 30
2 27
3 83
and so on, which means in the bash script, the set command makes the following variable assignments:
$1 = 1
$2 = 30
$3 = 2
$4 = 27
$5 = 3
$6 = 83
etc. But you only use $1 and $2 in your script, meaning that the rest of the file's contents - everything after the first line - is discarded.
Honestly, unless you're doing this just to learn how to use bash, I'd say just do it in awk. Since awk automatically runs over every line in the file, it'll be easy to multiply columns 2 and 3 and keep a running total.
awk '{ total += $2 * $3 } ENDFILE { print total; total = 0 }' fileone filetwo
Here ENDFILE is a special address that means "run this next block at the end of each file, not at each line."
If you are doing this for educational purposes, let me say this: the only thing you need to know about doing arithmetic in bash is that you should never do arithmetic in bash :-P Seriously though, when you want to manipulate numbers, bash is one of the least well-adapted tools for that job. But if you really want to know, I can edit this to include some information on how you could do this task primarily in bash.
I agree that awk is in general better suited for this kind of work, but if you are curious what a pure bash implementation would look like:
for f in file1 file2; do
total=0
while read -r _ x y _; do
((total += x * y))
done < "$f"
echo "$total"
done

Resources