Awk error argument list too long, only two variables - bash

I am trying to sum two numbers of a table (in file) with awk (inside a loop), using variables passed from a bash script like described below. The numbers that I am dealing with ($value) are floating point values. $value only contains one number. Note that the line where the error occurs is 126, 125 is working fine.
111 while read line
112 do
120 n=2
121 sum=0
122
123 for x in $(seq 1 $number)
124 do
125 value=$(echo "$line" | awk -v n="$n" '{print $n}') # I am just getting the values to sum up here
126 sum=$(awk -v sum="$sum" -v value="$value" '{sum = sum + value; print sum}')
n=$((n+1))
129 done
done < $file
Where $number is defined previously.
I get the following error:
./script.sh: line 126: /bin/awk: Argument list too long
I am only trying to pass two variables on the awk command on this line, any idea why I am getting this error?
An example of the table in the "file":
A -0.717616 -0.623398 -0.214494 -0.352871
B -0.19373 -0.140626 -0.0523623 0.0248858
C -0.0822092 -0.302354 0.347158 -0.0373262
D 0.310213 0.312805 0.114366 0.353496
E -0.175354 -0.0263985 -0.125694 -0.155082
Thank you!

There are two problems with your call to awk: you haven't specified an input file, so it is reading from its inherited standard input (which is the file the while loop is also trying to read from), and it outputs each line it reads, not just the value of sum (fixed while I was typing this).
This is a very inefficient way to add up the numbers in the file, by the way, but here is a corrected version:
while read line
do
n=2
sum=0
for x in $(seq 1 $number)
do
value=$(echo "$line" | awk -v n="$n" '{print $n}')
sum=$(awk -v sum="$sum" -v value="$value" 'BEGIN {sum = sum + value; print sum}' </dev/null)
n=$((n+1))
done
done < $file
A better solution:
awk 'BEGIN {sum=0}
{for (i=2;i<=NF;i++) { sum = sum + $i }}
END {print sum}' < $file

Related

Awk sum up last n elements in for loop from csv file

I want to sum up the last n elements of a csv file that has only one row.
The expected result should be 110 see example csv file below.
Inside the awk function I casted the passed values from the bash script to int but it still fails.
The for loop should sum up each element, then the result should be assigned to the sum variable by the awk print statement
RWP="/foo/bar/rwp.csv"
DL=";"
function rrwpsh() {
local n="${1}"
local length=$(awk --field-separator=$DL '{ print NF; exit }' $RWP)
local lower_bound=$[$length-$n]
local sum=$(awk --field-separator=$DL -v lower_boundxxx=$lower_bound -v lengthxxx=$length 'BEGIN{lower_boundxxx_int=int(lower_boundxxx);lengthxxx_int=int(lengthxxx);for(j=lower_boundxxx_int;j<=lengthxxx_int;j++)x+=$j;print x}' $RWP)
echo "sum: " $sum
}
let n=3 # example value
result=$(rrwpsh $n)
echo -e "result " $result
the csv file contains numbers
20;1;2;100;7;3
Coming from stdin:
$ echo "20;1;2;100;7;3" | awk -F\; -v n=3 '{for(i=0;i<n;i++)s+=$(NF-i);print s}'
110
From file:
$ awk -F\; -v n=3 '{ # set delimiter and the n value
for(i=0;i<n;i++) # looping from 0..(n-1)
s+=$(NF-i) # sum up related fields
print s # output
}' file

Random line using sed

I want to select a random line with sed. I know shuf -n and sort -R | head -n does the job, but for shuf you have to install coreutils, and for the sort solution, it isn't optimal on large data :
Here is what I tested :
echo "$var" | shuf -n1
Which gives the optimal solution but I'm afraid for portability
that's why I want to try it with sed.
`var="Hi
i am a student
learning scripts"`
output:
i am a student
output:
hi
It must be Random.
It depends greatly on what you want your pseudo-random probability distribution to look like. (Don't try for random, be content with pseudo-random. If you do manage to generate a truly random value, go collect your nobel prize.) If you just want a uniform distribution (eg, each line has equal probability of being selected), then you'll need to know a priori how many lines of are in the file. Getting that distribution is not quite so easy as allowing the earlier lines in the file to be slightly more likely to be selected, and since that's easy, we'll do that. Assuming that the number of lines is less than 32769, you can simply do:
N=$(wc -l < input-file)
sed -n -e $((RANDOM % N + 1))p input-file
-- edit --
After thinking about it for a bit, I realize you don't need to know the number of lines, so you don't need to read the data twice. I haven't done a rigorous analysis, but I believe that the following gives a uniform distribution:
awk 'BEGIN{srand()} rand() < 1/NR { out=$0 } END { print out }' input-file
-- edit --
Ed Morton suggests in the comments that we should be able to invoke rand() only once. That seems like it ought to work, but doesn't seem to. Curious:
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed); r=rand()} r < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 205
2 64
3 37
4 21
5 9
6 9
7 9
8 46
real 0m1.862s
user 0m0.689s
sys 0m0.907s
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed)} rand() < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 55
2 60
3 37
4 50
5 57
6 45
7 50
8 46
real 0m1.924s
user 0m0.710s
sys 0m0.932s
var="Hi
i am a student
learning scripts"
mapfile -t array <<< "$var" # create array from $var
echo "${array[$RANDOM % (${#array}+1)]}"
echo "${array[$RANDOM % (${#array}+1)]}"
Output (e.g.):
learning scripts
i am a student
See: help mapfile
This seems to be the best solution for large input files:
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file
as it uses standard UNIX tools, it's not restricted to files that are 32,769 lines long or less, it doesn't have any bias towards either end of the input, it'll produce different output even if called twice in 1 second, and it exits immediately after the target line is printed rather than continuing to the end of the input.
Update:
Having said the above, I have no explanation for why a script that calls rand() once per line and reads every line of input is about twice as fast as a script that calls rand() once and exits at the first matching line:
$ seq 100000 > file
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file;
done > o3
real 1m0.712s
user 0m8.062s
sys 0m9.340s
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" 'BEGIN{srand(seed)} rand() < 1/NR{ out=$0 } END { print out}' file;
done > o4
real 0m29.950s
user 0m9.918s
sys 0m2.501s
They both produced very similar types of output:
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o3 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
498 500 1 2
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o4 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
490 500 1 3
Final Update:
Turns out it was calling wc that (unexpectedly to me at least!) was taking most of the time. Here's the improvement when we take it out of the loop:
$ time { max=$(wc -l < file); for i in $(seq 500); do awk -v seed="$RANDOM" -v max="$max" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file; done } > o3
real 0m24.556s
user 0m5.044s
sys 0m1.565s
so the solution where we call wc up front and rand() once is faster than calling rand() for every line as expected.
on bash shell, first initialize seed to # line cube or your choice
$ i=;while read a; do let i++;done<<<$var; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;echo -e $var |sed -En "$l p"
if move your data to varfile
$ echo -e $var >varfile
$ i=;while read a; do let i++;done<varfile; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;sed -En "$l p" varfile
put the last inside loop e.g. for((c=0;c<9;c++)) { ;}
Using GNU sed and bash; no wc or awk:
f=input-file
sed -n $((RANDOM%($(sed = $f | sed '2~2d' | sed -n '$p')) + 1))p $f
Note: The three seds in $(...) are an inefficient way to fake wc -l < $f. Maybe there's a better way -- using only sed of course.
Using shuf:
$ echo "$var" | shuf -n 1
Output:
Hi

awk to calculate average of field in multiple text files and merge into one

I am trying to calculate the average of $2 in multiple test files in a directory and merge the output in one tab-delimeted output file. The output file is two fields, in which $1 is the file name that has been extracted by pref, and $2" is the calculated average with one decimal, rounded up. There is also a header in the outputSamplein$1andPercentin$2`. The below seems close but I am missing a few things (adding the header to the output, merging into one tab-delimeted file, and rounding to 3 decimal places), that I do not know how to do yet and not getting the desired output. Thank you :).
123_base.txt
AASS 99.81
ABAT 100.00
ABCA10 0.0
456_base.txt
ABL2 97.81
ABO 100.00
ACACA 99.82
desired output (tab-delimeted)
Sample Percent
123 66.6
456 99.2
Bash
for f in /home/cmccabe/Desktop/20x/percent/*.txt ; do
bname=$(basename $f)
pref=${bname%%_base_*.txt}
awk -v OFS='\t' '{ sum += $2 } END { if (NR > 0) print sum / NR }' $f /home/cmccabe/Desktop/NGS/bed/bedtools/IDP_total_target_length_by_panel/IDP_unix_trim_total_target_length.bed > /home/cmccabe/Desktop/20x/coverage/${pref}_average.txt
done
This one uses GNU awk, which provides handy BEGINFILE and ENDFILE events:
gawk '
BEGIN {print "Sample\tPercent"}
BEGINFILE {sample = FILENAME; sub(/_.*/,"",sample); sum = n = 0}
{sum += $2; n++}
ENDFILE {printf "%s\t%.1f\n", sample, sum/n}
' 123_base.txt 456_base.txt
If you're giving a pattern with the directory attached, I'd get the sample name like this:
match(FILENAME, /^.*\/([^_]+)/, m); sample = m[1]
and then, yes this is OK: gawk '...' /path/to/*_base.txt
And to steal against division by zero, inspired by James Brown's answer:
ENDFILE {printf "%s\t%.1f\n", sample, n==0 ? 0 : sum/n}
with perl
$ perl -ane '
BEGIN{ print "Sample\tPercent\n" }
$c++; $sum += $F[1];
if(eof)
{
($pref) = $ARGV=~/(.*)_base/;
printf "%s\t%.1f\n", $pref, $sum/$c;
$c = 0; $sum = 0;
}' 123_base.txt 456_base.txt
Sample Percent
123 66.6
456 99.2
print header using BEGIN block
-a option would split input line on spaces and save to #F array
For each line, increment counter and add to sum variable
If end of file eof is detected, print in required format
$ARGV contains current filename being read
If full path of filename is passed but only filename should be used to get pref, then use this line instead
($pref) = $ARGV=~/.*\/\K(.*)_base/;
In awk. Notice printf "%3.3s" to truncate the filename after 3rd char:
$ cat ave.awk
BEGIN {print "Sample", "Percent"} # header
BEGINFILE {s=c=0} # at the start of every file reset
{s+=$2; c++} # sum and count hits
ENDFILE{if(c>0) printf "%3.3s%s%.1f\n", FILENAME, OFS, s/c}
# above output if more than 0 lines
Run it:
$ touch empty_base.txt # test for division by zero
$ awk -f ave.awk 123_base.txt 123_base.txt empty_base.txt
Sample Percent
123 66.6
456 99.2
another awk
$ awk -v OFS='\t' '{f=FILENAME;sub(/_.*/,"",f);
a[f]+=$2; c[f]++}
END{print "Sample","Percent";
for(k in a) print k, sprintf("%.1f",a[k]/c[k])}' {123,456}_base.txt
Sample Percent
456 99.2
123 66.6

Pipe awk output to add to variable inside loop

I might be going about this the wrong way but I have tried every syntax and I am stuck on the closest error I could get to.
I have a log file, in which I want to filter to a set of lines like so:
Files : 1 1 1 1 1
Files : 3 3 4 4 5
Files : 10 4 2 3 1
Files : 254 1 1 1 1
The code I have will get me to this point, however, I want to use awk to perform addition of all of the first numeric column, in this instance giving 268 as the output (then performing a similar task on the other columns).
I have tried to pipe the awk output into a loop to perform the final step, but it won't add the values, throwing an error. I thought it could be due to awk handling the entries as a string, but as bash isn't strongly typed it should not matter?
Anyway, the code is:
x=0;
iconv -f UTF-16 -t UTF-8 "./TestLogs/rbTest.log" | grep "Files :" | grep -v "*.*" | egrep -v "Files : [a-zA-Z]" |awk '{$1=$1}1' OFS="," | awk -F "," '{print $4}' | while read i;
do
$x=$((x+=i));
done
Error message:
-bash: 0=1: command not found
-bash: 1=4: command not found
-bash: 4=14: command not found
-bash: 14=268: command not found
I tried a couple of the different addition syntaxes but I feel this has something to do with what I am trying to feed it than the addition itself.
This is currently just with integer values but I would also be looking to perform it with floats as well.
Any help much appreciated and I am sure there is a less convoluted way to achieve this, still learning.
You can do computations in awk itself:
awk '{for (c=3; c<=NF; c++) sum[c]+=$c} END{printf "Total : ";
for (c=3; c<=NF; c++) printf "%s%s", sum[c], ((c<NF)? OFS:ORS) }' file
Output:
Total : 268 9 8 9 8
Here sum is an associative array that holds sum for each column from #3 onwards.
Command breakup:
for (c=3; c<=NF; c++) # Iterate from 3rd col to last col
sum[c]+=$c # Add each col value into an array sum with index of col #
END # Execute this block after last record
printf "Total : " # Print literal "Total : "
for (c=3; c<=NF; c++) # Iterate from 3rd col to last col
printf "%s%s", # Use printf to format the output as 2 strings (%s%s)
sum[c], # 1st one is sum for the given index
((c<NF)? OFS:ORS) # 2nd is conditional string. It will print OFS if it is not last
# col and will print ORS if it is last col.
(Not an answer, but a formatted comment)
I always get antsy when I see a long pipeline of greps and awks (and seds, etc)
... | grep "Files :" | grep -v "*.*" | egrep -v "Files : [a-zA-Z]" | awk '{$1=$1}1' OFS="," | awk -F "," '{print $4}'
Can be written as
... | awk '/Files : [^[:alpha:]]/ && !/\*/ {print $4}'
Are you using grep -v "*.*" to filter out lines with dots, or lines with asterisks? Because you're achieving the latter.

How to efficiently sum two columns in a file with 270,000+ rows in bash

I have two columns in a file, and I want to automate summing both values per row
for example
read write
5 6
read write
10 2
read write
23 44
I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.
I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line
lines=`grep -v READ $x|wc -l | awk '{print $1}'`
line_num=1
arr_num=0
while [ $line_num -le $lines ]
do
arr[$arr_num]=`grep -v READ $x | sed $line_num'q;d' | awk '{print $2 + $3}'`
echo $line_num
line_num=$[$line_num+1]
arr_num=$[$arr_num+1]
done
However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?
Use awk instead and take advantage of modulus function:
awk '!(NR%2){print $1+$2}' infile
awk is probably faster, but the idiomatic bash way to do this is something like:
while read -a line; do # read each line one-by-one, into an array
# use arithmetic expansion to add col 1 and 2
echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)
Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bash builtins.
Using the <( ) process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a | pipe could be used.
Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:
awk '
NR%2 == 1 {next}
NR == 2 {max = $1+$2; next}
$1+$2 > max {max = $1+$2}
END {print max}
' filename
You could also use a pipeline with tools that implicitly loop over the input like so:
grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE
This assumes there are spaces between your read and write data values.
Why not run:
awk 'NR==1 { print "sum"; next } { print $1 + $2 }'
You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.
You can use Perl or Python instead of awk if you prefer.
Your code is running grep, sed and awk on each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.
Assuming that it's always one 'header' row followed by one 'data' row:
awk '
BEGIN{ max = 0 }
{
if( NR%2 == 0 ){
sum = $1 + $2;
if( sum > max ) { max = sum }
}
}
END{ print max }' input.txt
Or simply trim out all lines that do not conform to what you want:
grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
BEGIN{ max = 0 }
{
sum = $1 + $2;
if( sum > max ) { max = sum }
}
END{ print max }' input.txt

Resources