I have a file with 2 columns and many rows. I would like to calculate the mean for each column for odd and even lines independantly, so that in the end I would have a file with 4 values: 2 columns with odd and even mean.
My file looks like this:
2 4
4 4
6 8
3 5
6 9
2 1
In the end I would like to obtain a file with the mean of 2,6,6 and 4,3,2 in the first column and the mean of 4,8,9 and 4,5,1 in the second column, that is:
4.66 7
3 3.33
If anyone could give me some advice I'd really appreaciate it, for the moment I'm only able to calculate the mean for all rows (not even and odd). Thank you very much in advance!
This is an awk hardcoded example but you can get the point :
awk 'NR%2{e1+=$1;e2+=$2;c++;next}
{o1+=$1;o2+=$2;d++}
END{print e1/c"\t"e2/c"\n"o1/d"\t"o2/d}' your_file
4.66667 7
3 3.33333
A more generalized version of Juan Diego Godoy's answer. Relies on GNU awk
gawk '
{
parity = NR % 2 == 1 ? "odd" : "even"
for (i=1; i<=NF; i++) {
sum[parity][i] += $i
count[parity][i] += 1
}
}
function result(parity) {
for (i=1; i<=NF; i++)
printf "%g\t", sum[parity][i] / count[parity][i]
print ""
}
END { result("odd"); result("even") }
'
This answer uses Bash and bc. It assumes that the input file consists of only integers and that there is an even number of lines.
#!/bin/bash
while read -r oddcol1 oddcol2; read -r evencol1 evencol2
do
(( oddcol1sum += oddcol1 ))
(( oddcol2sum += oddcol2 ))
(( evencol1sum += evencol1 ))
(( evencol2sum += evencol2 ))
(( count++ ))
done < inputfile
cat <<EOF | bc -l
scale=2
print "Odd Column 1 Mean: "; $oddcol1sum / $count
print "Odd Column 2 Mean: "; $oddcol2sum / $count
print "Even Column 1 Mean: "; $evencol1sum / $count
print "Even Column 2 Mean: "; $evencol2sum / $count
EOF
It could be modified to use arrays to make it more flexible.
Related
While looking into this this question the challenge was to take this matrix:
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
And turn into:
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2 # top of next 2 columns...
6 7
4 2
... each N elements from each row of the matrix -- in this example, N=2...
3 4
4 1
d f
5 9
q w # last element is lower left of matrix
The OP stated the input was 'much bigger' than the example without specifying the shape of the actual input (millions of rows? millions of columns? or both?)
I assumed (mistakenly) that the file had millions of rows (it was later specified to have millions of columns)
BUT the interesting thing is that most of the awks written were perfectly acceptable speed IF the shape of the data was millions of columns.
Example: #glennjackman posted a perfectly useable awk so long as the long end was in columns, not in rows.
Here, you can use his Perl to generate an example matrix of rows X columns. Here is that Perl:
perl -E '
my $cols = 2**20; # 1,048,576 columns - the long end
my $rows = 2**3; # 8 rows
my #alphabet=( 'a'..'z', 0..9 );
my $size = scalar #alphabet;
for ($r=1; $r <= $rows; $r++) {
for ($c = 1; $c <= $cols; $c++) {
my $idx = int rand $size;
printf "%s ", $alphabet[$idx];
}
printf "\n";
}' >file
Here are some candidate scripts that turn file (from that Perl script) into the output of 2 columns taken from the front of each row:
This is the speed champ regardless of the shape of input in Python:
$ cat col.py
import sys
cols=int(sys.argv[2])
offset=0
delim="\t"
with open(sys.argv[1], "r") as f:
dat=[line.split() for line in f]
while offset<=len(dat[0])-cols:
for sl in dat:
print(delim.join(sl[offset:offset+cols]))
offset+=cols
Here is a Perl that is also quick enough regardless of the shape of the data:
$ cat col.pl
push #rows, [#F];
END {
my $delim = "\t";
my $cols_per_group = 2;
my $col_start = 0;
while ( 1 ) {
for my $row ( #rows ) {
print join $delim, #{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
}
$col_start += $cols_per_group;
last if ($col_start + $cols_per_group - 1) > $#F;
}
}
Here is an alternate awk that is slower but a consistent speed (and the number of lines in the file needs to be pre-calculated):
$ cat col3.awk
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{ col_offset=0
for(i=1;i<=NF;i+=cols) {
s=join(i,i+cols-1)
col[NR+col_offset*nl]=join(i,i+cols-1)
col_offset++
++cnt
}
}
END { for(i=1;i<=cnt;i++) printf "%s", col[i]
}
And Glenn Jackman's awk (not to pick on him since ALL the awks had the same bad result with many rows):
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{
c=0
for (i=1; i<NF; i+=n) {
c++
col[c] = col[c] join(i, i+n-1)
}
}
END {
for (i=1; i<=c; i++)
printf "%s", col[i] # the value already ends with newline
}
Here are the timings with many columns (ie, in the Perl scrip that generates file above, my $cols = 2**20 and my $rows = 2**3):
echo 'glenn jackman awk'
time awk -f col1.awk -v n=2 file >file1
echo 'glenn jackman gawk'
time gawk -f col1.awk -v n=2 file >file5
echo 'perl'
time perl -lan columnize.pl file >file2
echo 'dawg Python'
time python3 col.py file 2 >file3
echo 'dawg awk'
time awk -f col3.awk -v nl=$(awk '{cnt++} END{print cnt}' file) -v cols=2 file >file4
Prints:
# 2**20 COLUMNS; 2**3 ROWS
glenn jackman awk
real 0m4.460s
user 0m4.344s
sys 0m0.113s
glenn jackman gawk
real 0m4.493s
user 0m4.379s
sys 0m0.109s
perl
real 0m3.005s
user 0m2.774s
sys 0m0.230s
dawg Python
real 0m2.871s
user 0m2.721s
sys 0m0.148s
dawg awk
real 0m11.356s
user 0m11.038s
sys 0m0.312s
But transpose the shape of the data by setting my $cols = 2**3 and my $rows = 2**20 and run the same timings:
# 2**3 COLUMNS; 2**20 ROWS
glenn jackman awk
real 23m15.798s
user 16m39.675s
sys 6m35.972s
glenn jackman gawk
real 21m49.645s
user 16m4.449s
sys 5m45.036s
perl
real 0m3.605s
user 0m3.348s
sys 0m0.228s
dawg Python
real 0m3.157s
user 0m3.065s
sys 0m0.080s
dawg awk
real 0m11.117s
user 0m10.710s
sys 0m0.399s
So question:
What would cause the first awk to be 100x slower if the data are transposed to millions of rows vs millions of columns?
It is the same number of elements processed and the same total data. The join function is called the same number of times.
String concatenation being saved in a variable is one of the slowest operations in awk (IIRC it's slower than I/O) as you're constantly having to find a new memory location to hold the result of the concatenation and there's more of that happening in the awk scripts as the rows get longer so it's probably all of the string concatenation in the posted solutions that's causing the slowdown.
Something like this should be fast and shouldn't be dependent on how many fields there are vs how many records:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for (i=1; i<=numVals; i+=2) {
valNr = i + ((i-1) * NF) # <- not correct, fix it!
print vals[valNr], vals[valNr+1]
}
}
I don't have time right now to figure out the correct math to calculate the index for the single loop approach above (see the comment in the code) so here's a working version with 2 loops that doesn't require as much thought and shouldn't run much if any, slower:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
inc = NF - 1
for (i=0; i<NF; i+=2) {
for (j=1; j<=NR; j++) {
valNr = i + j + ((j-1) * inc)
print vals[valNr], vals[valNr+1]
}
}
}
$ awk -f tst.awk file
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w
A play with strings:
$ awk '
{
a[NR]=$0 # hash rows to a
c[NR]=1 # index pointer
}
END {
b=4 # buffer size for match
i=NR # row count
n=length(a[1]) # process til the first row is done
while(c[1]<n)
for(j=1;j<=i;j++) { # of each row
match(substr(a[j],c[j],b),/([^ ]+ ?){2}/) # read 2 fields and separators
print substr(a[j],c[j],RLENGTH) # output them
c[j]+=RLENGTH # increase index pointer
}
}' file
b=4 is a buffer that is optimal for 2 single digit fields and 2 single space separators (a b ) as was given in the original question, but if the data is real world data, b should be set to something more suitable. If omitted leading the match line to match(substr(a[j],c[j]),/([^ ]+ ?){2}/) kills the performance for data with lots of columns.
I got times around 8 seconds for datasets of sizes 220 x 23 and 23 x 220.
Ed Morton's approach did fix the speed issue.
Here is the awk I wrote that supports variable columns:
$ cat col.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
for (i=1; i<=numVals; i+=NF) {
for(j=0; j<cols; j++) {
printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
}
}
}
}
$ time awk -f col.awk -v cols=2 file >file.cols
real 0m5.810s
user 0m5.468s
sys 0m0.339s
This is about 6 seconds for datasets of sizes 220 x 23 and 23 x 220
But MAN it sure is nice to have strong support of arrays of arrays (such as in Perl or Python...)
I have an array in Bash that will print out a series of numbers. I would like to find the first available (read: not in the array) number divisible by 8 (including 0).
for i in "${NUMS[#]}"
do
echo "$i"
done
Will output:
0
1
2
3
8
9
10
11
So in this example, the value would be "16". If 0 or 8 were missing from that array, those would have been selected.
I'm looking at something like:
echo "${NUMS[#]}" | awk -v RS='\\s+' '{ a[$1] } END { for(i = 0; i in a; ++i); print i }'
which will give me the first missing integer (4), but have not yet gotten a working result for a multiple of 8.
This should work:
printf '%s\n' "${NUMS[#]}" |
sort -n |
awk 'BEGIN { num=0 } $0 == num { num+=8 } END { print num }'
The idea is to start looking for the number 0, if you find it you start looking for 8 and so on. The variable num gets incremented by 8 each time the number is found to give the next multiple of 8 that hasn't been seen yet.
Sort is only needed if the array isn't already ordered.
Another solution I had working prior to reading Graeme's (much better) solution:
POSSIBLE_VALUES=($(seq 0 8 255))
for i in ${POSSIBLE_VALUES[#]}
do
match=0
for j in ${NUMS[#]}
do
if [ "${i}" == "${j}" ]
then
match=1
break
fi
done
if [ "${match}" == 0 ]
then
c+=($i)
fi
done
echo ${c[0]}
Hi :) Need help on a project, working on shell scripting and need to figure out how to print car names after certain numbers when they're divisible by certain numbers in a list.
Here's the generator, it takes two integers from the user, (Section where they're prompted not included), and prints the evens between those. I need to print car names after numbers divisible by: 5, 7, 10.
5 = Ford 7 = Bmw 10 = Rover
Generator:
for ((n = n1; n<= n2; ++n)); do
out=$(( $n % 2 ))
if [ $out -eg 0 ] ; then
echo "$n"
fi
done
Any help would be appreciated :( I'm completely clueless :(
Thanks
awk to the rescue!
$ awk -v start=0 -v end=70 -v pat='5=Ford,7=Bmw,10=Rover'
'BEGIN{n=split(pat,p,",");
for(i=1;i<=n;i++)
{split(p[i],d,"=");
a[d[1]]=d[2]}
for(i=start;i<=end;i+=2)
{printf "%s ",i;
for(k in a)
if(i%k==0)
printf "%s ", a[k];
print ""}}'
instead of hard coding the fizzbuzz values, let the program handle it for you. script parses the pattern and assigns divisor/value to a map a. While iterating over from start to end two by two check for each divisor and if divides append the tag to the line. Assumes start is entered as an even value if not need to guard for that too.
Do you mean something like this?
n1=0
n2=10
for ((n = n1; n <= n2; ++n)); do
if (( n % 2 == 0)); then
echo -n $n
if (( n % 5 == 0)); then
echo -n ' Ford'
fi
if (( n % 7 == 0)); then
echo -n ' BMW'
fi
if (( n % 10 == 0)); then
echo -n ' Rover'
fi
echo
fi
done
Output
0 Ford BMW Rover
2
4
6
8
10 Ford Rover
Not sure you want the 0 line containing names though, but that's how % works. You can add an explicit check for 0 if you wish.
I'm trying to write a bash script that calculates the average of numbers by rows and columns. An example of a text file that I'm reading in is:
1 2 3 4 5
4 6 7 8 0
There is an unknown number of rows and unknown number of columns. Currently, I'm just trying to sum each row with a while loop. The desired output is:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
And so on and so forth with each row. Currently this is the code I have:
while read i
do
echo "num: $i"
(( sum=$sum+$i ))
echo "sum: $sum"
done < $2
To call the program it's stats -r test_file. "-r" indicates rows--I haven't started columns quite yet. My current code actually just takes the first number of each column and adds them together and then the rest of the numbers error out as a syntax error. It says the error comes from like 16, which is the (( sum=$sum+$i )) line but I honestly can't figure out what the problem is. I should tell you I'm extremely new to bash scripting and I have googled and searched high and low for the answer for this and can't find it. Any help is greatly appreciated.
You are reading the file line by line, and summing line is not an arithmetic operation. Try this:
while read i
do
sum=0
for num in $i
do
sum=$(($sum + $num))
done
echo "$i Sum: $sum"
done < $2
just split each number from every line using for loop. I hope this helps.
Another non bash way (con: OP asked for bash, pro: does not depend on bashisms, works with floats).
awk '{c=0;for(i=1;i<=NF;++i){c+=$i};print $0, "Sum:", c}'
Another way (not a pure bash):
while read line
do
sum=$(sed 's/[ ]\+/+/g' <<< "$line" | bc -q)
echo "$line Sum = $sum"
done < filename
Using the numsum -r util covers the row addition, but the output format needs a little glue, by inefficiently paste-ing a few utils:
paste "$2" \
<(yes "Sum =" | head -$(wc -l < "$2") ) \
<(numsum -r "$2")
Output:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
Note -- to run the above line on a given file foo, first initialize $2 like so:
set -- "" foo
paste "$2" <(yes "Sum =" | head -$(wc -l < "$2") ) <(numsum -r "$2")
Hi I have a csv file with below data in it.
8/22/2013 1,417,196,108
8/23/2013 1,370,586,883
8/24/2013 1,362,561,606
8/25/2013 1,177,575,904
8/26/2013 1,228,394,403
8/27/2013 1,276,168,499
8/28/2013 1,265,333,615
I want a script that can help me to sum 2nd column and insert result in next row, so the output should look like:
8/22/2013 1,417,196,108
8/23/2013 1,370,586,883
8/24/2013 1,362,561,606
8/25/2013 1,177,575,904
8/26/2013 1,228,394,403
8/27/2013 1,276,168,499
8/28/2013 1,265,333,615
Total 9,097,817,018
the tricky part of this question is the thousand separator.
Try this line:
awk '{print;gsub(/,/,"");s+=$2}END{printf "Total %'\''d\n",s}' file
test with your data:
kent$ awk '{print;gsub(/,/,"");s+=$2}END{printf "Total %'\''d\n",s}' f
8/22/2013 1,417,196,108
8/23/2013 1,370,586,883
8/24/2013 1,362,561,606
8/25/2013 1,177,575,904
8/26/2013 1,228,394,403
8/27/2013 1,276,168,499
8/28/2013 1,265,333,615
Total 9,097,817,018
Pure bash solution:
#!/bin/bash
while read -r _ curNumber; do
(( answer += ${curNumber//,/} ))
done < file.csv
(( start = (${#answer} % 3 == 0 ? 3 : ${#answer} % 3) ))
echo -n "${answer:0:start}"
for ((i = start; i < ${#answer}; i += 3)); do
echo -n ",${answer:i:3}"
done
echo