Bash loop that calculates the sums of columns - bash

I'm trying to write a loop in Bash that prints the sum of every column in a file. These columns are separated by tabs. What I have so far is this:
cols() {
count=$(grep -c $'\t' $1)
for n in $(seq 1 $count) ;do
cat $FILE | awk '{sum+=$1} END{print "sum=",sum}'
done
}
But this only prints out the sum of the first column. How can I do this for every column?

Your approach does the job, but it is somehow overkill: you are counting the number of columns, then catting the file and calling awk, while awk alone can do all of it:
awk -F"\t" '{for(i=1; i<=NF; i++) sum[i]+=$i} END {for (i in sum) print i, sum[i]}' file
This takes advantage of NF that stores the number of fields a line has (which is what you were doing with count=$(grep -c $'\t' $1)). Then, it is just a matter of looping through the fields and sum to every element on the array, where sum[i] contains the sum for the column i. Finally, it loops through the result and writes its values.
Why isn't your approach suming a given column? Because when you say:
for n in $(seq 1 $count) ;do
cat $FILE | awk '{sum+=$1} END{print "sum=",sum}'
done
You are always using $1 as the element to sum. Instead, you should pass the value $n to awk by using something like:
awk -v col="$n" '{sum+=$col} END{print "sum=",sum}' $FILE # no need to cat $FILE

If you want a bash builtin only solution, this would work:
declare -i i l
declare -ai la sa=()
while read -d$'\t' -ra la; do
for ((l=${#la[#]}, i=0; i<l; sa[i]+=la[i], ++i)); do :; done
done < file
(IFS=$'\t'; echo "${sa[*]}")
The performance of this should be decent, but quite a bit slower than something like awk.

Related

Adding similar lines in bash [duplicate]

This question already has answers here:
Sort keys and Sum their values in bash
(4 answers)
sum of column in text file using shell script
(4 answers)
How can I sum values in column based on the value in another column?
(5 answers)
Closed 4 years ago.
I have a file with below records:
$ cat sample.txt
ABC,100
XYZ,50
ABC,150
QWE,100
ABC,50
XYZ,100
Expecting the output to be:
$ cat output.txt
ABC,300
XYZ,150
QWE,100
I tried the below script:
PREVVAL1=0
SUM1=0
cat sam.txt | sort > /tmp/Pos.part
while read line
do
VAL1=$(echo $line | awk -F, '{print $1}')
VAL2=$(echo $line | awk -F, '{print $2}')
if [ $VAL1 == $PREVVAL1 ]
then
SUM1=` expr $SUM + $VAL2`
PREVVAL1=$VAL1
echo $VAL1 $SUM1
else
SUM1=$VAL2
PREVVAL1=$VAL1
fi
done < /tmp/Pos.part
I want to get some one liner command to get the required output. Wanted to avoid the while loop concept. I want to just add the numbers where the first column is same and show it in a single line.
awk -F, '{a[$1]+=$2} END{for (i in a) print i FS a[i]}' sample.txt
Output
QWE,100
XYZ,150
ABC,300
The first part is executed for each line and creates an associative array. The END part prints this array.
It's an awk one-liner:
awk -F, -v OFS=, '{sum[$1]+=$2} END {for (key in sum) print key, sum[key]}' sample.txt > output.txt
sum[$1] += $2 creates an associative array whose keys are the first field and values are the corresponding sums.
This can also be done easily enough in native bash. The following uses no external tools, no subshells and no pipelines, and is thus far faster (I'd place money on 100x the throughput on a typical/reasonable system) than your original code:
declare -A sums=( )
while IFS=, read -r name val; do
sums[$name]=$(( ${sums[$name]:-0} + val ))
done
for key in "${!sums[#]}"; do
printf '%s,%s\n' "$key" "${sums[$key]}"
done
If you want to, you can make this a one-liner:
declare -A sums=( ); while IFS=, read -r name val; do sums[$name]=$(( ${sums[$name]:-0} + val )); done; for key in "${!sums[#]}"; do printf '%s,%s\n' "$key" "${sums[$key]}"; done

How to get values from one file that fall in a list of ranges from another file

I have bunch of files with sorted numerical values, in example:
cat tag_1_file.val
234
551
626
cat tag_2_file.val
12
1023
1099
etc.
And one file with tags and value ranges that fit my needs. Values are sorted first by tag, then by 2nd column, then by 3rd. Ranges may overlap.
cat ranges.val
tag_1 200 300
tag_1 600 635
tag_2 421 443
and so on.
So I try to loop through file with ranges and then look for all values that fall in range (in every line) in file with appropriate tag:
cat ~/blahblah/ranges.val | while read -a line;
#read line as array
do
cat ~/blahblah/${line[0]}_file.val | while read number;
#get tag name and cat the appropriate file
do
if [[ "$number" -ge "${line[1]}" ]] && [[ "$number" -le "${line[2]}" ]]
#check if current value fall into range
then
echo $number >> ${line[0]}.output
#toss the value that fall into interval to another file
elif [[ "$number" -gt "${line[2]}" ]]
then break
fi
done
done
But these two nested while loops are deadly slow with huge files containing 100M+ lines.
I think, there must be more efficient way of doing such things and I'd be grateful for any hint.
UPD: The expected output based on this example is:
cat file tag_1.output
234
626
Have you tried recoding the inner loop in something more efficient than Bash? Perl would probably be good enough:
while read tag low hi; do
perl -nle "print if \$_ >= ${low} && \$_ <= ${hi}" \
<${tag}_file.val >>${tag}.output
done <ranges.val
The behaviour if this version is slightly different in two ways - the loop doesn't bail out once the high point is reached, and the output file is created even if it is empty. Over to you if that isn't what you want!
another not so efficient implementation with awk
$ awk 'NR==FNR {t[NR]=$1; s[NR]=$2; e[NR]=$3; next}
{for(k in t)
if(t[k]==FILENAME) {
inout = t[k] "." ((s[k]<=$1 && $1<=e[k])?"in":"out");
print > inout;
next}}' ranges tag_1 tag_2
$ head tag_?.*
==> tag_1.in <==
234
==> tag_1.out <==
551
626
==> tag_2.out <==
12
1023
1099
note that I renamed files to match the tag names, otherwise you have to add tag extraction from filenames. Suffix ".in" for in ranges and ".out" for not. Depends on the sorted order of the files. If you have thousands of tag files adding a another layer to filter out the ranges per tag will speed it up. Now it iterates over ranges.
I'd write
while read -u3 -r tag start end; do
f="${tag}_file.val"
if [[ -r $f ]]; then
while read -u4 -r num; do
(( start <= num && num <= end )) && echo "$num"
done 4< "$f"
fi
done 3< ranges.val
I'm deliberately reading the files on separate file descriptors, otherwise the inner while-read loop will also slurp up the rest of "ranges.val".
bash while-read loops are very slow. I'll be back if a few minutes with an alternate solution
here's a GNU awk answer (requires, I believe, a fairly recent version)
gawk '
#load "filefuncs"
function read_file(tag, start, end, file, number, statdata) {
file = tag "_file.val"
if (stat(file, statdata) != -1) {
while (getline number < file) {
if (start <= number && number <= end) print number
}
}
}
{read_file($1, $2, $3)}
' ranges.val
perl
perl -Mautodie -ane '
$file = $F[0] . "_file.val";
next unless -r $file;
open $fh, "<", $file;
while ($num = <$fh>) {
print $num if $F[1] <= $num and $num <= $F[2]
}
close $fh;
' ranges.val
I have a solution for you from bioinformatics:
We have a format and a tool for this kind of task.
The format called .bed is used for description of ranges on chromosomes, but should work with your tags too.
The best toolset for this format is bedtools, which is lightning fast.
The specific tool, which might help you is intersect.
With this installed it becomes a task of formating the data for the tool:
#!/bin/bash
#reformating your positions to .bed format;
#1 adding the tag to each line
#2 repeating the position to make it a range
#3 converting to tab-separation
awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g' >all_positions_in_one_range_file.bed
#making your range-file tab-separated
sed 's/ /\t/g' ranges.val >ranges_with_tab.bed
#doing the real comparision of the ranges with bedtools
bedtools intersect -a all_positions_in_one-range_file.bed -b ranges_with_tab.bed >all_positions_intersected.bed
#spliting the one result file back into files named by your tag
awk -F $'\t' '{print $2 >$1".out"}' all_positions_intersected.bed
Or if you prefer oneliners:
bedtools intersect -a <(awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g') -b <(sed 's/ /\t/g' ranges.val) | awk -F $'\t' '{print $2 >$1".out"}'

Bash - Transpose a single field keeping the rest same and repeat it across

I have a file with pipe separated fields.
eg.
1,2,3|xyz|abc
I need the output in below format:
1|xyz|abc
2|xyz|abc
3|xyz|abc
I have a working code in bash:
while read i
do
f1=`echo $i | cut -d'|' -f1`
f2=`echo $i | cut -d'|' -f2-`
echo $f1 | tr ',' '\n' | sed "s:$:|$f2:" >> output.txt
done < pipe_delimited_file.txt
Can anyone suggest a way to achieve this witout using loop.
The file contains a large number of records.
Uses a loop, but it's inside awk, so very fast:
awk -F\| 'BEGIN{OFS="|"}{n = split($1, a, ","); $1=""; for(i=1; i<=n; i++) {print a[i] $0}}' pipe_delimited_file.txt
Perl may be a bit faster than awk:
perl -F'[|]' -ane 'for $n (split /,/, $F[0]) {$F[0] = $n; print join "|", #F}' file
bash is very slow, but here's a quicker way to use it. This uses plain bash without calling any external programs:
( # in a subshell:
IFS=, # use comma as field separator
set -f # turn off filename generation
while IFS='|' read -r f1 rest; do # temporarily using pipe as field separator,
# read first field and rest of line
for word in $f1; do # iterate over comma-separated words
echo "$word|$rest"
done
done
) < file

calculate mediane with shell script

I have a script that prints numbers in loops.
#!/bin/bash
for i in `seq 80 $i`
do
for j in `seq 1 $4`
do
./sujet1 $1 $2 $i
done
done
./sujet1 $1 $2 $i is a C compiled program which prints a number ( but I don't like to print it on screen).
I would like to calculate the mediane of numbers in the second loop that ./sujet1 $1 $2 $i prints then print this mediane on the screen.
so I'll have $i mediane at the end.
I konw I should firstly use ./sujet1 $1 $2 $i >> mediane.txt to save values. But I don't know how to recover them in the file, calculate mediane, erase them when finishing every loop..
EDIT:
I tried with awk as told in comment, but I find it difficult to understand for me
#!/bin/bash
for i in `seq 80 $i`
do
for j in `seq 1 $4`
do
awk '{ total += ./sujet1 $1 $2 $i } END { print total/NR }' mediane.txt
done
done
It doesn't work for me.
EDIT 2: for exemple i type ./run.sh 30 40 90 3
so I'll have
//for($3= 80 )
2,3
3,5
4,4
//for($3= 81 )
4,5
1,3
5,6
...
//for($3=90)
2,4
3,5
5,4
You notice here for every value in $3 I have $4 value repeating. I want to calculate the median of these $4 values and print one value
Your question is very hard to understand, but I think you want to run the sujet program lots of times and average the answer.
for i in `seq 80 $i`
do
for j in `seq 1 $4`
do
./sujet1 $1 $2 $i
done
done | awk '{total += $0} END{ print total/NR}'
Maybe you want the median of all the outputs of the sujet program. If so, pipe the output through sort first and then find the middle one with awk something like this:
for ...
for ...
./sujet ...
done
done | sort -n | awk '{x[NR]=$0} END{middle=int(NR/2); print x[middle]}'
You could yse 'backticks' operator:
result="`./sujet1 $1 $2 $i`"
It is used to "inline" run an os command, and to assign it's output to the left-side variable

adding numbers without grep -c option

I have a txt file like
Peugeot:406:1999:Silver:1
Ford:Fiesta:1995:Red:2
Peugeot:206:2000:Black:1
Ford:Fiesta:1995:Red:2
I am looking for a command That counts the number of red Ford Fiesta cars.
The last number in each line is the amount of that particular car.
The command I am looking for CANNOT use the -c option of grep.
so this command should just output the number 4.
Any help would be welcome, thank you.
A simple bit of awk would do the trick:
awk -F: '$1=="Ford" && $4=="Red" { c+=$5 } END { print c }' file
Output:
4
Explanation:
The -F: switch means that the input field separator is a colon, so the car manufacturer is $1 (the 1st field), the model is $2, etc.
If the 1st field is "Ford" and the 4th field is "Red", then add the value of the 5th (last) field to the variable c. Once the whole file has been processed, print out the value of c.
For a native bash solution:
c=0
while IFS=":" read -ra col; do
[[ ${col[0]} == Ford ]] && [[ ${col[3]} == Red ]] && (( c += col[4] ))
done < file && echo $c
Effectively applies the same logic as the awk one above, without any additional dependencies.
Methods:
1.) use some scripting language for counting, like awk or perl and such. Awk solution already posted, here is an perl solution.
perl -F: -lane '$s+=$F[4] if m/Ford:.*:Red/}{print $s' < carfile
#or
perl -F: -lane '$s+=$F[4] if ($F[0]=~m/Ford/ && $F[3]=~/Red/)}{print $s' < carfile
both examples prints
4
2.) The second method is based on shell-pipelining
filter out the right rows
extract the column with the count
sum the numbers
e.g some examples:
grep 'Ford:.*:Red:' carfile | cut -d: -f5 | paste -sd+ | bc
the grep filter out the right rows
the cut get the last column
the paste creates an line like 2+2 what can be counted by
the bc for counting
Another example:
sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile | paste -sd+ | bc
the sed filter and extract
another example - different way of counting
(echo 0 ; sed -n 's/\(Ford:.*:Red\):\(.*\)/\2+/p' carfile ;echo p )| dc
numbers are counted by RPN calculator called dc, e.g. it works like 0 2 + - first comes the values and as the last the operation.
the first echo puts into the stack 0
the sed creates a stream of numbers like 2+ 2+
the last echo p prints the stack
exists many other possibilies how count a strem of numbers.
e.g counting by bash
while read -r num
do
sum=$(( $sum + $num ))
done < <(sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile)
and pure bash:
while IFS=: read -r maker model year color count
do
if [[ "$maker" == "Ford" && "$color" == "Red" ]]
then
(( sum += $count ))
fi
done < carfile
echo $sum

Resources