Adding similar lines in bash [duplicate] - bash

This question already has answers here:
Sort keys and Sum their values in bash
(4 answers)
sum of column in text file using shell script
(4 answers)
How can I sum values in column based on the value in another column?
(5 answers)
Closed 4 years ago.
I have a file with below records:
$ cat sample.txt
ABC,100
XYZ,50
ABC,150
QWE,100
ABC,50
XYZ,100
Expecting the output to be:
$ cat output.txt
ABC,300
XYZ,150
QWE,100
I tried the below script:
PREVVAL1=0
SUM1=0
cat sam.txt | sort > /tmp/Pos.part
while read line
do
VAL1=$(echo $line | awk -F, '{print $1}')
VAL2=$(echo $line | awk -F, '{print $2}')
if [ $VAL1 == $PREVVAL1 ]
then
SUM1=` expr $SUM + $VAL2`
PREVVAL1=$VAL1
echo $VAL1 $SUM1
else
SUM1=$VAL2
PREVVAL1=$VAL1
fi
done < /tmp/Pos.part
I want to get some one liner command to get the required output. Wanted to avoid the while loop concept. I want to just add the numbers where the first column is same and show it in a single line.

awk -F, '{a[$1]+=$2} END{for (i in a) print i FS a[i]}' sample.txt
Output
QWE,100
XYZ,150
ABC,300
The first part is executed for each line and creates an associative array. The END part prints this array.

It's an awk one-liner:
awk -F, -v OFS=, '{sum[$1]+=$2} END {for (key in sum) print key, sum[key]}' sample.txt > output.txt
sum[$1] += $2 creates an associative array whose keys are the first field and values are the corresponding sums.

This can also be done easily enough in native bash. The following uses no external tools, no subshells and no pipelines, and is thus far faster (I'd place money on 100x the throughput on a typical/reasonable system) than your original code:
declare -A sums=( )
while IFS=, read -r name val; do
sums[$name]=$(( ${sums[$name]:-0} + val ))
done
for key in "${!sums[#]}"; do
printf '%s,%s\n' "$key" "${sums[$key]}"
done
If you want to, you can make this a one-liner:
declare -A sums=( ); while IFS=, read -r name val; do sums[$name]=$(( ${sums[$name]:-0} + val )); done; for key in "${!sums[#]}"; do printf '%s,%s\n' "$key" "${sums[$key]}"; done

Related

Search equality in a certain field with AWK [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 1 year ago.
I am trying to get the name out of /etc/passwd using awk to search only in the 5th field of every row, and then to cut some part of that line and print it out.
This is what I wrote but it doesn't seems to work:
for iter in "$#";
do cat /etc/passwd | awk -F ":" '$5==$iter' | cut -d":" -f6;
done;
concerning the delimiter syntax, everything should be fine I guess?
so my problem is in the $5==$iter, I assume.
How can I change that $5==$iter to - if the 5th field of that row contains my $iter var, then cut and so on..
Sorry for the ignorance, I am a beginner :)
Thanks in advance.
See How do I use shell variables in an awk script?
-v should be used to pass shell variables into awk. Also, there's no reason to use either cat or cut here:
for iter in "$#"; do
awk -F: -v iter="$iter" '$5==iter { print $6 }' </etc/passwd
done
As Charles Duffy commented, your code would be more efficient if it didn't need to read /etc/passwd every pass. And while this particular loop probably doesn't need to be optimized (after all, /etc/passwd is typically not that long and most OS's would cache the file anyway after the first read), it would be interesting to see an awk script read the file only once.
That said, here's another implementation where awk is only invoked once:
printf "%s\n" "$#" | awk -F: '
NR == FNR { etc_passwd[ $5 ] = $6; next }
{ print $0 , etc_passwd[ $0 ] }
' /etc/passwd /dev/stdin
The NR == FNR condition is an idiom that causes its associated command only to be executed for the first file in the list of files that follows the awk script (that is, for the reading of /etc/passwd).
You can also do everything in bash, example:
#!/bin/bash
declare -A passwd # declare a associative array
# build the associative array "passwd" with the
# 5th field as a "key" and 6th field as "value"
while IFS=$':\n' read -a line; do # emulate awk to extract fields
[[ -n "${line[4]}" ]] || continue # avoid blank "keys"
passwd["${line[4]}"]=${line[5]} # in bash, arrays starting in "0"
done < /etc/passwd
for iter in "$#"; do
if [ ${passwd[$iter] + 'x'} ]; then
echo ${passwd[$iter]}
fi
done
(This version doesn't get into accout mĂșltiples values for 5th field)
here is a better version that can handle blank values as well, ike./script.sh '':
while IFS=$':\n' read -a line; do
for iter in "$#"; do
if [ "$iter" == "${line[4]}" ]; then
echo ${line[5]}
continue
fi
done
done < /etc/passwd
A pure awk solution could be:
#!/usr/bin/awk -f
BEGIN {
FS = ":"
for ( i = 1; i < ARGC; i++ ) {
args[ARGV[i]] = 1
delete ARGV[i]
}
ARGV[1] = "/etc/passwd"
}
($5 in args) { print $6 }
and you could call as ./script.awk -f 'param1' 'param2'.

Shell Script for combining 3 files

I have 3 files with below data
$cat File1.txt
Apple,May
Orange,June
Mango,July
$cat File2.txt
Apple,Jan
Grapes,June
$cat File3.txt
Apple,March
Mango,Feb
Banana,Dec
I require the below output file.
$Output_file.txt
Apple,May|Jan|March
Orange,June
Mango,July|Feb
Grapes,June
Banana,Dec
Requirement here is the take out the first column and then common data in column 1 in each file need to be searched and second column needs to be "|" separated. If there is no common column, then same needs to be printed in the output file.
I have tried putting this in a while loop, but it takes time as the file size increase. Wanted a simple solution using shell script.
This should work :
#!/bin/bash
for FRUIT in $( cat "$#" | cut -d "," -f 1 | sort | uniq )
do
echo -ne "${FRUIT},"
awk -F "," "\$1 == \"$FRUIT\" {printf(\"%s|\",\$2)}" "$#" | sed 's/.$/\'$'\n/'
done
Run it as :
$ ./script.sh File1.txt File2.txt File3.txt
A purely native-bash solution (calling no external tools, and thus limited only by the performance constraints of bash itself) might look like:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: Bash 4 or newer required" >&2; exit 1;; esac
declare -A items=( )
for file in "$#"; do
while IFS=, read -r key value; do
items[$key]+="|$value"
done <"$file"
done
for key in "${!items[#]}"; do
value=${items[$key]}
printf '%s,%s\n' "$key" "${value#'|'}"
done
...called as ./yourscript File1.txt File2.txt File3.txt
This is fairly easy done with a single awk command:
awk 'BEGIN{FS=OFS=","} {a[$1] = a[$1] (a[$1] == "" ? "" : "|") $2}
END {for (i in a) print i, a[i]}' File{1,2,3}.txt
Orange,June
Banana,Dec
Apple,May|Jan|March
Grapes,June
Mango,July|Feb
If you want output in the same order as strings appear in original files then use this awk:
awk 'BEGIN{FS=OFS=","} !($1 in a) {b[++n] = $1}
{a[$1] = a[$1] (a[$1] == "" ? "" : "|") $2}
END {for (i=1; i<=n; i++) print b[i], a[b[i]]}' File{1,2,3}.txt
Apple,May|Jan|March
Orange,June
Mango,July|Feb
Grapes,June
Banana,Dec

Bash loop that calculates the sums of columns

I'm trying to write a loop in Bash that prints the sum of every column in a file. These columns are separated by tabs. What I have so far is this:
cols() {
count=$(grep -c $'\t' $1)
for n in $(seq 1 $count) ;do
cat $FILE | awk '{sum+=$1} END{print "sum=",sum}'
done
}
But this only prints out the sum of the first column. How can I do this for every column?
Your approach does the job, but it is somehow overkill: you are counting the number of columns, then catting the file and calling awk, while awk alone can do all of it:
awk -F"\t" '{for(i=1; i<=NF; i++) sum[i]+=$i} END {for (i in sum) print i, sum[i]}' file
This takes advantage of NF that stores the number of fields a line has (which is what you were doing with count=$(grep -c $'\t' $1)). Then, it is just a matter of looping through the fields and sum to every element on the array, where sum[i] contains the sum for the column i. Finally, it loops through the result and writes its values.
Why isn't your approach suming a given column? Because when you say:
for n in $(seq 1 $count) ;do
cat $FILE | awk '{sum+=$1} END{print "sum=",sum}'
done
You are always using $1 as the element to sum. Instead, you should pass the value $n to awk by using something like:
awk -v col="$n" '{sum+=$col} END{print "sum=",sum}' $FILE # no need to cat $FILE
If you want a bash builtin only solution, this would work:
declare -i i l
declare -ai la sa=()
while read -d$'\t' -ra la; do
for ((l=${#la[#]}, i=0; i<l; sa[i]+=la[i], ++i)); do :; done
done < file
(IFS=$'\t'; echo "${sa[*]}")
The performance of this should be decent, but quite a bit slower than something like awk.

comparing two files and priniting lines with similar strings in one file [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 6 years ago.
I have two file which I need to compare, and if the first column in file1 matches part of the fisrt columns in file2, then add them side by side in file3, below is an example:
File1:
123123,ABC,2016-08-18,18:53:53
456456,ABC,2016-08-18,18:53:53
789789,ABC,2016-08-18,18:53:53
123123,ABC,2016-02-15,12:46:22
File2
789789_TTT,567774,223452
123123_TTT,121212,343434
456456_TTT,323232,223344
output:
123123,ABC,2016-08-18,18:53:53,123123_TTT,121212,343434
456456,ABC,2016-08-18,18:53:53,456456_TTT,323232,223344
789789,ABC,2016-08-18,18:53:53,789789_TTT,567774,223452
123123,ABC,2016-02-15,18:53:53,123123_TTT,121212,343434
Thanks..
Usin Gnu AWK:
$ awk -F, 'NR==FNR{a[gensub(/([^_]*)_.*/,"\\1","g",$1)]=$0;next} $1 in a{print $0","a[$1]}' file2 file1
123123,ABC,2016-08-18,18:53:53 123123_TTT,121212,343434
456456,ABC,2016-08-18,18:53:53 456456_TTT,323232,223344
789789,ABC,2016-08-18,18:53:53 789789_TTT,567774,223452
123123,ABC,2016-02-15,12:46:22 123123_TTT,121212,343434
Explanation:
NR==FNR { # for the first file (file2)
a[gensub(/([^_]*)_.*/,"\\1","g",$1)]=$0 # store to array
next
}
$1 in a { # if the key from second file in array
print $0","a[$1] # output
}
awk solution matches keys formed from file2 against column 1 of file1 - should also work on Solaris using /usr/xpg4/bin/awk - I took the liberty of assuming the last line of OP output has a typo
file1=$1
file2=$2
AWK=awk
[[ $(uname) == SunOS ]] && AWK=/usr/xpg4/bin/awk
$AWK -F',' '
BEGIN{OFS=","}
# file2 key is part of $1 till underscore
FNR==NR{key=substr($1,1,index($1,"_")-1); f2[key]=$0; next}
$1 in f2 {print $0, f2[$1]}
' $file2 $file1
tested
123123,ABC,2016-08-18,18:53:53,123123_TTT,121212,343434
456456,ABC,2016-08-18,18:53:53,456456_TTT,323232,223344
789789,ABC,2016-08-18,18:53:53,789789_TTT,567774,223452
123123,ABC,2016-02-15,12:46:22,123123_TTT,121212,343434
Pure bash solution
file1=$1
file2=$2
while IFS= read -r line; do
key=${line%%_*}
f2[key]=$line
done <$file2
while IFS= read -r line; do
key=${line%%,*}
[[ -n ${f2[key]} ]] || continue
echo "$line,${f2[key]}"
done <$file1

Using Array With Awk [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
I am using an array of values, and I want to look for those values using awk and output to file. In the awk line if I replace the first "$i" with the numbers themselves, the script works, but when I try to use the variable "$i" the script no longer works.
declare -a arr=("5073770" "7577539")
for i in "${arr[#]}"
do
echo "$i"
awk -F'[;\t]' '$2 ~ "$i"{sub(/DP=/,"",$15); print $15}' $INPUT >> "$i"
done
The file I'm looking at contains many lines like the following:
chr12 3356475 . C A 76.508 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=3;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=9.52472;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=8.76405;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=111;QR=0;RO=0;RPP=9.52472;RPPR=0;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1/1:3:0:0:3:111:-10,-0.90309,0
Pass the value $i to awk using -v:
awk -F'[;\t]' -v var="$i" '$2 ~ var{sub(/DP=/,"",$15); print $15}' $INPUT >> "$i"
awk will have no idea what the value of the shell's $i is unless you explicitly pass it into awk as a variable
awk -F'[;\t]' -v "VAR=${i}" '$2 ~ VAR {....
I expect the result you see is because 'i' is undefined and treated as zero
which makes your test '$2 ~ $0 {...
You can avoid awk and do this in BASH itself:
arr=("5073770" "7577539" "3356475")
for i in "${arr[#]}"; do
while IFS='['$'\t'';]' read -ra arr; do
[[ ${arr[1]} == *$i* ]] && { s="${arr[14]}"; echo "${s#DP=}"; }
done < "$INPUT"
done

Resources