How to combine columns that have the same headers within 1 file using Awk or Bash - bash

I would like to know how to combine columns with duplicate headers in a file using bash/sed/awk.
x y x y
s1 3 4 6 10
s2 3 9 10 7
s3 7 1 3 2
to :
x y
s1 9 14
s2 13 16
s3 10 3

$ cat file
x y x y
s1 3 4 6 10
s2 3 9 10 7
s3 7 1 3 2
$ cat tst.awk
NR==1 {
for (i=1;i<=NF;i++) {
flds[$i] = flds[$i] " " i+1
}
printf "%-3s",""
for (hdr in flds) {
printf "%3s",hdr
}
print ""
next
}
{
printf "%-3s",$1
for (hdr in flds) {
n = split(flds[hdr],fldNrs)
sum = 0
for (i=1; i<=n; i++) {
sum += $(fldNrs[i])
}
printf "%3d",sum
}
print ""
}
$ awk -f tst.awk file
x y
s1 9 14
s2 13 16
s3 10 3
$ time awk -f ./tst.awk file
x y
s1 9 14
s2 13 16
s3 10 3
real 0m0.265s
user 0m0.030s
sys 0m0.108s
Adjust the printf lines in the obvious ways for different output formatting if you like.
Here's the bash equivalent in response to the comments elsethread. Do NOT use this, the awk solution is the right one, this is just to show how you should write it in bash IF you wanted to do that for some inexplicable reason:
$ cat tst.sh
declare -A flds
while IFS= read -r rec
do
lineNr=$(( lineNr + 1 ))
set -- $rec
if (( lineNr == 1 ))
then
fldNr=1
for fld
do
fldNr=$(( fldNr + 1 ))
flds[$fld]+=" $fldNr"
done
printf "%-3s" ""
for hdr in "${!flds[#]}"
do
printf "%3s" "$hdr"
done
printf "\n"
else
printf "%-3s" "$1"
for hdr in "${!flds[#]}"
do
fldNrs=( ${flds[$hdr]} )
sum=0
for fldNr in "${fldNrs[#]}"
do
eval val="\$$fldNr"
sum=$(( sum + val ))
done
printf "%3d" "$sum"
done
printf "\n"
fi
done < "$1"
$
$ time ./tst.sh file
x y
s1 9 14
s2 13 16
s3 10 3
real 0m0.062s
user 0m0.031s
sys 0m0.046s
Note that it runs in roughly the same order of magnitude duration as the awk script (see comments elsethread). Caveat - I never write bash scripts for processing text files so I'm not claiming the above bash script is perfect, just an example of how to approach it in bash for comparison with the other script in this thread that I claimed should be rewritten!

This not a one line. You can do it using Bash v4, Bash's dictonaries, and some shell tools.
Execute the script below with the name of the file to process a parameter
bash script_below.sh your_file
Here is the script:
declare -A coltofield
headerdone=0
# Take the first line of the input file and extract all fields
# and their position. Start with position value 2 because of the
# format of the following lines
while read line; do
colnum=$(echo $line | cut -d "=" -f 1)
field=$(echo $line | cut -d "=" -f 2)
coltofield[$colnum]=$field
done < <(head -n 1 $1 | sed -e 's/^[[:space:]]*//;' -e 's/[[:space:]]*$//;' -e 's/[[:space:]]\+/\n/g;' | nl -v 2 -n ln | sed -e 's/[[:space:]]\+/=/g;')
# Read the rest of the file starting with the second line
while read line; do
declare -A computation
declare varname
# Turn the line in key value pair. The key is the position of
# the value in the line
while read value; do
vcolnum=$(echo $value | cut -d "=" -f 1)
vvalue=$(echo $value | cut -d "=" -f 2)
# The first value is the line variable name
# (s1, s2)
if [[ $vcolnum == "1" ]]; then
varname=$vvalue
continue
fi
# Get the name of the field by the column
# position
field=${coltofield[$vcolnum]}
# Add the value to the current sum for this field
computation[$field]=$((computation[$field]+${vvalue}))
done < <(echo $line | sed -e 's/^[[:space:]]*//;' -e 's/[[:space:]]*$//;' -e 's/[[:space:]]\+/\n/g;' | nl -n ln | sed -e 's/[[:space:]]\+/=/g;')
if [[ $headerdone == "0" ]]; then
echo -e -n "\t"
for key in ${!computation[#]}; do echo -n -e "$key\t" ; done; echo
headerdone=1
fi
echo -n -e "$varname\t"
for value in ${computation[#]}; do echo -n -e "$value\t"; done; echo
computation=()
done < <(tail -n +2 $1)

Yet another AWK alternative:
$ cat f
x y x y
s1 3 4 6 10
s2 3 9 10 7
s3 7 1 3 2
$ cat f.awk
BEGIN {
OFS="\t";
}
NR==1 {
#need header for 1st column
for(f=NF; f>=1; --f)
$(f+1) = $f;
$1="";
for(f=1; f<=NF; ++f)
fld2hdr[f]=$f;
}
{
for(f=1; f<=NF; ++f)
if($f ~ /^[0-9]/)
colValues[fld2hdr[f]]+=$f;
else
colValues[fld2hdr[f]]=$f;
for (i in colValues)
row = row colValues[i] OFS;
print row;
split("", colValues);
row=""
}
$ awk -f f.awk f
x y
s1 9 14
s2 13 16
s3 10 3

$ awk 'BEGIN{print " x y"} a=$2+$4, b=$3+$5 {print $1, a, b}' file
x y
s1 9 14
s2 13 16
s3 10 3
No doubt there is a better way to display the heading but my awk is a little sketchy.

Here's a Perl solution, just for fun:
cat table.txt | perl -e'#h=grep{$_}split/\s+/,<>;while(#l=grep{$_}split/\s+/,<>){for$i(1..$#l){$t{$l[0]}{$h[$i-1]}+=$l[$i]}};printf " %s\n",(join" ",sort keys%{$t{(keys%t)[0]}});for$h(sort keys%t){printf"$h %s\n",(join " ",map{sprintf"%2d",$_}#{$t{$h}}{sort keys%{$t{$h}}})};'

Related

How can I use 'echo' output as an operand for the 'seq' command within a terminal?

I have an excercise where I need to sum together every digit up until a given number like this:
Suppose I have the number 12, I need to do 1+2+3+4+5+6+7+8+9+1+0+1+1+1+2.
(numbers past 9 are split up into their separate digits eg. 11 = 1+1, 234 = 2+3+4, etc.)
I know I can just use:
seq -s '' 12
which outputs 123456789101112 and then add them all together with '+' in between and then pipe to 'bc' BUT I have to specifically do :
echo 12 | ...
as the first step (because the online IDE fills it in as the unchangeable first step for every testcase) and when I do this I start to have problems with seq
I tried
echo 12 | seq -s '' $1
### or just ###
echo 12 | seq -s ''
but can't get it to work as this just gives back a missing operand error for seq (because I'm in the terminal, not a script and the '12' isn't just assigned to $1 I assume), any recommendations on how to avoid it or how to get seq to interpret the 12 from echo as operand or alternative ways to go?
seq -s '' $(cat)
full solution:
echo "12" | seq -s '' $(cat) | sed 's/./&+/g; s/$/0/' | bc
Or
echo 12 | { echo $(( $({ seq -s '' $(< /dev/stdin); echo; } | sed -E 's/([[:digit:]])/\1+/g; s/$/0/') )); }
without sed:
d=$(echo 12 | { seq -s '' $(< /dev/stdin); echo; }); echo $(( "${d//?/&+}0" ))
echo 12 | awk '{
cnt=0
for(i=1;i<=$1;i++) {
cnt+=i
printf("%s%s",i,i<$1?"+":"=")
}
print cnt
}'
Prints:
1+2+3+4+5+6+7+8+9+10+11+12=78
If it is supposed to be just the digits added up:
echo 12 | awk '{s=""
for(i=1;i<=$1;i++) s=s i
split(s,ch,"")
for(i=1;i<=length(ch); i++) cnt+=ch[i]
print cnt
}'
51
Or a POSIX pipeline:
$ echo 12 | seq -s '' "$(cat)" | sed -E 's/([0-9])/\1+/g; s/$/0/' | bc
51

Bash - adding numbers from array matrix

When adding numbers in a matrix, I get this error:
line 271: 1 2 3 4: syntax error in expression (error token is "2 3 4")
My add function:
add()
{
#Reading matrices into temp files
while read line1 <&3 && read line2 <&4
do
echo "$line1" | tr "\n" "\t" >> "temp50"
echo "$line2" | tr "\n" "\t" >> "temp60"
done 3<$temp1 4<$fileTwo
echo >>"temp50"
echo >>"temp60"
cat "temp60" >> "temp50"
i=1
x=1
while [ $i -le $totalNum ]
do
sum=0
cut -f $i "temp50" > "temp55"
while read num
do
sum=$(($sum + $num))
done <"temp55"
echo "$sum" | tr "\n" "\t" >> "temp65"
#try and remove hanging tab
if [[ "$x" -eq "$numcolOne" ]]
then
rev "temp65" > "temp222"
cat "temp222" | cut -c 1- >"temp333"
rev "temp333">"temp65"
x=0
fi
i=$((i+1))
x=$((x+1))
done
Matrix array (temp1):
1 2 3 4
5 6 7 8
Function is supposed to add the matrix array in the temp1 file; sample output:
2 4 6 8
10 12 14 16
Appreciate anyone's help!

Run script on pipeline output variable number of times

I have a command that appends a computed column to stdout that I would like to apply a variable N number of times.
For example if my input was 'hello\nworld\n' and I wanted to append a column of 0, N=3 times I could type the following:
echo -e 'hello\nworld' | sed 's/$/ 0/' | sed 's/$/ 0/' | sed 's/$/ 0/'
I've been trying stupid ideas like:
echo -e 'hello\nworld' | (for i in $(seq 1 $N); do echo $(cat) 0; done)
and
echo -e 'hello\nworld' | (for i in $(seq 1 $N); do sed 's/$/ 0/'; done)
but clearly these are not chaining the pipeline.
Any ideas?
This is easily done with recursion:
repeat() {
count="$1"
shift
if [ "$count" -ge 1 ]
then
"$#" | repeat "$((count-1))" "$#"
else
cat
fi
}
Examples:
$ echo foo | repeat 0 sed 's/$/ 0/'
foo
$ echo foo | repeat 1 sed 's/$/ 0/'
foo 0
$ echo foo | repeat 3 sed 's/$/ 0/'
foo 0 0 0
So you want to append a value of 0 a multiple N = 3 times:
awk -v value="0" -v N=3 \
'{printf "%s", $0; for (i = 0; i < N; i++) printf " %s", value; print "" }'
Pass the value and the repeat count as variables to awk. For each line, print the input; then add N copies of the value; then emit a newline.
You could use OFS (the output field separator) in place of blanks to separate the output in the loop:
printf "%s%s", OFS, value;

how can i echo a line once , then the rest keep them the way they are in unix bash?

I have the following comment:
(for i in 'cut -d "," -f1 file.csv | uniq`; do var =`grep -c $i file.csv';if (($var > 1 )); then echo " you have the following repeated numbers" $i ; fi ; done)
The output that i get is : You have the following repeated numbers 455
You have the following repeated numbers 879
You have the following repeated numbers 741
what I want is the following output:
you have the following repeated numbers:
455
879
741
Try moving the echo of the header line before the for-loop :
(echo " you have the following repeated numbers"; for i in 'cut -d "," -f1 file.csv | uniq`; do var =`grep -c $i file.csv';if (($var > 1 )); then echo $i ; fi ; done)
Or only print the header once :
(header=" you have the following repeated numbers\n"; for i in 'cut -d "," -f1 file.csv | uniq`; do var =`grep -c $i file.csv';if (($var > 1 )); then echo -e $header$i ; header=""; fi ; done)
Well, here's what I came to:
1) generated input for testing
for x in {1..35},aa,bb ; do echo $x ; done > file.csv
for x in {21..48},aa,bb ; do echo $x ; done >> file.csv
for x in {32..63},aa,bb ; do echo $x ; done >> file.csv
unsort file.csv > new.txt ; mv new.txt file.csv
2) your line ( corrected syntax errors)
dtpwmbp:~ pwadas$ for i in $(cut -d "," -f1 file.csv | uniq);
do var=`grep -c $i file.csv`; if [ "$var" -ge 1 ] ;
then echo " you have the following repeated numbers" $i ; fi ; done | head -n 10
you have the following repeated numbers 8
you have the following repeated numbers 41
you have the following repeated numbers 18
you have the following repeated numbers 34
you have the following repeated numbers 3
you have the following repeated numbers 53
you have the following repeated numbers 32
you have the following repeated numbers 33
you have the following repeated numbers 19
you have the following repeated numbers 7
dtpwmbp:~ pwadas$
3) my line:
dtpwmbp:~ pwadas$ echo "you have the following repeated numbers:";
for i in $(cut -d "," -f1 file.csv | uniq); do var=`grep -c $i file.csv`;
if [ "$var" -ge 1 ] ; then echo $i ; fi ; done | head -n 10
you have the following repeated numbers:
8
41
18
34
3
53
32
33
19
7
dtpwmbp:~ pwadas$
I added quotes, changed if() to [..] expression, and finally moved description sentence out of loop. Number of occurences tested is digit near "-ge" condition. If it is "1", then numbers which appear once or more are printed. Note, that in this expression, if file contains e.g. numbers
8
12
48
then "8" is listed in output as appearing twice. with "-ge 2", if no digits appear more than once, no output (except heading) is printed.

How can I align the columns of tables in Bash?

I want to format text as a table. I tried echoing with a '\t' separator, but it was misaligned.
Desired output:
a very long string.......... 112232432 anotherfield
a smaller string 123124343 anotherfield
Use the column command:
column -t -s' ' filename
printf is great, but people forget about it.
$ for num in 1 10 100 1000 10000 100000 1000000; do printf "%10s %s\n" $num "foobar"; done
1 foobar
10 foobar
100 foobar
1000 foobar
10000 foobar
100000 foobar
1000000 foobar
$ for((i=0;i<array_size;i++));
do
printf "%10s %10d %10s" stringarray[$i] numberarray[$i] anotherfieldarray[%i]
done
Notice I used %10s for strings. %s is the important part. It tells it to use a string. The 10 in the middle says how many columns it is to be. %d is for numerics (digits).
See man 1 printf for more info.
function printTable()
{
local -r delimiter="${1}"
local -r data="$(removeEmptyLines "${2}")"
if [[ "${delimiter}" != '' && "$(isEmptyString "${data}")" = 'false' ]]
then
local -r numberOfLines="$(wc -l <<< "${data}")"
if [[ "${numberOfLines}" -gt '0' ]]
then
local table=''
local i=1
for ((i = 1; i <= "${numberOfLines}"; i = i + 1))
do
local line=''
line="$(sed "${i}q;d" <<< "${data}")"
local numberOfColumns='0'
numberOfColumns="$(awk -F "${delimiter}" '{print NF}' <<< "${line}")"
# Add Line Delimiter
if [[ "${i}" -eq '1' ]]
then
table="${table}$(printf '%s#+' "$(repeatString '#+' "${numberOfColumns}")")"
fi
# Add Header Or Body
table="${table}\n"
local j=1
for ((j = 1; j <= "${numberOfColumns}"; j = j + 1))
do
table="${table}$(printf '#| %s' "$(cut -d "${delimiter}" -f "${j}" <<< "${line}")")"
done
table="${table}#|\n"
# Add Line Delimiter
if [[ "${i}" -eq '1' ]] || [[ "${numberOfLines}" -gt '1' && "${i}" -eq "${numberOfLines}" ]]
then
table="${table}$(printf '%s#+' "$(repeatString '#+' "${numberOfColumns}")")"
fi
done
if [[ "$(isEmptyString "${table}")" = 'false' ]]
then
echo -e "${table}" | column -s '#' -t | awk '/^\+/{gsub(" ", "-", $0)}1'
fi
fi
fi
}
function removeEmptyLines()
{
local -r content="${1}"
echo -e "${content}" | sed '/^\s*$/d'
}
function repeatString()
{
local -r string="${1}"
local -r numberToRepeat="${2}"
if [[ "${string}" != '' && "${numberToRepeat}" =~ ^[1-9][0-9]*$ ]]
then
local -r result="$(printf "%${numberToRepeat}s")"
echo -e "${result// /${string}}"
fi
}
function isEmptyString()
{
local -r string="${1}"
if [[ "$(trimString "${string}")" = '' ]]
then
echo 'true' && return 0
fi
echo 'false' && return 1
}
function trimString()
{
local -r string="${1}"
sed 's,^[[:blank:]]*,,' <<< "${string}" | sed 's,[[:blank:]]*$,,'
}
SAMPLE RUNS
$ cat data-1.txt
HEADER 1,HEADER 2,HEADER 3
$ printTable ',' "$(cat data-1.txt)"
+-----------+-----------+-----------+
| HEADER 1 | HEADER 2 | HEADER 3 |
+-----------+-----------+-----------+
$ cat data-2.txt
HEADER 1,HEADER 2,HEADER 3
data 1,data 2,data 3
$ printTable ',' "$(cat data-2.txt)"
+-----------+-----------+-----------+
| HEADER 1 | HEADER 2 | HEADER 3 |
+-----------+-----------+-----------+
| data 1 | data 2 | data 3 |
+-----------+-----------+-----------+
$ cat data-3.txt
HEADER 1,HEADER 2,HEADER 3
data 1,data 2,data 3
data 4,data 5,data 6
$ printTable ',' "$(cat data-3.txt)"
+-----------+-----------+-----------+
| HEADER 1 | HEADER 2 | HEADER 3 |
+-----------+-----------+-----------+
| data 1 | data 2 | data 3 |
| data 4 | data 5 | data 6 |
+-----------+-----------+-----------+
$ cat data-4.txt
HEADER
data
$ printTable ',' "$(cat data-4.txt)"
+---------+
| HEADER |
+---------+
| data |
+---------+
$ cat data-5.txt
HEADER
data 1
data 2
$ printTable ',' "$(cat data-5.txt)"
+---------+
| HEADER |
+---------+
| data 1 |
| data 2 |
+---------+
REF LIB at: https://github.com/gdbtek/linux-cookbooks/blob/master/libraries/util.bash
To have the exact same output as you need, you need to format the file like this:
a very long string..........\t 112232432\t anotherfield\n
a smaller string\t 123124343\t anotherfield\n
And then using:
$ column -t -s $'\t' FILE
a very long string.......... 112232432 anotherfield
a smaller string 123124343 anotherfield
It's easier than you wonder.
If you are working with a separated-by-semicolon file and header too:
$ (head -n1 file.csv && sort file.csv | grep -v <header>) | column -s";" -t
If you are working with an array (using tab as separator):
for((i=0;i<array_size;i++));
do
echo stringarray[$i] $'\t' numberarray[$i] $'\t' anotherfieldarray[$i] >> tmp_file.csv
done;
cat file.csv | column -t
awk solution that deals with stdin
Since column is not POSIX, maybe this is:
mycolumn() (
file="${1:--}"
if [ "$file" = - ]; then
file="$(mktemp)"
cat > "${file}"
fi
awk '
FNR == 1 { if (NR == FNR) next }
NR == FNR {
for (i = 1; i <= NF; i++) {
l = length($i)
if (w[i] < l)
w[i] = l
}
next
}
{
for (i = 1; i <= NF; i++)
printf "%*s", w[i] + (i > 1 ? 1 : 0), $i
print ""
}
' "$file" "$file"
if [ "$1" = - ]; then
rm "$file"
fi
)
Test:
printf '12 1234 1
12345678 1 123
1234 123456 123456
' > file
Test commands:
mycolumn file
mycolumn <file
mycolumn - <file
Output for all:
12 1234 1
12345678 1 123
1234 123456 123456
See also:
Using awk to align columns in text file?
AWK: go through the file twice, doing different tasks
I am not sure where you were running this, but the code you posted would not produce the output you gave, at least not in the Bash version that I'm familiar with.
Try this instead:
stringarray=('test' 'some thing' 'very long long long string' 'blah')
numberarray=(1 22 7777 8888888888)
anotherfieldarray=('other' 'mixed' 456 'data')
array_size=4
for((i=0;i<array_size;i++))
do
echo ${stringarray[$i]} $'\x1d' ${numberarray[$i]} $'\x1d' ${anotherfieldarray[$i]}
done | column -t -s$'\x1d'
Note that I'm using the group separator character (0x1D) instead of tab, because if you are getting these arrays from a file, they might contain tabs.
Just in case someone wants to do that in PHP, I posted a gist on GitHub:
https://gist.github.com/redestructa/2a7691e7f3ae69ec5161220c99e2d1b3
Simply call:
$output = $tablePrinter->printLinesIntoArray($items, ['title', 'chilProp2']);
You may need to adapt the code if you are using a PHP version older than 7.2.
After that, call echo or writeLine depending on your environment.
The below code has been tested and does exactly what is requested in the original question.
Parameters:
%30s Column of 30 char and text right align.
%10d integer notation, %10s will also work. \
stringarray[0]="a very long string.........."
# 28Char (max length for this column)
numberarray[0]=1122324333
# 10digits (max length for this column)
anotherfield[0]="anotherfield"
# 12Char (max length for this column)
stringarray[1]="a smaller string....."
numberarray[1]=123124343
anotherfield[1]="anotherfield"
printf "%30s %10d %13s" "${stringarray[0]}" ${numberarray[0]} "${anotherfield[0]}"
printf "\n"
printf "%30s %10d %13s" "${stringarray[1]}" ${numberarray[1]} "${anotherfield[1]}"
# a var string with spaces has to be quoted
printf "\n Next line will fail \n"
printf "%30s %10d %13s" ${stringarray[0]} ${numberarray[0]} "${anotherfield[0]}"
a very long string.......... 1122324333 anotherfield
a smaller string..... 123124343 anotherfield
column -t skips empty fields when a line starts with a delimiter character or when there are two or more consecutive delimiter characters:
$ printf %s\\n a,b,c a,,c ,b,c|column -s, -t
a b c
a c
b c
Therefore I use this awk function instead (it requires gawk because it uses arrays of arrays):
$ tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf("%-"(b[i]+n)"s",a[h][i]);print""}}' n="${2-1}" "${1+FS=$1}"|sed 's/ *$//';}
$ printf %s\\n a,b,c a,,c ,b,c|tab ,
a b c
a c
b c
if you data doesn't contain the equal sign ("=") anywhere in it, you can use that as a shell-friendly delimiter for column without having to escape anything -
by modifying FS to be either a tab ("\t") plus any amount of spaces (" ") or tabs ("\t") on either side of it, or a contiguous chunk of 2 or more spaces, it also allows the input data to have any amount of single space within each field
echo "${inputdata2}" |
mawk NF=NF OFS== FS=' + |[ \t]*\t[ \t]*' |
column -s= -t
a very long string.......... 112232432 anotherfield
a smaller string 123124343 anotherfield
if the data does contain the equal sign, use a combo sep that's close to impossible to exist in typical data :
gawk -e NF=NF OFS='\301\372\5' FS=' + |[ \t]*\t[ \t]*' |
LC_ALL=C column -s$'\301\372\5' -t
a very long string.......... 112232432 anotherfield
a smaller string 123124343 anotherfield
and if ur data only has 2 columns, and you have ballpark sense of how wide the first field is, you can use this \r trick for nice on-screen formatting (but those don't become runs of spaces if u need to send it down the pipe) :
# each \t is 8-spaces at console terminal
mawk NF=2 FS=' + |[ \t]*\t[ \t]*' OFS='\r\t\t\t\t'
a very long string.......... 112232432
a smaller string 123124343

Resources