Awk separate column output - bash

What I wanted to do is to create a Table (maximum=4 rows) from a one-column file using awk.
I have a file:
1 a,b
2 r,i
3 w
4 r,t
5 o,s
6 y
The desire output:
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
So far, I just separating the rows into different files and "paste" them into one. I would appreciate of any of more sophisticated method.

$ cat tst.awk
BEGIN {
numRows = 4
OFS = "\t"
}
{
rowNr = (NR - 1 ) % numRows + 1
if ( rowNr == 1 ) {
numCols++
}
val[rowNr,numCols] = $0
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
}
}
}
$
$ awk -f tst.awk file
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t

Combination of awk to join lines and column to pretty-print them:
awk -v max=4 '
{ i = (NR-1) % max + 1; line[i] = line[i] "\t" $0 }
END { for(i=1; i<=max && i<=length(line); i++) print line[i] }' file | column -t -s $'\t'
Output:
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t

Another:
$ awk ' {
i=(NR%4) # using modulo indexed array
a[i]=a[i] (a[i]==""?"":" ") $0 # append to it
}
END { # in the END
for(i=1;i<=4;i++) # loop all indexes in order
print a[i%4] # dont forget the modulo
}' file
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
Naturally it will be ugly if there are missing columns.

Here is another awk approach:-
awk '
{
A[++c] = $0
}
END {
m = sprintf ( "%.0f", ( c / 4 ) )
for ( i = 1; i <= 4; i++ )
{
printf "%s\t", A[i]
for ( j = 1; j <= m; j++ )
printf "%s\t", A[i+(j*4)]
printf "\n"
}
}
' file

you can combine split and paste
split -l 4 file part- && paste part-*
-l <number> means to split file to smaller files of <number> lines each.
part- is a prefix of our choice to be used for the new files. Note that they will be in alphabetical order, e.g. part-aa, part-ab etc. So paste will paste them as expected.

Related

How to sort a specified column in l、Linux

This is my two column sequence, I want to combine them into 1 column and sort them in Linux, but I don't know how to write the shell script to handle them.
GGCTGCAGCTAACAGGTGA TACTCGGGGAGCTGCGG
CCTCTGGCTCGCAGGTCATGGC CAGCGTCTTGCGCTCCT
GCTGCAGCTACATGGTGTCG CGCTCCGCTTCTCTCTACG
The sorted results are as follows (first column first, second column second, and split by "\t")
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG
what should I do?
You can do it easily in awk by storing the second column in an array and then outputting the saved values in the END rule, e.g.
awk '
{
print ++n, $1 # output first column
a[n] = $2 # save second column in array
}
END {
j = n + 1 # j is next counter
for (i=1;i<=n;i++) # loop 1 - n
print j++, a[i] # output j and array value
}
' file.txt
Example Use/Output
With your input in file.txt, you can just copy/middle-mouse-paste the above in an xterm with file.txt in the current directory, e.g.
$ awk '
> {
> print ++n, $1 # output first column
> a[n] = $2 # save second column in array
> }
> END {
> j = n + 1 # j is next counter
> for (i=1;i<=n;i++) # loop 1 - n
> print j++, a[i] # output j and array value
> }
> ' file.txt
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG
Or as a 1-liner:
$ awk '{print ++n, $1; a[n]=$2} END {j=n+1; for (i=1;i<=n;i++) print j++, a[i]}' file.txt
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG
If you would like to create an awk script from the above, you can simply create the script file (say cmbcols.awk) as:
{
print ++n, $1 # output first column
a[n] = $2 # save second column in array
}
END {
j = n + 1 # j is next counter
for (i=1;i<=n;i++) # loop 1 - n
print j++, a[i] # output j and array value
}
Then to run the script on the file file.txt you can do:
$ awk -f cmbcols.awk file.txt
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG

How to calculate the mean of row from csv file from nth column?

This may look like a duplicate but I could not solve the issue I'm having.
I'm trying to find the average of each column from a CSV/TSV file the data looks like below:
input.tsv
ID source random text val1 val2 val3 val4 val330
1 atttt eeeee test 0.9 0.5 0.2 0.54 0.89
2 afdg adfgrg tf 0.6 0.23 0.5 0.4 0.29
output.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
or at least
ID Avg
1 0.606
2 0.404
I tried a suggestion from here
awk 'NR==1{next}
{printf("%s\t", $1
printf("%.2f\n", ($5 + $6 + $7)/3}' input.tsv
which threw error.
and
awk '{ s = 4; for (i = 5; i <= NF; i++) s += $i; print $1, (NF > 1) ? s / (NF - 1) : 0; }' input.tsv
the below code also threw a syntax error
for i in `cat input.tsv` do; VALUES=`echo $i | tr '\t' '\t'`;COUNT=0;SUM=0;typeset -i j;IFS=' ';for j in $VALUES; do;SUM=`expr $SUM + $j`;COUNT=`expr $COUNT + 1`;done;AVG=`expr $SUM / $COUNT`;echo $AVG;done
help me resolve the issue to calculate the average of the row
From you code reference:
awk 'NR==1{next}
{
# missing the last ). This print the 1st column
#printf("%s\t", $1
printf("%s\t", $1 )
# missing the last ) and average of 3 colum only
#printf("%.2f\n", ($5 + $6 + $7)/3
printf("%.2f\n", ($5 + $6 + $7 + $8 + $9) / 5 )
}' input.tsv
Your second code is not easy work with , lot of subshell (backtic) and shell loop but most of all, i think it is made for working with integer value and for full line of value (not 5- > 9). Forget it unless you don't want awk in this case.
for fun
awk 'NR==1{
# Header
print $0 OFS "Avg"
Count = NF - 5
next
}
{
# print each element of the line + sum after col 4
for( i=Avg=0;i<=NF;i++) {
if( i >=5 ) Avg+= $i
printf( "%s ", $i)
}
# print average
printf( "%.2f\n", Avg/Count )
}
' input.tsv
Assuming here that it is always counting on the full stack of value, we can change the Count by (NF - 4) if less value are on the line and empty are not counting
You could use this awk script:
awk 'NR>1{
for(i=5;i<=NF;i++)
sum+=$i
}
{
print $1,$2,$3,$4,(NF>4&&sum!=""?sum/(NF-4):(NR==1?"Avg":""))
sum=0
}' file | column -t
The first block gets the sum of all ids starting from the 5th element.
The second block, prints the header line and the average value.
column -t displays the result in column.
This would be working as expected:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ s=0; for(i=5;i<=NF;++i) s+=$i }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
or just for the fun of it, if you want to make the for-loop obfuscated:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ for(s=!(i=5);i<=NF;s+=$(i++)) {} }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
$ cat tst.awk
NR == 1 { avg = "Avg" }
NR > 1 {
sum = cnt = 0
for (i=5; i<=NF; i++) {
sum += $i
cnt++
}
avg = (cnt ? sum / cnt : 0)
}
{ print $1, $2, $3, $4, avg }
$ awk -f tst.awk file
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
Using Perl one-liner
> perl -lane '{ $s=0;foreach(#F[4..8]){$s+=$_} $F[4]=$s==0?"Avg":$s/5;print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]" } ' input.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
>

Extract desired column with values

Please help me with this small script I am making I am trying to grep some columns with values from a big file (tabseparated) (mainFileWithValues.txt) which has this format:
A B C ......... (total 700 columns)
80 2.08 23
14 1.88 30
12 1.81 40
Column names are in column.nam
cat columnnam.nam
A
B
.
.
.
till 20 nmes
I am first taking column number from a big file using:
sed -n "1 s/${i}.*//p" mainFileWithValues.txt | sed 's/[^\t*]//g' |wc -c
Then using cut I am extracting values
I have made a for loop
#/bin/bash
for i in `cat columnnam.nam`
do
cut -f`sed -n "1 s/${i}.*//p" mainFileWithValues.txt | sed 's/[^\t*]//g' |wc -c` mainFileWithValues.txt > test.txt
done
cat test.txt
A
80
14
12
B
2.08
1.88
1.81
my problem is I want output test.txt to be in columns like main file.
i.e.
A B
80 2.08
How can I fix this in this script?
Here is one-liner:
awk 'FNR==NR{h[NR]=$1;next}{for(i=1; i in h; i++){if(FNR==1){for(j=1; j<=NF; j++){if(tolower(h[i])==tolower($j)){d[i]=j; break }}}printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"")}print ""}' columns.nam mainfile
Explanation:
[ note : case insensitive header match, remove tolower(), if you want strict match ]
awk '
FNR==NR{ # Here we read columns.nam file
h[NR]=$1; # h -> array, NR -> as array key, $1 -> as array value
next # go to next line
}
{ # Here we read second file
for(i=1; i in h; i++) # iterate array h
{
if(FNR==1) # if we are reading 1st row of second file, will parse header
{
for(j=1; j<=NF; j++) # iterate over fields of 1st row fields
{
# if it was the field we are looking for
if(tolower(h[i])==tolower($j))
{
# then
# d -> array, i -> as array key which is column order number
# j -> as array value which is column number
d[i]=j;
break
}
}
}
# for all records
# if field we searched was found then print such field
# from d[i] we access, column number
printf("%s%s",i>1 ? OFS:"", i in d ? $(d[i]): "");
}
# print newline char
print ""
}
' columns.nam mainfile
Test Results:
$ cat mainfile
A B C
80 2.08 23
14 1.88 30
12 1.81 40
$ cat columns.nam
A
C
$ awk 'FNR==NR{h[NR]=$1;next}{for(i=1; i in h; i++){if(FNR==1){for(j=1; j<=NF; j++){if(tolower(h[i])==tolower($j)){d[i]=j; break }}}printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"")}print ""}' columns.nam mainfile
A C
80 23
14 30
12 40
You can also make script and run
akshay#db-3325:/tmp$ cat col_parser.awk
FNR == NR {
h[NR] = $1;
next
}
{
for (i = 1; i in h; i++) {
if (FNR == 1) {
for (j = 1; j <= NF; j++) {
if (tolower(h[i]) == tolower($j)) {
d[i] = j;
break
}
}
}
printf("%s%s", i > 1 ? OFS : "", i in d ? $(d[i]) : "");
}
print ""
}
akshay#db-3325:/tmp$ awk -v OFS="\t" -f col_parser.awk columns.nam mainfile
A C
80 23
14 30
12 40
Similar Answer
AWK to display a column based on Column name and remove header and last delimiter
Another awk approach:
awk 'NR == FNR {
hdr[$1]
next
}
FNR == 1 {
for (i=1; i<=NF; i++)
if ($i in hdr)
h[i]
}
{
s=""
for (i in h)
s = s (s == "" ? "" : OFS) $i
print s
}' column.nam mainFileWithValues.txt
A B
80 2.08
14 1.88
12 1.81
To get formatted output pipe above command to column -t

Min and max coordinates for same values in different column

I have one question, I think about script for my data and I am totally lost.
INPUT:
1 BR.100.200
2 BR.100.200
3 BR.100.200
4 BR.100.200
1 BAL.11.235
2 BAL.11.235
3 BAL.11.235
1 JOJ.21.354
2 JOJ.21.354
OUTPUT :
BR.100.200 1 4
BAL.11.235 1 3
JOJ.21.354 1 2
Than I want: if the $2 is same for columns, write for this same values maximal and minimal values in $1. Please i prefer awk language or bash or sed.
Thank you
Filip
Could probz be made better but this works
awk '!x[$2]{x[$2]=$1}y[$2]<$1{y[$2]=$1}x[$2]>$1{x[$2]=$1}END{for(i in y)print i,x[i],y[i]}' file
More readable
awk '!min[$2]{min[$2]=$1} max[$2]<$1{max[$2]=$1} min[$2]>$1{min[$2]=$1} END{for(i in max)print i, min[i], max[i]}' file
#!/usr/bin/awk -f
NF == 0 { next }
$2 in min {
if ($1 < min[$2]) {
min[$2] = $1
} else if ($1 > max[$2]) {
max[$2] = $1
}
next
}
{
min[$2] = max[$2] = $1
keys[i++] = $2
}
END {
for (i = 0; i in keys; ++i) {
key = keys[i]
if (i) {
print ""
}
printf "%s\t%s\t%s\n", key, min[key], max[key]
}
}
Run with:
awk -f script.awk your_file.txt
Output:
BR.100.200 1 4
BAL.11.235 1 3
JOJ.21.354 1 2
awk '{if (NR == 1) {temp1=$2;min=$1;max=$1;} else if ((NR % 2)!=0) {temp2=$2; if (temp1 == temp2) {max=$1} else {print (temp1,min,max); temp1=$2;min=$1;max=$1} } } END{if ((NR % 2)!=0) {temp2=$2; if (temp1 == temp2) {max=$1} else {print (temp1,min,max);} print (temp2,min,max) } }' inputfile

summing second columns of all files in bash

I have 1-N files in this format:
file 1:
1 1
2 5
3 0
4 0
5 0
file 2:
1 5
2 1
3 0
4 0
5 1
As an output, I want to sum all second columns of all files, so the output looks like this:
output:
1 6
2 6
3 0
4 0
5 1
Thanks a lot.
(Alternatively would be the best for me to do this operation automatically with all files that have the same name, but start with different number, e.g. 1A.txt, 2A.txt, 3A.txt as one output and 1AD.txt, 2AD.txt, 3AD.txt as next output)
Something like this should work:
cat *A.txt | awk '{sums[$1] += $2;} END { for (i in sums) print i " " sums[i]; }'
cat *AD.txt | awk '{sums[$1] += $2;} END { for (i in sums) print i " " sums[i]; }'
A quick summing solution can be done in awk:
{ sum[$1] += $2; }
END { for (i in sum) print i " " sum[i]; }
Grouping your input files is done easiest by building a list of suffixes and then globbing for them:
ls *.txt | sed -e 's/^[0-9]*//' | while read suffix; do
awk '{ sum[$1] += $2; } END { for (i in sum) print i " " sum[i]; }' *$suffix > ${suffix}.sum
done
#!/bin/bash
suffixes=$(find . -name '*.txt' | sed 's/.*[0-9][0-9]*\(.*\)\.txt/\1/' | sort -u)
for suffix in ${suffixes}; do
paste *${suffix}.txt | awk '{sum = 0; for (i = 2; i <= NF; i += 2) sum += $i;
print $1" "sum}' > ${suffix}.sums.txt
done
exit 0
Pure Bash:
declare -a sum
for file in *A.txt; do
while read a b; do
((sum[a]+=b))
done < "$file"
done
for idx in ${!sum[*]}; do # iterate over existing indices
echo "$idx ${sum[$idx]}"
done

Resources