How to sort a specified column in l、Linux - bash

This is my two column sequence, I want to combine them into 1 column and sort them in Linux, but I don't know how to write the shell script to handle them.
GGCTGCAGCTAACAGGTGA TACTCGGGGAGCTGCGG
CCTCTGGCTCGCAGGTCATGGC CAGCGTCTTGCGCTCCT
GCTGCAGCTACATGGTGTCG CGCTCCGCTTCTCTCTACG
The sorted results are as follows (first column first, second column second, and split by "\t")
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG
what should I do?

You can do it easily in awk by storing the second column in an array and then outputting the saved values in the END rule, e.g.
awk '
{
print ++n, $1 # output first column
a[n] = $2 # save second column in array
}
END {
j = n + 1 # j is next counter
for (i=1;i<=n;i++) # loop 1 - n
print j++, a[i] # output j and array value
}
' file.txt
Example Use/Output
With your input in file.txt, you can just copy/middle-mouse-paste the above in an xterm with file.txt in the current directory, e.g.
$ awk '
> {
> print ++n, $1 # output first column
> a[n] = $2 # save second column in array
> }
> END {
> j = n + 1 # j is next counter
> for (i=1;i<=n;i++) # loop 1 - n
> print j++, a[i] # output j and array value
> }
> ' file.txt
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG
Or as a 1-liner:
$ awk '{print ++n, $1; a[n]=$2} END {j=n+1; for (i=1;i<=n;i++) print j++, a[i]}' file.txt
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG
If you would like to create an awk script from the above, you can simply create the script file (say cmbcols.awk) as:
{
print ++n, $1 # output first column
a[n] = $2 # save second column in array
}
END {
j = n + 1 # j is next counter
for (i=1;i<=n;i++) # loop 1 - n
print j++, a[i] # output j and array value
}
Then to run the script on the file file.txt you can do:
$ awk -f cmbcols.awk file.txt
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG

Related

How to print the data that have largest length using awk?

I have this input:
1 happy,t,c,d
2 t,c
3 e,fgh,k
4 yk,j,f
5 leti,j,f,g
I want to print the length of the largest item (with comma as a delimiter), which should yield:
1 5,1,1,1
2 1,1
3 1,3,1
4 2,1,1
5 4,1,1,1
And then I want to select the max value for the 2nd column finally creating:
1 5
2 1
3 3
4 2
5 4
How can I do this in awk?
1) For the first task I have tried:
awk -v col=$2 -F',' '{OFS=","; for(i=1; i<=NF; i++) print length($i);}' test.data.txt
Which doesn't output correct data:
7
1
1
1
3
1
3
3
1
4
1
1
6
1
1
1
The only problem is that I am not able to use -v option properly to read only that column. So, I have all data in one column, and values added (from length) from column1 and space between column1 and column2.
2) To select the max value, I am doing:
awk -F',' '{OFS="\t"; m=length($1); for(i=1; i<=NF; i++) if (length($i) > m) m=length($i); print m}' test.data.txt
This works properly, but due to the presence of 1st column the values are added to the max values giving me:
7
3
3
4
6
instead of:
5
1
3
2
4
Lastly, I want to merge these two processes in one go. Any suggestions on improvements?
awk -F'[, ]' -v OFS="," '{m=length($2);for (i=3;i<=NF;i++) if (length($i) > m) m=length($i)}{print $1,m}' file
1,5
2,1
3,3
4,2
5,4
For the first case:
awk -F'[, ]' -v OFS="," '{printf "%s",$1;for (i=2;i<=NF;i++) printf "%s%s",(i==2?" ":OFS),length($i)}{print ""}'
1 5,1,1,1
2 1,1
3 1,3,1
4 2,1,1
5 4,1,1,1
Shorter alternative:
awk -F'[, ]' -v OFS="," '{printf "%s ",$1;for (i=2;i<=NF;i++) printf "%s%s",length($i),(i==NF?ORS:OFS)}'
While print in awk prints data and changes line by printing a new line at the end, printf does not change line on it's own.
PS: Thanks Ed Morton for the valuable comment.
We start with this data file:
$ cat data
1 happy,t,c,d
2 t,c
3 e,fgh,k
4 yk,j,f
5 leti,j,f,g
For the first task:
$ awk '{n=split($2,a,/,/); printf "%2s %s",$1,length(a[1]); for(i=2; i<=n; i++) printf ",%s",length(a[i]); print""}' data
1 5,1,1,1
2 1,1
3 1,3,1
4 2,1,1
5 4,1,1,1
For the second task:
$ awk '{n=split($2,a,/,/); max=length(a[1]); for(i=2; i<=n; i++) if (length(a[i])>max)max=length(a[i]); print $1,max}' data
1 5
2 1
3 3
4 2
5 4
How it works
For the second task:
n=split($2,a,/,/)
We split up the contents of field 2 into array a
max=length(a[1])
We assign the length of the first element of array a to the awk variable max.
for(i=2; i<=n; i++) if (length(a[i])>max)max=length(a[i])
If any succeeding element of array a is larger than max, we update `max.
print $1,max
We print the first field and the value of max.
Trying to golf this down:
gawk -F'[ ,]' '{m=0;for(i=2;i<=NF;i++){l=length($i);if(l>m)m=l}print$1,m}'
perl -MList::Util=max -F'\s+|,' -lne'$,=" ";print shift(#F),max map{length}#F'
perl -MList::Util=max -F'\s+|,' -lne'print"#{[shift(#F),max map{length}#F]}"'
perl -MList::Util=max -F'\s+|,' -lpe'$_="#{[shift(#F),max map{length}#F]}"'
ruby -F'[ ,]' -lape'$_="#{$F[0]} #{$F[1..-1].map{|e|e.size}.max}"'

Awk separate column output

What I wanted to do is to create a Table (maximum=4 rows) from a one-column file using awk.
I have a file:
1 a,b
2 r,i
3 w
4 r,t
5 o,s
6 y
The desire output:
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
So far, I just separating the rows into different files and "paste" them into one. I would appreciate of any of more sophisticated method.
$ cat tst.awk
BEGIN {
numRows = 4
OFS = "\t"
}
{
rowNr = (NR - 1 ) % numRows + 1
if ( rowNr == 1 ) {
numCols++
}
val[rowNr,numCols] = $0
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
}
}
}
$
$ awk -f tst.awk file
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
Combination of awk to join lines and column to pretty-print them:
awk -v max=4 '
{ i = (NR-1) % max + 1; line[i] = line[i] "\t" $0 }
END { for(i=1; i<=max && i<=length(line); i++) print line[i] }' file | column -t -s $'\t'
Output:
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
Another:
$ awk ' {
i=(NR%4) # using modulo indexed array
a[i]=a[i] (a[i]==""?"":" ") $0 # append to it
}
END { # in the END
for(i=1;i<=4;i++) # loop all indexes in order
print a[i%4] # dont forget the modulo
}' file
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
Naturally it will be ugly if there are missing columns.
Here is another awk approach:-
awk '
{
A[++c] = $0
}
END {
m = sprintf ( "%.0f", ( c / 4 ) )
for ( i = 1; i <= 4; i++ )
{
printf "%s\t", A[i]
for ( j = 1; j <= m; j++ )
printf "%s\t", A[i+(j*4)]
printf "\n"
}
}
' file
you can combine split and paste
split -l 4 file part- && paste part-*
-l <number> means to split file to smaller files of <number> lines each.
part- is a prefix of our choice to be used for the new files. Note that they will be in alphabetical order, e.g. part-aa, part-ab etc. So paste will paste them as expected.

awk command to sum pairs of lines and filter out under particular condition

I have a file with numbers and I want to sum numbers from two lines and this for each column, then in my last step I want to filter out pairs of lines that has a count bigger or equal than 3 of '0' sum counts. I write a small example to make it clear:
This is my file (without the comments ofc), it contains 2 pairs of lines (=4 lines) with 5 columns.
2 6 0 8 9 # pair 1.A
0 1 0 5 1 # pair 1.B
0 2 0 3 0 # pair 2.A
0 0 0 0 0 # pair 2.B
And I need to sum up pairs of lines so I get something like this (intermediate step)
2 7 0 13 10 # sum pair 1, it has one 0
0 2 0 3 0 # sum pair 2, it has three 0
Then I want to print the original lines, but only those which the sum of 0 (of the sum of the two lines) is lower than 3, therefore I should get printed this:
2 6 0 8 9 # pair 1.A
0 1 0 5 1 # pair 1.B
Because the sum of the second pair of lines has three 0, then it should be excluded
So from the first file I need to get the last output.
So far what I have been able to do is to sum pairs of lines, count zeros, and identify those with a count lower than 3 of 0 but I don't know how to print the two lines that contributed to the SUM, I am only able to print one of the two lines (the last one). This is the awk I am using:
awk '
NR%2 { split($0, a); next }
{ for (i=1; i<=NF; i++) if (a[i]+$i == 0) SUM +=1;
if (SUM < 3) print $0; SUM=0 }' myfile
(That's what I get now)
0 1 0 5 1 # pair 1.B
Thanks!
Another variation, could be useful to avoid loop iterations in some input cases:
awk '!(NR%2){ zeros=0; for(i=1;i<=NF;i++) { if(a[i]+$i==0) zeros++; if(zeros>=3) next }
print prev ORS $0 }{ split($0,a); prev=$0 }' file
The output:
2 6 0 8 9
0 1 0 5 1
Well, after digging a little bit more I found that it was rather simple to print the previous line (I was complicating myself)
awk '
NR%2 { split($0, a) ; b=$0; next }
{ for (i=1; i<=NF; i++) if (a[i]+$i == 0) SUM +=1;
if (SUM < 3) print b"\n"$0; SUM=0}' myfile
So I just have to save the first line in a variable b and print when the condition is favorable.
Hope it can help other people too
$ cat tst.awk
!(NR%2) {
split(prev,p)
zeroCnt = 0
for (i=1; i<=NF; i++) {
zeroCnt += (($i + p[i]) == 0 ? 1 : 0)
}
if (zeroCnt < 3) {
print prev ORS $0
}
}
{ prev = $0 }
$ awk -f tst.awk file
2 6 0 8 9
0 1 0 5 1

Grouping elements by two fields on a space delimited file

I have this ordered data by column 2 then 3 and then 1 in a space delimited file (i used linux sort to do that):
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
I want to create a new file (leaving the old file as is)
0 2 0,1,2
1 4 1,2
Basically put the fields 2 and 3 first and group the elements of field 1 (as a comma separated list) by them. Is there a way to do that by an awk, sed, bash one liner, so to avoid writing a Java, C++ app for that?
Since the file is already ordered, you can print the line as they change:
awk '
seen==$2 FS $3 { line=line "," $1; next }
{ if(seen) print seen, line; seen=$2 FS $3; line=$1 }
END { print seen, line }
' file
0 2 0,1,2
1 4 1,2
This will preserve the order of output.
with your input and output this line may help:
awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}
{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' file
test:
kent$ cat f
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
kent$ awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' f
0 2 0,1,2
1 4 1,2
awk 'a[$2, $3]++ { p = p "," $1; next } p { print p } { p = $2 FS $3 FS $1 } END { if (p) print p }' file
Output:
0 2 0,1,2
1 4 1,2
The solution assumes data on second and third column is sorted.
Using awk:
awk '{k=$2 OFS $3} !(k in a){a[k]=$1; b[++n]=k; next} {a[k]=a[k] "," $1}
END{for (i=1; i<=n; i++) print b[i],a[b[i]]}' file
0 2 0,1,2
1 4 1,2
Yet another take:
awk -v SUBSEP=" " '
{group[$2,$3] = group[$2,$3] $1 ","}
END {
for (g in group) {
sub(/,$/,"",group[g])
print g, group[g]
}
}
' file > newfile
The SUBSEP variable is the character that joins strings in a single-dimensional awk array.
http://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional
This might work for you (GNU sed):
sed -r ':a;$!N;/(. (. .).*)\n(.) \2.*/s//\1,\3/;ta;s/(.) (.) (.)/\2 \3 \1/;P;D' file
This appends the first column of the subsequent record to the first record until the second and third keys change. Then the fields in the first record are re-arranged and printed out.
This uses the data presented but can be adapted for more complex data.

summing second columns of all files in bash

I have 1-N files in this format:
file 1:
1 1
2 5
3 0
4 0
5 0
file 2:
1 5
2 1
3 0
4 0
5 1
As an output, I want to sum all second columns of all files, so the output looks like this:
output:
1 6
2 6
3 0
4 0
5 1
Thanks a lot.
(Alternatively would be the best for me to do this operation automatically with all files that have the same name, but start with different number, e.g. 1A.txt, 2A.txt, 3A.txt as one output and 1AD.txt, 2AD.txt, 3AD.txt as next output)
Something like this should work:
cat *A.txt | awk '{sums[$1] += $2;} END { for (i in sums) print i " " sums[i]; }'
cat *AD.txt | awk '{sums[$1] += $2;} END { for (i in sums) print i " " sums[i]; }'
A quick summing solution can be done in awk:
{ sum[$1] += $2; }
END { for (i in sum) print i " " sum[i]; }
Grouping your input files is done easiest by building a list of suffixes and then globbing for them:
ls *.txt | sed -e 's/^[0-9]*//' | while read suffix; do
awk '{ sum[$1] += $2; } END { for (i in sum) print i " " sum[i]; }' *$suffix > ${suffix}.sum
done
#!/bin/bash
suffixes=$(find . -name '*.txt' | sed 's/.*[0-9][0-9]*\(.*\)\.txt/\1/' | sort -u)
for suffix in ${suffixes}; do
paste *${suffix}.txt | awk '{sum = 0; for (i = 2; i <= NF; i += 2) sum += $i;
print $1" "sum}' > ${suffix}.sums.txt
done
exit 0
Pure Bash:
declare -a sum
for file in *A.txt; do
while read a b; do
((sum[a]+=b))
done < "$file"
done
for idx in ${!sum[*]}; do # iterate over existing indices
echo "$idx ${sum[$idx]}"
done

Resources