Split column using awk or sed - bash

I have a file containing the following text.
dog
aa 6469
bb 5946
cc 715
cat
aa 5692
Bird
aa 3056
bb 2893
cc 1399
dd 33
I need the following output:
A-Z ,aa ,bb, cc, dd
dog, 6469, 5946 ,715, 0
cat ,5692, 0, 0, 0
Bird ,3056, 2893, 1399, 33
I tried:
awk '{$1=$1}1' OFS="," RS=
But is not giving the format I need.
Thanks in advance for your help.
Cris

With Perl
perl -00 -nE'
($t, %p) = split /[\n\s]/; $h{$t} = {%p}; # Top line, Pairs on lines
$o{$t} = ++$c; # remember Order
%k = map { $_, 1} keys %p; # find full set of subKeys
}{ # END block starts
say join ",", "A-Z", sort keys %k;
for $t (sort { $o{$a} <=> $o{$b} } keys %h) {
say join ",", $k, map { ($h{$k}{$_} // 0) } sort keys %k;
}
' data.txt
prints, in the original order
A-Z,aa,bb,cc,dd
dog,6469,5946,715,0
cat,5692,0,0,0
Bird,3056,2893,1399,33

Here's a sed solution, which works on your input, but requires that you know the column names in advance and that the column names are given as sorted full ranges starting with the first column name (so nothing like aa, cc or bb, aa or bb, cc) and that every paragraph is followed by one empty line. You would also need to adjust the script if you don't have exactly four numeric columns:
echo 'A-Z, aa, bb, cc, dd';sed -e '/./{s/.* //;H;d};x;s/\n/, /g;s/, //;s/$/, 0, 0, 0, 0/;:a;s/,[^,]*//5;ta' file
If you need to look up the sed commands, you can look at info sed, especially 3.5 Less Frequently-Used Commands.

awk to the rescue!
awk -v OFS=, 'NF==1 {h[++c]=$1}
NF==2 {v[c,$1]=$2; ks[$1]}
END {printf "%s", "A-Z";
for(k in ks) printf "%s", OFS k;
print "";
for(i=1;i<=c;i++)
{printf "%s", h[i];
for(k in ks) printf "%s", OFS v[i,k]+0;
print ""}}' file'
order of the columns will be random.

Related

counting occurence of character

I have a file that looks like this
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
I need to count number of time A, B & D occur. Individually, I would do it like this
awk '{if($1~/A/) print $0 }' < test.txt | wc
awk '{if($1~/B/) print $0 }' < test.txt | wc
awk '{if($1~/D/) print $0 }' < test.txt | wc
How to join these lines so that I can count number of A,B,D just through one liner instead of 3 separate lines.
For specific line format (where the needed char is before _):
$ awk -F"_" '{ seen[substr($1, length($1))]++ }END{ for(k in seen) print k, seen[k] }' file
A 3
B 2
D 2
Counting occurrences is generally done by keeping track of a counter. So a single of the OP's awk lines;
awk '{if($1~/A/) print $0}' < test.txt | wc
can be rewritten as
awk '($1~/A/){c++}END{print c}' test.txt
for multiple cases, you can now do:
awk '($1~/A/){c["A"]++}
($1~/B/){c["B"]++}
($1~/D/){c["D"]++}
END{for(i in c) print i,c[i]}' test.txt
Now you can even clean this up a bit more:
awk '{c["A"]+=($1~/A/)}
{c["B"]+=($1~/B/)}
{c["D"]+=($1~/D/)}
END{for(i in c) print i,c[i]}' test.txt
which you can clean up further as:
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=($1~a[i])}
END{for(i in c) print i,c[i]}' test.txt
But these cases just count how many times a line occurs that contains the letter, not how many times the letter occurs.
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=gsub(a[i],"",$1)}
END{for(i in c) print i,c[i]}' test.txt
Perl to the rescue!
perl -lne '$seen{$1}++ if /([ABD])/; END { print "$_:$seen{$_}" for keys %seen }' < test.txt
-n reads the input line by line
-l removes newlines from input and adds them to output
a hash table %seen is used to keep the number of occurrences of each symbol. Each time it's matched it's captured and the corresponding field in the hash is incremented.
END is run when the file ends. It outputs all the keys of the hash, i.e. the matched characters, each followed by the number of occurrences.
datafile:
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
script.awk
BEGIN {
arr["A"]=0
arr["B"]=0
arr["D"]=0
}
/A/ { arr["A"]++ }
/B/ { arr["B"]++ }
/D/ { arr["D"]++ }
END {
printf "A: %s, B: %s, D: %s", arr["A"], arr["B"], arr["D"]
}
execution:
awk -f script.awk datafile
result:
A: 3, B: 2, D: 2

How to sort ROW in a line in BASH

Most sorting available in bash or linux terminal commands are about sorting a field (column). I couldn't figure out how to sort a row of three number, e.g. "1, 3, 2". I want it from left to right are small to large, like "1,2,3" or vice versa.
So input would be like line="5, 3, 10". After being sorted, the output will be sorted_line="3,5,10".
Any tips? Thanks.
Note that asort works for gawk not general awk. So here is another solution for a file, a.txt
gawk -F, '{split($0, w); s=""; for(i=1; i<=asort(w); i++) s=s w[i] ","; print s }' a.txt | sed 's/,$//'
sample file, a.txt is
1,5,7,2
8,1,3,4
9,7,8,2
result,
1,2,5,7
1,3,4,8
2,7,8,9
This is one way :
echo "6 5,4,9 1,3 2,10,7 8" | awk '{ split($0,arr,"(,| )") ; asort(arr); exit; } END{ for ( i=1; i <= length(arr) ; i++ ) { print arr[i]} }'
I am using a regex as a delimiter so it can be comma or space separated.
Hope it helps!

Splitting a large, complex one column file into several columns with awk

I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.
(1
2
3
...
)
(11
22
33
...
)
(111
222
333
...
)
I need to achieve an output like:
1; 11; 111
2; 22; 222
3; 33; 333
... ... ...
I found a complicated way that is:
perform sed operations to get
1
2
3
...
#
11
22
33
...
#
111
222
333
...
use awk as follows to split my file in several sub-files
awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
remove white spaces from my subfiles again with sed
sed -i '/^[[:space:]]*$/d' splitted*.txt
join everything together:
paste splitted*.txt > out.txt
add a field separator (defined in my bash script)
awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt
I feel this is crappy as I loop over million lines several time.
Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it.
Something like:
awk 'BEGIN{RS="(\\n)"; OFS=";"} { print something } '
I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.
Any help would be appreciated.
With GNU awk for multi-char RS and true multi dimensional arrays:
$ cat tst.awk
BEGIN {
RS = "(\\s*[()]\\s*)+"
OFS = ";"
}
NR>1 {
cell[NR][1]
split($0,cell[NR])
}
END {
for (rowNr=1; rowNr<=NF; rowNr++) {
for (colNr=2; colNr<=NR; colNr++) {
printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
If you know you have 3 columns, you can do it in a very ugly way as following:
pr -3ts <file>
All that needs to be done then is to remove your brackets:
$ pr -3ts ~/tmp/f | awk 'BEGIN{OFS="; "}{gsub(/[()]/,"")}(NF){$1=$1; print}'
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
You can also do it in a single awk line, but it just complicates things. The above is quick and easy.
This awk program does the full generic version:
awk 'BEGIN{r=c=0}
/)/{r=0; c++; next}
{gsub(/[( ]/,"")}
(NF){a[r++,c]=$1; rm=rm>r?rm:r}
END{ for(i=0;i<rm;++i) {
printf a[i,0];
for(j=1;j<c;++j) printf "; " a[i,j];
print ""
}
}' <file>
Could you please try following once, considering that your actual Input_file is same as shown samples.
awk -v RS="" '
{
gsub(/\n|, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
}
}
for(j=1;j<=num;j++){
print val[j]
}
delete val
delete array
value=""
}' OFS="; "
OR(above script is considering that numbers inside (...) will be constant, now adding script which will working even field numbers of not equal inside (....).
awk -v RS="" '
{
gsub(/\n/,",")
gsub(/, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
max=num>max?num:max
}
}
for(j=1;j<=max;j++){
print val[j]
}
delete val
delete array
}' OFS="; "
Output will be as follows.
1; 11; 111
2; 22; 222
3; 33; 333
Explanation: Adding explanation for above code here.
awk -v RS="" ' ##Setting RS(record separator) as NULL here.
{ ##Starting BLOCK here.
gsub(/\n/,",") ##using gsub to substitute new line OR comma with space with comma here.
gsub(/, /,",")
}
1' Input_file | ##Mentioning 1 will be printing edited/non-edited line of Input_file. Using | means sending this output as Input to next awk program.
awk ' ##Starting another awk program here.
{
while(match($0,/\([^\)]*/)){ ##Using while loop which will run till a match is FOUND for (...) in lines.
value=substr($0,RSTART+1,RLENGTH-2) ##storing substring from RSTART+1 to till RLENGTH-1 value to variable value here.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line with substring valeu from RSTART+RLENGTH till last of line.
num=split(value,array,",") ##Splitting value variable into array named array whose delimiter is comma here.
for(i=1;i<=num;i++){ ##Using for loop which runs from i=1 to till value of num(length of array).
val[i]=val[i]?val[i] OFS array[i]:array[i] ##Creating array val whose index is value of variable i and concatinating its own values.
}
}
for(j=1;j<=num;j++){ ##Starting a for loop from j=1 to till value of num here.
print val[j] ##Printing value of val whose index is j here.
}
delete val ##Deleting val here.
delete array ##Deleting array here.
value="" ##Nullifying variable value here.
}' OFS="; " ##Making OFS value as ; with space here.
NOTE: This should work for more than 3 values inside (...) brackets also.
awk 'BEGIN { RS = "\\s*[()]\\s*"; FS = "\\s*" }
NF > 0 {
maxCol++
if (NF > maxRow)
maxRow = NF
for (row = 1; row <= NF; row++)
a[row,maxCol] = $row
}
END {
for (row = 1; row <= maxRow; row++) {
for (col = 1; col <= maxCol; col++)
printf "%s", a[row,col] ";"
print ""
}
}' yourFile
output
1;11;111;
2;22;222;
3;33;333;
...;...;...;
Change FS= "\\s*" to FS = "\n*" when you also want to allow spaces inside your fields.
This script supports columns of different lengths.
When benchmarking also consider replacing [i,j] with [i][j] for GNU awk. I'm unsure which one is faster and did not benchmark the script myself.
Here is the Perl one-liner solution
$ cat edouard2.txt
(1
2
3
a
)
(11
22
33
b
)
(111
222
333
c
)
$ perl -lne ' $x=0 if s/[)(]// ; if(/(\S+)/) { #t=#{$val[$x]};push(#t,$1);$val[$x++]=[#t] } END { print join(";",#{$val[$_]}) for(0..$#val) }' edouard2.txt
1;11;111
2;22;222
3;33;333
a;b;c
I would convert each section to a row and then transpose after, e.g. assuming you are using GNU awk:
<infile awk '{ gsub("[( )]", ""); $1=$1 } 1' RS='\\)\n\\(' OFS=';' |
datamash -t';' transpose
Output:
1;11;111
2;22;222
3;33;333
...;...;...

deleting spaces between every other column

I have a large dataset that looks like this:
ID224912 A A A B B A B A B A B
and I want to make it look like:
ID224912 AA AB BA BA BA BA
I have tried modifying this code that I found somewhere else but no success:
AWK=''' { printf (""%s %s %s %s"", $1, $2, $3, $4); }
{ for (f = 5; f <= NF; f += 2) printf (""%s %s"", $(f), $(f + 1)); }
{ printf (""\n""); } '''
awk ""${AWK}"" InFile > OutFile
Any suggestions?
This might work for you (GNU sed):
sed -E 's/((\S+\s\S+\s)*\S+).*/\1/g;s/(\S+\s\S+)\s/\1/g' file
The solution is in two parts. First group the spaces between fields to be an even number and delete an extra field if there is one. Then group the fields
$ awk '{r=$1; for (i=2; i<NF; i+=2) r=r OFS $i $(i+1); print r}' file
ID224912 AA AB BA BA BA
You do not have to assign the AWK script into a variable. Just invoke it inline, which is simpler and safer.
It looks strange that you are grouping the first four fields. As far as I can see from your desired output, it would be enough just to treat the first (ID) field separately.
Try something like:
awk '{printf("%s", $1); for (i=2; i<=NF; i+=2) printf(" %s%s", $i, $(i+1)); print ""}' InFile > OutFile
Hope this hepls.
For funsies here is a sed solution:
cat input | sed 's/\([ A-Z ]\) \([ A-Z ]\)/\1\2/g' > output
Just for clarification I tested on BSD sed.
Regarding InFile as your input file, you can use sed this way:
cat InFile |sed -e 's/\([a-zA-Z]\)[ \t]\([a-zA-Z]\)/\1\2/g'
N.B.: with the specified InFile in your initial question (with an odd count of letters), the result is:
ID224912 AA AB BA BA BA B
The following awk line
awk '{printf $1}{for(i=2;i<=NF;i+=2) printf OFS $i $(i+1); print "" }'
will output
ID224912 AA AB BA BA BA B
As you notice, we have an extra column B in the end due to the even amount of columns in the original output. As the OP does not want this, we can fix this with a simple update in the for-loop condititions
awk '{printf $1}{for(i=2;i<NF;i+=2) printf OFS $i $(i+1); print "" }'
will output
ID224912 AA AB BA BA BA

Restructure line fields in a file

Am a newbie to coding but would like to use either awk, sed or bash to solve this problem.
I have a file "input.txt" that looks like this:
Otu13 k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus 0.998
Otu24 k__Bacteria;p__Candidatus_Saccharibacteria;g__Saccharibacteria_genera_incertae_sedis; 1.000;;
Otu59 k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Prevotellaceae;g__Alloprevotella 0.991
Otu41 k__Bacteria;p__Bacteroidetes;g__Alloprevotella 0.998
Firstly, I would like to drop the last column with numbers, then for the rest of the fields in each line, write them out depending on their prefix (k__, p__, o__, f__, g__).
The values after the prefixes should be printed out in a similar order as in parenthesis such that if one of the prefix in the sequences order is missing e.g. line 2 and 4, then they are replaced with blank. In the end I should have 7 fields.
My desired output is something like this:
Otu13; Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus
Otu24; Bacteria; Candidatus_Saccharibacteria; ; ; ;Saccharibacteria_genera_incertae_sedis
Otu59; Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Alloprevotella
Otu41; Bacteria;Bacteroidetes; ; ; ; Alloprevotella
Will greatly appreciate your assistance.
It's not clear how/why you'd get the output you show from the input you posted and the description of your requirements but I think this is what you really want:
$ cat tst.awk
BEGIN { n=split("k p c o f g",order); FS="[ ;]+|__"; OFS=";" }
{
sub(/[0-9.;[:space:]]+$/,"")
delete f
for (i=2;i<=NF;i+=2) {
f[$i] = $(i+1)
}
printf "%s%s", $1, OFS
for (i=1; i<=n; i++) {
printf "%s%s", f[order[i]], (i<n ? OFS : ORS)
}
}
$ awk -f tst.awk file
Otu13;Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus
Otu24;Bacteria;Candidatus_Saccharibacteria;;;;Saccharibacteria_genera_incertae_sedis
Otu59;Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Alloprevotella
Otu41;Bacteria;Bacteroidetes;;;;Alloprevotella

Resources