Splitting a large, complex one column file into several columns with awk - bash

I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.
(1
2
3
...
)
(11
22
33
...
)
(111
222
333
...
)
I need to achieve an output like:
1; 11; 111
2; 22; 222
3; 33; 333
... ... ...
I found a complicated way that is:
perform sed operations to get
1
2
3
...
#
11
22
33
...
#
111
222
333
...
use awk as follows to split my file in several sub-files
awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
remove white spaces from my subfiles again with sed
sed -i '/^[[:space:]]*$/d' splitted*.txt
join everything together:
paste splitted*.txt > out.txt
add a field separator (defined in my bash script)
awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt
I feel this is crappy as I loop over million lines several time.
Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it.
Something like:
awk 'BEGIN{RS="(\\n)"; OFS=";"} { print something } '
I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.
Any help would be appreciated.

With GNU awk for multi-char RS and true multi dimensional arrays:
$ cat tst.awk
BEGIN {
RS = "(\\s*[()]\\s*)+"
OFS = ";"
}
NR>1 {
cell[NR][1]
split($0,cell[NR])
}
END {
for (rowNr=1; rowNr<=NF; rowNr++) {
for (colNr=2; colNr<=NR; colNr++) {
printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...

If you know you have 3 columns, you can do it in a very ugly way as following:
pr -3ts <file>
All that needs to be done then is to remove your brackets:
$ pr -3ts ~/tmp/f | awk 'BEGIN{OFS="; "}{gsub(/[()]/,"")}(NF){$1=$1; print}'
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
You can also do it in a single awk line, but it just complicates things. The above is quick and easy.
This awk program does the full generic version:
awk 'BEGIN{r=c=0}
/)/{r=0; c++; next}
{gsub(/[( ]/,"")}
(NF){a[r++,c]=$1; rm=rm>r?rm:r}
END{ for(i=0;i<rm;++i) {
printf a[i,0];
for(j=1;j<c;++j) printf "; " a[i,j];
print ""
}
}' <file>

Could you please try following once, considering that your actual Input_file is same as shown samples.
awk -v RS="" '
{
gsub(/\n|, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
}
}
for(j=1;j<=num;j++){
print val[j]
}
delete val
delete array
value=""
}' OFS="; "
OR(above script is considering that numbers inside (...) will be constant, now adding script which will working even field numbers of not equal inside (....).
awk -v RS="" '
{
gsub(/\n/,",")
gsub(/, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
max=num>max?num:max
}
}
for(j=1;j<=max;j++){
print val[j]
}
delete val
delete array
}' OFS="; "
Output will be as follows.
1; 11; 111
2; 22; 222
3; 33; 333
Explanation: Adding explanation for above code here.
awk -v RS="" ' ##Setting RS(record separator) as NULL here.
{ ##Starting BLOCK here.
gsub(/\n/,",") ##using gsub to substitute new line OR comma with space with comma here.
gsub(/, /,",")
}
1' Input_file | ##Mentioning 1 will be printing edited/non-edited line of Input_file. Using | means sending this output as Input to next awk program.
awk ' ##Starting another awk program here.
{
while(match($0,/\([^\)]*/)){ ##Using while loop which will run till a match is FOUND for (...) in lines.
value=substr($0,RSTART+1,RLENGTH-2) ##storing substring from RSTART+1 to till RLENGTH-1 value to variable value here.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line with substring valeu from RSTART+RLENGTH till last of line.
num=split(value,array,",") ##Splitting value variable into array named array whose delimiter is comma here.
for(i=1;i<=num;i++){ ##Using for loop which runs from i=1 to till value of num(length of array).
val[i]=val[i]?val[i] OFS array[i]:array[i] ##Creating array val whose index is value of variable i and concatinating its own values.
}
}
for(j=1;j<=num;j++){ ##Starting a for loop from j=1 to till value of num here.
print val[j] ##Printing value of val whose index is j here.
}
delete val ##Deleting val here.
delete array ##Deleting array here.
value="" ##Nullifying variable value here.
}' OFS="; " ##Making OFS value as ; with space here.
NOTE: This should work for more than 3 values inside (...) brackets also.

awk 'BEGIN { RS = "\\s*[()]\\s*"; FS = "\\s*" }
NF > 0 {
maxCol++
if (NF > maxRow)
maxRow = NF
for (row = 1; row <= NF; row++)
a[row,maxCol] = $row
}
END {
for (row = 1; row <= maxRow; row++) {
for (col = 1; col <= maxCol; col++)
printf "%s", a[row,col] ";"
print ""
}
}' yourFile
output
1;11;111;
2;22;222;
3;33;333;
...;...;...;
Change FS= "\\s*" to FS = "\n*" when you also want to allow spaces inside your fields.
This script supports columns of different lengths.
When benchmarking also consider replacing [i,j] with [i][j] for GNU awk. I'm unsure which one is faster and did not benchmark the script myself.

Here is the Perl one-liner solution
$ cat edouard2.txt
(1
2
3
a
)
(11
22
33
b
)
(111
222
333
c
)
$ perl -lne ' $x=0 if s/[)(]// ; if(/(\S+)/) { #t=#{$val[$x]};push(#t,$1);$val[$x++]=[#t] } END { print join(";",#{$val[$_]}) for(0..$#val) }' edouard2.txt
1;11;111
2;22;222
3;33;333
a;b;c

I would convert each section to a row and then transpose after, e.g. assuming you are using GNU awk:
<infile awk '{ gsub("[( )]", ""); $1=$1 } 1' RS='\\)\n\\(' OFS=';' |
datamash -t';' transpose
Output:
1;11;111
2;22;222
3;33;333
...;...;...

Related

Use an array created using awk as a variable in another awk script

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.
The awk script I use for creating the array is as follows:
array=($(awk 'NR>1 { print $1 }' < file.tsv))
Then, to use this array in the other awk script
awk var="${array[#]}" 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
However, when I run this, the following error occurs.
awk: fatal: cannot open file 'foo' for reading (No such file or directory)
I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.
awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.
Thx in advance.
That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.
Here's how to really do what you're trying to do:
array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
awk -v xyz="${array[*]}" '
BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
... now use `var` as you were trying to ...
'
For example:
$ cat file.tsv
col1 col2
a b c d e
f g h i j
$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j
$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h
$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
$ awk -v xyz="${array[*]}" '
BEGIN {
split(xyz,tmp,RS)
for (i in tmp) {
var[tmp[i]]
}
for (idx in var) {
print "<" idx ">"
}
}
'
<f g h>
<a b>
It's easier and more efficient to process both files in a single awk:
edit: fixed issues in comment, thanks #EdMorton
awk '
FNR == NR {
if ( FNR > 1 )
var[$1]
next
}
FNR == 1 {
for (i = 1; i <= NF; i++)
heading[i] = $i
next
}
{
for (i = 2; i <= NF; i++)
if ( $i == "1" && heading[i] in var) {
outFile = heading[i] ".txt"
print ">kmer" (NR-1) "\n" $1 >> (outFile)
close(outFile)
}
}
' file.tsv input.txt
You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be
A B C
D E F
G H I
and file2.txt content be
1
3
2
then
var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt
gives output
First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D
Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.
(tested in gawk 4.2.1)

How to print the row number and starting location of a pattern when multiple matches per row are present?

I want to use awk to match all the occurrences of a pattern within a large file. For each match, I would like to print the row number and the starting position of the pattern along the row (sort of xy coordinates). There are several occurrences of the pattern in each line.
I found this somewhat related question.
So far, I managed to do it only for the first (leftmost) occurrence in each line. As an example:
echo xyzABCdefghiABCdefghiABCdef | awk 'match($0, /ABC/) {print NR, RSTART } '
The resulting output is :
1 4
But what I would expect is something like this:
1 4
1 13
1 22
I tried using split instead of match. I manage to identify all the occurrences, but the RSTART is lost and printed as "0".
echo xyzABCdefghiABCdefghiABCdef | awk ' { split($0,t, /ABC/,m) ; for (i=1; i in m; i++) print (NR, RSTART) } '
Output:
1 0
1 0
1 0
Any advice would be appreciated. I am not limited to using awk but a awk solution would be appreciated.
Also, in my case the pattern to match would be a regex (/A.C/).
Thank you
This may be what you're trying to do:
echo xyzABCdefghiABCdefghiABCdef |
awk '{ begpos=1
while (match(substr($0, begpos), /ABC/)) {
print NR, begpos + RSTART - 1
begpos += RLENGTH + RSTART - 1
}
}'
Another option using gnu awk could be using split with a regex.
Using the split function, the 3rd field is the fieldsep array and the 4th field is the seps array which you can both use to calculate the positions.
echo xyzABCdefghiABCdefghiABCdef |
awk ' {
n=split($0, a, /ABC/, seps); pos=1
for(i=1; i<n; i++){
pos += length(a[i])
print NR, pos
pos += length(seps[i])
}
}'
Output
1 4
1 13
1 22
With your shown samples, please try following awk code.
awk '
{
prev=0
while(match($0,/ABC/)){
$0=substr($0,RSTART+RLENGTH)
print FNR,prev+RSTART
prev+=RSTART+2
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
prev=0 ##Setting prev variable to 0 here.
while(match($0,/ABC/)){ ##Using while loop to match ABC string and it runs till ABC match is ture in current line.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line by assigning value of rest of line(which starts after match of ABC).
print FNR,prev+RSTART ##Printing line number along with prev+RSTART value here.
prev+=RSTART+2 ##Setting prev to prev+RSTART+2 here.
}
}
' Input_file ##Mentioning Input_file name here.
Determination of the coordinates of a string with awk:
echo "xyzABCdefghiABCdefghiABCdef" \
| awk -v s="ABC" 'BEGIN{ len=length(s) }
{
for(i=1; i<=length($0); i++){
if(substr($0, i, len)==s){
print NR, i
}
}
}'
Output:
1 4
1 13
1 22
As one line:
echo xyzABCdefghiABCdefghiABCdef | awk -v s="ABC" 'BEGIN{ len=length(s) } { for(i=1; i<=length($0); i++){ if(substr($0,i,len)==s) { print NR,i } } }'
Source: Find position of character with awk
One awk idea using split() and some slicing-n-dicing of length() results:
ptn='ABC'
echo xyzABCdefghiABCdefghiABCdef |
awk -v ptn="${ptn}" '
{ pos=-(length(ptn)-1)
n=split($0,arr,ptn)
for (i=1;i<n;i++) {
pos+=length(arr[i] ptn)
print NR,pos
}
}'
This generates:
1 4
1 13
1 22

divide each column by max value/last value

I have a matrix like this:
A 25 27 50
B 35 37 475
C 75 78 80
D 99 88 76
0 234 230 681
The last row is the sum of all elements in the column - and it is also the maximum value.
What I would like to get is the matrix in which each value is divided by the last value in the column (e.g. for the first number in column 2, I would want "25/234="):
A 0.106837606837607 0.117391304347826 0.073421439060206
B 0.14957264957265 0.160869565217391 0.697503671071953
C 0.320512820512821 0.339130434782609 0.117474302496329
D 0.423076923076923 0.382608695652174 0.11160058737151
An answer in another thread gives an acceptable result for one column, but I was not able to loop it over all columns.
$ awk 'FNR==NR{max=($2+0>max)?$2:max;next} {print $1,$2/max}' file file
(this answer was provided here: normalize column data with maximum value of that column)
I would be grateful for any help!
In addition to the great approaches by #RavinderSingh13, you can also isolate the last line in the input file with, e.g. tail -n1 Input_file and then use the split() command in the BEGIN rule to separate the values. You can then make a single-pass through the file with awk to update the values as you indicate. In the end, you can pipe the output to head -n-1 to remove the unneeded final row, e.g.
awk -v lline="$(tail -n1 Input_file)" '
BEGIN { split(lline,a," ") }
{
printf "%s", $1
for(i=2; i<=NF; i++)
printf " %.15lf", $i/a[i]
print ""
}
' Input_file | head -n-1
Example Use/Output
$ awk -v lline="$(tail -n1 Input_file)" '
> BEGIN { split(lline,a," ") }
> {
> printf "%s", $1
> for(i=2; i<=NF; i++)
> printf " %.15lf", $i/a[i]
> print ""
> }
> ' Input_file | head -n-1
A 0.106837606837607 0.117391304347826 0.073421439060206
B 0.149572649572650 0.160869565217391 0.697503671071953
C 0.320512820512821 0.339130434782609 0.117474302496329
D 0.423076923076923 0.382608695652174 0.111600587371512
(note: this presumes you don't have trailing blank lines in your file and you really don't have blank lines between every row. If you do, let me know)
The differences between the approaches are largely negligible. In each case you are making a total of 3-passes through the file. Here with tail, awk and then head. In the other case with wc and then two-passes with awk.
Let either of us know if you have questions.
1st solution: Could you please try following, written and tested with shown samples in GNU awk. With exact 15 floating points as per OP's shown samples:
awk -v lines=$(wc -l < Input_file) '
FNR==NR{
if(FNR==lines){
for(i=2;i<=NF;i++){ arr[i]=$i }
}
next
}
FNR<lines{
for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) }
print
}
' Input_file Input_file
2nd solution: If you don't care of floating points to be specific points then try following.
awk -v lines=$(wc -l < Input_file) '
FNR==NR && FNR==lines{
for(i=2;i<=NF;i++){ arr[i]=$i }
next
}
FNR<lines && FNR!=NR{
for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") }
print
}
' Input_file Input_file
OR(placing condition of FNR==lines inside FNR==NR condition):
awk -v lines=$(wc -l < Input_file) '
FNR==NR{
if(FNR==lines){
for(i=2;i<=NF;i++){ arr[i]=$i }
}
next
}
FNR<lines{
for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") }
print
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk -v lines=$(wc -l < Input_file) ' ##Starting awk program from here, creating lines which variable which has total number of lines in Input_file here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
if(FNR==lines){ ##Checking if FNR is equal to lines then do following.
for(i=2;i<=NF;i++){ arr[i]=$i } ##Traversing through all fields here of current line and creating an array arr with index of i and value of current field value.
}
next ##next will skip all further statements from here.
}
FNR<lines{ ##Checking condition if current line number is lesser than lines, this will execute when 2nd time Input_file is being read.
for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) } ##Traversing through all fields here and saving value of divide of current field with arr current field value with 15 floating points into current field.
print ##Printing current line here.
}
' Input_file Input_file ##Mentioning Input_file names here.

How to run a bash script in a loop

i wrote a bash script in order to pull substrings and save it to an output file from two input files that looks like this:
input file 1
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
input file 2
gene1 10 20
gene2 40 50
genen x y
my script
>output_file
cat input_file2 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' input_file1 >> output file
done
done
how can i make it work in a loop for more than one string in the input file 1?
new input file looks like this:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotypen...
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...
I would like to have a different out file for every genotype and that the file name would be the genotype name.
thank you!
If I'm understanding correctly, would you try the following:
awk '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > genotype
print substr($0, start[i], len[i]) >> genotype
}
close(genotype)
}' input_file2 input_file1
input_file1:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotype3
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Input_file2:
gene1 10 20
gene2 40 50
gene3 20 25
[Results]
genotype1:
>gene1
aaaaaaaaaa
>gene2
aaaaaaaaaa
>gene3
aaaaa
genotype2:
>gene1
bbbbbbbbbb
>gene2
bbbbbbbbbb
>gene3
bbbbb
genotype3:
>gene1
nnnnnnnnnn
>gene2
nnnnnnnnnn
>gene3
nnnnn
[EDIT]
If you want to store the output files to a different directory,
please try the following instead:
dir="./outdir" # directory name to store the output files
# you can modify the name as you want
mkdir -p "$dir"
awk -v dir="$dir" '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > dir"/"genotype
print substr($0, start[i], len[i]) >> dir"/"genotype
}
close(dir"/"genotype)
}' input_file2 input_file1
The 1st two lines are executed in bash to define and mkdir the destination directory.
Then the directory name is passed to awk via -v option
Hope this helps.
Could you please try following, where I am assuming that your Input_file1's column which starts with > should be compared with 1st column of Input_file2's first column (since samples are confusing so based on OP's attempt this has been written).
awk '
FNR==NR{
start_point[$1]=$2
end_point[$1]=$3
next
}
/^>/{
sub(/^>/,"")
val=$0
next
}
{
print val ORS substr($0,start_point[val],end_point[val])
val=""
}
' Input_file2 Input_file1
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named Input_file2 is being read.
start_point[$1]=$2 ##Creating an array named start_point with index $1 of current line and its value is $2.
end_point[$1]=$3 ##Creating an array named end_point with index $1 of current line and its value is $3.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if a line starts from > then do following.
sub(/^>/,"") ##Substituting starting > with NULL.
val=$0 ##Creating a variable val whose value is $0.
next ##next will skip all further statements from here.
}
{
print val ORS substr($0,start_point[val],end_point[val]) ##Printing val newline(ORS) and sub-string of current line whose start value is value of start_point[val] and end point is value of end_point[val].
val="" ##Nullifying variable val here.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.

match pattern and print corresponding columns from a file using awk or grep

I have a input file with repetitive headers (below):
A1BG A1BG A1CF A1CF A2ML1
aa bb cc dd ee
1 2 3 4 5
I want to print all columns with same header in one file. e.g for above file there should be three output files; 1 for A1BG with 2 columns; 2nd for A1CF with 2 columns; 3rd for A2ML1 with 1 column. I there any way to do it using one-liners by awk or grep?
I tried following one-liner:
awk -v f="A1BG" '!o{for(x=1;x<=NF;x++)if($x==f){o=1;next}}o{print $x}' trial.txt
but this searches the pattern in only one column (1 in this case). I want to look through all the header names and print all the corresponding columns which have A1BG in their header.
This awk solution takes the same approach as Lars but uses gawk 4.0 2D arrays
awk '
# fill cols map of header to its list of columns
NR==1 {
for(i=1; i<=NF; ++i) {
if(!($i in cols))
j=0
cols[$i][j++]=i
}
}
{
# write tab-delimited columns for each header to its cols.header file
for(h in cols) {
of="cols."h
for(i=0; i < length(cols[h]); ++i) {
if(i > 0) printf("\t") >of
printf("%s", $cols[h][i]) >of
}
printf("\n") >of
}
}
'
awk solution should be pretty fast - output files are tab-delimited and named cols.A1BG cols.A1CF etc
awk '
# fill cols columns map to header and tab map to track tab state per header
NR==1 {
for(i=1; i<=NF; ++i) {
cols[i]=$i
tab[$i]=0
}
}
{
# reset tab state for every header
for(h in tab) tab[h]=0
# write tab-delimited column to its cols.header file
for(i=1; i<=NF; ++i) {
hdr=cols[i]
of="cols." hdr
if(tab[hdr]) {
printf("\t") >of
} else
tab[hdr]=1
printf("%s", $i) >of
}
# newline for every header file
for(h in tab) {
of="cols." h
printf("\n") >of
}
}
'
This is the output from both of my awk solutions:
$ ./scr.sh <in.txt; head cols.*
==> cols.A1BG <==
A1BG A1BG
aa bb
1 2
==> cols.A1CF <==
A1CF A1CF
cc dd
3 4
==> cols.A2ML1 <==
A2ML1
ee
5
I cannot help you with a 1-liner but here is a 10-liner for GNU awk:
script.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (i==1)? i : f2c[$i] " " i } }
{ for( n in f2c ) {
split( f2c[n], fls, " ")
tmp = ""
for( f in fls ) tmp = (f ==1) ? $fls[f] : tmp "\t" $fls[f]
print tmp > n
}
}
Use it like this: awk -f script.awk your_file
In the first action: it determines filenames from the columns in the first record (NR == 1).
In the second action: for each record: for each output file: its columns (as defined in the first record) are collected into tmp and written to the output file.
The use of PROCINFO requires GNU awk, see Ed Mortons comments for alternatives.
Example run and ouput:
> awk -f mpapccfaf.awk mpapccfaf.csv
> cat A1BG
A1BG A1BG
aa bb
1 2
Here y'go, a one-liner as requested:
awk 'NR==1{for(i=1;i<=NF;i++)a[$i][i]}{PROCINFO["sorted_in"]="#ind_num_asc";for(n in a){c=0;for(f in a[n])printf"%s%s",(c++?OFS:""),$f>n;print"">n}}' file
The above uses GNU awk 4.* for true multi-dimensional arrays and sorted_in.
For anyone else reading this who prefers clarity over the brevity the OP needs, here it is as a more natural multi-line script:
$ cat tst.awk
NR==1 {
for (i=1; i<=NF; i++) {
names2fldNrs[$i][i]
}
}
{
PROCINFO["sorted_in"] = "#ind_num_asc"
for (name in names2fldNrs) {
c = 0
for (fldNr in names2fldNrs[name]) {
printf "%s%s", (c++ ? OFS : ""), $fldNr > name
}
print "" > name
}
}
$ awk -f tst.awk file
$ cat A1BG
A1BG A1BG
aa bb
1 2
$ cat A1CF
A1CF A1CF
cc dd
3 4
$ cat A2ML1
A2ML1
ee
Since you wrote in one of the comments to my other answer that you have 20000 columns, lets consider a two step approach to ease debugging to find out which of the steps breaks.
step1.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Step1 should give us a list of files together with their columns:
> awk -f step1.awk yourfile
Mpap_1:$1, $2, $3, $5, $13, $19, $25
Mpap_2:$4, $6, $8, $12, $14, $16, $20, $22, $26, $28
Mpap_3:$7, $9, $10, $11, $15, $17, $18, $21, $23, $24, $27, $29, $30
In my test data Mpap_1 is the header in column 1,2,3,5,13,19,25. Lets hope that this first step works with your large set of columns. (To be frank: I dont know if awk can deal with $20000.)
Step 2: lets create one of those famous one liners:
> awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }' | awk -v "OFS=\t" -f - yourfile
The first part is our step 1, the second part builds on-the-fly a second awk script, with lines like this: print $1, $2, $3, $5, $13, $19, $25 > "Mpap_1". This second awk script is piped to the third part, which read the script from stdin (-f -) and applies the script to your input file.
In case something does not work: watch the output of each part of step2, you can execute the parts from the left up to (but not including) each of the | symbols and see what is going on, e.g.:
awk -f step1.awk yourfile
awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }'
Following worked for me:
code for step1.awk:
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " \"\t\" $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Then run one liner which uses above awk script:
awk -f step1.awk file.txt | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1".txt" "\"" }; END { print "}" }'| awk -f - file.txt
This outputs tab delimited .txt files having all the columns with same header in one file. (separate files for each type of header)
Thanks Lars Fischer and others.
Cheers

Resources