Conditional vlookup in bash with awk or sed? - bash

I have those two files (both have headers), each line of both files are starting with a date on the first column with the same format. the separator is a semicolon.
On the 9th column of the first file, I can only have those id: UK or JPN or EUR.
I need to aggregate file1 with the intel from file2 with the corresponding date intel.
I can try to do it with a bash script and a "for" loop of course, but I'm sure that resource wise it will be better with an awk or else bash command... if possible!
Thanks in advance for any hint.
ps: I tried unsuccessfully to adapt this method: https://unix.stackexchange.com/questions/428861/vlookup-equivalent-in-awk-scripting
The first file :
Date;$2;$3;$4;$5;$6;$7;$8;Id
2018-01-01; ;UK
2018-01-02; ;JPN
2018-01-03; ;EUR
2018-01-04; ;JPN
the second file :
Date;UKDIR;JPNDIR;EURDIR
2018-01-01;1;2;3
2018-01-02;4;5;6
2018-01-03;7;8;9
2018-01-04;11;10;12
Expected return
Date;$2;$3;$4;$5;$6;$7;$8;Id ;Intel
2018-01-01; ;UK ;1
2018-01-02; ;JPN ;5
2018-01-03; ;EUR ;9
2018-01-04; ;JPN ;10

You may use this awk:
awk -F';' -v OFS='; ' 'NR==1 { for (i=2; i<=NF; i++) h[i]=$i; next }
FNR==NR { for (i=2; i<=NF; i++) a[$1,h[i]]=$i; next }
FNR==1 { print $0, "Intel"; next }
{ print $0, a[$1,$NF "DIR"] }' file2 file1
Date;$2;$3;$4;$5;$6;$7;$8;Id; Intel
2018-01-01; ;UK; 1
2018-01-02; ;JPN; 5
2018-01-03; ;EUR; 9
2018-01-04; ;JPN; 10

Could you please try following.
awk '
BEGIN{
count=count1=1
FS=OFS=";"
}
FNR!=NR && FNR==1{
print $0 OFS "Intel"
}
FNR==NR && /^[0-9]/{
a[$1]=$(++count)
count=count==4?1:count
next
}
NF && /^[0-9]/{
print $0 OFS a[$1]
count1=count1==4?1:count1
}
' second_file first_file
Output will be as follows.
Date;$2;$3;$4;$5;$6;$7;$8;Id;Intel
2018-01-01; ;UK;1
2018-01-02; ;JPN;5
2018-01-03; ;EUR;9
2018-01-04; ;JPN;11

$ cat tst.awk
BEGIN { FS=OFS=";" }
NR==FNR {
if (NR == 1) {
for (fldNr=2; fldNr<=NF; fldNr++) {
fldName = $fldNr
sub(/DIR/,"",fldName)
fldNr2name[fldNr] = fldName
}
}
else {
for (fldNr=2; fldNr<=NF; fldNr++) {
fldName = fldNr2name[fldNr]
dateFldName2val[$1,fldName] = $fldNr
}
}
next
}
{
print $0, (FNR>1 ? dateFldName2val[$1,$NF] : "Intel")
}
$ awk -f tst.awk file2 file1
Date;$2;$3;$4;$5;$6;$7;$8;Id;Intel
2018-01-01; ;UK;1
2018-01-02; ;JPN;5
2018-01-03; ;EUR;9
2018-01-04; ;JPN;10

Related

How to run command line awk script (with arguments) as bash script?

I have an awk script (tst.awk):
NR==FNR {
ids[++numIds] = $1","
next
}
FNR==1 { numFiles++ }
{
id = $1
sub(/^[^[:space:]]+[[:space:]]+/,"")
vals[id,numFiles] = $0
gsub(/[^[:space:],]+/,"NA")
naVal[numFiles] = $0
}
END {
for ( idNr=1; idNr<=numIds; idNr++) {
id = ids[idNr]
printf "%s%s", id, OFS
for (fileNr=1; fileNr<=numFiles; fileNr++) {
val = ((id,fileNr) in vals ? vals[id,fileNr] : naVal[fileNr])
printf "%s%s", val, (fileNr<numFiles ? OFS : ORS)
}
}
}
That is called on the command line with:
awk -f tst.awk master file1 file2 file3 > output.file
(note: there can be a variable number of arguments)
How can I change this script, and command line code, to run it as a bash script?
I have tried (tst_awk.sh):
#!/bin/bash
awk -f "$1" "$2" "$3" "$4"
'NR==FNR {
ids[++numIds] = $1","
next
}
FNR==1 { numFiles++ }
{
id = $1
sub(/^[^[:space:]]+[[:space:]]+/,"")
vals[id,numFiles] = $0
gsub(/[^[:space:],]+/,"NA")
naVal[numFiles] = $0
}
END {
for ( idNr=1; idNr<=numIds; idNr++) {
id = ids[idNr]
printf "%s%s", id, OFS
for (fileNr=1; fileNr<=numFiles; fileNr++) {
val = ((id,fileNr) in vals ? vals[id,fileNr] : naVal[fileNr])
printf "%s%s", val, (fileNr<numFiles ? OFS : ORS)
}
}
}' > output_file
called on command line with:
./tst_awk.sh master file1 file2 file3
I have also tried (tst_awk2.sh):
#!/bin/bash
awk -f master file1 file2 file3
'NR==FNR {
ids[++numIds] = $1","
next
}
FNR==1 { numFiles++ }
...
}
}
}' > output_file
called on command line with:
./tst_awk2.sh
-f needs to be followed by the name of the awk script. You're putting the first argument of the shell script after it.
You can use "$#" to get all the script arguments, so you're not limited to just 4 arguments.
#!/bin/bash
awk -f /path/to/tst.awk "$#" > output_file
Use an absolute path to the awk script so you can run the shell script from any directory.
If you don't want to use the separate tst.awk, you just put the script as the literal first argument to awk.
#!/bin/bash
awk 'NR==FNR {
ids[++numIds] = $1","
next
}
FNR==1 { numFiles++ }
{
id = $1
sub(/^[^[:space:]]+[[:space:]]+/,"")
vals[id,numFiles] = $0
gsub(/[^[:space:],]+/,"NA")
naVal[numFiles] = $0
}
END {
for ( idNr=1; idNr<=numIds; idNr++) {
id = ids[idNr]
printf "%s%s", id, OFS
for (fileNr=1; fileNr<=numFiles; fileNr++) {
val = ((id,fileNr) in vals ? vals[id,fileNr] : naVal[fileNr])
printf "%s%s", val, (fileNr<numFiles ? OFS : ORS)
}
}
}' "$#" > output_file
you can make your awk script executable by adding the shebang
#! /bin/awk -f
NR==FNR {
ids[++numIds] = $1","
next
}...
don't forget to chmod +x tst.awk
and run
$ ./tst.awk master file1 file2 file3 > outfile

Match two files and print the matched strings based on the second file using awk

I have two files below named InputFile and Ref
InputFile
1234~code1=yyy:code2=fff:code3=vvv
1256~code2=ttt:code1=yyy:code4=zzz
4567~code4=uuu
8907~code8=ooo:code7=rrr
Ref
code2
code3
code8
code7
I have to match all the records in Ref to InputFile's second column (~ delimited and will be split by colon(:)). If a record in Ref is found in InputFile, it should print the preceding value after the = sign otherwise print none.
Desired output
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr
I'm about to load it to a table having the Ref records as the columns.
Here's my script as of:
awk '
BEGIN{
FS=OFS="~"
}
FNR==NR{
a[$0]
next
}
FNR==1 && FNR!=NR{
print
next
}
{
num=split($2,array,"[=:]")
for(i=1;i<=num;i+=2){
if(array[i] in a){
val=val?val OFS array[i+1]:array[i+1]
}
else{
val=val?val OFS "~":"~"
}
}
print $1,val
val=""
}
' Ref InputFile
It prints the array (code1,code2,etc) in InputFile that is present in Ref but it doesn't print in Ref's order.
Script's output
1234~~fff~vvv
1256~ttt
4567~
8907~ooo~rrr
something similar to yours
$ awk -F~ 'NR==FNR {c[NR]=$1; cs=NR; next}
{n=split($2,f,"[=:]");
delete k;
for(i=1;i<n;i+=2) k[f[i]]=f[i+1];
printf "%s", $1;
for(i=1;i<=cs;i++) printf "%s", FS k[c[i]];
print ""}' ref input
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr
since you want to keep the order in the ref file, don't insert them as keys to the array, instead add them as values indexed with the order number (here the line number). Otherwise you're going to lose order, which I think it the (only?) issue with your script.
$ cat tst.awk
BEGIN {
FS = "[~:=]"
OFS = "~"
}
NR == FNR {
refs[++numRefs] = $0
next
}
{
delete ref2val
for (fldNr=2; fldNr<NF; fldNr+=2) {
ref2val[$fldNr] = $(fldNr+1)
}
printf "%s%s", $1, OFS
for (refNr=1; refNr<=numRefs; refNr++) {
ref = refs[refNr]
printf "%s%s", ref2val[ref], (refNr<numRefs ? OFS : ORS)
}
}
$ awk -f tst.awk refs file
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr

how to find out common columns and its records from two files using awk

I have two files:
File 1:
id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio
File 2(only headers):
id|name|country
Now, I want an output like:
OUTPUT:
id|name|country
1|abc|xyz
2|asd|uio
Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.
I want to do this using awk or bash.
I tried using:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt
and have no idea what to do next.
Thank You
Following awk may help you on same.
awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2 File1
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
for(i=1;i<=NF;i++){
a[$i]};
next}
FNR==1 && FNR!=NR{
for(j=1;j<=NF;j++){
if($j in a){ b[++p]=j }}
}
{
for(o=1;o<=p;o++){
printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2 File1
Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=1; i<=NF; i++) {
names[$i]
}
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in names) {
f[++numFlds] = i
}
}
}
{
for (i=1; i<=numFlds; i++) {
printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
}
}
with (lot's of) unix pipes as Doug McIlroy intended...
$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file
id|name|country
1|abc|xyz
2|asd|uio
Solution using bash>4:
IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'
# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[#]}-1))); do
for j in $(seq 0 $((${#headers1[#]}-1))); do
if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
idx+=($j)
break
fi
done
done
# idx=(0 1 3) for example
# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }
# first line - output headers
join_by '|' "${headers2[#]}"
isfirst=true
while IFS='|' read -a vals; do
# ignore first (header line)
if $isfirst; then
isfirst=false
continue;
fi;
# filter from line only columns with idx indices
tmp=()
for i in ${idx[#]}; do
tmp+=("${vals[$i]}")
done
# join ouptut with '|'
join_by '|' "${tmp[#]}"
done < $file1
This one respects the order of columns in file1, changed the order:
$ cat file1
id|country|name
The awk:
$ awk '
BEGIN { FS=OFS="|" }
NR==1 { # file1
n=split($0,a)
next
}
NR==2 { # file2 header
for(i=1;i<=NF;i++)
b[$i]=i
}
{ # output part
for(i=1;i<=n;i++)
printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd
(Another version using cut for outputing in revisions)
This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.
It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.
BEGIN { FS = OFS = "|" }
# read headers from first file
NR == FNR { for (i = 1; i <= NF; ++i) header[$i]; next }
# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1 {
for (i = 1; i <= NF; ++i)
select[i] = ($i in header)
}
{
skip = 0
pos = 1
for (i = 1; i <= NF; ++i)
if (!select[i]) { # we don't want this field
++skip
$pos = $(pos + skip) # shift fields left
} else
++pos
NF -= skip # adjust number of fields
print
}
Running this:
$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

Remove duplicate from csv using bash / awk

I have a csv file with the format :
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
I want to group by first column unique id's and concat types in a single row like this:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
I found awk does a great job in handling such scenarios. But all I could achieve is this:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"
I used this command:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
How can I remove the duplicates and also handle the formatting of the second column types?
quick fix:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
!seen[$0]++ will be true only if line was not already seen
If second column should all be within double quotes
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"
With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.
Short GNU datamash + tr solution:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'
The output:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"
----------
In case if between-item double quotes should be eliminated - use the following alternative:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
For sample, input below one will work, but unsorted
One-liner
# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
# using regexp
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
Test Results:
$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
Better Readable:
Using regexp
awk 'BEGIN{
FS=OFS="|"
}
{
a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
}
END{
for(i in a)
print i,a[i]
}
' infile
Using two array
awk 'BEGIN{
FS=OFS="|"
}
!seen[$1,$2]++{
a[$1] = ($1 in a ? a[$1] ":" : "") $2
}
END{
for(i in a)
print i,a[i]
}' infile
Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if
you want to prefer some other column, you may prefer !seen[$1,$2]++,
here column1 and column2 are used as index
awk + sort solution:
awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

match pattern and print corresponding columns from a file using awk or grep

I have a input file with repetitive headers (below):
A1BG A1BG A1CF A1CF A2ML1
aa bb cc dd ee
1 2 3 4 5
I want to print all columns with same header in one file. e.g for above file there should be three output files; 1 for A1BG with 2 columns; 2nd for A1CF with 2 columns; 3rd for A2ML1 with 1 column. I there any way to do it using one-liners by awk or grep?
I tried following one-liner:
awk -v f="A1BG" '!o{for(x=1;x<=NF;x++)if($x==f){o=1;next}}o{print $x}' trial.txt
but this searches the pattern in only one column (1 in this case). I want to look through all the header names and print all the corresponding columns which have A1BG in their header.
This awk solution takes the same approach as Lars but uses gawk 4.0 2D arrays
awk '
# fill cols map of header to its list of columns
NR==1 {
for(i=1; i<=NF; ++i) {
if(!($i in cols))
j=0
cols[$i][j++]=i
}
}
{
# write tab-delimited columns for each header to its cols.header file
for(h in cols) {
of="cols."h
for(i=0; i < length(cols[h]); ++i) {
if(i > 0) printf("\t") >of
printf("%s", $cols[h][i]) >of
}
printf("\n") >of
}
}
'
awk solution should be pretty fast - output files are tab-delimited and named cols.A1BG cols.A1CF etc
awk '
# fill cols columns map to header and tab map to track tab state per header
NR==1 {
for(i=1; i<=NF; ++i) {
cols[i]=$i
tab[$i]=0
}
}
{
# reset tab state for every header
for(h in tab) tab[h]=0
# write tab-delimited column to its cols.header file
for(i=1; i<=NF; ++i) {
hdr=cols[i]
of="cols." hdr
if(tab[hdr]) {
printf("\t") >of
} else
tab[hdr]=1
printf("%s", $i) >of
}
# newline for every header file
for(h in tab) {
of="cols." h
printf("\n") >of
}
}
'
This is the output from both of my awk solutions:
$ ./scr.sh <in.txt; head cols.*
==> cols.A1BG <==
A1BG A1BG
aa bb
1 2
==> cols.A1CF <==
A1CF A1CF
cc dd
3 4
==> cols.A2ML1 <==
A2ML1
ee
5
I cannot help you with a 1-liner but here is a 10-liner for GNU awk:
script.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (i==1)? i : f2c[$i] " " i } }
{ for( n in f2c ) {
split( f2c[n], fls, " ")
tmp = ""
for( f in fls ) tmp = (f ==1) ? $fls[f] : tmp "\t" $fls[f]
print tmp > n
}
}
Use it like this: awk -f script.awk your_file
In the first action: it determines filenames from the columns in the first record (NR == 1).
In the second action: for each record: for each output file: its columns (as defined in the first record) are collected into tmp and written to the output file.
The use of PROCINFO requires GNU awk, see Ed Mortons comments for alternatives.
Example run and ouput:
> awk -f mpapccfaf.awk mpapccfaf.csv
> cat A1BG
A1BG A1BG
aa bb
1 2
Here y'go, a one-liner as requested:
awk 'NR==1{for(i=1;i<=NF;i++)a[$i][i]}{PROCINFO["sorted_in"]="#ind_num_asc";for(n in a){c=0;for(f in a[n])printf"%s%s",(c++?OFS:""),$f>n;print"">n}}' file
The above uses GNU awk 4.* for true multi-dimensional arrays and sorted_in.
For anyone else reading this who prefers clarity over the brevity the OP needs, here it is as a more natural multi-line script:
$ cat tst.awk
NR==1 {
for (i=1; i<=NF; i++) {
names2fldNrs[$i][i]
}
}
{
PROCINFO["sorted_in"] = "#ind_num_asc"
for (name in names2fldNrs) {
c = 0
for (fldNr in names2fldNrs[name]) {
printf "%s%s", (c++ ? OFS : ""), $fldNr > name
}
print "" > name
}
}
$ awk -f tst.awk file
$ cat A1BG
A1BG A1BG
aa bb
1 2
$ cat A1CF
A1CF A1CF
cc dd
3 4
$ cat A2ML1
A2ML1
ee
Since you wrote in one of the comments to my other answer that you have 20000 columns, lets consider a two step approach to ease debugging to find out which of the steps breaks.
step1.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Step1 should give us a list of files together with their columns:
> awk -f step1.awk yourfile
Mpap_1:$1, $2, $3, $5, $13, $19, $25
Mpap_2:$4, $6, $8, $12, $14, $16, $20, $22, $26, $28
Mpap_3:$7, $9, $10, $11, $15, $17, $18, $21, $23, $24, $27, $29, $30
In my test data Mpap_1 is the header in column 1,2,3,5,13,19,25. Lets hope that this first step works with your large set of columns. (To be frank: I dont know if awk can deal with $20000.)
Step 2: lets create one of those famous one liners:
> awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }' | awk -v "OFS=\t" -f - yourfile
The first part is our step 1, the second part builds on-the-fly a second awk script, with lines like this: print $1, $2, $3, $5, $13, $19, $25 > "Mpap_1". This second awk script is piped to the third part, which read the script from stdin (-f -) and applies the script to your input file.
In case something does not work: watch the output of each part of step2, you can execute the parts from the left up to (but not including) each of the | symbols and see what is going on, e.g.:
awk -f step1.awk yourfile
awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }'
Following worked for me:
code for step1.awk:
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " \"\t\" $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Then run one liner which uses above awk script:
awk -f step1.awk file.txt | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1".txt" "\"" }; END { print "}" }'| awk -f - file.txt
This outputs tab delimited .txt files having all the columns with same header in one file. (separate files for each type of header)
Thanks Lars Fischer and others.
Cheers

Resources