match pattern and print corresponding columns from a file using awk or grep - bash

I have a input file with repetitive headers (below):
A1BG A1BG A1CF A1CF A2ML1
aa bb cc dd ee
1 2 3 4 5
I want to print all columns with same header in one file. e.g for above file there should be three output files; 1 for A1BG with 2 columns; 2nd for A1CF with 2 columns; 3rd for A2ML1 with 1 column. I there any way to do it using one-liners by awk or grep?
I tried following one-liner:
awk -v f="A1BG" '!o{for(x=1;x<=NF;x++)if($x==f){o=1;next}}o{print $x}' trial.txt
but this searches the pattern in only one column (1 in this case). I want to look through all the header names and print all the corresponding columns which have A1BG in their header.

This awk solution takes the same approach as Lars but uses gawk 4.0 2D arrays
awk '
# fill cols map of header to its list of columns
NR==1 {
for(i=1; i<=NF; ++i) {
if(!($i in cols))
j=0
cols[$i][j++]=i
}
}
{
# write tab-delimited columns for each header to its cols.header file
for(h in cols) {
of="cols."h
for(i=0; i < length(cols[h]); ++i) {
if(i > 0) printf("\t") >of
printf("%s", $cols[h][i]) >of
}
printf("\n") >of
}
}
'

awk solution should be pretty fast - output files are tab-delimited and named cols.A1BG cols.A1CF etc
awk '
# fill cols columns map to header and tab map to track tab state per header
NR==1 {
for(i=1; i<=NF; ++i) {
cols[i]=$i
tab[$i]=0
}
}
{
# reset tab state for every header
for(h in tab) tab[h]=0
# write tab-delimited column to its cols.header file
for(i=1; i<=NF; ++i) {
hdr=cols[i]
of="cols." hdr
if(tab[hdr]) {
printf("\t") >of
} else
tab[hdr]=1
printf("%s", $i) >of
}
# newline for every header file
for(h in tab) {
of="cols." h
printf("\n") >of
}
}
'
This is the output from both of my awk solutions:
$ ./scr.sh <in.txt; head cols.*
==> cols.A1BG <==
A1BG A1BG
aa bb
1 2
==> cols.A1CF <==
A1CF A1CF
cc dd
3 4
==> cols.A2ML1 <==
A2ML1
ee
5

I cannot help you with a 1-liner but here is a 10-liner for GNU awk:
script.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (i==1)? i : f2c[$i] " " i } }
{ for( n in f2c ) {
split( f2c[n], fls, " ")
tmp = ""
for( f in fls ) tmp = (f ==1) ? $fls[f] : tmp "\t" $fls[f]
print tmp > n
}
}
Use it like this: awk -f script.awk your_file
In the first action: it determines filenames from the columns in the first record (NR == 1).
In the second action: for each record: for each output file: its columns (as defined in the first record) are collected into tmp and written to the output file.
The use of PROCINFO requires GNU awk, see Ed Mortons comments for alternatives.
Example run and ouput:
> awk -f mpapccfaf.awk mpapccfaf.csv
> cat A1BG
A1BG A1BG
aa bb
1 2

Here y'go, a one-liner as requested:
awk 'NR==1{for(i=1;i<=NF;i++)a[$i][i]}{PROCINFO["sorted_in"]="#ind_num_asc";for(n in a){c=0;for(f in a[n])printf"%s%s",(c++?OFS:""),$f>n;print"">n}}' file
The above uses GNU awk 4.* for true multi-dimensional arrays and sorted_in.
For anyone else reading this who prefers clarity over the brevity the OP needs, here it is as a more natural multi-line script:
$ cat tst.awk
NR==1 {
for (i=1; i<=NF; i++) {
names2fldNrs[$i][i]
}
}
{
PROCINFO["sorted_in"] = "#ind_num_asc"
for (name in names2fldNrs) {
c = 0
for (fldNr in names2fldNrs[name]) {
printf "%s%s", (c++ ? OFS : ""), $fldNr > name
}
print "" > name
}
}
$ awk -f tst.awk file
$ cat A1BG
A1BG A1BG
aa bb
1 2
$ cat A1CF
A1CF A1CF
cc dd
3 4
$ cat A2ML1
A2ML1
ee

Since you wrote in one of the comments to my other answer that you have 20000 columns, lets consider a two step approach to ease debugging to find out which of the steps breaks.
step1.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Step1 should give us a list of files together with their columns:
> awk -f step1.awk yourfile
Mpap_1:$1, $2, $3, $5, $13, $19, $25
Mpap_2:$4, $6, $8, $12, $14, $16, $20, $22, $26, $28
Mpap_3:$7, $9, $10, $11, $15, $17, $18, $21, $23, $24, $27, $29, $30
In my test data Mpap_1 is the header in column 1,2,3,5,13,19,25. Lets hope that this first step works with your large set of columns. (To be frank: I dont know if awk can deal with $20000.)
Step 2: lets create one of those famous one liners:
> awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }' | awk -v "OFS=\t" -f - yourfile
The first part is our step 1, the second part builds on-the-fly a second awk script, with lines like this: print $1, $2, $3, $5, $13, $19, $25 > "Mpap_1". This second awk script is piped to the third part, which read the script from stdin (-f -) and applies the script to your input file.
In case something does not work: watch the output of each part of step2, you can execute the parts from the left up to (but not including) each of the | symbols and see what is going on, e.g.:
awk -f step1.awk yourfile
awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }'

Following worked for me:
code for step1.awk:
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " \"\t\" $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Then run one liner which uses above awk script:
awk -f step1.awk file.txt | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1".txt" "\"" }; END { print "}" }'| awk -f - file.txt
This outputs tab delimited .txt files having all the columns with same header in one file. (separate files for each type of header)
Thanks Lars Fischer and others.
Cheers

Related

Splitting a large, complex one column file into several columns with awk

I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.
(1
2
3
...
)
(11
22
33
...
)
(111
222
333
...
)
I need to achieve an output like:
1; 11; 111
2; 22; 222
3; 33; 333
... ... ...
I found a complicated way that is:
perform sed operations to get
1
2
3
...
#
11
22
33
...
#
111
222
333
...
use awk as follows to split my file in several sub-files
awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
remove white spaces from my subfiles again with sed
sed -i '/^[[:space:]]*$/d' splitted*.txt
join everything together:
paste splitted*.txt > out.txt
add a field separator (defined in my bash script)
awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt
I feel this is crappy as I loop over million lines several time.
Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it.
Something like:
awk 'BEGIN{RS="(\\n)"; OFS=";"} { print something } '
I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.
Any help would be appreciated.
With GNU awk for multi-char RS and true multi dimensional arrays:
$ cat tst.awk
BEGIN {
RS = "(\\s*[()]\\s*)+"
OFS = ";"
}
NR>1 {
cell[NR][1]
split($0,cell[NR])
}
END {
for (rowNr=1; rowNr<=NF; rowNr++) {
for (colNr=2; colNr<=NR; colNr++) {
printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
If you know you have 3 columns, you can do it in a very ugly way as following:
pr -3ts <file>
All that needs to be done then is to remove your brackets:
$ pr -3ts ~/tmp/f | awk 'BEGIN{OFS="; "}{gsub(/[()]/,"")}(NF){$1=$1; print}'
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
You can also do it in a single awk line, but it just complicates things. The above is quick and easy.
This awk program does the full generic version:
awk 'BEGIN{r=c=0}
/)/{r=0; c++; next}
{gsub(/[( ]/,"")}
(NF){a[r++,c]=$1; rm=rm>r?rm:r}
END{ for(i=0;i<rm;++i) {
printf a[i,0];
for(j=1;j<c;++j) printf "; " a[i,j];
print ""
}
}' <file>
Could you please try following once, considering that your actual Input_file is same as shown samples.
awk -v RS="" '
{
gsub(/\n|, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
}
}
for(j=1;j<=num;j++){
print val[j]
}
delete val
delete array
value=""
}' OFS="; "
OR(above script is considering that numbers inside (...) will be constant, now adding script which will working even field numbers of not equal inside (....).
awk -v RS="" '
{
gsub(/\n/,",")
gsub(/, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
max=num>max?num:max
}
}
for(j=1;j<=max;j++){
print val[j]
}
delete val
delete array
}' OFS="; "
Output will be as follows.
1; 11; 111
2; 22; 222
3; 33; 333
Explanation: Adding explanation for above code here.
awk -v RS="" ' ##Setting RS(record separator) as NULL here.
{ ##Starting BLOCK here.
gsub(/\n/,",") ##using gsub to substitute new line OR comma with space with comma here.
gsub(/, /,",")
}
1' Input_file | ##Mentioning 1 will be printing edited/non-edited line of Input_file. Using | means sending this output as Input to next awk program.
awk ' ##Starting another awk program here.
{
while(match($0,/\([^\)]*/)){ ##Using while loop which will run till a match is FOUND for (...) in lines.
value=substr($0,RSTART+1,RLENGTH-2) ##storing substring from RSTART+1 to till RLENGTH-1 value to variable value here.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line with substring valeu from RSTART+RLENGTH till last of line.
num=split(value,array,",") ##Splitting value variable into array named array whose delimiter is comma here.
for(i=1;i<=num;i++){ ##Using for loop which runs from i=1 to till value of num(length of array).
val[i]=val[i]?val[i] OFS array[i]:array[i] ##Creating array val whose index is value of variable i and concatinating its own values.
}
}
for(j=1;j<=num;j++){ ##Starting a for loop from j=1 to till value of num here.
print val[j] ##Printing value of val whose index is j here.
}
delete val ##Deleting val here.
delete array ##Deleting array here.
value="" ##Nullifying variable value here.
}' OFS="; " ##Making OFS value as ; with space here.
NOTE: This should work for more than 3 values inside (...) brackets also.
awk 'BEGIN { RS = "\\s*[()]\\s*"; FS = "\\s*" }
NF > 0 {
maxCol++
if (NF > maxRow)
maxRow = NF
for (row = 1; row <= NF; row++)
a[row,maxCol] = $row
}
END {
for (row = 1; row <= maxRow; row++) {
for (col = 1; col <= maxCol; col++)
printf "%s", a[row,col] ";"
print ""
}
}' yourFile
output
1;11;111;
2;22;222;
3;33;333;
...;...;...;
Change FS= "\\s*" to FS = "\n*" when you also want to allow spaces inside your fields.
This script supports columns of different lengths.
When benchmarking also consider replacing [i,j] with [i][j] for GNU awk. I'm unsure which one is faster and did not benchmark the script myself.
Here is the Perl one-liner solution
$ cat edouard2.txt
(1
2
3
a
)
(11
22
33
b
)
(111
222
333
c
)
$ perl -lne ' $x=0 if s/[)(]// ; if(/(\S+)/) { #t=#{$val[$x]};push(#t,$1);$val[$x++]=[#t] } END { print join(";",#{$val[$_]}) for(0..$#val) }' edouard2.txt
1;11;111
2;22;222
3;33;333
a;b;c
I would convert each section to a row and then transpose after, e.g. assuming you are using GNU awk:
<infile awk '{ gsub("[( )]", ""); $1=$1 } 1' RS='\\)\n\\(' OFS=';' |
datamash -t';' transpose
Output:
1;11;111
2;22;222
3;33;333
...;...;...

how to find out common columns and its records from two files using awk

I have two files:
File 1:
id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio
File 2(only headers):
id|name|country
Now, I want an output like:
OUTPUT:
id|name|country
1|abc|xyz
2|asd|uio
Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.
I want to do this using awk or bash.
I tried using:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt
and have no idea what to do next.
Thank You
Following awk may help you on same.
awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2 File1
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
for(i=1;i<=NF;i++){
a[$i]};
next}
FNR==1 && FNR!=NR{
for(j=1;j<=NF;j++){
if($j in a){ b[++p]=j }}
}
{
for(o=1;o<=p;o++){
printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2 File1
Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=1; i<=NF; i++) {
names[$i]
}
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in names) {
f[++numFlds] = i
}
}
}
{
for (i=1; i<=numFlds; i++) {
printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
}
}
with (lot's of) unix pipes as Doug McIlroy intended...
$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file
id|name|country
1|abc|xyz
2|asd|uio
Solution using bash>4:
IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'
# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[#]}-1))); do
for j in $(seq 0 $((${#headers1[#]}-1))); do
if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
idx+=($j)
break
fi
done
done
# idx=(0 1 3) for example
# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }
# first line - output headers
join_by '|' "${headers2[#]}"
isfirst=true
while IFS='|' read -a vals; do
# ignore first (header line)
if $isfirst; then
isfirst=false
continue;
fi;
# filter from line only columns with idx indices
tmp=()
for i in ${idx[#]}; do
tmp+=("${vals[$i]}")
done
# join ouptut with '|'
join_by '|' "${tmp[#]}"
done < $file1
This one respects the order of columns in file1, changed the order:
$ cat file1
id|country|name
The awk:
$ awk '
BEGIN { FS=OFS="|" }
NR==1 { # file1
n=split($0,a)
next
}
NR==2 { # file2 header
for(i=1;i<=NF;i++)
b[$i]=i
}
{ # output part
for(i=1;i<=n;i++)
printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd
(Another version using cut for outputing in revisions)
This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.
It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.
BEGIN { FS = OFS = "|" }
# read headers from first file
NR == FNR { for (i = 1; i <= NF; ++i) header[$i]; next }
# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1 {
for (i = 1; i <= NF; ++i)
select[i] = ($i in header)
}
{
skip = 0
pos = 1
for (i = 1; i <= NF; ++i)
if (!select[i]) { # we don't want this field
++skip
$pos = $(pos + skip) # shift fields left
} else
++pos
NF -= skip # adjust number of fields
print
}
Running this:
$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

How to grep for multiple word occurrences from multiple files and list them grouped as rows and columns

Hello: Need your help to count word occurrences from multiple files and output them as row and columns. I searched the site for a similar reference but could not locate, hence posting it here.
Setup:
I have 2 files with the following
[a.log]
id,status
1,new
2,old
3,old
4,old
5,old
[b.log]
id,status
1,new
2,old
3,new
4,old
5,new
Results required
The result i require using the command line only is (preferably):
file count(new) count(old)
a.log 1 4
b.log 3 2
Script
The script below provides me the count for a single word across multiple.
I am stuck trying to get results for multiple words. Please help.
grep -cw "old" *.log
You can get this output using gnu-awk that accepts comma separated word to be searched in a command line argument:
awk -v OFS='\t' -F, -v wrds='new,old' 'BEGIN{n=split(wrds, a, /,/); for(i=1; i<=n; i++) b[a[i]]=a[i]} FNR==1{next} $2 in b{freq[FILENAME][$2]++} END{printf "%s", "file" OFS; for(i=1; i<=n; i++) printf "count(%s)%s", a[i], (i==n?ORS:OFS); for(f in freq) {printf "%s", f OFS; for(i=1; i<=n; i++) printf "%s%s", freq[f][a[i]], (i==n?ORS:OFS)}}' a.log b.log | column -t
Output:
file count(new) count(old)
a.log 1 4
b.log 3 2
PS: column -t was only used for formatting the output in tabular format.
Readable awk:
awk -v OFS='\t' -F, -v wrds='new,old' 'BEGIN {
n = split(wrds, a, /,/) # split input words list by comma with int index
for(i=1; i<=n; i++) # store words in another array with key as words
b[a[i]]=a[i]
}
FNR==1 {
next # skip first row from all the files
}
$2 in b {
freq[FILENAME][$2]++ # store filename and word frequency in 2-dimesional array
}
END { # print formatted result
printf "%s", "file" OFS
for(i=1; i<=n; i++)
printf "count(%s)%s", a[i], (i==n?ORS:OFS)
for(f in freq) {
printf "%s", f OFS
for(i=1; i<=n; i++)
printf "%s%s", freq[f][a[i]], (i==n?ORS:OFS)
}
}' a.log b.log
I think you're looking for something like this, but it's not overly clear what your objectives are (if you're going for efficiency for example, this isn't overly efficient)...
for file in *.log; do
echo -n "${file}\t"
for word in "new" "old"; do
grep -cw $word $file;
echo -n "\t";
done
echo;
done
(for readability, I simplified the first line, but this doesn't work if there's spaces in the filenames -- the proper solution is to change the first line to read find . -iname "*.log" -maxdepth=1 | while read file; do)
for c in a b ; do egrep -o "new|old" $c.log | sort | uniq -c > $c.luc; done
Get rid of the headlines with grep, then sort and count.
join -1 2 -2 2 a.luc b.luc
> new 1 3
> old 4 2
Placing a new header is left as an exercise for the reader. Is there a flip command for unix/linux/bash to flip a table, or how would you say?
Handling empty cells is left as an exercise too, but possible with join.
without real multi-dim array support, this will count all values in field 2, not just "new/old". The header and number of columns are dynamic with number of distinct values as well.
$ awk -F, 'NR==1 {fs["file"]}
FNR>1 {c[FILENAME,$2]++; fs[FILENAME]; ks[$2];
c["file",$2]="count("$2")"}
END {for(f in fs)
{printf "%s", f;
for(k in ks) printf "%s", OFS c[f,k];
printf "\n"}}' file{1,2} | column -t
file count(new) count(old)
file1 1 4
file2 3 2
Awk solution:
awk 'BEGIN{
FS=","; OFS="\t"; print "file","count(new)","count(old)";
f1=ARGV[1]; f2=ARGV[2] # get filenames
}
FNR==1{ next } # skip the 1st header line
NR==FNR{ c1[$2]++; next } # accumulate occurrences of the 2nd field in 1st file
{ c2[$2]++ } # accumulate occurrences of the 2nd field in 2nd file
END{
print f1, c1["new"], c1["old"];
print f2, c2["new"], c2["old"]
}' a.log b.log
The output:
file count(new) count(old)
a.log 1 4
b.log 3 2

AWK: increment a field based on values from previous line

Given the following input for AWK:
10;20;20
8;41;41
15;52;52
How could I increase/decrease the values so that:
$1 = remains unchanged
$2 = $2 of previous line + $1 of previous line + 1
$3 = $3 of previous line + $1 of previous line + 1
So the desired output would be:
10;20;20
8;31;31
15;40;40
I need to auto-increment and loop over the lines,
using associative arrays, but it's confusing for me.
Surely, this doesn't work as desired:
#!/bin/awk -f
BEGIN { FS = ";" }
{
print ln, st, of
ln=$1
st=$2 + ln + 1
of=$3 + ln + 1
}
with awk
awk -F";" -v OFS=";"
'NR!=1{ $2=a[2]+a[1]+1; $3=a[3]+a[1]+1 } { split($0,a,FS) } 1' file
split the line to an array and when processing the next line we can use the values stored.
test
10;20;20
8;31;31
15;40;40
Following awk may help you in same.
awk -F";" '
FNR==1{
val=$1;
val1=$2;
val2=$3;
print;
next
}
{
$2=val+val1+1;
$3=val+val2+1;
print;
val=$1;
val1=$2;
val2=$3;
}' OFS=";" Input_file
For your given Input_file, output will be as follows.
10;20;20
8;31;31
15;40;40
awk 'BEGIN{
FS = OFS = ";"
}
FNR>1{
$2 = p2 + p1 + 1
$3 = p3 + p1 + 1
}
{
p1=$1; p2=$2; p3=$3
}1
' infile
Input:
$ cat infile
10;20;20
8;41;41
15;52;52
Output:
awk 'BEGIN{FS=OFS=";"}FNR>1{$2=p2+p1+1; $3=p3+p1+1 }{p1=$1; p2=$2; p3=$3}1' infile
10;20;20
8;31;31
15;40;40
Or store only fields of your interest
awk -v myfields="2,3" '
BEGIN{
FS=OFS=";";
split(myfields,t,/,/)
}
{
for(i in t)
{
if(FNR>1)
{
$(t[i]) = a[t[i]] + a[1] + 1
}
a[t[i]] = $(t[i])
}
a[1] = $1
}1' infile

How to convert space separated key value data into CSV format in bash?

I am working on some data files where data is of key and value pairs that are separated by space.
The data in files is inconsistent. All the Key and values are not always present.But the keys will always be as Table, count and size.
Below example has table_name, count, size information
cat sample1.txt
Table SCOTT.TABLE1 count 3889 size 300
Table SCOTT.TABLE2 count 7744
Table SCOTT.TABLE3 count 2622
Table SCOTT.TABLE4 size 2773 count 22
Table SCOTT.TABLE5 size 21
Below file have just table_name but no count and size data.
cat sample2.txt
Table SCOTT.TABLE1
Table SCOTT.TABLE2
Table SCOTT.TABLE3
Table SCOTT.TABLE4
Table SCOTT.TABLE5
So I am trying to convert these files into CSV format using following
cat <file_name> | awk -F' ' 'BEGIN { RS="\n"; print"Table,Count,Size";OFS="," } NR > 1 { print a["Table"], a["count"], a["size"]; delete a; next }{ a[$1]=$2 }{ a[$3]=$4 }{ a[$5]=$6 }'
cat sample1.txt | awk -F' ' 'BEGIN { RS="\n"; print"Table,Count,Size";OFS="," }
NR > 1 { print a["Table"], a["count"], a["size"]; delete a; next }
{ a[$1]=$2 }{ a[$3]=$4 }{ a[$5]=$6 }'
Table,Count,Size
SCOTT.TABLE1,3889,300
,,
,,
,,
And for the second sample
cat sample2.txt | awk -F' ' 'BEGIN { RS="\n"; print"Table,Count,Size";OFS="," } NR > 1 { print a["Table"], a["count"], a["size"]; delete a; next }{ a[$1]=$2 }{ a[$3]=$4 }{ a[$5]=$6 }'
Table,Count,Size
SCOTT.TABLE1,,
,,
,,
,,
But exepected as following:
For sample1.txt
TABLE,count,size
SCOTT.TABLE1,3889,300
SCOTT.TABLE2,7744,
SCOTT.TABLE3,2622
SCOTT.TABLE4,22,2773
SCOTT.TABLE5,,21
For sample2.txt
Table,Count,Size
SCOTT.TABLE1,,
SCOTT.TABLE2,,
SCOTT.TABLE3,,
SCOTT.TABLE4,,
SCOTT.TABLE5,,
Thanks in advance.
awk to the rescue!
$ awk -v OFS=',' '{for(i=1;i<NF;i+=2)
{if(!($i in c)){c[$i];cols[++k]=$i};
v[NR,$i]=$(i+1)}}
END{for(i=1;i<=k;i++) printf "%s", cols[i] OFS;
print "";
for(i=1;i<=NR;i++)
{for(j=1;j<=k;j++) printf "%s", v[i,cols[j]] OFS;
print ""}}' file
Table,count,size,
SCOTT.TABLE1,3889,300,
SCOTT.TABLE2,7744,,
SCOTT.TABLE3,2622,,
SCOTT.TABLE4,22,2773,
SCOTT.TABLE5,,21,
if you have gawk you can simplify it more with sorted-in
UPDATE For the revised question, the header needs to be known in advance since the keys might be completely missing. This simplifies the problem and the following script should do the trick.
$ awk -v header='Table,count,size' \
'BEGIN{OFS=","; n=split(header,h,OFS); print header}
{for(i=1; i<NF; i+=2) v[NR,$i]=$(i+1)}
END{for(i=1; i<=NR; i++)
{printf "%s", v[i,h[1]];
for(j=2; j<=n; j++) printf "%s", OFS v[i,h[j]];
print ""}}' file
here is an inelegant but fast and comprehensible solution:
awk 'BEGIN{OFS=",";print "TABLE,count,size"}
{
t=$2
if($3=="count"){
c=$4
s=$6
}
else{
s=$4
c=$6
}
print t,c,s
}' 1.txt
output:
TABLE,count,size
SCOTT.TABLE1,3889,300
SCOTT.TABLE2,7744,
SCOTT.TABLE3,2622,
SCOTT.TABLE4,22,2773
SCOTT.TABLE5,,21

Resources