convert data matrix using awk - matrix

Is it possible to transpose the following data matrix input to the desired output?
f1 x1 1.2
f1 x2 2.2
f1 x3 0
f2 x1 1.1
f2 x2 1.2
f2 x3 3.3
f3 x1 2.3
f3 x2 4.4
f3 x3 0.1
output
f1 f2 f3
x1 1.2 1.1 2.3
x2 2.2 1.2 4.4
x3 0 3.3 0.1

This can be a way:
awk '{a[$1,$2]=$3; col[$1]; row[$2]}
END {printf "%s", FS
for (c in col) printf "%s%s", c, FS; print "";
for (r in row) {
printf "%s%s", r, FS
for (c in col) printf "%s%s", a[c,r], FS
print ""
}
}' file
It is quite descriptive, but still:
Store the data in an array a[col, row].
Store the possible names of cols and rows.
Once the file has been read, loop through the results and print.
For the given input it returns:
$ awk '{a[$1,$2]=$3; col[$1]; row[$2]} END {printf "%s", FS; for (c in col) printf "%s%s", c, FS; print ""; for (r in row) { printf "%s%s", r, FS; for (c in col) printf "%s%s", a[c,r], FS; print ""}}' a
f1 f2 f3
x1 1.2 1.1 2.3
x2 2.2 1.2 4.4
x3 0 3.3 0.1

% cat mat.rix
f1 x1 1.2
f1 x2 2.2
f1 x3 0
f2 x1 1.1
f2 x2 1.2
f2 x3 3.3
f3 x1 2.3
f3 x2 4.4
f3 x3 0.1
% cat a.wk
{
if(! row[$1]) { i=i+1; rowname[i]=$1; row[$1]=1 }
if(! col[$2]) { j=j+1; colname[j]=$2; col[$2]=1 }
if(c[$2])
c[$2] = sprintf("%s%s%10.4f", c[$2], OFS, $3)
else
c[$2] = sprintf("%10.4f", $3)
}
END {
printf " "
for(n=1;n<i+1;n++){ printf "%10s%s", rowname[n], OFS } ; print ""
for(n=1;n<j+1;n++){ print colname[n], c[colname[n]] }}
% awk -f a.wk mat.rix
f1 f2 f3
x1 1.2000 1.1000 2.3000
x2 2.2000 1.2000 4.4000
x3 0.0000 3.3000 0.1000
%
Addendum
What if the column names are of different length?
% cat aw.k
{
if(! row[$1]) { i=i+1; rowname[i]=$1; row[$1]=1 }
if(! col[$2]) { j=j+1; colname[j]=$2; col[$2]=1 }
if(c[$2])
c[$2] = sprintf("%s%s%10.4f", c[$2], OFS, $3)
else
c[$2] = sprintf("%10.4f", $3)
}
END {
for(n=1;n<j+1;n++){l = length(colname[n]) ; if(l>lmax) lmax=l}
format = sprintf("%%-%ds%%s%%s\n", lmax)
for(n=1;n<lmax+1;n++) printf "."; printf OFS
for(n=1;n<i+1;n++){ printf "%10s%s", rowname[n], OFS } ; print ""
for(n=1;n<j+1;n++){ printf(format, colname[n], OFS, c[colname[n]])}
}
% awk -f aw.k mat.rix
.. f1 f2 f3
x1 1.2000 1.1000 2.3000
x2 2.2000 1.2000 4.4000
x3 0.0000 3.3000 0.1000
%
please consider the use of %% when the format is built using sprintf

Related

Create a matrix out of table using awk

I want to use this table:
a 16 moe max us
b 11 tom mic us
d 14 roe fox au
t 29 ann teo au
n 28 joe joe ca
and make this matrix by using awk (or any other simple option in bash):
a_16; b_11; d_14; t_29; n_28
us; moe_max; tom_mic; ; ;
au; ; ; roe_fox; ann_teo;
ca; ; ; ; ; joe_joe
I tried this but it didn't work:
awk '{a[$5]=a[$5]?a[$5] FS $1"_"$2:$1"_"$2; b[$5]=b[$5]?b[$5] FS $3"_"$4:$3"_"$4;} END{for (i in a){print i"\t" a[i] "\t" b[i];}}' fis.txt
Using any awk
$ cat tst.awk
{
row = $NF
col = $1 "_" $2
vals[row,col] = $3 "_" $4
}
!seenRow[row]++ { rows[++numRows] = row }
!seenCol[col]++ { cols[++numCols] = col }
END {
OFS = "; "
printf " "
for ( colNr=1; colNr<=numCols; colNr++ ) {
col = cols[colNr]
printf "%s%s", col, (colNr<numCols ? OFS : ORS)
}
for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
row = rows[rowNr]
printf "%s%s", row, OFS
for ( colNr=1; colNr<=numCols; colNr++ ) {
col = cols[colNr]
#val = ((row,col) in vals ? vals[row,col] : " ")
val = vals[row,col]
printf "%s%s", val, (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
a_16; b_11; d_14; t_29; n_28
us; moe_max; tom_mic; ; ;
au; ; ; roe_fox; ann_teo;
ca; ; ; ; ; joe_joe
I can't see the pattern in the expected output in your question of when there should be 1, 2, 3, or 4 spaces after each ; so I just used a consistent 2 in the above. Massage it to suit.
Using gawk multidimensional arrays for collecting header columns and row indices:
awk '{
head[NR] = $1"_"$2;
idx[$5][NR] = $3"_"$4
}
END {
h = ""; col_size = length(head);
for (i = 1; i <= col_size; i++) {
h = sprintf("%s %s", h, head[i])
}
print h;
for (lab in idx) {
printf("%s", lab);
for (i = 1; i <= col_size; i++) {
v = sprintf("%s; %s", v, idx[lab][i])
}
print v;
v = "";
}
}' test.txt
a_16 b_11 d_14 t_29 n_28
ca; ; ; ; ; joe_joe
au; ; ; roe_fox; ann_teo;
us; moe_max; tom_mic; ; ;
Here is a ruby to do that:
ruby -e 'd=$<.read.
split(/\R/).
map(&:split).
map{|sa| sa.each_slice(2).map{|ss| ss.join("_") } }.
group_by{|sa| sa[-1] }
# {"us"=>[["a_16", "moe_max", "us"], ["b_11", "tom_mic", "us"]], "au"=>[["d_14", "roe_fox", "au"], ["t_29", "ann_teo", "au"]], "ca"=>[["n_28", "joe_joe", "ca"]]}
heads=d.values.flatten(1).map{|sa| sa[0]}
# ["a_16", "b_11", "d_14", "t_29", "n_28"]
hsh=Hash.new {|h,k| h[k] = ["\t"]*heads.length}
d.each{|k,v|
v.each{|sa|
hsh[k][heads.index(sa[0])]="\t#{sa[1]}"
}
}
puts heads.map{|e| "\t#{e}" }.join(";")
hsh.each{|k,v| puts "#{k};\t#{v.join(";")}"}
' file
Prints:
a_16; b_11; d_14; t_29; n_28
us; moe_max; tom_mic; ; ;
au; ; ; roe_fox; ann_teo;
ca; ; ; ; ; joe_joe

Using awk to print data from a specific sequence of lines as data arranged as a row

I have a file organized like this:
a b c d
x1
x2
x3
e f g h
x4
x5
x6
and so on. I would like to use awk to write another file as follows:
x1 x2 x3
x4 x5 x6
and so on. I am struggling since I'm still beginning to learn awk and sed. Any suggestions?
I would harness GNU AWK for this task following way, let file.txt content be
a b c d
x1
x2
x3
e f g h
x4
x5
x6
then
awk 'BEGIN{ORS=" "}NR==1{next}NF==1{print $1}NF>1{printf "\n"}' file.txt
gives output
x1 x2 x3
x4 x5 x6
Explanation: I inform GNU AWK to use space as output row separator (ORS), then for 1st row to go next row (skipping first row), if row has 1 field I do print 1st record ($1) which gets trailing space rather than newline, as I set ORS to space. If there is more than one field I just printf newline. Observe that printf does not add trailing space as opposed to print. If you want to know more about ORS or NR or NF then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)

transform csv file using awk script

I have csv file as below:
C1, C2, C3,Cv1,Cv2,Cv3,Cv4 ... this one can be have longer column
x1, x2 ,x3.1, 1.1, 1.2, 1.3, 1.4
x1, x2, x3.2, 2.1, 2.2, 2.3, 2.4
x1, x2, x3.3, 3.1, 3.2, 3.3, 3.4
i would like to transform this csv file to as below:
C1,C2, C3,CTEXT,XVALUE
x1, x2, x3.1, Cv1 , 1.1
x1, x2, x3.1, Cv2 , 1.2
x1, x2, x3.1, Cv3 , 1.3
x1, x2, x3.1, Cv4 , 1.4
x1, x2, x3.2, Cv1 , 2.1
x1, x2, x3.2, Cv2 , 2.2
x1, x2, x3.2, Cv3 , 2.3
x1, x2, x3.2, Cv4 , 2.4
x1, x2, x3.3, Cv1 , 3.1
x1,x2,x3.3, Cv2 , 3.2
x1,x2,x3.3, Cv3 , 3.3
x1,x2,x3.3, Cv4 , 3.4
Below is my code:
#!/bin/bash
awk -F, -v OFS=, '{ if (NR==1)
{ print $1,$2,$3, "CTEXT","XVALUE"
i=4; while (i < NF) {
a[i]=$i; i=i+1
}
am=NF; next
}
i=4 ; while (i < am) {
if (i > NF) {print "record "NR" insufficient value" >/dev/stderr
break}
print $1,$2,$3,a[i],$i
i=i+1
}
if (am <NF) print "record "NR" too many values for text" >/dev/stderr
}' input.csv
When i run the script, it shows error :
awk: syntax error near line 2
awk: bailing out near line 2
Edit by Ed Morton - I just ran the script through a beautifier (gawk -o- '...') so it's much easier to read/understand:
{
if (NR == 1) {
print $1, $2, $3, "CTEXT", "XVALUE"
i = 4
while (i < NF) {
a[i] = $i
i = i + 1
}
am = NF
next
}
i = 4
while (i < am) {
if (i > NF) {
print("record " NR " insufficient value") > (/dev/) stderr
break
}
print $1, $2, $3, a[i], $i
i = i + 1
}
if (am < NF) {
print("record " NR " too many values for text") > (/dev/) stderr
}
}
Even if you switch your Solaris awk to gawk or nawk, there still
remain some problems. Would you please try the following:
awk -F, -v OFS=, '
NR==1 {
print $1,$2,$3, "CTEXT","XVALUE"
for (i = 4; i <= NF; i++) a[i]=$i
am=NF; next
}
{
if (am < NF) {
print "record "NR" too many values for text" > "/dev/stderr"
next
}
for (i = 4; i <= am; i++) {
if (i > NF) {
print "record "NR" insufficient value" > "/dev/stderr"
break
}
print $1,$2,$3,a[i],$i
}
}' input.csv
You need to increment i up to NR or am (not < but <=).
Enclose /dev/stderr with quotes.
Better to use for loop rather than while.
Hope this helps.
something like this
$ awk -F, 'BEGIN {OFS=FS}
NR==1 {n=split($0,h);
print $1,$2,$3,"CTEXT","XVALUE";
next}
n!=NF {print n<NF?"too many":"not enough";
exit}
{for(i=4;i<=NF;i++) print $1,$2,$3,h[i],$i}' file
C1,C2,C3,CTEXT,XVALUE
x1,x2,x3.1,Cv1,1.1
x1,x2,x3.1,Cv2,1.2
x1,x2,x3.1,Cv3,1.3
x1,x2,x3.1,Cv4,1.4
x1,x2,x3.2,Cv1,2.1
x1,x2,x3.2,Cv2,2.2
x1,x2,x3.2,Cv3,2.3
x1,x2,x3.2,Cv4,2.4
x1,x2,x3.3,Cv1,3.1
x1,x2,x3.3,Cv2,3.2
x1,x2,x3.3,Cv3,3.3
x1,x2,x3.3,Cv4,3.4

bash command for splitting cell content by delimiter into multiple rows in the cell column

To draw a task. I have dataframe:
x y1;y2;y3 z1;z2;z3
a b1;b2 c1;c2
I need:
x y1 z1
x y2 z2
x y3 z3
a b1 c1
a b2 c2
Column 1 has one instance always. Number of instances in a cell can be from one to many but always equal between column 2,3. Thanks
In awk:
$ awk -F"(\t|;)" '{
for(i=2;i<=4;i++)
if($i!="")
print $1, $i, $(i+3)
}' file
x y1 z1
x y2 z2
x y3 z3
a b1 c1
a b2 c2
Edit: Another version:
$ awk -F"(\t+|;)" '{ # FS tabs or semicolon
for(i=2;i<=int(NF/2)+1;i++)
print $1,$i,$(i+int(NF/2))
}' file
x y1 z1
x y2 z2
x y3 z3
a b1 c1
a b2 c2
Something like this should make it:
declare -a cols=() # array for individual columns (line fields)
IFS=' ;' # fields separators
while read -a cols; do
n=${#cols[#]} # number of fields in current line
if (( n < 3 || n % 2 != 1 )); then # skip invalid lines
printf "skipping invalid line: %s\n" "${cols[*]}"
continue
fi
for (( i = 1; i <= n / 2; i += 1 )); do # loop over pairs of fields
# printf line
printf "%s %s %s\n" "${cols[0]}" "${cols[i]}" "${cols[n/2+i]}"
done
done < data.txt
Explanations:
IFS is the list of characters used by read to split a line in fields. In your case spaces and ; seem to be the separators.
read -a cols assigns the fields of the read line to the cols array, starting at cell 0.
Example of run:
$ cat data.txt
x y1;y2;y3 z1;z2;z3
a b1;b2 c1;c2
$ ./foo.sh
x y1 z1
x y2 z2
x y3 z3
a b1 c1
a b2 c2

Print awk output into new column

I have lot of files modified (after filtration) and I need to print NR and characters about new files into column - lets see example:
input files: x1, x2, x3, y1, y2, y3, z1, z2, z3 ...
script:
for i in x* y* z*
do awk -v h=$i 'END{c+=lenght+1 ;print h "\t" NR "\t" c}' >> stats.txt
done;
my output looks like:
x1 NR c
x2 NR c
x3 NR c
y1 NR c
y2 NR c
y3 NR c
z1 NR c
z2 NR c
z3 NR c
And I need to save each loop to new column no line:
x1 NR c y1 NR c z1 NR c
x2 NR c y2 NR c z2 NR c
x3 NR c y3 NR c z3 NR c
to keep corresponding files (after filtration) on the same line. I hope I am clear. I need to do this in BASH and awk. Thank you for any help!!
EDITED:
the real output look like:
x 0.457143 872484
y 0.527778 445759
z 0.416667 382712
x 0.457143 502528
y 0.5 575972
z 0.444444 590294
x 0.371429 463939
y 0.694444 398033
z 0.56565 656565
.
.
.
and I need:
x 0.457143 872484 0.457143 502528 0.371429 463939
y 0.52777 445759 0.5 575972 0.694444 398033
.
.
.
I hope it is clear..
Try this:
cat data | tr -d , | awk '{for (i = 1; i <= NF; i += 3) print $i " NR c " $(i+1) " NR c " $(i+2) " NR c"}'
Output:
x1 NR c x2 NR c x3 NR c
y1 NR c y2 NR c y3 NR c
z1 NR c z2 NR c z3 NR c
Same table but transposed (for your task variant):
cat data | tr -d , | awk '{for (i = 1; i <= NF/3; i += 1) print $i " NR c " $(i+3) " NR c " $(i+6) " NR c"}'
Output:
x1 NR c y1 NR c z1 NR c
x2 NR c y2 NR c z2 NR c
x3 NR c y3 NR c z3 NR c
For your task update check the following solution (using bash):
cat data | sort | while read L;
do
y=`echo $L | cut -f1 -d' '`;
{
test "$x" = "$y" && echo -n " `echo $L | cut -f2- -d' '`";
} ||
{
x="$y";echo -en "\n$L";
};
done
(from my solution for similar problem)
Updated script after comment:
sort data | while read L
do
y="`echo \"$L\" | cut -f1 -d' '`"
if [ "$x" = "$y" ]
then
echo -n " `echo \"$L\" | cut -f2- -d' '`"
else
x="$y"
echo -en "\n$L"
fi
done

Resources