Adding columns to a csv table with AWK from multiple files - bash

I'm looking to build a csv table by getting values from several files with AWK. I have it working with two files, but I can't scale it beyond that. I'm currently taking the output of the second file, and appending the third, and so on.
Here are example files:
#file1 #file2 #file3 #file4
100 45 1 5
200 23 1 2
300 29 2 1
400 0 1 2
500 74 4 5
This is the goal:
#data.csv
1,100,45,1,5
2,200,23,1,2
3,300,29,2,1
4,400,0,1,2
5,500,74,4,5
This is what I have working:
awk 'FNR==NR { a[FNR""] = NR", " $0","; next } { print a[FNR""], $0}' $file1 $file2
With the result:
1, 100, 45
2, 200, 23
3, 300, 29
4, 400, 0
5, 500, 74
But when I try and get it to work on 3 or more files, like so:
awk 'FNR==NR { a[FNR""] = NR", " $0","; next } { print a[FNR""], $0; next } { print a[FNR""], $0}' $file1 $file2 $file3
I get this output:
1, 100, 45
2, 200, 23
3, 300, 29
4, 400, 0
5, 500, 74
1, 100, 1
2, 200, 1
3, 300, 2
4, 400, 1
5, 500, 4
In the first column the line count restarts, and the second column it also repeats the first file. In the third column is where it adds the third and subsequent files as new rows, where I would expect these should be added as columns. No new rows required.
Any help would be greatly appreciated. I have learned most of my AWK from Stack Exchange, and I know I'm missing something fundamental here. Thanks,

as already answered you can use paste. To get the exact output with comma delimited line numbering, you can do this
paste -d, file{1..4} | nl -s, -w1
-s, sets number separator as comma (default is tab).
-w1 sets number width, so there are no initial spaces (because default is bigger)
another solution with awk
awk '{a[FNR]=a[FNR] "," $0}
END {for (i=1;i<=length(a);i++) print i a[i]}' file{1..4}

Why don't you use paste and then simply number each row:-
paste -d"," file1 file2 file3 file4
100,45,1,5
200,23,1,2
300,29,2,1
400,0 ,1,2
500,74,4,5

An awk solution for a variable number of files:
awk '{ !line[FNR] && line[FNR]=FNR; line[FNR]=line[FNR]","$0 }
END { for (i=1; i<=length(line); i++) print line[i] }' file1 file2 ... fileN
For example:
$ awk '{ !line[FNR] && line[FNR]=FNR; line[FNR]=line[FNR]","$0 }
END { for (i=1; i<=length(line); i++) print line[i] }' \
<(seq 1 5) <(seq 11 15) <(seq 21 25) <(seq 31 35)
1,1,11,21,31
2,2,12,22,32
3,3,13,23,33
4,4,14,24,34
5,5,15,25,35

Here is a beginner friendly solution. If you need to manipulate the data on the way in you can clearly see which file is being read.
ARGIND is gawk specific. It tells us which file we are processing. We fill two arrays a and b from file1 and file2 and then print your desired output while processing file3.
awk '
ARGIND == 1 { a[FNR] = $0 ; next }
ARGIND == 2 { b[FNR] = $0 ; next }
ARGIND == 3 { print FNR "," a[FNR] "," b[FNR] "," $0 }
' file1 file2 file3
Output:
1,100,45,1
2,200,23,1
3,300,29,2
4,400,0,1
5,500,74,4

Related

find identical keys with different values from two text files

I have two files which have data in a format like this:
cat File1.txt
A: 1
B: 2
C: 3
D: 4
E: 5
cat File2.txt
A: 10
B: 2
C: 30
D: 4
F: 6
I was wondering how I could print the diff for common keys like:
A: 1, 10
C: 3, 30
You could try
awk -F":" 'NR==FNR{a[$1]=$2} FNR!=NR && a[$1] && a[$1]!=$2{print $1":"a[$1]","$2}' File1.txt File2.txt
As it seems there are no duplicates in the file, this should do:
$ awk '{if(($1 in a)&&$2!=a[$1])print $1,a[$1] ", " $2;else a[$1]=$2}' file1 file2
Output:
A: 1, 10
C: 3, 30
Explained:
$ awk '{
if(($1 in a) && $2!=a[$1]) # if $1 already seen and $2 not equal to previous
print $1,a[$1] ", " $2 # output
else
a[$1]=$2 # else store the value as seen for the first time
}' file1 file2
$ cat tst.awk
BEGIN { OFS=", " }
NR==FNR {
a[$1] = $2
next
}
($1 in a) && (a[$1] != $2) {
print $0, a[$1]
}
$ awk -f tst.awk file2 file1
A: 1, 10
C: 3, 30

losing data when comparing a column with awk

I have a text file and all I want to do is compare the third column and see if it's equal to 1 or 0, so I just simply used
awk '$3 == 1 { print $0 }' input > output1
awk '$3 == 0 { print $0 }' input > output2
This is part of a bash script and I'm certain there is a more elegant approach to this, but the code above should get the job done, only it does not. input has 425 rows of text, the third column in input is always a 1 or 0, therefore the total number of rows in output1 + output2 should be 425. But I get 417 rows.
Here is a sample of input (all of it is just one row, and there are 425 such rows):
out_first.dat 1 1 0.000000 265075.000000 6.000000e-01 1.005205e-03 9.000000e-01 9.000000e-01 2.889631e+00 -2.423452e+00 3.730018e+00 -1.532915e+00
if $3 is 1 or 0, it will be equal to its square, prints to output1/2. If not prints to other for inspection.
awk `$3*$3==$3{print > "output"(2-$3); next} {print > "other"}' file
if $3*$3==$3 is confusing, change to $3==0 || $3===1
for the curious $3==0 || $3===1 can be written as $3*($3-1)==0 from which the above follows.

Bash: extract columns with cut and filter one column further

I have a tab-separated file and want to extract a few columns with cut.
Two example line
(...)
0 0 1 0 AB=1,2,3;CD=4,5,6;EF=7,8,9 0 0
1 1 0 0 AB=2,1,3;CD=1,1,2;EF=5,3,4 0 1
(...)
What I want to achieve is to select columns 2,3,5 and 7, however from column 5 only CD=4,5,6.
So my expected result is
0 1 CD=4,5,6; 0
1 0 CD=1,1,2; 1
How can I use cut for this problem and run grep on one of the extracted columns? Any other one-liner is of course also fine.
here is another awk
$ awk -F'\t|;' -v OFS='\t' '{print $2,$3,$6,$NF}' file
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
or with cut/paste
$ paste <(cut -f2,3 file) <(cut -d';' -f2 file) <(cut -f7 file)
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
Easier done with awk. Split the 5th field using ; as the separator, and then print the second subfield.
awk 'BEGIN {FS="\t"; OFS="\t"}
{split($5, a, ";"); print $2, $3, a[2]";", $7 }' inputfile > outputfile
If you want to print whichever subfield begins with CD=, use a loop:
awk 'BEGIN {FS="\t"; OFS="\t"}
{n = split($5, a, ";");
for (i = 1; i <= n; i++) {
if (a[i] ~ /^CD=/) subfield = a[i];
}
print $2, $3, subfield";", $7}' < inputfile > outputfile
I think awk is the best tool for this kind of task and the other two answers give you good short solutions.
I want to point out that you can use awk's built-in splitting facility to gain more flexibility when parsing input. Here is an example script that uses implicit splitting:
parse.awk
# Remember second, third and seventh columns
{
a = $2
b = $3
d = $7
}
# Split the fifth column on ";". After this the positional variables
# (e.g. $1, # $2, ..., $NF) contain the fields from the previous
# fifth column
{
oldFS = FS
FS = ";"
$0 = $5
}
# For example to test if the second elemnt starts with "CD", do
# something like this
$2 ~ /^CD/ {
c = $2
}
# Print the selected elements
{
print a, b, c, d
}
# Restore FS
{
FS = oldFS
}
Run it like this:
awk -f parse.awk FS='\t' OFS='\t' infile
Output:
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1

Grouping elements by two fields on a space delimited file

I have this ordered data by column 2 then 3 and then 1 in a space delimited file (i used linux sort to do that):
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
I want to create a new file (leaving the old file as is)
0 2 0,1,2
1 4 1,2
Basically put the fields 2 and 3 first and group the elements of field 1 (as a comma separated list) by them. Is there a way to do that by an awk, sed, bash one liner, so to avoid writing a Java, C++ app for that?
Since the file is already ordered, you can print the line as they change:
awk '
seen==$2 FS $3 { line=line "," $1; next }
{ if(seen) print seen, line; seen=$2 FS $3; line=$1 }
END { print seen, line }
' file
0 2 0,1,2
1 4 1,2
This will preserve the order of output.
with your input and output this line may help:
awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}
{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' file
test:
kent$ cat f
0 0 2
1 0 2
2 0 2
1 1 4
2 1 4
kent$ awk '{f=$2 FS $3}!(f in a){i[++p]=f;a[f]=$1;next}{a[f]=a[f]","$1}END{for(x=1;x<=p;x++)print i[x],a[i[x]]}' f
0 2 0,1,2
1 4 1,2
awk 'a[$2, $3]++ { p = p "," $1; next } p { print p } { p = $2 FS $3 FS $1 } END { if (p) print p }' file
Output:
0 2 0,1,2
1 4 1,2
The solution assumes data on second and third column is sorted.
Using awk:
awk '{k=$2 OFS $3} !(k in a){a[k]=$1; b[++n]=k; next} {a[k]=a[k] "," $1}
END{for (i=1; i<=n; i++) print b[i],a[b[i]]}' file
0 2 0,1,2
1 4 1,2
Yet another take:
awk -v SUBSEP=" " '
{group[$2,$3] = group[$2,$3] $1 ","}
END {
for (g in group) {
sub(/,$/,"",group[g])
print g, group[g]
}
}
' file > newfile
The SUBSEP variable is the character that joins strings in a single-dimensional awk array.
http://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html#Multidimensional
This might work for you (GNU sed):
sed -r ':a;$!N;/(. (. .).*)\n(.) \2.*/s//\1,\3/;ta;s/(.) (.) (.)/\2 \3 \1/;P;D' file
This appends the first column of the subsequent record to the first record until the second and third keys change. Then the fields in the first record are re-arranged and printed out.
This uses the data presented but can be adapted for more complex data.

Add leading zeroes to awk variable

I have the following awk command within a "for" loop in bash:
awk -v pdb="$pdb" 'BEGIN {file = 1; filename = pdb"_" file ".pdb"}
/ENDMDL/ {getline; file ++; filename = pdb"_" file ".pdb"}
{print $0 > filename}' < ${pdb}.pdb
This reads a series of files with the name $pdb.pdb and splits them in files called $pdb_1.pdb, $pdb_2.pdb, ..., $pdb_21.pdb, etc. However, I would like to produce files with names like $pdb_01.pdb, $pdb_02.pdb, ..., $pdb_21.pdb, i.e., to add padding zeros to the "file" variable.
I have tried without success using printf in different ways. Help would be much appreciated.
Here's how to create leading zeros with awk:
# echo 1 | awk '{ printf("%02d\n", $1) }'
01
# echo 21 | awk '{ printf("%02d\n", $1) }'
21
Replace %02 with the total number of digits you need (including zeros).
Replace file on output with sprintf("%02d", file).
Or even the whole assigment with filename = sprintf("%s_%02d.pdb", pdb, file);.
This does it without resort of printf, which is expensive. The first parameter is the string to pad, the second is the total length after padding.
echo 722 8 | awk '{ for(c = 0; c < $2; c++) s = s"0"; s = s$1; print substr(s, 1 + length(s) - $2); }'
If you know in advance the length of the result string, you can use a simplified version (say 8 is your limit):
echo 722 | awk '{ s = "00000000"$1; print substr(s, 1 + length(s) - 8); }'
The result in both cases is 00000722.
Here is a function that left or right-pads values with zeroes depending on the parameters: zeropad(value, count, direction)
function zeropad(s,c,d) {
if(d!="r")
d="l" # l is the default and fallback value
return sprintf("%" (d=="l"? "0" c:"") "d" (d=="r"?"%0" c-length(s) "d":""), s,"")
}
{ # test main
print zeropad($1,$2,$3)
}
Some tests:
$ cat test
2 3 l
2 4 r
2 5
a 6 r
The test:
$ awk -f program.awk test
002
2000
00002
000000
It's not fully battlefield tested so strange parameters may yield strange results.

Resources