How to outer-join two CSV files, using shell script? - bash

I have two CSV files, like the following:
file1.csv
label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"
file2.csv
label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"
Desired Output.csv
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"
Currently with the below awk command i am able to combine the matching one's which have entries for label in both the files like PQR and XYZ but unable to append the ones that are not having label values present in both the files:
awk -F, 'NR==FNR{a[$1]=substr($0,length($1)+2);next} ($1 in a){print $0","a[$1]}' file1.csv file2.csv

This solution prints exactly the wished result with any AWK.
Please note that the sorting algorithm is taken from the mawk manual.
# SO71053039.awk
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP, n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
out = "Output.csv"
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["\"Part-A\""]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["\"Part-B\""]
}
}
close(ARGV[2])
# print the header
print "label" OFS "\"Part-A\"" OFS "\"Part-B\"" > out
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", LABEL_KEY_SWAP[i])
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
print result_string > out
}
}
Call:
awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

I would like to introduce Miller to you. It is a tool that can do a few things with a few file formats and that is available as a stand-alone binary. You just have to download the archive, put the mlr executable somewhere (preferably in your PATH) and you're done with the installation.
mlr --csv \
join -f file1.csv -j 'label' --ul --ur \
then \
unsparsify --fill-with 'NA' \
then \
sort -f 'label' \
file2.csv
Command parts:
mlr --csv
means that you want to read CSV files and output a CSV format. As an other example, if you want to read CSV files and output a JSON format it would be mlr --icsv --ojson
join -f file1.csv -j 'label' --ul --ur ...... file2.csv
means to join file1.csv and file2.csv on the field label and emit the unmatching records of both files
then is Miller's way of chaining operations
unsparsify --fill-with 'NA'
means to create the fields that didn't exist in each file and fill them with NA. It's needed for the records that had a uniq label
then sort -f 'label'
means to sort the records on the field label
Regarding the updated question: mlr handles the CSV quoting on its own. The only difference with your new expected output is that it removes the superfluous quotes:
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0

awk -v OFS=, '{
if(!o1[$1]) { o1[$1]=$NF; o2[$1]="NA" } else { o2[$1]=$NF }
}
END{
for(v in o1) { print v, o1[v], o2[v] }
}' file{1,2}
## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4
I think this will do nicely.

We suggest gawk script which is standard Linux awk:
script.awk
NR == FNR {
valsStr = sprintf("%s,%s", $2, "na");
rowsArr[$1] = valsStr;
}
NR != FNR && $1 in rowsArr {
split(rowsArr[$1],valsArr);
valsStr = sprintf("%s,%s", valsArr[1], $2);
rowsArr[$1] = valsStr;
next;
}
NR != FNR {
valsStr = sprintf("%s,%s", "na", $2);
rowsArr[$1] = valsStr;
}
END {
printf("%s,%s\n", "label", rowsArr["label"]);
for (rowName in rowsArr) {
if (rowName == "label") continue;
printf("%s,%s\n", rowName, rowsArr[rowName]);
}
}
output:
awk -F, -f script.awk input.{1,2}.txt
label,Part-A,Part-B
LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4

Since your question was titled with "how to do ... in a shell script?" and not necessarily with awk, I'm going to recommend GoCSV, a command-line tool with several sub-commands for processing CSVs (delimited files).
It doesn't have a single command that can accomplish what you need, but you can compose a number of commands to get the correct result.
The core of this solution is the join command which can perform inner (default), left, right, and outer joins; you want an outer join to keep the non-overlapping elements:
gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
Joined
+-------+--------+-------+--------+
| label | Part-A | label | Part-B |
+-------+--------+-------+--------+
| ABC | 2 | | |
+-------+--------+-------+--------+
| XYZ | 3 | XYZ | 4 |
+-------+--------+-------+--------+
| PQR | 6 | PQR | 6 |
+-------+--------+-------+--------+
| | | LMN | 8 |
+-------+--------+-------+--------+
| | | EFG | 1 |
+-------+--------+-------+--------+
The data-part is correct, but it'll take some work to get the columns correct, and to get the NA values in there.
Here's a complete pipeline:
gocsv join -c 'label' -outer file1.csv file2.csv \
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
| gocsv select -c 'label','Part-A','Part-B' \
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
> final.csv
echo 'Final'
gocsv view final.csv
which gets us the correct, final, file:
Final pipeline
+-------+--------+--------+
| label | Part-A | Part-B |
+-------+--------+--------+
| ABC | 2 | NA |
+-------+--------+--------+
| EFG | NA | 1 |
+-------+--------+--------+
| LMN | NA | 8 |
+-------+--------+--------+
| PQR | 6 | 6 |
+-------+--------+--------+
| XYZ | 3 | 4 |
+-------+--------+--------+
There's a lot going on in that pipeline, the high points are:
Merge the the two label fields
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
Pare-down to just the 3 columns you want
| gocsv select -c 'label','Part-A','Part-B' \
Add the NA values and sort by label
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
I've made a step-by-step explanation at this Gist.

You mentioned join in the comment on my other answer, and I'd forgotten about this utility:
#!/bin/sh
rm -f *sorted.csv
# Join two files, normally inner-join only, but
# - `-a 1 -a 2`: include "unpaired lines" from file 1 and file 2
# - `-1 1 -2 1`: the first column from each is the "join column"
# - `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2
join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv
# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv
# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv
# Then sort remainder
tail -n +2 unsorted.csv | sort -t, -k 1 >> sorted.csv
And, here's sorted.csv
+--------+--------+--------+
| label | Part-A | Part-B |
+--------+--------+--------+
| ABC mn | 2.0 | NA |
+--------+--------+--------+
| EFG | NA | 1.0 |
+--------+--------+--------+
| LMN Wv | NA | 8 |
+--------+--------+--------+
| PQR SN | 6 | 6 |
+--------+--------+--------+
| XYZ | 3.0 | 4.0 |
+--------+--------+--------+

As #Fravadona stated correctly in his comment, for CSV files that can contain the delimiter, a newline or double quotes inside a field a proper CSV parser is needed.
Actually, only two functions are needed: One for unquoting CSV fields to normal AWK fields and one for quoting the AWK fields to write the data back to CSV fields.
I have written a variant of my previous answer (https://stackoverflow.com/a/71056926/18135892) that uses Ed Morton's CSV parser (https://stackoverflow.com/a/45420607/18135892 with the gsub variant which works with any AWK version) to give an example of proper CSV parsing:
This solution prints the wished result correctly sorted with any AWK.
Please note that the sorting algorithm is taken from the mawk manual.
# SO71053039_2.awk
# unquote CSV:
# Ed Morton's CSV parser: https://stackoverflow.com/a/45420607/18135892
function buildRec( fpat,fldNr,fldStr,done) {
CurrRec = CurrRec $0
if ( gsub(/"/,"&",CurrRec) % 2 ) {
# The string built so far in CurrRec has an odd number
# of "s and so is not yet a complete record.
CurrRec = CurrRec RS
done = 0
}
else {
# If CurrRec ended with a null field we would exit the
# loop below before handling it so ensure that cannot happen.
# We use a regexp comparison using a bracket expression here
# and in fpat so it will work even if FS is a regexp metachar
# or a multi-char string like "\\\\" for \-separated fields.
CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
$0 = ""
fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
while ( (CurrRec != "") && match(CurrRec,fpat) ) {
fldStr = substr(CurrRec,RSTART,RLENGTH)
# Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
gsub(/""/, "\"", fldStr)
}
$(++fldNr) = fldStr
CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
}
CurrRec = ""
done = 1
}
return done
}
# quote CSV:
# Quote according to https://datatracker.ietf.org/doc/html/rfc4180 rules
function csvQuote(field, sep) {
if ((field ~ sep) || (field ~ /["\r\n]/)) {
gsub(/"/, "\"\"", field)
field = "\"" field "\""
}
return field
}
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP, n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["Part-A"]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["Part-B"]
}
}
close(ARGV[2])
# print the header
print "label" OFS "Part-A" OFS "Part-B"
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", csvQuote(LABEL_KEY_SWAP[i], OFS))
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS csvQuote(LABEL_KEY1[LABEL_KEY_SWAP[i]], OFS) OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS) : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS))
print result_string
}
}
Call:
awk -f SO71053039_2.awk file1.csv file2.csv
=> result (superfluous quotes according to CSV rules are omitted):
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0

Related

Bash - Removing empty columns from .csv file

I have a large .csv file in which I have to remove columns which are empty. By empty, I mean that they have a header, but the rest of the column contains no data.
I've written a Bash script to try and do this, but am running into a few issues.
Here's the code:
#!/bin/bash
total="$(head -n 1 Reddit-cleaner.csv | grep -o ',' | wc -l)"
i=1
count=0
while [ $i -le $total ]; do
cat Reddit-cleaner.csv | cut -d "," -f$i | while read CMD; do if [ -n CMD ]; then count=$count+1; fi; done
if [ $count -eq 1 ]; then
cut -d "," -f$i --complement <Reddit-cleaner.csv >Reddit-cleanerer.csv
fi
count=0
i=$i+1
done
Firstly I find the number of columns, and store it in total. Then while the program has not reached the last column, I loop through the columns individually. The nested while loop checks if each row in the column is empty, and if there is more than one row that is not empty, it writes all other columns to another file.
I recognise that there are a few problems with this script. Firstly, the count modification occurs in a subshell, so count is never modified in the parent shell. Secondly, the file I am writing to will be overwritten every time the script finds an empty column.
So my question then is how can I fix this. I initially wanted to have it so that it wrote to a new file column by column, based on count, but couldn't figure out how to get that done either.
Edit: People have asked for a sample input and output.
Sample input:
User, Date, Email, Administrator, Posts, Comments
a, 20201719, a#a.com, Yes, , 3
b, 20182817, b#b.com, No, , 4
c, 20191618, , No, , 4
d, 20190126, , No, , 2
Sample output:
User, Data, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
In the sample output, the column which has no data in it except for the header (Posts) has been removed, while the columns which are either entirely or partially filled remain.
I may be misinterpreting the question (due to its lack of example input and expected output), but this should be as simple as:
$ x="1,2,3,,4,field 5,,,six,7"
$ echo "${x//,+(,)/,}"
1,2,3,4,field 5,six,7
This requires bash with extglob enabled. Otherwise, you can use an external call to sed:
$ echo "1,2,3,,4,field 5,,,six,7" |sed 's/,,,*/,/g'
1,2,3,4,field 5,six,7
There's a lot of redundancy in your sample code. You should really consider awk since it already tracks the current field count (as NF) and the number of lines (as NR), so you could add that up with a simple total+=NF on each line. With the empty fields collapsed, awk can just run on the field number you want.
$ echo "1,2,3,,4,field 5,,,six,7" |awk -F ',+' '
{ printf "line %d has %d fields, the 6th of which is <%s>\n", NR, NF, $6 }'
line 1 has 7 fields, the 6th of which is <six>
This uses printf to denote the number of records (NR, the current line number), the number of fields (NF) and the value of the sixth field ($6, can also be as a variable, e.g. $NF is the value of the final field since awk is one-indexed).
It is actually job of a CSV parser but you may use this awk script to get the job done:
cat removeEmptyCellsCsv.awk
BEGIN {
FS = OFS = ", "
}
NR == 1 {
for (i=1; i<=NF; i++)
e[i] = 1 # initially all cols are marked empty
next
}
FNR == NR {
for (i=1; i<=NF; i++)
e[i] = e[i] && ($i == "")
next
}
{
s = ""
for (i=1; i<=NF; i++)
s = s (i==1 || e[i-1] ? "" : OFS) (e[i] ? "" : $i)
print s
}
Then run it as:
awk -f removeEmptyCellsCsv.awk file.csv{,}
Using sample data provided in question, it will produce following output:
1, User, Date, Email, Administrator, Comments
2, a, 20201719, a#a.com, Yes, 3
3, b, 20182817, b#b.com, No, 4
4, c, 20191618, , No, 4
5, d, 20190126, , No, 2
Note that Posts columns has been removed because it is empty in every record.
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
if ( NR > 1 ) {
for (i=1; i<=NF; i++) {
if ( $i ~ /[^[:space:]]/ ) {
gotValues[i]
}
}
}
next
}
{
c=0
for (i=1; i<=NF; i++) {
if (i in gotValues) {
printf "%s%s", (c++ ? OFS : ""), $i
}
}
print ""
}
$ awk -f tst.awk file file
User, Date, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
See also What's the most robust way to efficiently parse CSV using awk? if you need to work with any more complicated CSVs than the one in your question.
You can use Miller (https://github.com/johnkerl/miller) and its remove-empty-columns verb.
Starting from
+------+----------+---------+---------------+-------+----------+
| User | Date | Email | Administrator | Posts | Comments |
+------+----------+---------+---------------+-------+----------+
| a | 20201719 | a#a.com | Yes | - | 3 |
| b | 20182817 | b#b.com | No | - | 4 |
| c | 20191618 | - | No | - | 4 |
| d | 20190126 | - | No | - | 2 |
+------+----------+---------+---------------+-------+----------+
and running
mlr --csv remove-empty-columns input.csv >output.csv
you will have
+------+----------+---------+---------------+----------+
| User | Date | Email | Administrator | Comments |
+------+----------+---------+---------------+----------+
| a | 20201719 | a#a.com | Yes | 3 |
| b | 20182817 | b#b.com | No | 4 |
| c | 20191618 | - | No | 4 |
| d | 20190126 | - | No | 2 |
+------+----------+---------+---------------+----------+

How to add Column values based on unique value of a different column

I am trying to add values in Column B based on unique value in Column A. How can i do it using AWK (or) Any other way using bash?
Column_A | Column_B
--------------------
A | 1
A | 2
A | 1
B | 3
B | 8
C | 5
C | 8
Result:
Column_A | Column_B
--------------------
A | 6
B | 11
C | 13
Considering that your Input_file is same as shown, sorted with first field, could you please try following(will edit solution for alignment too soon).
awk '
BEGIN{
OFS=" | "
}
FNR==1 || /^-/{
print
next
}
prev!=$1 && prev{
print prev,sum
prev=sum=""
}
{
sum+=$NF
prev=$1
}
END{
if(prev && sum){
print prev,sum
}
}' Input_file
another awk
$ awk 'NR<3 {print; next}
{a[$1]+=$NF; line[$1]=$0}
END {for(k in a) {sub(/[0-9]+$/,a[k],line[k]); print line[k]}}' file
Column_A | Column_B
--------------------
A | 4
B | 11
C | 13
note that A totals to 4, not 6.
One possible solution (Assuming file is in CSV format):
Input :
$ cat csvtest.csv
A,1
A,2
A,3
B,3
B,8
C,5
C,8
$ cat csvtest.csv | awk -F "," '{arr[$1]+=$2} END {for (i in arr) {print i","arr[i]}}'
A,6
B,11
C,13

Replace string in Nth array

I have a .txt file with strings in arrays which looks like these:
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 0
And i want to add counts, so i need to replace last digit, but i need to change it only for the one line.
I am getting new needed value by this function:
$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"
And them I want to replace it with new value, which will be bigger for 1.
And im doing it right in there.
counts=(("$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"+1))
Where IdOpen is an id of the array, where i need to replace string.
So i have tried to replace the whole array by these:
counter="$(awk -v i=$idOpen 'BEGIN{FNqR == i}{$7+=1} END{ print $0}' bookmarks)"
N=$idOpen
sed -i "{N}s/.*/${counter}" bookmarks
But it doesn't work!
So is there a way to replace only last string with value which i have got earlier?
As result i need to get:
id | String1 | String2 | Counts
1 | Abc | Abb | 1 # if idOpen was 1 for 1 time
2 | Cde | Cdf | 2 # if idOpen was 2 for 2 times
And the last number will be increased by 1 everytime when i will activate these commands.
awk solution:
setting idOpen variable(for ex. 2):
idOpen=2
awk -F'|' -v i=$idOpen 'NR>1{if($1 == i) $4=" "$4+1}1' OFS='|' file > tmp && mv tmp file
The output(after executing the above command twice):
cat file
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 2
NR>1 - skipping the header line

use awk to match rows for each column

How can awk be used to find values that match in row 2 for each column?
I would like to take in a tab limited file and for each column if any row below row 2 matches what is in row 2, print field with "match".
transforming this tab delimited file:
header1 | header2 | header3
1 | 1 | B
--------+---------+----------
3 | 1 | A
2 | A | B
1 | B | 1
To this:
header1 | header2 | header3
1 | 1 | B
--------+---------+----------
3 | 1 match | A
2 | A | B match
1 match | B | 1
I would go for something like this:
$ cat file
header1 header2 header3
1 1 B
3 1 A
2 A B
1 B 1
$ awk -v OFS='\t' 'NR == 2 { for (i=1; i<=NF; ++i) a[i] = $i }
NR > 2 { for(i=1;i<=NF;++i) if ($i == a[i]) $i = $i " match" }1' file
header1 header2 header3
1 1 B
3 1 match A
2 A B match
1 match B 1
On the second line, populate the array a with the contents of each field. On subsequent lines, add "match" when they match the corresponding value in the array. The 1 at the end is a common shorthand causing each line to be printed. Setting the output field separator OFS to a tab character preserves the format of the data.
Pedantically, with GNU Awk 4.1.1:
awk -f so.awk so.txt
header1 header2 header3
1 1 B
3 1* A
2 A B*
1* B 1
with so.awk:
{
if(1 == NR) {
print $0;
} else if(2 == NR) {
for(i = 1; i <= NF; i++) {
answers[i]=$i;
}
print $0;
} else {
for(i = 1; i <= NF; i++) {
field = $i;
if(answers[i]==$i) {
field = field "*" # a match
}
printf("%s\t",field);
}
printf("%s", RS);
}
}
and so.txt as a tab delimited data file:
header1 header2 header3
1 1 B
3 1 A
2 A B
1 B 1
This isn't homework, right...?

How to align text in columns (centered) without removing delimiter?

I would like to adjust my source centered in columns...
Source:
IP | ASN | Prefix | AS Name | CN | Domain | ISP
109.228.12.96 | 8560 | 109.228.0.0/18 | ONEANDONE | DE | fasthosts.com | Fast Hosts LTD
Goal:
IP | ASN | Prefix | AS Name | CN | Domain | ISP
109.228.12.96 | 8560 | 109.228.0.0/18 | ONEANDONE | DE | fasthosts.com | Fast Hosts LTD
I tried different things with the command column...but I have double spaces inside:
cat Source.txt | sed 's/ *| */#| /g' | column -s '#' -t
IP | ASN | Prefix | AS Name | CN | Domain | ISP
109.228.12.96 | 8560 | 109.228.0.0/18 | ONEANDONE | DE | fasthosts.com | Fast Hosts LTD
Is there a way to use column without removing the delimiter...or another solution?
Thanks in advance for your help!
You can also do everything in awk. Save the program to pr.awk and run
awk -f pr.awk input.dat
BEGIN {
FS = "|"
ARGV[2] = "pass=2" # a trick to read file two times
ARGV[3] = ARGV[1]
ARGC=4
pass = 1
}
function trim(s) {
sub(/^[[:space:]]+/, "", s) # remove leading
sub(/[[:space:]]+$/, "", s) # and trailing whitespaces
return s
}
pass == 1 {
for (i=1; i<=NF; i++) {
field = trim($i)
len = length(field)
w[i] = len>w[i] ? len : w[i] # find the maximum width
}
}
pass == 2 {
line = ""
for (i=1; i<=NF; i++) {
field = trim($i)
s = i==NF ? field : sprintf("%-" w[i] "s", field)
sep = i==1 ? "" : " | "
line = line sep s
}
print line
}
column has input sepatator -s and also output seperator -o
so call is like
cat file | column -t -s '|' -o '|'

Resources