Bash - Removing empty columns from .csv file - bash

I have a large .csv file in which I have to remove columns which are empty. By empty, I mean that they have a header, but the rest of the column contains no data.
I've written a Bash script to try and do this, but am running into a few issues.
Here's the code:
#!/bin/bash
total="$(head -n 1 Reddit-cleaner.csv | grep -o ',' | wc -l)"
i=1
count=0
while [ $i -le $total ]; do
cat Reddit-cleaner.csv | cut -d "," -f$i | while read CMD; do if [ -n CMD ]; then count=$count+1; fi; done
if [ $count -eq 1 ]; then
cut -d "," -f$i --complement <Reddit-cleaner.csv >Reddit-cleanerer.csv
fi
count=0
i=$i+1
done
Firstly I find the number of columns, and store it in total. Then while the program has not reached the last column, I loop through the columns individually. The nested while loop checks if each row in the column is empty, and if there is more than one row that is not empty, it writes all other columns to another file.
I recognise that there are a few problems with this script. Firstly, the count modification occurs in a subshell, so count is never modified in the parent shell. Secondly, the file I am writing to will be overwritten every time the script finds an empty column.
So my question then is how can I fix this. I initially wanted to have it so that it wrote to a new file column by column, based on count, but couldn't figure out how to get that done either.
Edit: People have asked for a sample input and output.
Sample input:
User, Date, Email, Administrator, Posts, Comments
a, 20201719, a#a.com, Yes, , 3
b, 20182817, b#b.com, No, , 4
c, 20191618, , No, , 4
d, 20190126, , No, , 2
Sample output:
User, Data, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
In the sample output, the column which has no data in it except for the header (Posts) has been removed, while the columns which are either entirely or partially filled remain.

I may be misinterpreting the question (due to its lack of example input and expected output), but this should be as simple as:
$ x="1,2,3,,4,field 5,,,six,7"
$ echo "${x//,+(,)/,}"
1,2,3,4,field 5,six,7
This requires bash with extglob enabled. Otherwise, you can use an external call to sed:
$ echo "1,2,3,,4,field 5,,,six,7" |sed 's/,,,*/,/g'
1,2,3,4,field 5,six,7
There's a lot of redundancy in your sample code. You should really consider awk since it already tracks the current field count (as NF) and the number of lines (as NR), so you could add that up with a simple total+=NF on each line. With the empty fields collapsed, awk can just run on the field number you want.
$ echo "1,2,3,,4,field 5,,,six,7" |awk -F ',+' '
{ printf "line %d has %d fields, the 6th of which is <%s>\n", NR, NF, $6 }'
line 1 has 7 fields, the 6th of which is <six>
This uses printf to denote the number of records (NR, the current line number), the number of fields (NF) and the value of the sixth field ($6, can also be as a variable, e.g. $NF is the value of the final field since awk is one-indexed).

It is actually job of a CSV parser but you may use this awk script to get the job done:
cat removeEmptyCellsCsv.awk
BEGIN {
FS = OFS = ", "
}
NR == 1 {
for (i=1; i<=NF; i++)
e[i] = 1 # initially all cols are marked empty
next
}
FNR == NR {
for (i=1; i<=NF; i++)
e[i] = e[i] && ($i == "")
next
}
{
s = ""
for (i=1; i<=NF; i++)
s = s (i==1 || e[i-1] ? "" : OFS) (e[i] ? "" : $i)
print s
}
Then run it as:
awk -f removeEmptyCellsCsv.awk file.csv{,}
Using sample data provided in question, it will produce following output:
1, User, Date, Email, Administrator, Comments
2, a, 20201719, a#a.com, Yes, 3
3, b, 20182817, b#b.com, No, 4
4, c, 20191618, , No, 4
5, d, 20190126, , No, 2
Note that Posts columns has been removed because it is empty in every record.

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
if ( NR > 1 ) {
for (i=1; i<=NF; i++) {
if ( $i ~ /[^[:space:]]/ ) {
gotValues[i]
}
}
}
next
}
{
c=0
for (i=1; i<=NF; i++) {
if (i in gotValues) {
printf "%s%s", (c++ ? OFS : ""), $i
}
}
print ""
}
$ awk -f tst.awk file file
User, Date, Email, Administrator, Comments
a, 20201719, a#a.com, Yes, 3
b, 20182817, b#b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2
See also What's the most robust way to efficiently parse CSV using awk? if you need to work with any more complicated CSVs than the one in your question.

You can use Miller (https://github.com/johnkerl/miller) and its remove-empty-columns verb.
Starting from
+------+----------+---------+---------------+-------+----------+
| User | Date | Email | Administrator | Posts | Comments |
+------+----------+---------+---------------+-------+----------+
| a | 20201719 | a#a.com | Yes | - | 3 |
| b | 20182817 | b#b.com | No | - | 4 |
| c | 20191618 | - | No | - | 4 |
| d | 20190126 | - | No | - | 2 |
+------+----------+---------+---------------+-------+----------+
and running
mlr --csv remove-empty-columns input.csv >output.csv
you will have
+------+----------+---------+---------------+----------+
| User | Date | Email | Administrator | Comments |
+------+----------+---------+---------------+----------+
| a | 20201719 | a#a.com | Yes | 3 |
| b | 20182817 | b#b.com | No | 4 |
| c | 20191618 | - | No | 4 |
| d | 20190126 | - | No | 2 |
+------+----------+---------+---------------+----------+

Related

How to outer-join two CSV files, using shell script?

I have two CSV files, like the following:
file1.csv
label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"
file2.csv
label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"
Desired Output.csv
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"
Currently with the below awk command i am able to combine the matching one's which have entries for label in both the files like PQR and XYZ but unable to append the ones that are not having label values present in both the files:
awk -F, 'NR==FNR{a[$1]=substr($0,length($1)+2);next} ($1 in a){print $0","a[$1]}' file1.csv file2.csv
This solution prints exactly the wished result with any AWK.
Please note that the sorting algorithm is taken from the mawk manual.
# SO71053039.awk
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP, n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
out = "Output.csv"
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["\"Part-A\""]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["\"Part-B\""]
}
}
close(ARGV[2])
# print the header
print "label" OFS "\"Part-A\"" OFS "\"Part-B\"" > out
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", LABEL_KEY_SWAP[i])
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
print result_string > out
}
}
Call:
awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"
I would like to introduce Miller to you. It is a tool that can do a few things with a few file formats and that is available as a stand-alone binary. You just have to download the archive, put the mlr executable somewhere (preferably in your PATH) and you're done with the installation.
mlr --csv \
join -f file1.csv -j 'label' --ul --ur \
then \
unsparsify --fill-with 'NA' \
then \
sort -f 'label' \
file2.csv
Command parts:
mlr --csv
means that you want to read CSV files and output a CSV format. As an other example, if you want to read CSV files and output a JSON format it would be mlr --icsv --ojson
join -f file1.csv -j 'label' --ul --ur ...... file2.csv
means to join file1.csv and file2.csv on the field label and emit the unmatching records of both files
then is Miller's way of chaining operations
unsparsify --fill-with 'NA'
means to create the fields that didn't exist in each file and fill them with NA. It's needed for the records that had a uniq label
then sort -f 'label'
means to sort the records on the field label
Regarding the updated question: mlr handles the CSV quoting on its own. The only difference with your new expected output is that it removes the superfluous quotes:
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0
awk -v OFS=, '{
if(!o1[$1]) { o1[$1]=$NF; o2[$1]="NA" } else { o2[$1]=$NF }
}
END{
for(v in o1) { print v, o1[v], o2[v] }
}' file{1,2}
## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4
I think this will do nicely.
We suggest gawk script which is standard Linux awk:
script.awk
NR == FNR {
valsStr = sprintf("%s,%s", $2, "na");
rowsArr[$1] = valsStr;
}
NR != FNR && $1 in rowsArr {
split(rowsArr[$1],valsArr);
valsStr = sprintf("%s,%s", valsArr[1], $2);
rowsArr[$1] = valsStr;
next;
}
NR != FNR {
valsStr = sprintf("%s,%s", "na", $2);
rowsArr[$1] = valsStr;
}
END {
printf("%s,%s\n", "label", rowsArr["label"]);
for (rowName in rowsArr) {
if (rowName == "label") continue;
printf("%s,%s\n", rowName, rowsArr[rowName]);
}
}
output:
awk -F, -f script.awk input.{1,2}.txt
label,Part-A,Part-B
LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4
Since your question was titled with "how to do ... in a shell script?" and not necessarily with awk, I'm going to recommend GoCSV, a command-line tool with several sub-commands for processing CSVs (delimited files).
It doesn't have a single command that can accomplish what you need, but you can compose a number of commands to get the correct result.
The core of this solution is the join command which can perform inner (default), left, right, and outer joins; you want an outer join to keep the non-overlapping elements:
gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
Joined
+-------+--------+-------+--------+
| label | Part-A | label | Part-B |
+-------+--------+-------+--------+
| ABC | 2 | | |
+-------+--------+-------+--------+
| XYZ | 3 | XYZ | 4 |
+-------+--------+-------+--------+
| PQR | 6 | PQR | 6 |
+-------+--------+-------+--------+
| | | LMN | 8 |
+-------+--------+-------+--------+
| | | EFG | 1 |
+-------+--------+-------+--------+
The data-part is correct, but it'll take some work to get the columns correct, and to get the NA values in there.
Here's a complete pipeline:
gocsv join -c 'label' -outer file1.csv file2.csv \
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
| gocsv select -c 'label','Part-A','Part-B' \
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
> final.csv
echo 'Final'
gocsv view final.csv
which gets us the correct, final, file:
Final pipeline
+-------+--------+--------+
| label | Part-A | Part-B |
+-------+--------+--------+
| ABC | 2 | NA |
+-------+--------+--------+
| EFG | NA | 1 |
+-------+--------+--------+
| LMN | NA | 8 |
+-------+--------+--------+
| PQR | 6 | 6 |
+-------+--------+--------+
| XYZ | 3 | 4 |
+-------+--------+--------+
There's a lot going on in that pipeline, the high points are:
Merge the the two label fields
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
Pare-down to just the 3 columns you want
| gocsv select -c 'label','Part-A','Part-B' \
Add the NA values and sort by label
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
I've made a step-by-step explanation at this Gist.
You mentioned join in the comment on my other answer, and I'd forgotten about this utility:
#!/bin/sh
rm -f *sorted.csv
# Join two files, normally inner-join only, but
# - `-a 1 -a 2`: include "unpaired lines" from file 1 and file 2
# - `-1 1 -2 1`: the first column from each is the "join column"
# - `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2
join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv
# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv
# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv
# Then sort remainder
tail -n +2 unsorted.csv | sort -t, -k 1 >> sorted.csv
And, here's sorted.csv
+--------+--------+--------+
| label | Part-A | Part-B |
+--------+--------+--------+
| ABC mn | 2.0 | NA |
+--------+--------+--------+
| EFG | NA | 1.0 |
+--------+--------+--------+
| LMN Wv | NA | 8 |
+--------+--------+--------+
| PQR SN | 6 | 6 |
+--------+--------+--------+
| XYZ | 3.0 | 4.0 |
+--------+--------+--------+
As #Fravadona stated correctly in his comment, for CSV files that can contain the delimiter, a newline or double quotes inside a field a proper CSV parser is needed.
Actually, only two functions are needed: One for unquoting CSV fields to normal AWK fields and one for quoting the AWK fields to write the data back to CSV fields.
I have written a variant of my previous answer (https://stackoverflow.com/a/71056926/18135892) that uses Ed Morton's CSV parser (https://stackoverflow.com/a/45420607/18135892 with the gsub variant which works with any AWK version) to give an example of proper CSV parsing:
This solution prints the wished result correctly sorted with any AWK.
Please note that the sorting algorithm is taken from the mawk manual.
# SO71053039_2.awk
# unquote CSV:
# Ed Morton's CSV parser: https://stackoverflow.com/a/45420607/18135892
function buildRec( fpat,fldNr,fldStr,done) {
CurrRec = CurrRec $0
if ( gsub(/"/,"&",CurrRec) % 2 ) {
# The string built so far in CurrRec has an odd number
# of "s and so is not yet a complete record.
CurrRec = CurrRec RS
done = 0
}
else {
# If CurrRec ended with a null field we would exit the
# loop below before handling it so ensure that cannot happen.
# We use a regexp comparison using a bracket expression here
# and in fpat so it will work even if FS is a regexp metachar
# or a multi-char string like "\\\\" for \-separated fields.
CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
$0 = ""
fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
while ( (CurrRec != "") && match(CurrRec,fpat) ) {
fldStr = substr(CurrRec,RSTART,RLENGTH)
# Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
gsub(/""/, "\"", fldStr)
}
$(++fldNr) = fldStr
CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
}
CurrRec = ""
done = 1
}
return done
}
# quote CSV:
# Quote according to https://datatracker.ietf.org/doc/html/rfc4180 rules
function csvQuote(field, sep) {
if ((field ~ sep) || (field ~ /["\r\n]/)) {
gsub(/"/, "\"\"", field)
field = "\"" field "\""
}
return field
}
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP, n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["Part-A"]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["Part-B"]
}
}
close(ARGV[2])
# print the header
print "label" OFS "Part-A" OFS "Part-B"
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", csvQuote(LABEL_KEY_SWAP[i], OFS))
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS csvQuote(LABEL_KEY1[LABEL_KEY_SWAP[i]], OFS) OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS) : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS))
print result_string
}
}
Call:
awk -f SO71053039_2.awk file1.csv file2.csv
=> result (superfluous quotes according to CSV rules are omitted):
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0

How to add Column values based on unique value of a different column

I am trying to add values in Column B based on unique value in Column A. How can i do it using AWK (or) Any other way using bash?
Column_A | Column_B
--------------------
A | 1
A | 2
A | 1
B | 3
B | 8
C | 5
C | 8
Result:
Column_A | Column_B
--------------------
A | 6
B | 11
C | 13
Considering that your Input_file is same as shown, sorted with first field, could you please try following(will edit solution for alignment too soon).
awk '
BEGIN{
OFS=" | "
}
FNR==1 || /^-/{
print
next
}
prev!=$1 && prev{
print prev,sum
prev=sum=""
}
{
sum+=$NF
prev=$1
}
END{
if(prev && sum){
print prev,sum
}
}' Input_file
another awk
$ awk 'NR<3 {print; next}
{a[$1]+=$NF; line[$1]=$0}
END {for(k in a) {sub(/[0-9]+$/,a[k],line[k]); print line[k]}}' file
Column_A | Column_B
--------------------
A | 4
B | 11
C | 13
note that A totals to 4, not 6.
One possible solution (Assuming file is in CSV format):
Input :
$ cat csvtest.csv
A,1
A,2
A,3
B,3
B,8
C,5
C,8
$ cat csvtest.csv | awk -F "," '{arr[$1]+=$2} END {for (i in arr) {print i","arr[i]}}'
A,6
B,11
C,13

CONCAT columns within a file

I'd like to concatenate column2 until column4.
Example (first.txt):
|ID|column2|column3|column4|
|1 | a | b | c |
|2 | d | e | f |
To this (mynewfile.txt) :
ID|column2
1 | a b c
2 | d e f
This is my script in cygwin : $ awk '{print $2" "$3" "$4 }' first.txt > mynewfile.txt
Of course, it is not working out well.. How do I improve the script?
You need to set the field separator so that a pipe with optional whitespace around it is the field delimiter.
The pipe at the beginning of the line causes an empty field 1 before the pipe, so the ID is field 2, and columns 2-4 are fields 3-5. So it should be:
awk -F' *\\| *' 'NR == 1 {print "ID|column2|"} NR > 1 {printf("%d | %s %s %s |\n", $2, $3, $4, $5)}' first.txt > mynewfile.txt
Not especially general GNU sed method:
sed 's/^[|]//;1s/2.*/2/;1!{s/|/ /g2;s/ */ /2g}' first.txt
Output:
ID|column2
1 | a b c
2 | d e f

awk command to print multiple columns using for loop

I am having a single file in which it contains 1st and 2nd column with item code and name, then from 3rd to 12th column which contains its 10 days consumption quantity continuously.
Now i need to convert that into 10 different files. In each the 1st and 2nd column should be the same item code and item name and the 3rd column will contain the consumption quantity of one day in each..
input file:
Code | Name | Day1 | Day2 | Day3 |...
10001 | abcd | 5 | 1 | 9 |...
10002 | degg | 3 | 9 | 6 |...
10003 | gxyz | 4 | 8 | 7 |...
I need the Output in different file as
file 1:
Code | Name | Day1
10001 | abcd | 5
10002 | degg | 3
10003 | gxyz | 4
file 2:
Code | Name | Day2
10001 | abcd | 1
10002 | degg | 9
10003 | gxyz | 8
file 3:
Code | Name | Day3
10001 | abcd | 9
10002 | degg | 6
10003 | gxyz | 7
and so on....
I wrote a code like this
awk 'BEGIN { FS = "\t" } ; {print $1,$2,$3}' FILE_NAME > file1;
awk 'BEGIN { FS = "\t" } ; {print $1,$2,$4}' FILE_NAME > file2;
awk 'BEGIN { FS = "\t" } ; {print $1,$2,$5}' FILE_NAME > file3;
and so on...
Now i need to write it with in a 'for' or 'while' loop which would be faster...
I dont know the exact code, may be like this..
for (( i=3; i<=NF; i++)) ; do awk 'BEGIN { FS = "\t" } ; {print $1,$2,$i}' input.tsv > $i.tsv; done
kindly help me to get the output as i explained.
If you absolutely need to to use a loop in Bash, then your loop can be fixed like this:
for ((i = 3; i <= 10; i++)); do awk -v field=$i 'BEGIN { FS = "\t" } { print $1, $2, $field }' input.tsv > file$i.tsv; done
But it would be really better to solve this using pure awk, without shell at all:
awk -v FS='\t' '
NR == 1 {
for (i = 3; i < NF; i++) {
fn = "file" (i - 2) ".txt";
print $1, $2, $i > fn;
print "" >> fn;
}
}
NR > 2 {
for (i = 3; i < NF; i++) {
fn = "file" (i - 2) ".txt";
print $1, $2, $i >> fn;
}
}' inputfile
That is, when you're on the first record,
create the output files by writing the header line and a blank line (as in specified in your question).
For the 3rd and later records, append to the files.
Note that the code in your question suggests that the fields in the file are separated by tabs, but the example files seem to use | padded with variable number of spaces. It's not clear which one is your actual case. If it's really tab-separated, then the above code will work. If in fact it's as the example inputs, then change the first line to this:
awk -v OFS=' | ' -v FS='[ |]+' '
bash + cut solution:
input.tsv test content:
Code | Name | Day1 | Day2 | Day3
10001 | abcd | 5 | 1 | 9
10002 | degg | 3 | 9 | 6
10003 | gxyz | 4 | 8 | 7
day_splitter.sh script:
#!/bin/bash
n=$(cat $1 | head -1 | awk -F'|' '{print NF}') # total number of fields
for ((i=3; i<=$n; i++))
do
fn="Day"$(($i-2)) # file name containing `Day` number
$(cut -d'|' -f1,2,$i $1 > $fn".txt")
done
Usage:
bash day_splitter.sh input.tsv
Results:
$cat Day1.txt
Code | Name | Day1
10001 | abcd | 5
10002 | degg | 3
10003 | gxyz | 4
$cat Day2.txt
Code | Name | Day2
10001 | abcd | 1
10002 | degg | 9
10003 | gxyz | 8
$cat Day3.txt
Code | Name | Day3
10001 | abcd | 9
10002 | degg | 6
10003 | gxyz | 7
In pure awk:
$ awk 'BEGIN{FS=OFS="|"}{for(i=3;i<=NF;i++) {f="file" (i-2); print $1,$2,$i >> f; close(f)}}' file
Explained:
$ awk '
BEGIN {
FS=OFS="|" } # set delimiters
{
for(i=3;i<=NF;i++) { # loop the consumption fields
f="file" (i-2) # create the filename
print $1,$2,$i >> f # append to target file
close(f) } # close the target file
}' file

Replace string in Nth array

I have a .txt file with strings in arrays which looks like these:
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 0
And i want to add counts, so i need to replace last digit, but i need to change it only for the one line.
I am getting new needed value by this function:
$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"
And them I want to replace it with new value, which will be bigger for 1.
And im doing it right in there.
counts=(("$(awk -F "|" -v i=$idOpen 'FNR == i { gsub (" ", "", $0); print $4}' filename)"+1))
Where IdOpen is an id of the array, where i need to replace string.
So i have tried to replace the whole array by these:
counter="$(awk -v i=$idOpen 'BEGIN{FNqR == i}{$7+=1} END{ print $0}' bookmarks)"
N=$idOpen
sed -i "{N}s/.*/${counter}" bookmarks
But it doesn't work!
So is there a way to replace only last string with value which i have got earlier?
As result i need to get:
id | String1 | String2 | Counts
1 | Abc | Abb | 1 # if idOpen was 1 for 1 time
2 | Cde | Cdf | 2 # if idOpen was 2 for 2 times
And the last number will be increased by 1 everytime when i will activate these commands.
awk solution:
setting idOpen variable(for ex. 2):
idOpen=2
awk -F'|' -v i=$idOpen 'NR>1{if($1 == i) $4=" "$4+1}1' OFS='|' file > tmp && mv tmp file
The output(after executing the above command twice):
cat file
id | String1 | String2 | Counts
1 | Abc | Abb | 0
2 | Cde | Cdf | 2
NR>1 - skipping the header line

Resources