How to convert CSV to Excel with adding header rows between different data using Shell script? - bash

I want to process CSV file line by line and if table_name is different, need to add header row.
Sample CSV:
table_name,no.,data
attribute,column_name,definition,data_type,valid_values,notes
archive_rule,1,ID,id,,int,,
archive_rule,2,EXECUTE SEQ,execute_seq,,int,,
archive_rule,3,ARCHIVE RULE NAME,archive_rule_name,,varchar,,
archive_rule,4,ARCHIVE RULE TABLE NAME,archive_rule_table_name,,varchar,,
archive_rule,5,ARCHIVE RULE PK NAME,archive_rule_pk_name,,varchar,,
archive_rule,6,ARCHIVE BATCH SIZE,archive_batch_size,,int,,
archive_rule,7,ACTIVE STATUS,active_status,,varchar,,
archive_table,1,ID,id,,int,,
archive_table,2,ARCHIVE RULE ID,archive_rule_id,,int,,
archive_table,3,EXECUTE SEQ,execute_seq,,int,,
archive_table,4,ARCHIVE DEPEND TABLE ID,archive_depend_table_id,,int,,
archive_table,5,ARCHIVE DEPEND LEVEL,archive_depend_level,,int,,
archive_table,6,ACTIVE STATUS,active_status,,varchar,,
batch_job,1,BATCH JOB ID,batch_job_id,,int,,
batch_job,2,JOB TYPE,job_type,,varchar,,
batch_job,3,JOB NAME,job_name,,varchar,,
batch_job,4,EXECUTION DATE,execution_date,,timestamp,,
batch_job,5,EXECUTION RESULT,execution_result,,varchar,,
batch_job,6,ERROR MESSAGE,error_message,,varchar,,
batch_job,7,REPORT OUTPUT,report_output,,varchar,,
Desired Result:
Data : archive_rule
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,ID,id,,int,,
2,EXECUTE SEQ,execute_seq,,int,,
3,ARCHIVE RULE NAME,archive_rule_name,,varchar,,
4,ARCHIVE RULE TABLE NAME,archive_rule_table_name,,varchar,,
5,ARCHIVE RULE PK NAME,archive_rule_pk_name,,varchar,,
6,ARCHIVE BATCH SIZE,archive_batch_size,,int,,
...
Data: archive_table
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,ID,id,,int,,
2,ARCHIVE RULE ID,archive_rule_id,,int,,
3,EXECUTE SEQ,execute_seq,,int,,
4,ARCHIVE DEPEND TABLE ID,archive_depend_table_id,,int,,
5,ARCHIVE DEPEND LEVEL,archive_depend_level,,int,,
...
Please help me to find a way to get output.

I can only imagine one way here: read the input file line by line, and use cut to extract the first field. This should do the trick:
#! /bin/bash
# accept both process.sh file and process.sh < file
if [ $# -eq 1 ]
then file="$1"
else file=-
fi
#initialize table name to the empty string
cur=""
# process the input line by line after skipping the header
cat "file" | tail +3 | (
while true
do
read line
if [ $? -ne 0 ] # exit loop on end of file or error
then
break
fi
tab=$( echo $line | cut -f 1 -d, ) # extract table name
if [ "x$tab" != "x$cur" ]
then
cur=$tab # if a new one remember it
echo "Data: $tab" # and write header
echo "no.,data attribute,column_name,definition,data_type,valid_values,notes"
fi
echo $line | cut -f 2- -d, # copy all except first field
done )
But I would use a true script language like Ruby or Python here...

Using awk:
$ awk '
BEGIN { FS=OFS="," } # set field separators
NR==1 { # first record, start building the header
h=$2 OFS $3
next
}
NR==2 { # second record, continue header construct
h=h $0 # space was in the end of record NR==1
next
}
$1!=p { # when the table name changes
print "Data : " $1 # print table name
print h # and header
}
{
for(i=2;i<=NF;i++) # print fields 2->
printf "%s%s",$i,(i==NF?ORS:OFS) # field separator or newline
p=$1 # remember the table name for next record
}' file
Output:
Data : archive_rule
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,ID,id,,int,,
2,EXECUTE SEQ,execute_seq,,int,,
...
Data : archive_table
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,ID,id,,int,,
2,ARCHIVE RULE ID,archive_rule_id,,int,,
...
Data : batch_job
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,BATCH JOB ID,batch_job_id,,int,,
2,JOB TYPE,job_type,,varchar,,
...

Related

Bash script to compare and generate csv datafile

I have two CSV files data1.csv and data2.csv the content is something like this (with headers) :
DATA1.csv
Client Name;strnu;addr;fav
MAD01;HDGF;11;V PO
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V ZZ
DATA2.csv
USER;BINin;TYPE
XXMAD01XXXHDGFXX;11;N
KJDGD;635;M
CVOJF01XXHHD;635;N
Issues :
The value of the 1st and 2nd column of DATA1.csv exist randomly in the first column of DATA2.csv.
For example MAD01;HDGF exist in the first column of DATA2 ***MAD01***HDGF** (* can be alphanum and/or symbols charachter) and MAD01;HDGF might not be in the same order in the column USER of DATA2.
The value of strnum in DATA1 is equal to the value of the column BINin in DATA2
The column fav DATA1 is the same as TYPE in DATA2 because V T = M and V PO = N (some other valuses may exist but we won't need them for example line 3 of DATA1 it should be ignored)
N.B: some data may exist in a file but not the other.
my bash script needs to generate a new CSV file that should contain:
The column USER from DATA2
Client Name and strnu from DATA1
BINin from DATA2 only if it's equal to the corespondent line and value of strnu DATA1
TYPE using M N Format and making sure to respect the condition that V T = M and V PO = N
The first thing i tried was usuing grep to search for lines that exist in both files
#!/bin/sh
DATA1="${1}"
DATA2="${2}"
for i in $(cat $DATA1 | awk -F";" '{print $1".*"$2}' | sed 1d) ; do
grep "$i" $DATA2
done
Result :
$ ./script.sh DATA1.csv DATA2.csv
MAD01;HDGF;11;V PO
XXMAD01XXXHDGFXX;11;N
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V PO
Using grep and awk i could find lines that are present in DATA1 and DATA2 files but it doesn't work for all the lines and i guess it's because of the - and other special characters present in column 2 of DATA1 but they can be ignored.
I don't know how i can generate a new csv that would mix the lines present in both files but the expected generated CSV should look like this
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;M
This can be done in a single awk program. This is join.awk
BEGIN {
FS = OFS = ";"
print "USER", "Client Name", "strnu", "BINin", "TYPE"
}
FNR == 1 {next}
NR == FNR {
strnu[$1] = $2
next
}
{
for (client in strnu) {
strnu_pattern = strnu[client]
gsub(/-/, "", strnu_pattern)
if ($1 ~ client && $1 ~ strnu_pattern) {
print $1, client, strnu[client], $2, $3
break
}
}
}
and then
awk -f join.awk DATA1.csv DATA2.csv
outputs
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N
Assumptions/understandings:
ignore lines from DATA1.csv where the fav field is not one of V T or V PO
when matching fields we need to ignore the any hyphens from the DATA1.csv fields
when matching fields the strings from DATA1.csv can show up in either order in DATA2.csv
last line of the expected output show end with 635,N
One `awk idea:
awk '
BEGIN { FS=OFS=";"
print "USER","Client Name","strnu","BINin","TYPE" # print new header
}
FNR==1 { next } # skip input headers
FNR==NR { if ($4 == "V PO" || $4 == "V T") { # only process if fav is one of "V PO" or "V T"
cnames[FNR]=$1 # save client name
strnus[FNR]=$2 # save strnu
}
next
}
{ for (i in cnames) { # loop through array indices
cname=cnames[i] # make copy of client name ...
strnu=strnus[i] # and strnu so that we can ...
gsub(/-/,"",cname) # strip hypens from both ...
gsub(/-/,"",strnu) # in order to perform the comparisons ...
if (index($1,cname) && index($1,strnu)) { # if cname and strnu both exist in $1 then index()>=1 in both cases so ...
print $1,cnames[i],strnus[i],$2,$3 # print to stdout
next # we found a match so break from loop and go to next line of input
}
}
}
' DATA1.csv DATA2.csv
This generates:
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N

Reading CSV file in Shell Scripting

I am trying to read values from a CSV file dynamically based on the header. Here's how my input files can look like.
File 1:
name,city,age
john,New York,20
jane,London,30
or
File 2:
name,age,city,country
john,20,New York,USA
jane,30,London,England
I may not be following the best way to accomplish this but I tried the following code.
#!/bin/bash
{
read -r line
line=`tr ',' ' ' <<< $line`
while IFS=, read -r `$line`
do
echo $name
echo $city
echo $age
done
} < file.txt
I am expecting the above code read the values of the header as the variable names. I know that the order of columns can be different for the input file. But, I expect the files to have name, city and age columns in the input file. Is this the right approach? If so, what is the fix for the above code if fails with the error - "line7: name: command not found".
The issue is caused by the backticks. Bash will evaluate the contents and replace the backticks with the output from the command it just evaluated.
You can simply use the variable after the read command to achieve what you want:
#!/bin/bash
{
read -r line
line=`tr ',' ' ' <<< $line`
echo "$line"
while IFS=, read -r $line ; do
echo "person: $name -- $city -- $age"
done
} < file.txt
Some notes on your code:
The backtick syntax is legacy syntax, it is now preferred to use $(...) to evaluate commands. The new syntax is more flexible.
You can enable automatic script failure with set -euo pipefail (see here). This will make your script stop if it encounters an error.
You code is currently very sensitive to invalid header data:
with a file like
n ame,age,city,country
john,20,New York,USA
jane,30,London,England
your script (or rather the version in the beginning of my answer) will run without errors but with invalid output.
It is also good practice to quote variables to prevent unwanted splitting.
To make it much more robust, you can change it as follows:
#!/bin/bash
set -euo pipefail
# -e and -o pipefail will make the script exit
# in case of command failure (or piped command failure)
# -u will exit in case a variable is undefined
# (in you case, if the header is invalid)
{
read -r line
readarray -d, -t header < <(printf "%s" "$line")
# using an array allows to detect if one of the header entries
# contains an invalid character
# the printf is needed because bash would add a newline to the
# command input if using heredoc (<<<).
while IFS=, read -r "${header[#]}" ; do
echo "$name"
echo "$city"
echo "$age"
done
} < file.txt
A slightly different approach can let awk handle the field separation and ordering of the desired output given either of the input files. Below awk stores the desired output order in the f[] (field) array set in the BEGIN rule. Then on the first line in a file (FNR==1) the array a[] is deleted and filled with the headings from the current file. At that point you just loop over the field names in-order in the f[] array and output the corresponding field from the current line, e.g.
awk -F, '
BEGIN { f[1]="name"; f[2]="city"; f[3]="age" } # desired order
FNR==1 { # on first line read header
delete a # clear a array
for (i=1; i<=NF; i++) # loop over headings
a[$i] = i # index by heading, val is field no.
next # skip to next record
}
{
print "" # optional newline between outputs
for (i=1; i<=3; i++) # loop over desired field order
if (f[i] in a) # validate field in a array
print $a[f[i]] # output fields value
}
' file1 file2
Example Use/Output
In your case with the content you show in file1 and file2, you would have:
$ awk -F, '
> BEGIN { f[1]="name"; f[2]="city"; f[3]="age" } # desired order
> FNR==1 { # on first line read header
> delete a # clear a array
> for (i=1; i<=NF; i++) # loop over headings
> a[$i] = i # index by heading, val is field no.
> next # skip to next record
> }
> {
> print "" # optional newline between outputs
> for (i=1; i<=3; i++) # loop over desired field order
> if (f[i] in a) # validate field in a array
> print $a[f[i]] # output fields value
> }
> ' file1 file2
john
New York
20
jane
London
30
john
New York
20
jane
London
30
Where both files are read and handled identically despite having different field orderings. Let me know if you have further questions.
If using Bash verison ≥ 4.2, it is possible to use an associative array to capture an arbitrary number of fields with their name as a key:
#!/usr/bin/env bash
# Associative array to store columns names as keys and and values
declare -A fields
# Array to store columns names with index
declare -a column_name
# Array to store row's values
declare -a line
# Commands block consuming CSV input
{
# Read first line to capture column names
IFS=, read -r -a column_name
# Proces records
while IFS=, read -r -a line; do
# Store column values to corresponding field name
for ((i=0; i<${#column_name[#]}; i++)); do
# Fills fields' associative array
fields["${column_name[i]}"]="${line[i]}"
done
# Dump fields for debug|demo purpose
# Processing of each captured value could go there instead
declare -p fields
done
} < file.txt
Sample output with file 1
declare -A fields=([country]="USA" [city]="New York" [age]="20" [name]="john" )
declare -A fields=([country]="England" [city]="London" [age]="30" [name]="jane" )
For older Bash version, without associative array, use indexed column name alternatively:
#!/usr/bin/env bash
# Array to store columns names with index
declare -a column_name
# Array to store values for a line
declare -a value
# Commands block consuming CSV input
{
# Read first line to capture column names
IFS=, read -r -a column_name
# Proces records
while IFS=, read -r -a value; do
# Print record separator
printf -- '--------------------------------------------------\n'
# Print captured field name and value
for ((i=0; i<"${#column_name[#]}"; i++)); do
printf '%-18s: %s\n' "${column_name[i]}" "${value[i]}"
done
done
} < file.txt
Output:
--------------------------------------------------
name : john
age : 20
city : New York
country : USA
--------------------------------------------------
name : jane
age : 30
city : London
country : England

SED not adding a first line to .csv file

I am doing a project for school and my head has gone through 3 walls with how many times I have bashed it. the project is to ask a name and color and assign each to a variable then make a directory from the color variable in the /tmp directory. create a .csv file with header, pull the information from a given .txt file out of order and add only select columns. I have gotten to the point of adding the columns but no matter what i do I cant get sed to add a header or import the information from the .txt file.
as you can see i have tried multiple ways to modify the file but I dont know enough yet to make it work
the input file format is as follows
1. 734-44-2041 James SMITH jsmith#beltec.us 360-555-4778 360-555-0158
and it should look like
james,smith,james.smith#beltec.us,734-44-2041-000
I am assuming that the 3 commas are intended to be 0's at the end
this is the code I have so far
#!/bin/bash
#interactive=
#variables
color=/tmp/$color
csvfile=/tmp/blue/midterm.csv
if [ "$1" == "" ]; then
echo "you should use the -c or -C flags
exit
fi
#adding the -c flag and setting the filename variable
while [ "$1" != "" ]; do
case $1 in
-c | -C ) shift
filename=$1
;;
* ) echo "you should use the -c flag"
exit 1
esac
shift
done
#get user's name
echo "what is your name"
read user_name
#get fav color from user
echo "what is your favorate color"
read color
#make the fov color directory
if [ ! -f /tmp/$color ]; then
mkdir /tmp/$color
else
echo "bad luck $user_name"
exit 1
fi
#cd into the directory
cd /tmp/$color
#make a csv file in /temp/$color
touch midterm.csv
akw '
BEGIN { FS=OFS=","; print "Firstname","lastname","Maildomain","Password" }
{ print $2,$3,$4,$1 }
' "$filename" > "/tmp/$color/midterm.csv"
sed by default outputs its results on the standard output.
In case you need to overwrite the old file use -i (or better -i.bak) to keep previous file version in <filename>.bak
Moreover in case you need to add something only at the beginning of the file use following syntax:
sed '1iYOUR_TEXT'
You never need sed when you're using awk. All you need to create a header + content is:
awk '
BEGIN { FS=OFS=","; print "Firstname", "Lastname", "Maildomain", "Password" }
{ print $3, $4, $5, $2 }
' "$filename" > "/tmp/$color/midterm.csv"
Or if your input file isn't a CSV as it seems not to be by your updated question:
awk '
BEGIN { OFS=","; print "Firstname", "Lastname", "Maildomain", "Password" }
{ print $3, $4, $5, $2 }
' "$filename" > "/tmp/$color/midterm.csv"

Append data to another column in a CSV if duplicate is found in first column

I have a CSV with data such as:
somename1,value1
somename1,value2
somename1,value3
anothername1,anothervalue1
anothername1,anothervalue2
anothername1,anothervalue3
I would like to rewrite the CSV so that when a duplicate in column 1 is found, the the data is appended to a new column on the first entry.
For instance, the desired output would be :
somename1,value1,value2,value3
anothername1,anothervalue1,anothervalue2,anothervalue3
How can i do this in a shell script ?
TIA
You need much more than just removing duplicated lines when using Awk, you need a logic as below to create an array of elements for each unique entry in $1.
The solution creates a hash-map with unique values in $1 working as indices of the array and elements as the value appended with a , separator.
awk 'BEGIN{FS=OFS=","; prev="";}{ if (prev != $1) {unique[$1]=$2;} else {unique[$1]=(unique[$1]","$2)} prev=$1; }END{for (i in unique) print i,unique[i]}' file
anothername1,anothervalue1,anothervalue2,anothervalue3
somename1,value1,value2,value3
A more readable version would be to have something like,
BEGIN {
# set input and output field separator to ',' and initialize
# variable holding last instance of $1 to empty
FS=OFS=","
prev=""
}
{
# Update the value of $2 directly in the hash array only when new
# unique elements are found in $1
if (prev != $1){
unique[$1]=$2
}
else {
unique[$1]=(unique[$1]","$2)
}
# Update the current $1
prev=$1
}
END {
for (i in unique) {
print i,unique[i]
}
FILE=$1
NAMES=`cut -d',' -f 1 $FILE | sort -u`
for NAME in $NAMES; do
echo -n "$NAME"
VALUES=`grep "$NAME" $FILE | cut -d',' -f2`
for VAL in $VALUES; do
echo -n ",$VAL"
done
echo ""
done
running with your data generates:
>bash script.sh data1.txt
anothername1,anothervalue1,anothervalue2,anothervalue3
somename1,value1,value2,value3
the filename of your data has to be passed as parameter. output can be written to a new file by redirecting.
>bash script.sh data1.txt > data_new.txt

Looping through multiline CSV rows in bash

I have the following csv file with 3 columns:
row1value1,row1value2,"row1
multi
line
value"
row2value1,row2value2,"row2
multi
line
value"
Is there a way to loop through its rows like (this does not work, it reads lines):
while read $ROW
do
#some code that uses $ROW variable
done < file.csv
Using gnu-awk you can do this using FPAT:
awk -v RS='"\n' -v FPAT='"[^"]*"|[^,]*' '{
print "Record #", NR, " =======>"
for (i=1; i<=NF; i++) {
sub(/^"/, "", $i)
printf "Field # %d, value=[%s]\n", i, $i
}
}' file.csv
Record # 1 =======>
Field # 1, value=[row1value1]
Field # 2, value=[row1value2]
Field # 3, value=[row1
multi
line
value]
Record # 2 =======>
Field # 1, value=[row2value1]
Field # 2, value=[row2value2]
Field # 3, value=[row2
multi
line
value]
However, as I commented above a dedicated CSV parser using PHP, Perl or Python will be more robust for this job.
Here is a pure bash solution. The multiline_csv.sh script translates the multiline csv into standard csv by replacing the newline characters between quotes with some replacement string. So the usage is
./multiline_csv.sh CSVFILE SEP
I placed your example script in a file called ./multi.csv. Running the command ./multiline_csv.sh ./multi.csv "\n" yielded the following output
[ericthewry#eric-arch-pc stackoverflow]$ ./multiline_csv.sh ./multi.csv "\n"
r1c2,r1c2,"row1\nmulti\nline\nvalue"
r2c1,r2c2,"row2\nmultiline\nvalue"
This can be easily translated back to the original csv file using printf:
[ericthewry#eric-arch-pc stackoverflow]$ printf "$(./multiline_csv.sh ./multi.csv "\n")\n"
r1c2,r1c2,"row1
multi
line
value"
r2c1,r2c2,"row2
multiline
value"
This might be an Arch-specific quirk of echo/sprintf (I'm not sure), but you could use some other separator string like ~~~++??//NEWLINE\\??++~~~ that you could sed out if need be.
# multiline_csv.sh
open=0
line_is_open(){
quote="$2"
(printf "$1" | sed -e "s/\(.\)/\1\n/g") | (while read char; do
if [[ "$char" = '"' ]]; then
open=$((($open + 1) % 2))
fi
done && echo $open)
}
cat "$1" | while read ln ; do
flatline="${ln}"
open=$(line_is_open "${ln}" $open)
until [[ "$open" = "0" ]]; do
if read newln
then
flatline="${flatline}$2${newln}"
open=$(line_is_open "${newln}" $open)
else
break
fi
done
echo "${flatline}"
done
Once you've done this translation, you can proceed as you would normally via the while read $ROW do ... done method.

Resources