How to convert CSV to Excel with adding header rows between different data using Shell script? - bash
I want to process CSV file line by line and if table_name is different, need to add header row.
Sample CSV:
table_name,no.,data
attribute,column_name,definition,data_type,valid_values,notes
archive_rule,1,ID,id,,int,,
archive_rule,2,EXECUTE SEQ,execute_seq,,int,,
archive_rule,3,ARCHIVE RULE NAME,archive_rule_name,,varchar,,
archive_rule,4,ARCHIVE RULE TABLE NAME,archive_rule_table_name,,varchar,,
archive_rule,5,ARCHIVE RULE PK NAME,archive_rule_pk_name,,varchar,,
archive_rule,6,ARCHIVE BATCH SIZE,archive_batch_size,,int,,
archive_rule,7,ACTIVE STATUS,active_status,,varchar,,
archive_table,1,ID,id,,int,,
archive_table,2,ARCHIVE RULE ID,archive_rule_id,,int,,
archive_table,3,EXECUTE SEQ,execute_seq,,int,,
archive_table,4,ARCHIVE DEPEND TABLE ID,archive_depend_table_id,,int,,
archive_table,5,ARCHIVE DEPEND LEVEL,archive_depend_level,,int,,
archive_table,6,ACTIVE STATUS,active_status,,varchar,,
batch_job,1,BATCH JOB ID,batch_job_id,,int,,
batch_job,2,JOB TYPE,job_type,,varchar,,
batch_job,3,JOB NAME,job_name,,varchar,,
batch_job,4,EXECUTION DATE,execution_date,,timestamp,,
batch_job,5,EXECUTION RESULT,execution_result,,varchar,,
batch_job,6,ERROR MESSAGE,error_message,,varchar,,
batch_job,7,REPORT OUTPUT,report_output,,varchar,,
Desired Result:
Data : archive_rule
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,ID,id,,int,,
2,EXECUTE SEQ,execute_seq,,int,,
3,ARCHIVE RULE NAME,archive_rule_name,,varchar,,
4,ARCHIVE RULE TABLE NAME,archive_rule_table_name,,varchar,,
5,ARCHIVE RULE PK NAME,archive_rule_pk_name,,varchar,,
6,ARCHIVE BATCH SIZE,archive_batch_size,,int,,
...
Data: archive_table
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,ID,id,,int,,
2,ARCHIVE RULE ID,archive_rule_id,,int,,
3,EXECUTE SEQ,execute_seq,,int,,
4,ARCHIVE DEPEND TABLE ID,archive_depend_table_id,,int,,
5,ARCHIVE DEPEND LEVEL,archive_depend_level,,int,,
...
Please help me to find a way to get output.
I can only imagine one way here: read the input file line by line, and use cut to extract the first field. This should do the trick:
#! /bin/bash
# accept both process.sh file and process.sh < file
if [ $# -eq 1 ]
then file="$1"
else file=-
fi
#initialize table name to the empty string
cur=""
# process the input line by line after skipping the header
cat "file" | tail +3 | (
while true
do
read line
if [ $? -ne 0 ] # exit loop on end of file or error
then
break
fi
tab=$( echo $line | cut -f 1 -d, ) # extract table name
if [ "x$tab" != "x$cur" ]
then
cur=$tab # if a new one remember it
echo "Data: $tab" # and write header
echo "no.,data attribute,column_name,definition,data_type,valid_values,notes"
fi
echo $line | cut -f 2- -d, # copy all except first field
done )
But I would use a true script language like Ruby or Python here...
Using awk:
$ awk '
BEGIN { FS=OFS="," } # set field separators
NR==1 { # first record, start building the header
h=$2 OFS $3
next
}
NR==2 { # second record, continue header construct
h=h $0 # space was in the end of record NR==1
next
}
$1!=p { # when the table name changes
print "Data : " $1 # print table name
print h # and header
}
{
for(i=2;i<=NF;i++) # print fields 2->
printf "%s%s",$i,(i==NF?ORS:OFS) # field separator or newline
p=$1 # remember the table name for next record
}' file
Output:
Data : archive_rule
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,ID,id,,int,,
2,EXECUTE SEQ,execute_seq,,int,,
...
Data : archive_table
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,ID,id,,int,,
2,ARCHIVE RULE ID,archive_rule_id,,int,,
...
Data : batch_job
no.,data attribute,column_name,definition,data_type,valid_values,notes
1,BATCH JOB ID,batch_job_id,,int,,
2,JOB TYPE,job_type,,varchar,,
...
Related
Bash script to compare and generate csv datafile
I have two CSV files data1.csv and data2.csv the content is something like this (with headers) : DATA1.csv Client Name;strnu;addr;fav MAD01;HDGF;11;V PO CVOJF01;HHD-;635;V T LINKO10;DH--JDH;98;V ZZ DATA2.csv USER;BINin;TYPE XXMAD01XXXHDGFXX;11;N KJDGD;635;M CVOJF01XXHHD;635;N Issues : The value of the 1st and 2nd column of DATA1.csv exist randomly in the first column of DATA2.csv. For example MAD01;HDGF exist in the first column of DATA2 ***MAD01***HDGF** (* can be alphanum and/or symbols charachter) and MAD01;HDGF might not be in the same order in the column USER of DATA2. The value of strnum in DATA1 is equal to the value of the column BINin in DATA2 The column fav DATA1 is the same as TYPE in DATA2 because V T = M and V PO = N (some other valuses may exist but we won't need them for example line 3 of DATA1 it should be ignored) N.B: some data may exist in a file but not the other. my bash script needs to generate a new CSV file that should contain: The column USER from DATA2 Client Name and strnu from DATA1 BINin from DATA2 only if it's equal to the corespondent line and value of strnu DATA1 TYPE using M N Format and making sure to respect the condition that V T = M and V PO = N The first thing i tried was usuing grep to search for lines that exist in both files #!/bin/sh DATA1="${1}" DATA2="${2}" for i in $(cat $DATA1 | awk -F";" '{print $1".*"$2}' | sed 1d) ; do grep "$i" $DATA2 done Result : $ ./script.sh DATA1.csv DATA2.csv MAD01;HDGF;11;V PO XXMAD01XXXHDGFXX;11;N CVOJF01;HHD-;635;V T LINKO10;DH--JDH;98;V PO Using grep and awk i could find lines that are present in DATA1 and DATA2 files but it doesn't work for all the lines and i guess it's because of the - and other special characters present in column 2 of DATA1 but they can be ignored. I don't know how i can generate a new csv that would mix the lines present in both files but the expected generated CSV should look like this USER;Client Name;strnu;BINin;TYPE XXMAD01XXXHDGFXX;MAD01;HDGF;11;N CVOJF01XXHHD;CVOJF01;HHD-;635;M
This can be done in a single awk program. This is join.awk BEGIN { FS = OFS = ";" print "USER", "Client Name", "strnu", "BINin", "TYPE" } FNR == 1 {next} NR == FNR { strnu[$1] = $2 next } { for (client in strnu) { strnu_pattern = strnu[client] gsub(/-/, "", strnu_pattern) if ($1 ~ client && $1 ~ strnu_pattern) { print $1, client, strnu[client], $2, $3 break } } } and then awk -f join.awk DATA1.csv DATA2.csv outputs USER;Client Name;strnu;BINin;TYPE XXMAD01XXXHDGFXX;MAD01;HDGF;11;N CVOJF01XXHHD;CVOJF01;HHD-;635;N
Assumptions/understandings: ignore lines from DATA1.csv where the fav field is not one of V T or V PO when matching fields we need to ignore the any hyphens from the DATA1.csv fields when matching fields the strings from DATA1.csv can show up in either order in DATA2.csv last line of the expected output show end with 635,N One `awk idea: awk ' BEGIN { FS=OFS=";" print "USER","Client Name","strnu","BINin","TYPE" # print new header } FNR==1 { next } # skip input headers FNR==NR { if ($4 == "V PO" || $4 == "V T") { # only process if fav is one of "V PO" or "V T" cnames[FNR]=$1 # save client name strnus[FNR]=$2 # save strnu } next } { for (i in cnames) { # loop through array indices cname=cnames[i] # make copy of client name ... strnu=strnus[i] # and strnu so that we can ... gsub(/-/,"",cname) # strip hypens from both ... gsub(/-/,"",strnu) # in order to perform the comparisons ... if (index($1,cname) && index($1,strnu)) { # if cname and strnu both exist in $1 then index()>=1 in both cases so ... print $1,cnames[i],strnus[i],$2,$3 # print to stdout next # we found a match so break from loop and go to next line of input } } } ' DATA1.csv DATA2.csv This generates: USER;Client Name;strnu;BINin;TYPE XXMAD01XXXHDGFXX;MAD01;HDGF;11;N CVOJF01XXHHD;CVOJF01;HHD-;635;N
Reading CSV file in Shell Scripting
I am trying to read values from a CSV file dynamically based on the header. Here's how my input files can look like. File 1: name,city,age john,New York,20 jane,London,30 or File 2: name,age,city,country john,20,New York,USA jane,30,London,England I may not be following the best way to accomplish this but I tried the following code. #!/bin/bash { read -r line line=`tr ',' ' ' <<< $line` while IFS=, read -r `$line` do echo $name echo $city echo $age done } < file.txt I am expecting the above code read the values of the header as the variable names. I know that the order of columns can be different for the input file. But, I expect the files to have name, city and age columns in the input file. Is this the right approach? If so, what is the fix for the above code if fails with the error - "line7: name: command not found".
The issue is caused by the backticks. Bash will evaluate the contents and replace the backticks with the output from the command it just evaluated. You can simply use the variable after the read command to achieve what you want: #!/bin/bash { read -r line line=`tr ',' ' ' <<< $line` echo "$line" while IFS=, read -r $line ; do echo "person: $name -- $city -- $age" done } < file.txt Some notes on your code: The backtick syntax is legacy syntax, it is now preferred to use $(...) to evaluate commands. The new syntax is more flexible. You can enable automatic script failure with set -euo pipefail (see here). This will make your script stop if it encounters an error. You code is currently very sensitive to invalid header data: with a file like n ame,age,city,country john,20,New York,USA jane,30,London,England your script (or rather the version in the beginning of my answer) will run without errors but with invalid output. It is also good practice to quote variables to prevent unwanted splitting. To make it much more robust, you can change it as follows: #!/bin/bash set -euo pipefail # -e and -o pipefail will make the script exit # in case of command failure (or piped command failure) # -u will exit in case a variable is undefined # (in you case, if the header is invalid) { read -r line readarray -d, -t header < <(printf "%s" "$line") # using an array allows to detect if one of the header entries # contains an invalid character # the printf is needed because bash would add a newline to the # command input if using heredoc (<<<). while IFS=, read -r "${header[#]}" ; do echo "$name" echo "$city" echo "$age" done } < file.txt
A slightly different approach can let awk handle the field separation and ordering of the desired output given either of the input files. Below awk stores the desired output order in the f[] (field) array set in the BEGIN rule. Then on the first line in a file (FNR==1) the array a[] is deleted and filled with the headings from the current file. At that point you just loop over the field names in-order in the f[] array and output the corresponding field from the current line, e.g. awk -F, ' BEGIN { f[1]="name"; f[2]="city"; f[3]="age" } # desired order FNR==1 { # on first line read header delete a # clear a array for (i=1; i<=NF; i++) # loop over headings a[$i] = i # index by heading, val is field no. next # skip to next record } { print "" # optional newline between outputs for (i=1; i<=3; i++) # loop over desired field order if (f[i] in a) # validate field in a array print $a[f[i]] # output fields value } ' file1 file2 Example Use/Output In your case with the content you show in file1 and file2, you would have: $ awk -F, ' > BEGIN { f[1]="name"; f[2]="city"; f[3]="age" } # desired order > FNR==1 { # on first line read header > delete a # clear a array > for (i=1; i<=NF; i++) # loop over headings > a[$i] = i # index by heading, val is field no. > next # skip to next record > } > { > print "" # optional newline between outputs > for (i=1; i<=3; i++) # loop over desired field order > if (f[i] in a) # validate field in a array > print $a[f[i]] # output fields value > } > ' file1 file2 john New York 20 jane London 30 john New York 20 jane London 30 Where both files are read and handled identically despite having different field orderings. Let me know if you have further questions.
If using Bash verison ≥ 4.2, it is possible to use an associative array to capture an arbitrary number of fields with their name as a key: #!/usr/bin/env bash # Associative array to store columns names as keys and and values declare -A fields # Array to store columns names with index declare -a column_name # Array to store row's values declare -a line # Commands block consuming CSV input { # Read first line to capture column names IFS=, read -r -a column_name # Proces records while IFS=, read -r -a line; do # Store column values to corresponding field name for ((i=0; i<${#column_name[#]}; i++)); do # Fills fields' associative array fields["${column_name[i]}"]="${line[i]}" done # Dump fields for debug|demo purpose # Processing of each captured value could go there instead declare -p fields done } < file.txt Sample output with file 1 declare -A fields=([country]="USA" [city]="New York" [age]="20" [name]="john" ) declare -A fields=([country]="England" [city]="London" [age]="30" [name]="jane" ) For older Bash version, without associative array, use indexed column name alternatively: #!/usr/bin/env bash # Array to store columns names with index declare -a column_name # Array to store values for a line declare -a value # Commands block consuming CSV input { # Read first line to capture column names IFS=, read -r -a column_name # Proces records while IFS=, read -r -a value; do # Print record separator printf -- '--------------------------------------------------\n' # Print captured field name and value for ((i=0; i<"${#column_name[#]}"; i++)); do printf '%-18s: %s\n' "${column_name[i]}" "${value[i]}" done done } < file.txt Output: -------------------------------------------------- name : john age : 20 city : New York country : USA -------------------------------------------------- name : jane age : 30 city : London country : England
SED not adding a first line to .csv file
I am doing a project for school and my head has gone through 3 walls with how many times I have bashed it. the project is to ask a name and color and assign each to a variable then make a directory from the color variable in the /tmp directory. create a .csv file with header, pull the information from a given .txt file out of order and add only select columns. I have gotten to the point of adding the columns but no matter what i do I cant get sed to add a header or import the information from the .txt file. as you can see i have tried multiple ways to modify the file but I dont know enough yet to make it work the input file format is as follows 1. 734-44-2041 James SMITH jsmith#beltec.us 360-555-4778 360-555-0158 and it should look like james,smith,james.smith#beltec.us,734-44-2041-000 I am assuming that the 3 commas are intended to be 0's at the end this is the code I have so far #!/bin/bash #interactive= #variables color=/tmp/$color csvfile=/tmp/blue/midterm.csv if [ "$1" == "" ]; then echo "you should use the -c or -C flags exit fi #adding the -c flag and setting the filename variable while [ "$1" != "" ]; do case $1 in -c | -C ) shift filename=$1 ;; * ) echo "you should use the -c flag" exit 1 esac shift done #get user's name echo "what is your name" read user_name #get fav color from user echo "what is your favorate color" read color #make the fov color directory if [ ! -f /tmp/$color ]; then mkdir /tmp/$color else echo "bad luck $user_name" exit 1 fi #cd into the directory cd /tmp/$color #make a csv file in /temp/$color touch midterm.csv akw ' BEGIN { FS=OFS=","; print "Firstname","lastname","Maildomain","Password" } { print $2,$3,$4,$1 } ' "$filename" > "/tmp/$color/midterm.csv"
sed by default outputs its results on the standard output. In case you need to overwrite the old file use -i (or better -i.bak) to keep previous file version in <filename>.bak Moreover in case you need to add something only at the beginning of the file use following syntax: sed '1iYOUR_TEXT'
You never need sed when you're using awk. All you need to create a header + content is: awk ' BEGIN { FS=OFS=","; print "Firstname", "Lastname", "Maildomain", "Password" } { print $3, $4, $5, $2 } ' "$filename" > "/tmp/$color/midterm.csv" Or if your input file isn't a CSV as it seems not to be by your updated question: awk ' BEGIN { OFS=","; print "Firstname", "Lastname", "Maildomain", "Password" } { print $3, $4, $5, $2 } ' "$filename" > "/tmp/$color/midterm.csv"
Append data to another column in a CSV if duplicate is found in first column
I have a CSV with data such as: somename1,value1 somename1,value2 somename1,value3 anothername1,anothervalue1 anothername1,anothervalue2 anothername1,anothervalue3 I would like to rewrite the CSV so that when a duplicate in column 1 is found, the the data is appended to a new column on the first entry. For instance, the desired output would be : somename1,value1,value2,value3 anothername1,anothervalue1,anothervalue2,anothervalue3 How can i do this in a shell script ? TIA
You need much more than just removing duplicated lines when using Awk, you need a logic as below to create an array of elements for each unique entry in $1. The solution creates a hash-map with unique values in $1 working as indices of the array and elements as the value appended with a , separator. awk 'BEGIN{FS=OFS=","; prev="";}{ if (prev != $1) {unique[$1]=$2;} else {unique[$1]=(unique[$1]","$2)} prev=$1; }END{for (i in unique) print i,unique[i]}' file anothername1,anothervalue1,anothervalue2,anothervalue3 somename1,value1,value2,value3 A more readable version would be to have something like, BEGIN { # set input and output field separator to ',' and initialize # variable holding last instance of $1 to empty FS=OFS="," prev="" } { # Update the value of $2 directly in the hash array only when new # unique elements are found in $1 if (prev != $1){ unique[$1]=$2 } else { unique[$1]=(unique[$1]","$2) } # Update the current $1 prev=$1 } END { for (i in unique) { print i,unique[i] }
FILE=$1 NAMES=`cut -d',' -f 1 $FILE | sort -u` for NAME in $NAMES; do echo -n "$NAME" VALUES=`grep "$NAME" $FILE | cut -d',' -f2` for VAL in $VALUES; do echo -n ",$VAL" done echo "" done running with your data generates: >bash script.sh data1.txt anothername1,anothervalue1,anothervalue2,anothervalue3 somename1,value1,value2,value3 the filename of your data has to be passed as parameter. output can be written to a new file by redirecting. >bash script.sh data1.txt > data_new.txt
Looping through multiline CSV rows in bash
I have the following csv file with 3 columns: row1value1,row1value2,"row1 multi line value" row2value1,row2value2,"row2 multi line value" Is there a way to loop through its rows like (this does not work, it reads lines): while read $ROW do #some code that uses $ROW variable done < file.csv
Using gnu-awk you can do this using FPAT: awk -v RS='"\n' -v FPAT='"[^"]*"|[^,]*' '{ print "Record #", NR, " =======>" for (i=1; i<=NF; i++) { sub(/^"/, "", $i) printf "Field # %d, value=[%s]\n", i, $i } }' file.csv Record # 1 =======> Field # 1, value=[row1value1] Field # 2, value=[row1value2] Field # 3, value=[row1 multi line value] Record # 2 =======> Field # 1, value=[row2value1] Field # 2, value=[row2value2] Field # 3, value=[row2 multi line value] However, as I commented above a dedicated CSV parser using PHP, Perl or Python will be more robust for this job.
Here is a pure bash solution. The multiline_csv.sh script translates the multiline csv into standard csv by replacing the newline characters between quotes with some replacement string. So the usage is ./multiline_csv.sh CSVFILE SEP I placed your example script in a file called ./multi.csv. Running the command ./multiline_csv.sh ./multi.csv "\n" yielded the following output [ericthewry#eric-arch-pc stackoverflow]$ ./multiline_csv.sh ./multi.csv "\n" r1c2,r1c2,"row1\nmulti\nline\nvalue" r2c1,r2c2,"row2\nmultiline\nvalue" This can be easily translated back to the original csv file using printf: [ericthewry#eric-arch-pc stackoverflow]$ printf "$(./multiline_csv.sh ./multi.csv "\n")\n" r1c2,r1c2,"row1 multi line value" r2c1,r2c2,"row2 multiline value" This might be an Arch-specific quirk of echo/sprintf (I'm not sure), but you could use some other separator string like ~~~++??//NEWLINE\\??++~~~ that you could sed out if need be. # multiline_csv.sh open=0 line_is_open(){ quote="$2" (printf "$1" | sed -e "s/\(.\)/\1\n/g") | (while read char; do if [[ "$char" = '"' ]]; then open=$((($open + 1) % 2)) fi done && echo $open) } cat "$1" | while read ln ; do flatline="${ln}" open=$(line_is_open "${ln}" $open) until [[ "$open" = "0" ]]; do if read newln then flatline="${flatline}$2${newln}" open=$(line_is_open "${newln}" $open) else break fi done echo "${flatline}" done Once you've done this translation, you can proceed as you would normally via the while read $ROW do ... done method.