Shell script: copying columns by header in a csv file to another csv file

Shell script: copying columns by header in a csv file to another csv file - shell

I have a csv file which I'll be using as input with a format looking like this:
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
The key attributes of the input file are that each "value" will have a variable number of statistics, but the statistic type and "value" will always be separated by a "-". I then want to output the statistics of all the "values" to separate csv files.
The output would then look something like this:
value1.csv
xvalue,value1-avg,value1-median
1,3,4
value2.csv
xvalue,value2-avg
1,20
I've tried finding solutions to this, but all I can find are ways to copy by the column number, not the header name. I need to be able to use the header names to append the associated statistics to each of the output csv files.
Any help is greatly appreciated!
P.S. the output file may have already been written to during previous runs of this script, meaning the code should append to the output file

Untested but should be close:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Note that deleting a whole array with delete(outstr) is gawk-specific. With other awks you can use split("",outstr) to get the same effect.
Note that this appends the output you wanted to existing files BUT that means you'll get the header line repeated on every execution. If that's an issue, tell us how to know when to generate the header line or not but the solution I THINK you'll want would look something like this:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
for (outfile in outfiles) {
exists[outfile] = ( ((getline tmp < outfile) > 0) && (tmp != "") )
close(outfile)
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
if ( (NR > 1) || !exists[outfile] )
print $1 outstr[outfile] >> outfile
}
' inFile.csv

Just figure out the name associated with each column and use that mapping to manipulate the columns. If you're trying to do this in awk, you can use associative arrays to store the column names and the rows those correspond to. If you're using ksh93 or bash, you can use associative arrays to store the column names and the rows those correspond to. If you're using perl or python or ruby or ... you can...
Or push the columns into an array to map the numbers to column numbers.
Either way, then you have a list of column headers, which can further be manipulated however you need to.

The solution I have found most useful to this kind of problem is to first retrieve the column number using an AWK script (encapsulated in a shell function) and then follow with a cut statement. This technique/strategy turns into a very concise, general and fast solution that can take advantage of co-processing. The non-append case is cleaner, but here is an example that handles the complication of the append you mentioned:
#! /bin/sh
fields() {
LC_ALL=C awk -F, -v pattern="$1" '{
j=0; split("", f)
for (i=1; i<=NF; i++) if ($(i) ~ pattern) f[j++] = i
if (j) {
printf("%s", f[0])
for (i=1; i<j; i++) printf(",%s", f[i])
}
exit 0
}' "$2"
}
cut_fields_with_append() {
if [ -s "$3" ]
then
cut -d, -f `fields "$1" "$2"` "$2" | sed '1 d' >> "$3"
else
cut -d, -f `fields "$1" "$2"` "$2" > "$3"
fi
}
cut_fields_with_append '^[^-]+$|1-' values.csv value1.csv &
cut_fields_with_append '^[^-]+$|2-' values.csv value2.csv &
cut_fields_with_append '^[^-]+$|3-' values.csv value3.csv &
wait
The result is as you would expect:
$ ls
values values.csv
$ cat values.csv
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
$ ./values
$ ls
value1.csv value2.csv value3.csv values values.csv
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
$ ./values
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
1,14,20
$

Related

printing contents of variable to a specified line in outputfile with sed/awk

I have been working on a script to concatenate multiple csv files into a single, large csv. The csv's contain names of folders and their respective sizes, in a 2-column setup with the format "Size, Projectname"
Example of a single csv file:
49747851728,ODIN
32872934580,_WORK
9721820722,LIBRARY
4855839655,BASELIGHT
1035732096,ARCHIVE
907756578,USERS
123685100,ENV
3682821,SHOTGUN
1879186,SALT
361558,SOFTWARE
486,VFX
128,DNA
For my current test I have 25 similar files, with different numbers in the first column.
I am trying to get this script to do the following:
Read each csv file
For each Project it sees, scan the outputfile if that Project was already printed to the file. If not, print the Projectname
For each file, for each Project, if the Project was found, print the Size to the output csv.
However, I need the Projects to all be on textline 1, comma separated, so I can use this outputfile as input for a javascript graph. The Sizes should be added in the column below their projectname.
My current script:
csv_folder=$(echo "$1" | sed 's/^[ \t]*//;s/\/[ \t]*$//')
csv_allfiles="$csv_folder/*.csv"
csv_outputfile=$csv_folder.csv
echo -n "" > $csv_outputfile
for csv_inputfile in $csv_allfiles; do
while read line && [[ $line != "" ]]; do
projectname=$(echo $line | sed 's/^\([^,]*\),//')
projectfound1=$(cat $csv_outputfile | grep -w $projectname)
if [[ ! $projectfound1 ]]; then
textline=1
sed "${textline}s/$/${projectname}, /" >> $csv_outputfile
for csv_foundfile in $csv_allfiles; do
textline=$(echo $textline + 1 | bc )
projectfound2=$(cat $csv_foundfile | grep -w $projectname)
projectdata=$(echo $projectfound2 | sed 's/\,.*$//')
if [[ $projectfound2 ]]; then
sed "${textline}s/$/$projectdata, /" >> $csv_outputfile
fi
done
fi
done < $csv_inputfile
done
My current script finds the right information (projectname, projectdata) and if I just 'echo' those variables, it prints the correct data to a file. However, with echo it only prints in a long list per project. I want it to 'jump back' to line 1 and print the new project at the end of the current line, then run the loop to print data at the end of each next line.
I was thinking this should be possible with sed or awk. sed should have a way of inserting text to a specific line with
sed '{n}s/search/replace/'
where {n} is the line to insert to
awk should be able to do the same thing with something like
awk -v l2="$textline" -v d="$projectdata" 'NR == l2 {print d} {print}' >> $csv_outputfile
However, while replacing the sed commands in the script with
echo $projectname
echo $projectdata
spit out the correct information (so I know my variables are filled correctly) the sed and awk commands tend to spit out the entire contents of their current inputcsv; not just the line that I want them to.
Pastebin outputs per variant of writing to file
https://pastebin.com/XwxiAqvT - sed output
https://pastebin.com/xfLU6wri - echo, plain output (single column)
https://pastebin.com/wP3BhgY8 - echo, detailed output per variable
https://pastebin.com/5wiuq53n - desired output
As you see, the sed output tends to paste the whole contents of inputcsv, making the loop stop after one iteration. (since it finds the other Projects after one loop)
So my question is one of these;
How do I make sed / awk behave the way I want it to; i.e. print only the info in my var to the current textline, instead of the whole input csv. Is sed capable of this, printing just one line of variable? Or
Should I output the variables through 'echo' into a temp file, then loop over the temp file to make sed sort the lines the way I want them to? (Bear in mind that more .csv files will be added in the future, I can't just make it loop x times to sort the info)
Is there a way to echo/print text to a specific text line without using sed or awk? Is there a printf option I'm missing? Other thoughts?
Any help would be very much appreciated.

A way to accomplish this transposition is to save the data to an associative array.
In the following example, we use a two dimensional array to keep track of our data. Because ordering seems to be important, we create a col array and create a new increment whenever we see a new projectname -- this col array ends up being our first index into our data. We also create a row array which we increment whenever we see a new data for the current column. The row number is our second index into data. At the end, we print out all the records.
#! /usr/bin/awk -f
BEGIN {
FS = ","
OFS = ", "
rows=0
cols=0
head=""
split("", data)
split("", row)
split("", col)
}
!($2 in col) { # new project
if (head == "")
head = $2
else
head = head OFS $2
i = col[$2] = cols++
row[i] = 0
}
{
i = col[$2]
j = row[i]++
data[i,j] = $1
if (j > rows)
rows = j
}
END {
print head
for (j=0; j<=rows; ++j) {
if ((0,j) in data)
x = data[0,j]
else
x = ""
for (i=1; i<cols; ++i) {
if ((i,j) in data)
x = x OFS data[i,j]
else
x = x OFS
}
print x
}
}
As a bonus, here is a script to reproduce the detailed output from one of your pastebins.
#! /usr/bin/awk -f
BEGIN {
FS = ","
split("", data) # accumulated data for a project
split("", line) # keep track of textline for data
split("", idx) # index into above to maintain input order
sz = 0
}
$2 in idx { # have seen this projectname
i = idx[$2]
x = ORS "textline = " ++line[i]
x = x ORS "textdata = " $1
data[i] = data[i] x
next
}
{ # new projectname
i = sz++
idx[$2] = i
x = "textline = 1"
x = x ORS "projectname = " $2
x = x ORS "textline = 2"
x = x ORS "projectdata = " $1
data[i] = x
line[i] = 2
}
END {
for (i=0; i<sz; ++i)
print data[i]
}

Fill parray with project names and array with values, then print them with bash printf, You can choose column width in printf command (currently 13 characters - %13s)
#!/bin/bash
declare -i index=0
declare -i pindex=0
while read project; do
parray[$pindex]=$project
index=0
while read;do
array[$pindex,$index]="$REPLY"
index+=1
done <<< $(grep -h "$project" *.csv|cut -d, -f1)
pindex+=1
done <<< $(cat *.csv|cut -d, -f 2|sort -u)
maxi=$index
maxp=$pindex
for (( pindex=0; $pindex < $maxp ; pindex+=1 ));do
STR="%13s $STR"
VAL="$VAL ${parray[$pindex]}"
done
printf "$STR\n" $VAL
for (( index=0; $index < $maxi;index+=1 ));do
STR=""; VAL=""
for (( pindex=0; $pindex < $maxp;pindex+=1 )); do
STR="%13s $STR"
VAL="$VAL ${array[$pindex,$index]}"
done
printf "$STR\n" $VAL
done

If you are OK with the output being sorted by name this one-liner might be of use:
awk 'BEGIN {FS=",";OFS=","} {print $2,$1}' * | sort | uniq
The files have to be in the same directory. If not a list of files replaces the *. First it exchanges the two fields. Awk will take a list of files and do the concatenation. Then sort the lines and print just the unique lines. This depends on the project size always being the same.
The simple one-liner above gives you one line for each project. If you really want to do it all in awk and use awk write the two lines, then the following would be needed. There is a second awk at the end that accumulates each column entry in an array then spits it out at the end:
awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | awk 'BEGIN {n=0}
{p[n]=$1;s[n++]=$2}
END {for (i=0;i<n;i++) printf "%s,",p[i];print "";
for (i=0;i<n;i++) printf "%s,",s[i];print ""}'
If you have the rs utility then this can be simplified to
awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | rs -C',' -T

Using awk to iterate unix command nm and sum output through multiple files

I am currently working on a script that will look through the output of nm and sum the values of column $1 using the following
read $filename
nm --demangle --size-sort --radix=d ~/object/$filename | {
awk '{ sum+= $1 } END { print "Total =" sum }'
}
I want to do the following for any number of files, looping through a directory to then output a summary of results. I want the result for each file and also the result of adding the first column of all the columns.
I am limited to using just bash and awk.

You need to put the read $filename in a while; do; done loop and feed the output of the entire loop to awk.
e.g.
while read filename ; do
nm ... $filename
done | awk '{print $0} { sum+=$1 } END { print "Total="sum}'
the awk {print $0} will print each file's line so you can see each one.

bash globstar option is for recursive file matching
you can use like **/*.txt at the end awk command
$ shopt -s globstar
$ awk '
BEGINFILE {
c="nm --demangle --size-sort --radix=d \"" FILENAME "\""
while ((c | getline) > 0) { fs+=$1; ts+=$1; }
printf "%45s %10'\''d\n",FILENAME, fs
close(c); fs=0; nextfile
} END {
printf "%30s %s\n", " ", "-----------------------------"
printf "%45s %10'\''d\n", "total", ts
}' **/*filename*

change timestamp in multiple columns to proper date (e.g YYYYMMDD)

I have the following file:
91001440737;1421687191;1421687966;10;true;true;1421816564;;;;;;;;;
91001477235;1422551333;;3;true;true;;;;;;1422789053;;;1422789053;
91001512152;;1423070412;2;true;true;;;;;;1423134381;;;;
91001520460;1421600655;;13;true;true;1421665705;;;;1422443201;;;;;
91001627323;1422724554;;10;true;true;1422939818;;;;;;;;;
91001680088;1421535875;;2;true;true;;;1422680695;;;1421579247;;;;
Some of the columns (like the 2nd and the 3rd one and others) have timestamps. I would like to change them into proper date.
I have used the following command line to do so:
cat fic_v1_entier.txt | while read line ; do echo $line\;$(date +%Y/%m/%d) ; done
But the command line is not correct as it gives me this result instead:
91001680088;1421535875;;2;true;true;;;1422680695;;;1421579247;;;;;2015/02/18
As you can see only the last column have been changed when I want the 2nd , the 3rd and also other specific columns to be changed.
Any tips are welcomed.

May be this can be done easily using awk
awk -F\; 'BEGIN{OFS=";"}
{ $2 = strftime("%Y/%m/%d",$2)
$3 = strftime("%Y/%m/%d",$3)}1'
Test
Here only the second and third ($2 and $3) are changed.
$ awk -F\; 'BEGIN {OFS=";"} { $2 = strftime("%Y/%m/%d",$2); $3 = strftime("%Y/%m/%d",$3)}1'
91001440737;2015/01/19;2015/01/19;10;true;true;1421816564;;;;;;;;;
91001477235;2015/01/29;1970/01/01;3;true;true;;;;;;1422789053;;;1422789053;
91001512152;1970/01/01;2015/02/04;2;true;true;;;;;;1423134381;;;;
91001520460;2015/01/18;1970/01/01;13;true;true;1421665705;;;;1422443201;;;;;
91001627323;2015/01/31;1970/01/01;10;true;true;1422939818;;;;;;;;;
91001680088;2015/01/18;1970/01/01;2;true;true;;;1422680695;;;1421579247;;;;

You can for example say:
while IFS=";" read -r f1 f2 f3
do
printf "%s;%s;%s\n" "$f1" $([ -n "$f2" ] && date -d#"$f2" "+%F%T" || echo "") "$f3"
done < file
That is, read every field and apply date to the required ones. To do the same with the rest of the variables you need to say read -r f1 f2 f3 ... fN and apply the same logic.
Note I used the %F%T format, whereas you can say %Y%m%d or whatever you prefer. And to do the conversion I use the expression date -d#timestamp "+format".
Also note you are saying cat file | while ..., whereas while ... < file is more than enough and even better: I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?.
Test
$ while IFS=";" read -r f1 f2 f3; do printf "%s;%s;%s\n" "$f1" $([ -n "$f2" ] && date -d#"$f2" "+%F%T" || echo "") "$f3"; done < file
91001440737;2015-01-1918:06:31;1421687966;10;true;true;1421816564;;;;;;;;;
91001477235;2015-01-2918:08:53;;3;true;true;;;;;;1422789053;;;1422789053;
91001512152;1423070412;2;true;true;;;;;;1423134381;;;;;
91001520460;2015-01-1818:04:15;;13;true;true;1421665705;;;;1422443201;;;;;
91001627323;2015-01-3118:15:54;;10;true;true;1422939818;;;;;;;;;
91001680088;2015-01-1800:04:35;;2;true;true;;;1422680695;;;1421579247;;;;

Using GNU awk for string functions:
$ cat tst.awk
BEGIN {
FS=OFS=";"
split("2 3 7 9 11 12 15",tsFlds,/ /)
}
{
for (i=1; i in tsFlds; i++) {
if ($(tsFlds[i]) != "") {
$(tsFlds[i]) = strftime("%Y/%m/%d",$(tsFlds[i]))
}
}
print
}
$
$ gawk -f tst.awk file
91001440737;2015/01/19;2015/01/19;10;true;true;2015/01/20;;;;;;;;;
91001477235;2015/01/29;;3;true;true;;;;;;2015/02/01;;;2015/02/01;
91001512152;;2015/02/04;2;true;true;;;;;;2015/02/05;;;;
91001520460;2015/01/18;;13;true;true;2015/01/19;;;;2015/01/28;;;;;
91001627323;2015/01/31;;10;true;true;2015/02/02;;;;;;;;;
91001680088;2015/01/17;;2;true;true;;;2015/01/30;;;2015/01/18;;;;
The split() enumerates the fields that can contain timestamps.

BASH parsing and generating MYSQL insert

I have the following text line :
"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4" ...
And I need to generate the following INSERT statement :
INSERT INTO data (Field1,Field2,Field3,Field4 ... ) VALUES(Data1,Data2,Data3,Data4 ... );
Any ideas on how to do it in BASH ?
Thanks in advance!

$ cat file
"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4"
$
$ cat tst.awk
BEGIN { FS="^\"|\"[:,]\"|\"$" }
{
fields = values = ""
for (i=2; i<NF; i+=2) {
fields = fields (i>2 ? "," : "") $i
values = values (i>2 ? "," : "") $(i+1)
}
printf "INSERT INTO data (%s) VALUES(%s);\n", fields, values
}
$
$ awk -f tst.awk file
INSERT INTO data (Field1,Field2,Field3,Field4) VALUES(Data1,Data2,Data3,Data4);

You could try this awk command:
$ cat file
"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4"
$ awk -F'[:"]+' '{s=(NR>1?",":""); fields=fields s $2;data=data s $3}END{printf "INSTERT INTO data(%s) VALUES(%s)\n", fields,data}' RS="," file
INSTERT INTO data(Field1,Field2,Field3,Field4) VALUES(Data1,Data2,Data3,Data4)
Or a bit more readable
#!/usr/bin/awk -f
BEGIN {
FS ="[:\"]+";
RS=",";
}
{
s=(NR>1?",":"")
fields=fields s $2
data=data s $3
}
END{
printf "INSTERT INTO data(%s) VALUES(%s)\n", fields,data
}
Save it in a file named script.awk, and run it like:
./script.awk file

Since you specifically asked for a BASH solution (rather than awk, perl, or python):
data='"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4"'
data=${data//,/$'\n'} # replace comma with new-lines
data=${data//\"/} # remove the quotes
while IFS=':' read -r field item
do
if [[ -n $fields ]]
then
fields="$fields,$field"
items="$items,$item"
else
fields=$field
items=$item
fi
done < <(echo "$data")
stmt="INSERT INTO data ($fields) VALUES($items);"
echo "$stmt"

sed -n 's/$/) VALUES(/
: next
s/"\([^"]*\)":"\([^"]*\)"\(.*\)) VALUES(\(.*\)/\1\3) VALUES(\4,\2/
t next
s/VALUES(,/VALUES(/
s/.*/INSERT INTO data (&)/
p
' YourFile
Assuming there is no " in data value nor ) VALUES( (could be treated also if needed)

How to print variable value always as last column in CSV file

I have a list of CSV files, I have to print a variable name (dynamically; it will change), to last column in the CSV files.
Here is the code:
addProgramtypeID () {
for csv in $1
do
file_name="$csv"
echo $file_name
f=`echo $file_name | cut -d '_' -f3 | cut -d '.' -f1`
echo $f
k=`grep -i $f Program_type.csv | cut -d ',' -f3`
echo $k
awk '{ print $0 "," "'"$k"'" }' "$csv" > tempfile && mv tempfile "$csv"
done
}
addProgramtypeID "T_H_EDCGO.csv"
As of now the variable value K is being printed at the 1st column of the CSV file , also it is removing the first 2 characters of the first column in the file. My requirement is that the variable value should always come as the last column in the CSV file.
input :
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
if suppose $k=2
output:
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,2
123,3,334,234,3,2
545,2,444,456,5,2
Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2

Assuming there is is nothing nasty in your CSV file, you can use awk as follows:
for csv_file in $ALL_MY_FILES
do
cat csv_file | awk 'BEGIN{FS=","}; {print($(NF))}'
done
Or even just
cat $ALL_MY_FILES | awk 'BEGIN{FS=","}; {print($(NF))}'
Both of these will print the last line column of all the csv files. The results from each CSV are just appended together (is that really what you want?).
The difficulties are on the awk side. This completely unaware of things like quited strings
or extra whitespace. My recommendation is to try the line above, see what goes wrong (if anything) and then start tweaking.

It looks like what you want is just:
$ cat tst.sh
addProgramtypeID () {
csv="$1"
awk -v csv="$csv" '
BEGIN{ FS=OFS=","; split(csv,csvA,/[_.]/); f=csvA[3] }
NR==FNR { if ($0 ~ f) { k = $3 }; next }
{ print $0, k }
' Program_type.csv "$csv" > tempfile && mv tempfile "$csv"
}
addProgramtypeID "T_H_EDC.csv"
$ cat Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
$ ./tst.sh
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,1
123,3,334,234,3,1
545,2,444,456,5,1
but it's hard to tell since your posted sample input could not produce your posted desired output so I had to make some up.
if ($0 ~ f) should probably just be if ($1 == f), I just copied what your original grep f <file> logic would do.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Shell script: copying columns by header in a csv file to another csv file - shell

Related

printing contents of variable to a specified line in outputfile with sed/awk

Using awk to iterate unix command nm and sum output through multiple files

change timestamp in multiple columns to proper date (e.g YYYYMMDD)

BASH parsing and generating MYSQL insert

How to print variable value always as last column in CSV file

Categories

Resources