I have the following text line :
"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4" ...
And I need to generate the following INSERT statement :
INSERT INTO data (Field1,Field2,Field3,Field4 ... ) VALUES(Data1,Data2,Data3,Data4 ... );
Any ideas on how to do it in BASH ?
Thanks in advance!
$ cat file
"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4"
$
$ cat tst.awk
BEGIN { FS="^\"|\"[:,]\"|\"$" }
{
fields = values = ""
for (i=2; i<NF; i+=2) {
fields = fields (i>2 ? "," : "") $i
values = values (i>2 ? "," : "") $(i+1)
}
printf "INSERT INTO data (%s) VALUES(%s);\n", fields, values
}
$
$ awk -f tst.awk file
INSERT INTO data (Field1,Field2,Field3,Field4) VALUES(Data1,Data2,Data3,Data4);
You could try this awk command:
$ cat file
"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4"
$ awk -F'[:"]+' '{s=(NR>1?",":""); fields=fields s $2;data=data s $3}END{printf "INSTERT INTO data(%s) VALUES(%s)\n", fields,data}' RS="," file
INSTERT INTO data(Field1,Field2,Field3,Field4) VALUES(Data1,Data2,Data3,Data4)
Or a bit more readable
#!/usr/bin/awk -f
BEGIN {
FS ="[:\"]+";
RS=",";
}
{
s=(NR>1?",":"")
fields=fields s $2
data=data s $3
}
END{
printf "INSTERT INTO data(%s) VALUES(%s)\n", fields,data
}
Save it in a file named script.awk, and run it like:
./script.awk file
Since you specifically asked for a BASH solution (rather than awk, perl, or python):
data='"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4"'
data=${data//,/$'\n'} # replace comma with new-lines
data=${data//\"/} # remove the quotes
while IFS=':' read -r field item
do
if [[ -n $fields ]]
then
fields="$fields,$field"
items="$items,$item"
else
fields=$field
items=$item
fi
done < <(echo "$data")
stmt="INSERT INTO data ($fields) VALUES($items);"
echo "$stmt"
sed -n 's/$/) VALUES(/
: next
s/"\([^"]*\)":"\([^"]*\)"\(.*\)) VALUES(\(.*\)/\1\3) VALUES(\4,\2/
t next
s/VALUES(,/VALUES(/
s/.*/INSERT INTO data (&)/
p
' YourFile
Assuming there is no " in data value nor ) VALUES( (could be treated also if needed)
Related
We are getting a varying length input file as mentioned below. The text is varying length.
Input file:
ID|Text
1|name1=value1;name3;name4=value2;name5=value5
2|name1=value1;name2=value2;name6=;name7=value7;name8=value8
The text here has named value pair as the content and it's of varying length. Please note that the name in the text column can contain a semi colon. We are trying to parse the input but we are not able handle it via AWK or BASH
Desired Output:
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
The below snipped of code works for ID=2, but doesn't for ID=1
echo "2|name1=value1;name2=value2;name6=;name7=value7;name8=value8" | while IFS="|"; read id text;do dsc=`echo $text|tr ';' '\n'`;echo "$dsc" >tmp;done
cat tmp
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
echo "1|name1=value1;name3;name4=value2;name5=value5" | while IFS="|"; read id text;do dsc=`echo $text|tr ';' '\n'`;echo "$dsc" >tmp;sed -i "s/^/${id}\|/g" tmp;done
cat tmp
1|name1=value1
1|name3
1|name4=value2
1|name5=value5
Any help is greatly appreciated.
Could you please try following, written and tested with shown samples in GNU awk with new version of it. Since OP's awk version is old so if anyone having old version of awk then try changing it to awk --re-interval
awk '
BEGIN{
FS=OFS="|"
}
FNR==1{ next }
{
first=$1
while(match($0,/(name[0-9]+;?){1,}=(value[0-9]+)?/)){
print first,substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}' Input_file
Output will be as follows.
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
Explanation: Adding detailed explanation for above(following is for explanation purposes only).
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="|" ##Setting FS and OFS wiht | here.
}
FNR==1{ next } ##If line is first line then go next, do not print anything.
{
first=$1 ##Creating first and setting as first field here.
while(match($0,/(name[0-9]+;?){1,}=(value[0-9]+)?/)){
##Running while loop which has match which has a regex of matching name and value all mentioned permutations and combinations.
print first,substr($0,RSTART,RLENGTH) ##Printing first and sub string(currently matched one)
$0=substr($0,RSTART+RLENGTH) ##Saving rest of the line into current line.
}
}' Input_file ##Mentioning Input_file name here.
Sample data:
$ cat name.dat
ID|Text
1|name1=value1;name3;name4=value2;name5=value5
2|name1=value1;name2=value2;name6=;name7=value7;name8=value8
One awk solution:
awk -F"[|;]" ' # use "|" and ";" as input field delimiters
FNR==1 { next } # skip header line
{ pfx=$1 "|" # set output prefix to field 1 + "|"
printpfx=1 # set flag to print prefix
for ( i=2 ; i<=NF ; i++ ) # for fields 2 to NF
{
if ( printpfx) { printf "%s", pfx ; printpfx=0 } # if print flag == 1 then print prefix and clear flag
if ( $(i) ~ /=/ ) { printf "%s\n", $(i) ; printpfx=1 } # if current field contains "=" then print it, end this line of output, reset print flag == 1
if ( $(i) !~ /=/ ) { printf "%s;", $(i) } # if current field does not contain "=" then print it and include a ";" suffix
}
}
' name.dat
The above generates:
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
A Bash solution:
#!/usr/bin/env bash
while IFS=\| read -r id text || [ -n "$id" ]; do
IFS=\; read -r -a kv_arr < <(printf %s "$text")
printf "$id|%s\\n" "${kv_arr[#]}"
done < <(tail -n +2 a.txt)
A plain POSIX shell solution:
#!/usr/bin/env sh
# Chop the header line from the input file
tail -n +2 a.txt |
# While reading id and text Fields Separated by vertical bar
while IFS=\| read -r id text || [ -n "$id" ]; do
# Sets the separator to a semicolon
IFS=\;
# Print each semicolon separated field formatted on
# its own line with the ID
# shellcheck disable=SC2086 # Explicit split on semicolon
printf "$id|%s\\n" $text
done
Input a.txt:
ID|Text
1|name1=value1;name3;name4=value2;name5=value5
2|name1=value1;name2=value2;name6=;name7=value7;name8=value8
Output:
1|name1=value1
1|name3
1|name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
You have some good answers and an accepted one already. Here is a much shorter gnu awk command that can also do the job:
awk -F '|' 'NR > 1 {
for (s=$2; match(s, /([^=]+=[^;]*)(;|$)/, m); s=substr(s, RLENGTH+1))
print $1 FS m[1]
}' file.txt
1|name1=value1
1|name3;name4=value2
1|name5=value5
2|name1=value1
2|name2=value2
2|name6=
2|name7=value7
2|name8=value8
I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big
I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.
head globcover_color.csv glds00g.csv
==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889
==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775
The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.
Anticipated output:
file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number
edit:
I figured out the first part.
paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
fname = FILENAME
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
print row[rowNr]
}
}
$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
To use minimal memory in awk:
$ cat tst.awk
BEGIN {
FS=OFS=","
for (fileNr=1; fileNr<ARGC; fileNr++) {
filename = ARGV[fileNr]
if ( (getline < filename) > 0 ) {
fname = filename
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
row = (fileNr==1 ? "" : row OFS) $0
}
print row
exit
}
$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
I have circa 400 CSV files will a large amount of data in them in the following format:
As at Date,3/12/2014
Header1,Header2,Header3...
Data1,Data2,Data3...
I'm wanting to add a new column (with a header) at the end of the row with the headers and the date shown in the first row on each line there data exit. An example of this would be:
As at Date,3/12/2014
Header1,Header2,Header3,Date
Data1,Data2,Data3,3/12/2014
Data4,Data5,Data6,3/12/2014
...
...
I know I can grab the details from the first row with:
head -q -n 1 *.csv
And I know that I can use sed to inset a header into the CSV file but I'm just not too sure how to combine this all together.
Any help would be greatly appreciated.
I'd use awk for this
awk '
BEGIN {FS = OFS = ","}
NR == 1 {d = $2}
NR == 2 {$(NF+1) = "Date"}
NR > 2 {$(NF+1) = d}
{print}
' file
which can be "one-liner"ed to
awk -F, -vOFS=, 'NR==1{d=$2};NR==2{$(NF+1)="Date"};NR>2{$(NF+1)=d};1' file
If you want just bash, use
{
IFS=, read -r asat date; echo "$asat,$date"
IFS= read -r line; echo "$line,Date"
while IFS= read -r line; do echo "$line,$date"; done
} < file
Other awk solution:
awk -F, '{
if (FNR==1) {
mydate=$2
print
} else {
print $0 "," mydate
}
} ' file.csv
I have the following file:
91001440737;1421687191;1421687966;10;true;true;1421816564;;;;;;;;;
91001477235;1422551333;;3;true;true;;;;;;1422789053;;;1422789053;
91001512152;;1423070412;2;true;true;;;;;;1423134381;;;;
91001520460;1421600655;;13;true;true;1421665705;;;;1422443201;;;;;
91001627323;1422724554;;10;true;true;1422939818;;;;;;;;;
91001680088;1421535875;;2;true;true;;;1422680695;;;1421579247;;;;
Some of the columns (like the 2nd and the 3rd one and others) have timestamps. I would like to change them into proper date.
I have used the following command line to do so:
cat fic_v1_entier.txt | while read line ; do echo $line\;$(date +%Y/%m/%d) ; done
But the command line is not correct as it gives me this result instead:
91001680088;1421535875;;2;true;true;;;1422680695;;;1421579247;;;;;2015/02/18
As you can see only the last column have been changed when I want the 2nd , the 3rd and also other specific columns to be changed.
Any tips are welcomed.
May be this can be done easily using awk
awk -F\; 'BEGIN{OFS=";"}
{ $2 = strftime("%Y/%m/%d",$2)
$3 = strftime("%Y/%m/%d",$3)}1'
Test
Here only the second and third ($2 and $3) are changed.
$ awk -F\; 'BEGIN {OFS=";"} { $2 = strftime("%Y/%m/%d",$2); $3 = strftime("%Y/%m/%d",$3)}1'
91001440737;2015/01/19;2015/01/19;10;true;true;1421816564;;;;;;;;;
91001477235;2015/01/29;1970/01/01;3;true;true;;;;;;1422789053;;;1422789053;
91001512152;1970/01/01;2015/02/04;2;true;true;;;;;;1423134381;;;;
91001520460;2015/01/18;1970/01/01;13;true;true;1421665705;;;;1422443201;;;;;
91001627323;2015/01/31;1970/01/01;10;true;true;1422939818;;;;;;;;;
91001680088;2015/01/18;1970/01/01;2;true;true;;;1422680695;;;1421579247;;;;
You can for example say:
while IFS=";" read -r f1 f2 f3
do
printf "%s;%s;%s\n" "$f1" $([ -n "$f2" ] && date -d#"$f2" "+%F%T" || echo "") "$f3"
done < file
That is, read every field and apply date to the required ones. To do the same with the rest of the variables you need to say read -r f1 f2 f3 ... fN and apply the same logic.
Note I used the %F%T format, whereas you can say %Y%m%d or whatever you prefer. And to do the conversion I use the expression date -d#timestamp "+format".
Also note you are saying cat file | while ..., whereas while ... < file is more than enough and even better: I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?.
Test
$ while IFS=";" read -r f1 f2 f3; do printf "%s;%s;%s\n" "$f1" $([ -n "$f2" ] && date -d#"$f2" "+%F%T" || echo "") "$f3"; done < file
91001440737;2015-01-1918:06:31;1421687966;10;true;true;1421816564;;;;;;;;;
91001477235;2015-01-2918:08:53;;3;true;true;;;;;;1422789053;;;1422789053;
91001512152;1423070412;2;true;true;;;;;;1423134381;;;;;
91001520460;2015-01-1818:04:15;;13;true;true;1421665705;;;;1422443201;;;;;
91001627323;2015-01-3118:15:54;;10;true;true;1422939818;;;;;;;;;
91001680088;2015-01-1800:04:35;;2;true;true;;;1422680695;;;1421579247;;;;
Using GNU awk for string functions:
$ cat tst.awk
BEGIN {
FS=OFS=";"
split("2 3 7 9 11 12 15",tsFlds,/ /)
}
{
for (i=1; i in tsFlds; i++) {
if ($(tsFlds[i]) != "") {
$(tsFlds[i]) = strftime("%Y/%m/%d",$(tsFlds[i]))
}
}
print
}
$
$ gawk -f tst.awk file
91001440737;2015/01/19;2015/01/19;10;true;true;2015/01/20;;;;;;;;;
91001477235;2015/01/29;;3;true;true;;;;;;2015/02/01;;;2015/02/01;
91001512152;;2015/02/04;2;true;true;;;;;;2015/02/05;;;;
91001520460;2015/01/18;;13;true;true;2015/01/19;;;;2015/01/28;;;;;
91001627323;2015/01/31;;10;true;true;2015/02/02;;;;;;;;;
91001680088;2015/01/17;;2;true;true;;;2015/01/30;;;2015/01/18;;;;
The split() enumerates the fields that can contain timestamps.
I have a csv file which I'll be using as input with a format looking like this:
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
The key attributes of the input file are that each "value" will have a variable number of statistics, but the statistic type and "value" will always be separated by a "-". I then want to output the statistics of all the "values" to separate csv files.
The output would then look something like this:
value1.csv
xvalue,value1-avg,value1-median
1,3,4
value2.csv
xvalue,value2-avg
1,20
I've tried finding solutions to this, but all I can find are ways to copy by the column number, not the header name. I need to be able to use the header names to append the associated statistics to each of the output csv files.
Any help is greatly appreciated!
P.S. the output file may have already been written to during previous runs of this script, meaning the code should append to the output file
Untested but should be close:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Note that deleting a whole array with delete(outstr) is gawk-specific. With other awks you can use split("",outstr) to get the same effect.
Note that this appends the output you wanted to existing files BUT that means you'll get the header line repeated on every execution. If that's an issue, tell us how to know when to generate the header line or not but the solution I THINK you'll want would look something like this:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
for (outfile in outfiles) {
exists[outfile] = ( ((getline tmp < outfile) > 0) && (tmp != "") )
close(outfile)
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
if ( (NR > 1) || !exists[outfile] )
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Just figure out the name associated with each column and use that mapping to manipulate the columns. If you're trying to do this in awk, you can use associative arrays to store the column names and the rows those correspond to. If you're using ksh93 or bash, you can use associative arrays to store the column names and the rows those correspond to. If you're using perl or python or ruby or ... you can...
Or push the columns into an array to map the numbers to column numbers.
Either way, then you have a list of column headers, which can further be manipulated however you need to.
The solution I have found most useful to this kind of problem is to first retrieve the column number using an AWK script (encapsulated in a shell function) and then follow with a cut statement. This technique/strategy turns into a very concise, general and fast solution that can take advantage of co-processing. The non-append case is cleaner, but here is an example that handles the complication of the append you mentioned:
#! /bin/sh
fields() {
LC_ALL=C awk -F, -v pattern="$1" '{
j=0; split("", f)
for (i=1; i<=NF; i++) if ($(i) ~ pattern) f[j++] = i
if (j) {
printf("%s", f[0])
for (i=1; i<j; i++) printf(",%s", f[i])
}
exit 0
}' "$2"
}
cut_fields_with_append() {
if [ -s "$3" ]
then
cut -d, -f `fields "$1" "$2"` "$2" | sed '1 d' >> "$3"
else
cut -d, -f `fields "$1" "$2"` "$2" > "$3"
fi
}
cut_fields_with_append '^[^-]+$|1-' values.csv value1.csv &
cut_fields_with_append '^[^-]+$|2-' values.csv value2.csv &
cut_fields_with_append '^[^-]+$|3-' values.csv value3.csv &
wait
The result is as you would expect:
$ ls
values values.csv
$ cat values.csv
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
$ ./values
$ ls
value1.csv value2.csv value3.csv values values.csv
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
$ ./values
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
1,14,20
$