bash inset additional header and data - bash

I have circa 400 CSV files will a large amount of data in them in the following format:
As at Date,3/12/2014
Header1,Header2,Header3...
Data1,Data2,Data3...
I'm wanting to add a new column (with a header) at the end of the row with the headers and the date shown in the first row on each line there data exit. An example of this would be:
As at Date,3/12/2014
Header1,Header2,Header3,Date
Data1,Data2,Data3,3/12/2014
Data4,Data5,Data6,3/12/2014
...
...
I know I can grab the details from the first row with:
head -q -n 1 *.csv
And I know that I can use sed to inset a header into the CSV file but I'm just not too sure how to combine this all together.
Any help would be greatly appreciated.

I'd use awk for this
awk '
BEGIN {FS = OFS = ","}
NR == 1 {d = $2}
NR == 2 {$(NF+1) = "Date"}
NR > 2 {$(NF+1) = d}
{print}
' file
which can be "one-liner"ed to
awk -F, -vOFS=, 'NR==1{d=$2};NR==2{$(NF+1)="Date"};NR>2{$(NF+1)=d};1' file
If you want just bash, use
{
IFS=, read -r asat date; echo "$asat,$date"
IFS= read -r line; echo "$line,Date"
while IFS= read -r line; do echo "$line,$date"; done
} < file

Other awk solution:
awk -F, '{
if (FNR==1) {
mydate=$2
print
} else {
print $0 "," mydate
}
} ' file.csv

Related

Use an array created using awk as a variable in another awk script

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.
The awk script I use for creating the array is as follows:
array=($(awk 'NR>1 { print $1 }' < file.tsv))
Then, to use this array in the other awk script
awk var="${array[#]}" 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
However, when I run this, the following error occurs.
awk: fatal: cannot open file 'foo' for reading (No such file or directory)
I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.
awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.
Thx in advance.
That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.
Here's how to really do what you're trying to do:
array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
awk -v xyz="${array[*]}" '
BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
... now use `var` as you were trying to ...
'
For example:
$ cat file.tsv
col1 col2
a b c d e
f g h i j
$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j
$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h
$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
$ awk -v xyz="${array[*]}" '
BEGIN {
split(xyz,tmp,RS)
for (i in tmp) {
var[tmp[i]]
}
for (idx in var) {
print "<" idx ">"
}
}
'
<f g h>
<a b>
It's easier and more efficient to process both files in a single awk:
edit: fixed issues in comment, thanks #EdMorton
awk '
FNR == NR {
if ( FNR > 1 )
var[$1]
next
}
FNR == 1 {
for (i = 1; i <= NF; i++)
heading[i] = $i
next
}
{
for (i = 2; i <= NF; i++)
if ( $i == "1" && heading[i] in var) {
outFile = heading[i] ".txt"
print ">kmer" (NR-1) "\n" $1 >> (outFile)
close(outFile)
}
}
' file.tsv input.txt
You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be
A B C
D E F
G H I
and file2.txt content be
1
3
2
then
var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt
gives output
First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D
Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.
(tested in gawk 4.2.1)

how to find out common columns and its records from two files using awk

I have two files:
File 1:
id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio
File 2(only headers):
id|name|country
Now, I want an output like:
OUTPUT:
id|name|country
1|abc|xyz
2|asd|uio
Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.
I want to do this using awk or bash.
I tried using:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt
and have no idea what to do next.
Thank You
Following awk may help you on same.
awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2 File1
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
for(i=1;i<=NF;i++){
a[$i]};
next}
FNR==1 && FNR!=NR{
for(j=1;j<=NF;j++){
if($j in a){ b[++p]=j }}
}
{
for(o=1;o<=p;o++){
printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2 File1
Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=1; i<=NF; i++) {
names[$i]
}
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in names) {
f[++numFlds] = i
}
}
}
{
for (i=1; i<=numFlds; i++) {
printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
}
}
with (lot's of) unix pipes as Doug McIlroy intended...
$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file
id|name|country
1|abc|xyz
2|asd|uio
Solution using bash>4:
IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'
# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[#]}-1))); do
for j in $(seq 0 $((${#headers1[#]}-1))); do
if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
idx+=($j)
break
fi
done
done
# idx=(0 1 3) for example
# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }
# first line - output headers
join_by '|' "${headers2[#]}"
isfirst=true
while IFS='|' read -a vals; do
# ignore first (header line)
if $isfirst; then
isfirst=false
continue;
fi;
# filter from line only columns with idx indices
tmp=()
for i in ${idx[#]}; do
tmp+=("${vals[$i]}")
done
# join ouptut with '|'
join_by '|' "${tmp[#]}"
done < $file1
This one respects the order of columns in file1, changed the order:
$ cat file1
id|country|name
The awk:
$ awk '
BEGIN { FS=OFS="|" }
NR==1 { # file1
n=split($0,a)
next
}
NR==2 { # file2 header
for(i=1;i<=NF;i++)
b[$i]=i
}
{ # output part
for(i=1;i<=n;i++)
printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd
(Another version using cut for outputing in revisions)
This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.
It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.
BEGIN { FS = OFS = "|" }
# read headers from first file
NR == FNR { for (i = 1; i <= NF; ++i) header[$i]; next }
# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1 {
for (i = 1; i <= NF; ++i)
select[i] = ($i in header)
}
{
skip = 0
pos = 1
for (i = 1; i <= NF; ++i)
if (!select[i]) { # we don't want this field
++skip
$pos = $(pos + skip) # shift fields left
} else
++pos
NF -= skip # adjust number of fields
print
}
Running this:
$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

awk/sed: How to perform nested replace operation with two files?

I'm a complete newbie to bash operations. I have two files, lets call them file A and file B.
In file A I have a line like this:
STRING_TO_BE_SEARCHED = "SOME_STRING_IN_FILE_A"
In file B I also have a similar line where just the string differs, like this:
STRING_TO_BE_SEARCHED = "SOME_STRING_IN_FILE_B"
What I need to do is to find the lines that start with STRING_TO_BE_SEARCHED in both files and replace the corresponding line in file B with the corresponding line in A.
How can I achieve this? Is it possible to do this in a single command?
Using awk you can do this. This will scan fileA first by scanning first file for given search string and then using it to replace value in fileB.
awk -v s='STRING_TO_BE_SEARCHED' 'BEGIN{ FS=OFS=" = " } FNR == NR && $1 == s {
a[$1] = $2; next } $1 in a { $2 = a[$1] } 1' fileA fileB
To save changes into fileB use:
awk -v s='STRING_TO_BE_SEARCHED' 'BEGIN{ FS=OFS=" = " } FNR == NR && $1 == s {
a[$1] = $2; next } $1 in a { $2 = a[$1] } 1' fileA fileB > $$.tmp && mv $$.tmp fileB
if you don't have any special chars, perhaps two step sed is easier
key='STRING_TO_BE_SEARCHED *= *'; \
val=$(sed 's/'"$key"'//' fileA); \
sed -r 's/('"$key"').*/\1'"$val"'/' fileB
STRING_TO_BE_SEARCHED = "SOME_STRING_IN_FILE_A"
you can make second sed replacement in place by adding -i option.
1- Store the line in variable lineA
lineA=$(perl -ne 'if(/^\QSTRING_TO_BE_SEARCHED\E/){print;exit}' fileA)
2- Replace line in fileB, old file saved as .BAK
perl -i.BAK -pe 'BEGIN{$line=shift#ARGV}if(/^\QSTRING_TO_BE_SEARCHED\E/){$_="$line\n"}' "$lineA" fileB
or create new file: fileB.new without changing fileB
perl -pe 'BEGIN{$line=shift#ARGV}if(/^\QSTRING_TO_BE_SEARCHED\E/){$_="$line\n"}' "$lineA" fileB > fileB.new

BASH parsing and generating MYSQL insert

I have the following text line :
"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4" ...
And I need to generate the following INSERT statement :
INSERT INTO data (Field1,Field2,Field3,Field4 ... ) VALUES(Data1,Data2,Data3,Data4 ... );
Any ideas on how to do it in BASH ?
Thanks in advance!
$ cat file
"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4"
$
$ cat tst.awk
BEGIN { FS="^\"|\"[:,]\"|\"$" }
{
fields = values = ""
for (i=2; i<NF; i+=2) {
fields = fields (i>2 ? "," : "") $i
values = values (i>2 ? "," : "") $(i+1)
}
printf "INSERT INTO data (%s) VALUES(%s);\n", fields, values
}
$
$ awk -f tst.awk file
INSERT INTO data (Field1,Field2,Field3,Field4) VALUES(Data1,Data2,Data3,Data4);
You could try this awk command:
$ cat file
"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4"
$ awk -F'[:"]+' '{s=(NR>1?",":""); fields=fields s $2;data=data s $3}END{printf "INSTERT INTO data(%s) VALUES(%s)\n", fields,data}' RS="," file
INSTERT INTO data(Field1,Field2,Field3,Field4) VALUES(Data1,Data2,Data3,Data4)
Or a bit more readable
#!/usr/bin/awk -f
BEGIN {
FS ="[:\"]+";
RS=",";
}
{
s=(NR>1?",":"")
fields=fields s $2
data=data s $3
}
END{
printf "INSTERT INTO data(%s) VALUES(%s)\n", fields,data
}
Save it in a file named script.awk, and run it like:
./script.awk file
Since you specifically asked for a BASH solution (rather than awk, perl, or python):
data='"Field1":"Data1","Field2":"Data2","Field3":"Data3","Field4":"Data4"'
data=${data//,/$'\n'} # replace comma with new-lines
data=${data//\"/} # remove the quotes
while IFS=':' read -r field item
do
if [[ -n $fields ]]
then
fields="$fields,$field"
items="$items,$item"
else
fields=$field
items=$item
fi
done < <(echo "$data")
stmt="INSERT INTO data ($fields) VALUES($items);"
echo "$stmt"
sed -n 's/$/) VALUES(/
: next
s/"\([^"]*\)":"\([^"]*\)"\(.*\)) VALUES(\(.*\)/\1\3) VALUES(\4,\2/
t next
s/VALUES(,/VALUES(/
s/.*/INSERT INTO data (&)/
p
' YourFile
Assuming there is no " in data value nor ) VALUES( (could be treated also if needed)

Shell script: copying columns by header in a csv file to another csv file

I have a csv file which I'll be using as input with a format looking like this:
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
The key attributes of the input file are that each "value" will have a variable number of statistics, but the statistic type and "value" will always be separated by a "-". I then want to output the statistics of all the "values" to separate csv files.
The output would then look something like this:
value1.csv
xvalue,value1-avg,value1-median
1,3,4
value2.csv
xvalue,value2-avg
1,20
I've tried finding solutions to this, but all I can find are ways to copy by the column number, not the header name. I need to be able to use the header names to append the associated statistics to each of the output csv files.
Any help is greatly appreciated!
P.S. the output file may have already been written to during previous runs of this script, meaning the code should append to the output file
Untested but should be close:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Note that deleting a whole array with delete(outstr) is gawk-specific. With other awks you can use split("",outstr) to get the same effect.
Note that this appends the output you wanted to existing files BUT that means you'll get the header line repeated on every execution. If that's an issue, tell us how to know when to generate the header line or not but the solution I THINK you'll want would look something like this:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
for (outfile in outfiles) {
exists[outfile] = ( ((getline tmp < outfile) > 0) && (tmp != "") )
close(outfile)
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
if ( (NR > 1) || !exists[outfile] )
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Just figure out the name associated with each column and use that mapping to manipulate the columns. If you're trying to do this in awk, you can use associative arrays to store the column names and the rows those correspond to. If you're using ksh93 or bash, you can use associative arrays to store the column names and the rows those correspond to. If you're using perl or python or ruby or ... you can...
Or push the columns into an array to map the numbers to column numbers.
Either way, then you have a list of column headers, which can further be manipulated however you need to.
The solution I have found most useful to this kind of problem is to first retrieve the column number using an AWK script (encapsulated in a shell function) and then follow with a cut statement. This technique/strategy turns into a very concise, general and fast solution that can take advantage of co-processing. The non-append case is cleaner, but here is an example that handles the complication of the append you mentioned:
#! /bin/sh
fields() {
LC_ALL=C awk -F, -v pattern="$1" '{
j=0; split("", f)
for (i=1; i<=NF; i++) if ($(i) ~ pattern) f[j++] = i
if (j) {
printf("%s", f[0])
for (i=1; i<j; i++) printf(",%s", f[i])
}
exit 0
}' "$2"
}
cut_fields_with_append() {
if [ -s "$3" ]
then
cut -d, -f `fields "$1" "$2"` "$2" | sed '1 d' >> "$3"
else
cut -d, -f `fields "$1" "$2"` "$2" > "$3"
fi
}
cut_fields_with_append '^[^-]+$|1-' values.csv value1.csv &
cut_fields_with_append '^[^-]+$|2-' values.csv value2.csv &
cut_fields_with_append '^[^-]+$|3-' values.csv value3.csv &
wait
The result is as you would expect:
$ ls
values values.csv
$ cat values.csv
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
$ ./values
$ ls
value1.csv value2.csv value3.csv values values.csv
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
$ ./values
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
1,14,20
$

Resources