How to add an input file name to multiple output files in awk? - bash

The question might be trivial. I'm trying to figure out a way to add a part of my input file name to multiple outputs generated by the following awk script.
Script:
zcat $1 | BEGIN {
# the number of sequences per file
if (!N) N=10000;
# file prefix
if (!prefix) prefix = "seq";
# file suffix
if (!suffix) suffix = "fa";
# this keeps track of the sequences
count = 0
}
# skip empty lines at the beginning
/^$/ { next; }
# act on fasta header
/^>/ {
if (count % N == 0) {
if (output) close(output)
output = sprintf("%s%07d.%s", prefix, count, suffix)
}
print > output
count ++
next
}
# write the fasta body into the file
{
print >> output
}
The input in $1 variable is 30_C_283_1_5.9.fa.gz
The output files generated by the script are
myseq0000000.fa, myseq1000000.fa and so on....
I would like the output to be
30_C_283_1_5.9_myseq000000.fa, 30_C_283_1_5.9_myseq100000.fa....
Looking forward for some inputs in this regard.

There's a way to direct the output from inside the Awk script:
https://www.gnu.org/software/gawk/manual/html_node/Redirection.html

Related

Retreive specific values from file

I have a file test.cf containing:
process {
withName : teq {
file = "/path/to/teq-0.20.9.txt"
}
}
process {
withName : cad {
file = "/path/to/cad-4.0.txt"
}
}
process {
withName : sik {
file = "/path/to/sik-20.0.txt"
}
}
I would like to retreive value associated at the end of the file for teq, cad and sik
I was first thinking about something like
grep -E 'teq' test.cf
and get only second raw and then remove part of recurrence in line
But it may be easier to do something like:
for a in test.cf
do
line=$(sed -n '{$a}p' test.cf)
if line=teq
#next line using sed -n?
do print nextline &> teq.txt
else if line=cad
do print nextline &> cad.txt
else if line=sik
do print nextline &> sik.txt
done
(obviously it doesn't work)
EDIT:
output wanted:
teq.txt containing teq-0.20.9, cad.txt containing cad-4.0 and sik.txt containing sik-20.0
Is there a good way to do that? Thank you for your comments
Based on your given sample:
awk '/withName/{close(f); f=$3 ".txt"}
/file/{sub(/.*\//, ""); sub(/\.txt".*/, "");
print > f}' ip.txt
/withName/{close(f); f=$3 ".txt"} if line contains withName, save filename in f using the third field. close() will close any previous file handle
/file/{sub(/.*\//, ""); sub(/\.txt".*/, ""); if line contains file, remove everything except the value required
print > f print the modified line and redirect to filename in f
if you can have multiple entries, use >> instead of >
Here is a solution in awk:
awk '/withName/{name=$3} /file =/{print $3 > name ".txt"}' test.cf
/withName/{name=$3}: when I see the line containing "withName", I save that name
When I see the line with "file =", I print

Split CSV into two files based on column matching values in an array in bash / posh

I have a input CSV that I would like to split into two CSV files. If the value of column 4 matches any value in WLTarray it should go in output file 1, if it doesn't it should go in output file 2.
WLTarray:
"22532" "79994" "18809" "21032"
input CSV file:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file1:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file2:
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
I've been looking at awk to filter this (python & perl not an option in my environment) but I think there is probably a much smarter way:
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}" #Everything in the WLTarray will go to $filename-WLT.tmp
do
awk -F, '($4=='$WLTvalue'){print}' $filename.tmp >> $filename-WLT.tmp #move the lines to the WLT file
# now filter to remove non matching values? why not just move the rows entirely?
done
With regular awk you can make use of split and substr (to handle double-quote removal for comparison) and split the csv file as you indicate. For example you can use:
awk 'BEGIN { FS=","; s="22532 79994 18809 21032"
split (s,a," ") # split s into array a
for (i in a) # loop over each index in a
b[a[i]]=1 # use value in a as index for b
}
FNR == 1 { # first record, write header to both output files
print $0 > "output1.csv"
print $0 > "output2.csv"
next
}
substr($4,2,length($4)-2) in b { # 4th field w/o quotes in b?
print $0 > "output1.csv" # write to output1.csv
next
}
{ print $0 > "output2.csv" } # otherwise write to output2.csv
' input.csv
Where:
in the BEGIN {...} rule you set the field separator (FS) to break on comma, and split the string containing your desired output1.csv field 4 matches into the array a, then loops over the values in a using them for the indexes in array b (to allow a simple i in b check);
the first rule is applied to the first records in the file (the header line) which is simply written out to both output files;
the next rule removes the double-quotes surrounding field-4 and then checks if the number in field-4 matches an index in array b. If so the record is written to output1.csv otherwise it is written to output2.csv.
Example Input File
$ cat input.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
Resulting Output Files
$ cat output1.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
$ cat output2.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
You can use gawk like this:
test.awk
#!/usr/bin/gawk -f
BEGIN {
split("22532 79994 18809 21032", a)
for(i in a) {
WLTarray[a[i]]
}
FPAT="[^\",]+"
}
NR > 1 {
if ($4 in WLTarray) {
print >> "output1.csv"
} else {
print >> "output2.csv"
}
}
Make it executable and run it like this:
chmod +x test.awk
./test.awk input.csv
using grep with a filter file as input was the simplest answer.
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}"
do
awkstring="'\$4 == "\"\\\"$WLTvalue\\\"\"" {print}'"
eval "awk -F, $awkstring input.csv >> output.WLT.csv"
done
grep -v -x -f output.WLT.csv input.csv > output.NonWLT.csv

Find, Replace, Remove - with in file

I'm currently using this code:
awk 'BEGIN { s = \"{$CNEW}\" } /WORD_MATCH/ { $0 = s; n = 1 } 1; END { if(!n) print s }' filename > new_filename
To find a match on WORD_MATCH and then replace that line with $CNEW in a file called filename the results are written to new_filename
This all works well. But I have an issue where I may want to DELETE the line instead of replace it.
So I set $CNEW = '' which works in that I get a blank line in the file, but not actually removing the line.
Is there anyway to adapt the AWK command to allow the removal of the line ?
The total aim is :
If there isn't a line in the file containing WORD_MATCH add one, based on $CNEW
If there is a line in the file containing WORD_MATCH update that line with the new value from $CNEW
If $CNEW ='' then delete the line contain WORD_MATCH.
There will only be one line in he file containing WORD_MATCH
Thanks
awk -v s="$CNEW" '/WORD_MATCH/ { n=1; if (s) $0=s; else next; } 1; END { if(s && !n) print s }' file
How it works
-v s="$CNEW"
This creates s as an awk variable with the value $CNEW. Note that the use of -v neatly eliminates the quoting problems that can occur by trying to define s in a BEGIN block.
/WORD_MATCH/ { n=1; if (s) $0=s; else next; }
If the current line matches WORD_MATCH, then set n to 1. If s is non-empty, then set the current line to s. If not, skip the rest of the commands and start over on the next line.
1
This is cryptic shorthand for print the line.
END { if(s && !n) print s }
At the end of the file, if n is still not 1 and s is non-empty, then print s.

How to pipe program output so as to eliminate specific text

I have a program which produces results to the terminal which contains a header and a footer. The header ends when the first line containing only '-' characters is encountered and the footer begins when the last line containing a '-'is encountered. I would like to pass the output of this program through another program that will cut out the header and footer, leaving only the data. I am not sure what the most efficient way to do this is. The files are roughly 20MB in size. I am running Mac OSX
You could use 'awk' to do the work. Below is a awk program file I wrote in a file named clip.awk.
You can trim a data file that you described data.txt like this:
$ cat data.txt | awk -f clip.awk
Here is the program clip.awk:
BEGIN { state = 0; # HEADER
}
# match a line of all ----
/^-+$/ {
if (state == 0)
state = 1; # DATA
else
state = 2; # FOOTER
# Skip to next line
next;
}
# print any line while in DATA section
{ if (state == 1) print }

Replace or append block of text in file with contest of another file

I have two files:
super.conf
someconfig=23;
second line;
#blockbegin
dynamicconfig=12
dynamicconfig2=1323
#blockend
otherconfig=12;
input.conf
newdynamicconfig=12;
anothernewline=1234;
I want to run a script and have input.conf replace the contents between the #blockbegin and #blockend lines.
I already have this:
sed -i -ne '/^#blockbegin/ {p; r input.conf' -e ':a; n; /#blockend/ {p; b}; ba}; p' super.conf
It works well but until I change or remove #blockend line in super.conf, then script replaces all lines after #blockbegin.
In addition, I want script to replace block or if block doesn't exists in super.conf append new block with content of input.conf to super.conf.
It can be accomplished by remove + append, but how to remove block using sed or other unix command?
Though I gotta question the utility of this scheme -- I tend to favor systems that complain loudly when expectations aren't met instead of being more loosey-goosey like this -- I believe the following script will do what you want.
Theory of operation: It reads in everything up-front, and then emits its output all in one fell swoop.
Assuming you name the file injector, call it like injector input.conf super.conf.
#!/usr/bin/env awk -f
#
# Expects to be called with two files. First is the content to inject,
# second is the file to inject into.
FNR == 1 {
# This switches from "read replacement content" to "read template"
# at the boundary between reading the first and second files. This
# will of course do something suprising if you pass more than two
# files.
readReplacement = !readReplacement;
}
# Read a line of replacement content.
readReplacement {
rCount++;
replacement[rCount] = $0;
next;
}
# Read a line of template content.
{
tCount++;
template[tCount] = $0;
}
# Note the beginning of the replacement area.
/^#blockbegin$/ {
beginAt = tCount;
}
# Note the end of the replacement area.
/^#blockend$/ {
endAt = tCount;
}
# Finished reading everything. Process it all.
END {
if (beginAt && endAt) {
# Both beginning and ending markers were found; replace what's
# in the middle of them.
emitTemplate(1, beginAt);
emitReplacement();
emitTemplate(endAt, tCount);
} else {
# Didn't find both markers; just append.
emitTemplate(1, tCount);
emitReplacement();
}
}
# Emit the indicated portion of the template to stdout.
function emitTemplate(from, to) {
for (i = from; i <= to; i++) {
print template[i];
}
}
# Emit the replacement text to stdout.
function emitReplacement() {
for (i = 1; i <= rCount; i++) {
print replacement[i];
}
}
I've written perl one-liner:
perl -0777lni -e 'BEGIN{open(F,pop(#ARGV))||die;$b="#blockbegin";$e="#blockend";local $/;$d=<F>;close(F);}s|\n$b(.*)$e\n||s;print;print "\n$b\n",$d,"\n$e\n" if eof;' edited.file input.file
Arguments:
edited.file - path to updated file
input.file - path to file with new content of block
Script first delete block (if find one matching) and next append new block with new content.
You mean say
sed '/^#blockbegin/,/#blockend/d' super.conf

Resources