How to pipe program output so as to eliminate specific text - macos

I have a program which produces results to the terminal which contains a header and a footer. The header ends when the first line containing only '-' characters is encountered and the footer begins when the last line containing a '-'is encountered. I would like to pass the output of this program through another program that will cut out the header and footer, leaving only the data. I am not sure what the most efficient way to do this is. The files are roughly 20MB in size. I am running Mac OSX

You could use 'awk' to do the work. Below is a awk program file I wrote in a file named clip.awk.
You can trim a data file that you described data.txt like this:
$ cat data.txt | awk -f clip.awk
Here is the program clip.awk:
BEGIN { state = 0; # HEADER
}
# match a line of all ----
/^-+$/ {
if (state == 0)
state = 1; # DATA
else
state = 2; # FOOTER
# Skip to next line
next;
}
# print any line while in DATA section
{ if (state == 1) print }

Related

(sed/awk) extract values text file and write to csv (no pattern)

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.
My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.
The file let's call it my_file_1.txt has a structure that looks something like this
lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...
and I would like to construct something like
file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...
How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.
I don't really have any experience with awk. With sed my best guess would be
filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
h
$!N
/.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
D
T
G
P
' $filename | sed -z 's/,\n/,/' >> my_data.csv
and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.
It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.
Solution (Edit)
I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.
Awk (Rusty Lemur's answer)
Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0)
}
/epoch/ {
epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0)
}
/stop value/ {
stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0)
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
I accepted this answer because it most understandable.
Sed (potong's answer)
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^.*start value/{:a;N;/\n.*stop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt | sed '1!N;s/\n//'
It's not clear how you'd get exactly the output you provided from the input you provided but this may be what you're trying to do (using any awk in any shell on every Unix box):
$ cat tst.awk
BEGIN {
OFS = ","
print "file", "start", "stop", "epoch", "run"
}
{ f[$1] = $NF }
$1 == "stop" {
print FILENAME, f["start"], f["stop"], f["epoch"], ++run
delete f
}
$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2
awk's basic structure is:
read a record from the input (by default a record is a line)
evaluate conditions
apply actions
The record is split into fields (by default based on whitespace as the separator).
The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second.
The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.
A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = $NF # when a line contains "start value" store the last field as startValue
}
/epoch/ {
epoch = $NF
}
/stop value/ {
stopValue = $NF
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
Save that as processor.awk and invoke as:
awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv
This might work for you (GNU sed):
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^start value/{:a;N;/\nstop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' file |
sed '1!N;s/\n//'
The solution contains two invocations of sed, the first to format all but the file name and second to embed the file name into the csv file.
Format the header line on the first line and prime the run number.
Gather up lines between start value and stop value.
Increment the run number, append it to the current line and output the file name. This prints two lines per record, the first is the file name and the second the remainder of the csv file.
In the second sed invocation read two lines at a time (except for the first line) and remove the newline between them, formatting the csv file.

Find, Replace, Remove - with in file

I'm currently using this code:
awk 'BEGIN { s = \"{$CNEW}\" } /WORD_MATCH/ { $0 = s; n = 1 } 1; END { if(!n) print s }' filename > new_filename
To find a match on WORD_MATCH and then replace that line with $CNEW in a file called filename the results are written to new_filename
This all works well. But I have an issue where I may want to DELETE the line instead of replace it.
So I set $CNEW = '' which works in that I get a blank line in the file, but not actually removing the line.
Is there anyway to adapt the AWK command to allow the removal of the line ?
The total aim is :
If there isn't a line in the file containing WORD_MATCH add one, based on $CNEW
If there is a line in the file containing WORD_MATCH update that line with the new value from $CNEW
If $CNEW ='' then delete the line contain WORD_MATCH.
There will only be one line in he file containing WORD_MATCH
Thanks
awk -v s="$CNEW" '/WORD_MATCH/ { n=1; if (s) $0=s; else next; } 1; END { if(s && !n) print s }' file
How it works
-v s="$CNEW"
This creates s as an awk variable with the value $CNEW. Note that the use of -v neatly eliminates the quoting problems that can occur by trying to define s in a BEGIN block.
/WORD_MATCH/ { n=1; if (s) $0=s; else next; }
If the current line matches WORD_MATCH, then set n to 1. If s is non-empty, then set the current line to s. If not, skip the rest of the commands and start over on the next line.
1
This is cryptic shorthand for print the line.
END { if(s && !n) print s }
At the end of the file, if n is still not 1 and s is non-empty, then print s.

How to add an input file name to multiple output files in awk?

The question might be trivial. I'm trying to figure out a way to add a part of my input file name to multiple outputs generated by the following awk script.
Script:
zcat $1 | BEGIN {
# the number of sequences per file
if (!N) N=10000;
# file prefix
if (!prefix) prefix = "seq";
# file suffix
if (!suffix) suffix = "fa";
# this keeps track of the sequences
count = 0
}
# skip empty lines at the beginning
/^$/ { next; }
# act on fasta header
/^>/ {
if (count % N == 0) {
if (output) close(output)
output = sprintf("%s%07d.%s", prefix, count, suffix)
}
print > output
count ++
next
}
# write the fasta body into the file
{
print >> output
}
The input in $1 variable is 30_C_283_1_5.9.fa.gz
The output files generated by the script are
myseq0000000.fa, myseq1000000.fa and so on....
I would like the output to be
30_C_283_1_5.9_myseq000000.fa, 30_C_283_1_5.9_myseq100000.fa....
Looking forward for some inputs in this regard.
There's a way to direct the output from inside the Awk script:
https://www.gnu.org/software/gawk/manual/html_node/Redirection.html

Shell script to Preprocess Fortran RECORDs to TYPEs?

I'm converting some old F77 code to compile under gfortran. I have a bunch of RECORDS used in the following manner:
RecoRD /TEST/ this
this.field = 1
this.otherfield.sumthin = 2
func = func(%val(ThIs.field,foo.bar,this.other.field))
I am trying to convert these all to TYPEs as such:
TYPE(TEST) this
this%field = 1
this%otherfield%sumthin = 2
func = func(%val(ThIs%field,foo.bar,this%other%field))
I'm just ok with sed and I can process the files to replace the RECORD declarations with TYPE declarations, but is there a way to write a preprocessing type of script using linux tools to convert the this.field notation to this%field notation? I believe I would need something that can recognize the declared record name and target it specifically to avoid borking other variables on accident. Also, any idea how I can deal with included files? I feel like that could get pretty messy but if anyone has done something similar it would be good to include in a solution.
Edit:
I have python 2.4 avaialable to me.
You could use Python for that. Following script reads the text from stdin and outputs it to stdout using the replacement you asked for:
import re
import sys
txt = sys.stdin.read()
names = re.findall(r"RECORD /TEST/\s*\b(.+)\b", txt, re.MULTILINE)
for name in list(set(names)):
txt = re.sub(r"\b%s\.(.*)\b"%name, r"%s%%\1"%name, txt,
re.MULTILINE)
sys.stdout.write(txt)
EDIT: As for Python 2.4: Yes format should be replaced with %. As for structures with subfields, one could easily achieve that by using a function in the sub() call as below. I also added case insensitiveness:
import re
import sys
def replace(match):
return match.group(0).replace(".", "%")
txt = sys.stdin.read()
names = re.findall(r"RECORD /TEST/\s*\b(.+)\b", txt, re.MULTILINE)
for name in names:
txt = re.sub(r"\b%s(\.\w+)+\b" % name, replace, txt,
re.MULTILINE | re.IGNORECASE)
sys.stdout.write(txt)
With GNU awk:
$ cat tst.awk
/RECORD/ { $0 = gensub(/[^/]+[/]([^/]+)[/]/,"TYPE(\\1)",""); name=tolower($NF) }
{
while ( match(tolower($0),"\\<" name "[.][[:alnum:]_.]+") ) {
$0 = substr($0,1,RSTART-1) \
gensub(/[.]/,"%","g",substr($0,RSTART,RLENGTH)) \
substr($0,RSTART+RLENGTH)
}
}
{ print }
$ cat file
RECORD /TEST/ tHiS
this.field = 1
THIS.otherfield.sumthin = 2
func = func(%val(ThIs.field,foo.bar,this.other.field))
$ awk -f tst.awk file
TYPE(TEST) tHiS
this%field = 1
THIS%otherfield%sumthin = 2
func = func(%val(ThIs%field,foo.bar,this%other%field))
Note that I modified your input to show what would happen with multiple occurrences of this.field on one line and mixed in with other "." references (foo.bar). I also added some mixed-case occurrences of "this" to show how that works.
In response to the question below about how to handle included files, here's one way:
This script will not only expand all the lines that say "include subfile", but by writing the result to a tmp file, resetting ARGV[1] (the highest level input file) and not resetting ARGV[2] (the tmp file), it then lets awk do any normal record parsing on the result of the expansion since that's now stored in the tmp file. If you don't need that, just do the "print" to stdout and remove any other references to a tmp file or ARGV[2].
awk 'function read(file) {
while ( (getline < file) > 0) {
if ($1 == "include") {
read($2)
} else {
print > ARGV[2]
}
}
close(file)
}
BEGIN{
read(ARGV[1])
ARGV[1]=""
close(ARGV[2])
}1' a.txt tmp
The result of running the above given these 3 files in the current directory:
a.txt b.txt c.txt
----- ----- -----
1 3 5
2 4 6
include b.txt include c.txt
9 7
10 8
would be to print the numbers 1 through 10 and save them in a file named "tmp".
So for this application you could replace the number "1" at the end of the above script with the contents of the first script posted above and it'd work on the tmp file that now includes the contents of the expanded files.

Replace or append block of text in file with contest of another file

I have two files:
super.conf
someconfig=23;
second line;
#blockbegin
dynamicconfig=12
dynamicconfig2=1323
#blockend
otherconfig=12;
input.conf
newdynamicconfig=12;
anothernewline=1234;
I want to run a script and have input.conf replace the contents between the #blockbegin and #blockend lines.
I already have this:
sed -i -ne '/^#blockbegin/ {p; r input.conf' -e ':a; n; /#blockend/ {p; b}; ba}; p' super.conf
It works well but until I change or remove #blockend line in super.conf, then script replaces all lines after #blockbegin.
In addition, I want script to replace block or if block doesn't exists in super.conf append new block with content of input.conf to super.conf.
It can be accomplished by remove + append, but how to remove block using sed or other unix command?
Though I gotta question the utility of this scheme -- I tend to favor systems that complain loudly when expectations aren't met instead of being more loosey-goosey like this -- I believe the following script will do what you want.
Theory of operation: It reads in everything up-front, and then emits its output all in one fell swoop.
Assuming you name the file injector, call it like injector input.conf super.conf.
#!/usr/bin/env awk -f
#
# Expects to be called with two files. First is the content to inject,
# second is the file to inject into.
FNR == 1 {
# This switches from "read replacement content" to "read template"
# at the boundary between reading the first and second files. This
# will of course do something suprising if you pass more than two
# files.
readReplacement = !readReplacement;
}
# Read a line of replacement content.
readReplacement {
rCount++;
replacement[rCount] = $0;
next;
}
# Read a line of template content.
{
tCount++;
template[tCount] = $0;
}
# Note the beginning of the replacement area.
/^#blockbegin$/ {
beginAt = tCount;
}
# Note the end of the replacement area.
/^#blockend$/ {
endAt = tCount;
}
# Finished reading everything. Process it all.
END {
if (beginAt && endAt) {
# Both beginning and ending markers were found; replace what's
# in the middle of them.
emitTemplate(1, beginAt);
emitReplacement();
emitTemplate(endAt, tCount);
} else {
# Didn't find both markers; just append.
emitTemplate(1, tCount);
emitReplacement();
}
}
# Emit the indicated portion of the template to stdout.
function emitTemplate(from, to) {
for (i = from; i <= to; i++) {
print template[i];
}
}
# Emit the replacement text to stdout.
function emitReplacement() {
for (i = 1; i <= rCount; i++) {
print replacement[i];
}
}
I've written perl one-liner:
perl -0777lni -e 'BEGIN{open(F,pop(#ARGV))||die;$b="#blockbegin";$e="#blockend";local $/;$d=<F>;close(F);}s|\n$b(.*)$e\n||s;print;print "\n$b\n",$d,"\n$e\n" if eof;' edited.file input.file
Arguments:
edited.file - path to updated file
input.file - path to file with new content of block
Script first delete block (if find one matching) and next append new block with new content.
You mean say
sed '/^#blockbegin/,/#blockend/d' super.conf

Resources