Shell script to Preprocess Fortran RECORDs to TYPEs? - bash

I'm converting some old F77 code to compile under gfortran. I have a bunch of RECORDS used in the following manner:
RecoRD /TEST/ this
this.field = 1
this.otherfield.sumthin = 2
func = func(%val(ThIs.field,foo.bar,this.other.field))
I am trying to convert these all to TYPEs as such:
TYPE(TEST) this
this%field = 1
this%otherfield%sumthin = 2
func = func(%val(ThIs%field,foo.bar,this%other%field))
I'm just ok with sed and I can process the files to replace the RECORD declarations with TYPE declarations, but is there a way to write a preprocessing type of script using linux tools to convert the this.field notation to this%field notation? I believe I would need something that can recognize the declared record name and target it specifically to avoid borking other variables on accident. Also, any idea how I can deal with included files? I feel like that could get pretty messy but if anyone has done something similar it would be good to include in a solution.
Edit:
I have python 2.4 avaialable to me.

You could use Python for that. Following script reads the text from stdin and outputs it to stdout using the replacement you asked for:
import re
import sys
txt = sys.stdin.read()
names = re.findall(r"RECORD /TEST/\s*\b(.+)\b", txt, re.MULTILINE)
for name in list(set(names)):
txt = re.sub(r"\b%s\.(.*)\b"%name, r"%s%%\1"%name, txt,
re.MULTILINE)
sys.stdout.write(txt)
EDIT: As for Python 2.4: Yes format should be replaced with %. As for structures with subfields, one could easily achieve that by using a function in the sub() call as below. I also added case insensitiveness:
import re
import sys
def replace(match):
return match.group(0).replace(".", "%")
txt = sys.stdin.read()
names = re.findall(r"RECORD /TEST/\s*\b(.+)\b", txt, re.MULTILINE)
for name in names:
txt = re.sub(r"\b%s(\.\w+)+\b" % name, replace, txt,
re.MULTILINE | re.IGNORECASE)
sys.stdout.write(txt)

With GNU awk:
$ cat tst.awk
/RECORD/ { $0 = gensub(/[^/]+[/]([^/]+)[/]/,"TYPE(\\1)",""); name=tolower($NF) }
{
while ( match(tolower($0),"\\<" name "[.][[:alnum:]_.]+") ) {
$0 = substr($0,1,RSTART-1) \
gensub(/[.]/,"%","g",substr($0,RSTART,RLENGTH)) \
substr($0,RSTART+RLENGTH)
}
}
{ print }
$ cat file
RECORD /TEST/ tHiS
this.field = 1
THIS.otherfield.sumthin = 2
func = func(%val(ThIs.field,foo.bar,this.other.field))
$ awk -f tst.awk file
TYPE(TEST) tHiS
this%field = 1
THIS%otherfield%sumthin = 2
func = func(%val(ThIs%field,foo.bar,this%other%field))
Note that I modified your input to show what would happen with multiple occurrences of this.field on one line and mixed in with other "." references (foo.bar). I also added some mixed-case occurrences of "this" to show how that works.
In response to the question below about how to handle included files, here's one way:
This script will not only expand all the lines that say "include subfile", but by writing the result to a tmp file, resetting ARGV[1] (the highest level input file) and not resetting ARGV[2] (the tmp file), it then lets awk do any normal record parsing on the result of the expansion since that's now stored in the tmp file. If you don't need that, just do the "print" to stdout and remove any other references to a tmp file or ARGV[2].
awk 'function read(file) {
while ( (getline < file) > 0) {
if ($1 == "include") {
read($2)
} else {
print > ARGV[2]
}
}
close(file)
}
BEGIN{
read(ARGV[1])
ARGV[1]=""
close(ARGV[2])
}1' a.txt tmp
The result of running the above given these 3 files in the current directory:
a.txt b.txt c.txt
----- ----- -----
1 3 5
2 4 6
include b.txt include c.txt
9 7
10 8
would be to print the numbers 1 through 10 and save them in a file named "tmp".
So for this application you could replace the number "1" at the end of the above script with the contents of the first script posted above and it'd work on the tmp file that now includes the contents of the expanded files.

Related

How to process file content differently for each line using shell script?

I have a file which has this data -
view:
schema1.view1:/some-path/view1.sql
schema2.view2:/some-path/view2.sql
tables:
schema1.table1:/some-path/table1.sql
schema2.table2:/some-path/table2.sql
end:
I have to read the file and store the contents in different variables.
viewData=$(sed '/view/,/tables/!d;/tables/q' $file|sed '$d')
tableData=$(sed '/tables/,/end/!d;/end/q' $file|sed '$d')
echo $viewData
view:
schema1.view1:/some-path/view1.sql
schema2.view2:/some-path/view2.sql
echo $tableData
tables:
schema1.table1:/some-path/table1.sql
schema2.table2:/some-path/table2.sql
dataArray=("$viewData" "$tableData")
I need to use a for loop over dataArray so that I get all the components in 4 different variables.
Lets say for $viewData, the loop should be able to print like this -
objType=view
schema=schema1
view=view1
fileLoc=some-path/view1.sql
objType=view
schema=schema2
view=view2
fileLoc=some-path/view2.sql
I have tried sed and cut commands but that is not working properly. And I need to do this using shell script only.
Any help will be appreciated. Thanks!
remark: If you add a space character between the : and / in the input then you would be able to use YAML-aware tools for parsing it robustly.
Given your sample input you, can use this awk for generating the expected blocks:
awk '
match($0,/[^[:space:]]+:/) {
key = substr($0,RSTART,RLENGTH-1)
val = substr($0,RSTART+RLENGTH)
if (i = index(key,".")) {
print "objType=" type
print "schema=" substr(key,1,i-1)
print "view=" substr(key,i+1)
print "fileLoc=" val
printf "%c", 10
} else
type = key
}
' data.txt
objType=view
schema=schema1
view=view1
fileLoc=/some-path/view1.sql
objType=view
schema=schema2
view=view2
fileLoc=/some-path/view2.sql
objType=tables
schema=schema1
view=table1
fileLoc=/some-path/table1.sql
objType=tables
schema=schema2
view=table2
fileLoc=/some-path/table2.sql

How to add an input file name to multiple output files in awk?

The question might be trivial. I'm trying to figure out a way to add a part of my input file name to multiple outputs generated by the following awk script.
Script:
zcat $1 | BEGIN {
# the number of sequences per file
if (!N) N=10000;
# file prefix
if (!prefix) prefix = "seq";
# file suffix
if (!suffix) suffix = "fa";
# this keeps track of the sequences
count = 0
}
# skip empty lines at the beginning
/^$/ { next; }
# act on fasta header
/^>/ {
if (count % N == 0) {
if (output) close(output)
output = sprintf("%s%07d.%s", prefix, count, suffix)
}
print > output
count ++
next
}
# write the fasta body into the file
{
print >> output
}
The input in $1 variable is 30_C_283_1_5.9.fa.gz
The output files generated by the script are
myseq0000000.fa, myseq1000000.fa and so on....
I would like the output to be
30_C_283_1_5.9_myseq000000.fa, 30_C_283_1_5.9_myseq100000.fa....
Looking forward for some inputs in this regard.
There's a way to direct the output from inside the Awk script:
https://www.gnu.org/software/gawk/manual/html_node/Redirection.html

awk substitution ascii table rules bash

I want to perform a hierarchical set of (non-recursive) substitutions in a text file.
I want to define the rules in an ascii file "table.txt" which contains lines of blank space tabulated pairs of strings:
aaa 3
aa 2
a 1
I have tried to solve it with an awk script "substitute.awk":
BEGIN { while (getline < file) { subs[$1]=$2; } }
{ line=$0; for(i in subs)
{ gsub(i,subs[i],line); }
print line;
}
When I call the script giving it the string "aaa":
echo aaa | awk -v file="table.txt" -f substitute.awk
I get
21
instead of the desired "3". Permuting the lines in "table.txt" doesn't help. Who can explain what the problem is here, and how to circumvent it? (This is a simplified version of my actual task. Where I have a large file containing ascii encoded phonetic symbols which I want to convert into Latex code. The ascii encoding of the symbols contains {$,&,-,%,[a-z],[0-9],...)).
Any comments and suggestions!
PS:
Of course in this application for a substitution table.txt:
aa ab
a 1
a original string: "aa" should be converted into "ab" and not "1b". That means a string which was yielded by applying a rule must be left untouched.
How to account for that?
The order of the loop for (i in subs) is undefined by default.
In newer versions of awk you can use PROCINFO["sorted_in"] to control the sort order. See section 12.2.1 Controlling Array Traversal and (the linked) section 8.1.6 Using Predefined Array Scanning Orders for details about that.
Alternatively, if you can't or don't want to do that you could store the replacements in numerically indexed entries in subs and walk the array in order manually.
To do that you will need to store both the pattern and the replacement in the value of the array and that will require some care to combine. You can consider using SUBSEP or any other character that cannot be in the pattern or replacement and then split the value to get the pattern and replacement in the loop.
Also note the caveats/etc×¥ with getline listed on http://awk.info/?tip/getline and consider not using that manually but instead using NR==1{...} and just listing table.txt as the first file argument to awk.
Edit: Actually, for the manual loop version you could also just keep two arrays one mapping input file line number to the patterns to match and another mapping patterns to replacements. Then looping over the line number array will get you the pattern and the pattern can be used in the second array to get the replacement (for gsub).
Instead of storing the replacements in an associative array, put them in two arrays indexed by integer (one array for the strings to replace, one for the replacements) and iterate over the arrays in order:
BEGIN {i=0; while (getline < file) { subs[i]=$1; repl[i++]=$2}
n = i}
{ for(i=0;i<n;i++) { gsub(subs[i],repl[i]); }
print tolower($0);
}
It seems like perl's zero-width word boundary is what you want. It's a pretty straightforward conversion from the awk:
#!/usr/bin/env perl
use strict;
use warnings;
my %subs;
BEGIN{
open my $f, '<', 'table.txt' or die "table.txt:$!";
while(<$f>) {
my ($k,$v) = split;
$subs{$k}=$v;
}
}
while(<>) {
while(my($k, $v) = each %subs) {
s/\b$k\b/$v/g;
}
print;
}
Here's an answer pulled from another StackExchange site, from a fairly similar question: Replace multiple strings in a single pass.
It's slightly different in that it does the replacements in inverse order by length of target string (i.e. longest target first), but that is the only sensible order for targets which are literal strings, as appears to be the case in this question as well.
If you have tcc installed, you can use the following shell function, which process the file of substitutions into a lex-generated scanner which it then compiles and runs using tcc's compile-and-run option.
# Call this as: substitute replacements.txt < text_to_be_substituted.txt
# Requires GNU sed because I was too lazy to write a BRE
substitute () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
sed -r 's/((\\\\)*)(\\?)$/\1\3\3/;
s/((\\\\)*)\\?"/\1\\"/g;
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$1"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
With gcc or clang, you can use something similar to compile a substitution program from the replacement list, and then execute that program on the given text. Posix-standard c99 does not allow input from stdin, but gcc and clang are happy to do so provided you tell them explicitly that it is a C program (-x c). In order to avoid excess compilations, we use make (which needs to be gmake, Gnu make).
The following requires that the list of replacements be in a file with a .txt extension; the cached compiled executable will have the same name with a .exe extension. If the makefile were in the current directory with the name Makefile, you could invoke it as make repl (where repl is the name of the replacement file without a text extension), but since that's unlikely to be the case, we'll use a shell function to actually invoke make.
Note that in the following file, the whitespace at the beginning of each line starts with a tab character:
substitute.mak
.SECONDARY:
%: %.exe
#$(<D)/$(<F)
%.exe: %.txt
#{ printf %s\\n "%option 8bit noyywrap nounput" "%%"; \
sed -r \
's/((\\\\)*)(\\?)$$/\1\3\3/; #\
s/((\\\\)*)\\?"/\1\\"/g; #\
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$<"; \
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"; \
} | lex -t | c99 -D_POSIX_C_SOURCE=200809L -O2 -x c -o "$#" -
Shell function to invoke the above:
substitute() {
gmake -f/path/to/substitute.mak "${1%.txt}"
}
You can invoke the above command with:
substitute file
where file is the name of the replacements file. (The filename must end with .txt but you don't have to type the file extension.)
The format of the input file is a series of lines consisting of a target string and a replacement string. The two strings are separated by whitespace. You can use any valid C escape sequence in the strings; you can also \-escape a space character to include it in the target. If you want to include a literal \, you'll need to double it.
If you don't want C escape sequences and would prefer to have backslashes not be metacharacters, you can replace the sed program with a much simpler one:
sed -r 's/([\\"])/\\\1/g' "$<"; \
(The ; \ is necessary because of the way make works.)
a) Don't use getline unless you have a very specific need and fully understand all the caveats, see http://awk.info/?tip/getline
b) Don't use regexps when you want strings (yes, this means you cannot use sed).
c) The while loop needs to constantly move beyond the part of the line you've already changed or you could end up in an infinite loop.
You need something like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
while ( sstart = index(tail,old) ) {
$0 = $0 substr(tail,1,sstart-1) new
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
print
}
$ echo aaa | awk -f substitute.awk table.txt -
3
$ echo aaaa | awk -f substitute.awk table.txt -
31
and adding some RE metacharacters to table.txt to show they are treated just like every other character and showing how to run it when the target text is stored in a file instead of being piped:
$ cat table.txt
aaa 3
aa 2
a 1
. 7
\ 4
* 9
$ cat foo
a.a\aa*a
$ awk -f substitute.awk table.txt foo
1714291
Your new requirement requires a solution like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
delete news
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
charPos = 0
while ( sstart = index(tail,old) ) {
charPos += sstart
news[charPos] = new
$0 = $0 substr(tail,1,sstart-1) RS
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
numChars = split($0, olds, "")
$0 = ""
for (charPos=1; charPos <= numChars; charPos++) {
$0 = $0 (charPos in news ? news[charPos] : olds[charPos])
}
print
}
.
$ cat table.txt
1 a
2 b
$ echo "121212" | awk -f substitute.awk table.txt -
ababab

split larger file into smaller files: help regarding 'split'

I have a large file (2GB) which looks something like this:
>10GS_A
YTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGD
LTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKD
DYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFP
LLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
>11BA_A
KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAV
CSQKKVTCKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKP
SVPVHFDASV
>11BG_A
KESAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAVCSQKKVT
CKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKPSVPVHFDASV
>121P_A
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRD
QYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYG
IPYIETSAKTRQGVEDAFYTLVREIRQH
I wanted to split this file into smaller files based in the delimiter ">" in such a way that, in this case, there are 4 files generated which contain the following text AND ARE NAMED IN THE FOLLOWING MANNER:
10gs_A.txt
11ba_A.txt
11bg_A.txt
121p_A.txt
AND THEY CONTAIN the following contents:
10gs_A.txt
>10GS_A
YTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGD
LTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKD
DYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFP
LLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
11ba_A.txt
>11BA_A
KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAV
CSQKKVTCKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKP
SVPVHFDASV
... and so on.
I am aware about separating a larger text file using the split command in linux, however it names the files created as temp00, temp01, temp03.
Is there a way to split this larger file and have the files named as I want?
What is the split function to achieve this?
With gawk you can do -
gawk -v RS='>' 'NF{ print RS$0 > $1".txt" }' InputFile
How about using an awk script to split mybigfile
splitter.awk
BEGIN {outname = "noname.txt"}
/^>/ { outname = substr($0,2,40) ".txt"
next }
{ print > outname }
If you want the separator row in the output, then use the following:
splitter.awk
BEGIN {outname = "noname.txt"}
/^>/ { outname = substr($0,2,40) ".txt"}
{ print > outname }
Then run this file
awk -f splitter.awk mybigfile

shell script to search attribute and store value along with filename

Looking out for a shell script which searches for an attribute (a string) in all the files in current directory and stores the attribute values along with the file name.
e.g File1.txt
abc xyz = "pqr"
File2.txt
abc xyz = "klm"
Here File1 and File2 contains desired string "abc xyz" and have values "pqr" and "klm".
I want result something like this:
File1.txt:pqr
File2.txt:klm
Well, this depends on how do you define a 'shell script'. Here are 3 one-line solutions:
Using grep/sed:
egrep -o "abc xyz = ".*"' * | sed -e 's/abc xyz = "(.*)"/\1/'
Using awk:
awk '/abc xyz = "(.)"/ { print FILENAME ":" gensub("abc xyz = \"(.)\"", "\1", 1) }' *
Using perl one-liner:
perl -ne 'if(s/abc xyz = "(.*)"/$ARGV:$1/) { print }' *
I personally would go with the last one.
Please don't use bash scripting for this.
There is much room for small improvements in the code,
but in 20 lines the damn thing does the job.
Note: the code assumes that "abc xyz" is at the beginning of the line.
#!/usr/bin/python
import os
import re
MYDIR = '/dir/you/want/to/search'
def search_file(fn):
myregex = re.compile(r'abc xyz = \"([a-z]+)\"')
f = open(fn, 'r')
for line in f:
m = myregex.match(line)
if m:
yield m.group(1)
for filename in os.listdir(MYDIR):
if os.path.isfile(os.path.join(MYDIR, filename)):
matches = search_file(os.path.join(MYDIR, filename))
for match in matches:
print filename + ':' + match,
Thanks to David Beazley, A.M. Kuchling, and Mark Pilgrim for sharing their vast knowledge.
I couldn't have done something like this without you guys leading the way.

Resources