Parallelize a awk script with multiple input files and changing the name of the output file

Parallelize a awk script with multiple input files and changing the name of the output file - bash

I have a series of text files in a folder sub.yr_by_yr which I pass to a for loop to subset a Beagle file from the header. I want to parallelize this script to subset the Beagle file from the header values (which is done using my subbeagle.awk script). I use the title of the text files to export the subset to a new file name using the base pattern matching in bash (file11=${file1%.subbeagle.txt}) to get the desired output (MM.beagle.${file11}.gz)
for file1 in $(ls sub.yr_by_yr)
do
echo -e "Doing sub-samples \n $file1"
file11=${file1%.subbeagle.txt}
awk -f subbeagle.awk \
./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz
done
The for loop works, but takes for ever... hence the need for parallelization. the folder sub.yr_by_yr contains >10 files named
something like similar to this: sp.yrseries.site1.1.subbeagle.txt, sp.yrseries.site1.2.subbeagle.txt, sp.yrseries.site1.3.subbeagle.txt...
I've tried
parallel "file11=${{}%.subbeagle.txt}; awk -f $SUBBEAGLEAWKSCRIPT ./sub.yr_by_yr/{} <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz" ::: sub.yr_by_yr/*.subbeagle.txt
But it gives me 'bad substitution'
How could I use the awk script in parallel and rename the files accordingly?
Content of subbeagle.awk:
# Source: https://stackoverflow.com/questions/74451358/select-columns-based-on-their-names-from-a-file-using-awk
BEGIN { FS=OFS="\t" } # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
{ sep=""
for (i=1; i<=NF; i++) {
if (FNR==1 && ($i in headers)) {
fldids[i]
}
if (i in fldids) {
printf "%s%s",sep,$i
sep=OFS # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
}
}
print ""
}
Content of MajorMinor.beagle.gz
marker allele1 allele2 FINCH_WB_ID1_splitMerged FINCH_WB_ID1_splitMerged FINCH_WB_ID1_splitMerged FINCH_WB_ID2_splitMerged FINCH_WB_ID2_splitMerged
chr1_34273 G C 0.79924 0.20076 3.18183e-09 0.940649 0.0593509
chr1_34285 G A 0.79924 0.20076 3.18183e-09 0.969347 0.0306534
chr1_34291 G C 0.666111 0.333847 4.20288e-05 0.969347 0.0306534
chr1_34299 C G 0.000251063 0.999498 0.000251063 0.996035 0.00396529
UPDATE:
I was able to get this from this source:
parallel "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{/.}_test.gz'" ::: sub.yr_by_yr/*.subbeagle.txt
The only fancy thing that needs to be removed is the .subbeagle par of the input file name...

So the parallel tutorial helped me here:
parallel --rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;' "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'" ::: sub.yr_by_yr/*.subbeagle.txt
Let's break this:
--rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;'
--rpl will "define a shorthand replacement string" (see parallel tutorial and another example here)
{mymy} is my 'new' replacement string, which will execute what is after it.
s:.*/::; is the definition to {/} (see parallel tutorial, search for "Perl expression replacement string", the last part of that section shows the definition of 7 'default' replacement strings)
s:\.[^.]+$::;s:\.[^.]+$::; removes 2 extensions (so .subbeagle.txt where .txt is the first extension and .subbeagle is the second)
"awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'"
is the subsetting and compressing par of the script. Note that the {mymy} is where the replacement will take place. As you can see {} will be in input string. The rest is unchanged!
::: sub.yr_by_yr/*.subbeagle.txt will pass all the files to parallel as input.
It took ~ 2 hours to do at least ~5 files, but using 22 cores, I could do all files this in a fraction of the time (~20 minutes)!

Related

Bash split command to split line in comma separated values

I have a large file with 2000 hostnames and I want to create multiple files with 25 each host per file, but separated by a comma and the last , should be removed.
Large.txt:
host1
host2
host3
.
.
host10000
The below-split command is creating multiple files like file1, file2 ... however, the host are not , separated and its not the expected output.
split -d -l 25 large.txt file
The expected output is:
host1,host2,host3

You'll need to perform 2 separate operations ... 1) split the file and 2) reformat the files generated by split.
The first step is already done:
split -d -l 25 large.txt file
For the second step let's work with the results that are dumped into the first file by the basic split command:
$ cat file00
host1
host2
host3
...
host25
We want to pull these lines into a single line using a comma (,) as delimiter. For this example I'll use an awk solution:
$ cat file00 | awk '{ printf "%s%s", sep, $0 ; sep="," } END { print "" }'
host1,host2,host3...,host25
Where:
sep is initially undefined (aka empty string)
on each successive line processed by awk we set sep to a comma
the printf doesn't include a linefeed (\n) so each successive printf will append to the 'first' line of output
we END the script by printing a linefeed to the end of the file
It just so happens that split has an option to call a secondary script/code-snippet to allow for custom formatting of the output (generated by split); the option is --filter. A few issues to keep in mind:
the initial output from split is (effectively) piped as input to the command listed in the --filter option
it is necessary to escape (with backslash) certain characters in the command (eg, double quotes, dollar sign) so as to keep them from being interpreted by the split command
the --filter option automatically has access to the current split outfile name using the $FILE variable
Pulling everything together gives us:
$ split -d -l 25 --filter="awk '{ printf \"%s%s\", sep, \$0 ; sep=\",\" } END { print \"\" }' > \$FILE" large.txt file
$ cat file00
host1,host2,host3...,host25

Using the --filter option on GNU split:
split -d -l 25 --filter="(perl -ne 'chomp; print \",\" if \$i++; print'; echo) > \$FILE" large.txt file

you can use below mentioned bash code snippet
INPUT FILE
~$ cat domainlist.txt
domain1.com
domain2.com
domain3.com
domain4.com
domain5.com
domain6.com
domain7.com
domain8.com
Script
#!/usr/bin/env bash
FILE_NAME=domainlist.txt
LIMIT=4
OUTPUT_PREFIX=domain_
CMD="csplit ${FILE_NAME} ${LIMIT} {1} -f ${OUTPUT_PREFIX}"
eval ${CMD}
#=====#
for file in ${OUTPUT_PREFIX}*; do
echo $file
sed -i ':a;N;$!ba;s/\n/,/g' $file
done
OUTPUT
./mysplit.sh
36
48
12
domain_00
domain_01
domain_02
~$ cat domain_00
domain1.com,domain2.com,domain3.com
Change LIMIT, OUTPUT_PREFIX file name prefix and input file as per your requirement

using awk:
awk '
BEGIN { PREFIX = "file"; n = 0; }
{ hosts = hosts sep $0; sep = ","; }
function flush() { print hosts > PREFIX n++; hosts = ""; sep = ""; }
NR % 25 == 0 { flush(); }
END { flush(); }
' large.txt
edit: improved comma separation handling stealing from markp-fuso's excellent answer :)

Replace a string with a random number for every line, in every file, in a directory in Bash

!/bin/bash
for file in ~/tdg/*.TXT
do
while read p; do
randvalue=`shuf -i 1-99999 -n 1`
sed -i -e "s/55555/${randvalue}/" $file
done < $file
done
This is my script. I'm attempting to replace 55555 with a different random number every time I find it. This currently works, but it replaces every instance of 55555 with the same random number. I have attempted to replace $file at the end of the sed command with $p but that just blows up.
Really though, even if I get to the point were each instance on the same line all of that same random number, but a new random number is used for each line, then I'll be happy.
EDIT
I should have specified this. I would like to actually save the results of the replace in the file, rather than just printing the results to the console.
EDIT
The final working version of my script after JNevill's fantastic help:
!/bin/bash
for file in ~/tdg/*.TXT
do
while read p;
do
gawk '{$0=gensub(/55555/, int(rand()*99999), "g", $0)}1' $file > ${file}.new
done < $file
mv -f ${file}.new $file
done

Since doing this is in sed gets pretty awful and quickly you may want to switch over to awk to perform this:
awk '{$0=gensub(/55555/, int(rand()*99999), "g", $0)}1' $file
Using this, you can remove the inner loop as this will run across the entire file line-by-line as awk does.
You could just swap out the entire script and feed the wildcard filename to awk directly too:
awk '{$0=gensub(/55555/, int(rand()*99999), "g", $0)}1' ~/tdg/*.TXT

This is how to REALLY do what you're trying to do with GNU awk:
awk -i inplace '{ while(sub(/55555/,int(rand()*99999)+1)); print }' ~/tdg/*.TXT
No shell loops or temp files required and it WILL replace every 55555 with a different random number within and across all files.
With other awks it'd be:
seed="$RANDOM"
for file in ~/tdg/*.TXT; do
seed=$(awk -v seed="$seed" '
BEGIN { srand(seed) }
{ while(sub(/55555/,int(rand()*99999)+1)); print > "tmp" }
END { print int(rand()*99999)+1 }
' "$file") &&
mv tmp "$file"
done

A variation on JNevill's solution that generates a different set of random numbers every time you run the script ...
A sample data file:
$ cat grand.dat
abc def 55555
xyz-55555-55555-__+
123-55555-55555-456
987-55555-55555-.2.
.+.-55555-55555-==*
And the script:
$ cat grand.awk
{ $0=gensub(/55555/,int(rand()*seed),"g",$0); print }
gensub(...) : works same as Nevill's answer, while we'll mix up the rand() multiplier by using our seed value [you can throw any numbers in here you wish to help determine size of the resulting value]
** keep in mind that this will replace all occurrences of 55555 on a single line with the same random value
Script in action:
$ awk -f grand.awk seed=${RANDOM} grand.dat
abc def 6939
xyz-8494-8494-__+
123-24685-24685-456
987-4442-4442-.2.
.+.-17088-17088-==*
$ awk -f grand.awk seed=${RANDOM} grand.dat
abc def 4134
xyz-5060-5060-__+
123-14706-14706-456
987-2646-2646-.2.
.+.-10180-10180-==*
$ awk -f grand.awk seed=${RANDOM} grand.dat
abc def 4287
xyz-5248-5248-__+
123-15251-15251-456
987-2744-2744-.2.
.+.-10558-10558-==*
seed=$RANDOM : have the OS generate a random int for us and pass into the awk script as the seed variable

Bash - Search and Replace operation with reporting the files and lines that got changed

I have a input file "test.txt" as below -
hostname=abc.com hostname=xyz.com
db-host=abc.com db-host=xyz.com
In each line, the value before space is the old value which needs to be replaced by the new value after the space recursively in a folder named "test". I am able to do this using below shell script.
#!/bin/bash
IFS=$'\n'
for f in `cat test.txt`
do
OLD=$(echo $f| cut -d ' ' -f 1)
echo "Old = $OLD"
NEW=$(echo $f| cut -d ' ' -f 2)
echo "New = $NEW"
find test -type f | xargs sed -i.bak "s/$OLD/$NEW/g"
done
"sed" replaces the strings on the fly in 100s of files.
Is there a trick or an alternative way by which i can get a report of the files changed like absolute path of the file & the exact lines that got changed ?
PS - I understand that sed or stream editors doesn't support this functionality out of the box. I don't want to use versioning as it will be an overkill for this task.

Let's start with a simple rewrite of your script, to make it a little bit more robust at handling a wider range of replacement values, but also faster:
#!/bin/bash
# escape regexp and replacement strings for sed
escapeRegex() { sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$1"; }
escapeSubst() { sed 's/[&/\]/\\&/g' <<<"$1"; }
while read -r old new; do
find test -type f -exec sed "/$(escapeRegex "$old")/$(escapeSubst "$new")/g" -i '{}' \;
done <test.txt
So, we loop over pairs of whitespace-separated fields (old, new) in lines from test.txt and run a standard sed in-place replace on all files found with find.
Pretty similar to your script, but we properly read lines from test.txt (no word splitting, pathname/variable expansion, etc.), we use Bash builtins whenever possible (no need to call external tools like cat, cut, xargs); and we escape sed metacharacters in old/new values for proper use as sed's regexp and replacement expressions.
Now let's add logging from sed:
#!/bin/bash
# escape regexp and replacement strings for sed
escapeRegex() { sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$1"; }
escapeSubst() { sed 's/[&/\]/\\&/g' <<<"$1"; }
while read -r old new; do
find test -type f -printf '\n[%p]\n' -exec sed "/$(escapeRegex "$old")/{
h
s//$(escapeSubst "$new")/g
H
x
s/\n/ --> /
w /dev/stdout
x
}" -i '{}' > >(tee -a change.log) \;
done <test.txt
The sed script above changes each old to new, but it also writes old --> new line to /dev/stdout (Bash-specific), which we in turn append to change.log file. The -printf action in find outputs a "header" line with file name, for each file processed.
With this, your "change log" will look something like:
[file1]
hostname=abc.com --> hostname=xyz.com
[file2]
[file1]
db-host=abc.com --> db-host=xyz.com
[file2]
db-host=abc.com --> db-host=xyz.com
Just for completeness, a quick walk-through the sed script. We act only on lines containing the old value. For each such line, we store it to hold space (h), change it to new, append that new value to the hold space (joined with newline, H) which now holds old\nnew. We swap hold with pattern space (x), so we can run s command that converts it to old --> new. After writing that to the stdout with w, we move the new back from hold to pattern space, so it gets written (in-place) to the file processed.

From man sed:
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)
This can be used to create a backup file when replacing. You can then look for any backup files, which indicate which files were changed, and diff those with the originals. Once you're done inspecting the diff, simply remove the backup files.
If you formulate your replacements as sed statements rather than a custom format you can go one further, and use either a sed shebang line or pass the file to -f/--file to do all the replacements in one operation.

There's several problems with your script, just replace it all with (using GNU awk instead of GNU sed for inplace editing):
mapfile -t files < <(find test -type f)
awk -i inplace '
NR==FNR { map[$1] = $2; next }
{ for (old in map) gsub(old,map[old]) }
' test.txt "${files[#]}"
You'll find that is orders of magnitude faster than what you were doing.
That still has the issue your existing script does of failing when the "test.txt" strings contain regexp or backreference metacharacters and modifying previously-modified strings and handling partial matches - if that's an issue let us know as it's easy to work around with awk (and extremely difficult with sed!).
To get whatever kind of report you want you just tweak the { for ... } line to print them, e.g. to print a record of the changes to stderr:
mapfile -t files < <(find test -type f)
awk -i inplace '
NR==FNR { map[$1] = $2; next }
{
orig = $0
for (old in map) {
gsub(old,map[old])
}
if ($0 != orig) {
printf "File %s, line %d: \"%s\" became \"%s\"\n", FILENAME, FNR, orig, $0 | "cat>&2"
}
}
' test.txt "${files[#]}"

awk substitution ascii table rules bash

I want to perform a hierarchical set of (non-recursive) substitutions in a text file.
I want to define the rules in an ascii file "table.txt" which contains lines of blank space tabulated pairs of strings:
aaa 3
aa 2
a 1
I have tried to solve it with an awk script "substitute.awk":
BEGIN { while (getline < file) { subs[$1]=$2; } }
{ line=$0; for(i in subs)
{ gsub(i,subs[i],line); }
print line;
}
When I call the script giving it the string "aaa":
echo aaa | awk -v file="table.txt" -f substitute.awk
I get
21
instead of the desired "3". Permuting the lines in "table.txt" doesn't help. Who can explain what the problem is here, and how to circumvent it? (This is a simplified version of my actual task. Where I have a large file containing ascii encoded phonetic symbols which I want to convert into Latex code. The ascii encoding of the symbols contains {$,&,-,%,[a-z],[0-9],...)).
Any comments and suggestions!
PS:
Of course in this application for a substitution table.txt:
aa ab
a 1
a original string: "aa" should be converted into "ab" and not "1b". That means a string which was yielded by applying a rule must be left untouched.
How to account for that?

The order of the loop for (i in subs) is undefined by default.
In newer versions of awk you can use PROCINFO["sorted_in"] to control the sort order. See section 12.2.1 Controlling Array Traversal and (the linked) section 8.1.6 Using Predefined Array Scanning Orders for details about that.
Alternatively, if you can't or don't want to do that you could store the replacements in numerically indexed entries in subs and walk the array in order manually.
To do that you will need to store both the pattern and the replacement in the value of the array and that will require some care to combine. You can consider using SUBSEP or any other character that cannot be in the pattern or replacement and then split the value to get the pattern and replacement in the loop.
Also note the caveats/etcץ with getline listed on http://awk.info/?tip/getline and consider not using that manually but instead using NR==1{...} and just listing table.txt as the first file argument to awk.
Edit: Actually, for the manual loop version you could also just keep two arrays one mapping input file line number to the patterns to match and another mapping patterns to replacements. Then looping over the line number array will get you the pattern and the pattern can be used in the second array to get the replacement (for gsub).

Instead of storing the replacements in an associative array, put them in two arrays indexed by integer (one array for the strings to replace, one for the replacements) and iterate over the arrays in order:
BEGIN {i=0; while (getline < file) { subs[i]=$1; repl[i++]=$2}
n = i}
{ for(i=0;i<n;i++) { gsub(subs[i],repl[i]); }
print tolower($0);
}

It seems like perl's zero-width word boundary is what you want. It's a pretty straightforward conversion from the awk:
#!/usr/bin/env perl
use strict;
use warnings;
my %subs;
BEGIN{
open my $f, '<', 'table.txt' or die "table.txt:$!";
while(<$f>) {
my ($k,$v) = split;
$subs{$k}=$v;
}
}
while(<>) {
while(my($k, $v) = each %subs) {
s/\b$k\b/$v/g;
}
print;
}

Here's an answer pulled from another StackExchange site, from a fairly similar question: Replace multiple strings in a single pass.
It's slightly different in that it does the replacements in inverse order by length of target string (i.e. longest target first), but that is the only sensible order for targets which are literal strings, as appears to be the case in this question as well.
If you have tcc installed, you can use the following shell function, which process the file of substitutions into a lex-generated scanner which it then compiles and runs using tcc's compile-and-run option.
# Call this as: substitute replacements.txt < text_to_be_substituted.txt
# Requires GNU sed because I was too lazy to write a BRE
substitute () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
sed -r 's/((\\\\)*)(\\?)$/\1\3\3/;
s/((\\\\)*)\\?"/\1\\"/g;
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$1"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
With gcc or clang, you can use something similar to compile a substitution program from the replacement list, and then execute that program on the given text. Posix-standard c99 does not allow input from stdin, but gcc and clang are happy to do so provided you tell them explicitly that it is a C program (-x c). In order to avoid excess compilations, we use make (which needs to be gmake, Gnu make).
The following requires that the list of replacements be in a file with a .txt extension; the cached compiled executable will have the same name with a .exe extension. If the makefile were in the current directory with the name Makefile, you could invoke it as make repl (where repl is the name of the replacement file without a text extension), but since that's unlikely to be the case, we'll use a shell function to actually invoke make.
Note that in the following file, the whitespace at the beginning of each line starts with a tab character:
substitute.mak
.SECONDARY:
%: %.exe
#$(<D)/$(<F)
%.exe: %.txt
#{ printf %s\\n "%option 8bit noyywrap nounput" "%%"; \
sed -r \
's/((\\\\)*)(\\?)$$/\1\3\3/; #\
s/((\\\\)*)\\?"/\1\\"/g; #\
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$<"; \
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"; \
} | lex -t | c99 -D_POSIX_C_SOURCE=200809L -O2 -x c -o "$#" -
Shell function to invoke the above:
substitute() {
gmake -f/path/to/substitute.mak "${1%.txt}"
}
You can invoke the above command with:
substitute file
where file is the name of the replacements file. (The filename must end with .txt but you don't have to type the file extension.)
The format of the input file is a series of lines consisting of a target string and a replacement string. The two strings are separated by whitespace. You can use any valid C escape sequence in the strings; you can also \-escape a space character to include it in the target. If you want to include a literal \, you'll need to double it.
If you don't want C escape sequences and would prefer to have backslashes not be metacharacters, you can replace the sed program with a much simpler one:
sed -r 's/([\\"])/\\\1/g' "$<"; \
(The ; \ is necessary because of the way make works.)

a) Don't use getline unless you have a very specific need and fully understand all the caveats, see http://awk.info/?tip/getline
b) Don't use regexps when you want strings (yes, this means you cannot use sed).
c) The while loop needs to constantly move beyond the part of the line you've already changed or you could end up in an infinite loop.
You need something like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
while ( sstart = index(tail,old) ) {
$0 = $0 substr(tail,1,sstart-1) new
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
print
}
$ echo aaa | awk -f substitute.awk table.txt -
3
$ echo aaaa | awk -f substitute.awk table.txt -
31
and adding some RE metacharacters to table.txt to show they are treated just like every other character and showing how to run it when the target text is stored in a file instead of being piped:
$ cat table.txt
aaa 3
aa 2
a 1
. 7
\ 4
* 9
$ cat foo
a.a\aa*a
$ awk -f substitute.awk table.txt foo
1714291
Your new requirement requires a solution like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
delete news
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
charPos = 0
while ( sstart = index(tail,old) ) {
charPos += sstart
news[charPos] = new
$0 = $0 substr(tail,1,sstart-1) RS
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
numChars = split($0, olds, "")
$0 = ""
for (charPos=1; charPos <= numChars; charPos++) {
$0 = $0 (charPos in news ? news[charPos] : olds[charPos])
}
print
}
.
$ cat table.txt
1 a
2 b
$ echo "121212" | awk -f substitute.awk table.txt -
ababab

Better way of extracting data from file for comparison

Problem: Comparison of files from Pre-check status and Post-check status of a node for specific parameters.
With some help from community, I have written the following solution which extracts the information from files from directories pre and post and based on the "Node-ID" (which happens to be unique and is to be extracted from the files as well). After extracting the data from Pre/post folder, I have created folders based on the node-id and dumped files into the folders.
My Code to extract data (The data is extracted from Pre and Post folders)
FILES=$(find postcheck_logs -type f -name *.log)
for f in $FILES
do
NODE=`cat $f | grep -m 1 ">" | awk '{print $1}' | sed 's/[>]//g'` ##Generate the node-id
echo "Extracting Post check information for " $NODE
mkdir temp/$NODE-post ## create a temp directory
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param1/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param1.txt ## extract data
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param2/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param2.txt
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param3/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param3.txt
done
After this I have a structure as:
/Node1-pre/param1.txt
/Node1-post/param1.txt
and so on.
Now I am stuck to compare $NODE-pre and $NODE-post files,
I have tried to do it using recursive grep, but I am not finding a suitable way to do so. What is the best possible way to compare these files using diff?
Moreover, I find the above data extraction program very slow. I believe it's not the best possible way (using least resources) to do so. Any suggestions?

Look askance at any instance of cat one-file — you could use I/O redirection on the next command in the pipeline instead.
You can do the whole thing more simply with:
for f in $(find postcheck_logs -type f -name *.log)
do
NODE=$(sed '/>/{ s/ .*//; s/>//g; p; q; }' $f) ##Generate the node-id
echo "Extracting Post check information for $NODE"
mkdir temp/$NODE-post
awk -v NODE="$NODE" -v DIR="temp/$NODE-post" \
'BEGIN { RS=NODE"> " }
/^param1/ { param1 = $0 }
/^param2/ { param2 = $0 }
/^param3/ { param3 = $0 }
END {
print RS param1 > DIR "/param1.txt"
print RS param2 > DIR "/param2.txt"
print RS param3 > DIR "/param3.txt"
}' $f
done
The NODE finding process is much better done by a single sed command than cat | grep | awk | sed, and you should plan to use $(...) rather than back-quotes everywhere.
The main processing of the log file should be done once; a single awk command is sufficient. The script is passed to variables — NODE and the directory name. The BEGIN is cleaned up; the $ before NODE was probably not what you intended. The main actions are very similar; each looks for the relevant parameter name and saves it in an appropriate variable. At the end, it write the saved values to the relevant files, decorated with the value of RS. Semicolons are only needed when there's more than one statement on a line; there's just one statement per line in this expanded script. It looks bigger than the original, but that's only because I'm using vertical space.
As to comparing the before and after files, you can do it in many ways, depending on what you want to know. If you've got a POSIX-compliant diff (you probably do), you can use:
diff -r temp/$NODE-pre temp/$NODE-post
to report on the differences, if any, between the contents of the two directories. Alternatively, you can do it manually:
for file in param1.txt param2.txt param3.txt
do
if cmp -s temp/$NODE-pre/$file temp/$NODE-post/$file
then : No difference
else diff temp/$NODE-pre/$file temp/$NODE-post/$file
fi
done
Clearly, you can wrap that in a 'for each node' loop. And, if you are going to need to do that, then you probably do want to capture the output of the find command in a variable (as in the original code) so that you do not have to repeat that operation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parallelize a awk script with multiple input files and changing the name of the output file - bash

Related

Bash split command to split line in comma separated values

Replace a string with a random number for every line, in every file, in a directory in Bash

Bash - Search and Replace operation with reporting the files and lines that got changed

awk substitution ascii table rules bash

Better way of extracting data from file for comparison

Categories

Resources