Persistent AWK Program

Persistent AWK Program - bash

I have been tasked with writing a BASH script to filter log4j files and pipe them over netcat to another host. One of the requirements is that the script must keep track of what it has already sent to the server and not send it again due to licensing constraints on the receiving server (the product on the server is licensed on a data-per-day model).
To achieve the filtering I'm using AWK encapsulated in a BASH script. The BASH component works fine - it's the AWK program that's giving me grief when I try to get it to remember what has already been sent to the server. I am doing this by grabbing the time stamp of a line each time a line matches my pattern. At the end of the program the last time stamp is written to a hidden file in current working directory. On successive runs of the program AWK will read this file in to a variable. Now each time a line matches the pattern it's time stamp is also compared to the one in the variable. If it is newer it is printed, otherwise it is not.
Desired Output:
INFO 2012-11-07 09:57:12,479 [[artifactid].connector.http.mule.default.receiver.02] org.mule.api.processor.LoggerMessageProcessor: MsgID=5017f1ff-1dfa-48c7-a03c-ed3c29050d12 InteractionStatus=Accept InteractionDateTime=2012-08-07T16:57:33.379+12:00 Retailer=CTCT RequestType=RemoteReconnect
Hidden File:
2012-10-11 12:08:19,918
So that's the theory, now my issue.
The script works fine for contrived/ trivial examples such as:
INFO 2012-11-07 09:57:12,479 [[artifactid].connector.http.mule.default.receiver.02] org.mule.api.processor.LoggerMessageProcessor: MsgID=5017f1ff-1dfa-48c7-a03c-ed3c29050d12 InteractionStatus=Accept InteractionDateTime=2012-08-07T16:57:33.379+12:00 Retailer=CTCT RequestType=RemoteReconnect
However, if I run it over a full blown log file with stack traces etc in it then the indentation levels appear to wreck havoc on my program. The first run of the program will produce the desired results - matching lines will be printed and the latest time stamp written to the hidden file. Running it again is when the problem crops up. The output of the program contains the indented lines from stack traces etc (see the block below) and I can't figure out why. This then stuffs the hidden file as the last matching line doesn't contain a time stamp and some garbage is written to it making any further runs pointless.
Undesired output:
at package.reverse.domain.SomeClass.someMethod(SomeClass.java:233)
at package.reverse.domain.processor.SomeClass.process(SomeClass.java:129)
at package.reverse.domain.processor.someClass.someMethod(SomeClassjava:233)
at package.reverse.domain.processor.SomeClass.process(SomeClass.java:129)
Hidden file after:
package.reverse.domain.process(SomeClass.java:129)
My awk program:
FNR == 1 {
CMD = "basename " FILENAME
CMD | getline FILE;
FILE = "." FILE ".last";
if (system("[ -f "FILE" ]") == 0) {
getline FIRSTLINE < FILE;
close(FILE);
print FIRSTLINE;
}
else {
FIRSTLINE = "1970-01-01 00:00:00,000";
}
}
$0 ~ EXPRESSION {
if (($2 " " $3) > FIRSTLINE) {
print $0;
LASTLINE=$2 " " $3;
}
}
END {
if (LASTLINE != "") {
print LASTLINE > FILE;
}
}
Any assistance with finding out why this is happening would be greatly appreciated.
UPDATE:
BASH Script:
#!/bin/bash
while getopts i:r:e:h:p: option
do
case "${option}"
in
i) INPUT=${OPTARG};;
r) RULES=${OPTARG};;
e) PATFILE=${OPTARG};;
h) HOST=${OPTARG};;
p) PORT=${OPTARG};;
?) printf "Usage: %s: -i <\"file1.log file2.log\"> -r <\"rules1.awk rules2.awk\"> -e <\"patterns.pat\"> -h <host> -p <port>\n" $0;
exit 1;
esac
done
#prepare expression with sed
EXPRESSION=`cat $PATFILE | sed ':a;N;$!ba;s/\n/|/g'`;
EXPRESSION="^(INFO|DEBUG|WARNING|ERROR|FATAL)[[:space:]]{2}[[:digit:]]{4}\\\\-[[:digit:]]{1,2}\\\\-[[:digit:]]{1,2}[[:space:]][[:digit:]]{1,2}:[[:digit:]]{2}:[[:digit:]]{2},[[:digit:]]{3}.*"$EXPRESSION".*";
#Make sure the temp file is empty
echo "" > .temp;
#input through awk.
for file in $INPUT
do
awk -v EXPRESSION="$EXPRESSION" -f $RULES $file >> .temp;
done
#send contents of file to splunk indexer over udp
cat .temp;
#cat .temp | netcat -t $HOST $PORT;
#cleanup temporary files
if [ -f .temp ]
then
rm .temp;
fi
Patterns File (The stuff I want to match):
Warning
Exception
Awk script as above.
Example.log
info 2012-09-04 16:00:11,638 [[adr-com-adaptor-stub].connector.http.mule.default.receiver.02] nz.co.amsco.interop.multidriveinterop: session not initialised
error 2012-09-04 16:00:11,639 [[adr-com-adaptor-stub].connector.http.mule.default.receiver.02] nz.co.amsco.adrcomadaptor.processor.comadaptorprocessor: nz.co.amsco.interop.exceptions.systemdownexception
nz.co.amsco.interop.exceptions.systemdownexception
at nz.co.amsco.adrcomadaptor.processor.comadaptorprocessor.getdeviceconfig(comadaptorprocessor.java:233)
at nz.co.amsco.adrcomadaptor.processor.comadaptorprocessor.process(comadaptorprocessor.java:129)
at org.mule.processor.chain.defaultmessageprocessorchain.doprocess(defaultmessageprocessorchain.java:99)
at org.mule.processor.chain.abstractmessageprocessorchain.process(abstractmessageprocessorchain.java:66)
at org.mule.processor.abstractinterceptingmessageprocessorbase.processnext(abstractinterceptingmessageprocessorbase.java:105)
at org.mule.processor.asyncinterceptingmessageprocessor.process(asyncinterceptingmessageprocessor.java:90)
at org.mule.processor.chain.defaultmessageprocessorchain.doprocess(defaultmessageprocessorchain.java:99)
at org.mule.processor.chain.abstractmessageprocessorchain.process(abstractmessageprocessorchain.java:66)
at org.mule.processor.AbstractInterceptingMessageProcessorBase.processNext(AbstractInterceptingMessageProcessorBase.java:105)
at org.mule.interceptor.AbstractEnvelopeInterceptor.process(AbstractEnvelopeInterceptor.java:55)
at org.mule.processor.AbstractInterceptingMessageProcessorBase.processNext(AbstractInterceptingMessageProcessorBase.java:105)
Usage:
./filter.sh -i "Example.log" -r "rules.awk" -e "patterns.pat" -h host -p port
Note that host and port are both unused in this version as the output is just thrown onto stdout.
So if I run this I get the following output:
info 2012-09-04 16:00:11,638 [[adr-com-adaptor-stub].connector.http.mule.default.receiver.02] nz.co.amsco.interop.multidriveinterop: session not initialised
error 2012-09-04 16:00:11,639 [[adr-com-adaptor-stub].connector.http.mule.default.receiver.02] nz.co.amsco.adrcomadaptor.processor.comadaptorprocessor: nz.co.amsco.interop.exceptions.systemdownexception
at nz.co.amsco.adrcomadaptor.processor.comadaptorprocessor.getdeviceconfig(comadaptorprocessor.java:233)
at nz.co.amsco.adrcomadaptor.processor.comadaptorprocessor.process(comadaptorprocessor.java:129)
If I run it again on the same unchanged file I should get no output however I am seeing:
nz.co.amsco.adrcomadaptor.processor.comadaptorprocessor.process(comadaptorprocessor.java:129)
I have been unable to determine why this is happening.

You didn't provide any sample input that could reproduce your problem so let's start by just cleaning up your script and go from there. Change it to this:
BEGIN{
expression = "^(INFO|DEBUG|WARNING|ERROR|FATAL)[[:space:]]{2}[[:digit:]]{4}-[[:digit:]]{1,2}-[[:digit:]]{1,2}[[:space:]][[:digit:]]{1,2}:[[:digit:]]{2}:[[:digit:]]{2},[[:digit:]]{3}.*Exception|Warning"
# Do you really want "(Exception|Warning)" in brackets instead?
# As written "Warning" on its own will match the whole expression.
}
FNR == 1 {
tstampFile = "/" FILENAME ".last"
sub(/.*\//,".",tstampFile)
if ( (getline prevTstamp < tstampFile) > 0 ) {
close(tstampFile)
print prevTstamp
}
else {
prevTstamp = "1970-01-01 00:00:00,000"
}
nextTstamp = ""
}
$0 ~ expression {
currTstamp = $2 " " $3
if (currTstamp > prevTstamp) {
print
nextTstamp = currTstamp
}
}
END {
if (nextTstamp != "") {
print nextTstamp > tstampFile
}
}
Now, do you still have a problem? If so, show us how you run the script, i.e. the bash command you are executing, and post some small sample input that reproduces your problem.

Related

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates.
In particular, I have files in which some lines begin with the '#' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented.
So far I've written a bash script that finds duplicated lines beginning with '#', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below:
#!/bin/bash
fdir=$1"*"
#for each fastq file
for f in $fdir
do
(
#find duplicated read names and write to file $f.txt
sort $f | uniq -d | grep ^# > "$f".txt
#loop over each duplicated readname
while read in; do
rname=$in
i=1
#while this readname still exists in the file increment and replace
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
rm "$f".txt
rm "$f".bu
done
echo "done" >> progress.txt
)&
background=( $(jobs -p) )
if (( ${#background[#]} ==40)); then
wait -n
fi
done
The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why.
My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach.
EDIT
Strangely the loop works fine on one file. Basically I ran
sort $f | uniq -d | grep ^# > "$f".txt
while read in; do
rname=$in
i=1
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach.
With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel?
Sample input:
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Sample output:
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG

Here's some code that produces the required output from your sample input.
Again, it is assumed that your input file is sorted by the first value (up to the first space character).
time awk '{
#dbg if (dbg) print "#dbg:prev=" prev
if (/^#/ && prev!=$1) {fixNum=0 ;if (dbg) print "prev!=$1=" prev "!=" $1}
if (/^#/ && (prev==$1 || NR==1) ) {
prev=$1
n=split($1,tmpArr,":") ; n++
#dbg if (dbg) print "tmpArr[6]="tmpArr[6] "\tfixNum="fixNum
fixNum++;tmpArr[6]=fixNum;
# magic to rebuild $1 here
for (i=1;i<n;i++) {
tmpFix ? tmpFix=tmpFix":"tmpArr[i]"" : tmpFix=tmpArr[i]
}
$1=tmpFix ; $0=$0
print $0
}
else { tmpFix=""; print $0 }
}' file > fixedFile
output
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
I've left a few of the #dbg:... statements in place (but they are now commented out) to show how you can run a small set of data as you have provided, and watch the values of variables change.
Assuming a non-csh, you should be able to copy/paste the code block into a terminal window cmd-line and replace file > fixFile at the end with your real file name and a new name for the fixed file. Recall that awk 'program' file > file (actually, any ...file>file) will truncate the existing file and then try to write, SO you can lose all the data of a file trying to use the same name.
There are probably some syntax improvements that will reduce the size of this code, and there might be 1 or 2 things that could be done that will make the code faster, but this should run very quickly. If not, please post the result of time command that should appear at the end of the run, i.e.
real 0m0.18s
user 0m0.03s
sys 0m0.06s
IHTH

#!/bin/bash
i=4
sort $1 | uniq -d | grep ^# > dups.txt
while read in; do
if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then
x="$in"
x=${x/"0:0 "/$i":0 "}
echo "$x" >> $1"fixed.txt"
else
echo "$in" >> $1"fixed.txt"
fi
let "i+=1"
done < $1

Grep moving spaces?

{
while read -r line ; do
if grep -q '2015' $line
then
echo "inside then"
echo "$line"
echo "yes"
fi
done
} < testinput
Once I execute the above code the output is:
inside then
( nothing is printed in this line its spaces)
yes
Why is that input line is not getting printed in the second output line?
Your help is appreciated. The reason why I am asking is that I actually have to perform a few operations on the input line after the match using grep is successful.
Input file Sample :
2015-07-18-00.07.28.991321-240 I84033A497 LEVEL: Info
PID : 21233902 TID : 9510 PROC : db2sysc 0
INSTANCE: xxxxxxxx NODE : 000 DB : XXXXXXX
APPHDL : 0-8 APPID: *LOCAL.xxxxxxx.150718040727
AUTHID : XXXXXXXX
EDUID : 9510 EDUNAME: db2agent (XXXXXXXX) 0
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure
I need to capture the time when SQLHA shows up in the input file or log file. To do that first I find the match for time in the input file and then I save that time in the variables. Once I find SQLHA I will write the time saved in the variables into an output file. So for every occurrence of SQLHA in the log, I will write the time to the output file.

After the update about what is really wanted, it is fairly clear that
you should probably use awk, though sed would also be an option (but harder). You can do it in shell too, though that's messier.
awk '/^2015-/ { datetime = $1 } / SQLHA / { print datetime }' testinput
Using sed:
sed -n -e '/^2015-/ {s/ .*//; h; n; }' -e '/ SQLHA / { x; p; x; }' testinput
(If you find 2015- at the start of a line, remove the stuff after a space and save it in the hold space. If you find SQLHA with spaces on either side, swap the hold and pattern space (thus placing the saved date/time in the pattern space), then print it, then switch it back. The switch back means that if two lines contain SQLHA between occurrences of the date line, you'll get the same date printed twice, rather than a date and then the first of the SQLHA lines. You end up having to think about what can go wrong, as well as what to do when everything goes right — but that may be more for later than right now.)
Using just sh:
while read -r line
do
case "$line" in
(2015-*) datetime=$(set -- $line; echo $1);; # Absence of quotes is deliberate
(*SQLHA*) echo "$datetime";;
esac
done < testinput
There are many other ways to do that in shell. Some of them are safer than this. It'll work on the data shown safely, but you might get to run against maliciously created data.
while read -r line
do
case "$line" in
(2015-*) datetime=$(echo "$line" | sed 's/ .*//');;
(*SQLHA*) echo "$datetime";;
esac
done < testinput
This invokes sed once per date line. Using Bash, I guess you can use:
while read -r line
do
case "$line" in
(2015-*) datetime="${line/ */}";; # Replace blank and everything after with nothing
(*SQLHA*) echo "$datetime";;
esac
done < testinput
This is the least likely to go wrong and avoids executing an external command for each line. You could also avoid case…esac using if and probably [[ so as to get pattern matching. Etc.
Running your script on a Mac, I get error output such as:
grep: 2015-07-18-00.07.28.991321-240: No such file or directory
grep: I84033A497: No such file or directory
grep: LEVEL:: No such file or directory
Are you not seeing that? If you're not, then either you've sent errors to /dev/null (or some other location than the terminal) or you've not shown us exactly the code you're using — or there's a blank line at the top of your testinput file.

This will do what your script is trying to do:
#!/usr/bin/awk -f
/2015/ {
print "inside then"
print
print "yes"
}

This is what I have written(very basic).I will try to run the same program with Grep and post why i am getting the blank space soon.
while read -r line
do
if [[ $line == *2015* ]];
then
dtime=`echo $line | cut -c1-26`
fi
if [[ $line == *SQLHA* ]];
then
echo $dtime
fi
done
} < testinput
Input Used:
2015-07-18-00.07.28.991321-240 I84033A497 LEVEL: Info
EDUID : 9510 EDUNAME: db2agent (SIEB_RPT) 0
FUNCTION: DB2 Common,APIs for DB2 HA Infrastructure, sqlhaAmIin
2015-07-18-00.07.29.991321-240 I84033A497 LEVEL: Info
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlha
2015-07-18-00.07.48.991321-240 I84033A497 LEVEL: Info
EDUID : 9510 EDUNAME: db2agent (SIEB_RPT) 0
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlha
O/p :
2015-07-18-00.07.29.991321
2015-07-18-00.07.48.991321

Slow text parsing in bash script, any advice?

I have wrote a script below to parse a text file that effectively removes line returns. It will take input that looks like this:
TCP 0.0.0.0:135 SVR LISTENING 776
RpcSs
And return this to a new text document
TCP 0.0.0.0:135 SVR LISTENING 776 RpcSs
Some entries span more than two lines so I was not able to write a script that removes the line return from every other line so I came up with this approach below. It worked fine for small collects but a 7MB collect resulted in my computer running out of memory and it took quite a bit of time do this before it failed. I'm curious why it ran out of memory as well as hoping someone could educate me on a better way to do this.
#!/bin/bash
#
# VARS
writeOuput=""
#
while read line
do
curLine=$line #grab current line from document
varWord=$(echo $curLine | awk '{print $1}') #grab first word from each line
if [ "$varWord" == "TCP" ] || [ "$varWord" == "UDP" ]; then
#echo "$curLine" >> results.txt
unset writeOutput
writeOutput=$curLine
elif [ "$varWord" == "Active" ]; then #new session
printf "\n" >> results1.txt
printf "New Session" >> results1.txt
printf "\n" >> results1.txt
else
writeOutput+=" $curLine"
#echo "$writeOutput\n"
printf "$writeOutput\n" >> results1.txt
#sed -e '"$index"s/$/"$curLine"'
fi
done < $1

Consider replacing the line with the awk call with this line:
varWord=${curLine%% *} #grab first word from each line
This saves the fork that happens in each iteration by using Bash-internal functionality only and should make your program run several times faster. See also that other guy's comment linking to this answer for an explanation.

As others have noted, the main bottleneck in your script is probably the forking where you pass each line through its own awk instance.
I have created an awk script which I hope does the same as your bash script, and I suspect it should run faster. Initially I just thought about replacing newlines with spaces, and manually adding newlines in front of every TCP or UDP, like this:
awk '
BEGIN {ORS=" "};
$1~/(TCP|UDP)/ {printf("\n")};
{print};
END {printf("\n")}
' <file>
But your script removes the 'Active' lines from the output, and adds three new lines before the line. You could, of course, pipe this through a second `awk command:
awk '/Active/ {gsub(/Active /, ""); print("\nNew Session\n")}; {print}'
But this awk script is a bit closer to what you did with bash, but it should still be considerably faster:
$ cat join.awk
$1~/Active/ {print("\nNew Session\n"); next}
$1~/(TCP|UDP)/ {if (output) print output; output = ""}
{if (output) output = output " " $0; else output = $0}
END {print output}
$ awk -f join.awk <file>
First, it checks whether the line begins with the word "Active", if it does, it prints the three lines, and goes on to the next input line.
Otherwise it checks for the presence of TCP or UDP as the first word. If it finds them, it prints what has accumulated in writeOutput (provided there is something in the variable), and clears it.
It then adds whatever it finds in the line to writeOutput
At the end, it prints what has accumulated since the last TCP or UDP.

Better way of extracting data from file for comparison

Problem: Comparison of files from Pre-check status and Post-check status of a node for specific parameters.
With some help from community, I have written the following solution which extracts the information from files from directories pre and post and based on the "Node-ID" (which happens to be unique and is to be extracted from the files as well). After extracting the data from Pre/post folder, I have created folders based on the node-id and dumped files into the folders.
My Code to extract data (The data is extracted from Pre and Post folders)
FILES=$(find postcheck_logs -type f -name *.log)
for f in $FILES
do
NODE=`cat $f | grep -m 1 ">" | awk '{print $1}' | sed 's/[>]//g'` ##Generate the node-id
echo "Extracting Post check information for " $NODE
mkdir temp/$NODE-post ## create a temp directory
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param1/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param1.txt ## extract data
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param2/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param2.txt
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param3/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param3.txt
done
After this I have a structure as:
/Node1-pre/param1.txt
/Node1-post/param1.txt
and so on.
Now I am stuck to compare $NODE-pre and $NODE-post files,
I have tried to do it using recursive grep, but I am not finding a suitable way to do so. What is the best possible way to compare these files using diff?
Moreover, I find the above data extraction program very slow. I believe it's not the best possible way (using least resources) to do so. Any suggestions?

Look askance at any instance of cat one-file — you could use I/O redirection on the next command in the pipeline instead.
You can do the whole thing more simply with:
for f in $(find postcheck_logs -type f -name *.log)
do
NODE=$(sed '/>/{ s/ .*//; s/>//g; p; q; }' $f) ##Generate the node-id
echo "Extracting Post check information for $NODE"
mkdir temp/$NODE-post
awk -v NODE="$NODE" -v DIR="temp/$NODE-post" \
'BEGIN { RS=NODE"> " }
/^param1/ { param1 = $0 }
/^param2/ { param2 = $0 }
/^param3/ { param3 = $0 }
END {
print RS param1 > DIR "/param1.txt"
print RS param2 > DIR "/param2.txt"
print RS param3 > DIR "/param3.txt"
}' $f
done
The NODE finding process is much better done by a single sed command than cat | grep | awk | sed, and you should plan to use $(...) rather than back-quotes everywhere.
The main processing of the log file should be done once; a single awk command is sufficient. The script is passed to variables — NODE and the directory name. The BEGIN is cleaned up; the $ before NODE was probably not what you intended. The main actions are very similar; each looks for the relevant parameter name and saves it in an appropriate variable. At the end, it write the saved values to the relevant files, decorated with the value of RS. Semicolons are only needed when there's more than one statement on a line; there's just one statement per line in this expanded script. It looks bigger than the original, but that's only because I'm using vertical space.
As to comparing the before and after files, you can do it in many ways, depending on what you want to know. If you've got a POSIX-compliant diff (you probably do), you can use:
diff -r temp/$NODE-pre temp/$NODE-post
to report on the differences, if any, between the contents of the two directories. Alternatively, you can do it manually:
for file in param1.txt param2.txt param3.txt
do
if cmp -s temp/$NODE-pre/$file temp/$NODE-post/$file
then : No difference
else diff temp/$NODE-pre/$file temp/$NODE-post/$file
fi
done
Clearly, you can wrap that in a 'for each node' loop. And, if you are going to need to do that, then you probably do want to capture the output of the find command in a variable (as in the original code) so that you do not have to repeat that operation.

converting the hash tag timestamps in history file to desired string

when i store the output of history command via ssh in a file i get something like this
ssh -i private_key user#ip 'export HISTFILE=~/.bash_history; export HISTTIMEFORMAT="%D-%T "; set -o history; history' > myfile.txt
OUTPUT
#1337431451
command
as far as ive learnt this hash string represents a timestamp. how do i change this to a string of my desired format
P.S- using history in ssh is not outputting with timestamps. Tried almost everything. So i guess the next best thing to do would be to convert these # timestamps to a readable date time format myself. How do i go about it?

you can combine rows with paste command:
paste -sd '#\n' .bash_history
and convert date with strftime in awk:
echo 1461136015 | awk '{print strftime("%d/%m/%y %T",$1)}'
as a result bash history with timestamp can be parsed by the next command:
paste -sd '#\n' .bash_history | awk -F"#" '{d=$2 ; $2="";print NR" "strftime("%d/%m/%y %T",d)" "$0}'
which converts:
#1461137765
echo lala
#1461137767
echo bebe
to
1 20/04/16 10:36:05 echo lala
2 20/04/16 10:36:07 echo bebe
also you can create script like /usr/local/bin/fhistory with content:
#!/bin/bash
paste -sd '#\n' $1 | awk -F"#" '{d=$2 ; $2="";print NR" "strftime("%d/%m/%y %T",d)" "$0}'
and quickly parse bash history file with next command:
fhistory .bash_history

Interesting question: I have tried it but found no simple and clean solution to access the history in a non-interactive shell. However, the format of the history file is simple, and you can write a script to parse it. The following python script might be interesting. Invoke it with ssh -i private_key user#ip 'path/to/script.py .bash_history':
#! /usr/bin/env python3
import re
import sys
import time
if __name__ == '__main__':
pattern = re.compile(br'^#(\d+)$')
out = sys.stdout.buffer
for pathname in sys.argv[1:]:
with open(pathname, 'rb') as f:
for line in f:
timestamp = 0
while line.startswith(b'#'):
match = pattern.match(line)
if match: timestamp, = map(int, match.groups())
line = next(f)
out.write(time.strftime('%F %T ', time.localtime(timestamp)).encode('ascii'))
out.write(line)

Using just Awk and in a slightly more accurate way:
awk -F\# '/^#1[0-9]{9}$/ { if(cmd) printf "%5d %s %s\n",n,ts,cmd;
ts=strftime("%F %T",$2); cmd=""; n++ }
!/^#1[0-9]{9}$/ { if(cmd)cmd=cmd " " $0; else cmd=$0 }' .bash_history
This parses only lines starting with something that looks like a timestamp (/^#1[0-9]{9}$/), compiles all subsequent lines up until the next timestamp, combines multi-line commands with " " (1 space) and prints the commands in a format similar to history including a numbering.
Note that the numbering does not (necessarily) match if there are multi-line commands.
Without the numbering and breaking up multi-line commands with a newline:
awk -F\# '/^#1[0-9]{9}$/ { if(cmd) printf "%s %s\n",ts,cmd;
ts=strftime("%F %T",$2); cmd="" }
!/^#1[0-9]{9}$/ { if(cmd)cmd=cmd "\n" $0; else cmd=$0 }' .bash_history
Finally, a quick and dirty solution using GNU Awk (gawk) to also sort the list:
gawk -F\# -v histtimeformat="$HISTTIMEFORMAT" '
/^#1[0-9]{9}$/ { i=$2 FS NR; cmd[i]="" }
!/^#1[0-9]{9}$/ { if(cmd[i]) cmd[i]=cmd[i] "\n" $0; else cmd[i]=$0 }
END { PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in cmd) { split(i,arr)
print strftime(histtimeformat,arr[1]) cmd[i]
}
}'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Persistent AWK Program - bash

Related

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

Grep moving spaces?

Slow text parsing in bash script, any advice?

Better way of extracting data from file for comparison

converting the hash tag timestamps in history file to desired string

Categories

Resources