How to run a program and parse the output using KornShell - shell

I am pretty new to KornShell (ksh). I have a project that needs to be done with ksh. The question is:
Please write a ksh script which will run the ‘bonnie’ benchmark
utility and parse the output to grab the values for block write, block
read and random seeks/s. Also consider how you might use these values
to compare to the results from previous tests. For the purpose of
this test, please limit yourself to standard GNU utilities (sed, awk,
grep, cut, etc.).
Here is the output from the ‘bonnie’ utility:
# bonnie -s 50M -d /tmp
File '/tmp/Bonnie.2001096837', size: 52428800
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1.S.e.eker 2.S.e.eker 3...start 'em...done...done...done...
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
50.0 36112 34.1 138026 1.9 179048 7.0 51361 51.1 312242 4.3 15211.4 10.3
Any suggestion of how to write this script would be really appreciate.
Thanks for reading!

Here's a simple solution to experiment with, that assumes the last line will always contain the data you want:
# -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
# Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
# 50.0 36112 34.1 138026 1.9 179048 7.0 51361 51.1 312242 4.3 15211.4 10.3
# block write, block read and random seeks/s
bonnie++ \
| awk '
{line=$0}
END{
# print "#dbg:last_line=" $line
split(line, lineArr)
printf ("blkWrt=%s\tblkRd=%s\tRandSks=%s\n", lineArr[4], lineArr[8], lineArr[12])
}' # > bonnieOutput
# ------^^ remove # to write output to file
(Note that the \ char after bonnie++ must be the last character on the line, NO SPACES OR TABS allowed!!! (It will blow up otherwise!) ;-) )
Awk reads all lines of input passed thru a pipe. When you're in the END{} block of awk, you put the last line read into the lineArr[], and then print out just the elements you want from that line, using the index number of the field in your data, so lineArr[4] is going to return the 4th field in that last line of data, lineArr[12], the 12th, etc. You may have to adjust what index number you use to get the data you want to display. (You'll have to figure that out! ;-)
To save the data to a file, use the shell redirection by uncommenting (removing the # char between }' and > bonnieOutput. Leave the # char in place until you get the output you need, THEN you can redirect it to a file.
Needless to say the labels that I've used in the printf like blkWrt= are mostly for debugging. Once you are sure about what data you need to capture, and that it reliably appears in the same location each time, then you can remove those labels and then you'll have a nice clean datafile that can process with other programs.
Keep in mind that almost all Unix toolbox utilities are line oriented, that is they expect to process 1 line of data at a time and there are often tricks to see what is being processed. Note the #dbg line I've included at the top of the END{} block. You'll have to remove the '#' to uncomment it to see the debug output.
There's a lot more than can be done, but if you want to learn ksh/unix toolbox with awk, you'll have to spend the time understanding what the features are. If you've read the chapter that included the question you're working with and don't understand how to even start solving this problem, maybe you better read the chapter again, OK? ;-)
Edit
Note that in awk, the variable $0 contains all text in the current line (as defined by the RS variable value, usually the Unix line ending char, \n). Other numbered values, i.e. $1, $2, indicate the first, or second "field" on the current line ($0).
Based on my new understand from you comment below, you want to extract values from lines that contain the text "Latency". This is even easier to process. The basic pattern will be
bonnie++ \
| awk '
/Latency/{
# print "#dbg:latency_line=" $0
printf ("blkWrt=%s\tblkRd=%s\tRandSks=%s\n", $4, $8, $12)
}' # > bonnieOutput
So this code says, read all output from bonnie++ into awk, through the pipe, and when you find a line containing the text "Latency", print the values found in the 4th, 8th, and 12th fields, using the printf format string that contains self-describing tags like blkWrt, etc.
You'll have to change the $4, etc to correctly match the number in the current line for each element of data. I.E. maybe it $5, $9, $13, or $3, $9, $24? OK?
Note that /Latency/ is case sensitive, and if there are other places in the output where the word appears, then we'll have to revise the reg-exp "rule" used to filter the output.
As a learning exercise, and as a very basic tool that any Unix person uses every day, skip awk, and just see what bonnie++| grep 'Latency' gets you.
IHTH

Just got the answer by the help from Shellter!
bonnie++\
| awk '/Machine/ {f=1;next}f{
print "#dbg: line_needed=" $0
printf("blkWrt=%s\t blkRd=%s\t RandSks=%s\n", $4, $8, $12);exit
}'

Related

Sum time output from processes (bash)

I've made an script which measure the time of some processes. This is the file that I get:
real 0m6.768s
real 0m5.719s
real 0m5.173s
real 0m4.245s
real 0m5.257s
real 0m5.479s
real 0m6.446s
real 0m5.418s
real 0m5.654s
The command I use to get the time is this one:
{ time my-command } |& grep real >> times.txt
What I need is to sum all this times and get as a result how many (hours if applies) minutes and seconds using a bash script.
From man bash, then if PAGER is less / time.
If the time reserved word precedes a pipeline, the elapsed as well as user and system time consumed by its exe-
cution are reported when the pipeline terminates. The -p option changes the output format to that specified by
POSIX. The TIMEFORMAT variable may be set to a format string that specifies how the timing information should
be displayed; see the description of TIMEFORMAT under Shell Variables below.
then /TIMEFORMAT
The optional l specifies a longer format, including minutes, of the form MMmSS.FFs. The value of p
determines whether or not the fraction is included.
If this variable is not set, bash acts as if it had the value $'\nreal\t%3lR\nuser\t%3lU\nsys%3lS'. If
the value is null, no timing information is displayed. A trailing newline is added when the format
string is displayed.
If it can be changed to something like
TIMEFORMAT=$'\nreal\t%3R'
without the l, it may be easier to sum.
Note also format may depend on locale LANG:
compare
(LANG=fr_FR.UTF-8; time sleep 1)
and
(LANG=C; time sleep 1)
In that case the sum can be done with an external tool like awk
awk '/^real/ {sum+=$2} END{print sum} ' times.txt
or perl
perl -aln -e '$sum+=$F[1] if /^real/; END{print $sum}' times.txt
Pipe the output to this command
grep real | awk '{ gsub("m","*60+",$2); gsub("s","+",$2); printf("%s",$2); } END { printf("0\n"); }' | bc
This should work if you have generated the output using built-in time command. The output is in seconds.

End of Line Overflow in start of next line

So I have come across an AWK script that used to be working on HP-UX but has been ported over to RHEL6.4/6.5. It does some work to create headers and trailers in a file and the main script body handles the record formatting.
The problem I am seeing when it runs now is that the last letter from the first line flows onto the start of the next line. Then the last two letters of the second line flow into the start of the third and so on.
This is the section of the script that deals with the record formatting:
ls_buffer=ls_buffer $0;
while (length(ls_buffer)>99) {
if (substr(ls_buffer,65,6)=="STUFF") {
.....do some other stuff
} else {
if (substr(ls_buffer,1,1)!="\x01f" && substr(ls_buffer,1,1)!="^") {
printf "%-100s\n", substr(ls_buffer,1,100);
}
};
#----remove 1st 100 chars in string ls_buffer
ls_buffer=substr(ls_buffer,100);
}
To start with it looks like the file had picked up some LF,CR,FF so I removed them with gsub hex replacements further up the code but it is ending the line at 100 and then re-printing the last character at the start of the second line.
This is some sample test output just in case it helps:
1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME130 DE TESTLLAND GROUP
P1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME131 TESTS RE TESTSLIN
NS1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME132 TESTINGS MORTGAG
GES1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME937 TESTS SUNDRY PA
Can anyone offer any suggestions as to why this is happening? Any help would be appreciated.
The problem here seems to be that the offsets are incorrect in the manual buffer printing loop.
Specifically that the loop prints 100 characters from the buffer but then strips only 99 characters off the front of the buffer (despite the comments claim to the contrary).
The substr function in awk starts at the character position of its second argument. So to drop x characters from the front of the string you need to use x+1 as the argument to substr.
Example:
# Print the first ten characters from the string.
$ awk 'BEGIN {f="12345678901234567890"; print substr(f, 1, 10)}'
1234567890
# Attempt to chop off the first ten characters from the string.
$ awk 'BEGIN {f="12345678901234567890"; print substr(f, 10)}'
01234567890
# Correctly chop off the first ten characters from the string.
$ awk 'BEGIN {f="12345678901234567890"; print substr(f, 11)}'
1234567890
So the ls_buffer=substr(ls_buffer,100); line in the original script would seem to need to be ls_buffer=substr(ls_buffer,101); instead.
Given that you claim that the original script is working however I have to wonder if whatever version of awk is on that HP-UX machine had a slightly different interpretation of substr (not that I see how that could be possible).
The above aside this seems like a very odd way to go about this business (manually assembling a buffer and then chopping it up) but without seeing the input and the rest of the script I can't comment much more in that direction.

Parsing the output of Bash's time builtin

I'm running a C program from a Bash script, and running it through a command called time, which outputs some time statistics for the running of the algorithm.
If I were to perform the command
time $ALGORITHM $VALUE $FILENAME
It produces the output:
real 0m0.435s
user 0m0.430s
sys 0m0.003s
The values depending on the running of the algorithm
However, what I would like to be able to do is to take the 0.435 and assign it to a variable.
I've read into awk a bit, enough to know that if I pipe the above command into awk, I should be able to grab the 0.435 and place it in a variable. But how do I do that?
Many thanks
You must be careful: there's the Bash builtin time and there's the external command time, usually located in /usr/bin/time (type type -a time to have all the available times on your system).
If your shell is Bash, when you issue
time stuff
you're calling the builtin time. You can't directly catch the output of time without some minor trickery. This is because time doesn't want to interfere with possible redirections or pipes you'll perform, and that's a good thing.
To get time output on standard out, you need:
{ time stuff; } 2>&1
(grouping and redirection).
Now, about parsing the output: parsing the output of a command is usually a bad idea, especially when it's possible to do without. Fortunately, Bash's time command accepts a format string. From the manual:
TIMEFORMAT
The value of this parameter is used as a format string specifying how the timing information for pipelines prefixed with the time reserved word should be displayed. The % character introduces an escape sequence that is expanded to a time value or other information. The escape sequences and their meanings are as follows; the braces denote optional portions.
%%
A literal `%`.
%[p][l]R
The elapsed time in seconds.
%[p][l]U
The number of CPU seconds spent in user mode.
%[p][l]S
The number of CPU seconds spent in system mode.
%P
The CPU percentage, computed as (%U + %S) / %R.
The optional p is a digit specifying the precision, the number of fractional digits after a decimal point. A value of 0 causes no decimal point or fraction to be output. At most three places after the decimal point may be specified; values of p greater than 3 are changed to 3. If p is not specified, the value 3 is used.
The optional l specifies a longer format, including minutes, of the form MMmSS.FFs. The value of p determines whether or not the fraction is included.
If this variable is not set, Bash acts as if it had the value
$'\nreal\t%3lR\nuser\t%3lU\nsys\t%3lS'
If the value is null, no timing information is displayed. A trailing newline is added when the format string is displayed.
So, to fully achieve what you want:
var=$(TIMEFORMAT='%R'; { time $ALGORITHM $VALUE $FILENAME; } 2>&1)
As #glennjackman points out, if your command sends any messages to standard output and standard error, you must take care of that too. For that, some extra plumbing is necessary:
exec 3>&1 4>&2
var=$(TIMEFORMAT='%R'; { time $ALGORITHM $VALUE $FILENAME 1>&3 2>&4; } 2>&1)
exec 3>&- 4>&-
Source: BashFAQ032 on the wonderful Greg's wiki.
You could try the below awk command which uses split function to split the input based on digit m or last s.
$ foo=$(awk '/^real/{split($2,a,"[0-9]m|s$"); print a[2]}' file)
$ echo "$foo"
0.435
You can use this awk:
var=$(awk '$1=="real"{gsub(/^[0-9]+[hms]|[hms]$/, "", $2); print $2}' file)
echo "$var"
0.435

bash: intensive rw operations cause to damaged files

I have big txt files (for example, let it be 1 000 000 strings each) and I want to sort them by some field and write data to different output files in several dirs (one input file - one out dir). I can do it simply with awk:
awk '{print $0 >> "dir_"'$i'"/"$1".some_suffix"}' some_file;
if I process files one-by-one it always works well, but if I try to work with many files at the same time, i usually (not always) receive some output files truncated (I know exactly count of fields, it's always the same in my case, so it's easy to find bad files). I use command like
for i in <input_files>; do
awk '{print $0 >> "dir_"'$i'"/"$1".some_suffix"}' < $i &
done
so each process creates files in own out dir. Also I tried to parallelize it with xargs and received the same results - some random files were truncated.
How could this happen? Is it RAM, or filesystem cache problem, any suggestions?
Hardware: RAM is not ECC, processors AMD Opteron 6378. I used ssd (plextor m5s) and tmpfs with ext4 and reiserfs (output files are small)
You are probably running out file descriptors in your awk process, if you check carefully you'll find that maybe the first 1021 (just under a power of 2, check ulimit -n for the limit) unique filenames work. Using print ... >> does not have the same behaviour as in a shell: it leaves the file open.
I assume you are using something more contemporary than a vintage awk, e.g. for GNUs gawk:
https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html
Similarly, when a file or pipe is opened for output, awk remembers the file name or command associated with it, and subsequent writes to the same file or command are appended to the previous writes. The file or pipe stays open until awk exits.
This implies that special steps are necessary in order to read the same file again from the beginning, or to rerun a shell command (rather than reading more output from the same command). The close() function makes these things possible:
close(filename)
Try it with close():
gawk '{
outfile="dir_"'$i'"/"$1".some_suffix"
print $0 >> outfile
close(outfile)
}' some_file;
gawk offers the special ERRNO variable which can be used to catch certain errors, sadly it's not set during output redirection errors, so this condition cannot be easily detected. However, under gawk this condition is detected internally (error EMFILE during an open operation) and it attempts to close a not recently used file descriptor so that it can continue, but this isn't guaranteed to work in every situation.
With gawk, you can use --lint for various run-time checks, including hitting the file-descriptor limit and failure to explicitly close files:
$ seq 1 1050 | gawk --lint '{outfile="output/" $1 ".out"; print $0 >> outfile;}'
gawk: cmd. line:1: (FILENAME=- FNR=1022) warning: reached system limit for open files:
starting to multiplex file descriptors
gawk: (FILENAME=- FNR=1050) warning: no explicit close of file `output/1050.out' provided
gawk: (FILENAME=- FNR=1050) warning: no explicit close of file `output/1049.out' provided
[...]
gawk: (FILENAME=- FNR=1050) warning: no explicit close of file `output/1.out' provided

optimizing this script to match lines of one txt file with another

Okay so I am at best a novice in bash scripting but I wrote this very small script late last night to take the first 40 character's of each line of a fairly large text file (~300,000 lines) and search through a much larger text file for matches (~2.2 million lines), and then output all of the results into the matching lines into an new text file.
so the script looks like this:
#!/bin/bash
while read -r line
do
match=${line:0:40}
grep "$match" large_list.txt
done <"small_list.txt"
and then by calling the script like so
$ bash my_script.sh > outputfile.txt &
and this gives me all the common elements between the 2 list's. Now this is all well and good and slowly works. but I am running this on a m1.smalll ec2 instance and fair enough (the proccesing on this is shit and I could spin up a larger instance to handle all this or do it on my desktop and upload the file). However I would rather learn a more efficentr way of accomplishing the same task, However I can't quite seem to figure this out. Any tidbits of how to best go about this , or complete the task more efficently would really be very very appreciated
to give you an idea of how slow this is working i started the script about 10 hours ago and I am about 10% of the way through all the matches.
Also I am not set in using bash so scripts in other language's are fair game .. I figure the pro's on S.O. can easily improve my rock for a hammer aproach
edit: adding input and output's and morre information about the data
input: (small text file)
8E636C0B21E42A3FC6AA3C412B31E3C61D8DD062|Vice S01E09 HDTV XviD-FUM[ettv]|Video TV|http://bitsnoop.com/vice-s01e09-hdtv-xvid-fum-ettv-q49614889.html|http://torrage.com/torrent/36A02E282D49EB7D94ACB798654829493CA929CB.torrent
3B9403AD73124A84AAE12E83A2DE446149516AC3|Sons of Guns S04E08 HDTV XviD-FUM[ettv]|Video TV|http://bitsnoop.com/sons-of-guns-s04e08-hdtv-xvid-fum-e-q49613491.html|http://torrage.com/torrent/3B9403AD73124A84AAE12E83A2DE446149516AC3.torrent
C4ADF747050D1CF64E9A626CA2563A0B8BD856E7|Save Me S01E06 HDTV XviD-FUM[ettv]|Video TV|http://bitsnoop.com/save-me-s01e06-hdtv-xvid-fum-ettv-q49515711.html|http://torrage.com/torrent/C4ADF747050D1CF64E9A626CA2563A0B8BD856E7.torrent
B71EFF95502E086F4235882F748FB5F2131F11CE|Da Vincis Demons S01E08 HDTV x264-EVOLVE|Video TV|http://bitsnoop.com/da-vincis-demons-s01e08-hdtv-x264-e-q49515709.html|http://torrage.com/torrent/B71EFF95502E086F4235882F748FB5F2131F11CE.torrent
match against (large text file)
86931940E7F7F9C1A9774EA2EA41AE59412F223B|0|0
8E636C0B21E42A3FC6AA3C412B31E3C61D8DD062|4|2|20705|9550|21419
ADFA5DD6F0923AE641F97A96D50D6736F81951B1|0|0
CF2349B5FC486E7E8F48591EC3D5F1B47B4E7567|1|0|429|428|22248
290DF9A8B6EC65EEE4EC4D2B029ACAEF46D40C1F|1|0|523|446|14276
C92DEBB9B290F0BB0AA291114C98D3FF310CF0C3|0|0|21448
Output:
8E636C0B21E42A3FC6AA3C412B31E3C61D8DD062|4|2|20705|9550|21419
additional clarifications: so Basically there is a hash which is first 40 charecter's of the input file (a file I have already reduced size to about 15% of original, SO for each line in this file there is a hash in the larger text file (that I am matching against) with some corresponding information now it is the line in the larger file that I would like to write to a new file so that in the end I have a 1:1 ratio of all thing in smaller text file to my output_file.txt
In this case I am showing the first line of the input being matched (line 2 of larger file)and then written to an output file
awk solution adopted from this answer:
awk -F"|" 'NR==FNR{a[$1]=$2;next}{if (a[$1]) print}' small.txt large.txt
some python to the rescue.
I created two text-files using the following snippet:
#!/usr/bin/env python
import random
import string
N=2000000
for i in range(N):
s = ''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(40))
print s + '|4|2|20705|9550|21419'
one 300k and one 2M lines This gives me the following files:
$ ll
-rwxr-xr-x 1 210 Jun 11 22:29 gen_random_string.py*
-rw-rw-r-- 1 119M Jun 11 22:31 large.txt
-rw-rw-r-- 1 18M Jun 11 22:29 small.txt
Then I appended a line from small.txt to the end of large.txt so that I had a matching pattern
Then some more python:
#!/usr/bin/env python
target = {}
with open("large.txt") as fd:
for line in fd:
target[line.split('|')[0]] = line.strip()
with open("small.txt") as fd:
for line in fd:
if line.split('|')[0] in target:
print target[line.split('|')[0]]
Some timings:
$ time ./comp.py
3A8DW2UUJO3FYTE8C5ESE25IC9GWAEJLJS2N9CBL|4|2|20705|9550|21419
real 0m2.574s
user 0m2.400s
sys 0m0.168s
$ time awk -F"|" 'NR==FNR{a[$1]=$2;next}{if (a[$1]) print}' small.txt large.txt
3A8DW2UUJO3FYTE8C5ESE25IC9GWAEJLJS2N9CBL|4|2|20705|9550|21419
real 0m4.380s
user 0m4.248s
sys 0m0.124s
Update:
To conserve memory, do the dictionary-lookup the other way
#!/usr/bin/env python
target = {}
with open("small.txt") as fd:
for line in fd:
target[line.split('|')[0]] = line.strip()
with open("large.txt") as fd:
for line in fd:
if line.split('|')[0] in target:
print line.strip()

Resources