Quickly generating test data (UUIDs, large random numbers, etc) with bash scripting - bash

I have a small bash script with a function containing a case statement which echoes random data if the 1st argument matches the case parameter.
Code is as follows:
#!/usr/bin/env bash '
AC='auto-increment'
UUID='uuid'
LAT='lat'
LONG='long'
IP='ip'
generate_mock_data() {
# ARGS: $1 - data type, $2 - loop index
case ${1} in
${AC})
echo ${2} ;;
${UUID})
uuidgen ;;
${LAT})
echo $((RANDOM % 180 - 90)).$(shuf -i1000000-9999999 -n1) ;;
${LONG})
echo $((RANDOM % 360 - 180)).$(shuf -i1000000-9999999 -n1) ;;
${IP})
echo $((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256)) ;;
esac
}
# Writing data to file
headers=('auto-increment' 'uuid' 'lat' 'long' 'ip')
for i in {1..2500}; do
for header in "${headers[#]}"; do
echo -n $(generate_mock_data ${header} ${i}),
done
echo # New line
done >> file.csv
However, execution time is incredibly slow for just 2500 rows:
real 0m8.876s
user 0m0.576s
sys 0m0.868s
What am I doing wrong ? is there anything I can do to speed up the process ? or is bash not the right language for these type of operations ?
I also tried profiling the entire script but after looking at the logs I didn't notice any significant bottlenecks.

It seems you can generate a UUID pretty fast with Python, so if you just execute Python once to generate 2,500 UUIDs, and you aren't a Python programmer -like me ;-) then you can patch them up with awk:
python -c 'import uuid; print("\n".join([str(uuid.uuid4()).upper() for x in range(2500)]))' |
awk '{
lat=-90+180*rand();
lon=-180+360*rand();
ip=int(256*rand()) "." int(256*rand()) "." int(256*rand()) "." int(256*rand());
print NR,$0,lat,lon,ip
}' OFS=,
This takes 0.06s on my iMac.
OFS is the "Output Field Separator"
NR is the line number
$0 means "the whole input line"
You can try the Python on its own, like this:
python -c 'import uuid; print("\n".join([str(uuid.uuid4()).upper() for x in range(2500)]))'

Is Shell The Right Tool?
Not really, but if you avoid bad practices, you can make something relatively fast.
With ksh93, the below reliably runs in 0.5-0.6s wall-clock; with bash, 1.2-1.3s.
What Does It Look Like?
#!/usr/bin/env bash
# Comment these two lines if running with ksh93, obviously. :)
[ -z "$BASH_VERSION" ] && { echo "This requires bash 4.1 or newer" >&2; exit 1; }
[[ $BASH_VERSION = [123].* ]] && { echo "This requires bash 4.1 or newer" >&2; exit 1; }
uuid_stream() {
python -c '
import uuid
try:
while True:
print str(uuid.uuid4()).upper()
except IOError:
pass # probably an EPIPE because we were closed.
'
}
# generate a file descriptor that emits a shuffled stream of integers
exec {large_int_fd}< <(while shuf -r -i1000000-9999999; do :; done)
# generate a file descriptor that emits an endless stream of UUIDs
exec {uuid_fd}< <(uuid_stream)
generate_mock_data() {
typeset val
case $1 in
auto-increment) val="$2" ;;
uuid) IFS= read -r val <&"$uuid_fd" || exit;;
lat) IFS= read -r val <&"$large_int_fd" || exit
val="$((RANDOM % 180 - 90)).$val" ;;
long) IFS= read -r val <&"$large_int_fd" || exit
val="$((RANDOM % 360 - 180)).$val" ;;
ip) val="$((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256)).$((RANDOM%256))" ;;
esac
printf '%s' "$val"
}
for ((i=0; i<2500; i++)); do
for header in auto-increment uuid lat long ip; do
generate_mock_data "$header" "$i"
printf ,
done
echo
done > file.csv
What's Different?
There are no command substitutions inside the inner loop. That means we don't ever use $() or any synonym for same. Each of these involves a fork() -- creating a new OS-level copy of the process -- and a wait(), with a bunch of FIFO magic to capture our output.
There are no external commands inside the inner loop. Any external command is even worse than a command substitution: They require a fork, and then additionally require an execve, with the dynamic linker and loader being invoked to pull in all the library dependencies for whichever external command is being run.
Because we don't have a command substitution stripping newlines, we have the function just not emitting them.

Related

Simple bash program which compares values

I have a file which contains varoius data (date,time,speed, distance from the front, distance from the back), the file looks like this, just with more rows:
2003.09.23.,05:05:21:64,134,177,101
2009.03.10.,17:46:17:81,57,102,57
2018.01.05.,00:30:37:04,354,145,156
2011.07.11.,23:21:53:43,310,125,47
2011.06.26.,07:42:10:30,383,180,171
I'm trying to write a simple Bash program, which tells the dates and times when the 'distance from the front' is less than the provided parameter ($1)
So far I wrote:
#!/bin/bash
if [ $# -eq 0 -o $# -gt 1 ]
then
echo "wrong number of parameters"
fi
i=0
fdistance=()
input='auto.txt'
while IFS= read -r line
do
year=${line::4}
month=${line:5:2}
day=${line:8:2}
hour=${line:12:2}
min=${line:15:2}
sec=${line:18:2}
hthsec=${line:21:2}
fdistance=$(cut -d, -f 4)
if [ "$fdistance[$i]" -lt "$1" ]
then
echo "$year[$i]:$month[$i]:$day[$i],$hour[$i]:$min[$i]:$sec[$i]:$hthsec[$i]"
fi
i=`expr $i + 1`
done < "$input"
but this gives the error "whole expression required" and doesn't work at all.
If you have the option of using awk, the entire process can be reduced to:
awk -F, -v dist=150 '$4<dist {split($1,d,"."); print d[1]":"d[2]":"d[3]","$2}' file
Where in the example above, any record with distance (field 4, $4) less than the dist variable value takes the date field (field 1, $1) and splits() the field into the array d on "." where the first 3 elements will be year, mo, day and then simply prints the output of those three elements separated by ":" (which eliminates the stray "." at the end of the field). The time (field 2, $2) is output unchanged.
Example Use/Output
With your sample data in file, you can do:
$ awk -F, -v dist=150 '$4<dist {split($1,d,"."); print d[1]":"d[2]":"d[3]","$2}' file
2009:03:10,17:46:17:81
2018:01:05,00:30:37:04
2011:07:11,23:21:53:43
Which provides the records in the requested format where the distance is less than 150. If you call awk from within your script you can pass the 150 in from the 1st argument to your script.
You can also accomplish this task by substituting a ':' for each '.' in the first field with gsub() and outputting a substring of the first field with substr() that drops the last character, e.g.
awk -F, -v dist=150 '$4<dist {gsub(/[.]/,":",$1); print substr($1,0,length($1)-1),$2}' file
(same output)
While parsing the data is a great exercise for leaning string handling in shell or bash, in practice awk will be Orders of Magnitude faster than a shell script. Processing a million line file -- the difference in runtime can be seconds with awk compared to minutes (or hours) with a shell script.
If this is an exercise to learn string handling in your shell, just put this in your hip pocket for later understanding that awk is the real Swiss Army-Knife for text processing. (well worth the effort to learn)
Would you try the following:
#/bin/bash
if (( $# != 1 )); then
echo "usage: $0 max_distance_from_the_front" >& 2 # output error message to the stderr
exit 1
fi
input="auto.txt"
while IFS=, read -r mydate mytime speed fdist bdist; do # split csv and assign variables
mydate=${mydate%.}; mydate=${mydate//./:} # reformat the date string
if (( fdist < $1 )); then # if the front disatce is less than $1
echo "$mydate,$mytime" # then print the date and time
fi
done < "$input"
Sample output with the same parameter as Keldorn:
$ ./test.sh 130
2009:03:10,17:46:17:81
2011:07:11,23:21:53:43
There are a few odd things in your script:
Why is fdistance an array. It is not necessary (and here done wrong) since the file is read line by line.
What is the cut of the line fdistance=$(cut -d, -f 4) supposed to cut, what's the input?
(Note: When invalid parameters, better end the script right away. Added in the example below.)
Here is a working version (apart from the parsing of the date, but that is not what your question was about so I skipped it):
#!/usr/bin/env bash
if [ $# -eq 0 -o $# -gt 1 ]
then
echo "wrong number of parameters"
exit 1
fi
input='auto.txt'
while IFS= read -r line
do
fdistance=$(echo "$line" | awk '{split($0,a,","); print a[4]}')
if [ "$fdistance" -lt "$1" ]
then
echo $line
fi
done < "$input"
Sample output:
$ ./test.sh 130
2009.03.10.,17:46:17:81,57,102,57
2011.07.11.,23:21:53:43,310,125,47
$

Can I make a shell function in as a pipeline conditionally "disappear", without using cat?

I have a bash script that produces some text from a pipe of commands. Based on a command line option I want to do some validation on the output. For a contrived example...
CHECK_OUTPUT=$1
...
check_output()
{
if [[ "$CHECK_OUTPUT" != "--check" ]]; then
# Don't check the output. Passthrough and return.
cat
return 0
fi
# Check each line exists in the fs root
while read line; do
if [[ ! -e "/$line" ]]; then
echo "Error: /$line does not exist"
return 1
fi
echo "$line"
done
return 0
}
ls /usr | grep '^b' | check_output
[EDIT] better example: https://stackoverflow.com/a/52539364/1888983
This is really useful, particularly if I have multiple functions that can becomes passthroughs. Yes, I could move the CHECK_OUTPUT conditional and create a pipe with or without check_output but I'd need to write lines for each combination for more functions. If there are better ways to dynamically build a pipe I'd like to know.
The problem is the "useless use of cat". Can this be avoided and make check_output like it wasn't in the pipe at all?
Yes, you can do this -- by making your function a wrapper that conditionally injects a pipeline element, instead of being an unconditional pipeline element itself. For example:
maybe_checked() {
if [[ $CHECK_OUTPUT != "--check" ]]; then
"$#" # just run our arguments as a command, as if we weren't here
else
# run our arguments in a process substitution, reading from stdout of same.
# ...some changes from the original code:
# IFS= stops leading or trailing whitespace from being stripped
# read -r prevents backslashes from being processed
local line # avoid modifying $line outside our function
while IFS= read -r line; do
[[ -e "/$line" ]] || { echo "Error: /$line does not exist" >&2; return 1; }
printf '%s\n' "$line" # see https://unix.stackexchange.com/questions/65803
done < <("$#")
fi
}
ls /usr | maybe_checked grep '^b'
Caveat of the above code: if the pipefail option is set, you'll want to check the exit status of the process substitution to have complete parity with the behavior that would otherwise be the case. In bash version 4.3 or later (IIRC), $? is modified by process substitutions to have the relevant PID, which can be waited for to retrieve exit status.
That said, this is also a use case wherein using cat is acceptable, and I'm saying this as a card-carying member of the UUOC crowd. :)
Adopting the examples from John Kugelman's answers on the linked question:
maybe_sort() {
if (( sort )); then
"$#" | sort
else
"$#"
fi
}
maybe_limit() {
if [[ -n $limit ]]; then
"$#" | head -n "$limit"
else
"$#"
fi
}
printf '%s\n' "${haikus[#]}" | maybe_limit maybe_sort sed -e 's/^[ \t]*//'

Bash Script is super slow

I'm updating an old script to parse ARP data and get useful information out of it. We added a new router and while I can pull the ARP data out of the router it's in a new format. I've got a file "zTempMonth" which is a all the arp data from both sets of routers that I need to compile down into a new data format that's normalized. The below lines of code do what I need them to logically - but it's extremely slow - as in it will take days to run these loops where previously the script took 20-30 minutes. Is there a way to speed this up, or identify what's slowing it down?
Thank you in advance,
echo "Parsing zTempMonth"
while read LINE
do
wc=`echo $LINE | wc -w`
if [[ $wc -eq "6" ]]; then
true
out=$(echo $LINE | awk '{ print $2 " " $4 " " $6}')
echo $out >> zTempMonth.tmp
else
false
fi
if [[ $wc -eq "4" ]]; then
true
out=$(echo $LINE | awk '{ print $1 " " $3 " " $4}')
echo $out >> zTempMonth.tmp
else
false
fi
done < zTempMonth
While read loops are slow.
Subshells in a loop are slow.
>> (open(f, 'a')) calls in a loop are slow.
You could speed this up and remain in pure bash, just by losing #2 and #3:
#!/usr/bin/env bash
while read -a line; do
case "${#line[#]}" in
6) printf '%s %s %s\n' "${line[1]}" "${line[3]}" "${line[5]}";;
4) printf '%s %s %s\n' "${line[0]}" "${line[2]}" "${line[3]}";;
esac
done < zTempMonth >> zTempMonth.tmp
But if there are more than a few lines, this will still be slower than pure awk. Consider an awk script as simple as this:
BEGIN {
print "Parsing zTempMonth"
}
NF == 6 {
print $2 " " $4 " " $6
}
NF == 4 {
print $1 " " $3 " " $4
}
You could execute it like this:
awk -f thatAwkScript zTempMonth >> zTempMonth.tmp
to get the same append approach as your current script.
When writing shell scripts, it’s almost always better to call a function directly rather than using a subshell to call the function. The usual convention that I’ve seen is to echo the return value of the function and capture that output using a subshell. For example:
#!/bin/bash
function get_path() {
echo "/path/to/something"
}
mypath="$(get_path)"
This works fine, but there is a significant speed overhead to using a subshell and there is a much faster alternative. Instead, you can just have a convention wherein a particular variable is always the return value of the function (I use retval). This has the added benefit of also allowing you to return arrays from your functions.
If you don’t know what a subshell is, for the purposes of this blog post, a subshell is another bash shell that is spawned whenever you use $() or `` and is used to execute the code you put inside.
I did some simple testing to allow you to observe the overhead. For two functionally equivalent scripts:
This one uses a subshell:
#!/bin/bash
function a() {
echo hello
}
for (( i = 0; i < 10000; i++ )); do
echo "$(a)"
done
This one uses a variable:
#!/bin/bash
function a() {
retval="hello"
}
for (( i = 0; i < 10000; i++ )); do
a
echo "$retval"
done
The speed difference between these two is noticeable and significant.
$ for i in variable subshell; do
> echo -e "\n$i"; time ./$i > /dev/null
> done
variable
real 0m0.367s
user 0m0.346s
sys 0m0.015s
subshell
real 0m11.937s
user 0m3.121s
sys 0m0.359s
As you can see, when using variable, execution takes 0.367 seconds. subshell however takes a full 11.937 seconds!
Source: http://rus.har.mn/blog/2010-07-05/subshells/

Count the Words in text file without using the 'wc' command in unix shell scripting

Here I could not find the number of words in the text file . What would be the possible changes do I need to make?
What is the use of tty in this program?
echo "Enter File name:"
read filename
terminal=`tty`
exec < $filename
num_line=0
num_words=0
while read line
do
num_lines=`expr $num_lines + 1`
num_words=`expr $num_words + 1`
done
There is a simple way using arrays to read the number of words in a file:
#!/bin/bash
[ -n "$1" ] || {
printf printf "error: insufficient input. Usage: %s\n" "${0//\//}"
exit 1
}
fn="$1"
[ -r "$fn" ] || {
printf "error: file not found: '%s'\n" "$fn"
exit 1
}
declare -i cnt=0
while read -r line || [ -n "$line" ]; do # read line from file
tmp=( $line ) # create tmp array of words
cnt=$((cnt + ${#tmp[#]})) # add no. of words to count
done <"$fn"
printf "\n %s words in %s\n\n" "$cnt" "$fn" # show results
exit 0
input:
$ cat dat/wordfile.txt
Here I could not find the number of words in the text file. What
would be the possible changes do I need to make? What is the use
of tty in this program?
output:
$bash wcount.sh dat/wordfile.txt
33 words in dat/wordfile.txt
wc -w confirmation:
$ wc -w dat/wordfile.txt
33 dat/wordfile.txt
tty?
The use of terminal=tty assigns the terminal device for the current interactive shell to the terminal variable. (It is a way to determine which tty device you are connected to e.g. /dev/pts/4)
tty command prints the name of the terminal connected to the standard output. In the context of your program, it does nothing significant really, you might as well remove that line and run.
Regarding the number of words calculation, you would need to parse each line and find it using space as the delimiter. Currently the program just finds the number of lines $num_lines and uses the same calculation for $num_words.

Bash script to automatically test program output - C

I am very new to writing scripts and I am having trouble figuring out how to get started on a bash script that will automatically test the output of a program against expected output.
I want to write a bash script that will run a specified executable on a set of test inputs, say in1 in2 etc., against corresponding expected outputs, out1, out2, etc., and check that they match. The file to be tested reads its input from stdin and writes its output to stdout. So executing the test program on an input file will involve I/O redirection.
The script will be invoked with a single argument, which will be the name of the executable file to be tested.
I'm having trouble just getting going on this, so any help at all (links to any resources that further explain how I could do this) would be greatly appreciated. I've obviously tried searching myself but haven't been very successful in that.
Thanks!
If I get what you want; this might get you started:
A mix of bash + external tools like diff.
#!/bin/bash
# If number of arguments less then 1; print usage and exit
if [ $# -lt 1 ]; then
printf "Usage: %s <application>\n" "$0" >&2
exit 1
fi
bin="$1" # The application (from command arg)
diff="diff -iad" # Diff command, or what ever
# An array, do not have to declare it, but is supposedly faster
declare -a file_base=("file1" "file2" "file3")
# Loop the array
for file in "${file_base[#]}"; do
# Padd file_base with suffixes
file_in="$file.in" # The in file
file_out_val="$file.out" # The out file to check against
file_out_tst="$file.out.tst" # The outfile from test application
# Validate infile exists (do the same for out validate file)
if [ ! -f "$file_in" ]; then
printf "In file %s is missing\n" "$file_in"
continue;
fi
if [ ! -f "$file_out_val" ]; then
printf "Validation file %s is missing\n" "$file_out_val"
continue;
fi
printf "Testing against %s\n" "$file_in"
# Run application, redirect in file to app, and output to out file
"./$bin" < "$file_in" > "$file_out_tst"
# Execute diff
$diff "$file_out_tst" "$file_out_val"
# Check exit code from previous command (ie diff)
# We need to add this to a variable else we can't print it
# as it will be changed by the if [
# Iff not 0 then the files differ (at least with diff)
e_code=$?
if [ $e_code != 0 ]; then
printf "TEST FAIL : %d\n" "$e_code"
else
printf "TEST OK!\n"
fi
# Pause by prompt
read -p "Enter a to abort, anything else to continue: " input_data
# Iff input is "a" then abort
[ "$input_data" == "a" ] && break
done
# Clean exit with status 0
exit 0
Edit.
Added exit code check; And a short walk trough:
This will in short do:
Check if argument is given (bin/application)
Use an array of "base names", loop this and generate real filenames.
I.e.: Having array ("file1" "file2") you get
In file: file1.in
Out file to validate against: file1.out
Out file: file1.out.tst
In file: file2.in
...
Execute application and redirect in file to stdin for application by <, and redirect stdout from application to out file test by >.
Use a tool like i.e. diff to test if they are the same.
Check exit / return code from tool and print message (FAIL/OK)
Prompt for continuance.
Any and all of which off course can be modified, removed etc.
Some links:
TLDP; Advanced Bash-Scripting Guide (can be a bit more readable with this)
Arrays
File test operators
Loops and branches
Exit-status
...
bash-array-tutorial
TLDP; Bash-Beginners-Guide
Expect could be a perfect fit for this kind of problem:
Expect is a tool primarily for automating interactive applications
such as telnet, ftp, passwd, fsck, rlogin, tip, etc. Expect really
makes this stuff trivial. Expect is also useful for testing these same
applications.
First take a look at the Advanced Bash-Scripting Guide chapter on I/O redirection.
Then I have to ask Why use a bash script at all? Do it directly from your makefile.
For instance I have a generic makefile containing something like:
# type 'make test' to run a test.
# for example this runs your program with jackjill.txt as input
# and redirects the stdout to the file jackjill.out
test: $(program_NAME)
./$(program_NAME) < jackjill.txt > jackjill.out
./diff -q jackjill.out jackjill.expected
You can add as many tests as you want like this. You just diff the output file each time against a file containing your expected output.
Of course this is only relevant if you're actually using a makefile for building your program. :-)
Functions. Herestrings. Redirection. Process substitution. diff -q. test.
Expected outputs are a second kind of input.
For example, if you want to test a square function, you would have input like (0, 1, 2, -1, -2) and expected output as (0, 1, 4, 1, 4).
Then you would compare every result of input to the expected output and report errors for example.
You could work with arrays:
in=(0 1 2 -1 -2)
out=(0 1 4 2 4)
for i in $(seq 0 $((${#in[#]}-1)) )
do
(( ${in[i]} * ${in[i]} - ${out[i]} )) && echo -n bad" " || echo -n fine" "
echo $i ": " ${in[i]}"² ?= " ${out[i]}
done
fine 0 : 0² ?= 0
fine 1 : 1² ?= 1
fine 2 : 2² ?= 4
bad 3 : -1² ?= 2
fine 4 : -2² ?= 4
Of course you can read both arrays from a file.
Testing with (( ... )) can invoke arithmetic expressions, strings and files. Try
help test
for an overview.
Reading strings wordwise from a file:
for n in $(< f1); do echo $n "-" ; done
Read into an array:
arr=($(< file1))
Read file linewise:
for i in $(seq 1 $(cat file1 | wc -l ))
do
line=$(sed -n ${i}p file1)
echo $line"#"
done
Testing against program output sounds like string comparison and capturing of program output n=$(cmd param1 param2):
asux:~/prompt > echo -e "foo\nbar\nbaz"
foo
bar
baz
asux:~/prompt > echo -e "foo\nbar\nbaz" > file
asux:~/prompt > for i in $(seq 1 3); do line=$(sed -n ${i}p file); test "$line" = "bar" && echo match || echo fail ; done
fail
match
fail
Further usesful: Regular expression matching on Strings with =~ in [[ ... ]] brackets:
for i in $(seq 1 3)
do
line=$(sed -n ${i}p file)
echo -n $line
if [[ "$line" =~ ba. ]]; then
echo " "match
else echo " "fail
fi
done
foo fail
bar match
baz match

Resources