My teacher told me the only way to have random non-correlated numbers in awk between consecutive runs of the same script, is saving the last seed used to file and then read it again when you start another execution.
So far I'm trying with this:
BEGIN {
getline seed < "myseed.txt";
srand(seed);
print rand();
print rand();
print srand() > "myseed.txt";
}
but the only content of myseed.txt is always 0 and I get the same random numbers at every execution.
Any ideas on how to save the status of the internal random number generator and then resuming generating random numbers exactly where it stopped between different executions of the same script?
There seems to be some confusion about what the seed actually is. It's a value used as a parameter to initialize the random number generator, and that's it; it does not contain the current state of the random number generator. If you call srand(N) with any N, and then generate some random numbers, and then call srand() again (with any seed), the return value will be that original N, no matter how many random numbers you have generated in the meantime.
If the goal is to have a reproducible string of numbers, allowing for interruptions, then you need to save not only the seed but the number of random numbers generated so far since the seeding, and generate and discard that many at the start of the next run. This is a terrible hack and you would probably be better implementing your own RNG or, more likely, using something other than awk.
If the goal is to make sure not to generate the same sequence even if run twice within a single second, then you should read the discussion at the link Ed Morton posted in his comment.
Either I misread your question or you are doing the complete opposite of what you are trying to achieve. If you read the seed from a file, use it, write the same seed back to the file for re-use the next time, you will always have the same seed, thus always produce the same run of numbers.
$ awk 'BEGIN{srand(1234);print rand()}' && sleep 1 && awk 'BEGIN{srand(1234);print rand()}'
0.240849
0.240849
If you don't provide a seed for srand() it will use seconds since epoch:
$ awk 'BEGIN{srand();print rand()}' && sleep 1 && awk 'BEGIN{srand();print rand()}'
0.2776
0.668099
... thus it will only produce new numbers once a second. You can work around this by using $RANDOM (if your shell supports it; changes on each use) or nanoseconds since epoch (if your date supports it) as seed, e.g.:
$ awk -v seed="$RANDOM" 'BEGIN{srand(seed);print rand()}' && awk -v seed="$RANDOM" 'BEGIN{srand(seed);print rand()}'
0.661197
0.325718
$ awk -v seed="$(date +%N)" 'BEGIN{srand(seed);print rand()}' && awk -v seed="$RANDOM" 'BEGIN{srand(seed);print rand()}'
0.588395
0.911353
srand(seed) takes an integer value for "seed"
rand() returns a decimal number >= 0 and < 1.
So, saving the output of rand() to use directly as the next seed will result in srand() always using a seed of zero. Look (remember a call to srand() returns the PREVIOUS seed used):
$ awk -v s=0.1 'BEGIN{srand(s); print rand(); print srand()}'
0.566305
0
$ awk -v s=0.9 'BEGIN{srand(s); print rand(); print srand()}'
0.566305
0
So, if you want to use the output of rand() as the next seed you need to multiply it by whatever number you want your seed to be less than, e.g. 1,000,000:
$ awk 'BEGIN{
if ( (getline s < "myseed.txt") > 0 ) {
print "seed read from file =", s
srand(1000000*s)
}
else {
print "using default seed of seconds since epoch"
srand()
}
r = rand()
print r > "myseed.txt"
print "seed actually used by srand() was:", srand()
print "rand() =", r
}'
using default seed of seconds since epoch
seed actually used by srand() was: 1368282525
rand() = 0.331514
$ awk 'BEGIN{
if ( (getline s < "myseed.txt") > 0 ) {
print "seed read from file =", s
srand(1000000*s)
}
else {
print "using default seed of seconds since epoch"
srand()
}
r = rand()
print r > "myseed.txt"
print "seed actually used by srand() was:", srand()
print "rand() =", r
}'
seed read from file = 0.331514
seed actually used by srand() was: 331514
rand() = 0.677688
$ awk 'BEGIN{
if ( (getline s < "myseed.txt") > 0 ) {
print "seed read from file =", s
srand(1000000*s)
}
else {
print "using default seed of seconds since epoch"
srand()
}
r = rand()
print r > "myseed.txt"
print "seed actually used by srand() was:", srand()
print "rand() =", r
}'
seed read from file = 0.677688
seed actually used by srand() was: 677688
rand() = 0.363388
On the first execution the file "myseed.txt" didn't exist.
Related
I want to plot some data of a spray simulation. There is a variable called the vaporpenetrationlength, which describes the distance from the injector to the position where the mass fraction is 0.1%. The simulation created many folders for each time step. Inside those folders there is one file which contains the mass fraction and the distance.
I want to create a script which goes through all the time step folders and search inside this one file and prints out the distance where the 0.1% were measured and in which time step it was.
I found a script, but I don't understand it because I just started to learn shell scripting.
Could someone please help me step by step in building such a script? I am interested in learning it, and therefore I want to understand ever line of the code.
Thanks in advance :)
This little script outputs TimeTabLengthTabMass based on the value of the "mass fraction":
printf '%s\t%s\t%s\n' 'Time' 'Length' 'Mass'
awk '
BEGIN { FS = OFS = "\t"}
FNR == 1 {
n = split(FILENAME,path,"/")
time = sprintf("%0.7f",path[n-1])
}
NF != 2 {next}
0.001 <= $2 && $2 < 0.00101 { print time,$1,$2 }
' postProcessing/singleGraphVapPen/*/*
remark: In fact, printing the header could be done within the awk program, but doing it with a separate printf command allows you to post-process the output of awk (for ex. if you need to sort the times and/or lengths and/or masses).
notes:
FNR == 1 is true for the first line of each input file. In the corresponding block, I extract the time value from the directory name.
NF != 2 {next} is for filtering out the gnuplot commands that are at the beginning of the input files. In words, this statement means "if the number of (tab-delimited) fields in the line isn't 2, then skip"
0.001 <= $2 && $2 < 0.00101 selects the lines based on the value of their second field, which is referred to as yheptane in your script. IDK the margin of error of your "0.1% of mass fraction" so I chose convenient conditions for the sample output below.
With the sample data, the output will be:
Time Length Mass
0.0001500 0.0895768 0.00100839
0.0002000 0.102057 0.00100301
0.0002000 0.0877939 0.00100832
0.0003500 0.0827694 0.00100114
0.0009000 0.0657509 0.00100015
0.0015000 0.0501911 0.00100016
0.0016500 0.0469495 0.00100594
0.0018000 0.0436538 0.00100853
0.0021500 0.0369005 0.00100809
0.0023000 0.100328 0.00100751
As an aside, here's a script for replacing your original code:
#!/bin/bash
set -- postProcessing/singleGraphVapPen/*/*
if ! [ -f VapPen.txt ]
then
{
printf '%s\t%s\n' 'Time [s]' 'VapPen [m]'
awk '
BEGIN {FS = OFS = "\t"}
FNR == 1 {
if (NR > 1)
print time,vappen
vappen = 0
n = split(FILENAME,path,"/")
time = sprintf("%0.7f",path[n-1])
}
NF != 2 {next}
$2 >= 0.001 { vappen = $1 }
END { if (NR) print time,vappen }
' "$#" |
sort -n -k1,1
} > VapPen.txt
fi
gnuplot -e '
set title "Verdunstungspenetration";
set xlabel "Zeit [s]";
set ylabel "Verdunstungspenetrationslänge [m]";
set grid;
plot "VapPen.txt" using 1:2 with linespoints title "Vapor penetraion 0,1% mass";
pause -1 "Hit return to continue";
'
With the provided data, it reduces the execution time from several minutes to 0.15s on my computer.
I've recently been working on some lab assignments and in order to collect and analyze results well, I prepared a bash script to automate my job. It was my first attempt to create such script, thus it is not perfect and my question is strictly connected with improving it.
Exemplary output of the program is shown below, but I would like to make it more general for more purposes.
>>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478
All data I want to collect is always in different lines. My first attempt was running the same program twice (or more times depending on the amount of data) and then using grep in each run to extract the data I need by looking for the keyword. It is very inefficient, as there probably are some possibilities of parsing whole output of one run, but I could not come up with any idea. At the moment the script is:
#!/bin/bash
write() {
o1=$(./progname args | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
o2=$(./progname args | grep "cycle" | grep -o -E '[0-9]+.[0-9]+')
o3=$(./progname args | grep "Total" | grep -o -E '[0-9]+.[0-9]+')
echo "$1 $o1 $o2 $o3"
}
for ((i = 1; i <= 10; i++)); do
write $i >> times.dat
done
It is worth mentioning that echoing results in one line is crucial, as I am using gnuplot later and having data in columns is perfect for that use. Sample output should be:
1 0.019306 3.369 170620476
2 0.019559 3.375 170620475
3 0.021971 3.334 170620478
4 0.020536 3.378 170620480
5 0.019692 3.390 170620475
6 0.020833 3.375 170620477
7 0.019951 3.450 170620477
8 0.019417 3.381 170620476
9 0.020105 3.374 170620476
10 0.020255 3.402 170620475
My question is: how could I improve the script to collect such data in just one program execution?
You could use awk here and could get values into an array and later access them by index 1,2 and 3 in case you want to do this in a single command.
myarr=($(your_program args | awk '/Total/{print $NF;next} /cycle/{print $NF;next} /Time/{print $(NF-1)}'))
OR use following to forcefully print all elements into a single line, which will not come in new lines if someone using " to keep new lines safe for values.
myarr=($(your_program args | awk '/Total/{val=$NF;next} /cycle/{val=(val?val OFS:"")$NF;next} /Time/{print val OFS $(NF-1)}'))
Explanation: Adding detailed explanation of awk program above.
awk ' ##Starting awk program from here.
/Total/{ ##Checking if a line has Total keyword in it then do following.
print $NF ##Printing last field of that line which has Total in it here.
next ##next keyword will skip all further statements from here.
}
/cycle/{ ##Checking if a line has cycle in it then do following.
print $NF ##Printing last field of that line which has cycle in it here.
next ##next keyword will skip all further statements from here.
}
/Time/{ ##Checking if a line has Time in it then do following.
print $(NF-1) ##Printing 2nd last field of that line which has Time in it here.
}'
To access individual items you could use like:
echo ${myarr[0]}, echo ${myarr[1]} and echo ${myarr[2]} for Total, cycle and time respectively.
Example to access all elements by loop in case you need:
for i in "${myarr[#]}"
do
echo $i
done
You can execute your program once and save the output at a variable.
o0=$(./progname args)
Then you can grep that saved string any times like this.
o1=$(echo "$o0" | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
Assumptions:
each of the 3x search patterns (Time, cycle, Total) occur just once in a set of output from ./progname
format of ./progname output is always the same (ie, same number of space-separated items for each line of output)
I've created my own progname script that just does an echo of the sample output:
$ cat progname
echo ">>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478"
One awk solution to parse and print the desired values:
$ i=1
$ ./progname | awk -v i=${i} ' # assign awk variable "i" = ${i}
/Time/ { o1 = $3 } # o1 = field 3 of line that contains string "Time"
/cycle/ { o2 = $5 } # o2 = field 5 of line that contains string "cycle"
/Total/ { o3 = $4 } # o4 = field 4 of line that contains string "Total"
END { printf "%s %s %s %s\n", i, o1, o2, o3 } # print 4x variables to stdout
'
1 0.042127 3.386 170620482
I am writing a script for automatically computing average runtime.
First I need to run $ time ./foo.py for 100 times and save output to file time.txt (working)
$ for i in `seq 100`; do { time ./foo.py; } 2>> time.txt; done
Output looks as follows
time ./foo.py
real 0m0,030s
user 0m0,030s
sys 0m0,000s
[...]
Runtimes from different scripts are in the same file. Each entry starts with time ./foo.py, followed by 100 "triplets" of real, user and sys.
Now, if possible, I would love to have the script automatically compute the average runtime for each tested file by using all 100 "triplets", and neatly returning only one "mean triplet".
I have thought about maybe using awk to calculate the mean, like this
awk '{ total += $2 } END { print total/NR }' time.txt
But the command would need to be adapted to fit my needs - after all, only the parts after the , (e.g. ,030s) may be used for computation and the s would also need to be disregarded.
Since I do not know how to achieve this objective, I thought to ask the community.
Any help is greatly appreciated.
It's easier if you tell time to output the time info in POSIX format:
awk '/^real/ { totalReal += $2 } /^user/ { totalUser += $2 } /^sys/ { totalSys += $2 } END { print "realAvg " totalReal/(NR/4) "\n" "userAvg " totalUser/(NR/4) "\n" "sysAvg " totalSys/(NR/4) }' time.txt
Prints output as follows:
realAvg 12.62
userAvg 27
sysAvg 3.8
Explanation:
Basically, tell awk to go through each line in the file, and if the line starts with real, add that to the totalReal variable, same for user and sys. So, basically, keep a running total of each of the three "types".
At the end, simply print the the three running totals, divided by the number of lines divided by 4. This is because you want each "set" of 4 lines to count as 1 instance, and awk's NR just counts the number of lines.
I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.
Sample.txt ( File will be sorted based on the third field on which pattern to be searched )
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
SOUTHWEST|3456|01134|GDFSG|EWRER|<br/>
"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
Used
awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile
and grep commands but it was very time consuming since file is 300+ MB of size.
Not sure if you'll find a faster tool than awk, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.
It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile<n>, where <n> is an index starting with 1.
ndx=0; fromRow=1
for val in '00003' '00112' '|'; do # 2 sample values to match, plus dummy value
chunkFile="smallfile$(( ++ndx ))"
fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
NR < fromRow { next }
{ if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
' big_file)
done
Note that dummy value | ensures that any remaining rows after the last true value to match are saved to a chunk file too.
Note that moving all the logic into a single awk script should be much faster, because big_file would only have to be read once:
awk -F'|' -v vals='00003|00112' '
BEGIN { split(vals, val); outFile="smallfile" ++ndx }
{
if ($3 != val[ndx]) {
if (p) { p=0; close(outFile); outFile="smallfile" ++ndx }
} else {
p=1
}
print > outFile
}
' big_file
You can try with Perl:
perl -ne '/00003/ && print' big_file > small_file
and compare its timing with other solutions...
EDIT
Limiting my answer to the tools you didn't try already... you can also use:
sed -n '/00003/p' big_file > small_file
But I tend to believe perl will be faster. Again... I'd suggest you to measure the elapsed for different solutions on your own.
I work in telecoms and regularly need to expand number ranges.
For example, 6121234567X [note that there are 10 numbers preceeding the X] is shorthand for:
61212345670
61212345671
61212345672....... etc (a 10 number range)
and 612123456X [note that there are only 9 numbers preceeding the X] is shorthand for
61212345600
61212345601....... etc (a 100 number range)
So I need a grep command that...
reads how many characters in the line preceeding the X (to determine how many suffixes)
writes the appropriate amount of lines (10, 100, or 100) with ascending suffixes
hopefully removes the original line
Below is the Python script that does it, file-name is the expected first argument. Example usage: python script.py file.in > file.out
#!/usr/bin/env python
import sys
def generate(pattern):
p = pattern.lower().find('x')
ret = ""
for i in range(10**(10-p+1)):
ret += pattern[:p] + str(i).zfill(10-p+1) + " "
return ret
if __name__ == "__main__":
if len(sys.argv) <= 1:
print("Filename needed!")
else:
with open(sys.argv[1]) as f:
for ln in f:
print(generate(ln.rstrip()))
You can do this in awk quite quickly:
awk -v val=$a -v max=10
'BEGIN {
gsub("X","",val)
items=max - length(val)
for (i=0; i<=10^items; i++)
print val*(10^items)+i
}'
This works as an example. To do the same reading from a file, you just need to play with $1 (first field of the field) instead of val and move all the code from BEGIN into the main block.
Explanation
-v val=$a -v max=10 pass parameters: $a is the variable containing the string on the form 12345678X AND max contains the maximum amount of digits the number will have (10 in your case).
BEGIN {} perform all these actions [before/without] reading a file.
gsub("X","",val) remove X from val.
items=max - length(val) count the size of the variable without the X.
for (i=0; i<=10^items; i++) print val*(10^items)+i loop from 0 to 10^remaining_size. This means from 0 to 10 or from 0 to 100... depending on the result of 10 - size without X.
Test
With 9 as maximum:
$ a=12345678X
$ awk -v val=$a -v max=9 'BEGIN {gsub("X","",val); items=max - length(val); for (i=0; i<=10^items; i++) print val*(10^items)+i}'
123456780
123456781
123456782
123456783
123456784
123456785
123456786
123456787
123456788
123456789
123456790
echo 6121234567X | perl -nE 'm/(.*)X/;
say $1. $_ foreach (0..10**(11-length $1)-1)'
61212345670
61212345671
61212345672
61212345673
61212345674
61212345675
61212345676
61212345677
61212345678
61212345679
It's a little uglier to get the zero padded format:
echo 611234567X | perl -wne 'm/(.*)X/; $b=$1; $r=11 - length $b;
$fmt="%0" . $r . "s\n";
printf "$b$fmt", $_ foreach (0..10**$r-1) '