bash 'while read line' efficiency with big file - bash

I was using a while loop to process a task,
which read records from a big file about 10 million lines.
I found that the processing become more and more slower as time goes by.
and I make a simulated script with 1 million lines as blow, which reveal the problem.
but I still don't know why, how does the read command work?
seq 1000000 > seq.dat
while read s;
do
if [ `expr $s % 50000` -eq 0 ];then
echo -n $( expr `date +%s` - $A) ' ';
A=`date +%s`;
fi
done < seq.dat
The terminal outputs the time interval:
98 98 98 98 98 97 98 97 98 101 106 112 121 121 127 132 135 134
at about 50,000 lines,the processing become slower obviously.

Using your code, I saw the same pattern of increasing times (right from the beginning!). If you want faster processing, you should rewrite using shell internal features. Here's my bash version:
tabChar=" " # put a real tab char here, of course
seq 1000000 > seq.dat
while read s;
do
if (( ! ( s % 50000 ) )) ;then
echo $s "${tabChar}" $( expr `date +%s` - $A)
A=$(date +%s);
fi
done < seq.dat
edit
fixed bug, output indicated each line was being processed, now only every 50000'th line gets the timing treatment. Doah!
was
if (( s % 50000 )) ;then
fixed to
if (( ! ( s % 50000 ) )) ;then
output now echo ${.sh.version} = Version JM 93t+ 2010-05-24
50000
100000 1
150000 0
200000 1
250000 0
300000 1
350000 0
400000 1
450000 0
500000 1
550000 0
600000 1
650000 0
700000 1
750000 0
output bash
50000 480
100000 3
150000 2
200000 3
250000 3
300000 2
350000 3
400000 3
450000 2
500000 2
550000 3
600000 2
650000 2
700000 3
750000 3
800000 2
850000 2
900000 3
950000 2
800000 1
850000 0
900000 1
950000 0
1e+06 1
As to why your original test case is taking so long ... not sure. I was surprised to see both the time for each test cyle AND the increase in time. If you really need to understand this, you may need to spend time instrumenting more test stuff. Maybe you'd see something running truss or strace (depending on your base OS).
I hope this helps.

Read is a comparatively slow process, as the author of "Learning the Korn Shell" points out*. (Just above Section 7.2.2.1.) There are other programs, such as awk or sed that have been highly optimized to do what is essentially the same thing: read from a file one line at a time and perform some operations using that input.
Not to mention, that you're calling an external process every time you're doing subtraction or taking the modulus, which can get expensive. awk has both of those functionalities built in.
As the following test points out, awk is quite a bit faster:
#!/usr/bin/env bash
seq 1000000 |
awk '
BEGIN {
command = "date +%s"
prevTime = 0
}
$1 % 50000 == 0 {
command | getline currentTime
close(command)
print currentTime - prevTime
prevTime = currentTime
}
'
Output:
1335629268
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
Note that the first number is equivalent to date +%s. Just like in your test case, I let the first match be.
Note
*Yes the author is talking about the Korn Shell, not bash as the OP tagged, but bash and ksh are rather similar in a lot of ways. ksh is actually a superset of bash. So I would assume that the read command is not drastically different from one shell to another.

Related

Problem with if condition on a "random walk" script

I'm trying to make the coordinate "x" randomly move in the interval [-1,1]. However, my code works sometimes, and sometimes it doesn't. I tried ShellCheck but it says "no issues detected!". I'm new to conditionals, am I using them wrong?
I'm running this on the windows subsystem for linux. I'm editing it on nano. Since I have a script that will plot 200 of these "random walks", the code should work consistenly, but I really don't understant why it doesn't.
Here's my code:
x=0
for num in {1..15}
do
r=$RANDOM
if [[ $r -lt 16383 ]]
then
p=1
else
p=-1
fi
if [[ $x -eq $p ]]
then
x=$(echo "$x-$p" | bc )
else
x=$(echo "$x+$p" | bc )
fi
echo "$num $x"
done
I expect something like this:
1 -1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
10 0
11 -1
12 0
13 1
14 0
15 1
But the usual output is something like this:
1 1
2 0
3 -1
4 0
5 -1
6 0
7 -1
(standard_in) 1: syntax error
8
(standard_in) 1: syntax error
9
(standard_in) 1: syntax error
10
(standard_in) 1: syntax error
11
(standard_in) 1: syntax error
12
(standard_in) 1: syntax error
13
(standard_in) 1: syntax error
14
(standard_in) 1: syntax error
15
Always stopping after a -1.
You can do this with bash:
x=$(( x - p ))
or
(( x -= p ))
and you don't need bc.
Replace x=$(echo "$x-$p" | bc ) with x=$(echo "$x-($p)" | bc ) to avoid echo "-1--1" | bc.
One-liner equivalents to the OP's 18-line random walk script, using bash arithmetic evaluation:
x=0; printf '%-5s\n' {1..15}\ $(( x=(RANDOM%2 ? 1 : -1) * (x==0) ))
x=0; printf '%-5s\n' {1..15}\ $(( x=( x ? 0 : (RANDOM%2 ? 1 : -1) ) ))
Sample output of either, (the 2nd column will vary between runs):
1 -1
2 0
3 -1
4 0
5 1
6 0
7 1
8 0
9 -1
10 0
11 1
12 0
13 -1
14 0
15 -1
How it works:
echo {1..15}\ $(( ...some code... )) prints the numbers 1 to
15 followed by 15 instances of whatever result in the $(( ... )) code returns. One flaw with this approach is that with the resulting 15 pairs of numbers, (e.g. 1 -1, 2 0, etc.), each appears to bash as one string, rather than 30 separate numbers.
(RANDOM%2): the % is a modulo operator and here returns the remainder when divided by 2, which is either 0 or 1.
(x==0): $x can be one of three numbers, but if the previous value of $x was -1 or 1 the only legal random step is 0, so we only need a random number if the previous value of $x was 0.
The if logic is replaced with shortcuts of the form (expr?expr:expr); these use the same logic as the OP script.

Wildcard symbol with grep -F

I have the following file
0 0
0 0.001
0 0.032
0 0.1241
0 0.2241
0 0.42
0.0142 0
0.0234 0
0.01429 0.01282
0.001 0.224
0.098 0.367
0.129 0
0.123 0.01282
0.149 0.16
0.1345 0.216
0.293 0
0.2439 0.01316
0.2549 0.1316
0.2354 0.5
0.3345 0
0.3456 0.0116
0.3462 0.316
0.3632 0.416
0.429 0
0.42439 0.016
0.4234 0.3
0.5 0
0.5 0.33
0.5 0.5
Notice that the two columns are sorted ascending, first by the first column and then by the second one. The minimum value is 0 and the maximum is 0.5.
I would like to count the number of lines that are:
0 0
and store that number in a file called "0_0". In this case, this file should contain "1".
Then, the same for those that are:
0 0.0*
For example,
0 0.032
And call it "0_0.0" (it should contain "2"), and this for all combinations only considering the first decimal digit (0 0.1*, 0 0.2* ... 0.0* 0, 0.0* 0.0* ... 0.5 0.5).
I am using this loop:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i" "$j"" file | wc -l > "$i"_"$j"
done
done
rm 0_0 #this 0_0 output is badly done, the good way is with the next command, which accepts \n
pcregrep -M "0 0\n" file | wc -l > 0_0
The problem is that for example, line
0.0142 0
will not be recognized by the iteration "0.0 0", since there are digits after the "0.0". Removing the -F option in grep in order to consider all numbers that start by "0.0" will not work, since the point will be considered a wildcard symbol and therefore for example in the iteration "0.1 0" the line
0.0142 0
will be counted, because 0.0142 is a 0"anything"1.
I hope I am making myself clear!
Is there any way to include a wildcard symbol with grep -F, like in:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i"* "$j"*" file | wc -l > "$i"_"$j"
done
done
(Please notice the asterisks after the variables in the grep command).
Thank you!
Don't use shell loops just to manipulate text, that's what the guys who invented shell also invented awk to do. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
It sounds like all you need is:
awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{ for (pair in cnt) {print cnt[pair] > pair; close(pair)} }' file
That will be vastly more efficient than your nested shell loops approach.
Here's what it'll be outputting to the files it creates:
$ awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{for (pair in cnt) print pair "\t" cnt[pair]}' file
0.0_0.3 1
0_0.4 1
0.5_0 1
0.2_0.5 1
0.4_0.3 1
0.0_0 2
0.1_0.0 1
0.3_0 1
0.1_0.1 1
0.1_0.2 1
0.3_0.0 1
0_0 1
0.1_0 1
0.5_0.3 1
0.4_0 1
0.3_0.3 1
0.2_0.0 1
0_0.0 2
0.5_0.5 1
0.3_0.4 1
0.2_0.1 1
0.0_0.0 1
0_0.1 1
0_0.2 1
0.4_0.0 1
0.2_0 1
0.0_0.2 1

How to find sum of elements in column inside of a text file (Bash)

I have a log file with lots of unnecessary information. The only important part of that file is a table which describes some statistics. My goal is to have a script which will accept a column name as argument and return the sum of all the elements in the specified column.
Example log file:
.........
Skipped....
........
WARNING: [AA[409]: Some bad thing happened.
--- TOOL_A: READING COMPLETED. CPU TIME = 0 REAL TIME = 2
--------------------------------------------------------------------------------
----- TOOL_A statistics -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
AAA 885 0 0 0 0
AAAA2 1 0 2 0 0
AAAA4 0 0 2 0 0
AAAA8 0 0 2 0 0
AAAA16 0 0 2 0 0
AAAA1 0 0 2 0 0
AAAA8 0 0 23 0 0
AAAAAAA4 0 0 18 0 0
AAAA2 0 0 14 0 0
AAAAAA2 0 0 21 0 0
AAAAA4 0 0 23 0 0
AAAAA1 0 0 47 0 0
AAAAAA1 2 0 26 0
NOTE: Some notes
......
Skipped ......
The expected usage script.sh Attr1
Expected output:
888
I've tried to find something with sed/awk but failed to figure out a solution.
tldr;
$ cat myscript.sh
#!/bin/sh
logfile=${1}
attribute=${2}
field=$(grep -o "NAME.\+${attribute}" ${logfile} | wc -w)
sed -nre '/NAME/,/NOTE/{/NAME/d;/NOTE/d;s/\s+/\t/gp;}' ${logfile} | \
cut -f${field} | \
paste -sd+ | \
bc
$ ./myscript.sh mylog.log Attr3
182
Explanation:
assign command-line arguments ${1} and ${2} to the logfile and attribute variables, respectively.
with wc -w, count the quantity of words within the line that
contains both NAME and ${attribute} (the field index) and assign it to field
with sed
suppress automatic printing (-n) and enable extended regular expressions (-r)
find lines between the NAME and NOTE lines, inclusive
delete the lines that match NAME and NOTE
translate each contiguous run of whitespace to a single tab and print the result
cut using the field index
paste all numbers as an infix summation
evaluate the infix summation via bc
Quick and dirty (without any other spec)
awk -v CountCol=2 '/^[^[:blank:]]/ && NF == 6 { S += $( CountCol) } END{ print S + 0 }' YourFile
with column name
awk -v ColName='Attr1' '/^[[:blank:]]/ && NF == 6 { for(i=1;i<=NF;i++){if ( $i == ColName) CountCol = i } /^[^[:blank:]]/ && NF == 6 && CountCol{ S += $( CountCol) } END{ print S + 0 }' YourFile
you should add a header/trailer filter to avoid noisy line (a flag suit perfect for this) but lack of info about structure to set this flag, i use sthe simple field count (assuming text field have 0 as value so not changing the sum when taken in count)
$ awk -v col='Attr3' '/NAME/{for (i=1;i<=NF;i++) f[$i]=i} col in f{sum+=$(f[col]); if (!NF) {print sum+0; exit} }' file
182

How to progress the echo from 1 to 100% in bash in given time?

Iam using a bash script which i want to show a timer before my next function should called.So instead of regular sleep 5 i wanted to use some progress bar.So from stackoverflow I found this snippet.
for pc in $(seq 1 10); do
echo -ne "$pc%\033[0K\r"
sleep 1
done
But this only shows 1 to 10% in 10 seconds . But all I need is to move the progress bar from 1 to 100 in 10 Seconds like 0 10 20 ... 90 100 Just like the above script.
Any suggestions will help.
for pc in $(seq 0 10 100); do
echo -ne "$pc%\033[0K\r"
sleep 1
done
This will start from 0 and proceed to 100 with steps of 10

How to determine statistical significance in shell

I would like to determine the statistical significance of my results using shell scripting. My input file shows the number of errors in each trial in 10000 observations. Part of it is listed as: (using a threshold of having at least 1 error)
ifile.txt
1
2
2
4
1
3
2
3
4
2
3
4
2
6
2
Then I calculated the probability of each numbered error, which I calculated as:
awk '{ count[$0]++; total++ }
END { for(i in count) printf("%d %.3f\n", i, count[i]/total) }' ifile.txt | sort -n > ofile.txt
where first column in ofile.txt shows the number of errors and 2nd column shows its probability
ofile.txt
1 0.133
2 0.400
3 0.200
4 0.200
6 0.067
Now I need to determine the statistical significance of this result e.g. to highlight those results which are not statistically significant at 1% level. i.e. we will accept those errors which are having p-value < 0.005 and if a error has p-value > 0.005 then we will reject it.
I can't think of any method to do this in shell. Can anybody help/suggest me something?
Desire output is something like:
outfile.txt
1 99999
2 0.400
3 0.200
4 0.200
6 99999
Here, I assumed the probability of showing 1 error is not statistically significent at 1% level, but the probability of showing 2 errors is statistically significant and so on.
With no statistics education or gnuplot experience, it's a bit difficult to decipher exactly the desired methods for a solution. The problem may not be described well enough or my knowledge is ill-equipped for it.
Either way, after looking at the relationships between the data presented and desired output, I came up with this Awk script to achieve it:
$ cat script.awk
function abs(v) { return v < 0 ? -v : v }
{ a[$0]++ }
END {
obs = 10000
sig = 1
for (i in a) {
r = a[i]/NR
if (abs(r-sig/10) <= sig/20)
print i, obs-sig
else
printf "%d %.3f\n", i, r
}
}
$ awk -f script.awk ifile.txt | sort > outfile.txt
$ cat outfile.txt
1 9999
2 0.400
3 0.200
4 0.200
6 9999
This assumes 9999 (10000 (number of observations) - 1 (error)) was meant as the second field in the 1st and 5th lines in the desired output, not 99999.
Also, if using GNU Awk, the need for a pipe into sort could be eliminated by using the sort function asorti.

Resources