Edit functions in bash - awk - bash

I have such a function...
function size {
export FILENAME=$1
export SIZE=$(du -sb $FILENAME | awk '{ print $1 }')
awk 'BEGIN{x = ENVIRON["SIZE"]
split("Byte KiloByte MegaByte GigaByte TeraByte PetaByte ExaByte ZettaByte YottaByte", type)
for(i=8; y < 1; i--)
y = x / (2**(10*i))
print y " " type[i+2]
}'
}
size "/home/foo.bar" # 1 MegaByte
how can I insert: print y " " type[i+2]
to variable: SIZE_FILE ?
test: SIZE_FILE=${print y " " type[i+2]} # error :-(
Thank you very much

The $( expr ) construct will save the result of evaluating "expr" in to a variable:
theDate=$(date)
You can also use backticks, but I think the $() is more readable:
theDate=`date`
So for your scripts, you'll use:
function size {
export FILENAME=$1
SIZE=$(du -sb $FILENAME | awk '{ print $1 }')
export FILE_SIZE=$(awk -v x=$SIZE 'BEGIN{
split("Byte KiloByte MegaByte GigaByte TeraByte PetaByte ExaByte ZettaByte YottaByte", type)
for(i=8; y < 1; i--)
y = x / (2**(10*i))
print y " " type[i+2]
}')
echo $FILE_SIZE
}

You can do this without awk, which is more suited for processing text files.
function size () {
# Non-environment variables should be lowercased
# Always quote parameter expansions, in case they contain spaces
local filename="$1"
# Simpler way to get the file size in bytes
local size=$(stat -c%s "$filename")
# You could put all the units in an array, but we'll keep it simple.
for unit in Byte KiloByte MegaByte GigaByte TeraByte PetaByte ExaByte ZettaByte YottaByte; do
echo "$size $unit"
(( size /= 1024 ))
done
}
sizes=$( size $myfile )

Related

Generic "append to file if not exists" function in Bash

I am trying to write a util function in a bash script that can take a multi-line string and append it to the supplied file if it does not already exist.
This works fine using grep if the pattern does not contain \n.
if grep -qF "$1" $2
then
return 1
else
echo "$1" >> $2
fi
Example usage
append 'sometext\nthat spans\n\tmutliple lines' ~/textfile.txt
I am on MacOS btw which has presented some problems with some of the solutions I've seen posted elsewhere being very linux specific. I'd also like to avoid installing any other tools to achieve this if possible.
Many thanks
If the files are small enough to slurp into a Bash variable (you should be OK up to a megabyte or so on a modern system), and don't contain NUL (ASCII 0) characters, then this should work:
IFS= read -r -d '' contents <"$2"
if [[ "$contents" == *"$1"* ]]; then
return 1
else
printf '%s\n' "$1" >>"$2"
fi
In practice, the speed of Bash's built-in pattern matching might be more of a limitation than ability to slurp the file contents.
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why I replaced echo with printf.
Using awk:
awk '
BEGIN {
n = 0 # length of pattern in lines
m = 0 # number of matching lines
}
NR == FNR {
pat[n++] = $0
next
}
{
if ($0 == pat[m])
m++
else if (m > 0 && $0 == pat[0])
m = 1
else
m = 0
}
m == n {
exit
}
END {
if (m < n) {
for (i = 0; i < n; i++)
print pat[i] >>FILENAME
}
}
' - "$2" <<EOF
$1
EOF
if necessary, one would need to properly escape any metacharacters inside FS | OFS :
jot 7 9 |
{m,g,n}awk 'BEGIN { FS = OFS = "11\n12\n13\n"
_^= RS = (ORS = "") "^$" } _<NF || ++NF'
9
10
11
12
13
14
15
jot 7 -2 | (... awk stuff ...)
-2
-1
0
1
2
3
4
11
12
13

How to calculate Adler32 checksum for zip in Bash?

I need get checksum Adler32 and store to variable in bash.
It will be used in automatic script and it will be useful if no additional app/liberty will used which user need to install.
Is it possible to use common / basic command Bash command to get this value?
This is monumentally slow (about 60,000 times slower than C), but it shows that yes, it is possible.
#!/bin/bash
sum1=1
sum2=0
while LANG=C IFS= read -r -d '' -n 1 ch; do
printf -v val '%d\n' "'$ch"
(( val = val < 0 ? val + 256 : val, sum1 = (sum1 + val) % 65521, sum2 = (sum2 + sum1) % 65521 ))
done
(( adler = sum1 + 65536 * sum2 ))
echo $adler
Hopefully someone who actually knows bash could vastly improve on this.
Maybe this solution?:
python -c "import zlib; print(zlib.adler32(\"${file}\"))"
Tried two adler bash functions
one with an ordination dictionary and one with printf
also tried some bit shifting like
instead of sum1=(sum1+val)%65521 -> temp= (sum1+val),sum1=temp >> 16 *15 + (temp & 65355)%65521
wasn't able to improve it a lot, perhaps somebody knows a faster one.
last function is a awk function, it is the fastest, works also on files.
#!/bin/bash
a=$'Hello World'; b=""
for ((i=0;i<1000;i++)); do b+=$a; done
#-- building associative array ord byte character array
declare -Ai ordCHAR=()
for ((i=1;i<256;i++)); do printf -v hex "%x" $i; printf -v char "\x"$hex; ordCHAR[$char]=$i; done
unset hex char i
#-- building associative array ord byte character array -- END
#-- with dictionary
function adler32_A ()
{
local char; local -i sum1=1 sum2=0 val
LC_ALL=C; while read -rN 1 char; do
val=${ordCHAR[$char]};
((sum1=(sum1+val) % 65521, sum2 = (sum2 + sum1) % 65521 ))
done <<< $1
#-- removing 0A=\n addition, because of here string
(( sum2-=sum1, sum2<0 ? sum2+=65521 :0, sum1-=val, sum1<0 ? sum1+=65521 :0 ));
printf "%08x" $(( (sum2 << 16) + sum1 ))
LC_ALL=""
}
#-- with printf
function adler32_B ()
{
local char; local -i sum1=1 sum2=0 val
LC_ALL=C; while read -rN 1 char;
do
printf -v val '%d' "'$char"
(( sum1 = (sum1 + val) % 65521, sum2 = (sum2 + sum1) % 65521 ))
done <<< $1
#-- removing 0A=\n addition, because of here string
(( sum2-=sum1, sum2<0 ? sum2+=65521 :0, sum1-=val, sum1<0 ? sum1+=65521 :0 ));
printf "%x" $((sum1 + 65536 * sum2 ))
LC_ALL=""
}
#-- call adler32_awk [text STR] [evaluate text as path bINT]
function adler32_awk ()
{
local -i bPath=$2;
awk -b \
' BEGIN {RS="^$"; bPath='"$bPath"'; for (i=0;i<256;i++) charOrdARR[sprintf("%c",i)]=i; A=1; B=0;}
{
recordSTR=substr($0,1,length($0)-1); if (bPath) {getline byte_data < recordSTR; close(recordSTR);} else byte_data=recordSTR;
l=length(byte_data); for (i=1;i<=l;i++) {
A+=charOrdARR[substr(byte_data,i,1)]; if (A>65520) A-=65521;
B+=A; if (B>65520) B-=65521;}
printf "%x", lshift(B,16)+A; }
' <<<$1
}
time adler32_A "$b"
time adler32_B "$b"
#-- adler 32 of file -> adler32_awk "/home/.../your file" 1
time adler32_awk "$b"

How to split file by percentage of no. of lines?

How to split file by percentage of no. of lines?
Let's say I want to split my file into 3 portions (60%/20%/20% parts), I could do this manually, -_- :
$ wc -l brown.txt
57339 brown.txt
$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339
$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
34398 part1.txt
11466 part2.txt
11475 part3.txt
57339 total
But I'm sure there's a better way!
There is a utility that takes as arguments the line numbers that should become the first of each respective new file: csplit. This is a wrapper around its POSIX version:
#!/bin/bash
usage () {
printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
}
# Collect csplit options
while getopts "ksf:n:" opt; do
case "$opt" in
k|s) args+=(-"$opt") ;; # k: no remove on error, s: silent
f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
*) usage; exit 1 ;;
esac
done
shift $(( OPTIND - 1 ))
fname=$1
shift
ratios=("$#")
len=$(wc -l < "$fname")
# Sum of ratios and array of cumulative ratios
for ratio in "${ratios[#]}"; do
(( total += ratio ))
cumsums+=("$total")
done
# Don't need the last element
unset cumsums[-1]
# Array of numbers of first line in each split file
for sum in "${cumsums[#]}"; do
linenums+=( $(( sum * len / total + 1 )) )
done
csplit "${args[#]}" "$fname" "${linenums[#]}"
After the name of the file to split up, it takes the ratios for the sizes of the split files relative to their sum, i.e.,
percsplit brown.txt 60 20 20
percsplit brown.txt 6 2 2
percsplit brown.txt 3 1 1
are all equivalent.
Usage similar to the case in the question is as follows:
$ percsplit -s -f part -n 1 brown.txt 60 20 20
$ wc -l part*
34403 part0
11468 part1
11468 part2
57339 total
Numbering starts with zero, though, and there is no txt extension. The GNU version supports a --suffix-format option that would allow for .txt extension and which could be added to the accepted arguments, but that would require something more elaborate than getopts to parse them.
This solution plays nice with very short files (split file of two lines into two) and the heavy lifting is done by csplit itself.
$ cat file
a
b
c
d
e
$ cat tst.awk
BEGIN {
split(pcts,p)
nrs[1]
for (i=1; i in p; i++) {
pct += p[i]
nrs[int(size * pct / 100) + 1]
}
}
NR in nrs{ close(out); out = "part" ++fileNr ".txt" }
{ print $0 " > " out }
$ awk -v size=$(wc -l < file) -v pcts="60 20 20" -f tst.awk file
a > part1.txt
b > part1.txt
c > part1.txt
d > part2.txt
e > part3.txt
Change the " > " to just > to actually write to the output files.
Usage
The following bash script allows you to specify the percentage like
./split.sh brown.txt 60 20 20
you also can use the placeholder . which fills the percentage up to 100%.
./split.sh brown.txt 60 20 .
the splitted file is written to
part1-brown.txt
part2-brown.txt
part3-brown.txt
The script always generates as much part files as numbers are specified.
If the percentages sum up to 100, cat part* will always generate the original file (no duplicated or missing lines).
Bash Script: split.sh
#! /bin/bash
file="$1"
fileLength=$(wc -l < "$file")
shift
part=1
percentSum=0
currentLine=1
for percent in "$#"; do
[ "$percent" == "." ] && ((percent = 100 - percentSum))
((percentSum += percent))
if ((percent < 0 || percentSum > 100)); then
echo "invalid percentage" 1>&2
exit 1
fi
((nextLine = fileLength * percentSum / 100))
if ((nextLine < currentLine)); then
printf "" # create empty file
else
sed -n "$currentLine,$nextLine"p "$file"
fi > "part$part-$file"
((currentLine = nextLine + 1))
((part++))
done
BEGIN {
split(w, weight)
total = 0
for (i in weight) {
weight[i] += total
total = weight[i]
}
}
FNR == 1 {
if (NR!=1) {
write_partitioned_files(weight,a)
split("",a,":") #empty a portably
}
name=FILENAME
}
{a[FNR]=$0}
END {
write_partitioned_files(weight,a)
}
function write_partitioned_files(weight, a) {
split("",threshold,":")
size = length(a)
for (i in weight){
threshold[length(threshold)] = int((size * weight[i] / total)+0.5)+1
}
l=1
part=0
for (i in threshold) {
close(out)
out = name ".part" ++part
for (;l<threshold[i];l++) {
print a[l] " > " out
}
}
}
Invoke as:
awk -v w="60 20 20" -f above_script.awk file_to_split1 file_to_split2 ...
Replace " > " with > in script to actually write partitioned files.
The variable w expects space separated numbers. Files are partitioned in that proportion. For example "2 1 1 3" will partition files into four with number of lines in proportion of 2:1:1:3. Any sequence of numbers adding up to 100 can be used as percentages.
For large files the array a may consume too much memory. If that is an issue, here is an alternative awk script:
BEGIN {
split(w, weight)
for (i in weight) {
total += weight[i]; weight[i] = total #cumulative sum
}
}
FNR == 1 {
#get number of lines. take care of single quotes in filename.
name = gensub("'", "'\"'\"'", "g", FILENAME)
"wc -l '" name "'" | getline size
split("", threshold, ":")
for (i in weight){
threshold[length(threshold)+1] = int((size * weight[i] / total)+0.5)+1
}
part=1; close(out); out = FILENAME ".part" part
}
{
if(FNR>=threshold[part]) {
close(out); out = FILENAME ".part" ++part
}
print $0 " > " out
}
This passes through each file twice. Once for counting lines (via wc -l) and the other time while writing partitioned files. Invocation and effect is similar to the first method.
i like Benjamin W.'s csplit solution, but it's so long...
#!/bin/bash
# usage ./splitpercs.sh file 60 20 20
n=`wc -l <"$1"` || exit 1
echo $* | tr ' ' '\n' | tail -n+2 | head -n`expr $# - 1` |
awk -v n=$n 'BEGIN{r=1} {r+=n*$0/100; if(r > 1 && r < n){printf "%d\n",r}}' |
uniq | xargs csplit -sfpart "$1"
(the if(r > 1 && r < n) and uniq bits are to prevent creating empty files or strange behavior for small percentages, files with small numbers of lines, or percentages that add to over 100.)
I just followed your lead and made what you do manually into a script. It may not be the fastest or "best", but if you understand what you are doing now and can just "scriptify" it, you may be better off should you need to maintain it.
#!/bin/bash
# thisScript.sh yourfile.txt 20 50 10 20
YOURFILE=$1
shift
# changed to cat | wc so I dont have to remove the filename which comes from
# wc -l
LINES=$(cat $YOURFILE | wc -l )
startpct=0;
PART=1;
for pct in $#
do
# I am assuming that each parameter is on top of the last
# so 10 30 10 would become 10, 10+30 = 40, 10+30+10 = 50, ...
endpct=$( echo "$startpct + $pct" | bc)
# your math but changed parts of 100 instead of parts of 10.
# change bc <<< to echo "..." | bc
# so that one can capture the output into a bash variable.
FIRSTLINE=$( echo "$LINES * $startpct / 100 + 1" | bc )
LASTLINE=$( echo "$LINES * $endpct / 100" | bc )
# use sed every time because the special case for head
# doesn't really help performance.
sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
$((PART++))
startpct=$endpct
done
# get the rest if the % dont add to 100%
if [[ $( "lastpct < 100" | bc ) -gt 0 ]] ; then
sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
fi
wc -l part*.txt

Sorting strings from array takes a long time

Reading a text file into an array, extracting elements and sorting them is taking a very long time.
The text file is ffmpeg console output for R128 audio analysis. I need to get the highest M and S values. Example:
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.49998 M: -22.2 S: -29.9 I: -27.0 LUFS LRA: 9.8 LU FTPK: -12.4 dBFS TPK: -9.7 dBFS
[Parsed_ebur128_0 # 0x7fd32a60caa0] t: 4.69998 M: -22.5 S: -28.6 I: -25.9 LUFS LRA: 11.3 LU FTPK: -12.7 dBFS TPK: -9.7 dBFS
The text file can be hundreds or thousands of lines long depending on the duration of the audio file being analysed
I want to find the highest M (-22.2) and S Values (-28.6) and assign them to variables M and S
This is what I am using currently:
ARRAY=()
while read LINE
do
ARRAY+=("$LINE")
done < $tempDir/text.txt
for LINE in "${ARRAY[#]}"
do
echo "$LINE" | sed -n ‘/B:/p' | sed 's/S:.*//' | sed -n -e 's/^.*M://p' | sed -n -e 's/-//p' >>/$tempDir/R128M.txt
done
for LINE in "${ARRAY[#]}"
do
echo "$LINE" | sed -n '/M:/p' | sed 's/I:.*//' | sed -n -e 's/^.*S://p' | sed -n -e 's/-//p' >>$tempDir/R128S.txt
done
cat $tempDir/R128M.txt
M=( $(sort $tempDir/R128M.txt) )
cat $tempDir/R128S.txt
S=( $(sort $tempDir/R128S.txt) )
Is there a faster way of doing this?
Rather than reading in the whole file in memory, writing bits of it out to separate file, and reading those in again, just parse it and pick out the largest values:
$ awk '$7 > m || m == "" { m = $7 } $9 > s || s == "" { s = $9 } END { print m, s }' data
-22.2 -28.6
In your data, field 7 and 9 contains the values of M and S. The awk script will update its m and s variables if it finds larger values in these fields and then print the largest found at the end. The m == "" and s == "" are needed to trigger initialization of the values if no values has been read yet.
Another way with awk, which may look cleaner:
$ awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { print m, s }' data
To assign them to M and S in the shell:
$ declare $( awk 'FNR == 1 { m = $7; s = $9; next } $7 > m { m = $7 } $9 > s { s = $9 } END { printf("M=%f S=%f\n", m, s) }' data )
$ echo $M $S
-22.200000 -28.600000
Adjust the printf() format to use %s instead of %f if you want the original strings instead of float values, or set the number of decimals you might want with, e.g., %.2f in place of %f.
First of all, three-process pipe is a bit redundant for a single value extraction, especially taking into account you reinstantiate it anew for every line.
Next, you save all the values into a file and then sort that file, while all you need is the maximum value. You can easily find it during the very first (value extraction) loop, for additional O(N) running time, instead of I/O and sorting with all the I/O overhead and O(NlogN) sorting expenses. See ARITHMETIC EXPANSION and conditional expressions in bash manual.

How do I iterate through each line of a command's output in bash?

I have a script that reads from /proc/stat and calculates CPU usage. There are three relevant lines in /proc/stat:
cpu 1312092 24 395204 12582958 77712 456 3890 0 0 0
cpu0 617029 12 204802 8341965 62291 443 2718 0 0 0
cpu1 695063 12 190402 4240992 15420 12 1172 0 0 0
Currently, my script only reads the first line and calculates usage from that:
cpu=($( cat /proc/stat | grep '^cpu[^0-9] ' ))
unset cpu[0]
idle=${cpu[4]}
total=0
for value in "${cpu[#]}"; do
let total=$(( total+value ))
done
let usage=$(( (1000*(total-idle)/total+5)/10 ))
echo "$usage%"
This works as expected, because the script only parses this line:
cpu 1312092 24 395204 12582958 77712 456 3890 0 0 0
It's easy enough to get only the lines starting with cpu0 and cpu1
cpu=$( cat /proc/stat | grep '^cpu[0-9] ' )
but I don't know how to iterate over each line and apply this same process. Ive tried resetting the internal field separator inside a subshell, like this:
cpus=$( cat /proc/stat | grep '^cpu[0-9] ' )
(
IFS=$'\n'
for cpu in $cpus; do
cpu=($cpu)
unset cpu[0]
idle=${cpu[4]}
total=0
for value in "${cpu[#]}"; do
let total=$(( total+value ))
done
let usage=$(( (1000*(total-idle)/total+5)/10 ))
echo -n "$usage%"
done
)
but this gets me a syntax error
line 18: (1000*(total-idle)/total+5)/10 : division by 0 (error token is "+5)/10 ")
If I echo the cpu variable in the loop it looks like it's separating the lines properly. I looked at this thread and I think Im assigning the cpu variable to an array properly but is there another error Im not seeing?
I put my script into whats wrong with my script and it doesnt show me any errors apart from a warning about using cat within $(), s o I'm stumped.
Change this line in the middle of your loop:
IFS=' ' cpu=($cpu)
You need this because outside of your loop you're setting IFS=$'\n', but with that settingcpu($cpu)` won't do what you expect.
Btw, I would write your script like this:
#!/bin/bash -e
grep ^cpu /proc/stat | while IFS=$'\n' read cpu; do
cpu=($cpu)
name=${cpu[0]}
unset cpu[0]
idle=${cpu[4]}
total=0
for value in "${cpu[#]}"; do
((total+=value))
done
((usage=(1000 * (total - idle) / total + 5) / 10))
echo "$name $usage%"
done
The equivalent using awk:
awk '/^cpu/ { total=0; idle=$5; for (i=2; i<=NF; ++i) { total += $i }; print $1, int((1000 * (total - idle) / total + 5) / 10) }' < /proc/stat
Because the OP asked, an awk program.
awk '
/cpu[0-9] .*/ {
total = 0
idle = $5
for(i = 0; i <= NF; i++) { total += $i; }
printf("%s: %f%%\n", $1, 100*(total-idle)/total);
}
' /proc/stat
The /cpu[0-9] .*/ means "execute for every line matching this expression".
The variables like $1 do what you'd expect, but the 1st field has index 1, not 0: $0 means the whole line in awk.

Resources