Simplest method to convert file-size with suffix to bytes

Simplest method to convert file-size with suffix to bytes - shell

Title says it all really, but I'm currently using a simple function with a case statement to convert human-readable file size strings into a size in bytes. It works well enough, but it's a bit unwieldy for porting into other code, so I'm curious to know if there are any widely available commands that a shell script could use instead?
Basically I want to take strings such as "100g" or "100gb" and convert them into bytes.
I'm currently doing the following:
to_bytes() {
value=$(echo "$1" | sed 's/[^0123456789].*$//g')
units=$(echo "$1" | sed 's/^[0123456789]*//g' | tr '[:upper:]' '[:lower:]')
case "$units" in
t|tb) let 'value *= 1024 * 1024 * 1024 * 1024' ;;
g|gb) let 'value *= 1024 * 1024 * 1024' ;;
m|mb) let 'value *= 1024 * 1024' ;;
k|kb) let 'value *= 1024' ;;
b|'') let 'value += 0' ;;
*)
value=
echo "Unsupported units '$units'" >&2
;;
esac
echo "$value"
}
It seems a bit overkill for something I would have thought was fairly common for scripts working with files; common enough that something might exist to do this more quickly.
If there are no widely available solutions (i.e - majority of unix and linux flavours) then I'd still appreciate any tips for optimising the above function as I'd like to make it smaller and easier to re-use.

See man numfmt.
# numfmt --from=iec 42 512K 10M 7G 3.5T
42
524288
10485760
7516192768
3848290697216
# numfmt --to=iec 42 524288 10485760 7516192768 3848290697216
42
512K
10M
7.0G
3.5T

toBytes() {
echo $1 | echo $((`sed 's/.*/\L\0/;s/t/Xg/;s/g/Xm/;s/m/Xk/;s/k/X/;s/b//;s/X/ *1024/g'`))
}

Here's something I wrote. It supports k, KB, and KiB. (It doesn't distinguish between powers of two and powers of ten suffixes, though, as in 1KB = 1000 bytes, 1KiB = 1024 bytes.)
#!/bin/bash
parseSize() {(
local SUFFIXES=('' K M G T P E Z Y)
local MULTIPLIER=1
shopt -s nocasematch
for SUFFIX in "${SUFFIXES[#]}"; do
local REGEX="^([0-9]+)(${SUFFIX}i?B?)?\$"
if [[ $1 =~ $REGEX ]]; then
echo $((${BASH_REMATCH[1]} * MULTIPLIER))
return 0
fi
((MULTIPLIER *= 1024))
done
echo "$0: invalid size \`$1'" >&2
return 1
)}
Notes:
Leverages bash's =~ regex operator, which stores matches in an array named BASH_REMATCH.
Notice the cleverly-hidden parentheses surrounding the function body. They're there to keep shopt -s nocasematch from leaking out of the function.

don't know if this is ok:
awk 'BEGIN{b=1;k=1024;m=k*k;g=k^3;t=k^4}
/^[0-9.]+[kgmt]?b?$/&&/[kgmtb]$/{
sub(/b$/,"")
sub(/g/,"*"g)
sub(/k/,"*"k)
sub(/m/,"*"m)
sub(/t/,"*"t)
"echo "$0"|bc"|getline r; print r; exit;}
{print "invalid input"}'
this only handles single line input. if multilines are needed, remove the exit
this checks only pattern [kgmt] and optional b. e.g. kib, mib would fail. also currently is only for lower-case.
e.g.:
kent$ echo "200kb"|awk 'BEGIN{b=1;k=1024;m=k*k;g=k^3;t=k^4}
/^[0-9.]+[kgmt]?b?$/&&/[kgmtb]$/{
sub(/b$/,"")
sub(/g/,"*"g)
sub(/k/,"*"k)
sub(/m/,"*"m)
sub(/t/,"*"t)
"echo "$0"|bc"|getline r
print r; exit
}{print "invalid input"}'
204800

Okay, so it sounds like there's nothing built-in or widely available, which is a shame, so I've had a go at reducing the size of the function and come up with something that's only really 4 lines long, though it's a pretty complicated four lines!
I'm not sure if it's suitable as an answer to my original question as it's not really what I'd call the simplest method, but I want to put it up in case anyone thinks it's a useful solution, and it does have the advantage of being really short.
#!/bin/sh
to_bytes() {
units=$(echo "$1" | sed 's/^[0123456789]*//' | tr '[:upper:]' '[:lower:]')
index=$(echo "$units" | awk '{print index ("bkmgt kbgb mbtb", $0)}')
mod=$(echo "1024^(($index-1)%5)" | bc)
[ "$mod" -gt 0 ] &&
echo $(echo "$1" | sed 's/[^0123456789].*$//g')"*$mod" | bc
}
To quickly summarise how it works, it first strips the number from the string given and forces to lowercase. It then use awk to grab the index of the extension from a structured string of valid suffixes. The thing to note is that the string is arranged to multiples of five (so it would need to be widened if more extensions are added), for example k and kb are at indices 2 and 7 respectively.
The index is then reduced by one and modulo'd by five so both k and kb become 1, m and mb become 2 and so-on. That's then used to raised 1024 as a power to get the size in bytes. If the extension was invalid this will resolve to a value of zero, and an extension of b (or nothing) will evaluate to 1.
So long as mod is greater than zero the input string is reduced to only the numeric part and multiplied by the modifier to get the end result.
This is actually how I would probably have solved this originally if I were using a language like PHP, Java etc., it's just a bit of a weird one to put together in a shell script.
I'd still very much appreciate any simplifications though!

Another variation, adding support for decimal values with a simpler T/G/M/K parser for outputs you might find from simpler Unix programs.
to_bytes() {
value=$(echo "$1" | sed -e 's/K//g' | sed -e 's/M//g' | sed -e 's/G//g' | sed -e 's/T//g' )
units=$(echo -n "$1" | grep -o .$ )
case "$units" in
T) value=$(bc <<< "scale=2; ($value * 1024 * 1024 * 1024 * 1024)") ;;
G) value=$(bc <<< "scale=2; ($value * 1024 * 1024 * 1024)") ;;
M) value=$(bc <<< "scale=2; ($value * 1024 * 1024)") ;;
K) value=$(bc <<< "scale=2; ($value * 1024)") ;;
b|'') let 'value += 0' ;;
*)
value=
echo "Unsupported units '$units'" >&2
;;
esac
echo "$value"
}

Related

If RANDOM only goes up to 32767, how can I generate a 9-digit random number?

How to generate 9 digit random number in shell?
I am trying something like this but it only gave numbers below 32768.
#!/bin/bash
mo=$((RANDOM%999999999))
echo "********Random"$mo
Please help
output should be ********Random453351111

In Linux with /dev/urandom:
$ rnd=$(tr -cd "[:digit:]" < /dev/urandom | head -c 9) && echo $rnd
463559879

I think this should make it
shuf -i 99999999-999999999 -n 1

As a work around, we could just simply ask for 1 random integer, for n times:
rand=''
for i in {1..9}; do
rand="${rand}$(( $RANDOM % 10 ))"
done
echo $rand
Try it online!
Note [1]: Since RANDOM's upper limit has a final digit of 7, there's a slightly lesser change for the 'generated' number to contain 8 or 9's.

Because of RANDOM's limited range, it can only be used to retrieve four base-10 digits at a time. Thus, to retrieve 9 digits, you need to call it three times.
If we don't care much about performance (are willing to pay process substitution costs), this may look like:
#!/usr/bin/env bash
get4() {
local newVal=32768
while (( newVal > 29999 )); do # avoid bias because of remainder
newVal=$RANDOM
done
printf '%04d' "$((newVal % 10000))"
}
result="$(get4)$(get4)$(get4)"
result=$(( result % 1000000000 ))
printf '%09d\n' "$result"
If we do care about performance, it may instead look like:
#!/usr/bin/env bash
get4() {
local newVal=32768 outVar=$1
while (( newVal > 29999 )); do # avoid bias because of remainder
newVal=$RANDOM
done
printf -v "$outVar" '%04d' "$((newVal % 10000))"
}
get4 out1; get4 out2; get4 out3
result="${out1}${out2}${out3}"
result=$(( result % 1000000000 ))
printf '%09d\n' "$result"

Use perl, as follows :
perl -e print\ rand | cut -c 3-11
Or
perl -MPOSIX -e 'print floor rand 10**9'

How to write a bash function that can detect if a given input ends in Kilobytes `K` or Megabytes `M`?

I have a bash function that is currently set up as:
MB=$(( $(echo $(FUNCTION_THAT_RETURNS_Kb_OR_Mb) | cut -d "K" -f 1 | sed 's/^.*- //') / 1000 ))
where the middle portion echo $(FUNCTION_THAT_RETURNS_Kb_OR_Mb) returns a value that ends in K or M, (for example: 515223 K or 36326 M) for Kilobytes or Megabytes. I currently have designed the function to strip the trailing units indicator for K, and then divide by 1000 to convert to megabytes. However, when the inside part of it ends in M, it fails. How can I write a function that detects if its in kilobytes or megabytes?

Don't reinvent the wheel - there is numfmt:
function_that_returns_Kb_or_Mb() { echo "515223 K"; }
mb=$(function_that_returns_Kb_or_Mb | numfmt -d '' --from=iec --to-unit=Mi)
# mb=504
function_that_returns_Kb_or_Mb() { echo "36326 M"; }
mb=$(function_that_returns_Kb_or_Mb | numfmt -d '' --from=iec --to-unit=Mi)
# mb=36326
Notes:
echo $(FUNCTION_THAT_RETURNS_Kb_OR_Mb) is a useless use of echo. It's like echo $(echo $(echo $(...)))). Just FUNCTION_THAT_RETURNS_Kb_OR_Mb | blabla.
By convention UPPERCASE VARIABLES are used for exported variables, like PATH COLUMNS UID PWD etc. - use lower case identifiers in your scripts.
I assumed input and output is using IEC scale, for SI scale use --from=si --to-unit=M.

Creating histograms in bash

EDIT
I read the question that this is supposed to be a duplicate of (this one). I don't agree. In that question the aim is to get the frequencies of individual numbers in the column. However if I apply that solution to my problem, I'm still left with my initial problem of grouping the frequencies of the numbers in a particular range into the final histogram. i.e. if that solution tells me that the frequency of 0.45 is 2 and 0.44 is 1 (for my input data), I'm still left with the problem of grouping those two frequencies into a total of 3 for the range 0.4-0.5.
END EDIT
QUESTION-
I have a long column of data with values between 0 and 1.
This will be of the type-
0.34
0.45
0.44
0.12
0.45
0.98
.
.
.
A long column of decimal values with repetitions allowed.
I'm trying to change it into a histogram sort of output such as (for the input shown above)-
0.0-0.1 0
0.1-0.2 1
0.2-0.3 0
0.3-0.4 1
0.4-0.5 3
0.5-0.6 0
0.6-0.7 0
0.7-0.8 0
0.8-0.9 0
0.9-1.0 1
Basically the first column has the lower and upper bounds of each range and the second column has the number of entries in that range.
I wrote it (badly) as-
for i in $(seq 0 0.1 0.9)
do
awk -v var=$i '{if ($1 > var && $1 < var+0.1 ) print $1}' input | wc -l;
done
Which basically does a wc -l of the entries it finds in each range.
Output formatting is not a part of the problem. If I simply get the frequencies corresponding to the different bins , that will be good enough. Also please note that the bin size should be a variable like in my proposed solution.
I already read this answer and want to avoid the loop. I'm sure there's a much much faster way in awk that bypasses the for loop. Can you help me out here?

Following the same algorithm of my previous answer, I wrote a script in awk which is extremely fast (look at the picture).
The script is the following:
#!/usr/bin/awk -f
BEGIN{
bin_width=0.1;
}
{
bin=int(($1-0.0001)/bin_width);
if( bin in hist){
hist[bin]+=1
}else{
hist[bin]=1
}
}
END{
for (h in hist)
printf " * > %2.2f -> %i \n", h*bin_width, hist[h]
}
The bin_width is the width of each channel. To use the script just copy it in a file, make it executable (with chmod +x <namefile>) and run it with ./<namefile> <name_of_data_file>.

For this specific problem, I would drop the last digit, then count occurrences of sorted data:
cut -b1-3 | sort | uniq -c
which gives, on the specified input set:
2 0.1
1 0.3
3 0.4
1 0.9
Output formatting can be done by piping through this awk command:
| awk 'BEGIN{r=0.0}
{while($2>r){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}
printf "%1.1f-%1.1f %3d\n",$2,$2+0.1,$1}
END{while(r<0.9){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}}'

The only loop you will find in this algorithm is around the line of the file.
This is an example on how to realize what you asked in bash. Probably bash is not the best language to do this since it is slow with math. I use bc, you can use awk if you prefer.
How the algorithm works
Imagine you have many bins: each bin correspond to an interval. Each bin will be characterized by a width (CHANNEL_DIM) and a position. The bins, all together, must be able to cover the entire interval where your data are casted. Doing the value of your number / bin_width you get the position of the bin. So you have just to add +1 to that bin. Here a much more detailed explanation.
#!/bin/bash
# This is the input: you can use $1 and $2 to read input as cmd line argument
FILE='bash_hist_test.dat'
CHANNEL_NUMBER=9 # They are actually 10: 0 is already a channel
# check the max and the min to define the dimension of the channels:
MAX=`sort -n $FILE | tail -n 1`
MIN=`sort -rn $FILE | tail -n 1`
# Define the channel width
CHANNEL_DIM_LONG=`echo "($MAX-$MIN)/($CHANNEL_NUMBER)" | bc -l`
CHANNEL_DIM=`printf '%2.2f' $CHANNEL_DIM_LONG `
# Probably printf is not the best function in this context because
#+the result could be system dependent.
# Determine the channel for a given number
# Usage: find_channel <number_to_histogram> <width_of_histogram_channel>
function find_channel(){
NUMBER=$1
CHANNEL_DIM=$2
# The channel is found dividing the value for the channel width and
#+rounding it.
RESULT_LONG=`echo $NUMBER/$CHANNEL_DIM | bc -l`
RESULT=`printf '%.0f' $RESULT_LONG`
echo $RESULT
}
# Read the file and do the computuation
while IFS='' read -r line || [[ -n "$line" ]]; do
CHANNEL=`find_channel $line $CHANNEL_DIM`
[[ -z HIST[$CHANNEL] ]] && HIST[$CHANNEL]=0
let HIST[$CHANNEL]+=1
done < $FILE
counter=0
for i in ${HIST[*]}; do
CHANNEL_START=`echo "$CHANNEL_DIM * $counter - .04" | bc -l`
CHANNEL_END=`echo " $CHANNEL_DIM * $counter + .05" | bc`
printf '%+2.1f : %2.1f => %i\n' $CHANNEL_START $CHANNEL_END $i
let counter+=1
done
Hope this helps. Comment if you have other questions.

How do I get bc(1) to print the leading zero?

I do something like the following in a Makefile:
echo "0.1 + 0.1" | bc
(in the real file the numbers are dynamic, of course)
It prints .2 but I want it to print 0.2.
I would like to do this without resorting to sed but I can't seem to find how to get bc to print the zero. Or is bc just not able to do this?

You can also resort to awk to format:
echo "0.1 + 0.1" | bc | awk '{printf "%f", $0}'
or with awk itself doing the math:
echo "0.1 0.1" | awk '{printf "%f", $1 + $2}'

This might work for you:
echo "x=0.1 + 0.1; if(x<1) print 0; x" | bc

After a quick look at the source (see bc_out_num(), line 1461), I don't see an obvious way to make the leading 0 get printed if the integer portion is 0. Unless I missed something, this behaviour is not dependent on a parameter which can be changed using command-line flag.
Short answer: no, I don't think there's a way to make bc print numbers the way you want.
I don't see anything wrong with using sed if you still want to use bc. The following doesn't look that ghastly, IMHO:
[me#home]$ echo "0.1 + 0.1" | bc | sed 's/^\./0./'
0.2
If you really want to avoid sed, both eljunior's and choroba's suggestions are pretty neat, but they require value-dependent tweaking to avoid trailing zeros. That may or may not be an issue for you.

I cannot find anything about output format in the documentation. Instead of sed, you can also reach for printf:
printf '%3.1f\n' $(bc<<<0.1+0.1)

echo "$a / $b" | bc -l | sed -e 's/^-\./-0./' -e 's/^\./0./'
This should work for all cases where the results are:
"-.123"
".123"
"-1.23"
"1.23"
Explanation:
For everything that only starts with -., replace -. with -0.
For everything that only starts with ., replace . with 0.

Building on potongs answer,
For fractional results:
echo "x=0.1 + 0.1; if(x<1 && x > 0) print 0; x" | bc -l
Note that negative results will not be displayed correctly. Aquarius Power has a solution for that.

$ bc -l <<< 'x=-1/2; if (length (x) == scale (x) && x != 0) { if (x < 0) print "-",0,-x else print 0,x } else print x'
This one is pure bc. It detects the leading zero by comparing the result of the length with the scale of the expression. It works on both positive and negative number.

This one will also handle negative numbers:
echo "0.1 - 0.3" | bc | sed -r 's/^(-?)\./\10./'

For positive numbers, it may be as simple as printing (an string) zero:
$ echo '"0";0.1+0.1' | bc
0.2
avoid the zero if the number is bigger (or equal) to 1:
$ echo 'x=0.1+0.1; if(x<1){"0"}; x' | bc
0.2
It gets a bit more complex if the number may be negative:
echo 'x= 0.3 - 0.5 ; s=1;if(x<0){s=-1};x*=s;if(s<0){"-"};if(x<1) {"0"};x' | bc
-0.2
You may define a function and add it to a library:
$ echo 'define leadzero(x){auto s;
s=1;if(x<0){s=-1};x*=s;if(s<0){"-"};if(x<1){"0"};
return(x)};
leadzero(2.1-12.4)' | bc
-10.3
$ echo 'define leadzero(x){auto s;
s=1;if(x<0){s=-1};x*=s;if(s<0){"-"};if(x<1){"0"};
return(x)};
leadzero(0.1-0.4)' | bc
-0.3

Probably, bc isn't really the best "bench calculator" for the modern age. Other languages will give you more control. Here are working examples that print values in the range (-1.0..+1.0) with a leading zero. These examples use bc, AWK, and Python 3, along with Here String syntax.
#!/bin/bash
echo "using bc"
time for (( i=-2; i<=+2; i++ ))
{
echo $(bc<<<"scale=1; x=$i/2; if (x==0||x<=-1||x>=1) { print x } else { if (x<0) { print \"-0\";-x } else { print \"0\";x } } ")
}
echo
echo "using awk"
time for (( i=-2; i<=+2; i++ ))
{
echo $(echo|awk "{printf \"%.1f\",$i/2}")
}
echo
echo "using Python"
time for (( i=-2; i<=+2; i++ ))
{
echo $(python3<<<"print($i/2)")
}
Note that the Python version is about 10x slower, if that matters (still very fast for most purposes).
Doing any non-trivial math with sh or bc is a fool's errand. There are much better bench calculators available nowadays. For example, you can embed and execute Python subroutines inside your Bash scripts using Here Documents.
function mathformatdemo {
python3<<SCRIPT
import sys
from math import *
x=${1} ## capture the parameter from the shell
if -1<=x<=+1:
#print("debug: "+str(x),file=sys.stderr)
y=2*asin(x)
print("2*asin({:2.0f})={:+6.2f}".format(x,y))
else: print("domain err")
SCRIPT
}
echo "using Python via Here-doc"
time for (( i=-2; i<=+2; i++ ))
{
echo $(mathformatdemo $i)
}
Output:
using Python via Here-doc
domain err
2*asin(-1)= -3.14
2*asin( 0)= +0.00
2*asin( 1)= +3.14
domain err

this only uses bc, and works with negative numbers:
bc <<< "x=-.1; if(x==0) print \"0.0\" else if(x>0 && x<1) print 0,x else if(x>-1 && x<0) print \"-0\",-x else print x";
try it with:
for y in "0" "0.1" "-0.1" "1.1" "-1.1"; do
bc <<< "x=$y; if(x==0) print \"0.0\" else if(x>0 && x<1) print 0,x else if(x>-1 && x<0) print \"-0\",-x else print x";
echo;
done

Another simple way, similar to one of the posts in this thread here:
echo 'x=0.1+0.1; print "0",x,"\n"' | bc
Print the list of variables, including the leading 0 and the newline.

Since you have the question tagged [bash] you can simply compute the answer and save it to a variable using command substitution (e.g. r="$(...)") and then using [[..]] with =~ to test if the first character in the result is [1-9] (e.g. [[ $r =~ ^[1-9].*$ ]]), and if the first character isn't, prepend '0' to the beginning of r, e.g.
r=$(echo "0.1 + 0.1" | bc) # compute / save result
[[ $r =~ ^[1-9].*$ ]] || r="0$r" # test 1st char [1-9] or prepend 0
echo "$r" # output result
Result
0.2
If the result r is 1.0 or greater, then no zero is prepended, e.g. (as a 1-liner)
$ r=$(echo "0.8 + 0.6" | bc); [[ $r =~ ^[1-9].*$ ]] || r="0$r"; echo "$r"
1.4

Pick and print one of three strings at random in Bash script

How can print a value, either 1, 2 or 3 (at random). My best guess failed:
#!/bin/bash
1 = "2 million"
2 = "1 million"
3 = "3 million"
print randomint(1,2,3)

To generate random numbers with bash use the $RANDOM internal Bash function:
arr[0]="2 million"
arr[1]="1 million"
arr[2]="3 million"
rand=$[ $RANDOM % 3 ]
echo ${arr[$rand]}
From bash manual for RANDOM:
Each time this parameter is
referenced, a random integer between 0
and 32767 is generated. The sequence
of random numbers may be initialized
by assigning a value to RANDOM. If
RANDOM is unset,it loses its
special properties, even if it is
subsequently reset.

Coreutils shuf
Present in Coreutils, this function works well if the strings don't contain newlines.
E.g. to pick a letter at random from a, b and c:
printf 'a\nb\nc\n' | shuf -n1
POSIX eval array emulation + RANDOM
Modifying Marty's eval technique to emulate arrays (which are non-POSIX):
a1=a
a2=b
a3=c
eval echo \$$(expr $RANDOM % 3 + 1)
This still leaves the RANDOM non-POSIX.
awk's rand() is a POSIX way to get around that.

64 chars alpha numeric string
randomString32() {
index=0
str=""
for i in {a..z}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {A..Z}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {0..9}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {1..64}; do str="$str${arr[$RANDOM%$index]}"; done
echo $str
}

~.$ set -- "First Expression" Second "and Last"
~.$ eval echo \$$(expr $RANDOM % 3 + 1)
and Last
~.$

Want to corroborate using shuf from coreutils using the nice -n1 -e approach.
Example usage, for a random pick among the values a, b, c:
CHOICE=$(shuf -n1 -e a b c)
echo "choice: $CHOICE"
I looked at the balance for two samples sizes (1000, and 10000):
$ for lol in $(seq 1000); do shuf -n1 -e a b c; done > shufdata
$ less shufdata | sort | uniq -c
350 a
316 b
334 c
$ for lol in $(seq 10000); do shuf -n1 -e a b c; done > shufdata
$ less shufdata | sort | uniq -c
3315 a
3377 b
3308 c
Ref: https://www.gnu.org/software/coreutils/manual/html_node/shuf-invocation.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio