How much memory does a variable take? - bash

The variable "a=b" contains 1 char 'a' for the name, and 1 char 'b' for the value.
Together 2 bytes.
How many characters can you store with one byte ?
The variable needs a pointer. 8 bytes.
How many bytes do pointers take up ?
Together 10 bytes.
Does a variable "a=b" stored in memory take about 10 bytes ? And would 10 variables of the same size take about 100 bytes ?
So 1000 variables of 1000 bytes each would be almost 1MB of memory ?
I have a file data.sh that only contains variables.
I need to retrieve the value of one variable in that file.
I do this by using a function.
(called by "'function-name' 'datafile-name' 'variable-name'")
#!/usr/pkg/bin/ksh93
readvar () {
while read -r line
do
typeset "${line}"
done < "${1}"
nameref indirect="${2}"
echo "${indirect}"
}
readvar datafile variable
The function reads the file data.sh line by line.
While it does that is typesets each line.
After it's done with that,
it makes a namereference from the variable-name in the function-call,
to one of the variables of the file data.sh.
To finally print the value of that variable.
When the function is finished it no longer uses up memory.
But as long as the function is running it does.
This means all variables in the file data.sh are at some point stored in memory.
Correct ?
In reality I have a file with ip-addresses as variable name and a nickname as values. So I suppose this will not be such a problem on memory. But if I use this also for posts of visitors variable values will be of larger sizes. But then it would be possible to have this function only store for instance 10 variables in memory each time.
However I wonder if my way of calculating this memory usage of variables is making any sense.
Edit:
This might be a solution to avoid loading the whole file in memory.
#!/bin/ksh
readvar () {
input=$(print "${2}" | sed 's/\[/\\[/g' | sed 's/\]/\\]/g')
line=$(grep "${input}" "${1}")
typeset ${line}
nameref indirect="${2}"
print "${indirect}"
}
readvar ./test.txt input[0]
With the input test.txt
input[0]=192.0.0.1
input[1]=192.0.0.2
input[2]=192.0.0.2
And the output
192.0.0.1
Edit:
Of course !!!
In the original post
Bash read array from an external file
it said:
# you could do some validation here
so:
while read -r line
do
# you could do some validation here
declare "$line"
done < "$1"
lines would be declared (or typeset in ksh) under a condition.

Your real concern seems not to be "how much memory does this take?" but "how can I avoid taking uselessly much memory for this?". I'm going to answer this one first. For a bunch of thoughts about the original question, see the end of my answer.
For avoiding to use up memory I propose to use grep to get the one line which is of interest to you and ignore all the others:
line=$(grep "^$2=" "$1")
Then you can extract the information you need from this line:
result=$(echo "$line" | cut -d= -f 2)
Now the variable result contains the value which would have been assigned to $2 in the file $1. Since you have no need to store more than one such result value you definitely have no memory issue.
Now, to the original question:
To find out how much memory a shell uses up for each variable is tricky. You would need to have a look into the source of the shell to be sure on the implementation. It can vary from shell to shell (you appear to be using ksh which can be different from bash in this aspect). It also can vary from version to version.
One way to get an idea would be to watch a shell process's memory usage while making the shell set variables in large amounts:
bash -c 'a="$(head -c 1000 /dev/zero | tr "\0" x)"; for ((i=0; i<1000; i++)); do eval a${i}="$a"; done; grep ^VmPeak /proc/$$/status'
bash -c 'a="$(head -c 1000 /dev/zero | tr "\0" x)"; for ((i=0; i<10000; i++)); do eval a${i}="$a"; done; grep ^VmPeak /proc/$$/status'
bash -c 'a="$(head -c 1000 /dev/zero | tr "\0" x)"; for ((i=0; i<100000; i++)); do eval a${i}="$a"; done; grep ^VmPeak /proc/$$/status'
bash -c 'a="$(head -c 1000 /dev/zero | tr "\0" x)"; for ((i=0; i<200000; i++)); do eval a${i}="$a"; done; grep ^VmPeak /proc/$$/status'
This prints the peak amount of memory in use by a bash which sets 1000, 10000, 100000, and 200000 variables with a value of 1000 x characters. On my machine (using bash 4.2.25(1)-release) this gave the following output:
VmPeak: 19308 kB
VmPeak: 30220 kB
VmPeak: 138888 kB
VmPeak: 259688 kB
This shows that the memory used is growing more or less in a linear fashion (plus a fixed offset of ~17000k) and that each new variable takes ~1.2kB of additional memory.
But as I said, other shells' results may vary.

Related

Bash split stdin by null and pipe to pipeline

I have a stream that is null delimited, with an unknown number of sections. For each delimited section I want to pipe it into another pipeline until the last section has been read, and then terminate.
In practice, each section is very large (~1GB), so I would like to do this without reading each section into memory.
For example, imagine I have the stream created by:
for I in {3..5}; do seq $I; echo -ne '\0';
done
I'll get a steam that looks like:
1
2
3
^#1
2
3
4
^#1
2
3
4
5
^#
When piped through cat -v.
I would like to pipe each section through paste -sd+ | bc, so I get a stream that looks like:
6
10
15
This is simply an example. In actuality the stream is much larger and the pipeline is more complicated, so solutions that don't rely on streams are not feasible.
I've tried something like:
set -eo pipefail
while head -zn1 | head -c-1 | ifne -n false | paste -sd+ | bc; do :; done
but I only get
6
10
If I leave off bc I get
1+2+3
1+2+3+4
1+2+3+4+5
which is basically correct. This leads me to believe that the issue is potentially related to buffering and the way each process is actually interacting with the pipes between them.
Is there some way to fix the way that these commands exchange streams so that I can get the desired output? Or, alternatively, is there a way to accomplish this with other means?
In principle this is related to this question, and I could certainly write a program that reads stdin into a buffer, looks for the null character, and pipes the output to a spawned subprocess, as the accepted answer does for that question. Given the general support of streams and null delimiters in bash, I'm hoping to do something that's a little more "native". In particular, if I want to go this route, I'll have to escape the pipeline (paste -sd+ | bc) in a string instead of just letting the same shell interpret it. There's nothing too inherently bad about this, but it's a little ugly and will require a bunch of somewhat error prone escaping.
Edit
As was pointed out in an answer, head makes no guarantees about how much it buffers. Unless it only buffers single byte at a time, which would be impractical, this will never work. Thus, it seems like the only solution would be to read it into memory, or write a specific program.
The issue with your original code is that head doesn't guarantee that it won't read more than it outputs. Thus, it can consume more than one (NUL-delimited) chunk of input, even if it's emitting only one chunk of output.
read, by contrast, guarantees that it won't consume more than you ask it for.
set -o pipefail
while IFS= read -r -d '' line; do
bc <<<"${line//$'\n'/+}"
done < <(build_a_stream)
If you want native logic, there's nothing more native than just writing the whole thing in shell.
Calling external tools -- including bc, cut, paste, or others -- involves a fork() penalty. If you're only processing small amounts of data per invocation, the efficiency of the tools is overwhelmed by the cost of starting them.
while read -r -d '' -a numbers; do # read up to the next NUL into an array
sum=0 # initialize an accumulator
for number in "${numbers[#]}"; do # iterate over that array
(( sum += number )) # ...using an arithmetic context for our math
done
printf '%s\n' "$sum"
done < <(build_a_stream)
For all of the above, I tested with the following build_a_stream implementation:
build_a_stream() {
local i j IFS=$'\n'
local -a numbers
for ((i=3; i<=5; i++)); do
numbers=( )
for ((j=0; j<=i; j++)); do
numbers+=( "$j" )
done
printf '%s\0' "${numbers[*]}"
done
}
As discussed, the only real solution seemed to be writing a program to do this specifically. I wrote one in rust called xstream-util. After installing it with cargo install xstream-util, you can pipe the input into
xstream -0 -- bash -c 'paste -sd+ | bc'
to get the desired output
6
10
15
It doesn't avoid having to run the program in bash, so it still needs escaping if the pipeline is complicated. Also, it currently only supports single byte delimiters.

bash loop taking extremely long time

I have a list of times that I am looping through in the format HH:MM:SS to find the nearest but not past time. The code that I have is:
for i in ${times[#]}; do
hours=$(echo $i | sed 's/\([0-9]*\):.*/\1/g')
minutes=$(echo $i | sed 's/.*:\([0-9]*\):.*/\1/g')
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
if [[ hours -ge currentHours ]]; then
if [[ minutes -ge currentMinutes ]]; then
break
fi
fi
done
The variable times is an array of all the times that I am sorting through (its about 20-40 lines). I'd expect this to take less than 1 second however it is taking upwards of 5 seconds. Any suggestions for decreasing the time of the regular expression would be appreciated.
times=($(cat file.txt))
Here is a list of the times that are stored in a text file and are imported into the times variable using the above line of code.
6:05:00
6:35:00
7:05:00
7:36:00
8:08:00
8:40:00
9:10:00
9:40:00
10:11:00
10:41:00
11:11:00
11:41:00
12:11:00
12:41:00
13:11:00
13:41:00
14:11:00
14:41:00
15:11:00
15:41:00
15:56:00
16:11:00
16:26:00
16:41:00
16:58:00
17:11:00
17:26:00
17:41:00
18:11:00
18:41:00
19:10:00
19:40:00
20:10:00
20:40:00
21:15:00
21:45:00
One of the key things to understand in looking at bash scripts from a performance perspective is that while the bash interpreter is somewhat slow, the act of spawning an external process is extremely slow. Thus, while it can often speed up your scripts to use a single invocation of awk or sed to process a large stream of input, starting those invocations inside a tight loop will greatly outweigh the performance of those tools once they're running.
Any command substitution -- $() -- causes a second copy of the interpreter to be fork()ed off as a subshell. Invoking any command not built into bash -- date, sed, etc -- then causes a subprocess to be fork()ed off for that process, and then the executable associated with that process to be exec()'d -- something involves a great deal of OS-level overhead (the binary needs to be linked, loaded, etc).
This loop would be better written as:
IFS=: read -r currentHours currentMinutes < <(date +"%H:%M")
while IFS=: read -r hours minutes _; do
if (( hours >= currentHours )) && (( minutes >= currentMinutes )); then
break
fi
done <file.txt
In this form only one external command is run, date +"%H:%M", outside the loop. If you were only targeting bash 4.2 and newer (with built-in time formatting support), even this would be unnecessary:
printf -v currentHours '%(%H)T' -1
printf -v currentMinutes '%(%M)T' -1
...will directly place the current hour and minute into the variables currentHours and currentMinutes using only functionality built into modern bash releases.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
BashFAQ #100 - How can I do native string manipulations in bash? (Subsection: "Splitting a string into fields")
To be honest I'm not sure why it's taking an extremely long time but there are certainly some things which could be made more efficient.
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
for time in "${times[#]}"; do
IFS=: read -r hours minutes seconds <<<"$time"
if [[ hours -ge currentHours && minutes -ge currentMinutes ]]; then
break
fi
done
This uses read, a built-in command, to split the text into variables, rather than calling external commands and creating subshells.
I assume that you want the script to run so quickly that it's safe to reuse currentHours and currentMinutes within the loop.
Note that you can also just use awk to do the whole thing:
awk -F: -v currentHours="$(date +"%H") -v currentMinutes="$(date +"%M")" '
$1 >= currentHours && $2 >= currentMinutes { print; exit }' file.txt
Just to make the program produce some output, I added a print, so that the last line is printed.
awk to the rescue!
awk -v time="12:12:00" '
function pad(x) {split(x,ax,":"); return (ax[1]<10)?"0"x:x}
BEGIN {time=pad(time)}
time>pad($0) {next}
{print; exit}' times
12:41:00
with 0 padding the hour you can do string only comparison.

Bash - Read only the last 100 lines of a long Log file

I am reading a long log file and splitting the columns in variables using bash.
cd $LOGDIR
IFS=","
while read LogTIME name md5
do
LogTime+="$(echo $LogTIME)"
Name+="$(echo $name)"
LOGDatamd5+="$(echo $md5)"
done < LOG.txt
But this is really slow and I don't need all the lines. The last 100 lines are enough (but the log file itself needs all the other lines for different programs).
I tried to use tail -n 10 LOG.txt | while read LogTIME name md5, but that takes really long as well and I had no output at all.
Another way I tested without success was:
cd $LOGDIR
foo="$(tail -n 10 LOG.txt)"
IFS=","
while read LogTIME name md5
do
LogTime+="$(echo $LogTIME)"
Name+="$(echo $name)"
LOGDatamd5+="$(echo $md5)"
done < "$foo"
But that gives me only the output of foo in total. Nothing was written into the variables inside the while loop.
There is probably a really easy way to do this, that I can't see...
Cheers,
BallerNacken
Process substitution is the common pattern:
while read LogTIME name md5 ; do
LogTime+=$LogTIME
Name+=$name
LogDatamd5+=$md5
done < <(tail -n100 LOG.txt)
Note that you don't need "$(echo $var)", you can assign $var directly.

How to pass array of values to shell script in command line and advice on loop

The goal is to get only the filenames from svn log based on the revision number. Every commit has a jira ticket number in the svn comment, so the svn revisions are got by looking for the jira ticket numbers.
The script so far works fine when I give only one jira ticket number but I need have it work when I give more than one jira ticket.
The issue with this script is that the output has only values from ticket-2. How can I have the output to include values from both ticket-1 and ticket-2?
I need some help on how to pass the ticket-1 and ticket-2 as arguments to the script rather than assign them in the script?
Code:
#!/bin/sh
src_url=$1
target_url=$2
jira_ticket=("ticket-1 ticket-2")
for i in $jira_ticket; do
revs=(`svn log $1 --limit 10 | grep -B 2 $i | grep "^r" | cut -d"r" -f2 | cut -d" " - f1|sort -r`)
done
for revisions in ${!revs[*]}; do
files=(`svn log -v $1 -r ${revs[$revisions]} | awk '$1~/^[AMD]$/{for(i=2;i<=NF;i++)print $i}'`)
for (( i = 0; i < ${#files[#]}; i++ )); do
echo "${files[$i]} #" ${revs[$revisions]} " will be merged."
done
done
tl;dr
Because the second loop (processing revs) is outside the first loop (setting revs). Move the second loop to within the first loop to fix this problem.
Detailed repairs
This script needs some serious fixing.
The array jira_ticket was declared incorrectly - it should be jira_ticket=("ticket-1" "ticket-2").
To loop over every element in an array, use "${array[#]}" (the quotes are important to avoid unintended word splitting, and using # instead of * makes the expansion be split into one word per element, which is what you're after). $array is equivalent to ${array[0]}.
Same principle with looping over an array's keys: say "${!array[#]}" instead of ${!array[*]}.
Why loop over keys when you can loop over values and you don't need the keys?
Variable assignments in a loop are not guaranteed to be propagated out of it (they probably are here, but odd things happen in pipelines and such).
Did you mean to execute the second loop within the first loop, to use each copy of revs? (As it stands you're only processing the last copy.)
Please quote all your variable expansions ("$1", not $1).
Please use modern command substitution syntax $(command) instead of backquotes. It's much less error-prone.
You'll need to set IFS properly to properly split the command substitution results. I think you're after an IFS of $'\n'; I may be wrong.
Passing the tickets as arguments
Use shift after dealing with $1 to get rid of $1, then assign everything that's left to the jira_tickets array.
The script, repaired as best I can:
#!/bin/sh
# First argument is the source URL; remaining args are ticket numbers
src_url="$1"; shift
#target_url="$2"; shift # Never used
# Handy syntax hint: `for i in "$#"; do` == `for i; do`
for ticket; do
# Fixed below $1 to be $src_url
revs=($(IFS=$'\n'; svn log "$src_url" --limit 10 | grep -B 2 "$ticket" | grep "^r" | cut -d"r" -f2 | cut -d" " - f1 | sort -r))
for revision in "${revs[#]}"; do # I think you meant to loop over the values here, not the keys
files=($(IFS=$'\n'; svn log -v "$src_url" -r "$revision" | awk '$1~/^[AMD]$/{for(i=2;i<=NF;i++)print $i}'))
for file in "${files[#]}"; do # Think you wanted to loop over the values here too
echo "$file # $revision will be merged."
done
done
done

What is the maximum number of characters that the ksh variable accepts?

I am trying to load and parse a really large text file. Although the loading is not a problem, but there are particular lines that have 2908778 characters on a single line.
This is causing an error in my script.
On the script below, I removed all logic and just got straight to read line.
I also removed all valid lines and just left the really long line in one text file. When running I get the below error :
$ dowhiledebug.sh dump.txt
dowhiledebug.sh[6]: no space
Script Ended dump.txt
The actual script:
#!/bin/sh
filename=$1
count=1
if [ -f ${filename} ]; then
echo "after then"
while read line;
do
echo "$count"
count=$((count+1))
done < $filename
else
echo "Could not open file $filename"
fi
echo "Script Ended $filename"
Updated (2013-01-17)
Follow up question : Is it possible to increase the maximum number of characters that ksh variable accepts?
what OS and version of ksh? Can you echo ${.sh.version} and get a value? If so, please include in your question above. Or could this be pdksh?
Here's a test that will get you in the ballpark, assuming a modern ksh supporting (( i++ )) math evaluations:
#100 char var
var=1234578901234456789012345678901234567890123456789012345789012344567890123456789012345678901234567890
$ while (( i++ < 10000 )) ;do var="$var$var" ; print "i=$i\t" ${#var} ; done
i=1 200
i=2 400
i=3 800
i=4 1600
i=5 3200
i=6 6400
i=7 12800
i=8 25600
i=9 51200
i=10 102400
i=11 204800
i=12 409600
i=13 819200
i=14 1638400
i=15 3276800
i=16 6553600
i=17 13107200
i=18 26214400
i=19 52428800
i=20 104857600
i=21 209715200
i=22 419430400
-ksh: out of memory
$ print -- ${.sh.version}
Version JM 93t+ 2010-05-24
AND that is just the overall size of the environment that can be supported. When dealing with the command-line environment and "words" after the program name, there is a limit to the number of words, regardless of overall size.
Some shells man page will have a section LIMITS that may show something like max-bytes 200MB, max-args 2048. This information may be in a different section, it will definitely have different labels and different values I have included, OR it may not be there at all, hence the above code loop, so look carefully around and if you find a source for this info, either add an answer to this Q, or update this one.
The bash 4.4 std man page doesn't seem to have this information and its harder to find a ksh doc all the time. Check your man ksh and hope that you can find a documented limit.
IHTH
The limit for any shell is the limit of the C command line maximum. Here's a little program that pulls the information out of /usr/include/limits.h for you:
cpp <<HERE | tail -1
#include <limits.h>
ARG_MAX
HERE
Mine gives me (256 * 1024) or 262144 characters.
Doesn't work if the C compiler isn't installed, but it's probably a similar limit.

Resources