Bash command line and input limit - bash

Is there some sort of character limit imposed in bash (or other shells) for how long an input can be? If so, what is that character limit?
I.e. Is it possible to write a command in bash that is too long for the command line to execute?
If there is not a required limit, is there a suggested limit?

The limit for the length of a command line is not imposed by the shell, but by the operating system. This limit is usually in the range of hundred kilobytes. POSIX denotes this limit ARG_MAX and on POSIX conformant systems you can query it with
$ getconf ARG_MAX # Get argument limit in bytes
E.g. on Cygwin this is 32000, and on the different BSDs and Linux systems I use it is anywhere from 131072 to 2621440.
If you need to process a list of files exceeding this limit, you might want to look at the xargs utility, which calls a program repeatedly with a subset of arguments not exceeding ARG_MAX.
To answer your specific question, yes, it is possible to attempt to run a command with too long an argument list. The shell will error with a message along "argument list too long".
Note that the input to a program (as read on stdin or any other file descriptor) is not limited (only by available program resources). So if your shell script reads a string into a variable, you are not restricted by ARG_MAX. The restriction also does not apply to shell-builtins.

Ok, Denizens. So I have accepted the command line length limits as gospel for quite some time. So, what to do with one's assumptions? Naturally- check them.
I have a Fedora 22 machine at my disposal (meaning: Linux with bash4). I have created a directory with 500,000 inodes (files) in it each of 18 characters long. The command line length is 9,500,000 characters. Created thus:
seq 1 500000 | while read digit; do
touch $(printf "abigfilename%06d\n" $digit);
done
And we note:
$ getconf ARG_MAX
2097152
Note however I can do this:
$ echo * > /dev/null
But this fails:
$ /bin/echo * > /dev/null
bash: /bin/echo: Argument list too long
I can run a for loop:
$ for f in *; do :; done
which is another shell builtin.
Careful reading of the documentation for ARG_MAX states, Maximum length of argument to the exec functions. This means: Without calling exec, there is no ARG_MAX limitation. So it would explain why shell builtins are not restricted by ARG_MAX.
And indeed, I can ls my directory if my argument list is 109948 files long, or about 2,089,000 characters (give or take). Once I add one more 18-character filename file, though, then I get an Argument list too long error. So ARG_MAX is working as advertised: the exec is failing with more than ARG_MAX characters on the argument list- including, it should be noted, the environment data.

There is a buffer limit of something like 1024. The read will simply hang mid paste or input. To solve this use the -e option.
http://linuxcommand.org/lc3_man_pages/readh.html
-e use Readline to obtain the line in an interactive shell
Change your read to read -e and annoying line input hang goes away.

In the old days, tcsh had a limit of 1024 characters per command line, which made it difficult if you had a very long $PATH. I was forced to rebuild a private version of tcsh with the buffer size increased to allow users to have long $PATH settings. That was 2 decades ago. That was when I gave up using tcsh, and switched to zsh which did not have that limitation. Now I just use plain old bash because it is good enough.

Related

Argument list too long for UGE compute host

In my .profile (SUSE Linux with Korn shell) I have had the following code active for many years:-
case $0 in
-ksh|ksh)
set -o ignoreeof
set -a
set -o vi
PS1=$(print '\033[34m$(tput bold)ksh:$(hostname)->$(tput sgr0)\033[00m ')
PS2="continue-> "
PS3=": "
PS4="$0.$LINENO+ "
FCEDIT=/usr/bin/vi
HISTFILE=~/.histories/${TTY}_$(hostname)_ksh_his
HISTSIZE=500
EDITOR=/bin/vi
VISUAL=/bin/vi
TERM=gnome
set +a ;;
*) SHELL=${SHELL} ;;
But lately that PS1 entry has been causing some unexpected problems. When I source a file of ~3K environment variables and then try to get onto to a UGE compute host, the following error causes everything to break down:-
$ qrsh -V -j y -pe mt 1 -l "os_version=SUSE12.0,model=EMT3500,cpu_code=E5-2667v4" -P iheavy -now no
ksh: /usr/bin/tput: Argument list too long
ksh: /bin/hostname: Argument list too long
ksh: /usr/bin/tput: Argument list too long
ksh:->
If I request a CentOS system (os_version=CS7.0) I do not see the error--it is specific to SUSE. Also, if I eliminate the PS1 entry altogether, I can get onto a SUSE system without any errors. This is the simplest way to capture the bigger problem: when I issue a qsub for batch computing tasks, my .profile ends up partially initialized and jobs fail to launch.
I often change environments and shells, so color-coding my terminal prompt in a shell-specific and host/queue-specific way has always been helpful. I would rather not retire that PS1 entry just for one project that has such a long list of environment variable names.
I have done various searches and learned that increasing ulimit settings can sometimes help in situations like this; however, I tried that (increased stack size and number-of-open-files to their maximum) and the outcome did not change.
Is there a practical way to avoid this problem without removing the PS1 entry?

Output time to a file with the Unix "time" command, but leave the output of the command to the console

I time a command that has some output. I want to output the real time from the time command to a file, but leave the output of the command to the console.
For example, if I do time my_command I get this printed in the console:
several lines of output from my_command ...
real 1m25.970s
user 0m0.427s
sys 0m0.518s
In this case, I want to store only 1m25.970s to a file, but still print the output of the command to the console.
The time command is tricky. The POSIX specification of time
doesn't define the default output format, but does define a format for the -p (presumably for 'POSIX') option. Note the (not easily understood) discussion of command sequences in pipelines.
The Bash specification say time prefixes a 'pipeline', which means that time cmd1 | cmd2 times both cmd1 and cmd2. It writes its results to standard error. The Korn shell is similar.
The POSIX format requires a single space between the tags such as real and the time; the default format often uses a tab instead of a space. Note that the /usr/bin/time command may have yet another output format. It does on macOS, for example, listing 3 times on a single line, by default, with the label after the time value; it supports -p to print in an approximation to the POSIX format (but it has multiple spaces between label and time).
You can easily get all the information written to standard error into a file:
(time my_command) 2> log.file
If my_command or any programs it invokes reports any errors to standard error, those will got to the log file too. And you will get all three lines of the output from time written to the file.
If your shell is Bash, you may be able to use process substitution to filter some of the output.
I wouldn't try it with a single command line; the hieroglyphs needed to make it work are ghastly and best encapsulated in shell scripts.
For example, a shell script time.filter to capture the output from time and write only the real time to a log file (default log.file, configurable by providing an alternative log file name as the first argument
#!/bin/sh
output="${1:-log.file}"
shift
sed -E '/^real[[:space:]]+(([0-9]+m)?[0-9]+[.][0-9]+s?)/{ s//\1/; w '"$output"'
d;}
/^(user|sys)[[:space:]]+(([0-9]+m)?[0-9]+[.][0-9]+s?)/d' "$#"
This assumes your sed uses -E to enable extended regular expressions.
The first line of the script finds the line containing the real label and the time after it (in a number of possible formats — but not all). It accepts an optional minutes value such as 60m05.003s, or just a seconds value 5.00s, or just 5.0 (POSIX formats — at least one digit after the decimal point is required). It captures the time part and prints it to the chosen file (by default, log.file; you can specify an alternative name as the first argument on the command line). Note that even GNU sed treats everything after the w command as file name; you have to continue the d (delete) command and the close brace } on a newline. GNU sed does not require the semicolon after d; BSD (macOS) sed does. The second line recognizes and deletes the lines reportin the user and sys times. Everything else is passed through unaltered.
The script processes any files you give it after the log file name, or standard input if you give it none. A better command line notation would use an explicit option (-l logfile) and getopts to specify the log file.
With that in place, we can devise a program that reports to standard error and standard output — my_command:
echo "nonsense: error: positive numbers are required for argument 1" >&2
dribbler -s 0.4 -r 0.1 -i data -t
echo "apoplexy: unforeseen problems induced temporary amnesia" >&2
You could use cat data instead of the dribbler command. The dribbler command as shown reads lines from data, writes them to standard output, with a random delay with a gaussian distribution between lines. The mean delay is 0.4 seconds; the standard deviation is 0.1 seconds. The other two lines are pretending to be commands that report errors to standard error.
My data file contained a nonsense 'poem' called 'The Great Panjandrum'.
With this background in place, we can run the command and capture the real time in log.file, delete (ignore) the user and system time values, while sending the rest of standard error to standard error by using:
$ (time my_command) 2> >(tee raw.stderr | time.filter >&2)
nonsense: error: positive numbers are required for argument 1
So she went into the garden
to cut a cabbage-leaf
to make an apple-pie
and at the same time
a great she-bear coming down the street
pops its head into the shop
What no soap
So he died
and she very imprudently married the Barber
and there were present
the Picninnies
and the Joblillies
and the Garyulies
and the great Panjandrum himself
with the little round button at top
and they all fell to playing the game of catch-as-catch-can
till the gunpowder ran out at the heels of their boots
apoplexy: unforeseen problems induced temporary amnesia
$ cat log.file
0m7.278s
(The time taken is normally between 6 and 8 seconds. There are 17 lines, so you'd expect it to take around 6.8 seconds at 0.4 seconds per line.) The blank line is from time; it is pretty hard to remove that blank line, and only that blank line, especially as POSIX says it is optional. It isn't worth it.

How rm * works in case of huge number of files in AIX (ksh)

Is there any limit of arguments that can be passed to rm command in AIX? I use ksh. I tried to run
rm *
in one catalog where is >500 000 (now) files but I received strange error. If I remember correctly it was something like "memory core dump". What does it mean? Can I suppose some files were removed? I don't know how many files were there before I executed this command.
I think shell was not able to collect all filenames and pass those to rm command so nothing was removed, but I really don't know. Can someone advise me how it works?
How long an argument list is allowed to be is defined by ARG_MAX which you can query using getconf, e.g.:
$ uname -o
GNU/Linux
$ getconf ARG_MAX
2097152
This is not a limit set by the shell but by the underlying system call(s) involved.

Maximum command length not reached, yet still exceeded

I'm having a weird problem when calling a MATLAB function from a (bash) shell script run in CygWin.
This is the problematic command:
"$MATLAB_PATH/matlab" -wait -nojvm -nosplash -automation -logfile "$MATLAB_LOGFILE" -r "myFunction $(echo ${FUNCTION_ARGS[#]}); quit;"
which, when echoed on the bash command line, evaluates to something like the following:
/cygdrive/c/Program Files (x86)/MATLAB/R2010a/bin/matlab -wait -nojvm \
-nosplash -automation -logfile MATLAB_output.txt -r myFunction \
/path/to/relevant/data/data1.txt /path/to/other/relevant/data/data2.txt \
<<several more such arguments>>; quit;
In total, the length of the command is ~2000 characters, depending a bit on which path the script is called from.
The problem is that my MATLAB function receives only 17 arguments (~1017 characters), while I send it well over 30 arguments.
Other observed behavior:
When I copy-paste the echoed command line into a regular MATLAB session (that is, not the automation server), there seems to be no problem and the function executes just fine on all ~30 arguments.
When I reduce the length of the command line (for example, by removing the -wait option), the MATLAB function will suddenly receive 18 arguments, with the last argument a portion of the 18th string that I passed in.
Reducing or increasing the command line length by a few characters in other ways (duplicating slashes in paths, duplicating spaces, etc.) has no effect.
EDIT: The copy-pasted command line length seems to have a maximum length of 1014 characters.
So apparently, somewhere along the tool chain, there is limitation on the maximum length a command can have. I'm not finding anything relevant in the docs of MATLAB, its automation server, bash, or CygWin -- they all have limits, but more in the order of 32K characters, way more than what I'm passing in.
So...I'm at a loss. I'm not sure how to diagnose which tool is causing this...any ideas?
EDIT:
Output of xargs --show-limits:
Your environment variables take up 5556 bytes
POSIX upper limit on argument length (this system): 24396
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 18840
Size of command buffer we are actually using: 24396
Output of expr $(getconf ARG_MAX) - $(env|wc -c) - $(env|wc -l) \* 4 - 2048:
24098
So, as I said, even the smallest of these does not come close to my ~2000 characters.
There may be a limit in Matlab's processing of command line. For example, the length of the buffer allocated for the function may be limited to 1014 characters.
In your script, save the function invocation in a script and try to pass the script to Matlab on the command line.
http://www.mathworks.com/help/matlab/ref/matlabwindows.html says:
matlab -r "statement" starts MATLAB and executes the specified MATLAB statement. If statement is the name of a MATLAB function or script, do not specify the file extension. Any required file must be on the MATLAB search path or in the startup folder.

Do some programs not accept process substitution for input files?

I'm trying to use process substitution for an input file to a program, and it isn't working. Is it because some programs don't allow process substitution for input files?
The following doesn't work:
bash -c "cat meaningless_name"
>sequence1
gattacagattacagattacagattacagattacagattacagattacagattaca
>sequence2
gattacagattacagattacagattacagattacagattacagattacagattaca
bash -c "clustalw -align -infile=<(cat meaningless_name) -outfile=output_alignment.aln -newtree=output_tree.dnd"
(Less verbose output, finishing with:
No sequences in file. No alignment!
But the following controls do work:
bash -c "clustalw -align -infile=meaningless_name -outfile=output_alignment.aln -newtree=output_tree.dnd"
(Verbose output, finishing with:
CLUSTAL-Alignment file created [output_alignment.aln]
bash -c "cat <(cat meaningless_name) > meaningless_name2"
diff meaningless_name meaningless_name2
(No output: the two files are the same)
bash -c "clustalw -align -infile=meaningless_name2 -outfile=output_alignment.aln -newtree=output_tree.dnd"
(Verbose output, finishing with:
CLUSTAL-Alignment file created [output_alignment.aln]
Which suggest that process substitution itself works, but that the clustalw program itself doesn't like process substitution - perhaps because it creates a non-standard file, or creates files with an unusual filename.
Is it common for programs to not accept process substitution? How would I check whether this is the issue?
I'm running GNU bash version 4.0.33(1)-release (x86_64-pc-linux-gnu) on Ubuntu 9.10. Clustalw is version 2.0.10.
Process substitution creates a named pipe. You can't seek into a named pipe.
Yes. I've noticed the same thing in other programs. For instance, it doesn't work in emacs either. It gives "File exists but can not be read". And it's definitely a special file, for me /proc/self/fd/some_number. And it doesn't work reliably in either less nor most, with default settings.
For most:
most <(/bin/echo 'abcdef')
and shorter displays nothing. Longer values truncate the beginning. less apparently works, but only if you specify -f.
I find zsh's = much more useful in practice. It's syntactically the same, except = instead of <. But it just creates a temporary file, so support doesn't depend on the program.
EDIT:
I found zsh uses TMPPREFIX to choose the temporary filename. So even if you don't want your real /tmp to be tmpfs, you can mount one for zsh.

Resources