xargs command length limits - bash

I am using jsonlint to lint a bunch of files in a directory (recursively). I wrote the following command:
find ./config/pages -name '*.json' -print0 | xargs -0I % sh -c 'echo Linting: %; jsonlint -V ./config/schema.json -q %;'
It works for most files but some files I get the following error:
Linting: ./LONG_FILE_NAME.json
fs.js:500
return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
^
Error: ENOENT, no such file or directory '%'
It appears to fail for long filenames. Is there a way to fix this? Thanks.
Edit 1:
Found the problem.
-I replstr
Execute utility for each input line, replacing one or more occurrences
of replstr in up to replacements (or 5 if no -R flag is specified)
arguments to utility with the entire line of input. The resulting
arguments, after replacement is done, will not be allowed to grow
beyond 255 bytes; this is implemented by concatenating as much of the
argument containing replstr as possible, to the con-structed arguments
to utility, up to 255 bytes. The 255 byte limit does not apply to
arguments to utility which do not contain replstr, and furthermore, no
replacement will be done on utility itself. Implies -x.
Edit 2:
Partial solution. Supports longer file names than before but still not as long as I need.
find ./config/pages -name '*.json' -print0 | xargs -0I % sh -c 'file=%; echo Linting: $file; jsonlint -V ./config/schema.json -q $file;'

On BSD like systems (e.g. Mac OS X)
If you happen to be on a mac or freebsd etc. your xargs implementation may support option -J which does not suffer from the argument size limits imposed on option -I.
Excert from manpage
-J replstr
If this option is specified, xargs will use the data read from standard input to replace the first occurrence of replstr instead of appending that data after all other arguments. This option will not effect how many arguments will be read from input (-n), or the size of the command(s) xargs will generate (-s). The option just moves where those arguments will be placed in the command(s) that are executed. The replstr must show up as a distinct argument to xargs. It will not be recognized if, for instance, it is in the middle of a quoted string. Furthermore, only the first occurrence of the replstr will be replaced. For example, the following command will copy the list of files and directories which start with an uppercase letter in the current directory to destdir:
/bin/ls -1d [A-Z]* | xargs -J % cp -Rp % destdir
If you need to refer to the repstr multiple times (*points up* TL;DR -J only replaces first occurrence) you can use this pattern:
echo hi | xargs -J{} sh -c 'arg=$0; echo "$arg $arg"' "{}"
=> hi hi
POSIX compliant method
The posix compliant method of doing this would be to use some other tool, e.g. sed to construct the code you want to execute and then use xargs to just specify the utility. When no repl string is used in xargs the 255 byte limit does not apply. xargs POSIX spec
find . -type f -name '*.json' -print |
sed "s_^_-c 'file=\\\"_g;s_\$_\\\"; echo \\\"Definitely over 255 byte script..$(printf "a%.0s" {1..255}): \\\$file\\\"; wc -l \\\"\\\$file\\\"'_g" |
xargs -L1 sh
This of course largely defeats the purpose of xargs to begin with, but can still be used to leverage e.g. parallel execution using xargs -L1 -P10 sh which is quite widely supported, though not posix.

Use -exec in find instead of piping to xargs.
find ./config/pages -name '*.json' -print0 -exec echo Linting: {} \; -exec jsonlint -V ./config/schema.json -q {} \;

The limit on xargs's command line length is imposed by the system (not an environment) variable ARG_MAX. You can check it like:
$ getconf ARG_MAX
2097152
Surprisingly, there doesn't not seem to be a way to change it, barring kernel modification.
But even more surprising that xargs by default gets capped to a much lower value, and you can increase with -s option. Still, ARG_MAX is not the value you can set after -s — acc. to man xargs you need to subtract size of environment, plus some "headroom", no idea why. To find out the actual number use the following command (alternatively, using an arbitrary big number for -s will result in a descriptive error):
$ xargs --show-limits 2>&1 | grep "limit on argument length (this system)"
POSIX upper limit on argument length (this system): 2092120
So you need to run … | xargs -s 2092120 …, e.g. with your command:
find ./config/pages -name '*.json' -print0 | xargs -s 2092120 -0I % sh -c 'echo Linting: %; jsonlint -V ./config/schema.json -q %;'

Related

How to search and replace with egrep and sed on macOS?

I want to match a pattern in a file and replace it.
This command works with egrep, xargs and sed:
egrep -lRZ "hello" . | xargs -0 -l sed -i -e 's/hello/world/g'
The problem: It does not work on MacOS because the xargs of MacOS does not support the argumente -l.
xargs: illegal option -- l
usage: xargs [-0opt] [-E eofstr] [-I replstr [-R replacements]] [-J replstr]
[-L number] [-n number [-x]] [-P maxprocs] [-s size]
[utility [argument ...]]
How is this solvable on MacOS?
There are actually three incompatibilities you're going to run into here between the GNU (Linux) vs. bsd (macOS) utilities.
The one you're getting an error message from is that bsd's xargs doesn't accept the -l option. But -l is equivalent to -L except that -L requires an argument specifying the maximum number of lines to pass per invocation of the command, while -l defaults to one if it isn't specified. Thus, you can just replace -l with -L1. -L is understood the same way by both the GNU and bsd versions of xargs, so using this is portable between Linux and macOS.
But in this particular case, there's another even easier option: sed is perfectly capable of operating on multiple files per invocation, so there's no reason to limit it to one per invocation. This'll even be slightly faster, since it doesn't have to spend as much time launching new processes. So just leave -l off.
The GNU and bsd versions of egrep (and others in the grep family) both take the option -Z, but they use it to mean completely different things. With GNU, egrep -Z prints zero bytes (ASCII NUL characters) after each filename (matching what xargs -0 expects). But with bsd, egrep -Z is equivalent to zgrep -- it treats its input files as zip archives, and expands them before searching their contents.
Fortunately, both versions understand --null to invoke zero-byte delimiters, so you can use that portably on both platforms.
Both the GNU and bsd versions understand -i<suffix> to mean "edit in place, but make a backup copy, and back up the original with the specified filename suffix". And for both of them, if the suffix is zero-length, it doesn't keep a backup. Unfortunately, the way you specify a zero-length suffix is different and (as far as I've been able to find) irreconcilably incompatible. Specifically, GNU requires the suffix to be directly attached to the -i (e.g. -i.bkp), so just specifying -i by itself is enough to specify in-place-without-backup mode. But bsd allows the suffix to be passed as a separate argument (e.g. -i .bkp), so if you just specify -i by itself, it'll use whatever the next argument is as a suffix (e.g. sed -i -e 's/hello/world/g' will use "-e" as a suffix). To specify in-place-without-backup mode, you need to follow -i with an explicit empty argument (e.g. sed -i '' -e 's/hello/world/g'). But if you do that with GNU's sed, it'll try to execute the empty argument as its script, which will fail.
With all that, here's the macOS version of your command:
egrep -lR --null "hello" . | xargs -0 sed -i '' -e 's/hello/world/g'
...which will almost work on Linux -- the only difference is that you need to remove the '' argument to sed. If you want something that's fully portable between Linux and macOS, you need to specify a backup suffix (and attach it directly to the -i option, as in -i.bkp).
The grep options to recursively search for files are best avoided - they just clutter up your grep args and make your scripts non-portable. There's already a perfectly good tool designed to find files with a very obvious name.
Are you just trying to replace hello with world in all your files? If so that's just
find . -type f |
while IFS= read -r file; do
sed 's/hello/world/g' "$file" > "tmp$$" &&
mv "tmp$$" "$file"
done
That'll work in any shell on any UNIX box unless your file names contain newlines. If you didn't want to change timestamps etc. on files that don't contain hello one way is:
find . -type f -exec grep -q 'hello' {} \; -print |
while IFS= read -r file; do
sed 's/hello/world/g' "$file" > "tmp$$" &&
mv "tmp$$" "$file"
done

How can I search and execute two scripts simultaneously in Unix?

Suppose you have a folder that contains two files.
Example: stop_tomcat_center.sh and start_tomcat_center.sh.
In my example, the return of ls *tomcat* returns these two scripts.
How can I search and execute these two scripts simultaneously?
I tried
ls *tomcat* | xargs sh
but only the first script is executed (not the second).
An easy way to do multiple things in parallel is with GNU Parallel:
parallel ::: ./*tomcat*
Or, if your scripts don't have a shebang at the first line:
parallel bash ::: ./*tomcat*
Or if you like xargs:
ls *tomcat* | xargs -P 2
xargs is missing the -n 1 option.
From man xargs:
-n max-args, --max-args=max-args
Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the -x option is given, in which case xargs will exit.
xargs otherwise tries to execute the command with as many parameters as possible, which makes sense for most commands.
In your case ls *tomcat* | xargs sh is running sh stop_tomcat_center.sh start_tomcat_center.sh and the stop_tomcat_center.sh is probably just ignoring the $1 parameter.
Also it is not a good idea to use the output of ls. A better way would be to use find -maxdepth 1 -name '*tomcat*' -print0 | xargs -0 -n 1 sh or for command in *tomcat*; do sh "$command"; done
This answer is based on the assumption that the OP meant "both with one command line" when he wrote "simultaneously".
For solutions on parallel execution have take a look at the other answers.
You can do the following to search and execute
find . -name "*.sh" -exec sh x {} \;
find will find the file and exec will find the match and execute

How do you grep results from 'find'?

Trying to find a word/pattern contained within the resulting file names of the find command.
For instance, I have this command:
find . -name Gruntfile.js that returns several file names.
How do I grep within these for a word pattern?
Was thinking something along the lines of:
find . -name Gruntfile.js | grep -rnw -e 'purifycss'
However, this is doesn't work..
Use the -exec {} + option to pass the list of filenames that are found as arguments to grep:
find -name Gruntfile.js -exec grep -nw 'purifycss' {} +
This is the safest and most efficient approach, as it doesn't break when the path to the file isn't "well-behaved" (e.g. contains a space). Like an approach using xargs, it also minimises the number of calls to grep by passing multiple filenames at once.
I have removed the -e and -r switches, as I don't think that they're useful to you here.
An excerpt from man find:
-exec command {} +
This variant of the -exec action runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the total number of invocations of the command will be much less than the number of matched files.
While this doesn't strictly answer your question, provided you have globstar turned on (shopt -s globstar), you could filter the results in bash like this:
grep something **/Gruntfile.js
I was using religiously the approach used by Tom Fenech until I switched to zsh, which handles such things much better. Now all I do is:
grep text **/*(.)
which greps text through all regular files in current directory.
I believe this to be much cleaner syntax especially for day-to-day work in shell.
When too many files exist for the * expansion to run:
$ grep -o 'xxmaj\|xxbos\|xxfld' train/* | wc -l
-bash: /bin/grep: Argument list too long
0
Then this code fixes the “too long” problem:
$ find junk -maxdepth 1 -type f | xargs grep -o 'TVDetails\|xxmaj\|xxbos\|xxfld'
junk/gum-.doc.out:TVDetails
junk/Zv0n.doc.out:TVDetails
$ find junk -maxdepth 1 -type f | xargs grep -o 'TVDetails\|xxmaj\|xxbos\|xxfld' | wc -l
2
It runs faster on my system, and maybe yours, when using the -P 0 option:
$ /usr/bin/time -f "%E Elapsed Real Time" find train -maxdepth 1 -type f | xargs -P 0 grep -o 'TVDetails\|xxmaj\|xxbos\|xxfld' | wc -l
0:02.45 Elapsed Real Time
358
$ /usr/bin/time -f "%E Elapsed Real Time" find train -maxdepth 1 -type f | xargs grep -o 'TVDetails\|xxmaj\|xxbos\|xxfld' | wc -l
0:11.96 Elapsed Real Time
358
Hope this helps.

Making xargs work in Cygwin

Linux/bash, taking the list of lines on input and using xargs to work on each line:
% ls -1 --color=never | xargs -I{} echo {}
a
b
c
Cygwin, take 1:
$ ls -1 --color=never | xargs -I{} echo {}
xargs: invalid option -- I
Usage: xargs [-0prtx] [-e[eof-str]] [-i[replace-str]] [-l[max-lines]]
[-n max-args] [-s max-chars] [-P max-procs] [--null] [--eof[=eof-str]]
[--replace[=replace-str]] [--max-lines[=max-lines]] [--interactive]
[--max-chars=max-chars] [--verbose] [--exit] [--max-procs=max-procs]
[--max-args=max-args] [--no-run-if-empty] [--version] [--help]
[command [initial-arguments]]
Cygwin, take 2:
$ ls -1 --color=never | xargs echo
a b c
(yes, I know there's a universal method of ls -1 --color=never | while read X; do echo ${X}; done, I have tested that it works in Cygwin too, but I'm looking for a way to make xargs work correctly in Cygwin)
damienfrancois's answer is correct. You probably want to use -n to enforce echo to echo one file name at a time.
However, if you are really interested in taking each file and executing it one at a time, you may be better off using find:
$ find . -maxdepth 1 --exec echo {} \;
A few things:
This will pick up file names that begin with a period (including '.')
This will put a ./ in front of your file names.
The echo being used is from /bin/echo and not the built in shell version of echo.
However, it doesn't depend upon the shell executing ls * and possibility causing issues (such as coloring file names, or printing out files in sub-directories (which your command will do).
The purpose of xargs was to minimize the execution of a particular command:
$ find . -type f | xargs foo
In this case, xargs will execute foo only a minimal number of times. foo will only execute when the command line buffer gets full, or there are no more file names. However, if you are forcing an execution after each name, you're probably better off using find. It's a lot more flexible and you're not depending upon shell behavior.
Use the -n argument of xargs, which is really the one you should be using, as -I is an option that serves to give the argument a 'name' so you can make them appear anywhere in the command line:
$ ls -1 --color=never | xargs echo
a b c
$ ls -1 --color=never | xargs -n 1 echo
a
b
c
From the manpage:
-n max-args
Use at most max-args arguments per command line
-I replace-str
Replace occurrences of replace-str in the initial-arguments with names read from standard input.

Why does the wc utility generate multiple lines with "total"?

I am using the wc utility in a shell script that I run from Cygwin, and I noticed that there is more than one line with "total" in its output.
The following function is used to count the number of lines in my source files:
count_curdir_src() {
find . '(' -name '*.vb' -o -name '*.cs' ')' \
-a '!' -iname '*.Designer.*' -a '!' -iname '.svn' -print0 | \
xargs -0 wc -l
}
But its output for a certain directory looks like this:
$ find . '(' -name '*.vb' -o -name '*.cs' ')' -a '!' -iname '*.Designer.*' -a '!' -iname '.svn' -print0 | xargs -0 wc -l
19 ./dirA/fileABC.cs
640 ./dirA/subdir1/fileDEF.cs
507 ./dirA/subdir1/fileGHI.cs
2596 ./dirA/subdir1/fileJKL.cs
(...many others...)
58 ./dirB/fileMNO.cs
36 ./dirB/subdir1/filePQR.cs
122200 total
6022 ./dirB/subdir2/subsubdir/fileSTU.cs
24 ./dirC/fileVWX.cs
(...)
36 ./dirZ/Properties/AssemblyInfo.cs
88 ./dirZ/fileYZ.cs
25236 total
It looks like wc resets somewhere in the process. It cannot be caused by space characters in filenames or directory names, because I use the -print0 option. And it only happens when I run it on my largest source tree.
So, is this a bug in wc, or in Cygwin? Or something else? The wc manpage says:
Print newline, word, and byte counts
for each FILE, and a total line if
more than one FILE is specified.
It doesn't mention anything about multiple total lines (intermediate total counts or something), so who's to blame here?
What's happening is that xargs is running wc multiple times. xargs by default batches as many arguments as it thinks it can into each invocation of the command it's supposed to run, but if there are too many files it will run the command multiple times on subsets of the files.
There are a couple ways I see to fix this. The first, which will break if you have too many files, is to skip xargs and use the shell. This may not work well on Cygwin, but would look like this:
wc -l $(find . '(' -name '*.vb' -o -name '*.cs' ')' \
-a '!' -iname '*.Designer.*' -a '!' -iname '.svn' )
and you also lose the print0 capabilities.
The other is to use an awk (or perl) script to process the output of your find/xargs combo, skip "total" lines, and sum up the total yourself.
You're calling wc multiple times - once for each "batch" of input arguments provided by xargs. You're getting one total per batch.
One alternative is to use a temporary file and the --files0-from option for wc:
$ find . '(' -name '*.vb' -o -name '*.cs' ')' -a '!' -iname '*.Designer.*' -a
'!' -iname '.svn' -print0 > files
$ wc --files0-from files
The command-line length is much more limited under cygwin than on a standard linux box, and xargs must split the input to respect those limits. You can check the limits with xargs --show-limits:
On cygwin:
$ xargs --show-limits < /dev/null
Your environment variables take up 4913 bytes
POSIX upper limit on argument length (this system): 25039
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 20126
Size of command buffer we are actually using: 25039
On centos:
$ xargs --show-limits < /dev/null
Your environment variables take up 1816 bytes
POSIX upper limit on argument length (this system): 2617576
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2615760
Size of command buffer we are actually using: 131072
And to build on #JonSkeet's answer, you don't need to create an additional file, you can pipe your find results directly to wc, by passing - as argument to --files0-from:
find . -name '*.vb' -print0 | wc -l --files0-from=-
To avoid generation of multiple lines with "total" counts when feeding the wc utility with an enormous number of file paths as command line arguments, you can use an intermediate xargs to cat the contents of files to the stdin of wc (see piping output of find to xargs wc gives unreasonable totals).
This is a workaround if your wc command does not have the --files0-from as mentioned by Xavier.
count_curdir_src() (
export LC_ALL=C
find . -name '*.vb' -print0 | xargs -0 -n 1000 cat | wc -l
)

Resources