getting message " -bash: /usr/bin/grep: Argument list too long" - bash

I am trying to count the occurrences of a sub-string by grep command:
grep -il "touch screen" * | wc -l
The number of files is almost 20,000. I am getting the message:
-bash: /usr/bin/grep: Argument list too long
Does it mean there are too many files? What is the remedy? I am using OS X.

Technically, there aren't too many files, but the combined length of all their names is too large. When starting a new process, you pass its arguments to it in the form of an array of strings, and the operating system puts a hard limit on how large that array is allowed to be.
A simple, though inefficient, solution is to not use the expansion of * as the argument list to grep, but to use it in a shell built-in command:
for f in *; do
grep 'touch screen' "$f"
done
Here, the shell is not trying to pass each string that * expands to in a single array, but using one element at a time. This requires a huge number of calls to grep, so a better solution is to use a tool that can batch up the results of the path name expansion into smaller, manageable sets.
find . -exec grep 'touch screen' {} +
Here, find passes as many files to grep as possible on each call, repeating the process until grep has been called on all the files.
Both of the previous techniques work because grep is multiplicative in the number-theoretical sense. That is,
{ grep '...' f; grep '...' g; } and grep '...' f g produce the same output. If you can't split your command up into multiple invocations on smaller subsets, the only solution is to hope the command can read arguments from a file (either a named file or via standard input) instead.

Since you're trying to grep all the files of a directory, I would recommend using grep's -Recursive mode :
grep -ilR "touch screen" . | wc -l
should be equivalent to your
grep -il "touch screen" * | wc -l
The two commands will be different if your directory contains subdirectories (but then you would have had errors with your current grep command).
Another option would have been to invoke grep on each file in a loop, then wc -l the output of the loop :
for file in *; do
grep -il "touch screen" $file
done | wc -l
I commented about xargs, which would indeed be great to pass the output of a command as parameters of another :
ls . | xargs -L 20 grep "touch screen" | wc -l
Here it limits the number of arguments passed to grep to 20, and will call grep as many times as needed.
However, I don't think it can work on the expansion of * and rather needs to work on the output of ls as I showed. Parsing the output of ls is error prone, so I wouldn't recommend it.

Related

How to print the file names from which I grep some lines

I'm trying to get some lines from several json files using the following code:
cat $(find ./*/*/folderA/*DTI*.json) | grep -i -E '(phaseencodingdirection|phaseencodingaxis)' > phase_direction
It worked! the problem is that I don't know which line comes from which file
With this find ./*/*/preprocessing/*DTI*.json -type f -printf "%f\n" I can print those names, but they appear at the end and not in order with their respective phaseencodingdirection|phaseencodingaxis extracted lines.
I don't know how to combine those lines of code to print the file's name from which the line was extracted and their respective extracted lines!?
Could you help me?
the problem is that I don't know which line comes from which file
Well no, you don't, because you have concatenated the contents of all the files into a single stream. If you want to be able to identify at the point of pattern matching which file each line comes from then you have to give that information to grep in the first place. Like this, for example:
find ./*/*/folderA/*DTI*.json |
xargs grep -i -E -H '(phaseencodingdirection|phaseencodingaxis)' > phase_direction
The xargs program converts lines read from its standard input into arguments to the specified command (grep in this case). The -H option to grep causes it to list the filename of each match along with the matching line itself.
Alternatively, this variation on the same thing is a little simpler, and closer in some senses to the original:
grep -i -E -H '(phaseencodingdirection|phaseencodingaxis)' \
$(find ./*/*/folderA/*DTI*.json) > phase_direction
That takes xargs out of the picture, and moves the command substitution directly to the argument list of grep.
But now observe that if the pattern ./*/*/folderA/*DTI*.json does not match any directories then find isn't actually doing anything useful for you. There is then no directory recursion to be done, and you haven't specified any tests, so the command substitution will simply expand to all the paths that match the pattern, just like the pattern would do if expanded without find. Thus, this is probably best of all:
grep -i -E -H '(phaseencodingdirection|phaseencodingaxis)' \
./*/*/folderA/*DTI*.json > phase_direction
Use the filenames as arguments to grep rather than cat.
grep -i -H -E '(phaseencodingdirection|phaseencodingaxis)' $(find ./*/*/folderA/*DTI*.json) > phase_direction
The -H option forces grep to incliude filenames in the output even if there's only one file.
But since your arguments to find are filenames, not directories to search recursively, there's no need to use it at all. Just pass the wildcard directly to grep. There's also no need to begin with ./. Any non-absolute pathname is interpreted relative to the current directory.
grep -i -H -E '(phaseencodingdirection|phaseencodingaxis)' */*/folderA/*DTI*.json > phase_direction
You may use recursive grep:
grep -iER 'phaseencodingdirection|phaseencodingaxis' --include=*DTI*.json */*/folderA

Human-readable filesize and line count

I want a bash command that will return a table, where each row is the human-readable filesize, number of lines, and filename. The table should be sorted by filesize.
I've been trying to do this using a combination of du -hs, wc -l, and sort -h, and find.
Here's where I'm at:
find . -exec echo $(du -h {}) $(wc -l {}) \; | sort -h
Your approach fell short not only because the shell expanded your command substitutions ($(...)) up front, but more fundamentally because you cannot pass shell command lines directly to find:
find's -exec action can only invoke external utilities with literal arguments - the only non-literal argument supported is the {} representing the filename(s) at hand.
choroba's answer fixes your immediate problem by invoking a separate shell instance in each iteration, to which the shell command to execute is passed as a string argument (-exec bash -c '...' \;).
While this works (assuming you pass the {} value as an argument rather than embedding it in the command-line string), it is also quite inefficient, because multiple child processes are created for each input file.
(While there is a way to have find pass (typically) all input files to a (typically) single invocation of the specified external utility - namely with terminator + rather than \;, this is not an option here due to the nature of the command line passed.)
An efficient and robust[1]
implementation that minimizes the number of child processes created would look like this:
Note: I'm assuming GNU utilities here, due to use of head -n -1 and sort -h.
Also, I'm limiting find's output to files only (as opposed to directories), because wc -l only works on files.
paste <(find . -type f -exec du -h {} +) <(find . -type f -exec wc -l {} + | head -n -1) |
awk -F'\t *' 'BEGIN{OFS="\t"} {sub(" .+$", "", $3); print $1,$2,$3}' |
sort -h -t$'\t' -k1,1
Note the use of -exec ... + rather than -exec ... \;, which ensures that typically all input filenames are passed to a single invocation to the external utility (if not all filenames fit on a single command line, invocations are batched efficiently to make as few calls as possible).
wc -l {} + invariably outputs a summary line, which head -n -1 strips away, but also outputs filenames after each line count.
paste combines the lines from each command (whose respective inputs are provided by a process substitution. <(...)) into a single output stream.
The awk command then strips the extraneous filename that stems from wc from the end of each line.
Finally, the sort command sorts the result by the 1st (-k1,1) tab-separated (-t$'\t') column by human-readable numbers (-h), such as the numbers that du -h outputs (e.g., 1K).
[1] As with any line-oriented processing, filenames with embedded newlines are not supported, but I do not consider this a real-world problem.
Ok, I tried it with find/-exec as well, but the escaping is hell. With a shell function it works pretty straight forward:
#!/bin/bash
function dir
{
du=$(du -sh "$1" | awk '{print $1}')
wc=$(wc -l < "$1")
printf "%10s %10s %s\n" $du $wc "${1#./}"
}
printf "%10s %10s %s\n" "size" "lines" "name"
OIFS=$IFS; IFS=""
find . -type f -print0 | while read -r -d $'\0' f; do dir "$f"; done
IFS=$OIFS
Using basishm read it is even kind of safe by using nul terminator. The IFS is needed to avoid read to truncate trailing blanks in filenames.
BTW: $'\0' does not really work (same as '') - but it makes the intention clear.
Sample output:
size lines name
156K 708 sash
16K 64 hostname
120K 460 netstat
40K 110 fuser
644K 1555 dir/bash
28K 82 keyctl
2.3M 8067 vim
The problem is that your shell interprets the $(...), so find doesn't get them. Escaping them doesn't help, either (\$\(du -h {}\)), as they become normal parameters to the commands, not command substitution.
In order to interpret them as command substitution is to call a new shell, either directly
find . -exec bash -c 'echo $(du -h {}) $(wc -l {})' \; | sort -h
or by creating a script and calling it from find.

perform an operation for *each* item listed by grep

How can I perform an operation for each item listed by grep individually?
Background:
I use grep to list all files containing a certain pattern:
grep -l '<pattern>' directory/*.extension1
I want to delete all listed files but also all files having the same file name but a different extension: .extension2.
I tried using the pipe, but it seems to take the output of grep as a whole.
In find there is the -exec option, but grep has nothing like that.
If I understand your specification, you want:
grep --null -l '<pattern>' directory/*.extension1 | \
xargs -n 1 -0 -I{} bash -c 'rm "$1" "${1%.*}.extension2"' -- {}
This is essentially the same as what #triplee's comment describes, except that it's newline-safe.
What's going on here?
grep with --null will return output delimited with nulls instead of newline. Since file names can have newlines in them delimiting with newline makes it impossible to parse the output of grep safely, but null is not a valid character in a file name and thus makes a nice delimiter.
xargs will take a stream of newline-delimited items and execute a given command, passing as many of those items (one as each parameter) to a given command (or to echo if no command is given). Thus if you said:
printf 'one\ntwo three \nfour\n' | xargs echo
xargs would execute echo one 'two three' four. This is not safe for file names because, again, file names might contain embedded newlines.
The -0 switch to xargs changes it from looking for a newline delimiter to a null delimiter. This makes it match the output we got from grep --null and makes it safe for processing a list of file names.
Normally xargs simply appends the input to the end of a command. The -I switch to xargs changes this to substitution the specified replacement string with the input. To get the idea try this experiment:
printf 'one\ntwo three \nfour\n' | xargs -I{} echo foo {} bar
And note the difference from the earlier printf | xargs command.
In the case of my solution the command I execute is bash, to which I pass -c. The -c switch causes bash to execute the commands in the following argument (and then terminate) instead of starting an interactive shell. The next block 'rm "$1" "${1%.*}.extension2"' is the first argument to -c and is the script which will be executed by bash. Any arguments following the script argument to -c are assigned as the arguments to the script. This, if I were to say:
bash -c 'echo $0' "Hello, world"
Then Hello, world would be assigned to $0 (the first argument to the script) and inside the script I could echo it back.
Since $0 is normally reserved for the script name I pass a dummy value (in this case --) as the first argument and, then, in place of the second argument I write {}, which is the replacement string I specified for xargs. This will be replaced by xargs with each file name parsed from grep's output before bash is executed.
The mini shell script might look complicated but it's rather trivial. First, the entire script is single-quoted to prevent the calling shell from interpreting it. Inside the script I invoke rm and pass it two file names to remove: the $1 argument, which was the file name passed when the replacement string was substituted above, and ${1%.*}.extension2. This latter is a parameter substitution on the $1 variable. The important part is %.* which says
% "Match from the end of the variable and remove the shortest string matching the pattern.
.* The pattern is a single period followed by anything.
This effectively strips the extension, if any, from the file name. You can observe the effect yourself:
foo='my file.txt'
bar='this.is.a.file.txt'
baz='no extension'
printf '%s\n'"${foo%.*}" "${bar%.*}" "${baz%.*}"
Since the extension has been stripped I concatenate the desired alternate extension .extension2 to the stripped file name to obtain the alternate file name.
If this does what you want, pipe the output through /bin/sh.
grep -l 'RE' folder/*.ext1 | sed 's/\(.*\).ext1/rm "&" "\1.ext2"/'
Or if sed makes you itchy:
grep -l 'RE' folder/*.ext1 | while read file; do
echo rm "$file" "${file%.ext1}.ext2"
done
Remove echo if the output looks like the commands you want to run.
But you can do this with find as well:
find /path/to/start -name \*.ext1 -exec grep -q 'RE' {} \; -print | ...
where ... is either the sed script or the three lines from while to done.
The idea here is that find will ... well, "find" things based on the qualifiers you give it -- namely, that things match the file glob "*.ext", AND that the result of the "exec" is successful. The -q tells grep to look for RE in {} (the file supplied by find), and exit with a TRUE or FALSE without generating any of its own output.
The only real difference between doing this in find vs doing it with grep is that you get to use find's awesome collection of conditions to narrow down your search further if required. man find for details. By default, find will recurse into subdirectories.
You can pipe the list to xargs:
grep -l '<pattern>' directory/*.extension1 | xargs rm
As for the second set of files with a different extension, I'd do this (as usual use xargs echo rm when testing to make a dry run; I haven't tested it, it may not work correctly with filenames with spaces in them):
filelist=$(grep -l '<pattern>' directory/*.extension1)
echo $filelist | xargs rm
echo ${filelist//.extension1/.extension2} | xargs rm
Pipe the result to xargs, it will allow you to run a command for each match.

How to execute the output of a command within the current shell?

I'm well aware of the source (aka .) utility, which will take the contents from a file and execute them within the current shell.
Now, I'm transforming some text into shell commands, and then running them, as follows:
$ ls | sed ... | sh
ls is just a random example, the original text can be anything. sed too, just an example for transforming text. The interesting bit is sh. I pipe whatever I got to sh and it runs it.
My problem is, that means starting a new sub shell. I'd rather have the commands run within my current shell. Like I would be able to do with source some-file, if I had the commands in a text file.
I don't want to create a temp file because feels dirty.
Alternatively, I'd like to start my sub shell with the exact same characteristics as my current shell.
update
Ok, the solutions using backtick certainly work, but I often need to do this while I'm checking and changing the output, so I'd much prefer if there was a way to pipe the result into something in the end.
sad update
Ah, the /dev/stdin thing looked so pretty, but, in a more complex case, it didn't work.
So, I have this:
find . -type f -iname '*.doc' | ack -v '\.doc$' | perl -pe 's/^((.*)\.doc)$/git mv -f $1 $2.doc/i' | source /dev/stdin
Which ensures all .doc files have their extension lowercased.
And which incidentally, can be handled with xargs, but that's besides the point.
find . -type f -iname '*.doc' | ack -v '\.doc$' | perl -pe 's/^((.*)\.doc)$/$1 $2.doc/i' | xargs -L1 git mv
So, when I run the former, it'll exit right away, nothing happens.
The eval command exists for this very purpose.
eval "$( ls | sed... )"
More from the bash manual:
eval
eval [arguments]
The arguments are concatenated together
into a single command, which
is then read and executed, and its
exit status returned as the exit
status of eval. If there are no
arguments or only empty arguments, the
return status is zero.
$ ls | sed ... | source /dev/stdin
UPDATE: This works in bash 4.0, as well as tcsh, and dash (if you change source to .). Apparently this was buggy in bash 3.2. From the bash 4.0 release notes:
Fixed a bug that caused `.' to fail to read and execute commands from non-regular files such as devices or named pipes.
Try using process substitution, which replaces output of a command with a temporary file which can then be sourced:
source <(echo id)
Wow, I know this is an old question, but I've found myself with the same exact problem recently (that's how I got here).
Anyway - I don't like the source /dev/stdin answer, but I think I found a better one. It's deceptively simple actually:
echo ls -la | xargs xargs
Nice, right? Actually, this still doesn't do what you want, because if you have multiple lines it will concat them into a single command instead of running each command separately. So the solution I found is:
ls | ... | xargs -L 1 xargs
the -L 1 option means you use (at most) 1 line per command execution. Note: if your line ends with a trailing space, it will be concatenated with the next line! So make sure each line ends with a non-space.
Finally, you can do
ls | ... | xargs -L 1 xargs -t
to see what commands are executed (-t is verbose).
Hope someone reads this!
`ls | sed ...`
I sort of feel like ls | sed ... | source - would be prettier, but unfortunately source doesn't understand - to mean stdin.
I believe this is "the right answer" to the question:
ls | sed ... | while read line; do $line; done
That is, one can pipe into a while loop; the read command command takes one line from its stdin and assigns it to the variable $line. $line then becomes the command executed within the loop; and it continues until there are no further lines in its input.
This still won't work with some control structures (like another loop), but it fits the bill in this case.
To use the mark4o's solution on bash 3.2 (macos) a here string can be used instead of pipelines like in this example:
. /dev/stdin <<< "$(grep '^alias' ~/.profile)"
I think your solution is command substitution with backticks: http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_03_04.html
See section 3.4.5
Why not use source then?
$ ls | sed ... > out.sh ; source out.sh

Use lines in a file as filenames for grep?

I have a file which contains filenames (and the full path to them) and I want to search for a word within all of them.
some pseudo-code to explain:
grep keyword <all files specified in files.txt>
or
cat files.txt > grep keyword
cat files txt | grep keyword
the problem is that I can only get grep to search the filenames, not the contents of the actual files.
cat files.txt | xargs grep keyword
or
grep keyword `cat files.txt`
or (equivalent to previous but harder to mis-read)
grep keyword $(cat files.txt)
should do the trick.
Pitfalls:
If files.txt contains file names with spaces, either solution will malfunction, because "This is a filename.txt" will be interpreted as four files, "This", "is", "a", and "filename.txt". A good reason why you shouldn't have spaces in your filenames, ever.
There are ways around this, but none of them is trivial. (find ... -print0 / xargs -0 is one of them.)
The second (cat) version can result in a very long command line (which might fail when exceeding the limits of your environment). The first (xargs) version handles long input automatically; xargs offers several options to control the details.
Both of the answers from DevSolar work (tested on Linux Ubuntu), but the xargs version is preferable if there may be many files, since it will avoid running into command line length limits.
so:
cat files.txt | xargs grep keyword
is the way to go
tr '\n' '\0' <files.txt | LANG=C xargs -r0 grep -F keyword
tr will delimit names with NUL character so that spaces not significant (note the corresponding -0 option to xargs).
xargs -r will start a single grep process for a "large" number of files, but not start any grep process if there are no files.
LANG=C means use quick routines for matching, rather than slow locale ones
grep -F means use quick string matching rather than slow regular expression matching
bash, ksh & zsh version:
grep keyword $(<files.txt)
Long time when last created a bash shell script, but you could store the result of the first grep (the one finding all filenames) in an array and iterate over it, issuing even more grep commands.
A good starting point should be the bash scripting guide.

Resources