Files with quotes, spaces causing bad behavior from xargs - bash

I want to find some files and calculate the shasum by using a pipe command.
find . -type f | xargs shasum
But there are files withe quotes in my directory, for example the file named
file with "special" characters.txt
The pipe output look like this:
user#home ~ $ find . -type f | xargs shasum
da39a3ee5e6b4b0d3255bfef95601890afd80709 ./empty1.txt
da39a3ee5e6b4b0d3255bfef95601890afd80709 ./empty2.txt
da39a3ee5e6b4b0d3255bfef95601890afd80709 ./empty3.txt
shasum: ./file:
shasum: with: No such file or directory
shasum: special: No such file or directory
shasum: characters.txt: No such file or directory
25ea78ccd362e1903c4a10201092edeb83912d78 ./file1.txt
25ea78ccd362e1903c4a10201092edeb83912d78 ./file2.txt
The quotes within the filename makes problems.
How can I tell shasum to process the files correctly?

The short explanation is that xargs is widely considered broken-by-design, unless using extensions to the standard that disable its behavior of trying to parse and honor quote and escaping content in its input. See the xargs section of UsingFind for more details.
Using NUL Delimited Streams
On a system with GNU or modern BSD extensions (including MacOS X), you can (and should) NUL-delimit the output from find:
find . -type f -print0 | xargs -0 shasum --
Using find -exec
That said, you can do even better by getting xargs out of the loop entirely in a way that's fully compliant with modern (~2006) POSIX:
find . -type f -exec shasum -- '{}' +
Note that the -- argument specifies to shasum that all future arguments are filenames. If you'd used find * -type f ..., then you could have a result starting with a dash; using -- ensures that this result isn't interpreted as a set of options.
Using Newline Delimiters (And Security Risks Thereof)
If you have GNU xargs, but don't have the option of a NUL-delimited input stream, then xargs -d $'\n' (in shells such as bash with ksh extensions) will avoid the quoting and escaping behavior:
xargs -d $'\n' shasum -- <files.txt
However, this is suboptimal, because newline literals are actually possible inside filenames, thus making it impossible to distinguish between a newline that separates two names and a newline that is part of an actual name. Consider the following scenario:
mkdir -p ./file.txt$'\n'/etc/passwd$'\n'/
touch ./file.txt$'\n'/etc/passwd$'\n'file.txt file.txt
find . -type f | xargs -d $'\n' shasum --
This will have output akin to the following:
da39a3ee5e6b4b0d3255bfef95601890afd80709 ./file.txt
da39a3ee5e6b4b0d3255bfef95601890afd80709 ./file.txt
c0c71bac843a3ec7233e99e123888beb6da8fbcf /etc/passwd
da39a3ee5e6b4b0d3255bfef95601890afd80709 file.txt
...thus allowing an attacker who can control filenames to cause a shasum for an arbitrary file outside the intended directory structure to be added to your output.

Related

execute command on files returned by grep

Say I want to edit every .html file in a directory one after the other using vim, I can do this with:
find . -name "*.html" -exec vim {} \;
But what if I only want to edit every html file containing a certain string one after the other? I use grep to find files containing those strings, but how can I pipe each one to vim similar to the find command. Perphaps I should use something other than grep, or somehow pipe the find command to grep and then exec vim. Does anyone know how to edit files containing a certain string one after the other, in the same fashion the find command example I give above would?
grep -l 'certain string' *.html | xargs vim
This assumes you don't have eccentric file names with spaces etc in them. If you have to deal with eccentric file names, check whether your grep has a -z option to terminate output lines with null bytes (and xargs has a -0 option to read such inputs), and if so, then:
grep -zl 'certain string' *.html | xargs -0 vim
If you need to search subdirectories, maybe your version of Bash has support for **:
grep -zl 'certain string' **/*.html | xargs -0 vim
Note: these commands run vim on batches of files. If you must run it once per file, then you need to use -n 1 as extra options to xargs before you mention vim. If you have GNU xargs, you can use -r to prevent it running vim when there are no file names in its input (none of the files scanned by grep contain the 'certain string').
The variations can be continued as you invent new ways to confuse things.
With find :
find . -type f -name '*.html' -exec bash -c 'grep -q "yourtext" "${1}" && vim "${1}"' _ {} \;
On each files, calls bash commands that grep the file with yourtext and open it with vim if text is matching.
Solution with a for cycle:
for i in $(find . -type f -name '*.html'); do vim $i; done
This should open all files in a separate vim session once you close the previous.

When to use xargs when piping?

I am new to bash and I am trying to understand the use of xargs, which is still not clear for me. For example:
history | grep ls
Here I am searching for the command ls in my history. In this command, I did not use xargs and it worked fine.
find /etc - name "*.txt" | xargs ls -l
I this one, I had to use xargs but I still can not understand the difference and I am not able to decide correctly when to use xargs and when not.
xargs can be used when you need to take the output from one command and use it as an argument to another. In your first example, grep takes the data from standard input, rather than as an argument. So, xargs is not needed.
xargs takes data from standard input and executes a command. By default, the data is appended to the end of the command as an argument. It can be inserted anywhere however, using a placeholder for the input. The traditional placeholder is {}; using that, your example command might then be written as:
find /etc -name "*.txt" | xargs -I {} ls -l {}
If you have 3 text files in /etc you'll get a full directory listing of each. Of course, you could just have easily written ls -l /etc/*.txt and saved the trouble.
Another example lets you rename those files, and requires the placeholder {} to be used twice.
find /etc -name "*.txt" | xargs -I {} mv {} {}.bak
These are both bad examples, and will break as soon as you have a filename containing whitespace. You can work around that by telling find to separate filenames with a null character.
find /etc -print0 -name "*.txt" | xargs -I {} -0 mv {} {}.bak
My personal opinion is that there are almost always alternatives to using xargs (such as the -exec argument to find) and you will be better served by learning those.
When you use piping without xargs, the actual data is fed into the next command. On the other hand, when using piping with xargs, the actual data is viewed as a parameter to the next command. To give a concrete example, say you have a folder with a.txt and b.txt. a.txt contains just a single line 'hello world!', and b.txt is just empty.
If you do
ls | grep txt
you would end up getting the output:
a.txt
b.txt
Yet, if you do
ls | xargs grep txt
you would get nothing since neither file a.txt nor b.txt contains the word txt.
If the command is
ls | xargs grep hello
you would get:
hello world!
That's because with xargs, the two filenames given by ls are passed to grep as arguments, rather than the actual content.
Short answer: Avoid xargs for now. Return to xargs when you have written dozens or hundreds of scripts.
Commands can get their input from parameters (like rm bad_example) or can get the input from stdin (not just the y on the question after rm -i is_this_bad_too, but also read answer). Other commands like grep and sed will look for parameters and when the parameters don't show the input, switch to the input.
Your grep example works fine reading from stdin, nothing special needed.
Your ls needs the output of find as a parameter. xargs is just one way to turn things around. Use man xargs for more about xargs. Alternatives:
find /etc -name "*.txt" -exec ls -l {} \;
find /etc -name "*.txt" -ls
ls -l $(find /etc -name "*.txt" )
ls /etc/*.txt
First try to see which of this commands is best when you have a nasty filename with spaces.txt in /etc.
xargs(1) is dangerous (broken, exploitable, etc.) when reading non-NUL-delimited input.
If you're working with filenames, use find's -exec [command] {} + instead.
If you can get NUL-delimited output, use xargs -0.
GNU Parallel can do the same as xargs, but does not have the broken and exploitable "features".
You can learn GNU Parallel by looking at examples http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-xargs--n1.-Argument-appending and walking through the tutorial http://www.gnu.org/software/parallel/parallel_tutorial.html

combine find and grep into a single command

How to combine below two into one line without changing the first one?
# find / -name sshd_config -print
# grep -I <sshd_config path> permitrootlogin
I came up with the following, but don't know whether I gives correct result in different cases
cat `find / -name sshd_config -print` |grep permitrootlogin
Don't do cat $(...) [$() is the modern replacement for backticks] -- that doesn't work reliably if your filenames contain special characters (spaces, wildcards, etc).
Instead, tell find to invoke cat for you, with as many filenames passed to each cat invocation as possible:
find / -name sshd_config -exec cat -- '{}' + | grep permitrootlogin
...or, even better, ignore cat altogether and just pass the filenames to grep literally:
find / -name sshd_config -exec grep -h -e permitrootlogin -- /dev/null '{}' +
Replace the -h with -H if you want filenames to be shown.
You could do something like that:
find / -name "somefilename" -print0 | xargs -0 grep "something"
The 'xargs' keyword will transform the stdout into arguments that can be read by grep.
I guess what you want is use the output of find / -name sshd_config -print (which should be the path of the sshd_config file) and use it as the second argument to grep (so that that the sshd_config file gets parsed for your search string).
There are several ways to achieve this.
Commands in back-quotes (`) are replaced by their output. So
grep permitrootlogin `find / -name sshd_config -print`
will be replaced by
grep permitrootlogin /path/to/the/sshd_config
which will search /path/to/the/sshd_config for permitrootlogin.
The same happens with
grep permitrootlogin $(find / -name sshd_config -print)
As another answer already mentions, this syntax has some advantages over the back-ticks. Namely, it can be nested.
However, this still runs into a problem when the path where the file is found contains spaces. As both backticks and $(...) just perform text substitution, such a patch would be passed as several arguments to grep, each probably being an invalid path. (/path/to the/sshd_config would become /path/to and the/sshd_config.)
Rather than working around this with fancy quoting and escaping strategies, remember that UNIX commands were already designed for being used in combination, usually by pipes. Indeed find has a -print0 action which will separate paths of found files by \0, so that they can be distinguished from paths containing whitespace. Alas, grep can't process a zero-delimited list of files and still wants the files to search as invocation arguments, not on stdin.
This is where xargs comes into play. It applies stuff it gets on stdin as arguments to other commands. And with its -0 option, it interprets stdin as a zero-delimited list instead of treating whitespace as delimiters.
So
find / -name sshd_config -print0 | xargs -0 grep permitrootlogin
should have you covered.
| is a pipeline, which means, that the standard output of catfind / -name sshd_config -print`` will go to standard intput of grep permitrootlogin, so you just have to be sure what's the output of the first command

how to grep large number of files?

I am trying to grep 40k files in the current directory and i am getting this error.
for i in $(cat A01/genes.txt); do grep $i *.kaks; done > A01/A01.result.txt
-bash: /usr/bin/grep: Argument list too long
How do one normally grep thousands of files?
Thanks
Upendra
This makes David sad...
Everyone so far is wrong (except for anubhava).
Shell scripting is not like any other programming language because much of the interpretation of lines comes from the power of the shell interpolating them before the command is actually executed.
Let's take something simple:
$ set -x
$ ls
+ ls
bar.txt foo.txt fubar.log
$ echo The text files are *.txt
echo The text files are *.txt
> echo The text files are bar.txt foo.txt
The text files are bar.txt foo.txt
$ set +x
$
The set -x allows you to see how the shell actually interpolates the glob and then passes that back to the command as input. The > points to the line that is actually being executed by the command.
You can see that the echo command isn't interpreting the *. Instead, the shell grabs the * and replaces it with the names of the matching files. Then and only then does the echo command actually executes the command.
When you have 40K plus files, and you do grep *, you're expanding that * to the names of those 40,000 plus files before grep even has a chance to execute, and that's where the error message /usr/bin/grep: Argument list too long is coming from.
Fortunately, Unix has a way around this dilemma:
$ find . -name "*.kaks" -type f -maxdepth 1 | xargs grep -f A01/genes.txt
The find . -name "*.kaks" -type f -maxdepth 1 will find all of your *.kaks files, and the -depth 1 will only include files in the current directory. The -type f makes sure you only pick up files and not a directory.
The find command pipes the names of the files into xargs and xargs will append the names of the file to the grep -f A01/genes.txtcommand. However, xargs has a trick up it sleeve. It knows how long the command line buffer is, and will execute the grep when the command line buffer is full, then pass in another series of file to the grep. This way, grep gets executed maybe three or ten times (depending upon the size of the command line buffer), and all of our files are used.
Unfortunately, xargs uses whitespace as a separator for the file names. If your files contain spaces or tabs, you'll have trouble with xargs. Fortunately, there's another fix:
$ find . -name "*.kaks" -type f -maxdepth 1 -print0 | xargs -0 grep -f A01/genes.txt
The -print0 will cause find to print out the names of the files not separated by newlines, but by the NUL character. The -0 parameter for xargs tells xargs that the file separator isn't whitespace, but the NUL character. Thus, fixes the issue.
You could also do this too:
$ find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;
This will execute the grep for each and every file found instead of what xargs does and only runs grep for all the files it can stuff on the command line. The advantage of this is that it avoids shell interference entirely. However, it may or may not be less efficient.
What would be interesting is to experiment and see which one is more efficient. You can use time to see:
$ time find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;
This will execute the command and then tell you how long it took. Try it with the -exec and with xargs and see which is faster. Let us know what you find.
You can combine find with grep like this:
find . -maxdepth 1 -name '*.kaks' -exec grep -H -f A01/genes.txt '{}' \; > A01/A01.result.txt
you can use recursive feature of grep:
for i in $(cat A01/genes.txt); do
grep -r $i .
done > A01/A01.result.txt
though if you want to select only kaks files:
for i in $(cat A01/genes.txt); do
find . -iregex '.*\.kaks$' -exec grep $i \;
done > A01/A01.result.txt
Put another for loop inside your outer one:
for f in *.kaks; do
grep -H $i "$f"
done
By the way, are you interested in finding EVERY occurrence in each file, or merely if the search string exists in there one or more times? If it is "good enough" to know the string occurs in there one or more times you can specify "-n 1" to grep and it will not bother reading/searching the rest of the file after finding the first match, which could potentially save lots of time.
The following solution has worked for me:
Problem:
grep -r "example\.com" *
-bash: /bin/grep: Argument list too long
Solution:
grep -r "example\.com" .
["In newer versions of grep you can omit the “.“, as the current directory is implied."]
Source:
Reinlick, J. https://www.saotn.org/bash-grep-through-large-number-files-argument-list-too-long/

Bash find filter and copy - trouble with spaces

So after a lot of searching and trying to interpret others' questions and answers to my needs, I decided to ask for myself.
I'm trying to take a directory structure full of images and place all the images (regardless of extension) in a single folder. In addition to this, I want to be able to remove images matching certain filenames in the process. I have a find command working that outputs all the filepaths for me
find -type f -exec file -i -- {} + | grep -i image | sed 's/\:.*//'
but if I try to use that to copy files, I have trouble with the spaces in the filenames.
cp `find -type f -exec file -i -- {} + | grep -i image | sed 's/\:.*//'` out/
What am I doing wrong, and is there a better way to do this?
With the caveat that it won't work if files have newlines in their names:
find . -type f -exec file -i -- {} + |
awk -vFS=: -vOFS=: '$NF ~ /image/{NF--;printf "%s\0", $0}' |
xargs -0 cp -t out/
(Based on answer by Jonathan Leffler and subsequent comments discussion with him and #devnull.)
The find command works well if none of the file names contain any newlines. Within broad limits, the grep command works OK under the same circumstances. The sed command works fine as long as there are no colons in the file names. However, given that there are spaces in the names, the use of $(...) (command substitution, also indicated by back-ticks `...`) is a disaster. Unfortunately, xargs isn't readily a part of the solution; it splits on spaces by default. Because you have to run file and grep in the middle, you can't easily use the -print0 option to (GNU) find and the -0 option to (GNU) xargs.
In some respects, it is crude, but in many ways, it is easiest if you write an executable shell script that can be invoked by find:
#!/bin/bash
for file in "$#"
do
if file -i -- "$file" | grep -i -q "$file:.*image"
then cp "$file" out/
fi
done
This is a little painful in that it invokes file and grep separately for each name, but it is reliable. The file command is even safe if the file name contains a newline; the grep is probably not.
If that script is called 'copyimage.sh', then the find command becomes:
find . -type f -exec ./copyimage.sh {} +
And, given the way the grep command is written, the copyimage.sh file won't be copied, even though its name contains the magic word 'image'.
Pipe the results of your find command to
xargs -l --replace cp "{}" out/
Example of how this works for me on Ubuntu 10.04:
atomic#atomic-desktop:~/temp$ ls
img.png img space.png
atomic#atomic-desktop:~/temp$ mkdir out
atomic#atomic-desktop:~/temp$ find -type f -exec file -i \{\} \; | grep -i image | sed 's/\:.*//' | xargs -l --replace cp -v "{}" out/
`./img.png' -> `out/img.png'
`./img space.png' -> `out/img space.png'
atomic#atomic-desktop:~/temp$ ls out
img.png img space.png
atomic#atomic-desktop:~/temp$

Resources