Using -exec on output from bashrc function - bash

In Cygwin, I wrote a bashrc function rg, which is basically a recursive grep statement:
rg () {
find . -type f -exec grep -il $1 {} \;
}
This works well, but now I need to run an additional grep on each output line to check for another word. I basically want it to list each file that contains both words (not necessarily on the same line). I tried this command:
rg Word1 -exec grep -il Word2 {} \;
This seems to just output the same files containing "Word1".
I tried this next command, and I thought it'd just output "Ha" on each line, but it still keeps listing the files from the "rg Word1" statement.
rg Word1 -exec echo "Ha" \;
So I'm clearly doing something wrong here. Can anyone clear my confusion? I'm aware there's a way to do this within grep itself, but that seems to work on a per-line basis. I'm guessing what I'm trying to do is pretty common. Also once I get this working, I'd like to put it in another bashrc function for convenience. Not sure if that's makes it more complicated or not.

If you have it or can get it, the xargs command collects arguments for a command from its standard input, then runs the designated command. You could combine that with your rg function to filter the function's output. For example,
rg Word1 | xargs grep -il Word2
The arguments xargs reads from its standard input -- the file names emitted by function rg -- will be appended to the given command (grep -il Word2), and the resulting command run. If xargs's input is long enough, the arguments will be split across multiple invocations of the grep command, which makes no difference to the output (in this case), but avoids command execution failing on account of too many arguments.
Consider also structuring function rg in the same way (i.e. using xargs) to minimize the number of separate grep processes that are executed. Starting a new process is one of the more expensive things you can do.

Related

Change text in argument for xargs (or GNU Parallel)

I have a program that I can run in two ways: single-end or paired-end mode. Here's the syntax:
program <output-directory-name> <input1> [input2]
Where the output directory and at least one input is required. If I wanted to run this on three files, say, sample A, B, and C, I would use something like find with xargs or parallel:
user#host:~/single$ ls
sampleA.txt sampleB.txt sampleC.txt
user#host:~/single$ find . -name "sample*" | xargs -i echo program {}-out {}
program ./sampleA.txt-out ./sampleA.txt
program ./sampleB.txt-out ./sampleB.txt
program ./sampleC.txt-out ./sampleC.txt
user#host:~/single$ find . -name "sample*" | parallel --dry-run program {}-out {}
program ./sampleA.txt-out ./sampleA.txt
program ./sampleB.txt-out ./sampleB.txt
program ./sampleC.txt-out ./sampleC.txt
But when I want to run the program in "paired-end" mode, I need to give it two inputs. These are related files, but they can't simply be concatenated - you have to run the program with both as inputs. Files are named sensibly, e.g., sampleA_1.txt and sampleA_2.txt.
I want to be able to create this easily on the command line with something like xargs (or preferably parallel):
user#host:~/paired$ ls
sampleA_1.txt sampleB_1.txt sampleC_1.txt
sampleA_2.txt sampleB_2.txt sampleC_2.txt
user#host:~/paired$ find . -name "sample*_1.txt" | sed/awk? | parallel ?
program ./sampleA-out ./sampleA_1.txt ./sampleA_2.txt
program ./sampleB-out ./sampleB_1.txt ./sampleB_2.txt
program ./sampleC-out ./sampleC_1.txt ./sampleC_2.txt
Ideally, the command would strip off the _1.txt to create the output directory name (sampleA-out, etc), but I really need to be able to take that argument and change the _1 to a _2 for the second input.
I know this is dead simple with a script - I did this in Perl with a quick regular expression substitution. But I would love to be able to do this with a quick one-liner.
Thanks in advance.
I did this in Perl with a quick regular expression substitution. But I would love to be able to do this with a quick one-liner.
Perl has one-liners, too, just as sed and awk do. You can write:
find . -name "sample*_1.txt" | perl -pe 's/_1\.txt$//' | parallel program {}-out {}_1.txt {}_2.txt
(The -e flag means "the next argument is the program text"; the -p flag means "the program should be run in loop; for each line of input, set $_ to that line, then run the program, then print $_".)
With sed and xargs you could do something like this:
find . -name "sample*_1.txt" | sed -n 's/_1\..*$//;h;s/$/_out/p;g;s/$/_1.txt/p;g;s/$/_2.txt/p' | xargs -L 3 echo program
I.e.: sed creates the three arguments and xargs -L 3 composes commands lines with three arguments.
Assuming you always have exactly 2 files in your directory for each pair and assuming they get sorted the right way by find (this you can ensure by piping results of find through sort), maybe xargs -l 2 would do the job. This tells xargs to place 2 consecutive incoming parameters on each command line it executes.
A shorter version:
parallel --xapply program {1.}.out {1} {2} :::: <(ls *_1.txt) <(ls *_2.txt)
but this only works if every _1.txt has a matching _2.txt and vice versa.

Commandline find, sed, exec

I have a bunch of files in a folder, in subfolders and I'm trying to make some kind of one-liner for quick copy/pasting once in a while.
The contents is (too long to paste here): http://pastebin.com/4aZCPbwT
I've tried the following commands:
List all files and their directories
find . -name '[!.]*'
Replace all instances of "Namespace" with "Test:
find . -name '[!.]*' -print0 | sed 's/Namespace/Test/gI' | xargs -i -0 echo '{}'
What I need to do is:
Replace foldes names like above, and copy the folders (including files), to another location. Create the folders if they don't exist (they most likely won't) - BUT, there are some of them that I don't need, like ./app, as this folder exists. I could use -wholename './app' for that.
When they are copied, I need to replace some text inside each file, same as above (Namespace with Test - also occours inside the files and save them of course).
Something like this I would imagine:
-print -exec sed -i 's/Namespace/Test/gI' {} \;
Can these 3 things be done in a one-liner? Replace text in files (Namespace <=> Test), copy files including their directories with cp -p (don't want to write over folders), but renaming each directory/file with as above (Namespace <=> Test).
Thanks a lot :-)
Besides describing the how with painstaking verbosity below, this method may also be unique in that it incorporates built-in debugging. It basically doesn't do anything at all as written except compile and save to a variable all commands it believes it should do in order to perform the work requested.
It also explicitly avoids loops as much as possible. Besides the sed recursive search for more than one match of the pattern there is no other recursion as far as I know.
And last, this is entirely null delimited - it doesn't trip on any character in any filename except the null. I don't think you should have that.
By the way, this is REALLY fast. Look:
% _mvnfind() { mv -n "${1}" "${2}" && cd "${2}"
> read -r SED <<SED
> :;s|${3}\(.*/[^/]*${5}\)|${4}\1|;t;:;s|\(${5}.*\)${3}|\1${4}|;t;s|^[0-9]*\(.*\)${5}|\1|p
> SED
> find . -name "*${3}*" -printf "%d\tmv %P ${5} %P\000" |
> sort -zg | sed -nz ${SED} | read -r ${6}
> echo <<EOF
> Prepared commands saved in variable: ${6}
> To view do: printf ${6} | tr "\000" "\n"
> To run do: sh <<EORUN
> $(printf ${6} | tr "\000" "\n")
> EORUN
> EOF
> }
% rm -rf "${UNNECESSARY:=/any/dirs/you/dont/want/moved}"
% time ( _mvnfind ${SRC=./test_tree} ${TGT=./mv_tree} \
> ${OLD=google} ${NEW=replacement_word} ${sed_sep=SsEeDd} \
> ${sh_io:=sh_io} ; printf %b\\000 "${sh_io}" | tr "\000" "\n" \
> | wc - ; echo ${sh_io} | tr "\000" "\n" | tail -n 2 )
<actual process time used:>
0.06s user 0.03s system 106% cpu 0.090 total
<output from wc:>
Lines Words Bytes
115 362 20691 -
<output from tail:>
mv .config/replacement_word-chrome-beta/Default/.../googlestars \
.config/replacement_word-chrome-beta/Default/.../replacement_wordstars
NOTE: The above function will likely require GNU versions of sed and find to properly handle the find printf and sed -z -e and :;recursive regex test;t calls. If these are not available to you the functionality can likely be duplicated with a few minor adjustments.
This should do everything you wanted from start to finish with very little fuss. I did fork with sed, but I was also practicing some sed recursive branching techniques so that's why I'm here. It's kind of like getting a discount haircut at a barber school, I guess. Here's the workflow:
rm -rf ${UNNECESSARY}
I intentionally left out any functional call that might delete or destroy data of any kind. You mention that ./app might be unwanted. Delete it or move it elsewhere beforehand, or, alternatively, you could build in a \( -path PATTERN -exec rm -rf \{\} \) routine to find to do it programmatically, but that one's all yours.
_mvnfind "${#}"
Declare its arguments and call the worker function. ${sh_io} is especially important in that it saves the return from the function. ${sed_sep} comes in a close second; this is an arbitrary string used to reference sed's recursion in the function. If ${sed_sep} is set to a value that could potentially be found in any of your path- or file-names acted upon... well, just don't let it be.
mv -n $1 $2
The whole tree is moved from the beginning. It will save a lot of headache; believe me. The rest of what you want to do - the renaming - is simply a matter of filesystem metadata. If you were, for instance, moving this from one drive to another, or across filesystem boundaries of any kind, you're better off doing so at once with one command. It's also safer. Note the -noclobber option set for mv; as written, this function will not put ${SRC_DIR} where a ${TGT_DIR} already exists.
read -R SED <<HEREDOC
I located all of sed's commands here to save on escaping hassles and read them into a variable to feed to sed below. Explanation below.
find . -name ${OLD} -printf
We begin the find process. With find we search only for anything that needs renaming because we already did all of the place-to-place mv operations with the function's first command. Rather than take any direct action with find, like an exec call, for instance, we instead use it to build out the command-line dynamically with -printf.
%dir-depth :tab: 'mv '%path-to-${SRC}' '${sed_sep}'%path-again :null delimiter:'
After find locates the files we need it directly builds and prints out (most) of the command we'll need to process your renaming. The %dir-depth tacked onto the beginning of each line will help to ensure we're not trying to rename a file or directory in the tree with a parent object that has yet to be renamed. find uses all sorts of optimization techniques to walk your filesystem tree and it is not a sure thing that it will return the data we need in a safe-for-operations order. This is why we next...
sort -general-numerical -zero-delimited
We sort all of find's output based on %directory-depth so that the paths nearest in relationship to ${SRC} are worked first. This avoids possible errors involving mving files into non-existent locations, and it minimizes need to for recursive looping. (in fact, you might be hard-pressed to find a loop at all)
sed -ex :rcrs;srch|(save${sep}*til)${OLD}|\saved${SUBSTNEW}|;til ${OLD=0}
I think this is the only loop in the whole script, and it only loops over the second %Path printed for each string in case it contains more than one ${OLD} value that might need replacing. All other solutions I imagined involved a second sed process, and while a short loop may not be desirable, certainly it beats spawning and forking an entire process.
So basically what sed does here is search for ${sed_sep}, then, having found it, saves it and all characters it encounters until it finds ${OLD}, which it then replaces with ${NEW}. It then heads back to ${sed_sep} and looks again for ${OLD}, in case it occurs more than once in the string. If it is not found, it prints the modified string to stdout (which it then catches again next) and ends the loop.
This avoids having to parse the entire string, and ensures that the first half of the mv command string, which needs to include ${OLD} of course, does include it, and the second half is altered as many times as is necessary to wipe the ${OLD} name from mv's destination path.
sed -ex...-ex search|%dir_depth(save*)${sed_sep}|(only_saved)|out
The two -exec calls here happen without a second fork. In the first, as we've seen, we modify the mv command as supplied by find's -printf function command as necessary to properly alter all references of ${OLD} to ${NEW}, but in order to do so we had to use some arbitrary reference points which should not be included in the final output. So once sed finishes all it needs to do, we instruct it to wipe out its reference points from the hold-buffer before passing it along.
AND NOW WE'RE BACK AROUND
read will receive a command that looks like this:
% mv /path2/$SRC/$OLD_DIR/$OLD_FILE /same/path_w/$NEW_DIR/$NEW_FILE \000
It will read it into ${msg} as ${sh_io} which can be examined at will outside of the function.
Cool.
-Mike
I haven't tested this, but I think it's what you're after.
find . -name '[!.]*' -print | while read line; do nfile=`echo "$line" | sed 's/Namespace/Test/gI'`; mkdir -p "`dirname $nfile`"; cp -p "$line" "$nfile"; sed -i 's/Namespace/Test/gI' "$nfile"; done

To understand xargs better

I want to understand the use of xargs man in Rampion's code:
screen -t man /bin/sh -c 'xargs man || read'
Thanks to Rampion: we do not need cat!
Why do we need xargs in the command?
I understand the xargs -part as follows
cat nothing to xargs
xargs makes a list of man -commands
I have had an idea that xargs makes a list of commands. For instance,
find . -type f -print0 | xargs -0 grep masi
is the same as a list of commands:
find fileA AND grep masi in it
find fileB AND grep masi in it
and so on for fileC, fileD, ...
No, I don't cat nothing. I cat whatever input I get after I run the command. cat is actually extraneous here, so let's ignore it.
xargs man waits on user input. Which is necessary. Since in the script you grabbed that from, I can't paste in the argument for man until after I create the window. So the command that runs in the window needs to wait for me to give it something, before it tries to run man.
If we just ran screen /bin/sh -d 'man || read', it would always complain "What manual page do you want?" since we never told it.
xargs gathers arguments from stdin and executes the command given with those arguments.
so cat is waiting for something to be typed, and then xargs is running man with that input.
xargs is useful if you have a lot of files to process, I often use it with output from find.
xargs will stuff as many arguments as it can onto the command line.
It's great for doing something like
find . -name '*.o' -print | xargs rm
The cat command does not operate on nothing; it operates on standard input, up until it is told that the input is ended. As Rampion notes, the cat command is not necessary here, but it is operating on its implicit input (standard input), not on nothing.
The xargs command reads the output from cat, and groups the information into arguments to the man command specified as its (only) argument. When it reaches a limit (configurable on the command line), it will execute the man command.
The find ... -print0 | xargs -0 ... idiom deals with file names that contain awkward characters such as blanks, tabs and newlines. The find command prints each filename followed by an ASCII NUL ('\0'); this is one of two characters that cannot appear in a simple file name - the other being '/' (which appears in path names, of course, but not in simple file names). It is not directly equivalent to the sequence you provide; xargs groups collections of file names into a single argument list, up to a size limit. If the names are short enough (they usually are), then there will be fewer executions of grep than there are file names.
Note, too, the grep only prints the file name where the material is found if it has more than one file to search -- or if it supports an option so that it always prints the file names and the option is used: '-H' is a GNU extension to grep that does this. The portable way to ensure that the file names always appear is to list /dev/null as the first file (so 'xargs grep something /dev/null'); it doesn't take long to search /dev/null.

How can I process a list of files that includes spaces in its names in Unix?

I'm trying to list the files in a directory and do something to them in the Mac OS X prompt.
It should go like this: for f in $(ls -1); do echo $f; done
If I have files without spaces in their names (fileA.txt, fileB.txt), the echo works fine.
If the files include spaces in their names ("file A.txt", "file B.txt"), I get 4 strings (file, A.txt, file, B.txt).
I've tried quoting the listing command, but it only changed the problem.
If I do this: for f in $(ls -1); do echo $f; done
I get: file A.txt\nfile B.txt
(It displays correctly, but it is a single string and I need the 2 lines separated.
Step away from ls if at all possible. Use find from the findutils package.
find /target/path -type f -print0 | xargs -0 your_command_here
-print0 will cause find to output the names separated by NUL characters (ASCII zero). The -0 argument to xargs tells it to expect the arguments separated by NUL characters too, so everything will work just fine.
Replace /target/path with the path under which your files are located.
-type f will only locate files. Use -type d for directories, or omit altogether to get both.
Replace your_command_here with the command you'll use to process the file names. (Note: If you run this from a shell using echo for your_command_here you'll get everything on one line - don't get confused by that shell artifact, xargs will do the expected right thing anyway.)
Edit: Alternatively (or if you don't have xargs), you can use the much less efficient
find /target/path -type f -exec your_command_here \{\} \;
\{\} \; is the escape for {} ; which is the placeholder for the currently processed file. find will then invoke your_command_here with {} ; replaced by the file name, and since your_command_here will be launched by find and not by the shell the spaces won't matter.
The second version will be less efficient since find will launch a new process for each and every file found. xargs is smart enough to pipe the commands to a newly launched process if it can figure it's safe to do so. Prefer the xargs version if you have the choice.
for f in *; do echo "$f"; done
should do what you want. Why are you using ls instead of * ?
In general, dealing with spaces in shell is a PITA. Take a look at the $IFS variable, or better yet at Perl, Ruby, Python, etc.
Here's an answer using $IFS as discussed by derobert
http://www.cyberciti.biz/tips/handling-filenames-with-spaces-in-bash.html
You can pipe the arguments into read. For example, to cat all files in the directory:
ls -1 | while read FILENAME; do cat "$FILENAME"; done
This means you can still use ls, as you have in your question, or any other command that produces $IFS delimited output.
The while loop makes it much easier to do several things to the argument, and makes complex processing more readable in my opinion. A contrived example:
ls -1 | while read FILE
do
echo 1: "$FILE"
echo 2: "$FILE"
done
look --quoting-style option.
for instance, --quoting-style=c would produce :
$ ls --quoting-style=c
"file1" "file2" "dir one"
Check out the manpage for xargs:
it works like this:
ls -1 /tmp/*.jpeg | xargs rm

How do you handle the "Too many files" problem when working in Bash?

I many times have to work with directories containing hundreds of thousands of files, doing text matching, replacing and so on. If I go the standard route of, say
grep foo *
I get the too many files error message, so I end up doing
for i in *; do grep foo $i; done
or
find ../path/ | xargs -I{} grep foo "{}"
But these are less than optimal (create a new grep process per each file).
This looks like more of a limitation in the size of the arguments programs can receive, because the * in the for loop works alright. But, in any case, what's the proper way to handle this?
PS: Don't tell me to do grep -r instead, I know about that, I'm thinking about tools that do not have a recursive option.
In newer versions of findutils, find can do the work of xargs (including the glomming behavior, such that only as many grep processes as needed are used):
find ../path -exec grep foo '{}' +
The use of + rather than ; as the last argument triggers this behavior.
If there is a risk of filenames containing spaces, you should remember to use the -print0 flag to find together with the -0 flag to xargs:
find . -print0 | xargs -0 grep -H foo
xargs does not start a new process for each file. It bunches together the arguments. Have a look at the -n option to xargs - it controls the number of arguments passed to each execution of the sub-command.
I can't see that
for i in *; do
grep foo $i
done
would work since I thought the "too many files" was a shell limitation, hence it would fail for the for loop as well.
Having said that, I always let xargs do the grunt-work of splitting the argument list into manageable bits thus:
find ../path/ | xargs grep foo
It won't start a process per file but per group of files.
Well, I had the same problems, but it seems that everything I came up with is already mentioned. Mostly, had two problems. Doing globs is expensive, doing ls on a million files directory takes forever (20+ minutes on one of my servers) and doing ls * on a million files directory takes forever and fails with "argument list too long" error.
find /some -type f -exec some command {} \;
seems to help with both problems. Also, if you need to do more complex operations on these files, you might consider to script your stuff into multiple threads. Here is a python primer for scripting CLI stuff.
http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR

Resources