concat a lot of files to stdout - bash

I have a large number of files in directory - ~100k. I want to combine them and pipe them to standard output (I need that to upload them as one file elsewhere), but cat $(ls) complains that -bash: /bin/cat: Argument list too long. I know how to merge all those files into a temporary one, but can I just avoid it?

For a start, cat $(ls) is not the right way to go about this - cat * would be more appropriate. If the number of files is too high, you can use find like this:
find -exec cat {} +
This combines results from find and passes them as arguments to cat, executing as many separate instances as needed. This behaves much in the same way as xargs but doesn't require a separate process or the use of any non-standard features like -print0, which is only supported in some versions of find.
find is recursive by default, so you can specify a -maxdepth 1 to prevent this if your version supports it. If there are other things in the directory, you can also filter by -type (but I guess there aren't, based on your original attempt).

find . -type f -print0 |xargs -0 cat
xargs will invoke cat several times, each time with as many arguments as it can fit on the command line (the combined length of the args can be no more than getconf ARG_MAX).
-print0 (seperate files with \0) for find in combination with -0 (process files separated with \0) for xargs is just a good habit to follow as it will prevent the commands from breaking on filenames with special or white characters in them.

Related

Given a text file with file names, how can I find files in subdirectories of the current directory?

I have a bunch of files with different names in different subdirectories. I created a txt file with those names but I cannot make find to work using the file. I have seen posts on problems creating the list, on not using find (do not understand the reason though). Suggestions? Is difficult for me to come up with an example because I do not know how to reproduce the directory structure.
The following are the names of the files (just in case there is a formatting problem)
AO-169
AO-170
AO-171
The best that I came up with is:
cat ExtendedList.txt | xargs -I {} find . -name {}
It obviously dies in the first directory that it finds.
I also tried
ta="AO-169 AO-170 AO-171"
find . -name $ta
but it complains find: AO-170: unknown primary or operator
If you are trying to ask "how can I find files with any of these names in subdirectories of the current directory", the answer to that would look something like
xargs printf -- '-o\0-name\0%s\0' <ExtendedList.txt |
xargs -r0 find . -false
The -false is just a cute way to let the list of actual predicates start with "... or".
If the list of names in ExtendedList.txt is large, this could fail if the second xargs decides to break it up between -o and -name.
The option -0 is not portable, but should work e.g. on Linux or wherever you have GNU xargs.
If you can guarantee that the list of strings in ExtendedList.txt does not contain any characters which are problematic to the shell (like single quotes), you could simply say
sed "s/.*/-o -name '&'/" ExtendedList.txt |
xargs -r find . -false

Find files in current directory, list differences from list within script

I am attempting to find differences for a directory and a list of files located in the bash script, for portability.
For example, search a directory with phpBB installed. Compare recursive directory listing to list of core installation files (excluding themes, uploads, etc). Display additional and missing files.
Thus far, I have attempted using diff, comm, and tr with "argument too long" errors. This is likely due to the lists being a list of files it is attempting to compare the actual files rather than the lists themselves.
The file list in the script looks something like this (But I am willing to format differently):
./file.php
./file2.php
./dir/file.php
./dir/.file2.php
I am attempting to use one of the following to print the list:
find ./ -type f -printf "%P\n"
or
find ./ -type f -print
Then use any command you can think of to compare the results to the list of files inside the script.
The following are difficult to use as there are often 1000's of files to check and each version can change the listings and it is a pain to update a whole script every time there is a new release.
find . ! -wholename './file.php' ! -wholename './file2.php'
find . ! -name './file.php' ! -name './file2.php'
find . ! -path './file.php' ! -path './file2.php'
With the lists being in different orders to accommodate any additional files, it can't be a straight comparison.
I'm just stumped. I greatly appreciate any advice or if I could be pointed in the right direction. Ask away for clarification!
You can use the -r option of diff command, to recursively compare the contents of the two directories. This way you don't need all the file names on the command line; just the two top level directory names.
It will give you missing files, newly added files, and the difference of changed files. Many things can be controlled by different options.
If you mean you have a list of expected files somewhere, and only one directory to be compared against it, then you can try using the tree command. The list can be first created using the tree command, and then at the time of comparison you can run the tree command again on the directory, and compare it with the stored "expected output" using the diff command.
Do you have to use coreutils? If so:
Put your list in a file, say list.txt, with one file path per line.
comm -23 <(find path/to/your/directory -type f | sort) \
<(sort path/to/list.txt) \
> diff.txt
diff.txt will have one line per file in path/to/your/directory that is not in your list.
If you care about files in your list that are not in path/to/your/directory, do comm -13 with the same parameters.
Otherwise, you can also use sd (stream diff), which doesn't require sorting nor process substitution and supports infinite streams, like so:
find path/to/your/directory -type f | sd 'cat path/to/list.txt' > diff.txt
And just invert the streams to get the second requirement, like so:
cat path/to/list.txt | sd 'find path/to/your/directory -type f' > diff.txt
Probably not that much of a benefit on this example other than succintness, but still consider it; in some cases you won't be able to use comm nor grep -F nor diff.
Here's a blogpost I wrote about diffing streams on the terminal, which introduces sd.

List files matching pattern when too many for bash globbing

I'd like to run the following:
ls /path/to/files/pattern*
and get
/path/to/files/pattern1
/path/to/files/pattern2
/path/to/files/pattern3
However, there are too many files matching the pattern in that directory, and I get
bash: /bin/ls: Argument list too long
What's a better way to do this? Maybe using the find command? I need to print out the full paths to the files.
This is where find in combination with xargs will help.
find /path/to/files -name "pattern*" -print0 | xargs -0 ls
Note from comments: xargs will help if you wish to do with the list once you have obtained it from find. If you only intend to list the files, then find should suffice. However, if you wish to copy, delete or perform any action on the list, then using xargs instead of -exec will help.

loop to move files not working

I am having one of these mornings where nothing goes to plan. I need to move files to a target directory by chunks of 1,000 at time
I wanted to loop thru my files like so
for i in `find . -name '*XML'`
for((b=0; b<1000; b++))
do
mv $i targetdirect/
done
done
But I get a "-bash: syntax error near unexpected token `done:" error.
What I am missing??
The second for loop is a syntax error. Also you should double-quote "$i".
What do you mean by moving 1000 files at a time? Something like this perhaps?
find . -name '*.XML' -print0 | xargs -r0 -n 1000 mv -t targetdirect
The -print0 and corresponding xargs -0 are a GNU extension to handle arbitrary file names. This works because the null character is an invalid character in file names on Unix; hence, it is safe to use as a delimiter between file names. For regularly named files (no quotes, no newlines etc in the file names) this may seem paranoid, but it is well-documented practice and a FAQ.
Your first for loop has no corresponding do (You have two done, but only one do.)

How do you handle the "Too many files" problem when working in Bash?

I many times have to work with directories containing hundreds of thousands of files, doing text matching, replacing and so on. If I go the standard route of, say
grep foo *
I get the too many files error message, so I end up doing
for i in *; do grep foo $i; done
or
find ../path/ | xargs -I{} grep foo "{}"
But these are less than optimal (create a new grep process per each file).
This looks like more of a limitation in the size of the arguments programs can receive, because the * in the for loop works alright. But, in any case, what's the proper way to handle this?
PS: Don't tell me to do grep -r instead, I know about that, I'm thinking about tools that do not have a recursive option.
In newer versions of findutils, find can do the work of xargs (including the glomming behavior, such that only as many grep processes as needed are used):
find ../path -exec grep foo '{}' +
The use of + rather than ; as the last argument triggers this behavior.
If there is a risk of filenames containing spaces, you should remember to use the -print0 flag to find together with the -0 flag to xargs:
find . -print0 | xargs -0 grep -H foo
xargs does not start a new process for each file. It bunches together the arguments. Have a look at the -n option to xargs - it controls the number of arguments passed to each execution of the sub-command.
I can't see that
for i in *; do
grep foo $i
done
would work since I thought the "too many files" was a shell limitation, hence it would fail for the for loop as well.
Having said that, I always let xargs do the grunt-work of splitting the argument list into manageable bits thus:
find ../path/ | xargs grep foo
It won't start a process per file but per group of files.
Well, I had the same problems, but it seems that everything I came up with is already mentioned. Mostly, had two problems. Doing globs is expensive, doing ls on a million files directory takes forever (20+ minutes on one of my servers) and doing ls * on a million files directory takes forever and fails with "argument list too long" error.
find /some -type f -exec some command {} \;
seems to help with both problems. Also, if you need to do more complex operations on these files, you might consider to script your stuff into multiple threads. Here is a python primer for scripting CLI stuff.
http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR

Resources