How to exclude files from list - bash

I have ~100k files in directory. I need to delete some of them files excluding list of(15k different pattern) pattern:
Directory: /20210111/
Example files:
/20210111/xxx_yyy_zzz.zip
/20210111/aaa_bbb_ccc.zip
/20210111/ddd_eee_fff.zip
...
Exclude.list
ddd
aaa
...
I tried with find:
find /20210111/ -type f -iname "*.zip" ! -iname "*$(cat Exclude.list)*" -exec ...
Getting error: arguments too long. Because exclude.list have a lot of lines.
How can I do that?

You can use grep to filter the output of find, then use xargs to process the resulting list.
find /20210111/ -type f -iname '*.zip' -print0 \
| grep -zvFf Exclude.list - \
| xargs -0 rm
The -print0, -z, and -0 are used to separate the filenames by the null byte, so filenames can contain any valid character (you can't store patterns containing literal newlines in your Exclude.list, anyway).
grep's -F interprets the patterns as fixed strings instead of regexes.

Related

unix command for file seperation in two different folders

I am currently in data folder which has following files and folders
Folders:
ISOLATE
JUKEBOX
Files:
XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.ISOLATE.quantifier.txt
XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.JUKEBOX.quantifier.txt
XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.ISOLATE.quantifier.txt
XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.JUKEBOX.quantifier.txt
...
I want to put the files with .ISOLATE in Folder ISOLATE and .JUKEBOX ones in the JUKEBOX folder. How could I perform this task using terminal?
There are more than 12000 files, so I cannot really change the naming scheme.
Thanks in advance
Try to use wildcards:
mv *.ISOLATE.quantifier.txt ISOLATE/
mv *.JUKEBOX.quantifier.txt JUKEBOX/
If the number of files is too high, you might need to move them in smaller loads.
find -name '*.ISOLATE.quantifier.txt' -maxdepth 1 -exec mv {} ISOLATE/ +
-exec with + should accumulate the command line arguments the same way as xargs, so you shouldn't overflow the maximal number of arguments.
Since you're dealing with huge # of files, you can use this mv with xargs:
printf '%s\0' *.ISOLATE.* | xargs -0 mv -t ISOLATE/
printf '%s\0' *.JUKEBOX.* | xargs -0 mv -t JUKEBOX/
In addition to trying wildcards (bash pattern match or globs), which at some point will hit an upper limit based on the number of files, you can also use find and xargs:
find . -name '*.ISOLATE.*.txt' -maxdepth 1 -print0 | xargs -0 -IFILE mv FILE ./ISOLATE
find . -name '*.JUKEBOX.*.txt' -maxdepth 1 -print0 | xargs -0 -IFILE mv FILE ./JUKEBOX
Doing this won't be subject to the maximum number of command line arguments that the glob solution may hit.
They key things in the commands above are:
-maxdepth 1 ensures that find won't keep looking into the ./ISOLOATE or ./JUKEBOX subdirectories
-print0 causes find to delimit the file names with a null byte rather than whitespace. This protects you against files that have spaces or other special characters in their names.
-0 causes xargs to use the null byte delimiter rather than whitespace for the same reason
-IFILE tells xargs to use the string FILE for each of the arguments. Typically xargs puts the filenames on the right, which wouldn't work with the mv command.
I tested the approach with a small shell script:
touch XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.ISOLATE.quantifier.txt
touch XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.JUKEBOX.quantifier.txt
touch XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.ISOLATE.quantifier.txt
touch XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.JUKEBOX.quantifier.txt
mkdir ISOLATE
mkdir JUKEBOX
find . -name '*.ISOLATE.*.txt' -maxdepth 1 -print0 | xargs -0 -IFILE mv FILE ./ISOLATE
find . -name '*.JUKEBOX.*.txt' -maxdepth 1 -print0 | xargs -0 -IFILE mv FILE ./JUKEBOX
find .
Which outputs:
$ bash example.sh
.
./example.sh
./ISOLATE
./ISOLATE/XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.ISOLATE.quantifier.txt
./ISOLATE/XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.ISOLATE.quantifier.txt
./JUKEBOX
./JUKEBOX/XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.JUKEBOX.quantifier.txt
./JUKEBOX/XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.JUKEBOX.quantifier.txt

grep cannot read filename after find folders with spaces

Hi after I find the files and enclose their name with double quotes with the following command:
FILES=$(find . -type f -not -path "./.git/*" -exec echo -n '"{}" ' \; | tr '\n' ' ')
I do a for loop to grep a certain word inside each file that matches find:
for f in $FILES; do grep -Eq '(GNU)' $f; done
but grep complains about each entry that it cannot find file or directory:
grep: "./test/test.c": No such file or directory
see picture:
whereas echo $FILES produces:
"./.DS_Store" "./.gitignore" "./add_license.sh" "./ads.add_lcs.log" "./lcs_gplv2" "./lcs_mit" "./LICENSE" "./new test/test.js" "./README.md" "./sxs.add_lcs.log" "./test/test.c" "./test/test.h" "./test/test.js" "./test/test.m" "./test/test.py" "./test/test.pyc"
EDIT
found the answer here. works perfectly!
The issue is that your array contains filenames surrounded by literal " quotes.
But worse, find's -exec cmd {} \; executes cmd separately for each file which can be inefficient. As mentioned by #TomFenech in the comments, you can use -exec cmd {} + to search as many files within a single cmd invocation as possible.
A better approach for recursive search is usually to let find output filenames to search, and pipe its results to xargs in order to grep inside as many filenames together as possible. Use -print0 and -0 respectively to correctly support filenames with spaces and other separators, by splitting results by a null character instead - this way you don't need quotes, reducing possibility of bugs.
Something like this:
find . -type f -not -path './.git/*' -print0 | xargs -0 egrep '(GNU)'
However in your question you had grep -q in a loop, so I suspect you may be looking for an error status (found/not found) for each file? If so, you could use -l instead of -q to make grep list matching filenames, and then pipe/send that output to where you need the results.
find . -print0 | xargs -0 egrep -l pattern > matching_filenames
Also note that grep -E (or egrep) uses extended regular expressions, which means parentheses create a regex group. If you want to search for files containing (GNU) (with the parentheses) use grep -F or fgrep instead, which treats the pattern as a string literal.

issue with piping find into sed (find and replace)

Here is my current code, my goal is to find every file in a given directory (recursively) and replace "FIND" with "REPLACEWITH" and overwrite the files.
FIND='ALEX'
REPLACEWITH='<strong>ALEX</strong>'
DIRECTORY='/some/directory/'
find $DIRECTORY -type f -name "*.html" -print0 |
LANG=C xargs -0 sed -i "s|$FIND|$REPLACEWITH|g"
The error I am getting is:
sed: 1: "/some/directory ...": command a expects \ followed by text
As given in BashFAQ #21, you can use perl to perform search-and-replace operations with no potential for data being treated as code:
in="$FIND" out="$REPLACEWITH" find "$DIRECTORY" -type f -name '*.html' \
-exec perl -pi -e 's/\Q$ENV{"in"}/$ENV{"out"}/g' '{}' +
If you want to include only files matching the FIND string, find can be told to only pass files which grep flags on to perl:
in="$FIND" out="$REPLACEWITH" find "$DIRECTORY" -type f -name '*.html' \
-exec grep -F -q -e "$FIND" '{}' ';' \
-exec perl -pi -e 's/\Q$ENV{"in"}/$ENV{"out"}/g' '{}' +
Because grep is being used to evaluate individual files, it's necessary to use one grep call per file so its exit status can be evaluated on a per-file basis; thus, the use of the less efficient -exec ... {} ';' action. For perl, it's possible to put multiple files to process on one command, hence the use of -exec ... {} +.
Note that fgrep is line-oriented; if your FIND string contains multiple lines, then files with any one of those lines will be passed to perl for replacements.
You can have find invoke sed directly although I think all the modification times on your files will be affected (which might matter or not):
find $DIRECTORY -type f -name "*.html" -exec sed -i "s|$FIND|$REPLACEWITH|g" '{}' ';'

Awk/Sed: How to do a recursive find/replace of a string in files with a certain file extension?

I need to recursively find and replace a string in my .cpp and .hpp files.
Looking at an answer to this question I've found the following command:
find /home/www -type f -print0 | xargs -0 sed -i 's/subdomainA.example.com/subdomainB.example.com/g'
Changing it to include my file type did not work - did not changed any single word:
find /myprojects -type f -name *.cpp -print0 | xargs -0 sed -i 's/previousword/newword/g'
Help appreciated.
Don't bother with xargs; use the -exec primary. (Split across two lines for readability.)
find /home/www -type f -name '*.cpp' \
-exec sed -i 's/previousword/newword/g' '{}' \;
chepner's helpful answer proposes the simpler and more efficient use of find's -exec action instead of piping to xargs.
Unless special xargs features are needed, this change is always worth making, and maps to xargs features as follows:
find ... -exec ... {} \; is equivalent to find ... -print0 | xargs -0 -n 1 ...
find ... -exec ... {} + is equivalent to find ... -print0 | xargs -0 ...
In other words:
the \; terminator invokes the target command once for each matching file/folder.
the + terminator invokes the target command once overall, supplying all matching file/folder paths as a single list of arguments.
Multiple calls happen only if the resulting command line becomes too long, which is rare, especially on Linux, where getconf ARG_MAX, the max. command-line length, is large.
Troubleshooting the OP's command:
Since the OP's xargs command passes all matching file paths at once - and per xargs defaults at the end of the command line, the resulting command will effectively look something like this:
sed -i 's/previousword/newword/g' /myprojects/file1.cpp /myprojects/file2.cpp ...
This can easily be verified by prepending echo to sed - though (conceptual) quoting of arguments that need it (paths with, e.g., embedded spaces) will not show (note the echo):
find /myprojects -type f -name '*.cpp' -print0 |
xargs -0 echo sed -i 's/previousword/newword/g'
Next, after running the actual command, check whether the last-modified date of the files has changed using stat:
If they have, yet the contents haven't changed, the implication is that sed has processed the files, but the regex in the s function call didn't match anything.
It is conceivable that older GNU sed versions don't work properly when combining -i (in-place editing) with multiple file operands (though I couldn't find anything in the GNU sed release notes).
To rule that out, invoke sed once for each file:
If you still want to use xargs, add -n 1:
find /myprojects -type f -name '*.cpp' -print0 |
xargs -0 -n 1 sed -i 's/previousword/newword/g'
To use find's -exec action, see chepner's answer.
With a GNU sed version that does support updating of multiple files with the -i option - which is the case as of at least v4.2.2 - the best formulation of your command is (note the quoted *.cpp argument to prevent premature expansion by the shell, and the use of terminator + to only invoke sed once):
find /myprojects -type f -name '*.cpp' -exec sed -i 's/previousword/newword/g' '{}' +

Find files containing a given text

In bash I want to return file name (and the path to the file) for every file of type .php|.html|.js containing the case-insensitive string "document.cookie" | "setcookie"
How would I do that?
egrep -ir --include=*.{php,html,js} "(document.cookie|setcookie)" .
The r flag means to search recursively (search subdirectories). The i flag means case insensitive.
If you just want file names add the l (lowercase L) flag:
egrep -lir --include=*.{php,html,js} "(document.cookie|setcookie)" .
Try something like grep -r -n -i --include="*.html *.php *.js" searchstrinhere .
the -i makes it case insensitlve
the . at the end means you want to start from your current directory, this could be substituted with any directory.
the -r means do this recursively, right down the directory tree
the -n prints the line number for matches.
the --include lets you add file names, extensions. Wildcards accepted
For more info see: http://www.gnu.org/software/grep/
find them and grep for the string:
This will find all files of your 3 types in /starting/path and grep for the regular expression '(document\.cookie|setcookie)'. Split over 2 lines with the backslash just for readability...
find /starting/path -type f -name "*.php" -o -name "*.html" -o -name "*.js" | \
xargs egrep -i '(document\.cookie|setcookie)'
Sounds like a perfect job for grep or perhaps ack
Or this wonderful construction:
find . -type f \( -name *.php -o -name *.html -o -name *.js \) -exec grep "document.cookie\|setcookie" /dev/null {} \;
find . -type f -name '*php' -o -name '*js' -o -name '*html' |\
xargs grep -liE 'document\.cookie|setcookie'
Just to include one more alternative, you could also use this:
find "/starting/path" -type f -regextype posix-extended -regex "^.*\.(php|html|js)$" -exec grep -EH '(document\.cookie|setcookie)' {} \;
Where:
-regextype posix-extended tells find what kind of regex to expect
-regex "^.*\.(php|html|js)$" tells find the regex itself filenames must match
-exec grep -EH '(document\.cookie|setcookie)' {} \; tells find to run the command (with its options and arguments) specified between the -exec option and the \; for each file it finds, where {} represents where the file path goes in this command.
while
E option tells grep to use extended regex (to support the parentheses) and...
H option tells grep to print file paths before the matches.
And, given this, if you only want file paths, you may use:
find "/starting/path" -type f -regextype posix-extended -regex "^.*\.(php|html|js)$" -exec grep -EH '(document\.cookie|setcookie)' {} \; | sed -r 's/(^.*):.*$/\1/' | sort -u
Where
| [pipe] send the output of find to the next command after this (which is sed, then sort)
r option tells sed to use extended regex.
s/HI/BYE/ tells sed to replace every First occurrence (per line) of "HI" with "BYE" and...
s/(^.*):.*$/\1/ tells it to replace the regex (^.*):.*$ (meaning a group [stuff enclosed by ()] including everything [.* = one or more of any-character] from the beginning of the line [^] till' the first ':' followed by anything till' the end of line [$]) by the first group [\1] of the replaced regex.
u tells sort to remove duplicate entries (take sort -u as optional).
...FAR from being the most elegant way. As I said, my intention is to increase the range of possibilities (and also to give more complete explanations on some tools you could use).

Resources