Bash script to change file names recursively - bash

I have a script for changing file names of mht files but it does not traverse through dirs and sub dirs. I asked a question on a local forum and I got an answer that this is a solution:
find . -type f -name "*.mhtml" -o -type f -name "*.mht" | xargs -I item sh -c '{ echo item; echo item | sed "s/[:?|]//g"; }' | xargs -n2 mv
But it generates an error. With some of my experimenting it turns out that sh -c breaks file names with space and that this generates an error. How can I fix this?
#!/bin/bash
# renames.sh
# basic file renamer
for i in . *.mht
do
j=`echo $i | sed 's/|/ /g' | sed 's/:/ /g' | sed 's/?//g' | sed 's/"//g'`
mv "$i" "$j"
done

#! /bin/bash
find . -type f \( -name "*.mhtml" -o -name ".mht" \) -print0 |
while IFS= read -r -d '' source; do
target="${source//[:?|]/}"
[ "X$source" != "X$target" ] &&
mv -nv "$source" "$target"
done
Update: Do the rename according to the original question, and added support for .mht.

Use rename. With rename you can specify a renaming pattern:
find . -type f \( -name "*.mhtml" -o -name "*.mht" \) -print0 | xargs -0 -I'{}' rename 's/[:?|]//g' "{}"
This way you can properly handle names with spaces. xargs will replace {} with every names of file provided by the find command. Also note the use of -print0 and -0. This use a \0 as a separator so its avoid problems dealing with filnames containing \n (newline).
The -o was not working the way it was intended to. you must use parenthesis to group conditions.
You may also consider using -iname instead of -name if you deal with file ending with ".mHtml".

Related

How to apply my sed command on the files resulting from grep filtering only

I have crafted this sed command which looks to be working fine, only it's being applied to all the files in my directory :
find . -type f -name '*.js' -not -path './node_modules/*' -exec sed -i .bak -E '
1i\
const env = require('\''env-var'\'');
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+) \|\| ([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'').default('\''\3'\'')\4/g
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'')\3/g
' {} \;
I wish to apply those transformations only to the files which match this grep command :
grep -r "process\.env\." --exclude-dir=node_modules
I tried using the pipe but I can't make the two working together. What's the right way to handle it?
EDIT: I tried this
➜ app-service git:(chore/adding-env-example) ✗ grep -r "process\.env\." --exclude-dir=node_modules | sed -i .bak -E '
1i\
const env = require('\''env-var'\'');
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+) \|\| ([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'').default('\''\3'\'')\4/g
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'')\3/g
' {} \;
sed: {}: No such file or directory
I want only the files containing process.env.SOMETHING to be edited.
Work with pipes. xargs comes handy:
find ... -print |
xargs -d '\n' grep -l 'regex' |
xargs -d '\n' sed 'stuff'
xargs: illegal option -- d
You can:
install GNU xargs
install GNU parallel
write a bash loop to read the files line by line, see https://mywiki.wooledge.org/BashFAQ/001
make sure your files do not have spaces or tabs or newlines in filnames and just remove -d '\n' option.
Suggesting to reverse order of commands. sed on filtered list of files.
Files filter is combination of grep filter on find filter: grep -l "process\.env\." $(find . -type f -name '*.js' -not -path './node_modules/*').
sed -i .bak -E '
1i\
const env = require('\''env-var'\'');
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+) \|\| ([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'').default('\''\3'\'')\4/g
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'')\3/g
' $(grep -l "process\.env\." $(find . -type f -name '*.js' -not -path './node_modules/*'))

Solution for find -exec if single and double quotes already in use

I would like to recursively go through all subdirectories and remove the oldest two PDFs in each subfolder named "bak":
Works:
find . -type d -name "bak" \
-exec bash -c "cd '{}' && pwd" \;
Does not work, as the double quotes are already in use:
find . -type d -name "bak" \
-exec bash -c "cd '{}' && rm "$(ls -t *.pdf | tail -2)"" \;
Any solution to the double quote conundrum?
In a double quoted string you can use backslashes to escape other double quotes, e.g.
find ... "rm \"\$(...)\""
If that is too convoluted use variables:
cmd='$(...)'
find ... "rm $cmd"
However, I think your find -exec has more problems than that.
Using {} inside the command string "cd '{}' ..." is risky. If there is a ' inside the file name things will break and might execcute unexpected commands.
$() will be expanded by bash before find even runs. So ls -t *.pdf | tail -2 will only be executed once in the top directory . instead of once for each found directory. rm will (try to) delete the same file for each found directory.
rm "$(ls -t *.pdf | tail -2)" will not work if ls lists more than one file. Because of the quotes both files would be listed in one argument. Therefore, rm would try to delete one file with the name first.pdf\nsecond.pdf.
I'd suggest
cmd='cd "$1" && ls -t *.pdf | tail -n2 | sed "s/./\\\\&/g" | xargs rm'
find . -type d -name bak -exec bash -c "$cmd" -- {} \;
You have a more fundamental problem; because you are using the weaker double quotes around the entire script, the $(...) command substitution will be interpreted by the shell which parses the find command, not by the bash shell you are starting, which will only receive a static string containing the result from the command substitution.
If you switch to single quotes around the script, you get most of it right; but that would still fail if the file name you find contains a double quote (just like your attempt would fail for file names with single quotes). The proper fix is to pass the matching files as command-line arguments to the bash subprocess.
But a better fix still is to use -execdir so that you don't have to pass the directory name to the subshell at all:
find . -type d -name "bak" \
-execdir bash -c 'ls -t *.pdf | tail -2 | xargs -r rm' \;
This could stll fail in funny ways because you are parsing ls which is inherently buggy.
You are explicitely asking for find -exec. Usually I would just concatenate find -exec find -delete but in your case only two files should be deleted. Therefore the only method is running subshell. Socowi already gave nice solution, however if your file names do not contain tabulator or newlines, another workaround is find while read loop.
This will sort files by mtime
find . -type d -iname 'bak' | \
while read -r dir;
do
find "$dir" -maxdepth 1 -type f -iname '*.pdf' -printf "%T+\t%p\n" | \
sort | head -n2 | \
cut -f2- | \
while read -r file;
do
rm "$file";
done;
done;
The above find while read loop as "one-liner"
find . -type d -iname 'bak' | while read -r dir; do find "$dir" -maxdepth 1 -type f -iname '*.pdf' -printf "%T+\t%p\n" | sort | head -n2 | cut -f2- | while read -r file; do rm "$file"; done; done;
find while read loop can also handle NUL terminated file names. However head can not handle this, so I did improve other answers and made it work with nontrivial file names (only GNU + bash)
replace 'realpath' with rm
#!/bin/bash
rm_old () {
find "$1" -maxdepth 1 -type f -iname \*.$2 -printf "%T+\t%p\0" | sort -z | sed -zn 's,\S*\t\(.*\),\1,p' | grep -zim$3 \.$2$ | xargs -0r realpath
}
export -f rm_old
find -type d -iname bak -execdir bash -c 'rm_old "{}" pdf 2' \;
However bash -c might still exploitable, to make it more secure let stat %N do the quoting
#!/bin/bash
rm_old () {
local dir="$1"
# we don't like eval
# eval "dir=$dir"
# this works like eval
dir="${dir#?}"
dir="${dir%?}"
dir="${dir//"'$'\t''"/$'\011'}"
dir="${dir//"'$'\n''"/$'\012'}"
dir="${dir//$'\047'\\$'\047'$'\047'/$'\047'}"
find "$dir" -maxdepth 1 -type f -iname \*.$2 -printf '%T+\t%p\0' | sort -z | sed -zn 's,\S*\t\(.*\),\1,p' | grep -zim$3 \.$2$ | xargs -0r realpath
}
find -type d -iname bak -exec stat -c'%N' {} + | while read -r dir; do rm_old "$dir" pdf 2; done

Replacing a part of file path in exec

I would like to replace the part of each file path, which will be find by find linux command.
My approach is attached below:
find . -type f -name "*.txt" -exec echo {} | sed "s/f/u/g" {} \;
I expect the replacement of each letter "f" with "u" in file path. Unfortunately I got this error:
find: missing argument to `-exec'
sed: can't read {}: No such file or directory
sed: can't read ;: No such file or directory
What I did wrong? Thank you for your help.
I would like to replace the part of each file path
If you want to change just the file names/paths then use:
find . -type f -name "*.txt" -exec bash -c 'echo "$1" | sed "s/f/u/g"' - {} \;
or a bit more efficient with xargs (since it avoids spawning subshell for each found file):
find . -type f -name "*.txt" -print0 |
xargs -0 bash -c 'for f; do sed "s/f/u/g" <<< "$f"; done'
find . -type f -name "*.txt" | while read files
do
newname=$(echo "${files}" | sed s"#f#u#"g)
mv -v "${files}" "${newname}"
done
I don't completely understand what you meant by file path. If you weren't talking about the file name, please clarify further.

bash shell script not working as intended using cmp with output redirection

I am trying to write a bash script that remove duplicate files from a folder, keeping only one copy.
The script is the following:
#!/bin/sh
for f1 in `find ./ -name "*.txt"`
do
if test -f $f1
then
for f2 in `find ./ -name "*.txt"`
do
if [ -f $f2 ] && [ "$f1" != "$f2" ]
then
# if cmp $f1 $f2 &> /dev/null # DOES NOT WORK
if cmp $f1 $f2
then
rm $f2
echo "$f2 purged"
fi
fi
done
fi
done
I want to redirect the output and stderr to /dev/null to avoid printing them to screen.. But using the commented statement this script does not work as intended and removes all files but the first..
I'll give more informations if needed.
Thanks
Few comments:
First, the:
for f1 in `find ./ -name "*.txt"`
do
if test -f $f1
then
is the same as (find only plain files with the txt extension)
for f1 in `find ./ -type f -name "*.txt"`
Better syntax (bash only) is
for f1 in $(find ./ -type f -name "*.txt")
and finally the whole is wrong, because if the filename contains a space, the f1 variable will not get the full path name. So instead the for do:
find ./ -type f -name "*.txt" -print | while read -r f1
and as #Sir Athos pointed out, the filename can contain \n so the best is to use
find . -type f -name "*.txt" -print0 | while IFS= read -r -d '' f1
Second:
Use "$f1" instead of $f1 - again, because the $f1 can contain space.
Third:
doing N*N comparisons is not very effective. You should make a checksum (md5 or better sha256) for every txt file. When the checksum is identical - the files are dups.
If you don't trust checksums, simply compare only files what has identical checksums. Files with different checksum are SURE not duplicates. ;)
Making checksums are slow to, so you should 1st compare ony files with the same size. Different sized files are not duplicates...
You can skip empty txt files - they are duplicates all :).
so the final command can be:
find -not -empty -type f -name \*.txt -printf "%s\n" | sort -rn | uniq -d |\
xargs -I% -n1 find -type f -name \*.txt -size %c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate
commented:
#find all non-empty file with the txt extension and print their size (in bytes)
find . -not -empty -type f -name \*.txt -printf "%s\n" |\
#sort the sizes numerically, and keep only duplicated sizes
sort -rn | uniq -d |\
#for each sizes (what are duplicated) find all files with the given size and print their name (path)
xargs -I% -n1 find . -type f -name \*.txt -size %c -print0 |\
#make an md5 checksum for them
xargs -0 md5sum |\
#sort the checksums and keep duplicated files separated with an empty line
sort | uniq -w32 --all-repeated=separate
The output now, you can simply edit the output file and decide what want remove and what file want keep.
&> is bash syntax, you'll need to change the shebang line (first line) to #!/bin/bash (or the appropriate path to bash.
Or if you're really using the Bourne Shell (/bin/sh), then you have to use old-style redirection, i.e.
cmp ... >/dev/null 2>&1
Also, I think the &> was only introduced in bash 4, so if you're using bash, 3.X you'll still need the old-style redirections.
IHTH
Credit to #kobame for this answer: this is really a comment but for the formatting.
You don't need to call find twice, print out the size and the filename in the find command
find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
# find the files that have duplicate sizes
sort -n | uniq -Dw 8 |
# strip off the size and get the md5 sum
cut -c 10- | xargs md5sum
An example
$ cat a.txt
this is file a
$ cat b.txt
this is file b
$ cat c.txt
different contents
$ cp a.txt d.txt
$ cp b.txt e.txt
$ find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
sort -n | uniq -Dw 8 | cut -c 10- | xargs md5sum
76fd4c1589ef708d9203f3cf09cfd032 ./a.txt
e2d75fd6a1080efb6230d0608b1f9014 ./b.txt
76fd4c1589ef708d9203f3cf09cfd032 ./d.txt
e2d75fd6a1080efb6230d0608b1f9014 ./e.txt
To keep one and delete the rest, I would pipe the output into:
... | awk '++seen[$1] > 1 {print $2}' | xargs echo rm
rm ./d.txt ./e.txt
Remove the echo if your testing is satisfactory.
Like many complex pipelines, filenames containing newlines will break it.
All nice answers, so only one short suggestion: you can install and use the
fdupes -r .
from the man:
Searches the given path for duplicate files. Such files are found by
comparing file sizes and MD5 signatures, followed by a byte-by-byte
comparison.
Added by #Francesco
fdupes -rf . | xargs rm -f
for remove dupes. (the -f in fdupes omit the 1st occurence the file, so list only dupes)

How to remove trailing whitespace of all files recursively?

How can you remove all of the trailing whitespace of an entire project? Starting at a root directory, and removing the trailing whitespace from all files in all folders.
Also, I want to to be able to modify the file directly, and not just print everything to stdout.
Here is an OS X >= 10.6 Snow Leopard solution.
It Ignores .git and .svn folders and their contents. Also it won't leave a backup file.
(export LANG=C LC_CTYPE=C
find . -not \( -name .svn -prune -o -name .git -prune \) -type f -print0 | perl -0ne 'print if -T' | xargs -0 sed -Ei 's/[[:blank:]]+$//'
)
The enclosing parenthesis preserves the L* variables of current shell – executing in subshell.
Use:
find . -type f -print0 | xargs -0 perl -pi.bak -e 's/ +$//'
if you don't want the ".bak" files generated:
find . -type f -print0 | xargs -0 perl -pi -e 's/ +$//'
as a zsh user, you can omit the call to find, and instead use:
perl -pi -e 's/ +$//' **/*
Note: To prevent destroying .git directory, try adding: -not -iwholename '*.git*'.
Two alternative approaches which also work with DOS newlines (CR/LF) and do a pretty good job at avoiding binary files:
Generic solution which checks that the MIME type starts with text/:
while IFS= read -r -d '' -u 9
do
if [[ "$(file -bs --mime-type -- "$REPLY")" = text/* ]]
then
sed -i 's/[ \t]\+\(\r\?\)$/\1/' -- "$REPLY"
else
echo "Skipping $REPLY" >&2
fi
done 9< <(find . -type f -print0)
Git repository-specific solution by Mat which uses the -I option of git grep to skip files which Git considers to be binary:
git grep -I --name-only -z -e '' | xargs -0 sed -i 's/[ \t]\+\(\r\?\)$/\1/'
In Bash:
find dir -type f -exec sed -i 's/ *$//' '{}' ';'
Note: If you're using .git repository, try adding: -not -iwholename '.git'.
Ack was made for this kind of task.
It works just like grep, but knows not to descend into places like .svn, .git, .cvs, etc.
ack --print0 -l '[ \t]+$' | xargs -0 -n1 perl -pi -e 's/[ \t]+$//'
Much easier than jumping through hoops with find/grep.
Ack is available via most package managers (as either ack or ack-grep).
It's just a Perl program, so it's also available in a single-file version that you can just download and run. See: Ack Install
This worked for me in OSX 10.5 Leopard, which does not use GNU sed or xargs.
find dir -type f -print0 | xargs -0 sed -i.bak -E "s/[[:space:]]*$//"
Just be careful with this if you have files that need to be excluded (I did)!
You can use -prune to ignore certain directories or files. For Python files in a git repository, you could use something like:
find dir -not -path '.git' -iname '*.py'
ex
Try using Ex editor (part of Vim):
$ ex +'bufdo!%s/\s\+$//e' -cxa **/*.*
Note: For recursion (bash4 & zsh), we use a new globbing option (**/*.*). Enable by shopt -s globstar.
You may add the following function into your .bash_profile:
# Strip trailing whitespaces.
# Usage: trim *.*
# See: https://stackoverflow.com/q/10711051/55075
trim() {
ex +'bufdo!%s/\s\+$//e' -cxa $*
}
sed
For using sed, check: How to remove trailing whitespaces with sed?
find
Find the following script (e.g. remove_trail_spaces.sh) for removing trailing whitespaces from the files:
#!/bin/sh
# Script to remove trailing whitespace of all files recursively
# See: https://stackoverflow.com/questions/149057/how-to-remove-trailing-whitespace-of-all-files-recursively
case "$OSTYPE" in
darwin*) # OSX 10.5 Leopard, which does not use GNU sed or xargs.
find . -type f -not -iwholename '*.git*' -print0 | xargs -0 sed -i .bak -E "s/[[:space:]]*$//"
find . -type f -name \*.bak -print0 | xargs -0 rm -v
;;
*)
find . -type f -not -iwholename '*.git*' -print0 | xargs -0 perl -pi -e 's/ +$//'
esac
Run this script from the directory which you want to scan. On OSX at the end, it will remove all the files ending with .bak.
Or just:
find . -type f -name "*.java" -exec perl -p -i -e "s/[ \t]$//g" {} \;
which is recommended way by Spring Framework Code Style.
I ended up not using find and not creating backup files.
sed -i '' 's/[[:space:]]*$//g' **/*.*
Depending on the depth of the file tree, this (shorter version) may be sufficient for your needs.
NOTE this also takes binary files, for instance.
Instead of excluding files, here is a variation of the above the explicitly white lists the files, based on file extension, that you want to strip, feel free to season to taste:
find . \( -name *.rb -or -name *.html -or -name *.js -or -name *.coffee -or \
-name *.css -or -name *.scss -or -name *.erb -or -name *.yml -or -name *.ru \) \
-print0 | xargs -0 sed -i '' -E "s/[[:space:]]*$//"
I ended up running this, which is a mix between pojo and adams version.
It will clean both trailing whitespace, and also another form of trailing whitespace, the carriage return:
find . -not \( -name .svn -prune -o -name .git -prune \) -type f \
-exec sed -i 's/[:space:]+$//' \{} \; \
-exec sed -i 's/\r\n$/\n/' \{} \;
It won't touch the .git folder if there is one.
Edit: Made it a bit safer after the comment, not allowing to take files with ".git" or ".svn" in it. But beware, it will touch binary files if you've got some. Use -iname "*.py" -or -iname "*.php" after -type f if you only want it to touch e.g. .py and .php-files.
Update 2: It now replaces all kinds of spaces at end of line (which means tabs as well)
This works well.. add/remove --include for specific file types :
egrep -rl ' $' --include *.c * | xargs sed -i 's/\s\+$//g'
Ruby:
irb
Dir['lib/**/*.rb'].each{|f| x = File.read(f); File.write(f, x.gsub(/[ \t]+$/,"")) }
1) Many other answers use -E. I am not sure why, as that's undocumented BSD compatibility option. -r should be used instead.
2) Other answers use -i ''. That should be just -i (or -i'' if preffered), because -i has the suffix right after.
3) Git specific solution:
git config --global alias.check-whitespace \
'git diff-tree --check $(git hash-object -t tree /dev/null) HEAD'
git check-whitespace | grep trailing | cut -d: -f1 | uniq -u -z | xargs -0 sed --in-place -e 's/[ \t]+$//'
The first one registers a git alias check-whitespace which lists the files with trailing whitespaces.
The second one runs sed on them.
I only use \t rather than [:space:] as I don't typically see vertical tabs, form feeds and non-breakable spaces. Your measurement may vary.
I use regular expressions. 4 steps:
Open the root folder in your editor (I use Visual Studio Code).
Tap the Search icon on the left, and enable the regular expression mode.
Enter " +\n" in the Search bar and "\n" in the Replace bar.
Click "Replace All".
This removes all trailing spaces at the end of each line in all files. And you can exclude some files that don't fit with this need.
This is what works for me (Mac OS X 10.8, GNU sed installed by Homebrew):
find . -path ./vendor -prune -o \
\( -name '*.java' -o -name '*.xml' -o -name '*.css' \) \
-exec gsed -i -E 's/\t/ /' \{} \; \
-exec gsed -i -E 's/[[:space:]]*$//' \{} \; \
-exec gsed -i -E 's/\r\n/\n/' \{} \;
Removed trailing spaces, replaces tabs with spaces, replaces Windows CRLF with Unix \n.
What's interesting is that I have to run this 3-4 times before all files get fixed, by all cleaning gsed instructions.

Resources