I have a directory which has 70000 xml files in it. Each file has a tag which looks something like this, for the sake of simplicity:
<ns2:apple>, <ns2:orange>, <ns2:grapes>, <ns2:melon>. Each file has only one fruit tag, i.e. there cannot be both apple and orange in the same file.
I would like rename every file (add "1_" before the beginning of each filename) which has one of: <ns2:apple>, <ns2:orange>, <ns2:melon> inside of it.
I can find such files with egrep:
egrep -r '<ns2:apple>|<ns2:orange>|<ns2:melon>'
So how would it look as a bash script, which I can then user as a cron job?
P.S. Sorry I don't have any bash script draft, I have very little experience with it and the time is of the essence right now.
This may be done with this script:
#!/bin/sh
find /path/to/directory/with/xml -type f | while read f; do
grep -q -E '<ns2:apple>|<ns2:orange>|<ns2:melon>' "$f" && mv "$f" "1_${f}"
done
But it will rescan the directory each time it runs and append 1_ to each file containing one of your tags. This means a lot of excess IO and files with certain tags will be getting 1_ prefix each run, resulting in names like 1_1_1_1_file.xml.
Probably you should think more on design, e.g. move processed files to two directories based on whether file has certain tags or not:
#!/bin/sh
# create output dirs
mkdir -p /path/to/directory/with/xml/with_tags/ /path/to/directory/with/xml/without_tags/
find /path/to/directory/with/xml -maxdepth 1 -mindepth 1 -type f | while read f; do
if grep -q -E '<ns2:apple>|<ns2:orange>|<ns2:melon>'; then
mv "$f" /path/to/directory/with/xml/with_tags/
else
mv "$f" /path/to/directory/with/xml/without_tags/
fi
done
Run this command as a dry run, then remove --dry_run to actually rename the files:
grep -Pl '(<ns2:apple>|<ns2:orange>|<ns2:melon>)' *.xml | xargs rename --dry-run 's/^/1_/'
The command-line utility rename comes in many flavors. Most of them should work for this task. I used the rename version 1.601 by Aristotle Pagaltzis. To install rename, simply download its Perl script and place into $PATH. Or install rename using conda, like so:
conda install rename
Here, grep uses the following options:
-P : Use Perl regexes.
-l : Suppress normal output; instead print the name of each input file from which output would normally have been printed.
SEE ALSO:
grep manual
Related
I'm trying to run a series of commands on a list of files in multiple directories located directly under the current branch.
An example hierarchy is as follows:
/tmp
|-1
| |-a.txt
| |-b.txt
| |-c.txt
|-2
| |-a.txt
| |-b.txt
| |-c.txt
From the /tmp directory I'm sitting at my prompt and I'm trying to run a command against the a.txt file by renaming it to d.txt.
How do I get it to go into each directory and rename the file? I've tried the following and it won't work:
for i in ./*; do
mv "$i" $"(echo $i | sed -e 's/a.txt/d.txt/')"
done
It just doesn't jump into each directory. I've also tried to get it to create files for me, or folders under each hierarchy from the current directory just 1 folder deep, but it won't work using this:
for x in ./; do
mkdir -p cats
done
OR
for x in ./; do
touch $x/cats.txt
done
Any ideas ?
Place the below script in your base directory
#!/bin/bash
# Move 'a.txt's to 'd.txt's recursively
mover()
{
CUR_DIR=$(dirname "$1")
mv "$1" "$CUR_DIR/d.txt"
}
export -f mover
find . -type f -name "a.txt" -exec bash -c 'mover "$0"' {} \;
and execute it.
Note:
If you wish be a bit more innovative and generalize the script, you could accept directory name to search for as a parameter to the script and pass the directory name to find
> for i in ./*; do
As per your own description, this will assign ./1 and then ./2 to i. Neither of those matches any of the actual files. You want
for i in ./*/*; do
As a further aside, the shell is perfectly capable of replacing simple strings using glob patterns. This also coincidentally fixes the problem with not quoting $i when you echo it.
mv "$i" "${i%/a.txt}/d.txt"
I am flattening a directory of nested folders/picture files down to a single folder. I want to move all of the nested files up to the root level.
There are 3,381 files (no directories included in the count). I calculate this number using these two commands and subtracting the directory count (the second command):
find ./ | wc -l
find ./ -type d | wc -l
To flatten, I use this command:
find ./ -mindepth 2 -exec mv -i -v '{}' . \;
Problem is that when I get a count after running the flatten command, my count is off by 46. After going through the list of files before and after (I have a backup), I found that the mv command is overwriting files sometimes even though I'm using -i.
Here's details from the log for one of these files being overwritten...
.//Vacation/CIMG1075.JPG -> ./CIMG1075.JPG
..more log
..more log
..more log
.//dog pics/CIMG1075.JPG -> ./CIMG1075.JPG
So I can see that it is overwriting. I thought -i was supposed to stop this. I also tried a -n and got the same number. Note, I do have about 150 duplicate filenames. Was going to manually rename after I flattened everything I could.
Is it a timing issue?
Is there a way to resolve?
NOTE: it is prompting me that some of the files are overwrites. On those prompts I just press Enter so as not to overwrite. In the case above, there is no prompt. It just overwrites.
Apparently the manual entry clearly states:
The -n and -v options are non-standard and their use in scripts is not recommended.
In other words, you should mimic the -n option yourself. To do that, just check if the file exists and act accordingly. In a shell script where the file is supplied as the first argument, this could be done as follows:
[ -f "${1##*/}" ]
The file, as first argument, contains directories which can be stripped using ##*/. Now simply execute the mv using ||, since we want to execute when the file doesn't exist.
[ -f "${1##*/}" ] || mv "$1" .
Using this, you can edit your find command as follows:
find ./ -mindepth 2 -exec bash -c '[ -f "${0##*/}" ] || mv "$0" .' '{}' \;
Note that we now use $0 because of the bash -c usage. It's first argument, $0, can't be the script name because we have no script. This means the argument order is shifted with respect to a usual shell script.
Why not check if file exists, prior move? Then you can leave the file where it is or you can rename it or do something else...
Test -f or, [] should do the trick?
I am on tablet and can not easyly include the source.
I want to consolidate into 1 directory files that are in multiple subdirectories.
The following comes close except that the random string is added after the extension; I want it before the extension:
find . -type f -iname "[a-z,0-9]*" -exec bash -c 'mv -v "$0" "./$( mktemp "$( basename "$0" ).XXX" )"' '{}' \;
I've searched through dozens of other posts but nothing addressed the specifics of my situation:
I'm on OS X (so it's a BSD flavor of Bash; for ex. there's no -t option for mv)
Many of the files have identical names so I need to rewrite them during the mv (and I can't just use the -n option for mv because there too many files would thus not get moved)
The files are not all the same kind, so I need to use a find -type f
I want to exclude .DS_store files, so it seems like a good option is find -type f -iname "[a-z,0-9]*"
I want the rewritten files's names to be in the form of: oldname-random_string.xyz (but I'm also OK with having the files being renamed as a sequential list: 00001.xyz, 00002.xyz, etc.)
The files are buried 4 levels down from my master directory:
Master/Top dir
Dir 2
Dir 3
Dir 4
Dir 5
file
For the sake of simplicity I prefer a bash command to a .sh script (but I'm happy with either)
GNU Solution
This uses basically the same command that you were using but I supply a template to mktemp so that the XXX pattern appears just before the suffix. With GNU sed:
find . -type f -iname "[a-z,0-9]*" -exec bash -c 'mv -v "$1" "./$(mktemp -u "$(basename "$1" | sed -E -e '\''s/\.([^.]+)$/.XXX.\1/'\'' -e '\''/XXX/ !s/$/.XXX/'\'')" )"' _ '{}' \;
The key addition above is the use of sed to insert XXX before the suffix in the file name:
sed -E -e 's/\.([^.]+)$/.XXX.\1/' -e '/XXX/ !s/$/.XXX/'
This has two commands. The first puts .XXX before the extension. The second command is run only if the file name has no extension in which case it adds .XXX to the end of the file name.
In the first command, the source regex consists of two parts. The first is \. which matches a period. The second is ([^.]+)$ which captures the extension into group 1. The substitution replaces this with .XXX.\1 where \1 is sed notation for group 1 which, in our case, is the file's extension.
OSX Solution
Under OSX, mktemp is not useful because it only supports templates with the XXX part trailing. As a workaround, we can use a bash script that generates non-overlapping file names:
#!/bin/bash
find . -type f -iname "[a-z,0-9]*" -print0 |
while IFS= read -r -d '' fname
do
new=$(basename "$fname")
[ "$fname" = "./$new" ] && continue
[ "$new" = .DS_store ] && continue
name=${new%.*}
ext=${new#"$name"}
n=0
new=$(printf '%s.%03i%s' "$name" "$n" "$ext")
while [ -f "$new" ]
do
n=$(($n + 1))
new=$(printf '%s.%03i%s' "$name" "$n" "$ext")
done
mv -v "$fname" "$new"
done
The above uses the find command to get the file names. The option -print0 is used to assure that it works with difficult file names. The while loop reads these file names one by one, into the variable fname. fname includes the full path to the source file. The file name without the path is then stored in new. Then two checks are performed. If the source file is already in the current directory, the script continues on to the next loop. Similarly, if the file name id .DS_Store, it is also skipped. (The find command, as given, already skips these files. This line is there just for future flexibility.) Next, the file name is split into two parts: the name and ext, the extension. ext includes the leading period. Next, a loop checks for files of the form name.NNN.ext and stops at the first one that doesn't yet exist. The source file is moved to a file of that name.
Related Notes Regarding the GNU Solution and its Compatibility
Quoting in the above GNU command is complex. The argument to bash -c needs to be in single-quotes to prevent the calling bash from performing premature variable substitution. In addition, the sed commands need to be in single-quotes when executed by the bash subshell to prevent history expansion from interfering with the use of negation, !, within the sed command.
The OSX (BSD) sed does not support combining commands together with semicolons. Consequently, each command is supplied to sed via a separate -e option.
The OSX (BSD) sed seems to treat + differently from the GNU sed. This incompatibility seems to go away when using the -E (extended regex) option. (The corresponding GNU option is -r but, as an undocumented compatibility feature, GNU sed supports -E also.
I have a bunch of xml files in a directory that need to have the dos2unix command performed on them and new files will be added every so often. I Instead of manually performing dos2unix command on each files everytime I would like to automate it all with a script. I have never even looked at a shell script in my life but so far I have this from what I have read on a few tutorials:
FILES=/tmp/testFiles/*
for f in $FILES
do
fname=`basename $f`
dos2unix *.xml $f $fname
done
However I keep getting the 'usage' output showing up. I think the problem is that I am not assigning the name of the new file correctly (fname).
The reason you're getting a usage message is that dos2unix doesn't take the extra arguments you're supplying. It will, however, accept multiple filenames (also via globs). You don't need a loop unless you're processing more files than can be accepted on the command line.
dos2unix /tmp/testFiles/*.xml
Should be all you need, unless you need recursion:
find /tmp/testFiles -name '*.xml' -exec dos2unix {} +
(for GNU find)
If all files are in one directory (no recursion needed) then you're almost there.
for file in /tmp/testFiles/*.xml ; do
dos2unix "$file"
done
By default dos2unix should convert in place and overwrite the original.
If recursion is needed you'll have to use find as well:
find /tmp/testFiles -name '*.xml' -print0 | while IFS= read -d '' file ; do
dos2unix "$file"
done
Which will work on all files ending with .xml in /tmp/testFiles/ and all of its sub-directories.
If no other step are required you can skip the shell loop entirely:
Non-recursive:
find /tmp/testFiles -maxdepth 1 -name '*.xml' -exec dos2unix {} +
And for recursive:
find /tmp/testFiles -name '*.xml' -exec dos2unix {} +
In your original command I see you finding the base name of each file name and trying to pass that to dos2unix, but your intent is not clear. Later, in a comment, you say you just want to overwrite the files. My solution performs the conversion, creates no backups and overwrites the original with the converted version. I hope this was your intent.
mkdir /tmp/testFiles/converted/
for f in /tmp/testFiles/*.xml
do
fname=`basename $f`
dos2unix $f ${f/testFiles\//testFiles\/converted\/}
# or for pure sh:
# dos2unix $f $(echo $f | sed s#testFiles/#testFiles/converted/#)
done
The result will be saved in the converted/ subdirectory.
The construction ${f/testFiles\//testFiles\/converted\/} (thanks to Rush)
or sed is used here to add converted/ before the name of the file:
$ echo /tmp/testFiles/1.xml | sed s#testFiles/#testFiles/converted/#
/tmp/testFiles/converted/1.xml
It is not clear which implementation of dos2unix you are using. Different implementations require different arguments. There are many different implementations around.
On RedHat/Fedora/Suse Linux you could just type
dos2unix /tmp/testFiles/*.xml
On SunOS you are required to give an input and output file name, and the above command would destroy several of your files.
Duplicate
Unable to remove everything else in a folder except FileA
I guess that it is slightly similar to this:
delete [^Music]
However, it does not work.
Put the following command to your ~/.bashrc
shopt -s extglob
You can now delete everything else in the folder except the Music folder by
rm -r !(Music)
Please, be careful with the command.
It is powerful, but dangerous too.
I recommend to test it always with the command
echo rm -r !(Music)
The command
rm (ls | grep -v '^Music$')
should work. If some of your "files" are also subdirectories, then you want to recursively delete them, too:
rm -r (ls | grep -v '^Music$')
Warning: rm -r can be dangerous and you could accidentally delete a lot of files. If you would like to confirm what you will be deleting, try looking at the output of
ls | grep -v '^Music$'
Explanation:
The ls command lists directory contents; without an argument, it defaults to the current directory.
The pipe symbol | redirects output to another command; when the output of ls is redirected in this way, it prints filenames one-per-line, rather than in a column format as you would see if you type ls at an interactive terminal.
The grep command matches lines for patterns; the -v switch means to print lines that don't match the pattern.
The pattern ^Music$ means to match a line starting and ending with Music -- that is, only the string Music; the effect of the ^ (beginning of line) and $ (end of line) characters can also be achieved with the -x switch, as in grep -vx Music.
The syntax command (subcommand) is fish's way of taking the output of one command and passing it over as command-line arguments to another.
The rm command removes files. By default, it does not remove directories, but the -r ("recursive") option changes that.
You can learn about these commands and more by typing man command, where command is what you want to learn about.
So I was looking all over for a way to remove all files in a directory except for some directories, and files, I wanted to keep around. After much searching I devised a way to do it using find.
find -E . -regex './(dir1|dir2|dir3)' -and -type d -prune -o -print -exec rm -rf {} \;
Essentially it uses regex to select the directories to exclude from the results then removes the remaining files. Just wanted to put it out here in case someone else needed it.