Cycle through a list of terms and batch rename files

Cycle through a list of terms and batch rename files - bash

EDIT: In the course of working on and reediting this question, I was able to get this to work. However, I'm sure there's a better way to do it, so I'm leaving it up to hear from those more experienced.
Periodically I need to reproduce several dozen copies of a few files. For example, given:
company_a_results_30d.py
company_a_results_90d.py
company_a_results_120d.py
company_a_results_all_time.py
I need to make copies where company_a is replaced with company_b, company_c....etc. (The next step is to find and replace a number of terms within the files, but this I have managed to do with a perl script.)
I'm sure this should be possible with a bash script and mv, but I haven't quite got the hang of it. Something like:
#!/usr/bin/env bash
my_array=(company_b company_c company_d)
for i in "${my_array[#]}"
do
for file in *.py
do
cp "$file" "${file/company_a/$i}"
done
done
I'd prefer a solution compatible with zsh, which is what I use.

bash
Slightly modified from the OP's answer:
#!/usr/bin/env bash
set -x # So you can see what's happening - feel free to omit
company_a_files=(company_a*.py) # <== Save the list of files first
my_array=(company_b company_c company_d)
for i in "${my_array[#]}"
do
for file in "${company_a_files[#]}" # <== Use the saved list
do
cp "$file" "${file/company_a/$i}"
done
done
When the inner loop in the OP's answer runs for file in *.py, the glob will pick up whatever company_b &c. files have already been created. So you wind up with a lot of set -x output like:
+ cp company_b_1.py company_b_1.py
cp: 'company_b_1.py' and 'company_b_1.py' are the same file
Instead, save the glob of company_a files into a shell array first, and then
loop over that array.
perl
As a one-liner for Perl 5.14+:
perl -MFile::Copy=copy -E 'for my $file (#ARGV) { copy $file, $file =~ s/company_a/$_/r foreach qw(company_b company_c company_d) }' company_a*.py
The Perl version switches the loop order compared to the bash version. For each file given on the command line (the for ... #ARGV), it copies from that file to each name-modified file in turn (the foreach).
$file =~ s/company_a/$_/r is a non-destructive (/r) replace in $file (the filename) that changes company_a to $_ (the current value from foreach).

This was the solution I came up with:
#!/usr/bin/env bash
my_array=(company_b company_c company_d)
for i in "${my_array[#]}"
do
for file in *.py
do
cp "$file" "${file/company_a/$i}"
done
done

Related

Scripting for file management with a very large amount of files

I have a three OSX machine setup that was using syncthing to keep shared drives synchronized remotely. Someone made some mistakes and a lot of files ended up getting renamed.
So all throughout this drive I have situations where there's a file of size 0KB named,for example, file.jpg and another file with real size named
file.sync-confilct201705-4528.jpg. I need to search the entire drive recursively and while I find a file with the sync-conflict string in it, check to see if there is the same file without the 'sync-conflict' string along with a size of 0KB. If there is, I need to rename the sync-conflict file to overwrite the 0KB file.
I have considered tackling this with a bash script or a Perl script. Using bash I think just using the 'find' command with -regex would get me started but I don't really know how to process the results and run the next find test. I am studying and working on it.
Same problem with Perl. I can get through the first step using File::Find:find and select what I need using regex to filter out the files, but there again I am stuck getting to the next step, which would be finding the original file in the same directory and performing the necessary file move function.
In both of these cases I am willing to put in the time to figure it out, but I wonder what the caveats will be? Can both of these scenarios handle recursing a large number of files without exception? Is there perhaps a better approach anyone can recommend?

One good tool in Perl for this is File::Find::Rule.
Find all sync-conflict files, then test whether corresponding files exist and are zero size
use warnings;
use strict;
use FindBin qw($RealBin);
use File::Copy qw(move);
use File::Find::Rule;
my $dir = shift || '.'; # top of hierarchy to search (from command line, or ./)
my #conflict_files = File::Find::Rule
->file->name('*sync-conflict*.jpg')->in($dir);
foreach my $conflict (#conflict_files)
{
my ($file) = $conflict =~ m|(.*)\.sync-conflict|;
$file .= '.jpg';
if (-z "$RealBin/$file") {
print "Rename $conflict to $file\n"
#move($conflict, $file) or warn "Can't move $conflict to $file: $!";
}
}
This builds the file's name file for each file.sync-conflict file and applies -z file test (-X), which tests for both existence and zero size. Then it renames the file using the core File::Copy.
Note that file-test operators need the full path while File::Find::Rule returns the path relative to the $dir it searches. I use $RealBin provided by FindBin, which is the path to the directory where the script was started with all links resolved, to build the full path for -z.
Uncomment the move line after sufficient testing (and with having made a backup first).
The code makes some assumptions about file names, please adjust as needed.
The $dir supplied on the command line is expected to be relative to the script's directory.

find is great. But as you've noted, you need more.
What find gets you in this scenario is the ability to search recursively and match certain patterns. As it happens as of Bash version 4, you can do that right in the shell.
(Note that macOS ships with bash version 3, so for this solution, you'll need to install bash 4 from Macports, Homebrew or Fink.)
$ shopt -s globstar nullglob
$ for file in **/*sync-confilct2017*.*; do echo mv -v "$file" "${file%sync-conf*}${file##*.}"; done
mv -v file.sync-confilct201705-4528.jpg file.jpg
mv -v foo/bar.sync-confilct201705-4528.ext foo/bar.ext
You can remove the echo to actually run the mv command.
The way this works is that the double asterisk, **, is treated by bash like a * that recurses. We're using parameter expansion to strip the parts of the filename we want in order to construct the "target" filename.

Create a function to fix the name:
$ function fixname() { file="$1"; newname=$( echo "$file" | sed "s/sync-conflict.*\.jpg$/.jpg/" ); if [ -f "$newname" -a ! -s "$newname" ]; then mv "$file" "$newname"; fi; }
Or, spread out a bit:
function fixname() {
file="$1"
newname=$( echo "$file" | sed "s/sync-conflict.*\.jpg$/.jpg/" )
# If empty file exists
if [ -f "$newname" -a ! -s "$newname" ]; then
mv "$file" "$newname"
fi
}
Export the function:
$ export -f fixname
Run find to execute the function:
$ find . -type f -name \*sync-conflict\*.jpg -exec bash -c 'fixname {}' bash \;
Caveat: It will not work with spaces or funky characters in the filenames.

automatically renaming files

I have a bunch of files (more than 1000) on this like the followings
$ ls
org.allenai.ari.solvers.termselector.BaselineLearnersurfaceForm-dev.lc
org.allenai.ari.solvers.termselector.BaselineLearnersurfaceForm-dev.lex
org.allenai.ari.solvers.termselector.BaselineLearnersurfaceForm-train.lc
org.allenai.ari.solvers.termselector.BaselineLearnersurfaceForm-train.lex
org.allenai.ari.solvers.termselector.BaselineLearnersurfaceForm.lc
org.allenai.ari.solvers.termselector.BaselineLearnersurfaceForm.lex
org.allenai.ari.solvers.termselector.ExpandedLearner.lc
org.allenai.ari.solvers.termselector.ExpandedLearner.lex
org.allenai.ari.solvers.termselector.ExpandedLearnerSVM.lc
org.allenai.ari.solvers.termselector.ExpandedLearnerSVM.lex
....
I have to rename these files files by adding a learners right before the capitalized name. For example
org.allenai.ari.solvers.termselector.BaselineLearnersurfaceForm.lex
would change to
org.allenai.ari.solvers.termselector.learners.BaselineLearnersurfaceForm.lex
and this one
org.allenai.ari.solvers.termselector.ExpandedLearner.lc
would change to
org.allenai.ari.solvers.termselector.learners.ExpandedLearner.lc
Any ideas how to do this automatically?

for f in org.*; do
echo mv "$f" "$( sed 's/\.\([A-Z]\)/.learner.\1/' <<< "$f" )"
done
This short loop outputs an mv command that renames the files in the manner that you wanted. Run it as-is first, and when you are certain it's doing what you want, remove the echo and run again.
The sed bit in the middle takes a filename ($f, via a here-string, so this requires bash) and replaces the first occurrence of a capital letter after a dot with .learner. followed by that same capital letter.

There is a tool called perl-rename, sometimes rename. Not to be confused with rename from util-linux.
It's very good for tasks like this as it takes a perl expression and renames accordingly:
perl-rename 's/(?=\.[A-Z])/.learners/' *
You can play with the regex online
Alternative you can a for loop and $BASH_REMATCH:
for file in *; do
[ -e "$file" ] || continue
[[ "$file" =~ ^([^A-Z]*)(.*)$ ]]
mv -- "$file" "${BASH_REMATCH[1]}learners.${BASH_REMATCH[2]}"
done

A very simple approach (useful if you only need to do this one time) is to ls >dummy them into a text file dummy, and then use find/replace in a text editor to make lines of the form mv xxx.yyy xxx.learners.yyy. Then you can simple execute the resulting file with ./dummy.
The exact find/replace commands depend on the text editor you use, but something like
replace org. with mv org.. That gets you the mv in the beginning.
replace mv org.allenai.ari.solvers.termselector.$1 with mv org.allenai.ari.solvers.termselector.$1 org.allenai.ari.solvers.termselector.learner.$1 to duplicate the filename and insert the learner.
There is also syntax with a for, which can do it probably in one line, (long) but I cannot explain it - try help for if you want to learn about it.

Shell script to execute executable over numerous files

Hi I have a file that sorts some code and reformats it. I have over 200 files to apply this to with incremental names run001, run002 etc. Is there a quick way to write a shell script to execute this file over all the files? The executable creates a new file called run001an etc so just running over all files containing run doesnt work, how do i increment the file number?
Cheers

how about:
for i in ./run*; do
process_the_file $i
done
which is valid Bash/Ksh

To be more specific with run### files you can have
for file in dir/run[0-9][0-9][0-9]; do
do_something "$file"
done
dir could simply be just . or other directories. If they have spaces, quote them around "" but only the directory parts.
In bash, you can make use of extended patterns to generate all number matches not just 3 digits:
shopt -s extglob
for file in dir/run+([0-9]); do
do_something "$file"
done

Shell script: execute cmd on a file, with additional processing of file name

So I am going to post a question about shell scripting again.
Problem Definition: For all files under a dir, ex.:
A_anything.txt, B_anything.txt, ......
I want to execute a script, say 'CMD', on each of them, with the output files named like:
A_result.txt, B_result.txt, ......
In addition, at the first line of these output file, I want to have the file name of the original one.
The 'find -exec' util seems to me unable to extract part of the file name.
Does someone know a solution to this problem, by any means(shell, python, find,etc)? Thank you!

cd /directory
for file in *.txt ; do
newfilename=`echo "$file"|sed 's/\(.\+\)_.*/\1_result.txt/`
echo "$file" > "$newfilename"
your-command $file >> "$newfilename"
done
HTH

Well, there's more than one way to do it (including using Perl, where that's the motto), but probably I'd write it like this:
find . -name '[A-Z]_*.txt' -type f -print0 |
xargs -0 modify_rename.sh
And then I'd write the script modify_rename.sh like this:
#!/bin/sh
for file in "$#"
do
dirname=$(dirname "$file")
basename=$(basename "$file" .txt)
leadname=${file%_*}
outname="$dirname/${leadname}_result.txt"
# Optionally check for pre-existence of $outname
{
# Optionally echo "$basename.txt" instead of "$file"
echo "$file"
# Does this invocation of CMD write to standard output?
# If not, adjust invocation appropriately.
CMD "$file"
} > "$outname"
done
The advantage of this separation into separate scripting operations is that the rename/modify operation can be checked out separately from the search process - which runs less risk of zapping your entire directory structure with bad commands.
Bash has the tools to avoid invoking basename and dirname but the notation is moderatly excruciating; I find the clarity of the command names worth having. I'd be happy if bash implemented them as built-ins. There are plenty of other ways to get the prefix of the file; this should be safe, though, even in the presence of spaces (tabs, newlines) in file or directory names because of the careful use of double quotes.

shell scripting: search/replace & check file exist

I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.
I'm currently doing something like
#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;
whatsleft=todo - isdone; # what's the unix magic?
#tack on the .xml prefix with sed or something
#and then call the job server;
jobserve E "$whatsleft";
and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)
As a bonus question, is there a way to do lookahead search in bash grep?
To clarify/extend the problem:
I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.
So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.
N.B. I don't need to check concurrency yet or lock any files.
So a simple, clear way to solve the above problem (in pseudocode) might be
for i in `/bin/ls *.xml`
do
replace xml suffix with txt
if [that file exists]
add to whatsleft list
end
done
but I'm looking for something more general.

#!/bin/sh
shopt -s extglob # allow extended glob syntax, for matching the filenames
LC_COLLATE=C # use a sort order comm is happy with
IFS=$'\n' # so filenames can have spaces but not newlines
# (newlines don't work so well with comm anyhow;
# shame it doesn't have an option for null-separated
# input lines).
files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
$(comm -23 --nocheck-order \
<(printf "%s\n" "${files_todo[#]%.xml}") \
<(printf "%s\n" "${files_done[#]%.txt}") ))
echo jobserve E $(for f in "${files_remaining[#]%.xml}"; do printf "%s\n" "${f}.txt"; done)
This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.
Note the use of extended globs rather than parsing ls, which is considered very poor practice.
To transform input to output names without using anything other than shell builtins, consider the following:
if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
: # ...handle here the fact that you have a noncompliant name...
fi

The question title suggests that you might be looking for:
set -o noclobber
The question content indicates a wholly different problem!
It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:
todo=""
for file in *.xml
do [ -f ${file%.xml}.txt ] || todo="$todo $file"
done
jobserve E $todo
This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.
If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.

whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )
Note this actually gets a symmetric difference.

i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )
if [ -f "$file" ];then
newname="...."
fi
...
jobserve E .... > $newname
if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..

for posterity's sake, this is what i found to work:
TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio