Scripting for file management with a very large amount of files - bash

I have a three OSX machine setup that was using syncthing to keep shared drives synchronized remotely. Someone made some mistakes and a lot of files ended up getting renamed.
So all throughout this drive I have situations where there's a file of size 0KB named,for example, file.jpg and another file with real size named
file.sync-confilct201705-4528.jpg. I need to search the entire drive recursively and while I find a file with the sync-conflict string in it, check to see if there is the same file without the 'sync-conflict' string along with a size of 0KB. If there is, I need to rename the sync-conflict file to overwrite the 0KB file.
I have considered tackling this with a bash script or a Perl script. Using bash I think just using the 'find' command with -regex would get me started but I don't really know how to process the results and run the next find test. I am studying and working on it.
Same problem with Perl. I can get through the first step using File::Find:find and select what I need using regex to filter out the files, but there again I am stuck getting to the next step, which would be finding the original file in the same directory and performing the necessary file move function.
In both of these cases I am willing to put in the time to figure it out, but I wonder what the caveats will be? Can both of these scenarios handle recursing a large number of files without exception? Is there perhaps a better approach anyone can recommend?

One good tool in Perl for this is File::Find::Rule.
Find all sync-conflict files, then test whether corresponding files exist and are zero size
use warnings;
use strict;
use FindBin qw($RealBin);
use File::Copy qw(move);
use File::Find::Rule;
my $dir = shift || '.'; # top of hierarchy to search (from command line, or ./)
my #conflict_files = File::Find::Rule
->file->name('*sync-conflict*.jpg')->in($dir);
foreach my $conflict (#conflict_files)
{
my ($file) = $conflict =~ m|(.*)\.sync-conflict|;
$file .= '.jpg';
if (-z "$RealBin/$file") {
print "Rename $conflict to $file\n"
#move($conflict, $file) or warn "Can't move $conflict to $file: $!";
}
}
This builds the file's name file for each file.sync-conflict file and applies -z file test (-X), which tests for both existence and zero size. Then it renames the file using the core File::Copy.
Note that file-test operators need the full path while File::Find::Rule returns the path relative to the $dir it searches. I use $RealBin provided by FindBin, which is the path to the directory where the script was started with all links resolved, to build the full path for -z.
Uncomment the move line after sufficient testing (and with having made a backup first).
The code makes some assumptions about file names, please adjust as needed.
The $dir supplied on the command line is expected to be relative to the script's directory.

find is great. But as you've noted, you need more.
What find gets you in this scenario is the ability to search recursively and match certain patterns. As it happens as of Bash version 4, you can do that right in the shell.
(Note that macOS ships with bash version 3, so for this solution, you'll need to install bash 4 from Macports, Homebrew or Fink.)
$ shopt -s globstar nullglob
$ for file in **/*sync-confilct2017*.*; do echo mv -v "$file" "${file%sync-conf*}${file##*.}"; done
mv -v file.sync-confilct201705-4528.jpg file.jpg
mv -v foo/bar.sync-confilct201705-4528.ext foo/bar.ext
You can remove the echo to actually run the mv command.
The way this works is that the double asterisk, **, is treated by bash like a * that recurses. We're using parameter expansion to strip the parts of the filename we want in order to construct the "target" filename.

Create a function to fix the name:
$ function fixname() { file="$1"; newname=$( echo "$file" | sed "s/sync-conflict.*\.jpg$/.jpg/" ); if [ -f "$newname" -a ! -s "$newname" ]; then mv "$file" "$newname"; fi; }
Or, spread out a bit:
function fixname() {
file="$1"
newname=$( echo "$file" | sed "s/sync-conflict.*\.jpg$/.jpg/" )
# If empty file exists
if [ -f "$newname" -a ! -s "$newname" ]; then
mv "$file" "$newname"
fi
}
Export the function:
$ export -f fixname
Run find to execute the function:
$ find . -type f -name \*sync-conflict\*.jpg -exec bash -c 'fixname {}' bash \;
Caveat: It will not work with spaces or funky characters in the filenames.

Related

Cycle through a list of terms and batch rename files

EDIT: In the course of working on and reediting this question, I was able to get this to work. However, I'm sure there's a better way to do it, so I'm leaving it up to hear from those more experienced.
Periodically I need to reproduce several dozen copies of a few files. For example, given:
company_a_results_30d.py
company_a_results_90d.py
company_a_results_120d.py
company_a_results_all_time.py
I need to make copies where company_a is replaced with company_b, company_c....etc. (The next step is to find and replace a number of terms within the files, but this I have managed to do with a perl script.)
I'm sure this should be possible with a bash script and mv, but I haven't quite got the hang of it. Something like:
#!/usr/bin/env bash
my_array=(company_b company_c company_d)
for i in "${my_array[#]}"
do
for file in *.py
do
cp "$file" "${file/company_a/$i}"
done
done
I'd prefer a solution compatible with zsh, which is what I use.
bash
Slightly modified from the OP's answer:
#!/usr/bin/env bash
set -x # So you can see what's happening - feel free to omit
company_a_files=(company_a*.py) # <== Save the list of files first
my_array=(company_b company_c company_d)
for i in "${my_array[#]}"
do
for file in "${company_a_files[#]}" # <== Use the saved list
do
cp "$file" "${file/company_a/$i}"
done
done
When the inner loop in the OP's answer runs for file in *.py, the glob will pick up whatever company_b &c. files have already been created. So you wind up with a lot of set -x output like:
+ cp company_b_1.py company_b_1.py
cp: 'company_b_1.py' and 'company_b_1.py' are the same file
Instead, save the glob of company_a files into a shell array first, and then
loop over that array.
perl
As a one-liner for Perl 5.14+:
perl -MFile::Copy=copy -E 'for my $file (#ARGV) { copy $file, $file =~ s/company_a/$_/r foreach qw(company_b company_c company_d) }' company_a*.py
The Perl version switches the loop order compared to the bash version. For each file given on the command line (the for ... #ARGV), it copies from that file to each name-modified file in turn (the foreach).
$file =~ s/company_a/$_/r is a non-destructive (/r) replace in $file (the filename) that changes company_a to $_ (the current value from foreach).
This was the solution I came up with:
#!/usr/bin/env bash
my_array=(company_b company_c company_d)
for i in "${my_array[#]}"
do
for file in *.py
do
cp "$file" "${file/company_a/$i}"
done
done

how to get a substring with varied length?

So i am writing a script which gets the substring from the input which is a path to a file (/path/to/file.ext) and if the directory (/path/to) does not exist it will run mkdir -p /path/to and then touch file.ext.
my question is this, how can i use cut to get the /path/to if we have a potentially unknown length of /'s
my script currently looks like this
INPUT=$0
SUBSTRING_PATH=`$INPUT | cut -d'/' -f 2`
if [! -d $SUBSTRING_PATH]; then
mkdir -p $SUBSTRING_PATH
fi
touch $INPUT
Instead of cut, use dirname and basename:
input=/path/to/foo
dir=$(dirname "$input")
file=$(basename "$input")
Now $DIR is /path/to and $FILE is foo.
dirname will also give you a valid directory for relative paths to the working directory (I mean that $(dirname file.txt) is .). This means, for example, that you can write "$dir/some/stuff/foo" without having to worry that you end up in a completely different directory tree (such as /some/stuff rather than ./some/stuff).
As #ruakh mentions in the comments, if you didn't have a directory but a string of tokens of which you wanted to discard the last (a line of a csv file, perhaps), one way to do it would be "${input%,*}", where the comma can be replaced by any delimiter. To my knowledge this is a bash extension. I only edit this in because a stray visitor in the future might have better luck seeing it here than in the comments; for your particular use case, dirname and basename are a better fit.

Recursively search a directory for each file in the directory on IBMi IFS

I'm trying to write two (edit: shell) scripts and am having some difficulty. I'll explain the purpose and then provide the script and current output.
1: get a list of every file name in a directory recursively. Then search the contents of all files in that directory for each file name. Should return the path, filename, and line number of each occurrence of the particular file name.
2: get a list of every file name in a directory recursively. Then search the contents of all files in the directory for each file name. Should return the path and filename of each file which is NOT found in any of the files in the directories.
I ultimately want to use script 2 to find and delete (actually move them to another directory for archiving) unused files in a website. Then I would want to use script 1 to see each occurrence and filter through any duplicate filenames.
I know I can make script 2 move each file as it is running rather than as a second step, but I want to confirm the script functions correctly before I do any of that. I would modify it after I confirm it is functioning correctly.
I'm currently testing this on an IMBi system in strqsh.
My test folder structure is:
scriptTest
---subDir1
------file4.txt
------file5.txt
------file6.txt
---subDir2
------file1.txt
------file7.txt
------file8.txt
------file9.txt
---file1.txt
---file2.txt
---file3.txt
I have text in some of those files which contains existing file names.
This is my current script 1:
#!/bin/bash
files=`find /www/Test/htdocs/DLTest/scriptTest/ ! -type d -exec basename {} \;`
for i in $files
do
grep -rin $i "/www/Test/htdocs/DLTest/scriptTest" >> testReport.txt;
done
Right now it functions correctly with exception to providing the path to the file which had a match. Doesn't grep return the file path by default?
I'm a little further away with script 2:
#!/bin/bash
files=`find /www/Test/htdocs/DLTest/scriptTest/ ! -type d`
for i in $files
do
#split $i on '/' and store into an array
IFS='/' read -a array <<< "$i"
#get last element of the array
echo "${array[-1]}"
#perform a grep similar to script 2 and store it into a variable
filename="grep -rin $i "/www/Test/htdocs/DLTest/scriptTest" >> testReport.txt;"
#Check if the variable has anything in it
if [ $filename = "" ]
#if not then output $i for the full path of the current needle.
then echo $i;
fi
done
I don't know how to split the string $i into an array. I keep getting an error on line 6
001-0059 Syntax error on line 6: token redirection not expected.
I'm planning on trying this on an actual linux distro to see if I get different results.
I appreciate any insight in advanced.
Introduction
This isn't really a full solution, as I'm not 100% sure I understand what you're trying to do. However, the following contain pieces of a solution that you may be able to stitch together to do what you want.
Create Test Harness
cd /tmp
mkdir -p scriptTest/subDir{1,2}
mkdir -p scriptTest/subDir1/file{4,5,6}.txt
mkdir -p scriptTest/subDir2/file{1,8,8}.txt
touch scriptTest/file{1,2,3}.txt
Finding and Deleting Duplicates
In the most general sense, you could use find's -exec flag or a Bash loop to run grep or other comparison on your files. However, if all you're trying to do is remove duplicates, then you might simply be better of using the fdupes or duff utilities to identify (and optionally remove) files with duplicate contents.
For example, given that all the .txt files in the test corpus are zero-length duplicates, consider the following duff and fdupes examples
duff
Duff has more options, but won't delete files for you directly. You'll likely need to use a command like duff -e0 * | xargs -0 rm to delete duplicates. To find duplicates using the default comparisons:
$ duff -r scriptTest/
8 files in cluster 1 (0 bytes, digest da39a3ee5e6b4b0d3255bfef95601890afd80709)
scriptTest/file1.txt
scriptTest/file2.txt
scriptTest/file3.txt
scriptTest/subDir1/file4.txt
scriptTest/subDir1/file5.txt
scriptTest/subDir1/file6.txt
scriptTest/subDir2/file1.txt
scriptTest/subDir2/file8.txt
fdupes
This utility offers the ability to delete duplicates directly in various ways. One such way is to invoke fdupes . --delete --noprompt once you're confident that you're ready to proceed. However, to find the list of duplicates:
$ fdupes -R scriptTest/
scriptTest/subDir1/file4.txt
scriptTest/subDir1/file5.txt
scriptTest/subDir1/file6.txt
scriptTest/subDir2/file1.txt
scriptTest/subDir2/file8.txt
scriptTest/file1.txt
scriptTest/file2.txt
scriptTest/file3.txt
Get a List of All Files, Including Non-Duplicates
$ find scriptTest -name \*.txt
scriptTest/file1.txt
scriptTest/file2.txt
scriptTest/file3.txt
scriptTest/subDir1/file4.txt
scriptTest/subDir1/file5.txt
scriptTest/subDir1/file6.txt
scriptTest/subDir2/file1.txt
scriptTest/subDir2/file8.txt
You could then act on each file with the find's -exec {} + feature, or simply use a grep that supports the --recursive --files-with-matches flags to find files with matching content.
Passing Find Results to a Bash Loop as an Array
Alternatively, if you know for sure that you won't have spaces in the file names, you can also use a Bash array to store the files into a variable you can iterate over in a Bash for-loop. For example:
files=$(find scriptTest -name \*.txt)
for file in "${files[#]}"; do
: # do something with each "$file"
done
Looping like this is often slower, but may provide you with the additional flexibility you need if you're doing something complicated. YMMV.

Copying unique files to a new location

I am a total newbie to Linux and bash scripting and am currently stumped with this problem!
I have a directory containing many images from which I need to copy the unique images to a new location. I know there are numerous options for how to go about doing this but have very limited knowledge at the moment so appreciate I may be going about this the wrong way.
I used find and cat to create this list and have attempted to copy the files across with the intention of comparing them (using md5 and checking file names) when they are there.
However, the text file has 30 files on it but only 18 have been copied over. Can anyone advise?
My code to find files is -
find $1 -name "IMG_****.JPG" | cat > list.txt
and my code to copy from the list is
for image in $(cat list.txt);
do
cp $image $2
done
You're doing this much too complicated. Do not pipe find output to cat to pipe it into a list. This is an unnecessary use of cat. If you must, you can redirect the output of every program directly:
find "$1" -name "IMG_*.JPG" > list.txt
Also, do not use for to read lines from a file. Better use while with read:
while read -r filename; do
cp "$filename" "$2"
done < list.txt
But it's even easier. You can just work with the files directly from find:
find "$1" -name "IMG_*.JPG" -exec cp {} "$2" \;
Here, {} will be replaced by each filename that find finds. Don't forget to quote your variables, so that spaces in file paths are no problem.
Another much simpler method with Bash options:
shopt -s nullglob globstar
cp -t "$2" -- "$1"/**/IMG_*.JPG
Here, globstar enables recursive matching of directories through **. The -t option to cp specifies the target of the copy operation.* The command will be expanded to cp -t target -- source1/IMG_foo.JPG source2/IMG_bar.JPG et cetera.
Now, as to your original issue, it could have been that some images have a space in their name. This would have broken your original script. If your image files contained a newline in their name, it also wouldn't have worked with while read … – but you would have gotten an error in that case of a file not being found.
Also note that cp overwrites files with the same name. Without asking for confirmation. So if in your subdirectories there are images with the same filename, you'd only get one result, with the latest overwriting the existing one.
* The -- isn't strictly necessary, but it's a good habit to include it to tell the command when the options arguments are over.

Shell script: execute cmd on a file, with additional processing of file name

So I am going to post a question about shell scripting again.
Problem Definition: For all files under a dir, ex.:
A_anything.txt, B_anything.txt, ......
I want to execute a script, say 'CMD', on each of them, with the output files named like:
A_result.txt, B_result.txt, ......
In addition, at the first line of these output file, I want to have the file name of the original one.
The 'find -exec' util seems to me unable to extract part of the file name.
Does someone know a solution to this problem, by any means(shell, python, find,etc)? Thank you!
cd /directory
for file in *.txt ; do
newfilename=`echo "$file"|sed 's/\(.\+\)_.*/\1_result.txt/`
echo "$file" > "$newfilename"
your-command $file >> "$newfilename"
done
HTH
Well, there's more than one way to do it (including using Perl, where that's the motto), but probably I'd write it like this:
find . -name '[A-Z]_*.txt' -type f -print0 |
xargs -0 modify_rename.sh
And then I'd write the script modify_rename.sh like this:
#!/bin/sh
for file in "$#"
do
dirname=$(dirname "$file")
basename=$(basename "$file" .txt)
leadname=${file%_*}
outname="$dirname/${leadname}_result.txt"
# Optionally check for pre-existence of $outname
{
# Optionally echo "$basename.txt" instead of "$file"
echo "$file"
# Does this invocation of CMD write to standard output?
# If not, adjust invocation appropriately.
CMD "$file"
} > "$outname"
done
The advantage of this separation into separate scripting operations is that the rename/modify operation can be checked out separately from the search process - which runs less risk of zapping your entire directory structure with bad commands.
Bash has the tools to avoid invoking basename and dirname but the notation is moderatly excruciating; I find the clarity of the command names worth having. I'd be happy if bash implemented them as built-ins. There are plenty of other ways to get the prefix of the file; this should be safe, though, even in the presence of spaces (tabs, newlines) in file or directory names because of the careful use of double quotes.

Resources