Using 'diff' with mismatched directories and filenames - shell

I have two separate folder directories, which mostly contain the same files, but the directory structure is completely different between the two folders. The filenames do not correspond either
So, for example:
FOLDER 1
--- Subfolder A
-file1
-file2
--- Subfolder B
-file3
-file4
FOLDER 2
--- Subfolder C
-Subfolder C1
-file5
-file6
-file7
-Subfolder C2
-file8
-file9
Let's suppose that file1=file5, file2=file6, file3=file7, file4=file8
And file9 is unmatched.
Is there some combination of options to the diff command that will identify the matches? Doing a recursive diff with -r doesn't seem to do the job.

This is a way to get the different and/or identical files with find and xargs:
find FOLDER1 -type f -print0 |
xargs -0 -I % find FOLDER2 -type f -exec diff -qs --from-file="%" '{}' \+
Sample output:
Files FOLDER1/SubfolderB/file3 and FOLDER2/SubfolderC/SubfolderC1/file5 differ
Files FOLDER1/SubfolderB/file3 and FOLDER2/SubfolderC/SubfolderC1/file7 are identical
So, you can filter the ones you want with grep (see example).
Notice this solution supports filenames with spaces and special characters (e.g.: newlines) embedded, so you don't have to worry about it
Explanation
For every file in FOLDER1 (find FOLDER1 -type f -print0), executes:
find FOLDER2 -type f -exec diff -qs --from-file="%" '{}' \+
That line calls find again to get all the files in FOLDER2 and executes the following (processed):
diff -qs --from-file="<a file from FOLDER1>" <all the files from FOLDER2>
From man diff:
--from-file=FILE1
Compare FILE1 to all operands. FILE1 can be a directory.
Example
This is the directory tree and the file content:
$ find FOLDER1 FOLDER2 -type f -exec sh -c 'echo "$0": && cat "$0"' '{}' \;
FOLDER1/SubfolderA/file1:
1=5
FOLDER1/SubfolderA/file2:
2=6
FOLDER1/SubfolderB/file3:
3=7
FOLDER1/SubfolderB/file4:
4=8
FOLDER2/SubfolderC/SubfolderC1/file5:
1=5
FOLDER2/SubfolderC/SubfolderC1/file6:
2=6
FOLDER2/SubfolderC/SubfolderC1/file7:
3=7
FOLDER2/SubfolderC/SubfolderC2/file8:
4=8
FOLDER2/SubfolderC/SubfolderC2/file9:
anything
And this is the command (pipeline) getting just the identical ones:
$ find FOLDER1 -type f -print0 |
> xargs -0 -I % find FOLDER2 -type f -exec diff -qs --from-file="%" '{}' \+ |
> grep "identical$"
Files FOLDER1/SubfolderA/file1 and FOLDER2/SubfolderC/SubfolderC1/file5 are identical
Files FOLDER1/SubfolderA/file2 and FOLDER2/SubfolderC/SubfolderC1/file6 are identical
Files FOLDER1/SubfolderB/file3 and FOLDER2/SubfolderC/SubfolderC1/file7 are identical
Files FOLDER1/SubfolderB/file4 and FOLDER2/SubfolderC/SubfolderC2/file8 are identical
Enhanced solution with bash's Process Substitution and Arrays
If you're using bash, you can first save all the FOLDER2 filenames in an array to avoid calling find for each file in FOLDER1:
# first of all, we save all the FOLDER2 filenames (recursively) in an array
while read -d $'\0' file; do
folder2_files=("${folder2_files[#]}" "$file")
done < <(find FOLDER2 -type f -print0)
# now we compare each file in FOLDER1 with the files in the array
find FOLDER1 -type f -exec diff -qs --from-file='{}' "${folder2_files[#]}" \; |
grep "identical$"

Create a temporary Git repository. Add the first directory tree to it, and commit.
Remove all the files and add the second directory tree to it. Do the second commit.
The git diff between those two commits will turn on rename detection and you will probably see something more englightening.

Related

How can I diff two directories in bash recursively for only 1 file name?

Currently I am trying this:
diff -r /develop /us-prod
which shows all the differences between the two, but all I really care about here is a file named schema.json, which is guaranteed to be there in all directories, but this file can be different.
I want to diff these two directories, but only if the file name is schema.json.
I see that you can do -x to exclude files, but it is difficult to say which other files could be in there.
There are some guaranteed files to be there, but some are not. Is there more an "inclusion" than an exclude?
You can try this :
find /develop -type f -name schema.json -exec bash -c\
'diff "$1" "/us-prod${1#/develop}"' _ {} \;
Assuming the both directories have just one schema.json file for each directory
including their subdirectories, would you please try:
diff $(find /develop -type f -name schema.json) $(find /us-prod -type f -name schema.json)

Simple way of listing all directories up to git directories

I have my projects in ${HOME}/projects/ in a hierarchical structure, something like this:
${HOME}/projects/
- custumer1/
- c1_git_repo1
- subdir1
- ...
- c1_git_repo2
- ...
- customer2/
- c2_project1/
- c2_p1_git_repo1
- ...
- c2_p1_git_repo2
- ...
...
I want to fill my CDPATH with all directories below {$HOME}/projects up to the git repos. So in the example above this would be:
CDPATH=${HOME}/projects/customer1:${HOME}/projects/customer1/c1_git_repo1:${HOME}/projects/customer1/c1_git_repo2:${HOME}/projects/customer2:${HOME}/projects/customer2/c2_project1:${HOME}/projects/customer2/c2_project1/c2_p1_git_repo1:${HOME}/projects/customer2/c2_project1/c2_p1_git_repo2
So I need to find all directories below {$HOME}/projects/, but stop at any directory which contains a ".git" folder.
Is there some cmd tool that can list the directories for me?
The following fetches all directories from $HOME/projects, except .git and contents of the .git directories.
paths=$(
find "$HOME/projects/" -mindepth 2 -type d -not -ipath '*/.git*' \
| while read d
do
builtin printf %q: "$d"
done
)
echo $paths
Sample output
/home/user/projects/dir\ with\ spaces:/home/user/projects/prj2
In the while loop we escape directory paths and append a colon by means of bulit-int printf function.
The result is stored in paths variable.
I would use something like this to find the directories:
find "$HOME/projects" -type d -execdir test -d {}/.git \; -print -prune
This stops the find whenever there's a .git, so it won't descend into the repos looking for submodules, nor will it actually descend into any of the .git directories themselves.
Next you want to get the result of that command in an array (I named the array repos). If you're using bash version 4+, the easiest way to do that is mapfile, which won't split on any spaces inside the pathnames:
mapfile -t repos < <(
find "$HOME/projects" -type d -execdir test -d {}/.git \; -print -prune)
If you're on a Mac using /bin/bash, you don't have that option, so you can use read with an altered IFS and the -d option to achieve the same result:
IFS=$'\n' read -r -d '' -a repos < <(
find "$HOME/projects" -type d -execdir test -d {}/.git \; -print -prune)
Either way, once you have the pathnames in the repos array, you can then assign CDPATH like so:
CDPATH=${repos[0]}$(printf ":%s" "${repos[#]:1}"})

bash shell: recursively search folder and subfolders for files from list

Until now when I want to gather files from a list I have been using a list that contains full paths and using:
cat pathlist.txt | xargs -I % cp % folder
However, I would like be able to recursively search through a folder and it's subfolders and copy all files that are in a plain text list of just filenames (not full paths).
How would I go about doing this?
Thanks!
Assuming your list of file names contains bare file names, as would be suitable for passing as an argument to find -name, you can do just that.
sed 's/^/-name /;1!s/^/-o /' pathlist.txt |
xargs -I % find folders to search -type f \( % \) -exec cp -t folder \+
If your cp doesn't support the -t option for specifying the destination folder before the sources, or your find doesn't support -exec ... \+ you will need to adapt this.
Just to explain what's going on here, the input
test.txt
radish.avi
:
is being interpolated into
find folders to search -type f \( -name test.txt -o -name radish.avi \
-o name : \) -exec cp -t folder \+
Try something like
find folder_to_search -type f | grep -f pattern_file | xargs -I % cp % folder
Use the find command.
while read line
do
find /path/to/search/for -type f -name "$line" -exec cp -R {} /path/to/copy/to \;
done <plain_text_file_containing_file_names
Assumption:
The files in the list have standard names without, say newlines or special characters in them.
Note:
If the files in the list have non-standard filenames, tt will be different ballgame. For more information see find manpage and look for -print0. In short you should be operating with null terminated strings then.

Recursively move files of certain type and keep their directory structure

I have a directory which contains multiple sub-directories with mov and jpg files.
/dir/
/subdir-a/ # contains a-1.jpg, a-2.jpg, a-1.mov
/subdir-b/ # contains b-1.mov
/subdir-c/ # contains c-1.jpg
/subdir-d/ # contains d-1.mov
... # more directories with the same pattern
I need to find a way using command-line tools (on Mac OSX, ideally) to move all the mov files to a new location. However, one requirement is to keep directory structure i.e.:
/dir/
/subdir-a/ # contains a-1.mov
/subdir-b/ # contains b-1.mov
# NOTE: subdir-c isn't copied because it doesn't have mov files
/subdir-d/ # contains d-1.mov
...
I am familiar with find, grep, and xargs but wasn't sure how to solve this issue. Thank you very much beforehand!
It depends slightly on your O/S and, more particularly, on the facilities in your version of tar and whether you have the command cpio. It also depends a bit on whether you have newlines (in particular) in your file names; most people don't.
Option #1
cd /old-dir
find . -name '*.mov' -print | cpio -pvdumB /new-dir
Option #2
find . -name '*.mov' -print | tar -c -f - -T - |
(cd /new-dir; tar -xf -)
The cpio command has a pass-through (copy) mode which does exactly what you want given a list of file names, one per line, on its standard input.
Some versions of the tar command have an option to read the list of file names, one per line, from standard input; on MacOS X, that option is -T - (where the lone - means 'standard input'). For the first tar command, the option -f - means (in the context of writing an archive with -c, write to standard output); in the second tar command, the -x option means that the -f - means 'read from standard input'.
There may be other options; look at the manual page or help output of tar rather carefully.
This process copies the files rather than moving them. The second half of the operation would be:
find . -name '*.mov' -exec rm -f {} +
ASSERT: No files have newline characters in them. Spaces, however, are AOK.
# TEST FIRST: CREATION OF FOLDERS
find . -type f -iname \*.mov -printf '%h\n' | sort | uniq | xargs -n 1 -d '\n' -I '{}' echo mkdir -vp "/TARGET_FOLDER_ROOT/{}"
# EXECUTE CREATION OF EMPTY TARGET FOLDERS
find . -type f -iname \*.mov -printf '%h\n' | sort | uniq | xargs -n 1 -d '\n' -I '{}' mkdir -vp "/TARGET_FOLDER_ROOT/{}"
# TEST FIRST: REVIEW FILES TO BE MOVED
find . -type f -iname \*.mov -exec echo mv {} /TARGET_FOLDER_ROOT/{} \;
# EXECUTE MOVE FILES
find . -type f -iname \*.mov -exec mv {} /TARGET_FOLDER_ROOT/{} \;
Being large files, if they are on the same file system you don't want to copy them, but just to replicate their directory structure while moving.
You can use this function:
# moves a file (or folder) preserving its folder structure (relative to source path)
# usage: move_keep_path source destination
move_keep_path () {
# create directories up to one level up
mkdir -p "`dirname "$2"`"
mv "$1" "$2"
}
Or, adding support to merging existing directories:
# moves a file (or folder) preserving its folder structure (relative to source path)
# usage: move_keep_path source destination
move_keep_path () {
# create directories up to one level up
mkdir -p "`dirname "$2"`"
if [[ -d "$1" && -d "$2" ]]; then
# merge existing folder
find "$1" -depth 1 | while read file; do
# call recursively for all files inside
mv_merge "$file" "$2/`basename "$file"`"
done
# remove after merge
rmdir "$1"
else
# either file or non-existing folder
mv "$1" "$2"
fi
}
It is easier to just copy the files like:
cp --parents some/folder/*/*.mov new_folder/
from the parent directory of "dir execute this:
find ./dir -name "*.mov" | xargs tar cif mov.tar
Then cd to the directory you want to move the files to and execute this:
tar xvf /path/to/parent/directory/of"dir"/mov.tar
This should work if you want to move all mov files to a directory called new location -
find ./dir -iname '*.mov' -exec mv '{}' ./newlocation \;
However, if you wish to move the mov files along with their sub-dirs then you can do something like this -
Step 1: Copy entire structure of /dir to a new location using cp
cp -iprv dir/ newdir
Step 2: Find jpg files from newdir and delete them.
find ./newdir -iname "*.jpg" -delete
Test:
[jaypal:~/Temp] ls -R a
a.mov aa b.mov
a/aa:
aaa c.mov d.mov
a/aa/aaa:
e.mov f.mov
[jaypal:~/Temp] mkdir d
[jaypal:~/Temp] find ./a -iname '*.mov' -exec mv '{}' ./d \;
[jaypal:~/Temp] ls -R d
a.mov b.mov c.mov d.mov e.mov f.mov
I amended the function of #djjeck, because it didn't work as I needed. The function below moves a source file to a destination directory also creating the needed levels of hierarchy in the source file path (see the example below):
# moves a file, creates needed levels of hierarchy in destination
# usage: move_with_hierarchy source_file destination top_level_directory
move_with_hierarchy () {
path_tail=$(dirname $(realpath --relative-to="$3" "$1"))
cd "$2"
mkdir -p $path_tail
cd - > /dev/null
mv "$1" "${2}/${path_tail}"
}
example:
$ ls /home/sergei/tmp/dir1/dir2/bla.txt
/home/sergei/tmp/dir1/dir2/bla.txt
$ rm -rf tmp2
$ mkdir tmp2
$ move_with_hierarchy /home/sergei/tmp/dir1/dir2/bla.txt /home/sergei/tmp2 /home/sergei/tmp
$ tree ~/tmp2
/home/sergei/tmp2
└── dir1
└── dir2
└── bla.txt
2 directories, 1 file

Delete all files but keep all directories in a bash script?

I'm trying to do something which is probably very simple, I have a directory structure such as:
dir/
subdir1/
subdir2/
file1
file2
subsubdir1/
file3
I would like to run a command in a bash script that will delete all files recursively from dir on down, but leave all directories. Ie:
dir/
subdir1/
subdir2/
subsubdir1
What would be a suitable command for this?
find dir -type f -print0 | xargs -0 rm
find lists all files that match certain expression in a given directory, recursively. -type f matches regular files. -print0 is for printing out names using \0 as delimiter (as any other character, including \n, might be in a path name). xargs is for gathering the file names from standard input and putting them as a parameters. -0 is to make sure xargs will understand the \0 delimiter.
xargs is wise enough to call rm multiple times if the parameter list would get too big. So it is much better than trying to call sth. like rm $((find ...). Also it much faster than calling rm for each file by itself, like find ... -exec rm \{\}.
With GNU's find you can use the -delete action:
find dir -type f -delete
With standard find you can use -exec rm:
find dir -type f -exec rm {} +
find dir -type f -exec rm '{}' +
find dir -type f -exec rm {} \;
where dir is the top level of where you want to delete files from
Note that this will only delete regular files, not symlinks, not devices, etc. If you want to delete everything except directories, use
find dir -not -type d -exec rm {} \;

Resources