I have a suspicion that a few years ago someone accidentally copied a folder structure from /home/data to /home/something/data. Since then /home/data has had many updates and changes.
What is the easiest way to check if there are any files in /home/something/data unique
(by name and location) to that location, to help me confirm if everything in there was a copy from /home/data?

Using diff -r dir1 dir2, you can recursively scan directories for differences in structure and content. Additional flags can tweak the output and behavior to your liking.

Use rsync in dry-run mode to see if copying /home/something/data into /home/data would actually copy any data.
rsync -r --dry-run /home/something/data /home/data
If a file under /home/something/data is identical to a file under /home/data, it would not be copied, and rsync --dry-run will not report it.

You may or may not like this approach, it can take a while to scan all files but I generally have a good feeling when I do it.
Go to the top of each directory structure and run a find and get the md5 checksums of each and every file - your switches may vary as I am on OSX
cd /home/data
find . -type f -exec md5 -r {} + > /tmp/a
cd /home/something/data
find . -type f -exec md5 -r {} + > /tmp/b
When they are finished, run the output files through sort and uniq -u to tell you the lines that only appear once (they should all appear twice if the files are the same in both directories):
sort < /tmp/[ab] | uniq -u


copy only the files in a directory

I have 2 directories
I want to copy only the files in dir2/results2 folder into dir1/results1 folder so that the result is like this:
I tried shell comand
cp -R dir2/results2/ dir1/results1/
but it is getting copied as
what is the right way to do it?
In your concrete case,
cp dir/results2/* dir/results1
would do what you want. It would not work well in two cases:
If you have files starting with a period, for instance dir/results2/.abc. These files would not be copied.
If you have subdirectories in dir/results2. While they indeed would not be copied (as you required, because you want to copy only files, not directories), you would get an error message, which is at least not elegant.
There are solutions to both problems, so if this is an issue for you, create a separate post with the respective topic.
(UPDATE) If the filename expansion would generate an argument line which is longer as the allowed minimum (for instance, if there are many files in the directory, or those with long lines), my solution would not work either. In this case, something like
find dir/results2 -maxdepth 1 -type f | xargs -i --no-run-if-empty] cp {} dir/results1
This would also solve the problems with the hidden files, which I have mentioned above.
(cd dir1 && find . -maxdepth 1 -type f -print0 | tar -T - --null -cf - ) | (cd dir2 && tar -xf -)
Handles all cases including . files and very large files but won't copy sibdirs. Remove the -depth to copy sibdirs. Requires gnutar.
tarcommand is very handy for that.
Give this a try:
tar cf - -C dir2/results2 . | ( cd dir1/results1 ; tar xf - )
It will not only copy plain files but also any other ones found into dir2/results2, such as directories etc.

finding and copying files with directory names

I have a folder with a lot of subdirectories. I want to find html.txt files and move them into a new directory with the same structure.
I´m using
find . -name "*.html.txt" | cp -t /home/....
but it is not keeping the name and directory structure.
Is there anyway to do this?
find . -name '*.html.txt' | xargs cp -t /home --parents
xargs is often useful to turn various lists into a list of arguments, especially with the output of 'find'. You can probably also do this with an exec command from within find, but I always find that harder to quickly get right.
--parents does the trick of creating all the parent directories as they are in the list of files.
I'm using this type of command in csh, and it should work in bash as well.
Mostly answered here already:

Gnu parallel for copying large amount of data(around 40 GB)

My requirement is simple but unable to find a way to do it effectively.
I have a directory named Code which contains around 14lac files and is around 40 GB. All i want to do is to create a copy of this Code folder at same directory level. Means Code and Code_Copy are in same folder.
If i use find . | parallel cp -rpf {} ../Code_Copy in Code folder, i get the correct directory structure plus all the files present in Code folder recursively also directly copied to Code_Copy.
If i use tree | parallel cp -rpf {} ../Code_Copy, the directory structure is created properly but the command keeps on running for a long time even after that giving error on "cannot stat file :" etc for lines of output of tree command.
Pls help in providing a solution.
Thanks in advance
tree will fail because it does not give paths but draws the paths using ascii art. This is not useful input for parallel.
The simplest solution is:
cp -a Code Code_Copy
but that may be slow. Running rsync in parallel may be the fastest solution depending on your disk system:
find . -type f | parallel 'mkdir -p ../Copy/{//}; rsync -a {} ../Copy/{}'
# To get permissions and special files fixed
rsync -a ./ ../Copy/
To get the most out of GNU Parallel consider walking through the tutorial:
Late answer but you need to do it in two steps. Build the directory tree first, then copy in the files. From the parent directory of Code:
find Code -type d -print0 | parallel -j$NUM_JOBS -0 "mkdir -p Code_Copy/{}"
find Code ! -type d -print0 | parallel -j$NUM_JOBS -0 "cp -P {} Code_Copy/{}"
The -P option preserves softlinks within the tree copied.

rsync : Recursively sync all files while ignoring the directory structure

I am trying to create a bash script for syncing music from my desktop to a mobile device. The desktop is the source.
Is there a way to make rsync recursively sync files but ignore the directory structure? If a file was deleted from the desktop, I want it to be deleted on the device as well.
The directory structure on my desktop is something like this.
The directory structure that I want on the device is:
rsync -a --delete --include=*.mp3 --exclude=* \
pathToSongs/Theme*/Artist*/. destuser#desthost:Music/.
would do the job if you're path hierarchy has a fixed number of level.
WARNING: if two song file do have exactly same name, while on same destination directory, your backup will miss one of them!
If else, and for answering strictly to your ask ignoring the directory structure you could use bash's shopt -s globstar feature:
shopt -s globstar
rsync -a --delete --include=*.mp3 --exclude=* \
pathToSongsRoot/**/. destuser#desthost:Music/.
At all, there is no need to fork to find command.
Recursively sync all files while ignoring the directory structure
For answering strictly to question, there must no be limited to an extension:
shopt -s globstar
rsync -d --delete sourceRoot/**/. destuser#desthost:destRoot/.
With this, directories will be copied too, but without content. All files and directories would be stored on same level at destRoot/.
WARNING: If some different files with same name exists in defferents directories, they would simply be overwrited on destination, durring rsync, for finaly storing randomly only one.
May be this is a recent option, but I see the option --no-relative mentioned in the documentation for --files-from and it worked great.
find SourceDir -name \*.mp3 | rsync -av --files-from - --no-relative . DestinationDir/
The answer to your question: No, rsync cannot do this alone. But with some help of other tools, we can get there... After a few tries I came up with this:
rsync -d --delete $(find . -type d|while read d ; do echo $d/ ; done) /targetDirectory && rmdir /targetDirectory/* 2>&-
The difficulty is this: To enable deletion of files at the target position, you need to:
specify directories as sources for rsync (it doesn't delete if the source is a list of files).
give it the complete list of sources at once (rsync within a loop will give you the contents of the last directory only at the target).
end the directory names with a slash (otherwise it creates the directories at the target directory)
So the command substitution (the stuff enclosed with the $( )) does this: It finds all directories and adds a slash (/) at the end of the directory names. Now rsync sees a list of source directories, all terminated with a slash and so copies their contents to the target directory. The option -d tells it, not to copy recursively.
The second trick is the rmdir /targetDirectory/* which removes the empty directories which rsync created (although we didn't ask it to do that).
I tested that here, and deletion of files removed in the source tree worked just fine.
If you can make a list of files, you've already solved the problem.
find /path/to/src/ -name \*.mp3 > list.txt
rsync -avi --no-relative --progress --files-from=list.txt / user#server:/path/to/dest
If you run the script again for new files, it will only copy the missing files.
If you don't like the list, then try a single sentence (but it's another logic)
find /path/to/src/ -name \*.mp3 -type f \
-exec rsync -avi --progress {} user#server:/path/to/dest/ \;
In this case, you will ask for each file, each time, since by the type of sentence, you cannot build the file list previously.

Unix tar: do not preserve full pathnames

When I try to compress files and directories with tar using absolute paths, the absolute path is preserved in the resulting compressed file. I need to use absolute paths to tell tar where the folder I wish to compress is located, but I only want it to compress that folder – not the whole path.
For example, tar -cvzf test.tar.gz /home/path/test – where I want to compress the folder test. However, what I actually end up compressing is /home/path/test. Is there anything that can be done to avoid this? I have tried playing with the -C operand to no avail.
This is ugly... but it works...
I had this same problem but with multiple folders, I just wanted to flat every files out. You can use the option "transform" to pass a sed expression and... it works as expected.
this is the expression:
's/.*\///g' (delete everything before '/')
This is the final command:
tar --transform 's/.*\///g' -zcvf tarballName.tgz */*/*.info
Use -C to specify the directory from which the files look like you want, and then specify the files as seen from that directory:
tar -cvzf test.tar.gz -C /home/path test
multi-directory example
tar cvzf my.tar.gz -C dir1 files_under_dir1 -C dir2 files_under_dir2
the files under dir 1/2 should not have path.
tar can perform transformations on the filenames on the way in and out of the archive. Consider this example that stores a bunch of files in a flat tarfile:
in the root ~/ directory
find logger -name \*.sh | tar -cvf test.tar -T - --xform='s|^.*/||' --show-transformed
The -T - option tell tar to read a list of files from stdin, the --xform='s|^.*/||' applies the sed expression to all the filenames as after they are read and before they are stored. --show-transformed is just a nicety to show you the file names after they are transformed, the default is to show the names as they are read.
There are no doubt other ways besides using find to specify files to archive. For instance, if you have dotglob set in bash, you can use ** patterns to wildcard any number of directories, shortening the previous to this:
tar -cvf test.tar --xform='s|^.*/||' --show-transformed logger/**/*.sh
You’ll have to judge what is best for your situation given the files you’re after.
find -type f | tar --transform 's/.*\///g' -zcvf comp.tar.gz -T -
Where find -type f finds all the files in the directory tree and using tar with --transform compresses them without including the folder structure. This is very useful if you want to compress only the files that are the result of a certain search or the files of a specific type like:
find -type f -name "*.txt" | tar --transform 's/.*\///g' -zcvf comp.tar.gz -T -
Unlike the other answers, you don't have to include */*/* specifying the depth of the directory. find handles that for you.
