Gnu parallel for copying large amount of data(around 40 GB) - parallel-processing

My requirement is simple but unable to find a way to do it effectively.
I have a directory named Code which contains around 14lac files and is around 40 GB. All i want to do is to create a copy of this Code folder at same directory level. Means Code and Code_Copy are in same folder.
If i use find . | parallel cp -rpf {} ../Code_Copy in Code folder, i get the correct directory structure plus all the files present in Code folder recursively also directly copied to Code_Copy.
If i use tree | parallel cp -rpf {} ../Code_Copy, the directory structure is created properly but the command keeps on running for a long time even after that giving error on "cannot stat file :" etc for lines of output of tree command.
Pls help in providing a solution.
Thanks in advance

tree will fail because it does not give paths but draws the paths using ascii art. This is not useful input for parallel.
The simplest solution is:
cp -a Code Code_Copy
but that may be slow. Running rsync in parallel may be the fastest solution depending on your disk system:
find . -type f | parallel 'mkdir -p ../Copy/{//}; rsync -a {} ../Copy/{}'
# To get permissions and special files fixed
rsync -a ./ ../Copy/
To get the most out of GNU Parallel consider walking through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Late answer but you need to do it in two steps. Build the directory tree first, then copy in the files. From the parent directory of Code:
NUM_JOBS=4
find Code -type d -print0 | parallel -j$NUM_JOBS -0 "mkdir -p Code_Copy/{}"
find Code ! -type d -print0 | parallel -j$NUM_JOBS -0 "cp -P {} Code_Copy/{}"
The -P option preserves softlinks within the tree copied.

Related

Operating on multiple specific folders at once with cp and rm commands

I'm new to linux (using bash) and I wanted to ask about something that I do often while I work, I'll give two examples.
Deleting multiple specific folders inside a certain directory.
Copying multiple specific folders into a ceratin directory.
I succesfully done this with files, using find with some regex and then using -exec and -delete. But for folders I found it more problematic, because I had problem pipelining the list of folders I got to the cp/rm command succescfully, each time getting the "No such file or directory error".
Looking online I found the following command (in my case for copying all folders starting with a Z):
cp -r $(ls -A | grep "Z*") destination
But when I execute it it says nothing and the prompt won't show up again until I hit Ctrl+C and nothing is copied.
How can I achieve what I'm looking for? For both cp and rm.
Thanks in advance!
First of all, you are trying to grep "Z*" but it means you are looking for Z, ZZ, ZZZZ, ZZZZZ ?
also try to execute ls -A - you will get multiple columns. I think need at least ls -1A to print result one per line.
So for your command try something like:
cp -r $(ls -1A|grep "^p") destination
or
cp -r $(ls -1A|grep "^p") -t destination
But all the above is just to correct syntax of your example.
It is much better to use find. Just in case try to put target directory in quotas like:
find <PATH_FROM> -type d -exec cp -r \"{}\" -t target \;

Create text document of every file in a directory recursively

Part of a script I currently use is using "ls -FCRlhLoprt" to list every file inside of a root directory recursively to a text document. The problem is, every time I run the script, ls includes that document in its output so the text document grows each time I run it. I believe I can use -i or --ignore, but how can I use that when ls is using a few variables? I keep getting errors:
ls "$lsopt" "$masroot"/ >> "$masroot"/"$client"_"$jobnum"_"$mas"_drive_contents.txt . #this works
If I try:
ls -FCRlhLoprt --ignore=""$masroot"/"$client"_"$jobnum"_"$mas"_drive_contents.txt"" "$masroot"/ >> "$masroot"/"$client"_"$jobnum"_"$mas"_drive_contents.txt #this does not work
I get errors. I basically want to not include the output back into the 2nd time I run this command.
Additional, all I am trying to do is create an easy to read document of every file inside of a directory recursively. If there is a better way, please let me know.
Additional, all I am trying to do is create an easy to read document of every file inside of a directory recursively. If there is a better way, please let me know.
To list every file in a directory recursively, the find command does exactly what you want, and admits further programmatic manipulation of the files found if you wish.
Examples:
To list every file under the current directory, recursively:
find ./ -type f
To list files under /etc/ and /usr/share, showing their owners and permissions:
find /etc /usr/share -type f -printf "%-100p %#m %10u %10g\n"
To show line counts of all files recursively, but ignoring subdirectories of .git:
find ./ -type f ! -regex ".*\.git.*" -exec wc -l {} +
To search under $masroot but ignore files generated by past searches, and dump the results into a file:
find "$masroot" -type f ! -regex ".*/[a-zA-Z]+_[0-9]+_.+_drive_contents.txt" | tee "$masroot/${client}_${jobnum}_${mas}_drive_contents.txt"
(Some of that might be slightly different on a Mac. For more information see man find.)

copy only the files in a directory

I have 2 directories
dir1/results1/a.xml
dir1/results1/b.txt
and
dir2/results2/c.xml
dir2/results2/d.txt
I want to copy only the files in dir2/results2 folder into dir1/results1 folder so that the result is like this:
dir1/results1/a.xml
dir1/results1/b.txt
dir1/results1/c.xml
dir1/results1/d.txt
I tried shell comand
cp -R dir2/results2/ dir1/results1/
but it is getting copied as
dir1/results1/a.xml
dir1/results1/b.txt
dir1/results1/results2
what is the right way to do it?
In your concrete case,
cp dir/results2/* dir/results1
would do what you want. It would not work well in two cases:
If you have files starting with a period, for instance dir/results2/.abc. These files would not be copied.
If you have subdirectories in dir/results2. While they indeed would not be copied (as you required, because you want to copy only files, not directories), you would get an error message, which is at least not elegant.
There are solutions to both problems, so if this is an issue for you, create a separate post with the respective topic.
(UPDATE) If the filename expansion would generate an argument line which is longer as the allowed minimum (for instance, if there are many files in the directory, or those with long lines), my solution would not work either. In this case, something like
find dir/results2 -maxdepth 1 -type f | xargs -i --no-run-if-empty] cp {} dir/results1
This would also solve the problems with the hidden files, which I have mentioned above.
(cd dir1 && find . -maxdepth 1 -type f -print0 | tar -T - --null -cf - ) | (cd dir2 && tar -xf -)
Handles all cases including . files and very large files but won't copy sibdirs. Remove the -depth to copy sibdirs. Requires gnutar.
tarcommand is very handy for that.
Give this a try:
tar cf - -C dir2/results2 . | ( cd dir1/results1 ; tar xf - )
It will not only copy plain files but also any other ones found into dir2/results2, such as directories etc.

finding and copying files with directory names

I have a folder with a lot of subdirectories. I want to find html.txt files and move them into a new directory with the same structure.
I´m using
find . -name "*.html.txt" | cp -t /home/....
but it is not keeping the name and directory structure.
Is there anyway to do this?
Try:
find . -name '*.html.txt' | xargs cp -t /home --parents
xargs is often useful to turn various lists into a list of arguments, especially with the output of 'find'. You can probably also do this with an exec command from within find, but I always find that harder to quickly get right.
--parents does the trick of creating all the parent directories as they are in the list of files.
I'm using this type of command in csh, and it should work in bash as well.
Mostly answered here already:
https://serverfault.com/questions/180853/how-to-copy-file-preserving-directory-path-in-linux

find files unique to different paths BASH

I have a suspicion that a few years ago someone accidentally copied a folder structure from /home/data to /home/something/data. Since then /home/data has had many updates and changes.
What is the easiest way to check if there are any files in /home/something/data unique
(by name and location) to that location, to help me confirm if everything in there was a copy from /home/data?
Using diff -r dir1 dir2, you can recursively scan directories for differences in structure and content. Additional flags can tweak the output and behavior to your liking.
Use rsync in dry-run mode to see if copying /home/something/data into /home/data would actually copy any data.
rsync -r --dry-run /home/something/data /home/data
If a file under /home/something/data is identical to a file under /home/data, it would not be copied, and rsync --dry-run will not report it.
You may or may not like this approach, it can take a while to scan all files but I generally have a good feeling when I do it.
Go to the top of each directory structure and run a find and get the md5 checksums of each and every file - your switches may vary as I am on OSX
cd /home/data
find . -type f -exec md5 -r {} + > /tmp/a
cd /home/something/data
find . -type f -exec md5 -r {} + > /tmp/b
When they are finished, run the output files through sort and uniq -u to tell you the lines that only appear once (they should all appear twice if the files are the same in both directories):
sort < /tmp/[ab] | uniq -u

Resources