Unix Shell scripting for copying files and creating directory - shell

I have a source directory eg /my/source/directory/ and a destination directory eg /my/dest/directory/, which I want to mirror with some constraints.
I want to copy files which meet certain criteria of the find command, eg -ctime -2 (less than 2 days old) to the dest directory to mirror it
I want to include some of the prefix so I know where it came from, eg /source/directory
I'd like to do all this with absolute paths so it doesn't depend which directory I run from
I'd guess not having cd commands is good practice too.
I want the subdirectories created if they don't exist
So
/my/source/directory/1/foo.txt -> /my/dest/directory/source/directory/1/foo.txt
/my/source/directory/2/3/bar.txt -> /my/dest/directory/source/directory/2/3/bar.txt
I've hacked together the following command line but it seems a bit ugly, can anyone do better?
find /my/source/directory -ctime -2 -type f -printf "%P\n" | xargs -IFILE rsync -avR /my/./source/directory/FILE /my/dest/directory/
Please comment if you think I should add this command line as an answer myself, I didn't want to be greedy for reputation.

This is remarkably similar to a (closed) question: Bash scripting copying files without overwriting. The answer I gave cites the 'find | cpio' solution mentioned in other answers (minus the time criteria, but that's the difference between 'similar' and 'same'), and also outlines a solution using GNU 'tar'.
ctime
When I tested on Solaris, neither GNU tar nor (Solaris) cpio was able to preserve the ctime setting; indeed, I'm not sure that there is any way to do that. For example, the touch command can set the atime or the mtime or both - but not the ctime. The utime() system call also only takes the mtime or atime values; it does not handle ctime. So, I believe that if you find a solution that preserves ctime, that solution is likely to be platform-specific. (Weird example: hack the disk device and edit the data in the inode - not portable, requires elevated privileges.) Rereading the question, though, I see that 'preserving ctime' is not part of the requirements (phew); it is simply the criterion for whether the file is copied or not.
chdir
I think that the 'cd' operations are necessary - but they can be wholly localized to the script or command line, though, as illustrated in the question cited and the command lines below, the second of which assumes GNU tar.
(cd /my; find source/directory -ctime -2 | cpio -pvdm /my/dest/directory)
(cd /my; find source/directory -ctime -2 | tar -cf - -F - ) |
(cd /my/dest/directory; tar -xf -)
Without using chdir() (aka cd), you need specialized tools or options to handle the manipulation of the pathnames on the fly.
Names with blanks, newlines, etc
The GNU-specific 'find -print0' and 'xargs -0' are very powerful and effective, as noted by Adam Hawes. Funnily enough, GNU cpio has an option to handle the output from 'find -print0', and that is '--null' or its short form '-0'. So, using GNU find and GNU cpio, the safe command is:
(cd /my; find source/directory -ctime -2 -print0 |
cpio -pvdm0 /my/dest/directory)
Note:This does not overwrite pre-existing files under the backup directory. Add -u to the cpio command for that.
Similarly, GNU tar supports --null (apparently with no -0 short-form), and could also be used:
(cd /my; find source/directory -ctime -2 -print0 | tar -cf - -F - --null ) |
(cd /my/dest/directory; tar -xf -)
The GNU handling of file names with the null terminator is extremely clever and a valuable innovation (though I only became aware of it fairly recently, courtesy of SO; it has been in GNU tar for at least a decade).

You could try cpio using the copy-pass mode, -p. I usually use it with overwrite all (-u), create directories (-d), and maintain modification time (-m).
find myfiles | cpio -pmud target-dir
Keep in mind that find should produce relative path names, which doesn't fit your absolute path criteria. This cold be of course be 'solved' using cd, which you also don't like (why not?)
(cd mypath; find myfiles | cpio ... )
The brackets will spawn a subshell, and will keep the state-change (i.e. the directory switch) local. You could also define a small procedure to abstract away the 'uglyness'.

IF you're using find always use -print0 and pipe the output through xargs -0; well almost always. The first file with a space in its name will bork the script if you use the default newline terminator output of find.
I agree with all the other posters - use cpio or tar if you can. It'll do what you want and save the hassle.

An alternative is to use tar,
(cd $SOURCE; tar cf - .) | (cd $DESTINATION; tar xf -)
EDIT:
Ah, I missed the bit about preserving CTIME. I believe most implementations of tar will preserve mtime, but if preserving ctime is critical, then cpio is indeed the only way.
Also, some tar implementations (GNU tar being one) can select the files to include based on atime and mtime, though seemingly not ctime.

#!/bin/sh
SRC=/my/source/directory
DST=/my/dest/directory
for i in $(find $SRC -ctime -2 -type f) ; do
SUBDST=$DST$(dirname $i)
mkdir -p $SUBDST
cp -p $i $SUBDST
done
And I suppose, since you want to include "where it came from", that you are going to use different source directories. This script can be modified to take source dir as an argument simply by replacing SRC=/my/source/directory, with SRC=$1
EDIT: Removed redundant if statement.
Does not work when filenames have whitespaces.

!/usr/bin/sh
script to copy files with same directory structure"
echo "Please enter Full Path of Source DIR (Starting with / and ending with /):"
read spath
echo " Please enter Full Path of Destination location (Starting with / and ending with /):"
read dpath
si=echo "$spath" | awk -F/ '{print NF-1}'
for fname in find $spath -type f -print
do
cdir=echo $fname | awk -F/ '{ for (i='$si'; i<NF; i++) printf "%s/", $i; printf "\n"; }'
if [ $cdir ]; then
if [ ! -d "$dpath$cdir" ]; then
mkdir -p $dpath$cdir
fi
fi
cp $fname $dpath$cdir
done

Related

Terminal command to create a .tar.gz files from 1,000,000 .json files (without including any directory)

I have a directory with plus 1,000,000 .json files and used the following command to build a j.tar.gz only from json files (without including the /Library/WebServer/a/a/e/j/ path):
cd /Library/WebServer/a/a/e/j && tar -zcvf j.tar.gz *.json
This error happened: ...Argument list too long. Would you suggest a better command to accomplish this task? Thanks.
An initial caveat: tar is not a standards-defined tool (the POSIX archiver is pax), so its behavior can vary between platforms without any minimal guaranteed baseline. Your mileage may vary.
Since this is flagged for bash, you can use <() -- a process substitution -- to generate a filename which, when read, will emit a subprocess's output without the need for a temporary file. (This will typically be implemented as either a /dev/fd name if your operating system supports them, or a named pipe otherwise).
If you only want the cd to apply to the tar command, you can do that as follows, putting it in a subshell and using exec to have the subshell replace itself with the tar command, avoiding the fork penalty that a subshell otherwise creates:
dir=/Library/WebServer/a/a/e/j
(cd "$dir" && exec tar --null -zcvf j.tar.gz -T <(printf '%s\0' *.json) )
Alternately, if your tar supports it, you can use --include to tell tar itself to filter the names:
tar -C "$dir" --include='*.json' -cvzf "$dir/j.tar.gz" .
Points of note:
printf '%s\n' *.json is immune from this because printf is a shell builtin; thus, the glob results aren't put in an execv-family syscall's arguments, so ARG_MAX doesn't apply.
Using --null on find and '%s\0' on printf (or -print0 if you were generating your list of names with find) prevents a maliciously-generated name with a literal newline from being able to inject arbitrary names into your stream. Think about what happens if someone runs mkdir -p $'hello/\n/etc/passwd\n.json' -- you don't want /etc/passwd going into your tarball.
Try:
find . -type f -name "*.json" > ./include_file && tar -zcvf j.tar.gz --files-from ./include_file
NOTE: This was tested successfully on CentOS/RedHat 6.7.
There is a limit set by your system. You can check
$ getconf ARG_MAX
mine returns
131072
Alternatively, you can create a file list for tar and use -T, --files-from F option to get names instead of globbing which hits the max args limit.
How about something like:
> cd /Library/WebServer/a/a/e/j
> find . -name '*.json' -maxdepth 1 | xargs tar -czvf j.tar.gz --add-file
It does not require temporary file and does not need to do *.json in the shell which would fail.
Checked on Ubuntu haven't got Mac at hand.

Can I limit the recursion when copying using find (bash)

I have been given a list of folders which need to be found and copied to a new location.
I have basic knowledge of bash and have created a script to find and copy.
The basic command I am using is working, to a certain degree:
find ./ -iname "*searchString*" -type d -maxdepth 1 -exec cp -r {} /newPath/ \;
The problem I want to resolve is that each found folder contains the files that I want, but also contains subfolders which I do not want.
Is there any way to limit the recursion so that only the files at the root level of the found folder are copied: all subdirectories and files therein should be ignored.
Thanks in advance.
If you remove -R, cp doesn't copy directories:
cp *searchstring*/* /newpath
The command above copies dir1/file1 to /newpath/file1, but these commands copy it to /newpath/dir1/file1:
cp --parents *searchstring*/*(.) /newpath
for GNU cp and zsh
. is a qualifier for regular files in zsh
cp --parents dir1/file1 dir2 copies file1 to dir2/dir1 in GNU cp
t=/newpath;for d in *searchstring*/;do mkdir -p "$t/$d";cp "$d"* "$t/$d";done
find *searchstring*/ -type f -maxdepth 1 -exec rsync -R {} /newpath \;
-R (--relative) is like --parents in GNU cp
find . -ipath '*searchstring*/*' -type f -maxdepth 2 -exec ditto {} /newpath/{} \;
ditto is only available on OS X
ditto file dir/file creates dir if it doesn't exist
So ... you've been given a list of folders. Perhaps in a text file? You haven't provided an example, but you've said in comments that there will be no name collisions.
One option would be to use rsync, which is available as an add-on package for most versions of Unix and Linux. Rsync is basically an advanced copying tool -- you provide it with one or more sources, and a destination, and it makes sure things are synchronized. It knows how to copy things recursively, but it can't be told to limit its recursion to a particular depth, so the following will copy each item specified to your target, but it will do so recursively.
xargs -L 1 -J % rsync -vi -a % /path/to/target/ < sourcelist.txt
If sourcelist.txt contains a line with /foo/bar/slurm, then the slurm directory will be copied in its entiriety to /path/to/target/slurm/. But this would include directories contained within slurm.
This will work in pretty much any shell, not just bash. But it will fail if one of the lines in sourcelist.txt contains whitespace, or various special characters. So it's important to make sure that your sources (on the command line or in sourcelist.txt) are formatted correctly. Also, rsync has different behaviour if a source directory includes a trailing slash, and you should read the man page and decide which behaviour you want.
You can sanitize your input file fairly easily in sh, or bash. For example:
#!/bin/sh
# Avoid commented lines...
grep -v '^[[:space:]]*#' sourcelist.txt | while read line; do
# Remove any trailing slash, just in case
source=${line%%/}
# make sure source exist before we try to copy it
if [ -d "$source" ]; then
rsync -vi -a "$source" /path/to/target/
fi
done
But this still uses rsync's -a option, which copies things recursively.
I don't see a way to do this using rsync alone. Rsync has no -depth option, as find has. But I can see doing this in two passes -- once to copy all the directories, and once to copy the files from each directory.
So I'll make up an example, and assume further that folder names do not contain special characters like spaces or newlines. (This is important.)
First, let's do a single-pass copy of all the directories themselves, not recursing into them:
xargs -L 1 -J % rsync -vi -d % /path/to/target/ < sourcelist.txt
The -d option creates the directories that were specified in sourcelist.txt, if they exist.
Second, let's walk through the list of sources, copying each one:
# Basic sanity checking on input...
grep -v '^[[:space:]]*#' sourcelist.txt | while read line; do
if [ -d "$line" ]; then
# Strip trailing slashes, as before
source=${line%%/}
# Grab the directory name from the source path
target=${source##*/}
rsync -vi -a "$source/" "/path/to/target/$target/"
fi
done
Note the trailing slash after $source on the rsync line. This causes rsync to copy the contents of the directory, rather than the directory.
Does all this make sense? Does it match your requirements?
You can use find's ipath argument:
find . -maxdepth 2 -ipath './*searchString*/*' -type f -exec cp '{}' '/newPath/' ';'
Notice the path starts with ./ to match find's search directory, ends with /* in order to exclude files in the top level directory, and maxdepth is set to 2 to only recurse one level deep.
Edit:
Re-reading your comments, it seems like you want to preserve the directory you're copying from? E.g. when searching for foo*:
./foo1/* ---> copied to /newPath/foo1/* (not to /newPath/*)
./foo2/* ---> copied to /newPath/foo2/* (not to /newPath/*)
Also, the other requirement is to keep maxdepth at 1 for speed reasons.
(As pointed out in the comments, the following solution has security issues for specially crafted names)
Combining both, you could use this:
find . -maxdepth 1 -type d -iname 'searchString' -exec sh -c "mkdir -p '/newPath/{}'; cp "{}/*" '/newPath/{}/' 2>/dev/null" ';'
Edit 2:
Why not ditch find altogether and use a pure bash solution:
for d in *searchString*/; do mkdir -p "/newPath/$d"; cp "$d"* "/newPath/$d"; done
Note the / at the end of the search string, causing only directories to be considered for matching.

Copying files with specific size to other directory

Its a interview question. Interviewer asked this "basic" shell script question when he understand i don't have experience in shell scripting. Here is question.
Copy files from one directory which has size greater than 500 K to another directory.
I can do it immediately in c lang but seems difficult in shell script as never tried it.I am familiar with unix basic commands so i tried it, but i can just able to extract those file names using below command.
du -sk * | awk '{ if ($1>500) print $2 }'
Also,Let me know good shell script examples book.
It can be done in several ways. I'd try and use find:
find $FIRSTDIRECTORY -size +500k -exec cp "{\} $SECONDDIRECTORY \;
To limit to the current directory, use -maxdepth option.
du recurses into subdirectories, which is probably not desired (you could have asked for clarification if that point was ambiguous). More likely you were expected to use ls -l or ls -s to get the sizes.
But what you did works to select some files and print their names, so let's build on it. You have a command that outputs a list of names. You need to put the output of that command into the command line of a cp. If your du|awk outputs this:
Makefile
foo.c
bar.h
you want to run this:
cp Makefile foo.c bar.h otherdirectory
So how you do that is with COMMAND SUBSTITUTION which is written as $(...) like this:
cd firstdirectory
cp $(du -sk * | awk '{ if ($1>500) print $2 }') otherdirectory
And that's a functioning script. The du|awk command runs first, and its output is used to build the cp command. There are a lot of subtle drawbacks that would make it unsuitable for general use, but that's how beginner-level shell scripts usually are.
find . -mindepth 1 -maxdepth 1 -type f -size +BYTESc -exec cp -t DESTDIR {}\+
The c suffix on the size is essential; the size is in bytes. Otherwise, you get probably-unexpected rounding behaviour in determining the result of the -size check. If the copying is meant to be recursive, you will need to take care of creating any destination directory also.

diff a directory recursively, ignoring all binary files

Working on a Fedora Constantine box. I am looking to diff two directories recursively to check for source changes. Due to the setup of the project (prior to my own engagement with said project! sigh), the directories contain both source and binaries, as well as large binary datasets. While diffing eventually works on these directories, it would take perhaps twenty seconds if I could ignore the binary files.
As far as I understand, diff does not have an 'ignore binary file' mode, but does have an ignore argument which will ignore regular expression within a file. I don't know what to write there to ignore binary files, regardless of extension.
I'm using the following command, but it does not ignore binary files. Does anyone know how to modify this command to do this?
diff -rq dir1 dir2
Kind of cheating but here's what I used:
diff -r dir1/ dir2/ | sed '/Binary\ files\ /d' >outputfile
This recursively compares dir1 to dir2, sed removes the lines for binary files(begins with "Binary files "), then it's redirected to the outputfile.
Maybe use grep -I (which is equivalent to grep --binary-files=without-match) as a filter to sort out binary files.
dir1='folder-1'
dir2='folder-2'
IFS=$'\n'
for file in $(grep -Ilsr -m 1 '.' "$dir1"); do
diff -q "$file" "${file/${dir1}/${dir2}}"
done
I came to this (old) question looking for something similar (Config files on a legacy production server compared to default apache installation). Following #fearlesstost's suggestion in the comments, git is sufficiently lightweight and fast that it's probably more straightforward than any of the above suggestions. Copy version1 to a new directory. Then do:
git init
git add .
git commit -m 'Version 1'
Now delete all the files from version 1 in this directory and copy version 2 into the directory. Now do:
git add .
git commit -m 'Version 2'
git show
This will show you Git's version of all the differences between the first commit and the second. For binary files it will just say that they differ. Alternatively, you could create a branch for each version and try to merge them using git's merge tools.
If the names of the binary files in your project follow a specific pattern (*.o, *.so, ...) as they usually do, you can put those patterns in a file and specify it using -X (hyphen X).
Contents of my exclude_file
*.o
*.so
*.git
Command:
diff -X exclude_file -r . other_tree > my_diff_file
UPDATE:
-x can be used instead of -X, to specify exclusion patterns on the command line rather than in a file:
diff -r -x *.o -x *.so -x *.git dir1 dir2
Use a combination of find and the file command. This requires you to do some research on the output of the file command in your directory; below I'm assuming that the files you want to diff is reported as ascii. OR, use grep -v to filter out the binary files.
#!/bin/bash
dir1=/path/to/first/folder
dir2=/path/to/second/folder
cd $dir1
files=$(find . -type f -print | xargs file | grep ASCII | cut -d: -f1)
for i in $files;
do
echo diffing $i ---- $dir2/$i
diff -q $i $dir2/$i
done
Since you probably know the names of the huge binaries, place them in a hash-array and only do the diff when a file is not in the hash,something like this:
#!/bin/bash
dir1=/path/to/first/directory
dir2=/path/to/second/directory
content_dir1=$(mktemp)
content_dir2=$(mktemp)
$(cd $dir1 && find . -type f -print > $content_dir1)
$(cd $dir2 && find . -type f -print > $content_dir2)
echo Files that only exist in one of the paths
echo -----------------------------------------
diff $content_dir1 $content_dir2
#Files 2 Ignore
declare -A F2I
F2I=( [sqlite3]=1 [binfile2]=1 )
while read f;
do
b=$(basename $f)
if ! [[ ${F2I[$b]} ]]; then
diff $dir1/$f $dir2/$f
fi
done < $content_dir1
Well, as a crude sort of check, you could ignore files that match /\0/.

Unable to convert dot-files to non-dotfiles safely

I run unsuccessfully in Mac
mv .* *
and
mv .* ./*
My files disappeared into thin air.
How can you convert dot-files to non-dotfiles safely?
for i in `ls -d .*`; do mv $i "`echo $i | sed 's/^.//'`"; done
or, much easier,
rename 's/^.//' `ls -d .*`
if your system have got it.
In zsh, you could just use .* safely, but in bash you'll have to use ls -d .*
You can't use mv to rename multiple files like that. What you want is mmv (get it here).
mmv .\* \#1
You have to escape the asterisk to prevent bash from expanding it. Use the -n flag to do a test run to make sure what will happen is what you want.
You could also do this in shell scripting but I much prefer mmv because the -n flag shows what it would do. You'd have to alter your script to echo instead of mv, which seems more dangerous than dropping the -n flag (especially when you get more complicated.
The tricky part about this is selecting dotfiles without selecting "." and "..".
ls .??* is sometimes used for this, since it forces the filenames to be three or more characters long. There is a risk though, of overlooking a dotfile with a short name, such as ".x"
ls -d .* prevents directories from being expanded, but it doesn't filter out "." or ".."
The find command could be used, as in find . -maxdepth 1 -type f -name '.*'. The maxdepth limits it to the current directory and not subdirectories. The -type f limits it to files, eliminating directories such as "." and "..". Then again, maybe you want to rename the .ssh directory to ssh.
Here's an alternative that selects dotfiles while avoiding "." and "..".
ls -A | sed -n 's/^\.\(.*\)/mv ".\1" "\1"/p' | bash
The -A lists all files and dotfiles, yet eliminates "." and ".." for us. Then the sed command selects only those lines with "." as the first character, and prints out appropriate "mv" commands, complete with quotes in case you have a bizarre dotfilename with a space in it.
Run it without the "| bash" first, to see what mv commands are generated.
i don't know what type of system you're on, but it looks unix like, i would do
ls -1 .?* | cut -b1- | xargs -i{} mv .{} {}
this lists, everything that starts with a ., but isn't . or .., then cut the first column off, then pipe that list to a move command
In Linux, there is usually a rename utility available (a perl script, if I am not mistaken):
rename 's/^.//' .*
It is available on a Mac. You can install it by following tips at here.
Even simpler:
for x in .*; do mv $x ${x/./}; done

Resources