I have a Bash script that repeatedly copies files every 5 seconds. But this is a touch overkill as usually there is no change.
I know about the Linux command watch but as this script will be used on OS X computers (which don’t have watch, and I don’t want to make everyone install macports) I need to be able to check if a file is modified or not with straight Bash code.
Should I be checking the file modified time? How can I do that?
Edit: I was hoping to expand my script to do more than just copy the file, if it detected a change. So is there a pure-bash way to do this?
I tend to agree with the rsync answer if you have big trees of files
to manage, but you can use the -u (--update) flag to cp to copy the
file(s) over only if the source is newer than the destination.
cp -u
Edit
Since you've updated the question to indicate that you'd like to take
some additional actions, you'll want to use the -nt check
in the [ (test) builtin command:
#!/bin/bash
if [ $1 -nt $2 ]; then
echo "File 1 is newer than file 2"
else
echo "File 1 is older than file 2"
fi
From the man page:
file1 -nt file2
True if file1 is newer (according to modification date) than
file2, or if file1 exists and file2 does not.
Hope that helps.
OS X has the stat command. Something like this should give you the modification time of a file:
stat -f '%m' filename
The GNU equivalent would be:
stat --printf '%Y\n' filename
You might find it more reliable to detect changes in the file content by comparing the file size (if the sizes differ, the content does) and the hash of the contents. It probably doesn't matter much which hash you use for this purpose: SHA1 or even MD5 is probably adequate, and you might find that the cksum command is sufficient.
File modification times can change without changing the content (think touch file); file modification times can not change even when the content does (doing this is harder, but you could use touch -r ref-file file to set the modification times of file to match ref-file after editing the file).
No. You should be using rsync or one of its frontends to copy the files, since it will detect if the files are different and only copy them if they are.
Related
In system call open(), if I open with O_CREAT | O_EXCL, the system call ensures that the file will only be created if it does not exist. The atomicity is guaranteed by the system call. Is there a similar way to create a file in an atomic fashion from a bash script?
UPDATE:
I found two different atomic ways
Use set -o noclobber. Then you can use > operator atomically.
Just use mkdir. Mkdir is atomic
A 100% pure bash solution:
set -o noclobber
{ > file ; } &> /dev/null
This command creates a file named file if there's no existent file named file. If there's a file named file, then do nothing (but return a non-zero return code).
Pros of > over the touch command:
Doesn't update timestamp if file already existed
100% bash builtin
Return code as expected: fail if file already existed or if file couldn't be created; success if file didn't exist and was created.
Cons:
need to set the noclobber option (but it's okay in a script, if you're careful with redirections, or unset it afterwards).
I guess this solution is really the bash counterpart of the open system call with O_CREAT | O_EXCL.
Here's a bash function using the mv -n trick:
function mkatomic() {
f="$(mktemp)"
mv -n "$f" "$1"
if [ -e "$f" ]; then
rm "$f"
echo "ERROR: file exists:" "$1" >&2
return 1
fi
}
Examples:
$ mkatomic foo
$ wc -c foo
0 foo
$ mkatomic foo
ERROR: file exists: foo
You could create it under a randomly-generated name, then rename (mv -n random desired) it into place with the desired name. The rename will fail if the file already exists.
Like this:
#!/bin/bash
touch randomFileName
mv -n randomFileName lockFile
if [ -e randomFileName ] ; then
echo "Failed to acquired lock"
else
echo "Acquired lock"
fi
Just to be clear, ensuring the file will only be created if it doesn't exist is not the same thing as atomicity. The operation is atomic if and only if, when two or more separate threads attempt to do the same thing at the same time, exactly one will succeed and all others will fail.
The best way I know of to create a file atomically in a shell script follows this pattern (and it's not perfect):
create a file that has an extremely high chance of not existing (using a decent random number selection or something in the file name), and place some unique content in it (something that no other thread would have - again, a random number or something)
verify that the file exists and contains the contents you expect it to
create a hard link from that file to the desired file
verify that the desired file contains the expected contents
In particular, touch is not atomic, since it will create the file if it's not there, or simply update the timestamp. You might be able to play games with different timestamps, but reading and parsing a timestamp to see if you "won" the race is harder than the above. mkdir can be atomic, but you would have to check the return code, because otherwise, you can only tell that "yes, the directory was created, but I don't know which thread won". If you're on a file system that doesn't support hard links, you might have to settle for a less ideal solution.
Another way to do this is to use umask to try to create the file and open it for writing, without creating it with write permissions, like this:
LOCK_FILE=only_one_at_a_time_please
UMASK=$(umask)
umask 777
echo "$$" > "$LOCK_FILE"
umask "$UMASK"
trap "rm '$LOCK_FILE'" EXIT
If the file is missing, the script will succeed at creating and opening it for writing, despite the file being created without writing permissions. If it already exists, the script won't be able to open the file for writing. It would be possible to use exec to open the file and keep the file descriptor around.
rm requires you to have write permissions to the directory itself, without regards to file permissions.
touch is the command you are looking for. It updates timestamps of the provided file if the file exists or creates it if it doesn't.
How do you to split a very large directory, containing potentially millions of files, into smaller directories of some custom defined maximum number of files, such as 100 per directory, on UNIX?
Bonus points if you know of a way to have wget download files into these subdirectories automatically. So if there are 1 million .html pages at the top-level path at www.example.com, such as
/1.html
/2.html
...
/1000000.html
and we only want 100 files per directory, it will download them to folders something like
./www.example.com/1-100/1.html
...
./www.example.com/999901-1000000/1000000.html
Only really need to be able to run the UNIX command on the folder after wget has downloaded the files, but if it's possible to do this with wget as it's downloading I'd love to know!
Another option:
i=1;while read l;do mkdir $i;mv $l $((i++));done< <(ls|xargs -n100)
Or using parallel:
ls|parallel -n100 mkdir {#}\;mv {} {#}
-n100 takes 100 arguments at a time and {#} is the sequence number of the job.
You can run this through a couple of loops, which should do the trick (at least for the numeric part of the file name). I think that doing this as a one-liner is over-optimistic.
#! /bin/bash
for hundreds in {0..99}
do
min=$(($hundreds*100+1))
max=$(($hundreds*100+100))
current_dir="$min-$max"
mkdir $current_dir
for ones_tens in {1..100}
do
current_file="$(($hundreds*100+$ones_tens)).html"
#touch $current_file
mv $current_file $current_dir
done
done
I did performance testing by first commenting out mkdir $current_dir and mv $current_file $current_dir and uncommenting touch $current_file. This created 10000 files (one-hundredth of your target of 1000000 files). Once the files were created, I reverted to the script as written:
$ time bash /tmp/test.bash 2>&1
real 0m27.700s
user 0m26.426s
sys 0m17.653s
As long as you aren't moving files across file systems, the time for each mv command should be constant, so you should see similar or better performance. Scaling this up to a million files would give you around 27700 seconds, i.e. 46 minutes. There are several avenues for optimization, such as moving all files for a given directory in one command, or removing the inner for loop.
Doing the 'wget' to grab a million files is going to take far longer than this, and is almost certainly going to require some optimization; preserving bandwidth in http headers alone will cut down run time by hours. I don't think that a shell script is probably the right tool for that job; using a library such as WWW::Curl on cpan will be much easier to optimize.
To make ls|parallel more practical to use, add a variable assignment to the destination dir:
DST=../brokenup; ls | parallel -n100 mkdir -p $DST/{#}\;cp {} $DST/{#}
Note: cd <src_large_dir> before executing.
The DST defined above will contain a copy of the current directory's files, but a maximum of 100 per subdirectory.
I've got an irritating closed-source tool which writes specific information into its configuration file. If you then try to use the configuration on a different file, then it loads the old file. Grrr...
Luckily, the configuration files are text, so I can version control them, and it turns out that if one just removes the offending line from the file, no harm is done.
But the tool keeps putting the lines back in. So every time I want to check in new versions of the config files, I have to remove all lines containing the symbol openDirFile.
I'm about to construct some sort of bash command to run grep -v on each file, store the result in a temporary file, and then delete the original and rename the temporary, but I wondered if anyone knew of a nice clean solution, or had already concocted and debugged a similar invocation.
For extra credit, how can this be done without destroying a symbolic link in the same directory (favourite.rc->signals.rc)?
sed -i '/openDirFile/d' *.conf
this do the removing on all conf files
you can also combine the line with "find" command if your conf files are located in different paths.
Note that -i will do the removing "in place".
This was the bash-spell that I came up with:
for i in *.rc ; do
TMP=$(mktemp)
grep -v openDirFile "$i" >"$TMP" && mv "$TMP" "$i"
done
(You can obviously turn this into a one-liner by replacing the newlines with semicolons, except after do.)
Kent's answer is clearly superior.
I've written a bash script on Cygwin which is rather like rsync, although different enough that I believe I can't actually use rsync for what I need. It iterates over about a thousand pairs of files in corresponding directories, comparing them with cmp.
Unfortunately, this seems to run abysmally slowly -- taking about ten (Edit: actually 25!) times as long as it takes to generate one of the sets of files using a Python program.
Am I right in thinking that this is surprisingly slow? Are there any simple alternatives that would go faster?
(To elaborate a bit on my use-case: I am autogenerating a bunch of .c files in a temporary directory, and when I re-generate them, I'd like to copy only the ones that have changed into the actual source directory, leaving the unchanged ones untouched (with their old creation times) so that make will know that it doesn't need to recompile them. Not all the generated files are .c files, though, so I need to do binary comparisons rather than text comparisons.)
Maybe you should use Python to do some - or even all - of the comparison work too?
One improvement would be to only bother running cmp if the file sizes are the same; if they're different, clearly the file has changed. Instead of running cmp, you could think about generating a hash for each file, using MD5 or SHA1 or SHA-256 or whatever takes your fancy (using Python modules or extensions, if that's the correct term). If you don't think you'll be dealing with malicious intent, then MD5 is probably sufficient to identify differences.
Even in a shell script, you could run an external hashing command, and give it the names of all the files in one directory, then give it the names of all the files in the other directory. Then you can read the two sets of hash values plus file names and decide which have changed.
Yes, it does sound like it is taking too long. But the trouble includes having to launch 1000 copies of cmp, plus the other processing. Both the Python and the shell script suggestions above have in common that they avoid running a program 1000 times; they try to minimize the number of programs executed. This reduction in the number of processes executed will give you a pretty big bang for you buck, I expect.
If you can keep the hashes from 'the current set of files' around and simply generate new hashes for the new set of files, and then compare them, you will do well. Clearly, if the file containing the 'old hashes' (current set of files) is missing, you'll have to regenerate it from the existing files. This is slightly fleshing out information in the comments.
One other possibility: can you track changes in the data that you use to generate these files and use that to tell you which files will have changed (or, at least, limit the set of files that may have changed and that therefore need to be compared, as your comments indicate that most files are the same each time).
If you can reasonably do the comparison of a thousand odd files within one process rather than spawning and executing a thousand additional programs, that would probably be ideal.
The short answer: Add --silent to your cmp call, if it isn't there already.
You might be able to speed up the Python version by doing some file size checks before checking the data.
First, a quick-and-hacky bash(1) technique that might be far easier if you can change to a single build directory: use the bash -N test:
$ echo foo > file
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$ cat file
foo
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
older than last read
$ echo blort > file # regenerate the file here
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$
Of course, if some subset of the files depend upon some other subset of the generated files, this approach won't work at all. (This might be reason enough to avoid this technique; it's up to you.)
Within your Python program, you could also check the file sizes using os.stat() to determine whether or not you should call your comparison routine; if the files are different sizes, you don't really care which bytes changed, so you can skip reading both files. (This would be difficult to do in bash(1) -- I know of no mechanism to get the file size in bash(1) without executing another program, which defeats the whole point of this check.)
The cmp program will do the size comparison internally IFF you are using the --silent flag and both files are regular files and both files are positioned at the same place. (This is set via the --ignore-initial flag.) If you're not using --silent, add it and see what the difference is.
I need to write a Linux shell script which can scans a root directory and prints files which were modified after they were last executed.
For example, if File A executed yesterday and I modify it today, the shell script must print File A. However, if File B executed yesterday and I don't modify it yet, then file B shouldn't be printed.
Your primary problem is tracking when the files were executed.
The trouble is, Linux does not keep separate track of when a file was executed as opposed to when it was read for other purposes (such as backup, or review), so it is going to be extremely tricky to get going.
There are a variety of tricks that could be considered, but none of them are particularly trivial or inviting. One option might be to enable process accounting. Another might be to modify each script to record when it is executed.
The 'last accessed' time (or atime, or st_atime, based on the name of the field in struct stat that contains the information) doesn't help you because, as already noted, it is modified whenever the file is read. Although an executed file would certainly have been accessed, there may be many read accesses that do not execute the file but that do trigger an update of the access time.
With those caveats in place, it may be that the access time is the best that you can do, and your script needs to look for files where the access time is equal to the modify time (which means the file was modified and has not been accessed since it was modified - neither read nor printed nor executed). It is less than perfect, but it may be the best approximation available, short of a complex execution tracking system.
Once you've got a mechanism in place to track the execution times of files, then you can devise an appropriate means of working out which files were modified since they were last executed.
Unix system stores 3 time values for any file:
last access
last modification
last change.
I don't think you can get last execution time without using some artificial means, like creating a log or temp file etc. when a executable file runs.
PS: Remember not every file in Unix is an executable so that's the reason probably they never thought of storing a file's last execution timestamp as well.
However if you do want to get these time values then use:
stat -c "%X" file-name # to get last accessed time value as seconds since Epoch
stat -c "%Y" file-name # to get last modified time value as seconds since Epoch
stat -c "%Z" file-name # to get last change time value as seconds since Epoch
It is very hard to do this in shell, simply because it is very hard to get atime or mtime in a sensible format in shell. Consider moving the routine to a more full-featured language like Ruby or Perl:
ruby -e 'puts Dir["**/*"].select{ |file| File.mtime(file) > File.atime(file) }'
Use **/* for all files in current directory and below, **/*.rb for all Ruby scripts in current directory in below, /* for all files in root... you get the pattern.
Take note what I wrote in a comment to #JohanthanLeffer: UNIX does not differentiate between reading a file and executing it. Thus, printing the script out with cat ./script will have the same effect as executing it with ./script, as far as this procedure is concerned. There is no way to differentiate reading and executing that I can think of, short of making your own kernel.
However, in most cases, you probably won't read the executables; and if you edit them, the save will come after opening, so mtime will still trump atime. The only bad scenario is if you open a file in an editor then exit without saving it (or just view it with less, without modification). As long as you avoid this, the method will work.
Also make note that most editors will not actually modify a file, but create a new file and copy the contents from the old one, then overwrite the old one with the new one. This does not set the mtime, but ctime. Modify the script accordingly, if this is your usage pattern.
EDIT: Apparently, stat can help with the sensible representation. This is in bash:
#!/bin/sh
for FILE in `find .`; do
if [ `stat -f "%m -gt %a" $FILE` ]; then
echo $FILE
fi
done
Replace "find ." (with backticks) with * for just current directory, or /* for root. To use ctime instead of mtime, use %c instead of %m.