Speeding up file comparisons (with `cmp`) on Cygwin? - bash

I've written a bash script on Cygwin which is rather like rsync, although different enough that I believe I can't actually use rsync for what I need. It iterates over about a thousand pairs of files in corresponding directories, comparing them with cmp.
Unfortunately, this seems to run abysmally slowly -- taking about ten (Edit: actually 25!) times as long as it takes to generate one of the sets of files using a Python program.
Am I right in thinking that this is surprisingly slow? Are there any simple alternatives that would go faster?
(To elaborate a bit on my use-case: I am autogenerating a bunch of .c files in a temporary directory, and when I re-generate them, I'd like to copy only the ones that have changed into the actual source directory, leaving the unchanged ones untouched (with their old creation times) so that make will know that it doesn't need to recompile them. Not all the generated files are .c files, though, so I need to do binary comparisons rather than text comparisons.)

Maybe you should use Python to do some - or even all - of the comparison work too?
One improvement would be to only bother running cmp if the file sizes are the same; if they're different, clearly the file has changed. Instead of running cmp, you could think about generating a hash for each file, using MD5 or SHA1 or SHA-256 or whatever takes your fancy (using Python modules or extensions, if that's the correct term). If you don't think you'll be dealing with malicious intent, then MD5 is probably sufficient to identify differences.
Even in a shell script, you could run an external hashing command, and give it the names of all the files in one directory, then give it the names of all the files in the other directory. Then you can read the two sets of hash values plus file names and decide which have changed.
Yes, it does sound like it is taking too long. But the trouble includes having to launch 1000 copies of cmp, plus the other processing. Both the Python and the shell script suggestions above have in common that they avoid running a program 1000 times; they try to minimize the number of programs executed. This reduction in the number of processes executed will give you a pretty big bang for you buck, I expect.
If you can keep the hashes from 'the current set of files' around and simply generate new hashes for the new set of files, and then compare them, you will do well. Clearly, if the file containing the 'old hashes' (current set of files) is missing, you'll have to regenerate it from the existing files. This is slightly fleshing out information in the comments.
One other possibility: can you track changes in the data that you use to generate these files and use that to tell you which files will have changed (or, at least, limit the set of files that may have changed and that therefore need to be compared, as your comments indicate that most files are the same each time).

If you can reasonably do the comparison of a thousand odd files within one process rather than spawning and executing a thousand additional programs, that would probably be ideal.
The short answer: Add --silent to your cmp call, if it isn't there already.
You might be able to speed up the Python version by doing some file size checks before checking the data.
First, a quick-and-hacky bash(1) technique that might be far easier if you can change to a single build directory: use the bash -N test:
$ echo foo > file
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$ cat file
foo
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
older than last read
$ echo blort > file # regenerate the file here
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$
Of course, if some subset of the files depend upon some other subset of the generated files, this approach won't work at all. (This might be reason enough to avoid this technique; it's up to you.)
Within your Python program, you could also check the file sizes using os.stat() to determine whether or not you should call your comparison routine; if the files are different sizes, you don't really care which bytes changed, so you can skip reading both files. (This would be difficult to do in bash(1) -- I know of no mechanism to get the file size in bash(1) without executing another program, which defeats the whole point of this check.)
The cmp program will do the size comparison internally IFF you are using the --silent flag and both files are regular files and both files are positioned at the same place. (This is set via the --ignore-initial flag.) If you're not using --silent, add it and see what the difference is.

Related

Does bash cache sourced files such that it doesn't need to reparse them from disk?

Say I source a bash script from within $PROMPT_COMMAND, which is to say at every time enter is pressed, which makes it quite often, does bash optimize this somehow when the file wasn't changed?
EDIT:
Just to clarify, I only ask about loading the script's content from disk, not optimizing the code itself.
An example an optimization one could manually do is check if the sourced file has the same modified date and size[1], if so, then not read the file from disk again and use an already parsed script from memory and execute that instead. If that file contains only bash function definitions then one could also imagine an optimization where these definitions need not be changed (reevaluated) at all - given that the contents are the same.
Is checking file size and modified date sufficient to determine if a file has changed? It can certainly be subverted but given that this is what rsync does by default then it surely is a method to consider.
[1] If a filesystem also stores checksums for files then this would be an even better way to determine if a file on disk has or hasn't changed.
Just to avoid misunderstandings regarding the term optimization:
It seems you are concerned with the time it takes to load the sourced file from the disk (this special form of optimization is usually called caching)
... not about the time it takes to executed an already loaded file (optimization as done by compilers, e.g. gcc -O2)
As far as I know, bash neither caches file contents nor does it optimize scripts. Although the underlying file system or operating system may cache files, bash would have to parse the cached file again; which probably takes longer than loading it from a modern disk (e.g. an SSD).
I wouldn't worry too much about such things unless they actually become a problem for you. If they do, you can easily ...
Cache the script yourself
Wrap the entire content of the sourced file in a function definition. Then source the file once on shell startup. After that, you can run the function from memory.
define-my-prompt-command.sh
my_prompt_command() {
# a big script
}
.bashrc
source define-my-prompt-command.sh
PROMPT_COMMAND=my_prompt_command
You can try to add following snippet to your script :
if ! type reload-sourced-file > /dev/null 2>&1 ; then
echo "This is run when you first source the file..."
PROMPT_COMMAND=$'reload-sourced-file\n'"$PROMPT_COMMAND"
absolute_script_path=$(realpath -e $BASH_SOURCE)
script_previous_stat=$(stat -c "%B%y%z" $absolute_script_path)
fi
reload-sourced-file(){
local stat=$(stat -c "%B%y%z" "$absolute_script_path")
if ! test "$stat" = "$script_previous_stat"; then # Re-source when stat changes.
script_previous_stat="$stat"
echo "You'll see this message for the following Re-sourcings."
source "$absolute_script_path"
fi
}
the script will be re-sourced when stat changes. Hopefully the stat is cached by the file system.

zsh: argument list too long: sudo

I've a command which I need to run in which one of the args is a list of comma separated ids. The list of ids is over 50k. I've the stored the list of ids in a file and I'm running the command in the following way:
sudo ./mycommand --ids `cat /tmp/ids.txt`
However I get an error zsh: argument list too long: sudo
This I believe is because the kernel has a max size of arguments it can take. One option for me is to manually split the file into smaller pieces (since the ids are comma separated I can't just break it evenly) and then run the command each time for each file.
Is there a better approach?
ids.txt file looks like this:
24342,24324234,122,54545,565656,234235
Converting comments into a semi-coherent answer.
The file ids.txt contains a single line of comma-separated values, and the total size of the file can be too big to be the argument list to a program.
Under many circumstances, using xargs is the right answer, but it relies on being able to split the input up in to manageable chunks of work, and it must be OK to run the program several times to get the job done.
In this case, xargs doesn't help because of the size and format of the file.
It isn't stated absolutely clearly that all the values in the file must be processed in a single invocation of the command. It also isn't absolutely clear whether the list of numbers must all be in a single argument or whether multiple arguments would work instead. If multiple invocations are not an issue, it is feasible to reformat the file so that it can be split by xargs into manageable chunks. If need be, it can be done to create a single comma-separated argument.
However, it appears that these options are not acceptable. In that case, something has to change.
If you must supply a single argument that is too big for your system, you're hosed until you change something — either the system parameters or your program.
Changing the program is usually easier than reconfiguring the o/s, especially if you take into account reconfiguring upgrades to the o/s.
One option worth reviewing is changing the program to accept a file name instead of the list of numbers on the command line:
sudo ./mycommand --ids-list=/tmp/ids.txt
and the program opens and reads the ID numbers from the file. Note that this preserves the existing --ids …comma,separated,list,of,IDs notation. The use of the = is optional; a space also works.
Indeed, many programs work on the basis that arguments provided to it are file names to be processed (the Unix filter programs — think grep, sed, sort, cat, …), so simply using:
sudo ./mycommand /tmp/ids.txt
might be sufficient, and you could have multiple files in a single invocation by supplying multiple names:
sudo ./mycommand /tmp/ids1.txt /tmp/ids2.txt /tmp/ids3.txt …
Each file could be processed in turn. Whether the set of files constitutes a single batch operation or each file is its own batch operation depends on what mycommand is really doing.

Prog Challenge - Find paths to files called from configuration files or scripts

I have no idea how to do that, so I come here for help :) Here is what I'd need. I need to parse some configuration files or bash/sh scripts on a Red Hat Linux system, and look for the paths to the files/commands/scripts meant to be executed by them. The configuration files can have different syntax or be using different languages.
Here are the files I have to look at:
Config scripts:
/etc/inittab
/var/spool/cron/root
/var/spool/cron/tabs/root
/etc/crontab
/etc/xinetd.conf
Files located under /etc/cron.d/* recursively
Bash / Sh scripts:
Files located under /etc/init.d/* or /etc/rc.d/* recursively. These folders contain only shell scripts so maybe all the other files listed above need separate treatment.
Now here's the challenges that I can think of:
The paths within the files may be absolute or relatives ;
The paths within the files may be at the beginning of lines or preceded by a character such as space, colon or semicolon ;
File paths expressed as arguments to commands/scripts must be ignored ;
Paths to directories must be ignored ;
Shell functions or built-in commands must be ignored ;
Some examples (extracted from /etc/init.d/avahi-daemon):
if [ -s /etc/localtime ]; then
cp -fp /etc/localtime /etc/avahi/etc >/dev/null 2>&1
-> Only /bin/cp and /bin/[ must be returned in the snippet above (its the only commands actually executed)
AVAHI_BIN=/usr/sbin/avahi-daemon
$AVAHI_BIN -r
-> /usr/sbin/avahi-daemon must be returned, but only because the variable is called after.
Note that I do not have access to the actual filesystem, I just have a copy of the files to parse.
After writing this up, I realize how complicated it is and unlikely to have a 100% working solution... But if you like programming challenges :)
The good part is I can use any scripting language: bash/sh/grep/sed/awk, php, python, perl, ruby or a combination of these..
I tried to start writing up in PHP but I am struggling to get coherent results.
Thanks!
The language you use to implement this doesn't matter. What matters is that the problem is undecidable, because it is equivalent to the halting problem.
Just as we know that it is impossible to determine if a program will halt, it is impossible to know if a program will call another program. For example, you may think your script will invoke X then Z, but if X never returns, Z will never be invoked. Also, you may not notice that your script invokes Y, because the string Y may be determined dynamically and never actually appear in the program text.
There are other problems which may stymie you along the way, too, such as:
python -c 'import subprocess; subprocess.call("ls")'
Now you need not only a complete parser for Bash, but also for Python. Not to mention solve the halting problem in Python.
In other words, what you want is not possible. To make it feasible you would have to significantly reduce the scope of the problem, e.g. "Find everything starting with /usr/bin or /bin that isn't in a comment". And it's unclear how useful that would be.

Shell script to print files if modified after last executed?

I need to write a Linux shell script which can scans a root directory and prints files which were modified after they were last executed.
For example, if File A executed yesterday and I modify it today, the shell script must print File A. However, if File B executed yesterday and I don't modify it yet, then file B shouldn't be printed.
Your primary problem is tracking when the files were executed.
The trouble is, Linux does not keep separate track of when a file was executed as opposed to when it was read for other purposes (such as backup, or review), so it is going to be extremely tricky to get going.
There are a variety of tricks that could be considered, but none of them are particularly trivial or inviting. One option might be to enable process accounting. Another might be to modify each script to record when it is executed.
The 'last accessed' time (or atime, or st_atime, based on the name of the field in struct stat that contains the information) doesn't help you because, as already noted, it is modified whenever the file is read. Although an executed file would certainly have been accessed, there may be many read accesses that do not execute the file but that do trigger an update of the access time.
With those caveats in place, it may be that the access time is the best that you can do, and your script needs to look for files where the access time is equal to the modify time (which means the file was modified and has not been accessed since it was modified - neither read nor printed nor executed). It is less than perfect, but it may be the best approximation available, short of a complex execution tracking system.
Once you've got a mechanism in place to track the execution times of files, then you can devise an appropriate means of working out which files were modified since they were last executed.
Unix system stores 3 time values for any file:
last access
last modification
last change.
I don't think you can get last execution time without using some artificial means, like creating a log or temp file etc. when a executable file runs.
PS: Remember not every file in Unix is an executable so that's the reason probably they never thought of storing a file's last execution timestamp as well.
However if you do want to get these time values then use:
stat -c "%X" file-name # to get last accessed time value as seconds since Epoch
stat -c "%Y" file-name # to get last modified time value as seconds since Epoch
stat -c "%Z" file-name # to get last change time value as seconds since Epoch
It is very hard to do this in shell, simply because it is very hard to get atime or mtime in a sensible format in shell. Consider moving the routine to a more full-featured language like Ruby or Perl:
ruby -e 'puts Dir["**/*"].select{ |file| File.mtime(file) > File.atime(file) }'
Use **/* for all files in current directory and below, **/*.rb for all Ruby scripts in current directory in below, /* for all files in root... you get the pattern.
Take note what I wrote in a comment to #JohanthanLeffer: UNIX does not differentiate between reading a file and executing it. Thus, printing the script out with cat ./script will have the same effect as executing it with ./script, as far as this procedure is concerned. There is no way to differentiate reading and executing that I can think of, short of making your own kernel.
However, in most cases, you probably won't read the executables; and if you edit them, the save will come after opening, so mtime will still trump atime. The only bad scenario is if you open a file in an editor then exit without saving it (or just view it with less, without modification). As long as you avoid this, the method will work.
Also make note that most editors will not actually modify a file, but create a new file and copy the contents from the old one, then overwrite the old one with the new one. This does not set the mtime, but ctime. Modify the script accordingly, if this is your usage pattern.
EDIT: Apparently, stat can help with the sensible representation. This is in bash:
#!/bin/sh
for FILE in `find .`; do
if [ `stat -f "%m -gt %a" $FILE` ]; then
echo $FILE
fi
done
Replace "find ." (with backticks) with * for just current directory, or /* for root. To use ctime instead of mtime, use %c instead of %m.

Bash script — determine if file modified?

I have a Bash script that repeatedly copies files every 5 seconds. But this is a touch overkill as usually there is no change.
I know about the Linux command watch but as this script will be used on OS X computers (which don’t have watch, and I don’t want to make everyone install macports) I need to be able to check if a file is modified or not with straight Bash code.
Should I be checking the file modified time? How can I do that?
Edit: I was hoping to expand my script to do more than just copy the file, if it detected a change. So is there a pure-bash way to do this?
I tend to agree with the rsync answer if you have big trees of files
to manage, but you can use the -u (--update) flag to cp to copy the
file(s) over only if the source is newer than the destination.
cp -u
Edit
Since you've updated the question to indicate that you'd like to take
some additional actions, you'll want to use the -nt check
in the [ (test) builtin command:
#!/bin/bash
if [ $1 -nt $2 ]; then
echo "File 1 is newer than file 2"
else
echo "File 1 is older than file 2"
fi
From the man page:
file1 -nt file2
True if file1 is newer (according to modification date) than
file2, or if file1 exists and file2 does not.
Hope that helps.
OS X has the stat command. Something like this should give you the modification time of a file:
stat -f '%m' filename
The GNU equivalent would be:
stat --printf '%Y\n' filename
You might find it more reliable to detect changes in the file content by comparing the file size (if the sizes differ, the content does) and the hash of the contents. It probably doesn't matter much which hash you use for this purpose: SHA1 or even MD5 is probably adequate, and you might find that the cksum command is sufficient.
File modification times can change without changing the content (think touch file); file modification times can not change even when the content does (doing this is harder, but you could use touch -r ref-file file to set the modification times of file to match ref-file after editing the file).
No. You should be using rsync or one of its frontends to copy the files, since it will detect if the files are different and only copy them if they are.

Resources