How to determine when a file was created - shell

I have a .jar file that is compiled on a server and is later copied down to a local machine. Doing ls -lon the local machine just gives me the time it was copied down onto the local machine, which could be much later than when it was created on the server. Is there a way to find that time on the command line?

UNIX-like systems do not record file creation time.
Each directory entry has 3 timestamps, all of which can be shown by running the stat command or by providing options to ls -l:
Last modification time (ls -l)
Last access time (ls -lu)
Last status (inode) change time (ls -lc)
For example, if you create a file, wait a few minutes, then update it, read it, and do a chmod to change its permissions, there will be no record in the file system of the time you created it.
If you're careful about how you copy the file to the local machine (for example, using scp -p rather than just scp), you might be able to avoid updating the modification time. I presume that a .jar file probably won't be modified after it's first created, so the modification time might be good enough.
Or, as Etan Reisner suggests in a comment, there might be useful information in the .jar file itself (which is basically a zip file). I don't know enough about .jar files to comment further on that.

wget and curl have options that allow you to preserve the file's modified time stamp. This is close enough to what I was looking for.

Related

Can "rsync --append" replace files that are larger at destination?

I have an rsync job that moves log files from a web server to an archive. The server rotates its own logs, so I might see a structure like this:
/logs
error.log
error.log.20200420
error.log.20200419
error.log.20200418
I use rsync to sync these log files every few minutes:
rsync --append --size-only /foo/logs/* /mnt/logs/
This command syncs everything with the least amount of processing. And it's important - calculating checksums or writing an entire file every time a few lines are added is a no-go. But it ignores files if there is a larger version on the server instead of replacing them:
man rsync:
--append [...] If a file needs to be transferred and its size on the receiver is the
same or longer than the size on the sender, the file is skipped.
Is there a way to tell rsync to replace files instead in this case? Using --append is important for me and works well for other log files that use unique filenames. Maybe there's a better tool for this?
The service is a packaged application that I can't really edit or configure unfortunately, so changing the file structure or paths isn't an option for me.

Retrieving latest file in a directory from a remote server

I was hoping to crack this myself, but it seems I have fallen at the first hurdle because I can't make head nor tale of other options I've read about.
I wish to access a database file hosted as follows (i.e. the hhsuite_dbs is a folder containing several databases)
http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pdb70_08Oct15.tgz
Periodically, they update these databases, and so I want to download the lastest version. My plan is to run a bash script via cron, most likely monthly (though I've yet to even tackle the scheduling aspect of the task).
I believe the database is refreshed fortnightly, so if my script runs monthly I can expect there to be a new version. I'll then be running downstream programs that require the database.
My question is then, how do I go about retrieving this (and for a little more finesse I'd perhaps like to be able to check whether the remote file has changed in name or content to avoid a large download if unnecessary)? Is the best approach to query the name of the file, or the file property of date last modified (given that they may change the naming syntax of the file too?). To my naive brain, some kind of globbing of the pdb70 (something I think I can rely on to be in the filename) then pulled down with wget was all I had come up with so far.
EDIT Another confounding issue that has just occurred to me is that the file I want wont necessarily be the newest in the folder (as there are other types of databases there too), but rather, I need the newest version of, in this case, the pdb70 database.
Solutions I've looked at so far have mentioned weex, lftp, curlftpls but all of these seem to suggest logins/passwords for the server which I don't have/need if I was to just download it via the web. I've also seen mention of rsync, but of a cursory read it seems like people are steering clear of it for FTP uses.
Quite a few barriers in your way for this.
My first suggestion is that rather than getting the filename itself, you simply mirror the directory using wget, which should already be installed on your Ubuntu system, and let wget figure out what to download.
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
And new files will be created in the "safe" directory.
But that just gets you your mirror. You're still after is the "newest" file.
Luckily, wget sets the datestamp of files it downloads, if it can. So after mirroring, you might be able to do something like:
newestfile=$(ls -t /some/place/safe/pdb70*gz | head -1)
Note that this fails if ever there are newlines in the filename.
Another possibility might be to check the difference between the current file list and the last one. Something like this:
#!/bin/bash
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
rm index.html* *.gif # remove debris from mirroring an index
ls > /tmp/filelist.txt.$$
if [ -f /tmp/filelist.txt ]; then
echo "Difference since last check:"
diff /tmp/filelist.txt /tmp/filelist.txt.$$
fi
mv /tmp/filelist.txt.$$ /tmp/filelist.txt
You can parse the diff output (man diff for more options) to determine what file has been added.
Of course, with a solution like this, you could run your script every day and hopefully download a new update within a day of it being ready, rather than a fortnight later. Nice thing about --mirror is that it won't download files that are already on-hand.
Oh, and I haven't tested what I've written here. That's one monstrously large file.

Block Level Copying and Rsync

I am trying to use grsync (A GUI for rsync) for Windows to run backups. In the directory that I am backing up there are many larger files that are updated periodically. I would like to be able to sync just the changes to those files and not the entire file each backup. I was under the impression that rsync is a block-level file copier and would only copy the bytes that had changed between each sync. Perhaps this is not the case, or I have misunderstood what block-level file coping is!
To test this I used grsync to synchronize a 5GB zip file between two directories. Then I added a very small text file to the zip file and ran grsync again. However it proceeded to copy over the entire zip file again. Is there a utility that would only copy over the changes to this zip file and not the entire file again? Or is there a command within grsync that could be used to this effect?
The reason the entire file was copied is simply that the algorithm that handles block-level changes is disabled when copying between two directories on a local filesystem.
This would have worked, because the file is being copied (or updated) to a remote system:
rsync -av big_file.zip remote_host:
This will not use the "delta" algorithm and the entire file will be copied:
rsync -av big_file.zip D:\target\folder\
Some notes
Even if the target is a network share, rsync will treat it as path of your local filesystem and will disable the "delta" (block changes) algorithm.
Adding data to the beginning or middle of a data file will not upset the algorithm that handles the block-level changes.
Rationale
The delta algorithm is disabled when copying between two local targets because it needs to read both the source and the destination file completely in order to determine which blocks need changing. The rationale is that the time taken to read the target file is much the same as just writing to it, and so there's no point reading it first.
Workaround
If you know for definite that reading from your target filesystem is significantly faster than writing to it you can force the block-level algorithm to run by including the --no-whole-file flag.
If you add a file to a zip the entire zip file can change if the file was added as the first file in the archive. The entire archive will shift. so yours is not a valid test.
I was just looking for this myself, I think you have to use
rsync -av --inplace
for this to work.

How to create a batch file in Mac?

I need to find a solution at work to backup specific folders daily, hopefully to a RAR or ZIP file.
If it was on PC, I would have done it already. But I don't have any idea to how to approach it on a Mac.
What I basically want to achieve is an automated task, that can be run with an executable, that does:
compress a specific directory (/Volumes/Audio/Shoko) to a rar or zip file.
(in the zip file exclude all *.wav files in all sub Directories and a directory names "Videos").
move It to a network share (/Volumes/Post Shared/Backup From Sound).
(or compress directly to this folder).
automate the file name of the Zip file with dynamic date and time (so no duplicate file names).
Shutdown Mac when finished.
I want to say again, I don't usually use Mac, so things like what kind of file to open for the script, and stuff like that is not trivial for me, yet.
I have tried to put Mark's bash lines (from the first answer, below) in a txt file and executed it, but it had errors and didn't work.
I also tried to use Automator, but it's too plain, no advanced options.
How can I accomplish this?
I would love a working example :)
Thank You,
Dave
You can just make a bash script that does the backup and then you can either double-click it or run it on a schedule. I don't know your paths and/or tools of choice, but some thing along these lines:
#!/bin/bash
FILENAME=`date +"/Volumes/path/to/network/share/Backup/%Y-%m-%d.tgz"`
cd /directory/to/backup || exit 1
tar -cvz "$FILENAME" .
You can save that on your Desktop as backup and then go in Terminal and type:
chmod +x ~/Desktop/backup
to make it executable. Then you can just double click on it - obviously after changing the paths to reflect what you want to backup and where to.
Also, you may prefer to use some other tools - such as rsync but the method is the same.

How can I find a directory-diff of millions of files to script maintenance?

I have been working on how to verify that millions of files that were on file system A have infact been moved to file system B. While working on a system migration, it became evident that all the files needed to be audited to prove that the files have been moved. The files were initially moved via rsync, which does provide logs, although not in a format that is helpful for doing an audit. So, I wrote this script to index all the files on System A:
#!/bin/bash
# Get directories and file list to be used to verify proper file moves have worked successfully.
LOGDATE=`/usr/bin/date +%Y-%m-%d`
FILE_LIST_OUT=/mounts/A_files_$LOGDATE.txt
MOUNT_POINTS="/mounts/AA mounts/AB"
touch $FILE_LIST_OUT
echo TYPE,USER,GROUP,BYTES,OCTAL,OCTETS,FILE_NAME > $FILE_LIST_OUT
for directory in $MOUNT_POINTS; do
# format: type,user,group,bytes,octal,octets,file_name
gfind $directory -mount -printf "%y","%u","%g","%s","%m","%p\n" >> $FILE_LIST_OUT
done
The file indexing works fine and takes about two hours to index ~30 million files.
On side B is where we run into issues. I have written a very simple shell script that reads the index file, tests to see if the file is there, and then counts up how many files are there, but it's running out of memory while looping through the 30 million lines on indexed file names. Effectively doing this little bit of code below through a while loop, and counters to increment for files found and not found.
if [ -f "$TYPE" "$FILENAME" ] ; then
print file found
++
else
file not found
++
fi
My questions are:
Can a shell script do this type of reporting from such a large list. A 64 bit unix system ran out of memory while trying to execute this script. I have already considered breaking up the input script into smaller chunks to make it faster. Currently it can
If as shell script is inappropriate, what would you suggest?
You just used rsync, use it again...
--ignore-existing
This tells rsync to skip updating files that already exist on the destination (this does not ignore existing directories, or nothing would get done). See also --existing.
This option is a transfer rule, not an exclude, so it doesn’t affect the data that goes into the file-lists, and thus it doesn’t affect deletions. It just limits the files that the receiver requests to be transferred.
This option can be useful for those doing backups using the --link-dest option when they need to continue a backup run that got interrupted. Since a --link-dest run is copied into a new directory hierarchy (when it is used properly), using --ignore existing will ensure that the already-handled files don’t get tweaked (which avoids a change in permissions on the hard-linked files). This does mean that this option is only looking at the existing files in the destination hierarchy itself.
That will actually fix any problems (at least in the same sense that any diff-list on file-exist tests could fix problem. Using --ignore-existing means rsync only does the file-exist tests (so it'll construct the diff list as you request and use it internally). If you just want information on the differences, check --dry-run, and --itemize-changes.
Lets say you have two directories, foo and bar. Let's say bar has three files, 1,2, and 3. Let's say that bar, has a directory quz, which has a file 1. The directory foo is empty:
Now, here is the result,
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/
>f+++++++++ 1
>f+++++++++ 2
>f+++++++++ 3
cd+++++++++ quz/
>f+++++++++ quz/1
Note, you're not interested in the cd+++++++++ -- that's just showing you that rsync issued a chdir. Now, let's add a file in foo called 1, and let's use grep to remove the chdir(s),
$ rsync -ri --dry-run --ignore-existing ./bar/ ./foo/ | grep -v '^cd'
>f+++++++++ 2
>f+++++++++ 3
>f+++++++++ quz/1
f is for file. The +++++++++ means the file doesn't exist in the DEST dir.
Here is the bonus, remove --dry-run, and, it'll go ahead and make the changes for you.
Have you considered a solution such as kdiff3, which will diff directories of files ?
Note the feature for version 0.9.84
Directory-Comparison: Option "Full Analysis" allows to show the number
of solved vs. unsolved conflicts or deltas vs. whitespace-changes in
the directory tree.
There is absolutely no problem reading a 30 million line file in a shell script. The reason why your process failed was most likely that you tried to read the file entirely into memory, e.g. by doing something wrong like for i in $(cat file).
The correct way of reading a file is:
while IFS= read -r line
do
echo "Something with $line"
done < someFile
A shell script is inappropriate, yes. You should be using a diff tool:
diff -rNq /original /new
If you're not particular about the solution being a script, you could also look into meld, which would let you diff directory trees quite easily and you can also set ignore patterns if you have any.

Resources