bash scripting de-dupe - bash

I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.
This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.
Easiest way to do this?
Thanks!

Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt
The first time you do:
$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]
The next time it'll be like this:
$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.
Except if the file on the server was updated.
That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?
You only know that, make some tests.
So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):
newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
xz -f file.txt # overwrite with the new compressed data
else
rm file.txt
fi
and that's done;

Calculate a hash of the content of the file and check against the new one. Use for instance md5sum. You only have to save the last MD5 sum to check if the file changed.
Also, take into account that the web is evolving to give more information on pages, that is, metadata. A well-founded web site should include file version and/or date of modification (or a valid, expires header) as part of the response headers. This, and quite other things, is what makes up the scalability of Web 2.0.

How about downloading the file, and checking it against a "last saved" file?
For example, the first time it downloads myfile, and saves it as myfile-[date], and compresses it. It also adds a symbolic link, such as lastfile pointing to myfile-[date]. The next time the script runs, it can check if the contents of whatever lastfile points to is the same as the new downloaded file.
Don't know if this would work well, but it's what I could think of.

You can compare the new file with the last one using the sum command. This takes the checksum of the file. If both files have the same checksum, they are very, very likely to be exactly the same. There's another command called md5 that takes the md5 fingerprint, but the sum command is on all systems.

Related

how to determine if a file is completely downloaded using kqueue?

I want to implement a function which monitor a directory and perform some action when a new file is downloaded from the Internet, but found it difficult to determine if the file is completely downloaded, is there a way to do that?
Usually tools that show the hash of a file will give the state of a file - this should be compared to the hash of another file - if identical then we know the file has downloaded successfully.
md5 (native to bsd) is available - but is only practical on a local file -
If you are retrieving the remote file via HTTP , then there is no way to get the hash of the file without downloading it first (whether it is to STDOUT or piped to file , using wget -O- or curl )
If the file host has a second file that contains the md5 hash of the file being downloaded - then a comparison of the locally downloaded hash is comparable to the hash provided by the file provider.
To do anything more swish will require a comprehensive program to be written - such as the combination of this question and accepted answer :
Python Compare local and remote file MD5 Hash
Besides MD5, there is a simple way to do this:
Partially downloaded file usually has a temporary filename, and it will be renamed to original filename after fully downloaded. You can make your program to ignore or monitor only certain filename extensions.

Merge lines in bash

I would like to write a script that restores a file, but preserving the changes that may be done after the backout file is created.
With more details: at some moment I create a backup of a file (file_orig). Do some changes to the original file as well(file_my_changes). After that, the original file can be changed again (file_additional_changes), but after the restore I want to have the backup file, plus the additional changes (file_orig + file_addtional_changes). In general backing out my changes only.
I am talking about grub.cfg file, so the expected possible changes will be adding or removing parts of a line.
Is it possible this to be done with a bash script?
I have 2 ideas:
Add some comments above the lines I am going to change, and then before the resotore if the line differ from the one from the backed out file, to read the comment, which will tell me what exactly to remove from the line;
If there is a way to display only the part of the line that differs from the file_orig and file_additional_changes, then to replace this line with the line from file_orig + the part that differs. But I am not sure if this is possible to be done at all.
Example"
line1: This is line1
line2: This is another line1
Is it possible to display only "another"?
Of course any other ideas are welcome!
Thank you!
Unclear, but perhaps if you're using a bash script you could run a diff on the 2 edited file and the last one and save that output someplace that you want to keep it? That would mean you have a copy of the changes.
Or just use git like everybody else.
One possibility would be to use POSIX commands patch and
diff.
Create the backup:
cp operational-file operational-file.001
Edit the operational file.
Create a patch from the differences:
diff -u operational-file.001 operational-file > operational-file.patch001
Copy the operational file again.
cp operational-file operational-file.002
Edit the operational file again.
Create a new patch
diff -u operational-file.002 operational-file > operational-file.patch002
If you need to recover but skip the changes from patch.001, then:
cp operational-file.001 operational-file
patch -i patch.002
This would apply just the second set of changes to the original file, as log as there's no overlap.
Consider using a version control system to keep records of the file changes. Consider using date/time stamps instead of version numbers on the file names.

Downloading Only Newest File Using Wget / Curl

How would I use wget or curl to download the newest file in a directory?
This seems really easy, however the filename won't always be predictable, and as new data comes in it'll be replaced with a random filename.
Specifically, the directory I wish to download data from has the following naming structure, where the last string of characters is a randomly generated timestamp:
MRMS_RotationTrackML1440min_00.50_20160530-175837.grib2.gz
MRMS_RotationTrackML1440min_00.50_20160530-182639.grib2.gz
MRMS_RotationTrackML1440min_00.50_20160530-185637.grib2.gz
The randomly generated timestamp is in the format of: {hour}{minute}{second}
The directory in question is here: http://mrms.ncep.noaa.gov/data/2D/RotationTrackML1440min/
Could it have to be something with something in the headers, where you'd use curl to sift through the last-modified timestamp?
Any help would be appreciated here, thanks in advance.
You can just run following command periodically:
wget -r -nc --level=1 http://mrms.ncep.noaa.gov/data/2D/RotationTrackML1440min/
It will download recursively whatever is new in the directory after last run.

how to read / copy a tmp file which exists for a very short time

my nginx creates tmp files for requests which are bigger than 16kb. I am trying to read this tmp files but they only exist for a rly short period of time (1ms?). Is there unix command / programm which can help me to read this files before they are gone?
the ngnix warning message looks like
a client request body is buffered to a temporary file /var/lib/nginx/body/0000001851
EDIT
i am not in the position to alter the ngnix source code nor am i able to edit the source code of the request origin. I just want to take a look at this files for debugging purpose as i cant imagine what kind of request will bloat up to 16k
In general you'll probably want to get nginx's assistance for this or if that's not possible and it's really important change the source code as Leo suggests.
There is one cringe-inducing, wtf-provoking trick which I am mentioning as a curiosity. You can set the append-only mode on the directory. If your filesystem supports it you can say:
chattr +a mydir
Your process will be able to create stuff inside but not remove it. Then at your leisure you can use inotify_wait to monitor the directory for changes. I don't know of any clean ways to remove the files though.
Well you could try parsing the output with something like:
stdbuf -oL nginx 2>&1 |
grep -F --line-buffered \
"a client request body is buffered to a temporary file" | {
while read -a line
cp line[${#line[#]}-1] /dest/path
}
Although you might find that this is too slow and the file is gone before you can copy it.
A better solution might be to use inotify. inotify_wait as mentioned by cnicutar would work. You could try the following:
while true
do
file=$(inotifywait -e create --format %f -r /var/lib/nginx/body/)
cp "/var/lib/nginx/body/$file" "/dest/path/$file"
done
If you don't get what you are looking for (eg if the files are copied before all the data is written), you could experiment with different events instead of create, maybe close, close_write or modify.

Sync File Modification Time Across Multiple Directories

I have a computer A with two directory trees. The first directory contains the original mod dates that span back several years. The second directory is a copy of the first with a few additional files. There is a second computer be which contains a directory tree which is the same as the second directory on computer A (new mod times and additional files). How update the files in the two newer directories on both machines so that the mod times on the files are the same as the original? Note that these directory trees are in the order of 10s of gigabytes so the solution would have to include some method of sending only the date information to the second computer.
The answer by Paul is partly correct, rsync is able to do this, however with different parameters. The correct command is
rsync -Prt --size-only original_dir copy_dir
where -P enables partial transfers and displays a progress indicator, -r recurses through subdirectories, -t preserves time stamps and --size-only doesn't transfer files that match in size.
The following command will make sure that TEST2 gets the same date assigned that TEST1 has
touch -t `stat -t '%Y%m%d%H%M.%S' -f '%Sa' TEST1` TEST2
Now instead of using hard-coded values here, you could find the files using "find" utility and then run touch via SSH on the remote machine. However, that means you may have to enter the password for each file, unless you switch SSH to cert authentication. I'd rather not do it all in a super fancy one-liner. Instead let's work with temp files. First go to the directory in question and run a find (you can filter by file type, size, extension, whatever pleases you, see "man find" for details. I'm just filtering by type file here to exclude any directories):
find . -type f -print -exec stat -t '%Y%m%d%H%M.%S' -f '%Sm' "{}" \; > /tmp/original_dates.txt
Now we have a file that looks like this (in my example there are only two entries there):
# cat /tmp/original_dates.txt
./test1
200809241840.55
./test2
200809241849.56
Now just copy the file over to the other machine and place it in the directory (so the relative file paths match) and apply the dates:
cat original_dates.txt | (while read FILE && read DATE; do touch -t $DATE "$FILE"; done)
Will also work with file names containing spaces.
One note: I used the last "modification" date at stat, as that's what you wrote in the question. However, it rather sounds as if you want to use the "creation" date (every file has a creation date, last modification date and last access date), you need to alter the stat call a bit.
'%Sm' - last modification date
'%Sc' - creation date
'%Sa' - last access date
However, touch can only change the modification time and access time, I think it can't change the creation time of a file ... so if that was your real intention, my solution might be sub-optimal... but in that case your question was as well ;-)
I would go through all the files in the source directory tree and gather the modification times from them into a script that I could run on the other directory trees. You will need to be careful about a few 'gotchas'. First, make sure that your output script has relative paths, and make sure you run it from the proper target directory, which should be the root directory of the target tree. Also, when changing machines make sure you are using the same timezone as you were on the machine where you generated the script.
Here's a Perl script I put together that will output the touch commands needed to update the times on the other directory trees. Depending on the target machines, you may need to tweak the date formats or command options, but this should give you a place to start.
#!/usr/bin/perl
my $STARTDIR="$HOME/test";
chdir $STARTDIR;
my #files = `find . -type f`;
chomp #files;
foreach my $file (#files) {
my $mtime = localtime((stat($file))[9]);
print qq(touch -m -d "$mtime" "$file"\n);
}
The other approach you could try is to attach the remote directory using NFS and then copy the times using find and touch -r.
I think rsync (with the right options)
will do this - it claims to only send
file differences, so presumably will
work out that there are no differences
to be transferred.
--times preserves the modification times, which is what you want.
See (for instance)
http://linux.die.net/man/1/rsync
Also add -I, --ignore-times don't skip files that match size and time
so that all files are "transferred' and trust to rsync's file differences optimisation to make it "fairly efficient" - see excerpt from the man page below.
-t, --times
This tells rsync to transfer modification times along with the files and update them on the remote system. Note that if this option is not used, the optimization that excludes files that have not been modified cannot be effective; in other words, a missing -t or -a will cause the next transfer to behave as if it used -I, causing all files to be updated (though the rsync algorithm will make the update fairly efficient if the files haven't actually changed, you're much better off using -t).
I used the following Python scripts instead.
Python scripts run much faster than an approach creating new processes for each file (like using find and stat). The solution below also works in case of timezone differences between systems, as it uses UTC times. It also works with paths containing spaces (but not paths containing newline!). It doesn't set times for symlinks, because the operating system provides no mechanism to modify the timestamp of a symlink, but in a file manager the time of the file the symlink points at is shown instead anyway. It uses a maxTime parameter to avoid resetting dates for files that are actually modified after copying from the original directory.
listMTimes.py:
import os
from datetime import datetime
from pytz import utc
for dirpath, dirnames, filenames in os.walk('./'):
for name in filenames+dirnames:
path = os.path.join(dirpath, name)
# Avoid symlinks because os.path.getmtime and os.utime get and
# set the time of the pointed file, and in the new directory,
# the link may have been redirected.
if not os.path.islink(path):
mtime = datetime.fromtimestamp(os.path.getmtime(path), utc)
print(mtime.isoformat()+" "+path)
setMTimes.py:
import datetime, fileinput, os, sys, time
import dateutil.parser
from pytz import utc
# Based on
# http://stackoverflow.com/questions/6999726/python-getting-millis-since-epoch-from-datetime
def unix_time(dt):
epoch = datetime.datetime.fromtimestamp(0, utc)
delta = dt - epoch
return delta.total_seconds()
if len(sys.argv) != 2:
print('Syntax: '+sys.argv[0]+' <maxTime>')
print(' where <maxTime> an ISO time, e. g. "2013-12-02T23:00+02:00".')
exit(1)
# A file with modification time newer than maxTime is not reset to
# its original modification time.
maxTime = unix_time(dateutil.parser.parse(sys.argv[1]))
for line in fileinput.input([]):
(datetimeString, path) = line.rstrip('\r\n').split(' ', 1)
mtime = dateutil.parser.parse(datetimeString)
if os.path.exists(path) and not os.path.islink(path):
if os.path.getmtime(path) <= maxTime:
os.utime(path, (time.time(), unix_time(mtime)))
Usage: in the first directory (the original) run
python listMTimes.py >/tmp/original_dates.txt
Then in the second directory (a copy of the original, possibly with some files modified/added/deleted) run something like this:
python setMTimes.py 2013-12-02T23:00+02:00 </tmp/original_dates.txt

Resources