Let me explain the tree structure: I have a network directory where several times a day new .txt files are copied by our database. Those files sit on directory based on usernames. On the local disk I have the same structure (directory based on usernames) and need to be updated with the latest .txt files. It's not a sync procedure: I copy the remote file to a local destination and I don't care what happens with it after that, so I don't need to keep it in sync. However I do need to copy ONLY the new files and not those that I already copied. It would look something like:
Remote disk
/mnt/remote/database
+ user1/
+ user2/
+ user3/
+ user4/
Local disk
/var/database
+ user1/
+ user2/
+ user3/
+ user4/
I played with
find /mnt/remote/database/ -type f -mtime +1
and other variants, but it's not working very well.
So, the script i am trying to figure is the following:
1- check /mnt/remote/database recursively for *.txt
2- check the files date to see if they are new (since the last time I checked, maybe maintain a text file with the last time checked on it as a reference?)
3- if the file is new, copy it to the proper destination in /var/database (so /mnt/remote/database/user1/somefile.txt will be copied to /var/database/user1/)
I'll run the script through a cron job.
I'm doing this in C right now, but the IT people are not very good in debugging or writing C and if they need to add or fix something they can handle bash scripts better, which I am not very good at.
Any ideas out there?
thank you!
you could consider using local rsync between the input & output directories. it has all the options you want to make its sync policy very flexible.
find /mnt/remote/database/ -type f -newer $TIMESTAMP_FILE | xargs $CP_COMMAND
touch $TIMESTAMP_FILE
The solution is here:
http://www.movingtofreedom.org/2007/04/15/bash-shell-script-copy-only-files-modifed-after-specified-date/
Related
I want to write the following bash script which copies files from one GCS bucket to another with renaming options.
My input folder is gs://test-rtt-integration/result/frd/*.orc
and my destination folder is gs://test-rtt-integration/recent_files/frd
The renaming of the copied file should be done based on the name provided from gs://test-rtt-integration/complex-files/TAN/recent_files/today/frd
once the copy with renaming is done I need to clean gs://test-rtt-integration/result/frd
I tested the following commands, but they are not working properly
NAME = "$(gsutil ls gs://test-rtt-integration/complex-files/TAN/recent_files/today/frd)"
gsutil mv gs://test-rtt-integration/result/frd/*.orc gs://test-rtt-integration/recent_files/frd/$NAME
gsutil rm -rf gs://test-rtt-integration/result/frd
( all .orc files and other files should be deleted)
But this is not working properly as I have to split the NAME based on / and get the last split , so if the result of split is called SPLIT , I have to do gsutil mv gs://test-rtt-integration/result/frd/*.orc gs://test-rtt-integration/recent_files/frd/$SPLIT
Any idea on how to do this?
The question is a little bit confusing. You say that you want to move files from one Google Cloud Storage bucket to another, but all the operations are made in one single bucket called test-rtt-integration.
However, as soon as you get the file location with the command gsutil ls gs://[BUCKET_NAME]/folder e.g. gs://[BUCKET_NAME]/folder/[FILENAME].orc, since the gs://[BUCKET_NAME]/folder/ part is always the same for all the objects in the folder, just replace it with null and you will get only the object name at the end as [FILENAME].orc etc.
I am not sure if this is exactly what you are looking for, but I did a little bit of coding myself and I have created a bash script that:
Gets the name of each object from gs://[BUCKET_NAME]/from bucket folder
Copy all objects from gs://[BUCKET_NAME]/from bucket folder to the gs://[BUCKET_NAME]/to/ bucket folder
Delete all objects from gs://[BUCKET_NAME]/from bucket folder
Inside there are comments that explain how every operation works in details. If that is not exactly what you are looking for, you can get the basic idea of how that works and implement it in different way that will suit you better. I have tested the scrip myself in Google Cloud Shell and it is working. The example code can be found in GitHub.
This is a theoretical question about minimizing side effects in bash scripting.
I recently used a simple mechanism for formatting a bunch of json files, in a nested directory structure...
for f in `find ./ -name *json`; do echo $f ; python -mjson.tool $f > /tmp/1 && cp /tmp/1 $f ; done.
The mechanism is simply to
format each file using python's mjson.tool,
write it to a tmp location, and
then rewrite it back in place.
Is there a way to do this which is more elegant, i.e. with minimal side effects? I'm assuming bash experts have a better way of doing this sort of thing .
Unix tools working on a streaming basis -- they don't store all of the contents of the files in memory at once. Therefore, you have to use an intermediary location since you would be overwriting a file that is currently being read from.
You may consider that your snippet isn't fault tolerant. If you make a mistake, you would have just overwritten all your data. You should store the output in a new location, verify, then move to overwrite. :)
Using Eclipse IDE we can achieve formatting for multiple JSON files
Import the files into eclipse and select the files (you wish to format) or folder(for all the files) and right click -> source -> format
I was looking for something similar and just noticed I can select all JSON files I have in my VSCode file panel and CTRL + Click > "Format". Works like magic for a one-off operation, it's formatting the files in-place.
VSCode format in action
I have a directory with files. The archive is very big and has 1.5 million pdf files inside.
the directory is stored on an IBM i server with OS V7R1 and the machine is new and very fast.
The files are named like this :
invoice_[custno]_[year']_[invoice_number].pdf
invoice_081500_2013_7534435564.pdf
No I try to find files with the find command using the Shell.
find . -name 'invoice_2013_*.pdf' -type f | ls -l > log.dat
The command took a long time so I aborted the operation with no result.
If I try it with smaller directories all works fine.
Later I want to have a job that runs everey day and finds the files created the last 24 hours but I it aleays runs so slow I can forget this.
That invocation would never work because ls does not read filenames from stdin.
Possible solutions are:
Use the find utility's built-in list option:
find . -name 'invoice_2013_*.pdf' -type f -ls > log.dat
Use the find utility's -exec option to execute ls -l for each matching file:
find . -name 'invoice_2013_*.pdf' -type f -exec ls {} \; > log.dat
Pipe the filenames to the xargs utility and let it execute ls -l with the filenames as parameters:
find . -name 'invoice_2013_*.pdf' -type f | xargs ls -l > log.dat
A pattern search of 1.5 million files in a single directory is going to be inefficient on any filesystem.
For looking only at a list of new entries in the directory, you might consider journaling the directory. You would specify INHERIT(*NO) to prevent journaling all the files in the directory as well. Then you could simply extract the recent journal entries with DSPJRN to find out what objects had been added.
I don't think I'd put more than maybe 15k files in a single directory. Some QShell utilities run into trouble at around 16k files. But I'm not sure I'd store them in a directory in any case, except maybe for ones over 16MB if that's a significant fraction of the total. I'd possibly look to store them in CLOBs/BLOBs in the database first.
Storing as individual streamfile objects brings ownership/authority problems that need to be addressed. Some profile is getting entries into its owned-objects table, and I'd expect that profile to be getting pretty large. Perhaps getting to one or more limits.
By storing in the database, you drop to a single owned object.
Or perhaps a few similar objects... There might be a purging/archiving process that moves rows off to a secondary or tertiary table. Hard to guess how that might need to be structured, if at all.
Saves could also benefit, especially SAVSECDTA and SAV saves. Security data is greatly reduced. And saving a 4GB table is faster than saving a thousand 4MB objects (or whatever the breakdown might be).
Other than determining how the original setup and implementation would go in your environment, the big tricky part could involve volatility. If these are stable objects with relatively few changes and few deletions, it should be okay. But if BLOBs are often modified, it can bring trouble when the table takes at a significant fraction of DASD capacity. It gets particularly rough when it exceeds the size of DASD free space and a re-org is needed. With low volatility, that's much less of a concern.
Typically what is done in such cases is to create subdirectories -- perhaps by using the first letter of each file.. For example, the file
abcsdsjahdjhfdsfds.xyz would be store in
/something/a/abcsdsjahdjhfdsfds.xyz
that would cut down on the size each subdirectory..
I am trying to figure out a way to search a directory for a file older than 365 days. If it finds a match, I'd like it to both delete the file and locate any other files in the directory that have the same basename, and delete those as well.
File name examples: 12345.pdf (Search for) then delete, 12345_a.pdf, 12345_xyz.pdf (delete if exist).
Thanks! I am very new to BASH scripting, so patience is appreciated ;-))
I doubt this can be done cleanly in a single pass.
Your best bet is to use -mtime or a variant to collect names and then use another find command to delete files matching those names.
UPDATE
With respect to your comment, I mean something like:
# find basenames of old files
find .... -printf '%f\n' | sort -u > oldfiles
for file in ($<oldfiles); do find . -name $file -exec rm; done
I have a computer A with two directory trees. The first directory contains the original mod dates that span back several years. The second directory is a copy of the first with a few additional files. There is a second computer be which contains a directory tree which is the same as the second directory on computer A (new mod times and additional files). How update the files in the two newer directories on both machines so that the mod times on the files are the same as the original? Note that these directory trees are in the order of 10s of gigabytes so the solution would have to include some method of sending only the date information to the second computer.
The answer by Paul is partly correct, rsync is able to do this, however with different parameters. The correct command is
rsync -Prt --size-only original_dir copy_dir
where -P enables partial transfers and displays a progress indicator, -r recurses through subdirectories, -t preserves time stamps and --size-only doesn't transfer files that match in size.
The following command will make sure that TEST2 gets the same date assigned that TEST1 has
touch -t `stat -t '%Y%m%d%H%M.%S' -f '%Sa' TEST1` TEST2
Now instead of using hard-coded values here, you could find the files using "find" utility and then run touch via SSH on the remote machine. However, that means you may have to enter the password for each file, unless you switch SSH to cert authentication. I'd rather not do it all in a super fancy one-liner. Instead let's work with temp files. First go to the directory in question and run a find (you can filter by file type, size, extension, whatever pleases you, see "man find" for details. I'm just filtering by type file here to exclude any directories):
find . -type f -print -exec stat -t '%Y%m%d%H%M.%S' -f '%Sm' "{}" \; > /tmp/original_dates.txt
Now we have a file that looks like this (in my example there are only two entries there):
# cat /tmp/original_dates.txt
./test1
200809241840.55
./test2
200809241849.56
Now just copy the file over to the other machine and place it in the directory (so the relative file paths match) and apply the dates:
cat original_dates.txt | (while read FILE && read DATE; do touch -t $DATE "$FILE"; done)
Will also work with file names containing spaces.
One note: I used the last "modification" date at stat, as that's what you wrote in the question. However, it rather sounds as if you want to use the "creation" date (every file has a creation date, last modification date and last access date), you need to alter the stat call a bit.
'%Sm' - last modification date
'%Sc' - creation date
'%Sa' - last access date
However, touch can only change the modification time and access time, I think it can't change the creation time of a file ... so if that was your real intention, my solution might be sub-optimal... but in that case your question was as well ;-)
I would go through all the files in the source directory tree and gather the modification times from them into a script that I could run on the other directory trees. You will need to be careful about a few 'gotchas'. First, make sure that your output script has relative paths, and make sure you run it from the proper target directory, which should be the root directory of the target tree. Also, when changing machines make sure you are using the same timezone as you were on the machine where you generated the script.
Here's a Perl script I put together that will output the touch commands needed to update the times on the other directory trees. Depending on the target machines, you may need to tweak the date formats or command options, but this should give you a place to start.
#!/usr/bin/perl
my $STARTDIR="$HOME/test";
chdir $STARTDIR;
my #files = `find . -type f`;
chomp #files;
foreach my $file (#files) {
my $mtime = localtime((stat($file))[9]);
print qq(touch -m -d "$mtime" "$file"\n);
}
The other approach you could try is to attach the remote directory using NFS and then copy the times using find and touch -r.
I think rsync (with the right options)
will do this - it claims to only send
file differences, so presumably will
work out that there are no differences
to be transferred.
--times preserves the modification times, which is what you want.
See (for instance)
http://linux.die.net/man/1/rsync
Also add -I, --ignore-times don't skip files that match size and time
so that all files are "transferred' and trust to rsync's file differences optimisation to make it "fairly efficient" - see excerpt from the man page below.
-t, --times
This tells rsync to transfer modification times along with the files and update them on the remote system. Note that if this option is not used, the optimization that excludes files that have not been modified cannot be effective; in other words, a missing -t or -a will cause the next transfer to behave as if it used -I, causing all files to be updated (though the rsync algorithm will make the update fairly efficient if the files haven't actually changed, you're much better off using -t).
I used the following Python scripts instead.
Python scripts run much faster than an approach creating new processes for each file (like using find and stat). The solution below also works in case of timezone differences between systems, as it uses UTC times. It also works with paths containing spaces (but not paths containing newline!). It doesn't set times for symlinks, because the operating system provides no mechanism to modify the timestamp of a symlink, but in a file manager the time of the file the symlink points at is shown instead anyway. It uses a maxTime parameter to avoid resetting dates for files that are actually modified after copying from the original directory.
listMTimes.py:
import os
from datetime import datetime
from pytz import utc
for dirpath, dirnames, filenames in os.walk('./'):
for name in filenames+dirnames:
path = os.path.join(dirpath, name)
# Avoid symlinks because os.path.getmtime and os.utime get and
# set the time of the pointed file, and in the new directory,
# the link may have been redirected.
if not os.path.islink(path):
mtime = datetime.fromtimestamp(os.path.getmtime(path), utc)
print(mtime.isoformat()+" "+path)
setMTimes.py:
import datetime, fileinput, os, sys, time
import dateutil.parser
from pytz import utc
# Based on
# http://stackoverflow.com/questions/6999726/python-getting-millis-since-epoch-from-datetime
def unix_time(dt):
epoch = datetime.datetime.fromtimestamp(0, utc)
delta = dt - epoch
return delta.total_seconds()
if len(sys.argv) != 2:
print('Syntax: '+sys.argv[0]+' <maxTime>')
print(' where <maxTime> an ISO time, e. g. "2013-12-02T23:00+02:00".')
exit(1)
# A file with modification time newer than maxTime is not reset to
# its original modification time.
maxTime = unix_time(dateutil.parser.parse(sys.argv[1]))
for line in fileinput.input([]):
(datetimeString, path) = line.rstrip('\r\n').split(' ', 1)
mtime = dateutil.parser.parse(datetimeString)
if os.path.exists(path) and not os.path.islink(path):
if os.path.getmtime(path) <= maxTime:
os.utime(path, (time.time(), unix_time(mtime)))
Usage: in the first directory (the original) run
python listMTimes.py >/tmp/original_dates.txt
Then in the second directory (a copy of the original, possibly with some files modified/added/deleted) run something like this:
python setMTimes.py 2013-12-02T23:00+02:00 </tmp/original_dates.txt