Script to copy files on CD and not on hard disk to a new directory - windows

I need to copy files from a set of CDs that have a lot of duplicate content, with each other, and with what's already on my hard disk. The file names of identical files are not the same, and are in sub-directories of different names. I want to copy non-duplicate files from the CD into a new directory on the hard disk. I don't care about the sub-directories - I will sort it out later - I just want the unique files.
I can't find software to do that - see my post at SuperUser https://superuser.com/questions/129944/software-to-copy-non-duplicate-files-from-cd-dvd
Someone at SuperUser suggested I write a script using GNU's "find" and the Win32 version of some checksum tools. I glanced at that, and have not done anything like that before. I'm hoping something exists that I can modify.
I found a good program to delete duplicates, Duplicate Cleaner (it compares checksums), but it won't help me here, as I'd have to copy all the CDs to disk, and each is probably about 80% duplicates, and I don't have room to do that - I'd have to cycle through a few at a time copying everything, then turning around and deleting 80% of it, working the hard drive a lot.
Thanks for any help.

I don't use Windows, but I'll give a suggestion: a combination of GNU find and a Lua script. For find you can try
find / -exec md5sum '{}' ';'
If your GNU software includes xargs the following will be equivalent but may be significantly faster:
find / -print0 | xargs -0 md5sum
This will give you a list of checksums and corresponding filenames. We'll throw away the filenames and keep the checksums:
#!/usr/bin/env lua
local checksums = {}
for l in io.lines() do
local checksum, pathname = l:match('^(%S+)%s+(.*)$')
checksums[checksum] = true
end
local cdfiles = assert(io.popen('find e:/ -print0 | xargs -0 md5sum'))
for l in cdfiles:lines() do
local checksum, pathname = l:match('^(%S+)%s+(.*)$')
if not checksums[checksum] then
io.stderr:write('copying file ', pathname, '\n')
os.execute('cp ' .. pathname .. ' c:/files/from/cd')
checksums[checksum] = true
end
end
You can then pipe the output from
find / -print0 | xargs -0 md5um
into this script.
There are a few problems:
If the filename has special characters, it will need to be quoted. I don't know the quoting conventions on Windows.
It would more efficient to write the checksums to disk rather than to run find all the time. You could try
local csums = assert(io.open('/tmp/checksums', 'w'))
for cs in pairs(checksums) do csums:write(cs, '\n') end
csums:close()
And then read checksums back in from the file using io.lines again.
I hope this is enough to get you started. You can download Lua from http://lua.org, and I recommend the superb book Programming in Lua (check out the previous edition free online).

Related

Why is my bash randomly selecting files and renaming?

On my Mac I am terminal scripting an increment rename of some files in a folder. The files have a number at the end but from a top down approach they are in order:
foobar101.png
foobar107.png
foobar115.png
foobar121.png
foobar127.png
foobar133.png
foobar141.png
foobar145.png
foobar151.png
foobar155.png
When I create and run my loop it works:
DIR="/customlocation/on/mac"
add=1;
for thefile in $(find $DIR -name "*.png" ); do
cd $DIR
mv -v "${thefile}" foobar"${add}".png
((add++))
done
However, when it runs the increment it's not as expected:
foobar101.png -> need foobar1.png but is foobar10.png
foobar107.png -> need foobar2.png but is foobar3.png
foobar115.png -> need foobar3.png but is foobar4.png
foobar121.png -> need foobar4.png but is foobar2.png
foobar127.png -> need foobar5.png but is foobar9.png
foobar133.png -> need foobar6.png but is foobar6.png
foobar141.png -> need foobar7.png but is foobar1.png
foobar145.png -> need foobar8.png but is foobar5.png
foobar151.png -> need foobar9.png but is foobar8.png
foobar155.png -> need foobar10.png but is foobar7.png
Ive tried searching on SO, Linux/Unix, Ask Ubuntu, and SuperUser but I don't see any questions that solve the issue of controlling the increment and I dont know if it's something in particular I should be looking at. So how can I control the increment from the lowest number/filename instead of the Mac possibly randomly renaming with an increment so I get the desired output?
EDIT:
After a comment from Etan I was looking into the numerical values at the end and some of the files are named foobarXXXX and that is the issue. The below answer, while awesome and a new approach I will look into still produces the same outcome because of some other files. If I remove all files that are foobarXXXX and only leave files with values of foobarXXX my code and the code in fedorqui's answer work. Is there a way then I can target this while in the loop process or do I have to target all names and test to see the length of values and adjust accordingly?
You cannot rely on the order of a find command, which uses the order that the VFS gives them to it in.
You may, instead, want to sort it:
DIR="/customlocation/on/mac"
add=1;
while IFS= read -r thefile; do
cd $DIR
mv -v "${thefile}" foobar"${add}".png
((add++))
done < <(find $DIR -name "*.png" | sort)
#-------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note this uses process substitution, which feed the while loop:
Process substitution is a form of redirection where the input or
output of a process (some sequence of commands) appear as a temporary
file.

Grep Zip files in windows - Have a process that works, but could this be faster?

Have seen posts for zipgrep for Linux..
For example - grep -f on files in a zipped folder
rem zipgrep -s "pattern" TestZipFolder.zip
rem zipgrep [egrep_options] pattern file[.zip] [file(s) ...] [-x xfile(s) ...]
Using Google, did find: http://www.info-zip.org/mans/zipgrep.html and looking in their archives don't see zipgrep in there. It also seems the Info-Zip binaries/code has not been updated in quite a while. I suppose I could grab some of their source and compile..
Also, looked on the Cygwin site and see they are also toying with this as well..
Here is what I am using today.. Just wondering if I could make this faster?
D:\WORK\Scripts\unzip -c D:\Logs\ArchiveTemp\%computername%-04-07-2014-??-00-00-compressed.zip server.log.* | D:\WORK\Scripts\grep -i ">somestring<" >> somestring.txt
Couple issues with the code I have posted:
* Does not show which log file the string is in
* Does not show which zip file the string is in
While the zip file I posted works, it has a lot of room for improvement.
Not much headroom for optimization, but it is worth noting that different implementations of unzip vary in performance. For speed on Windows, decompress the zip file using 7-zip, or the cygwin unzipping utlity. (Obtain via setup -nqP unzip, or the setup utility).
After unzipping, fgrep the directory structure recursively using grep -r.
In summary:
1) copy the zip file to fooCopy.zip
2) unzip fooCopy.zip
3) fgrep -r "regular expression" fooCopy
Rationale, because the file is compressed, you will have to incrementally uncompress the pieces to grep them anyway. Doing it as one batch job is faster, and clearer for someone else to understand.

find files in huge directory - very slow

I have a directory with files. The archive is very big and has 1.5 million pdf files inside.
the directory is stored on an IBM i server with OS V7R1 and the machine is new and very fast.
The files are named like this :
invoice_[custno]_[year']_[invoice_number].pdf
invoice_081500_2013_7534435564.pdf
No I try to find files with the find command using the Shell.
find . -name 'invoice_2013_*.pdf' -type f | ls -l > log.dat
The command took a long time so I aborted the operation with no result.
If I try it with smaller directories all works fine.
Later I want to have a job that runs everey day and finds the files created the last 24 hours but I it aleays runs so slow I can forget this.
That invocation would never work because ls does not read filenames from stdin.
Possible solutions are:
Use the find utility's built-in list option:
find . -name 'invoice_2013_*.pdf' -type f -ls > log.dat
Use the find utility's -exec option to execute ls -l for each matching file:
find . -name 'invoice_2013_*.pdf' -type f -exec ls {} \; > log.dat
Pipe the filenames to the xargs utility and let it execute ls -l with the filenames as parameters:
find . -name 'invoice_2013_*.pdf' -type f | xargs ls -l > log.dat
A pattern search of 1.5 million files in a single directory is going to be inefficient on any filesystem.
For looking only at a list of new entries in the directory, you might consider journaling the directory. You would specify INHERIT(*NO) to prevent journaling all the files in the directory as well. Then you could simply extract the recent journal entries with DSPJRN to find out what objects had been added.
I don't think I'd put more than maybe 15k files in a single directory. Some QShell utilities run into trouble at around 16k files. But I'm not sure I'd store them in a directory in any case, except maybe for ones over 16MB if that's a significant fraction of the total. I'd possibly look to store them in CLOBs/BLOBs in the database first.
Storing as individual streamfile objects brings ownership/authority problems that need to be addressed. Some profile is getting entries into its owned-objects table, and I'd expect that profile to be getting pretty large. Perhaps getting to one or more limits.
By storing in the database, you drop to a single owned object.
Or perhaps a few similar objects... There might be a purging/archiving process that moves rows off to a secondary or tertiary table. Hard to guess how that might need to be structured, if at all.
Saves could also benefit, especially SAVSECDTA and SAV saves. Security data is greatly reduced. And saving a 4GB table is faster than saving a thousand 4MB objects (or whatever the breakdown might be).
Other than determining how the original setup and implementation would go in your environment, the big tricky part could involve volatility. If these are stable objects with relatively few changes and few deletions, it should be okay. But if BLOBs are often modified, it can bring trouble when the table takes at a significant fraction of DASD capacity. It gets particularly rough when it exceeds the size of DASD free space and a re-org is needed. With low volatility, that's much less of a concern.
Typically what is done in such cases is to create subdirectories -- perhaps by using the first letter of each file.. For example, the file
abcsdsjahdjhfdsfds.xyz would be store in
/something/a/abcsdsjahdjhfdsfds.xyz
that would cut down on the size each subdirectory..

Bash script to find file older than X days, then subsequently delete it, and any files with the same base name?

I am trying to figure out a way to search a directory for a file older than 365 days. If it finds a match, I'd like it to both delete the file and locate any other files in the directory that have the same basename, and delete those as well.
File name examples: 12345.pdf (Search for) then delete, 12345_a.pdf, 12345_xyz.pdf (delete if exist).
Thanks! I am very new to BASH scripting, so patience is appreciated ;-))
I doubt this can be done cleanly in a single pass.
Your best bet is to use -mtime or a variant to collect names and then use another find command to delete files matching those names.
UPDATE
With respect to your comment, I mean something like:
# find basenames of old files
find .... -printf '%f\n' | sort -u > oldfiles
for file in ($<oldfiles); do find . -name $file -exec rm; done

Sync File Modification Time Across Multiple Directories

I have a computer A with two directory trees. The first directory contains the original mod dates that span back several years. The second directory is a copy of the first with a few additional files. There is a second computer be which contains a directory tree which is the same as the second directory on computer A (new mod times and additional files). How update the files in the two newer directories on both machines so that the mod times on the files are the same as the original? Note that these directory trees are in the order of 10s of gigabytes so the solution would have to include some method of sending only the date information to the second computer.
The answer by Paul is partly correct, rsync is able to do this, however with different parameters. The correct command is
rsync -Prt --size-only original_dir copy_dir
where -P enables partial transfers and displays a progress indicator, -r recurses through subdirectories, -t preserves time stamps and --size-only doesn't transfer files that match in size.
The following command will make sure that TEST2 gets the same date assigned that TEST1 has
touch -t `stat -t '%Y%m%d%H%M.%S' -f '%Sa' TEST1` TEST2
Now instead of using hard-coded values here, you could find the files using "find" utility and then run touch via SSH on the remote machine. However, that means you may have to enter the password for each file, unless you switch SSH to cert authentication. I'd rather not do it all in a super fancy one-liner. Instead let's work with temp files. First go to the directory in question and run a find (you can filter by file type, size, extension, whatever pleases you, see "man find" for details. I'm just filtering by type file here to exclude any directories):
find . -type f -print -exec stat -t '%Y%m%d%H%M.%S' -f '%Sm' "{}" \; > /tmp/original_dates.txt
Now we have a file that looks like this (in my example there are only two entries there):
# cat /tmp/original_dates.txt
./test1
200809241840.55
./test2
200809241849.56
Now just copy the file over to the other machine and place it in the directory (so the relative file paths match) and apply the dates:
cat original_dates.txt | (while read FILE && read DATE; do touch -t $DATE "$FILE"; done)
Will also work with file names containing spaces.
One note: I used the last "modification" date at stat, as that's what you wrote in the question. However, it rather sounds as if you want to use the "creation" date (every file has a creation date, last modification date and last access date), you need to alter the stat call a bit.
'%Sm' - last modification date
'%Sc' - creation date
'%Sa' - last access date
However, touch can only change the modification time and access time, I think it can't change the creation time of a file ... so if that was your real intention, my solution might be sub-optimal... but in that case your question was as well ;-)
I would go through all the files in the source directory tree and gather the modification times from them into a script that I could run on the other directory trees. You will need to be careful about a few 'gotchas'. First, make sure that your output script has relative paths, and make sure you run it from the proper target directory, which should be the root directory of the target tree. Also, when changing machines make sure you are using the same timezone as you were on the machine where you generated the script.
Here's a Perl script I put together that will output the touch commands needed to update the times on the other directory trees. Depending on the target machines, you may need to tweak the date formats or command options, but this should give you a place to start.
#!/usr/bin/perl
my $STARTDIR="$HOME/test";
chdir $STARTDIR;
my #files = `find . -type f`;
chomp #files;
foreach my $file (#files) {
my $mtime = localtime((stat($file))[9]);
print qq(touch -m -d "$mtime" "$file"\n);
}
The other approach you could try is to attach the remote directory using NFS and then copy the times using find and touch -r.
I think rsync (with the right options)
will do this - it claims to only send
file differences, so presumably will
work out that there are no differences
to be transferred.
--times preserves the modification times, which is what you want.
See (for instance)
http://linux.die.net/man/1/rsync
Also add -I, --ignore-times don't skip files that match size and time
so that all files are "transferred' and trust to rsync's file differences optimisation to make it "fairly efficient" - see excerpt from the man page below.
-t, --times
This tells rsync to transfer modification times along with the files and update them on the remote system. Note that if this option is not used, the optimization that excludes files that have not been modified cannot be effective; in other words, a missing -t or -a will cause the next transfer to behave as if it used -I, causing all files to be updated (though the rsync algorithm will make the update fairly efficient if the files haven't actually changed, you're much better off using -t).
I used the following Python scripts instead.
Python scripts run much faster than an approach creating new processes for each file (like using find and stat). The solution below also works in case of timezone differences between systems, as it uses UTC times. It also works with paths containing spaces (but not paths containing newline!). It doesn't set times for symlinks, because the operating system provides no mechanism to modify the timestamp of a symlink, but in a file manager the time of the file the symlink points at is shown instead anyway. It uses a maxTime parameter to avoid resetting dates for files that are actually modified after copying from the original directory.
listMTimes.py:
import os
from datetime import datetime
from pytz import utc
for dirpath, dirnames, filenames in os.walk('./'):
for name in filenames+dirnames:
path = os.path.join(dirpath, name)
# Avoid symlinks because os.path.getmtime and os.utime get and
# set the time of the pointed file, and in the new directory,
# the link may have been redirected.
if not os.path.islink(path):
mtime = datetime.fromtimestamp(os.path.getmtime(path), utc)
print(mtime.isoformat()+" "+path)
setMTimes.py:
import datetime, fileinput, os, sys, time
import dateutil.parser
from pytz import utc
# Based on
# http://stackoverflow.com/questions/6999726/python-getting-millis-since-epoch-from-datetime
def unix_time(dt):
epoch = datetime.datetime.fromtimestamp(0, utc)
delta = dt - epoch
return delta.total_seconds()
if len(sys.argv) != 2:
print('Syntax: '+sys.argv[0]+' <maxTime>')
print(' where <maxTime> an ISO time, e. g. "2013-12-02T23:00+02:00".')
exit(1)
# A file with modification time newer than maxTime is not reset to
# its original modification time.
maxTime = unix_time(dateutil.parser.parse(sys.argv[1]))
for line in fileinput.input([]):
(datetimeString, path) = line.rstrip('\r\n').split(' ', 1)
mtime = dateutil.parser.parse(datetimeString)
if os.path.exists(path) and not os.path.islink(path):
if os.path.getmtime(path) <= maxTime:
os.utime(path, (time.time(), unix_time(mtime)))
Usage: in the first directory (the original) run
python listMTimes.py >/tmp/original_dates.txt
Then in the second directory (a copy of the original, possibly with some files modified/added/deleted) run something like this:
python setMTimes.py 2013-12-02T23:00+02:00 </tmp/original_dates.txt

Resources