find files in huge directory - very slow - shell

I have a directory with files. The archive is very big and has 1.5 million pdf files inside.
the directory is stored on an IBM i server with OS V7R1 and the machine is new and very fast.
The files are named like this :
invoice_[custno]_[year']_[invoice_number].pdf
invoice_081500_2013_7534435564.pdf
No I try to find files with the find command using the Shell.
find . -name 'invoice_2013_*.pdf' -type f | ls -l > log.dat
The command took a long time so I aborted the operation with no result.
If I try it with smaller directories all works fine.
Later I want to have a job that runs everey day and finds the files created the last 24 hours but I it aleays runs so slow I can forget this.

That invocation would never work because ls does not read filenames from stdin.
Possible solutions are:
Use the find utility's built-in list option:
find . -name 'invoice_2013_*.pdf' -type f -ls > log.dat
Use the find utility's -exec option to execute ls -l for each matching file:
find . -name 'invoice_2013_*.pdf' -type f -exec ls {} \; > log.dat
Pipe the filenames to the xargs utility and let it execute ls -l with the filenames as parameters:
find . -name 'invoice_2013_*.pdf' -type f | xargs ls -l > log.dat
A pattern search of 1.5 million files in a single directory is going to be inefficient on any filesystem.

For looking only at a list of new entries in the directory, you might consider journaling the directory. You would specify INHERIT(*NO) to prevent journaling all the files in the directory as well. Then you could simply extract the recent journal entries with DSPJRN to find out what objects had been added.

I don't think I'd put more than maybe 15k files in a single directory. Some QShell utilities run into trouble at around 16k files. But I'm not sure I'd store them in a directory in any case, except maybe for ones over 16MB if that's a significant fraction of the total. I'd possibly look to store them in CLOBs/BLOBs in the database first.
Storing as individual streamfile objects brings ownership/authority problems that need to be addressed. Some profile is getting entries into its owned-objects table, and I'd expect that profile to be getting pretty large. Perhaps getting to one or more limits.
By storing in the database, you drop to a single owned object.
Or perhaps a few similar objects... There might be a purging/archiving process that moves rows off to a secondary or tertiary table. Hard to guess how that might need to be structured, if at all.
Saves could also benefit, especially SAVSECDTA and SAV saves. Security data is greatly reduced. And saving a 4GB table is faster than saving a thousand 4MB objects (or whatever the breakdown might be).
Other than determining how the original setup and implementation would go in your environment, the big tricky part could involve volatility. If these are stable objects with relatively few changes and few deletions, it should be okay. But if BLOBs are often modified, it can bring trouble when the table takes at a significant fraction of DASD capacity. It gets particularly rough when it exceeds the size of DASD free space and a re-org is needed. With low volatility, that's much less of a concern.

Typically what is done in such cases is to create subdirectories -- perhaps by using the first letter of each file.. For example, the file
abcsdsjahdjhfdsfds.xyz would be store in
/something/a/abcsdsjahdjhfdsfds.xyz
that would cut down on the size each subdirectory..

Related

Merge files with same name in more than 100 folders

I have a problem similar to Merge files with same name in different folders, I have about 100 different folders in which there is a .txt file "replaced_txt", the problem is that I need to merge those files but since there is 100 different folders I want to know if tehre is something quicker than doing :
cat /folder1/replaced_txt /folder2/replaced_txt /folder3/replaced_txt ...
The cat command is just about the simplest there is, so there is no obvious and portable way to make the copying of file contents any faster. The bottleneck is probably going to be finding the files, anyway, not in copying them. If indeed the files are all in subdirectories immediately below the root directory,
cat /*/replaced_txt >merged_txt
will expand the wildcard alphabetically (so /folder10/replaced_txt comes before /folder2/replaced_txt) but might run into "Argument list too long" and/or take a long time to expand the wildcard if some of these directories are large (especially on an older Linux system with an ext3 filesystem, which doesn't scale to large directories very well). A more general solution is find, which is better at finding files in arbitrarily nested subdirectories, and won't run into "Argument list too long" because it never tries to assemble all the file names into an alphabetized list; instead, it just enumerates the files it finds as it traverses directories in whichever order the filesystem reports them, and creates a new cat process when the argument list fills up to the point where the system's ARG_MAX limit would be exceeded.
find / -type f -name replaced_txt -xdev -exec cat {} + >merged_txt
If you want to limit how far subdirectories will be traversed or you only want to visit some directories, look at the find man page for additional options.

Bash Script to Check That Files Are Being Created

We have an Amazon EC2 instance where we upload output from our security cameras. Every now and then, the cameras have an issue, stop uploaded, and need to be rebooted. The easy way for us to determine this is by seeing if the files are not being created. The problem is it creates lots and lots of files. If I use find with -ctime, it takes a very long time for this script to run. Is there a faster way to check to see if files have been created since yesterday? I just need to capture the result, (yes there are some files, or not there are not,) and email a message, but it would be nice to have something that didn't take half an hour to run.
#!/bin/bash
find /vol/security_ftp/West -ctime -1
find /vol/security_ftp/BackEntrance -ctime -1
find /vol/security_ftp/BoardroomDoor -ctime -1
find /vol/security_ftp/MainEntrance -ctime -1
find /vol/security_ftp/North -ctime -1
find /vol/security_ftp/South -ctime -1
Using find is a natural solution, but if you really must avoid it, you can see the newest file in a directory using ls and sorting the output according to ctime, eg.
ls /vol/security_ftp/West -clt | head --lines=1
This would be enough if you want to see the date.
If you need better formatted output (or only ctime to process it further) you can feed the filename to stat:
stat --format="%z" $( ls /vol/security_ftp/West -ct | head --lines=1 )
This does not answer automatically if any file was created recently, though.
The simple (and recommended man find) solution is:
find /vol/security_ftp/ -mtime 0
To find files in /vol/security_ftp modified within the last 24 hours. Give it a try and see if it will meet your time requirements. We can look for another solution if the default can't do it quick enough. If the delay is due to numerous subdirectories under /vol/security_ftp, then limit the depth and type with:
find /vol/security_ftp/ -maxdepth 1 -type f -mtime 0

how to find image files without extensions (on macos 10.8)

i have an app that has decided to die which had a library of images it stored on my hard drive in a series of guid-like folders. the files themselves have no file extensions, there must have been an internal database (unrecoverable/corrupt) that associated the file itself with its name/extension/mime. So to get my stuff back out I'd like to be able to search the disk to at least identify which of the files are images (jpeg and png files). I know that both jpeg and png have particular byte sequences in the first few bytes of the files. Is there a grep command that can match these known byte sequences in the first few bytes of each file in the massively nested file system structure that I have (e.g. folders 0 through f, each containing folders 0 through f, nested several levels deep, with files with uid filenames.
Starting at the current directory .:
find . -type f -print0 | xargs -J fname -0 -P 4 identify -ping fname 2>|/dev/null
This will print the files that ImageMagick can identify, which are mostly images, but there are also exceptions (like txt files). ImageMagick is not particularly fast for this task either, so depending on what you have available there might be faster alternatives. For instance, the PIL package for Python will make this faster simply because it supports a lesser amount of image formats, but which might be enough for your task.

BASH script to copy files based on date, with a catch

Let me explain the tree structure: I have a network directory where several times a day new .txt files are copied by our database. Those files sit on directory based on usernames. On the local disk I have the same structure (directory based on usernames) and need to be updated with the latest .txt files. It's not a sync procedure: I copy the remote file to a local destination and I don't care what happens with it after that, so I don't need to keep it in sync. However I do need to copy ONLY the new files and not those that I already copied. It would look something like:
Remote disk
/mnt/remote/database
+ user1/
+ user2/
+ user3/
+ user4/
Local disk
/var/database
+ user1/
+ user2/
+ user3/
+ user4/
I played with
find /mnt/remote/database/ -type f -mtime +1
and other variants, but it's not working very well.
So, the script i am trying to figure is the following:
1- check /mnt/remote/database recursively for *.txt
2- check the files date to see if they are new (since the last time I checked, maybe maintain a text file with the last time checked on it as a reference?)
3- if the file is new, copy it to the proper destination in /var/database (so /mnt/remote/database/user1/somefile.txt will be copied to /var/database/user1/)
I'll run the script through a cron job.
I'm doing this in C right now, but the IT people are not very good in debugging or writing C and if they need to add or fix something they can handle bash scripts better, which I am not very good at.
Any ideas out there?
thank you!
you could consider using local rsync between the input & output directories. it has all the options you want to make its sync policy very flexible.
find /mnt/remote/database/ -type f -newer $TIMESTAMP_FILE | xargs $CP_COMMAND
touch $TIMESTAMP_FILE
The solution is here:
http://www.movingtofreedom.org/2007/04/15/bash-shell-script-copy-only-files-modifed-after-specified-date/

Script to copy files on CD and not on hard disk to a new directory

I need to copy files from a set of CDs that have a lot of duplicate content, with each other, and with what's already on my hard disk. The file names of identical files are not the same, and are in sub-directories of different names. I want to copy non-duplicate files from the CD into a new directory on the hard disk. I don't care about the sub-directories - I will sort it out later - I just want the unique files.
I can't find software to do that - see my post at SuperUser https://superuser.com/questions/129944/software-to-copy-non-duplicate-files-from-cd-dvd
Someone at SuperUser suggested I write a script using GNU's "find" and the Win32 version of some checksum tools. I glanced at that, and have not done anything like that before. I'm hoping something exists that I can modify.
I found a good program to delete duplicates, Duplicate Cleaner (it compares checksums), but it won't help me here, as I'd have to copy all the CDs to disk, and each is probably about 80% duplicates, and I don't have room to do that - I'd have to cycle through a few at a time copying everything, then turning around and deleting 80% of it, working the hard drive a lot.
Thanks for any help.
I don't use Windows, but I'll give a suggestion: a combination of GNU find and a Lua script. For find you can try
find / -exec md5sum '{}' ';'
If your GNU software includes xargs the following will be equivalent but may be significantly faster:
find / -print0 | xargs -0 md5sum
This will give you a list of checksums and corresponding filenames. We'll throw away the filenames and keep the checksums:
#!/usr/bin/env lua
local checksums = {}
for l in io.lines() do
local checksum, pathname = l:match('^(%S+)%s+(.*)$')
checksums[checksum] = true
end
local cdfiles = assert(io.popen('find e:/ -print0 | xargs -0 md5sum'))
for l in cdfiles:lines() do
local checksum, pathname = l:match('^(%S+)%s+(.*)$')
if not checksums[checksum] then
io.stderr:write('copying file ', pathname, '\n')
os.execute('cp ' .. pathname .. ' c:/files/from/cd')
checksums[checksum] = true
end
end
You can then pipe the output from
find / -print0 | xargs -0 md5um
into this script.
There are a few problems:
If the filename has special characters, it will need to be quoted. I don't know the quoting conventions on Windows.
It would more efficient to write the checksums to disk rather than to run find all the time. You could try
local csums = assert(io.open('/tmp/checksums', 'w'))
for cs in pairs(checksums) do csums:write(cs, '\n') end
csums:close()
And then read checksums back in from the file using io.lines again.
I hope this is enough to get you started. You can download Lua from http://lua.org, and I recommend the superb book Programming in Lua (check out the previous edition free online).

Resources