get_dir_file_info() hangs when run on a large directory - codeigniter

I have made a little function that deletes files based on date. Prior to doing the deletions, it lets the user choose how many days/months back to delete files, telling them how many files and how much memory it would clean up.
It worked great in my test environment, but when I attempted to test it on a larger directory (approximately 100K files), it hangs.
I’ve stripped everything else from my code to ensure that it is the get_dir_info() function that is causing the issue.
$this->load->helper('file');
$folder = "iPad/images/";
set_time_limit (0);
echo "working<br />";
$dirListArray = get_dir_file_info($folder);
echo "still working";
When I run this, the page loads for approximately 60 seconds, then displays only the first message “working” and not the following message “still working”.
It doesn’t seem to be a system/php memory problem as it is coming back after 60 seconds and the server respects my set_time_limit() as I’ve had to use that for other processes.
Is there some other memory/time limit I might be hitting that I need to adjust?

from the CI user guide the get_dir_file_info() is:
Reads the specified directory and builds an array containing the filenames, filesize, dates, and permissions. Sub-folders contained within the specified path are only read if forced by sending the second parameter, $top_level_only to FALSE, as this can be an intensive operation.
so if you are saying that you have 100k files then the best way to do it, is to cut it into two steps:
First: use get_filenames('path/to/directory/') to retrieve all your files without their information.
Second: use get_file_info('path/to/file', $file_information) to retrieve a specific file info, as you might not need all the file information immediately. it can be done on file name click or something relevant.
the idea here is not to force your server to deal with large amount of process while in production. that would kill two things, responsiveness, and performance (I haven't found a better definition for performance) but the idea here is clear.

Related

Automatically saving notebook (or other type files in mathematica) files

I have been facing this problem for sometimes now, a laziness caused in part by the fact that Microsoft Office automatically save files you are working on with versions and automatic recovery.
Many times when I am starting a new notebook in mathematica to do some tests or whatever, I often forget to save what I am doing.
Every now and then, depending on the computer I am using, the computer crashes and all the beautiful work I was doing is lost forever...
Is there a way to get around this other that manically saving my files every five minutes? How about file versioning?
BTW: Using MMA V8
Regarding autosaving, you may want to check out the NotebookAutoSave option, which can be set to True through Fromat->Option Inspector. You have to choose "Selected notebook", then go to Notebook Options -> File Options, and set NotebookAutoSave to True. Then, your notebook will be saved after every evaluation. Whether or not this is a satisfactory solution, of course depends on the situation.
But my experience is that the most reliable way is to develop a CTRL+S reflex - this one never lets me down and is working quite well.
As for the versioning, it is much easier with packages, for which you can use WorkBench which has integrated support for CVS and support for SVN via Eclipse plugin. For notebooks, I refer you to this SO thread. You may also find this Mathgroup discussion of some interest.
EDIT
For M8, for auto-saving purposes you can probably also run
RunScheduledTask[NotebookSave[EvaluationNotebook[]],{300}]
But I can not test this code at the moment
EDIT2
I just came across this post in the Toolbag repository - which may also be an alternative for the autosave part of the question (but please see also the discussion in comments on the relative advantages of scheduled tasks vs. Dynamic)
Since you have MMA version 8 you could use:
saveTask = CreateScheduledTask[FrontEndExecute[FrontEndToken["Save"]], 5*60];
StartScheduledTask[saveTask];
to save every 5 minutes (change the term 5*60 for other timings).
To remove the auto-save task use:
RemoveScheduledTask[saveTask];
To save only a fixed, specific notebook, store its handle in nb (finding it using Notebooks, SelectedNotebook, InputNotebook or EvaluationNotebook) and use FrontEndToken[nb,"Save"] instead of just FrontEndToken["Save"]
I have a Mathematica package that provides auto-backup functionality. When enabled, the current notebook--call it "blah.nb"--will be backed up to "blah.nb~" after a configurable amount of time has elapsed. I use it constantly and it has saved me from losing work many, many times. It's better than autosaving since it doesn't touch the actual notebook file: if you screw something up or something gets corrupted you don't want to overwrite your main file. :)
It's on GitHub here.
I've got an autosave routine that saves a copy of every open, modified notebook every 5 minutes (or whatever interval you prefer. It leaves your manually-saved copy alone, and saves a "swap file" in a separate directory that can be easily recovered if need be. The code (to be copied to init.m) is given in this answer: https://mathematica.stackexchange.com/questions/18380/automatic-recovery-after-crash/65852#65852, and copied below:
Motivated by the same concerns, I wrote the following code and added it to my init.m file. There are two main entries you'll want to change to use this. The global variable $SwapDirectory is where the swap files are saved (by swap file, I mean it in the VIm sense; an "extra" copy of your notebook, separate from your manually saved copy that periodically saves any new work). The swap files are organized within the swap directory in a directory structure which "mirrors" their original file locations, and have ".swp" appended to their file names. The other variable you might want to change is the number of seconds between autosaves, indicated by the "300" (corresponding to 5 minutes) near the bottom of the code below. At the appropriate times, this code will (automatically in the background) save swap files for ALL open notebooks, unless they are unmodified from their manually-saved versions (this exception makes the code more efficient, and more importantly, prevents the storage of swap files for documentation notebooks, for example).
In its current form, the code does not filter for only the input cells, but hopefully you can use the other answers to make that modification yourself.
Some things to note:
1) the Mathematica Put command seems to have trouble writing to network drives, even when offline access is enabled. Therefore, it is probably best to choose a SwapDirectory that is on your local machine.
2) Within SwapDirectory, you should create a sub-directory called "Recovery". This is where the AutoSaveSwap routine will make an initial save of any notebooks for which there is NO existing manual save location.
3) Simply evaluate
RecoverSwap["filePath"]
where "filePath" is a string representing the filePath of the MANUALLY-SAVED copy of the file (i.e., not the file that was created by AutoSave). This will then pop up a window containing the most recent auto-saved version of the file. The manually saved version is NEVER overwritten, unless you explicitly choose to do so. Once the recovered version pops up, you can save it whereever you like, or discard it at your discretion.
4) You should probably add this code to the KERNEL version of init.m ($UserBaseDirectory/Kernel/init.m) rather than the frontend version... this way, if you quit and restart the kernel, the autosave feature will also restart. On the other hand, this means that you must evaluate at least one expression after each start or restart to begin auto-saving. Once this initial evaluation is done, you do NOT need to have evaluated a cell for it to be backed up (unlike the built-in autosave utility).
Hope this helps someone! Feel free to respond with any questions, suggestions, or requests for improvement you may have. And, if you find this post useful, upvotes would be most appeciated! Take care.
$SwapDirectory= "C:\\Users\\pacoj\\Swap Files\\";
SaveSwap[nb_NotebookObject]:=Module[
{fileName, swapFileName, nbout, nbdir, nbdirout, recoveryDir},
If[ ! SameQ[Quiet[NotebookFileName[nb]], $Failed],
(* if the notebook is already saved to the file system *)
fileName = Last[ FileNameSplit[ NotebookFileName[nb]] ];
swapFileName = fileName <> ".swp";
nbdir = Rest[FileNameSplit # NotebookDirectory[nb]];
nbdirout= FileNameJoin[ FileNameSplit[$SwapDirectory]~Join~nbdir]<>"\\";
If[!DirectoryQ[nbdirout], CreateDirectory[nbdirout]];
nbout = NotebookGet[nb];
Put[nbout, nbdirout <> swapFileName],
(* else, if the file has never been saved, save as untitled *)
recoveryDir= $SwapDirectory <> "Recovery\\\";
fileName= ("WindowTitle" /. NotebookInformation[nb])<>".nb";
NotebookSave[nb, recoveryDir <> fileName]
]
];
RecoverSwap::noswp= "swap file `1` not found in expected location";
RecoverSwap[nbfilename_String]:=Module[
{fileName, swapFileName, nbin, nbdir, nbdirout},
fileName= Last[ FileNameSplit[ nbfilename] ];
swapFileName= fileName <> ".swp";
nbdir= Most[ Rest[FileNameSplit # nbfilename] ];
nbdirout= FileNameJoin[ FileNameSplit[$SwapDirectory]~Join~nbdir]<>"\\\";
If[ FileNames[swapFileName, {nbdirout}] == {},
Message[RecoverSwap::noswp,nbdirout <> swapFileName]; Return[],
nbin= Get[nbdirout <> swapFileName]; NotebookPut[nbin]
]
];
AutoSaveSwaps= CreateScheduledTask[
SaveSwap /# Select[Notebooks[], "ModifiedInMemory" /. NotebookInformation[#]&],
300
]
StartScheduledTask[AutoSaveSwaps]

command line wisdom for 2 panel file manager user

Want to upgrade my file management productivity by replacing 2 panel file manager with command line (bash or cygwin). Can commandline give same speed? Please advise a guru way of how to do e.g. copy of some file in directory A to the directory B. Is it heavy use of pushd/popd? Or creation of links to most often used directories? What are the best practices and a day-to-day routine to manage files of a command line master?
Can commandline give same speed?
My experience is that commandline copying is significantly faster (especially in the Windows environment). Of course the basic laws of physics still apply, a file that is 1000 times bigger than a file that copies in 1 second will still take 1000 seconds to copy.
..(howto) copy of some file in directory A to the directory B.
Because I often have 5-10 projects that use similar directory structures, I set up variables for each subdir using a naming convention :
project=NewMatch
NM_scripts=${project}/scripts
NM_data=${project}/data
NM_logs=${project}/logs
NM_cfg=${project}/cfg
proj2=AlternateMatch
altM_scripts=${proj2}/scripts
altM_data=${proj2}/data
altM_logs=${proj2}/logs
altM_cfg=${proj2}/cfg
You can make this sort of thing as spartan or baroque as needed to match your theory of living/programming.
Then you can easily copy the cfg from 1 project to another
cp -p $NM_cfg/*.cfg ${altM_cfg}
Is it heavy use of pushd/popd?
Some people seem to really like that. You can try it and see what you thing.
Or creation of links to most often used directories?
Links to dirs are, in my experience used more for software development where a source code is expecting a certain set of dir names, and your installation has different names. Then making links to supply the dir paths expected is helpful. For production data, is just one more thing that can get messed up, or blow up. That's not always true, maybe you'll have a really good reason to have links, but I wouldn't start out that way, just because it is possible to do.
What are the best practices and a day-to-day routine to manage files of a command line master?
( Per above, use standardized directory structure for all projects.
Have scripts save any small files to a directory your dept keeps in the /tmp dir, .
i.e /tmp/MyDeptsTmpFile (named to fit your local conventions) )
It depends. If you're talking about data and logfiles, dated fileNames can save you a lot of time. I recommend dateFmts like YYYYMMDD(_HHMMSS) if you need the extra resolution.
Dated logfiles are very handy, when a current process seems like it is taking a long time, you can look at the log file from a week ago and quantify exactly how long this process took, a week, month, 6 months (up to how much space you can afford). LogFiles should also capture all STDERR messages, so you never have to re-run a bombed program just to see what the error message was.
This is Linux/Unix you're using, right? Read the man page for the cp cmd installed on your machine. I recommend using an alias like alias CP='/bin/cp -pi' so you always copy a file with the same permissions and with the original files' time stamp. Then it is easy to use /bin/ls -ltr to see a sorted list of files with the most recent files showing up at the bottom of the list. (No need to scroll back to the top, when you sort by time,reverse). Also the '-i' option will warn you that you are going to overwrite a file, and this has saved me more than a couple of times.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.

FTP FileWatcher

So, I am in this little predicament where I am stuck watching a few ftp folders to see if they have new files added to them. If they do, it needs to throw an event with the file name. Thereby telling something else to download that file.
This is a pretty simple object to make, I was just curious if anyone knew how expensive this operation would be?
I plan on using the command NLIST because I don't need file size information, and there will be no sub-directories in the folder. Each file in the folder will have exactly 25 characters in its name.
There could be anywhere from 10 to 'maybe' a couple thousand (max around 2000) files per folder (usually on the lower end, 100-300, but currently growing).
The files are anywhere from 250kb to a very VERY unlikely 10mb (usually within the 250kb to 4mb range).
There possibly could be up to a few hundred folders (in which case I could change the watch frequency depending on number of folders), but currently there are only a few (6-10ish).
There also would be multiple logins for the ftp server, different logins would have access to different folders.
I am not asking for an implementation, just if anyone has some first or second hand knowledge about FTP, how could this affect my network.
I am not opposed to putting in file retention times or change the frequency in which I check for new files.
Do you have any control over the remote servers? FTP isn't really optimized for this, and you could probably do a lot better with some sort of dedicated mini-server. You could use file system monitoring on the remote side and just send out the filenames when they arrive rather than continuously polling. You'd only need to have one connection open too, rather than the two that FTP requires.

Verify whether ftp is complete or not?

I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.

How can I speed up Perl's readdir for a directory with 250,000 files?

I am using Perl readdir to get file listing, however, the directory contains more than 250,000 files and this results long time (longer than 4 minutes) to perform readdir and uses over 80MB of RAM. As this was intended to be a recurring job every 5 minutes, this lag time will not be acceptable.
More info:
Another job will fill the directory (once per day) being scanned.
This Perl script is responsible for processing the files. A file count is specified for each script iteration, currently 1000 per run.
The Perl script is to run every 5 min and process (if applicable) up to 1000 files.
File count limit intended to allow down stream processing to keep up as Perl pushes data into database which triggers complex workflow.
Is there another way to obtain filenames from directory, ideally limited to 1000 (set by variable) which would greatly increase speed of this script?
What exactly do you mean when you say readdir is taking minutes and 80 MB? Can you show that specific line of code? Are you using readdir in scalar or list context?
Are you doing something like this:
foreach my $file ( readdir($dir) ) {
#do stuff here
}
If that's the case, you are reading the entire directory listing into memory. No wonder it takes a long time and a lot of memory.
The rest of this post assumes that this is the problem, if you are not using readdir in list context, ignore the rest of the post.
The fix for this is to use a while loop and use readdir in a scalar context.
while (
defined( my $file = readdir $dir )
) {
# do stuff.
}
Now you only read one item at a time. You can add a counter to keep track of how many files you process, too.
The solution would maybe lie in the other end : at the script that fills the directory...
Why not create an arborescence to store all those files and that way have lots of directories each with a manageable number of files ?
Instead of creating "mynicefile.txt" why not "m/my/mynicefile", or something like that ?
Your file system would thank you for that (especially if you remove the empty directories when you have finished with them).
This is not exactly an answer to your query, but I think having that many files in the same directory is not a very good thing for overall speed (including, the speed at which your filesystem handles add and delete operations, not just listing as you have seen).
A solution to that design problem is to have sub-directories for each possible first letter of the file names, and have all files beginning with that letter inside that directory. Recurse to the second, third, etc. letter if need be.
You will probably see a definite speed improvement on may operations.
You're saying that the content gets there by unpacking zip file(s). Why don't you just work on the zip files instead of creating/using 250k of files in one directory?
Basically - to speed it up, you don't need specific thing in perl, but rather on filesystem level. If you are 100% sure that you have to work with 250k files in directory (which I can't imagine a situation when something like this would be required) - you're much better off with finding better filesystem to handle it than to finding some "magical" module in perl that would scan it faster.
Probably not. I would guess most of the time is in reading the directory entry.
However you could preprocess the entire directory listing, creating one file per 1000-entries. Then your process could do one of those listing files each time and not incur the expense of reading the entire directory.
Have you tried just readdir() through the directory without any other processing at all to get a baseline?
You aren't going to be able to speed up readdir, but you can speed up the task of monitoring a directory. You can ask the OS for updates -- Linux has inotify, for example. Here's an article about using it:
http://www.ibm.com/developerworks/linux/library/l-ubuntu-inotify/index.html?ca=drs-
You can use Inotify from Perl:
http://metacpan.org/pod/Linux::Inotify2
The difference is that you will have one long-running app instead of a script that is started by cron. In the app, you'll keep a queue of files that are new (as provided by inotify). Then, you set a timer to go off every 5 minutes, and process 1000 items. After that, control returns to the event loop, and you either wake up in 5 minutes and process 1000 more items, or inotify sends you some more files to add to the queue.
(BTW, You will need an event loop to handle the timers; I recommend EV.)

Resources