I found the following perl code executed in surprisingly varying speeds, sometimes fast, sometimes very slow. I have a few folders containing tens of thousands of files, which I need to run this code through. I am running this on cygwin with windows 7. Just wonder if someone could please help me to speed it up, or as least to figure out why the speed is varying. My CPU and memory should be plentiful in all these situations.
outer loop to iterate through a list of $dir's
opendir(DIR, $dir);
#all=readdir(DIR);
#files = (0..$#all);
$i=-1;
foreach $current (#all){
if (-f "$dir/$current") {
$files[++$i]=$current;
}
}
push #Allfiles,#files[0..$i];
closedir(DIR);
You're probably I/O bound, so changes to your code probably won't affect the total runtime - runtime will be affected by whether the directory entries are in cache or not.
But your code uses temporary arrays for no good reason, using too much RAM if the directories are very large. You could simplify it to:
opendir(DIR, $dir);
while (my file = readdir(DIR)) {
push #Allfiles, $file if (-f "$dir/$file");
}
closedir(DIR);
No temporary arrays.
If it is slow the first time you run, and fast after that, then the problem is that your system is caching the reads. The first time you run your code, data has to be read off your disk. After that, the data is still cached in RAM. If you wait long enough, the cache will flush and you will have to hit the disks again.
Or sometimes you may be running some other disk intensive task at the same time, but not at other times when you run your code.
Related
Myself and some other people at work have been trying to figure out exactly why this excerpt of this script runs so much faster in ISE than in the shell.
For context, the entire script (which compares AD hashes to a list of known compromised hashes), will run in ISE in about 30 minutes with the expected results. However, when invoked remotely or run locally from the shell, it takes up to 10 days in some cases.
We've found that this little bit of code in a function is where things go wonky. I'm not 100% certain, but I believe it may be resulting from the use of System.IO.StreamReader. Specifically, calling the ReadLine() method; but really not sure.
$fsHashDictionary = New-Object IO.Filestream $HashDictionary,'Open','Read','Read'
$frHashDictionary = New-Object System.IO.StreamReader($fsHashDictionary)
while (($lineHashDictionary = $frHashDictionary.ReadLine()) -ne $null) {
if($htADNTHashes.ContainsKey($lineHashDictionary.Split(":")[0].ToUpper()))
{
$foFoundObject = [PSCustomObject]#{
User = $htADNTHashes[$lineHashDictionary.Split(":")[0].ToUpper()]
Frequency = $lineHashDictionary.Split(":")[1]
Hash = $linehashDictionary.Split(":")[0].ToUpper()
}
$mrMatchedResults += $foFoundObject
}
Afaik, there isn't anything that can explain a "Script runs hundreds of times faster in ISE than in the shell" therefore I suspect the available memory differences between one and the other session are causing your script to run into performance issues.
Knowing that custom PowerShell objects are pretty heavy. To give you an idea how much memory they consume, try something like this:
$memBefore = (Get-Process -id $pid).WS
$foFoundObject = [PSCustomObject]#{
User = $htADNTHashes[$lineHashDictionary.Split(":")[0].ToUpper()]
Frequency = $lineHashDictionary.Split(":")[1]
Hash = $linehashDictionary.Split(":")[0].ToUpper()
}
$memAfter = (Get-Process -id $pid).WS
$memAfter - $memBefore
Together with the fact that arrays (as $mrMatchedResults) are mutual and therefore causing the array to be rebuild every time you use the increase assignment operator (+=), the PowerShell session might be running out of physically memory causing Windows to constantly swapping memory pages.
.Net methods like [System.IO.StreamReader] are definitely a lot faster then PowerShell cmdlets (as e.g. Get-Content) but that doesn't mean that you have to pot everything into memory. Meaning, instead of assigning the results to $lineHashDictionary (which loads all lines into memory), stream each object to the next cmdlet.
Especially For you main object, try to respect the PowerShell pipeline. As recommended in Why should I avoid using the increase assignment operator (+=) to create a collection?, you better not assign the output at all but pass the pipeline output directly to the next cmdlet (and eventually release to its destination, as e.g. display, AD, disk) to free up memory.
And if you do use .Net classes (along with the StreamReader class) make sure that you dispose the object as shown in the PowerShell scripting performance considerations article, otherwise you function might leak even more memory than required.
the performance of a complete (PowerShell) solution is supposed to be better than the sum of its parts. Meaning, don't focus too much on a single function if it concerns performance issues, instead look at you whole solution. The PowerShell pipeline gives you the opportunity to e.g. load objects from AD and process them almost simultaneously and using just a little more memory than each single object.
It's probably because ISE uses the WPF framework and benefits from hardware acceleration, a PowerShell console does not.
I have a large script that parses several computers for information. At the beginning of my script, I create several files to write output to as the script finds information.
Since I want my script to work at scale (hundreds or thousands of machines), I am wondering if it is too costly to write to the files each time the script identifies new information. In terms of performance, should I write to the files only once (after the script finishes), or do the writes to files become inexpensive after my first write?
That should not be an overhead if your system resource along with the DISK I/O allow you to do that.
I would recommend you to use .Net StreamWriter to write the file because thats pretty fast and effective way to accomplish the need:
$stream = [System.IO.StreamWriter] "t.txt"
1..100 | % {
$stream.WriteLine($s)
}
$stream.close()
Avoid locking. Do not try to access the same file in parallel. That will reduce the complexity.
And it is completely feasible if you are making sure that other heavy applications are not using the Disk I/O that time.
Hope it helps.
I have made a little function that deletes files based on date. Prior to doing the deletions, it lets the user choose how many days/months back to delete files, telling them how many files and how much memory it would clean up.
It worked great in my test environment, but when I attempted to test it on a larger directory (approximately 100K files), it hangs.
I’ve stripped everything else from my code to ensure that it is the get_dir_info() function that is causing the issue.
$this->load->helper('file');
$folder = "iPad/images/";
set_time_limit (0);
echo "working<br />";
$dirListArray = get_dir_file_info($folder);
echo "still working";
When I run this, the page loads for approximately 60 seconds, then displays only the first message “working” and not the following message “still working”.
It doesn’t seem to be a system/php memory problem as it is coming back after 60 seconds and the server respects my set_time_limit() as I’ve had to use that for other processes.
Is there some other memory/time limit I might be hitting that I need to adjust?
from the CI user guide the get_dir_file_info() is:
Reads the specified directory and builds an array containing the filenames, filesize, dates, and permissions. Sub-folders contained within the specified path are only read if forced by sending the second parameter, $top_level_only to FALSE, as this can be an intensive operation.
so if you are saying that you have 100k files then the best way to do it, is to cut it into two steps:
First: use get_filenames('path/to/directory/') to retrieve all your files without their information.
Second: use get_file_info('path/to/file', $file_information) to retrieve a specific file info, as you might not need all the file information immediately. it can be done on file name click or something relevant.
the idea here is not to force your server to deal with large amount of process while in production. that would kill two things, responsiveness, and performance (I haven't found a better definition for performance) but the idea here is clear.
In my Azure web role OnStart() I need to deploy a huge unmanaged program the role depends on. The program is previously compressed into a 400-megabytes .zip archive, splitted to files 20 megabytes each and uploaded to a blob storage container. That program doesn't change - once uploaded it can stay that way for ages.
My code does the following:
CloudBlobContainer container = ... ;
String localPath = ...;
using( FileStream writeStream = new FileStream(
localPath, FileMode.OpenOrCreate, FileAccess.Write ) )
{
for( int i = 0; i < blobNames.Size(); i++ ) {
String blobName = blobNames[i];
container.GetBlobReference( blobName ).DownloadToStream( writeStream );
}
writeStream.Close();
}
It just opens a file, then writes parts into it one by one. Works great, except it takes about 4 minutes when run from a single core (extra small) instance. Which means the average download speed about 1,7 megabytes per second.
This worries me - it seems too slow. Should it be so slow? What am I doing wrong? What could I do instead to solve my problem with deployment?
Adding to what Richard Astbury said: An Extra Small instance has a very small fraction of bandwidth that even a Small gives you. You'll see approx. 5Mbps on an Extra Small, and approx. 100Mbps on a Small (for Small through Extra Large, you'll get approx. 100Mbps per core).
The extra small instance has limited IO performance. Have you tried going for a medium sized instance for comparison?
In some ad-hoc testing I have done in the past I found that there is no discernable difference between downloading 1 large file and trying to download in parallel N smaller files. It turns out that the bandwidth on the NIC is usually the limiting factor no matter what and that a large file will just as easily saturate it as many smaller ones. The reverse is not true, btw. You do benefit by uploading in parallel as opposed to one at a time.
The reason I mention this is that it seems like you should be using 1 large zip file here and something like Bootstrapper. That would be 1 line of code for you to download, unzip, and possibly run. Even better, it won't do it more than once on reboot unless you force it to.
As others have already aptly mentioned, the NIC bandwidth on the XS instances is vastly smaller than even a S instance. You will see much faster downloads by bumping up the VM size slightly.
I am using Perl readdir to get file listing, however, the directory contains more than 250,000 files and this results long time (longer than 4 minutes) to perform readdir and uses over 80MB of RAM. As this was intended to be a recurring job every 5 minutes, this lag time will not be acceptable.
More info:
Another job will fill the directory (once per day) being scanned.
This Perl script is responsible for processing the files. A file count is specified for each script iteration, currently 1000 per run.
The Perl script is to run every 5 min and process (if applicable) up to 1000 files.
File count limit intended to allow down stream processing to keep up as Perl pushes data into database which triggers complex workflow.
Is there another way to obtain filenames from directory, ideally limited to 1000 (set by variable) which would greatly increase speed of this script?
What exactly do you mean when you say readdir is taking minutes and 80 MB? Can you show that specific line of code? Are you using readdir in scalar or list context?
Are you doing something like this:
foreach my $file ( readdir($dir) ) {
#do stuff here
}
If that's the case, you are reading the entire directory listing into memory. No wonder it takes a long time and a lot of memory.
The rest of this post assumes that this is the problem, if you are not using readdir in list context, ignore the rest of the post.
The fix for this is to use a while loop and use readdir in a scalar context.
while (
defined( my $file = readdir $dir )
) {
# do stuff.
}
Now you only read one item at a time. You can add a counter to keep track of how many files you process, too.
The solution would maybe lie in the other end : at the script that fills the directory...
Why not create an arborescence to store all those files and that way have lots of directories each with a manageable number of files ?
Instead of creating "mynicefile.txt" why not "m/my/mynicefile", or something like that ?
Your file system would thank you for that (especially if you remove the empty directories when you have finished with them).
This is not exactly an answer to your query, but I think having that many files in the same directory is not a very good thing for overall speed (including, the speed at which your filesystem handles add and delete operations, not just listing as you have seen).
A solution to that design problem is to have sub-directories for each possible first letter of the file names, and have all files beginning with that letter inside that directory. Recurse to the second, third, etc. letter if need be.
You will probably see a definite speed improvement on may operations.
You're saying that the content gets there by unpacking zip file(s). Why don't you just work on the zip files instead of creating/using 250k of files in one directory?
Basically - to speed it up, you don't need specific thing in perl, but rather on filesystem level. If you are 100% sure that you have to work with 250k files in directory (which I can't imagine a situation when something like this would be required) - you're much better off with finding better filesystem to handle it than to finding some "magical" module in perl that would scan it faster.
Probably not. I would guess most of the time is in reading the directory entry.
However you could preprocess the entire directory listing, creating one file per 1000-entries. Then your process could do one of those listing files each time and not incur the expense of reading the entire directory.
Have you tried just readdir() through the directory without any other processing at all to get a baseline?
You aren't going to be able to speed up readdir, but you can speed up the task of monitoring a directory. You can ask the OS for updates -- Linux has inotify, for example. Here's an article about using it:
http://www.ibm.com/developerworks/linux/library/l-ubuntu-inotify/index.html?ca=drs-
You can use Inotify from Perl:
http://metacpan.org/pod/Linux::Inotify2
The difference is that you will have one long-running app instead of a script that is started by cron. In the app, you'll keep a queue of files that are new (as provided by inotify). Then, you set a timer to go off every 5 minutes, and process 1000 items. After that, control returns to the event loop, and you either wake up in 5 minutes and process 1000 more items, or inotify sends you some more files to add to the queue.
(BTW, You will need an event loop to handle the timers; I recommend EV.)