Powershell Get-ChildItem progress question - performance

So, I've got a set of directories 00-99 in a folder. Each of those directories has 100 subdirectories, 00-99. Each of those subdirectories has thousands of images.
What I'm attempting to do is basically get a progress report while it's computing the average file size, but I can't get that to work. Here's my current query:
get-childitem <MyPath> -recurse -filter *.jpeg | Where-Object { Write-Progress "Examining File $($_.Fullname)" true } | measure-object -Property length -Average
This shows me a bar that updates as each of the files gets processed, but at the end I get back no average file size data. Clearly, I'm doing something wrong, because I figure trying to hack the Where-Object to print a progress statement is probably a bad idea(tm).
Since there are millions and millions of images, this query obviously takes a VERY LONG time to work. get-childitem is pretty much going to be the bulk of query time, if I understand things correctly. Any pointers to get what I want? AKA, my result would ideally be:
Starting...
Examining File: \00\00\Sample.jpeg
Examining File: \00\00\Sample2.jpeg
Examining File: \00\00\Sample3.jpeg
Examining File: \00\00\Sample4.jpeg
...
Examining File: \99\99\Sample9999.jpg
Average File Size: 12345678.244567
Edit: I can do the simple option of:
get-childitem <MyPath> -recurse -filter *.jpeg | measure-object -Property length -Average
And then just walk away from my workstation for a day and half or something, but that seems a bit inefficient =/

Something like this?
get-childitem -recurse -filter *.exe |
%{Write-Host Examining file: $_.fullname; $_} |
measure-object -Property length -Average

A little more detailed progress:
$images = get-childitem -recurse -filter *.jpeg
$images | % -begin { $i=0 } `
-process { write-progress -activity "Computing average..." -status "Examining File: $image.fullpath ($i of $($images.count))" -percentcomplete ($i/$images.count*100); $i+=1 } `
-end { write-output "Average file size is: $($images | measure-object -Property length -Average)" }

Related

Powershell replace file content with output of previous command [duplicate]

I am having a helluva time trying to understand why this script is not working as intended. It is a simple script in which I am attempting to import a CSV, select a few columns that I want, then export the CSV and copy over itself. (Basically we have archived data that I only need a few columns from for another project due to memory size constraints). This script is very simple, which apparently has an inverse relationship with how much frustration it causes when it doesn't work... Right now the end result is I end up with an empty csv instead of a csv containing only the columns I selected with Select-Object.
$RootPath = "D:\SomeFolder"
$csvFilePaths = Get-ChildItem $RootPath -Recurse -Include *.csv |
ForEach-Object{
Import-CSV $_ |
Select-Object Test_Name, Test_DataName, Device_Model, Device_FW, Data_Avg_ms, Data_StdDev |
Export-Csv $_.FullName -NoType -Force
}
Unless you read the input file into memory in full, up front, you cannot safely read from and write back to the same file in a given pipeline.
Specifically, a command such as Import-Csv file.csv | ... | Export-Csv file.csv will erase the content of file.csv.
The simplest solution is to enclose the command that reads the input file in (...), but note that:
The file's content (transformed into objects) must fit into memory as a whole.
There is a slight risk of data loss if the pipeline is interrupted before all (transformed) objects have been written back to the file.
Applied to your command:
$RootPath = "D:\SomeFolder"
Get-ChildItem $RootPath -Recurse -Include *.csv -OutVariable csvFiles |
ForEach-Object{
(Import-CSV $_.FullName) | # NOTE THE (...)
Select-Object Test_Name, Test_DataName, Device_Model, Device_FW,
Data_Avg_ms, Data_StdDev |
Export-Csv $_.FullName -NoType -Force
}
Note that I've used -OutVariable csvFiles in order to collect the CSV file-info objects in output variable $csvFiles. Your attempt to collect the file paths via $csvFilePaths = ... doesn't work, because it attempts to collects Export-Csv's output, but Export-Csv produces no output.
Also, to be safe, I've changed the Import-Csv argument from $_ to $_.FullName to ensure that Import-Csv finds the input file (because, regrettably, file-info object $_ is bound as a string, which sometimes expands to the mere file name).
A safer solution would be to output to a temporary file first, and (only) on successful completion replace the original file.
With either approach, the replacement file will have default file attributes and permissions; if the original file had special attributes and/or permissions that you want to preserve, you must recreate them explicitly.
As Matt commented, your last $PSItem ($_) not related to the Get-ChildItem anymore but for the Select-Object cmdlet which don't have a FullName Property
You can use differnt foreach approach:
$RootPath = "D:\SomeFolder"
$csvFilePaths = Get-ChildItem $RootPath -Recurse -Include *.csv
foreach ($csv in $csvFilePaths)
{
Import-CSV $csv.FullName |
Select-Object Test_Name,Test_DataName,Device_Model,Device_FW,Data_Avg_ms,Data_StdDev |
Export-Csv $csv.FullName -NoType -Force
}
Or keeping your code, add $CsvPath Variable containing the csv path and use it later on:
$RootPath = "D:\SomeFolder"
Get-ChildItem $RootPath -Recurse -Include *.csv | ForEach-Object{
$CsvPath = $_.FullName
Import-CSV $CsvPath |
Select-Object Test_Name,Test_DataName,Device_Model,Device_FW,Data_Avg_ms,Data_StdDev |
Export-Csv $CsvPath -NoType -Force
}
So I have figured it out. I was attempting to pipe through the Import-Csv cmdlet directly instead of declaring it as a variable in the o.g. code. Here is the code snippet that gets what I wanted to get done, done. I was trying to pipe in the Import-Csv cmdlet directly before, I simply had to declare a variable that uses the Import-Csv cmdlet as its definition and pipe that variable through to Select-Object then Export-Csv cmdlets. Thank you all for your assistance, I appreciate it!
$RootPath = "\someDirectory\"
$CsvFilePaths = #(Get-ChildItem $RootPath -Recurse -Include *.csv)
$ColumnsWanted = #('Test_Name','Test_DataName','Device_Model','Device_FW','Data_Avg_ms','Data_StdDev')
for($i=0;$i -lt $CsvFilePaths.Length; $i++){
$csvPath = $CsvFilePaths[$i]
Write-Host $csvPath
$importedCsv = Import-CSV $csvPath
$importedCsv | Select-Object $ColumnsWanted | Export-CSV $csvPath -NoTypeInformation
}

Quickly find the newest file with PowerShell 2

For PowerShell 2.0 in Win 2008,
I need to check what's the newest file in a directory with about 1.6 million files.
I know I can use Get-ChildItem like so:
$path="G:\Calls"
$filter='*.wav'
$lastFile = Get-ChildItem -Recurse -Path $path -Include $filter | Sort-Object -Property LastWriteTime | Select-Object -Last 1
$lastFile.Name
$lastFile.LastWriteTime
The issue is that it takes sooooo long to find the newest file due to the sheer amount of files.
Is there a faster way to find that?
Sort-Object is known to be slow as it aggregates over each item combination.
But you don't need to do that as you might just go over each file and keep track of the latest one:
Get-ChildItem -Recurse |ForEach-Object `
-Begin { $Newest = $Null } `
-Process { if ($_.LastWriteTime -gt $Newest.LastWriteTime) { $Newest = $_ } } `
-End { $Newest }
there are a couple of things that can be done to improve performance.
First, use -Filter rather than -Include because the filter is passed to the underlying Win32API which will be a bit faster.
Also, because the script gathers all the files and then sorts them, you might be creating a very large memory footprint during the sorting phase. I don't know if it's possible to query the MFT or some other process which avoids retrieving each file and inspecting the lastwritetime, but an alternative approach could be:
gci -rec -file -filter *.wav | %{$v = $null}{if ($_.lastwritetime -gt $v.lastwritetime){$v=$_}}{$v}
I tried this with all files and saw the following:
measure-command{ ls -rec -file |sort lastwritetime|select -last 1}
. . .
TotalSeconds : 142.1333641
vs
measure-command { gci -rec -file | %{$v = $null}{if ($_.lastwritetime -gt $v.lastwritetime){$v=$_}}{$v} }
. . .
TotalSeconds : 87.7215093
which is a pretty good savings. There may be additional ways to improve performance

How to monitor progress of md5 hashing of large drives/many files?

I am looking for the simplest and least intrusive way to monitor the progress of md5 fingerprinting of large drives, many files (8 TB, 2 million).
What would be the best option, for example in case it gets stuck or begins an infinite loop, I can see the trouble file?
The code:
Get-childitem -recurse -file | select-object #{n="Hash";e={get-filehash -algorithm MD5 -path $_.FullName | Select-object -expandproperty Hash}},lastwritetime,length,fullname | export-csv "$((Get-Date).ToString("yyyyMMdd_HHmmss"))_filelistcsv_MD5_LWT_size_path_file.csv" -notypeinformation
aaaa
If you want to list progress, you need to know where your process will end, so you need to list all the files BEFORE you start operating on them.
Write-Host "Listing Files..." -Fore Yellow
$AllFiles = Get-ChildItem -Recurse -File
$CurrentFile = 0 ; $TotalFiles = $AllFiles.Count
Write-Host "Hashing Files..." -Fore Yellow
$AllHashes = foreach ($File in $AllFiles){
Write-Progress -Activity "Hashing Files" -Status "$($CurrentFile)/$($TotalFiles) $($File.FullName)" -PercentComplete (($CurrentFile++/$TotalFiles)*100)
[PSCustomObject]#{
File = $File.FullName
Hash = (Get-FileHash -LiteralPath $File.FullName -Algorithm MD5).Hash
LastWriteTime = $File.LastWriteTime
Size = $File.Length
}
}
$AllHashes | Export-Csv "File.csv" -NoTypeInformation
This will give you a nice header with a progress bar, which looks like this:
ISE:
Normal Shell:

Select directory from a file

I need my program to give me every folder containing files which are out of the Windows' number of characters limit. It means if a file has more than 260 characters (248 for folders), I need it to write the address of the file's parent. And I need it to write it only once. For now, I'm using this code:
$maxLength = 248
Get-ChildItem $newPath -Recurse |
Where-Object { ($_.FullName.Length -gt $maxLength) } |
Select-Object -ExpandProperty FullName |
Split-Path $_.FullName
But the Split-Path won't work (this is the first time I use it). It tells me the -Path parameter has a null value (I can write -Path but it doesn't change anything).
If you want an example of what I need: imagine folder3 has a 230-character address and file.txt has a 280-character address:
C:\users\folder1\folder2\folder3\file.txt
Would write:
C:\users\folder1\folder2\folder3
I'm using PS2, by the way.
Spoiler: the tool you are building may not be able to report paths over the limit since Get-ChildItem cannot access them. You can try nevertheless, and also find other solutions in the links at the bottom.
Issue in your code: $_ only works in specific contexts, for example a ForEach-Object loop.
But here, at the end of the pipeline, you're only left with a string containing the full path (not the complete file object any more), so directly passing it to Split-Path should work:
$maxLength = 248
Get-ChildItem $newPath -Recurse |
Where-Object { ($_.FullName.Length -gt $maxLength) } |
Select-Object -ExpandProperty FullName |
Split-Path
as "C:\Windows\System32\regedt32.exe" | Split-Path would output C:\Windows\System32
Sidenote: what do (Get-Item C:\Windows\System32\regedt32.exe).DirectoryName and (Get-Item C:\Windows\System32\regedt32.exe).Directory.FullName output on your computer ? These both show the directory on my system.
Adapted code example:
$maxLength = 248
Get-ChildItem $newPath -Recurse |
Where-Object { ($_.FullName.Length -gt $maxLength) } |
ForEach-Object { $_.Directory.FullName } |
Select-Object -Unique
Additional information about MAX_PATH:
How do I find files with a path length greater than 260 characters in Windows?
Why does the 260 character path length limit exist in Windows?
http://www.powershellmagazine.com/2012/07/24/jaap-brassers-favorite-powershell-tips-and-tricks/
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
https://gallery.technet.microsoft.com/scriptcenter/Get-ChildItemV2-to-list-29291aae
you cannot use get-childitem to list paths greater than the windows character limit.
There are a couple of alternatives for you. Try an external library like 'Alphafs' or you can use robocopy. Boe Prox has a script that utilizes robocopy and it is available on technet but i am not sure if it will work on PSV2. Anyway you can give it a try.
I've had a similar problem and resolved it like this:
$PathTooLong = #()
Get-ChildItem -LiteralPath $Path -Recurse -ErrorVariable +e -ErrorAction SilentlyContinue
$e | where {$_.Exception -like 'System.IO.PathTooLongException*'} | ForEach-Object {
$PathTooLong += $_.TargetObject
$Global:Error.Remove($_)
}
$PathTooLong
On every path that is too long, or that the PowerShell engine can't handle, Get-ChildItem will throw an error. This error is saved in the ErrorVariable called e in the example above.
When all errors are collected in $e you can filter out the ones you need by checking the error Exception for the string System.IO.PathTooLongException.
Hope it helps you out.

Compare a log file of file paths to a directory structure and remove files not in log file

I have a file transfer/sync job that is copying files from the main network into a totally secure network using a custom protocol (ie no SMB). The problem is that because I can't look back to see what files exist, the destination is filling up, as the copy doesn't remove any files it hasn't touched (like robocopy MIR does).
Initailly I wrote a script that:
1. Opens the log file and grabs the file paths out (this is quite quick and painless)
2. Does a Get-ChildItem on the destination folder (now using dir /s /b as it's way faster than gci)
3. Compared the two, and then removed the differences.
The problem is that there are more jobs that require this clean up but the log files are 100MB and the folders contain 600,000 files, so it's taking ages and using tons of memory. I actually have yet to see one finish. I'd really like some ideas on how to make this faster (memory/cpu use doesn't bother me too much but speed is essential.
$destinationMatch = "//server/fileshare/folder/"
the log file contains some headers and footers and then 600,000 lines like this one:
"//server/fileshare/folder/dummy/deep/tags/20140826/more_stuff/Deeper/2012-07-02_2_0.dat_v2" 33296B 0B completed
Here's the script:
[CmdletBinding(SupportsShouldProcess=$True)]
param(
[Parameter(Mandatory=$True)]
[String]$logName,
[Parameter(Mandatory=$True)]
[String]$destinationMatch
)
$logPath = [string]("C:\Logs\" + $logName)
$manifestFile = gci -Path $logPath | where {$_.name -match "manifest"} | sort creationtime -descending | select Name -first 1
$manifestFileName = [string]$manifestFile.name
$manifestFullPath = $logPath + "\" + $manifestFileName
$copiedList = #()
(gc $manifestFullPath -ReadCount 0) | where {$_.trim() -match $DestinationMatch} | % {
if ( $_ -cmatch '(?<=")[^"]*(?=")' ){
$copiedList += ($matches[0]).replace("/","\")
}
}
$dest = $destinationMatch.replace("/","\")
$actualPathString = (gci -Path $dest -Recurse | select fullname).fullnameCompare-Object -ReferenceObject $copiedList -DifferenceObject $actualPathString -PassThru | % {
$leaf = Split-Path $_ -leaf
if ($leaf.contains(".")){
$fsoData = gci -Path $_
if (!($fsoData.PSIsContainer)){
Remove-Item $_ -Force
}
}
}
$actualDirectory | where {$_.PSIsContainer -and #(gci -LiteralPath $_.FullName -Recurse -WarningAction SilentlyContinue -ErrorAction SilentlyContinue | where {!$_.PSIsContainer}).Length -eq 0} | remove-item -Recurse -Force
Ok, so let's assume that your file copy preserves the last modified date/time stamp. If you really need to pull a directory listing, and compare it against a log, I think you're doing a decent job of it. The biggest slow down is obviously going to be pulling your directory listing. I'll address that shortly. For right now I would propose the following modification of your code:
[CmdletBinding(SupportsShouldProcess=$True)]
param(
[Parameter(Mandatory=$True)]
[String]$logName,
[Parameter(Mandatory=$True)]
[String]$destinationMatch
)
$logPath = [string]("C:\Logs\" + $logName)
$manifestFile = gci -Path $logPath | where {$_.name -match "manifest"} | sort creationtime -descending | select -first 1
$RegExPattern = [regex]::escape($DestinationMatch)
$FilteredManifest = gc $manifestfile.FullPath | where {$_ -match "`"($RegexPattern[^`"]*)`""} |%{$matches[1] -replace '/','\'}
$dest = $destinationMatch.replace("/","\")
$DestFileList = gci -Path $dest -Recurse | select Fullname,Attributes
$DestFileList | Where{$FilteredManifest -notcontains $_.FullName -and $_.Attributes -notmatch "Directory"}|Remove-Item $_ -Force
$DestFileList | Where{$FilteredManifest -notcontains $_.FullName -and $_.Attributes -match "Directory" -and (gci -LiteralPath $_ -Recurse -WarningAction SilentlyContinue -ErrorAction SilentlyContinue).Length -eq 0}{Remove-Item $_ -Recurse -Force}
This stops you from duplicating efforts. There's no need to get your manifest file, and then assign different variables to different properties of the file object, just reference them directly. Then later when you pull your directory listing of the drive (the slow part here), keep the full name and attributes of the files/folders. That way you can easily filter against Attributes to see what's a directory and what not, so we can deal with files first, then clean up directories later after the files are cleaned up.
That script should be a bit more streamlined version of yours. Now, about pulling that directory listing... Here's the deal, using Get-ChildItem is going to be slower than some alternatives (such as dir /s /b) but it stops you from having to duplicate efforts by later checking what's a file, and what's a directory. I suppose if the actual files/folders that you are concerned with are a small percentage of the total, then the double work may actually be worth the time and effort to pull the list with something like dir /s /b, and then parse against the log, and only pull folder/file info for the specific items you need to address.

Resources