How to split a huge folder?

How to split a huge folder? - windows

We have a folder on Windows that's ... huge. I ran "dir > list.txt". The command lost response after 1.5 hours. The output file is about 200 MB. It shows there're at least 2.8 million files. I know the situation is stupid but let's focus the problem itself. If I have such a folder, how can I split it to some "manageable" sub-folders? Surprisingly all the solutions I have come up with all involve getting all the files in the folder at some point, which is a no-no in my case. Any suggestions?
Thank Keith Hill and Mehrdad. I accepted Keith's answer because that's exactly what I wanted to do but I couldn't quite get PS working quickly.
With Mehrdad's tip, I wrote this little program. It took 7+ hours to move 2.8 million files. So the initial dir command did finish. But somehow it didn't return to console.
namespace SplitHugeFolder
{
class Program
{
static void Main(string[] args)
{
var destination = args[1];
if (!Directory.Exists(destination))
Directory.CreateDirectory(destination);
var di = new DirectoryInfo(args[0]);
var batchCount = int.Parse(args[2]);
int currentBatch = 0;
string targetFolder = GetNewSubfolder(destination);
foreach (var fileInfo in di.EnumerateFiles())
{
if (currentBatch == batchCount)
{
Console.WriteLine("New Batch...");
currentBatch = 0;
targetFolder = GetNewSubfolder(destination);
}
var source = fileInfo.FullName;
var target = Path.Combine(targetFolder, fileInfo.Name);
File.Move(source, target);
currentBatch++;
}
}
private static string GetNewSubfolder(string parent)
{
string newFolder;
do
{
newFolder = Path.Combine(parent, Path.GetRandomFileName());
} while (Directory.Exists(newFolder));
Directory.CreateDirectory(newFolder);
return newFolder;
}
}
}

I use Get-ChildItem to index my whole C: drive every night into c:\filelist.txt. That's about 580,000 files and the resulting file size is ~60MB. Admittedly I'm on Win7 x64 with 8 GB of RAM. That said, you might try something like this:
md c:\newdir
Get-ChildItem C:\hugedir -r |
Foreach -Begin {$i = $j = 0} -Process {
if ($i++ % 100000 -eq 0) {
$dest = "C:\newdir\dir$j"
md $dest
$j++
}
Move-Item $_ $dest
}
The key is to do the move in a streaming manner. That is, don't collect up all the Get-ChildItem results into a single variable and then proceed. That would require all 2.8 million FileInfos to be in memory at once. Also, if you use the Name parameter on Get-ChildItem it will output a single string containing the file's path relative to the base dir. Even then, perhaps this size will just overwhelm the memory available to you. And no doubt, it will take quite a while to execute. IIRC correctly, my indexing script takes several hours.
If it does work, you should wind up with c:\newdir\dir0 thru dir28 but then again, I haven't tested this script at all so your mileage may vary. BTW this approach assumes that you're huge dir is a pretty flat dir.
Update: Using the Name parameter is almost twice as slow so don't use that parameter.

I found out the GetChildItem is the slowest option when working with many items in a directory.
Look at the results:
Measure-Command { Get-ChildItem C:\Windows -rec | Out-Null }
TotalSeconds : 77,3730275
Measure-Command { listdir C:\Windows | Out-Null }
TotalSeconds : 20,4077132
measure-command { cmd /c dir c:\windows /s /b | out-null }
TotalSeconds : 13,8357157
(with listdir function defined like this:
function listdir($dir) {
$dir
[system.io.directory]::GetFiles($dir)
foreach ($d in [system.io.directory]::GetDirectories($dir)) {
listdir $d
}
}
)
With this in mind, what I would do: I would stay in PowerShell but use more lowlevel approach with .NET methods:
function DoForFirst($directory, $max, $action) {
function go($dir, $options)
{
foreach ($f in [system.io.Directory]::EnumerateFiles($dir))
{
if ($options.Remaining -le 0) { return }
& $action $f
$options.Remaining--
}
foreach ($d in [system.io.directory]::EnumerateDirectories($dir))
{
if ($options.Remaining -le 0) { return }
go $d $options
}
}
go $directory (New-Object PsObject -Property #{Remaining=$max })
}
doForFirst c:\windows 100 {write-host File: $args }
# I use PsObject to avoid global variables and ref parameters.
To use the code you have to switch to .NET 4.0 runtime -- enumerating methods are new in .NET 4.0.
You can specify any scriptblock as -action parameter, so in your case it would be something like {Move-item -literalPath $args -dest c:\dir }.
Just try to list first 1000 items, I hope it will finish very quickly:
doForFirst c:\yourdirectory 1000 {write-host '.' -nonew }
And of course you can process all items at once, just use
doForFirst c:\yourdirectory ([long]::MaxValue) {move-item ... }
and each item should be processed immediately after it is returned. So the whole list is not read at once and then processed, but it is processed during reading.

How about starting with this:
cmd /c dir /b > list.txt
That should get you a list of all the file names.
If you're doing "dir > list.txt" from a powershell prompt, get-childitem is aliased as "dir". Get-childitem has known issues enumerating large directories, and the object collections it returns can get huge.

Related

Powershell Performance

i have a Problem with powershell Performance while searching a 40gb log file.
i Need to check if any of 1000 email adresses are included in this 40gb file. This would take 180 hours :D any ideas?
$logFolder = "H:\log.txt"
$adressen= Get-Content H:\Adressen.txt
$ergebnis = #()
foreach ($adr in $adressen){
$suche = Select-String -Path $logFolder -Pattern "\[\(\'from\'\,.*$adr.*\'\)\]" -List
$aktiv= $false
$adr
if ($suche){
$aktiv = $true
}
if ($aktiv -eq $true){
$ergebnis+=$adr + ";Ja"
}
else{
$ergebnis+=$adr + ";Nein"
}
}
$ergebnis |Out-File H:\output.txt

Don't read the file 1000 times.
Build a regexp line with all 1000 addresses (it's gonna be a huge line, but hey, much smaller than 40TB). Like:
$Pattern = "\[\(\'from\'\,.*$( $adressen -join '|' ).*\'\)\]"
Then do your Select-String, and save the result to do an address-by-address search in it. Hopefully, the result will be much smaller than 40Gb, and should be much faster.

As mentioned in the comments, replace
$ergebnis = #()
with
$ergebnis = New-Object System.Collections.ArrayList
and
$ergebnis+=$adr + ";Ja"
with
$ergebnis.add("$adr;Ja")
or respective
$ergebnis.add("$adr;Nein")
This will speed up your script quite a bit.

Invoke-Command faster than the command itself?

I was trying to measure some ways to write to files in PowerShell. No question about that but I don't understand why the first Measure-Command statement below takes longer to be executed than the 2nd statement.
They are the same but in the second one I write a scriptblock to send to Invoke-Command and in the 1st one I only run the command.
All informations about Invoke-Command speed I can find are about remoting.
This block takes about 4 seconds:
Measure-Command {
$stream = [System.IO.StreamWriter] "$PSScriptRoot\t.txt"
$i = 0
while ($i -le 1000000) {
$stream.WriteLine("This is the line number: $i")
$i++
}
$stream.Close()
} # takes 4 sec
And this code below which is exactly the same but written in a scriptblock passed to Invoke-Command takes about 1 second:
Measure-Command {
$cmdtest = {
$stream = [System.IO.StreamWriter] "$PSScriptRoot\t2.txt"
$i = 0
while ($i -le 1000000) {
$stream.WriteLine("This is the line number: $i")
$i++
}
$stream.Close()
}
Invoke-Command -ScriptBlock $cmdtest
} # Takes 1 second
How is that possible?

As it turns out, based on feedback from a PowerShell team member on GitHub issue #8911, the issue is more generally about (implicit) dot-sourcing (such as direct invocation of an expression) vs. running in a child scope, such as with &, the call operator, or, in the case at hand, with Invoke-Command -ScriptBlock.
Running in a child scope avoids variable lookups that are performed when (implicitly) dot-sourcing.
Therefore, as of Windows PowerShell v5.1 / PowerShell (Core) 7.2.x, you can speed up statements involving script blocks by invoking them via & { ... }, in a child scope (somewhat counter-intuitively, given that creating a new scope involves extra work).
Note that using & means that such blocks then cannot modify the caller's variables directly, but there are workarounds.
The following simplified code, which uses a foreach expression to loop 1 million times (1e6) demonstrates the performance advantage of running via & { ... }:
# REGULAR, direct invocation of an expression (a `foreach` statement in this case),
# which is implicitly DOT-SOURCED
(Measure-Command { $result = foreach ($n in 1..1e6) { $n } }).TotalSeconds
# OPTIMIZED invocation in CHILD SCOPE, using & { ... }
# up to 10+ TIMES FASTER, depending on OS and PowerShell edition
(Measure-Command { $result = & { foreach ($n in 1..1e6) { $n } } }).TotalSeconds
However, note that the performance advantage diminishes and can even go away the more preexisting variables are being referenced in the script block:
# Define a few sample variables to reference in the script blocks.
# Note that, due to PowerShell's dynamic scoping, even the child
# scope created by & { ... } sees these variables.
$i1=1; $i2=2; $i3=3; $i4=4; $i5=5
(Measure-Command { $result = foreach ($n in 1..1e6) { $n, $i1, $i2, $i3, $i4, $i5 } }).TotalSeconds
# MAY OR MAY NOT BE FASTER, depending on the OS and PowerShell edition.
(Measure-Command { $result = & { foreach ($n in 1..1e6) { $n, $i1, $i2, $i3, $i4, $i5 } } }).TotalSeconds
The reason is that variables that aren't created in the script block (by assigning to them inside it) require a variable lookup with & { ... } too, due to PowerShell's dynamic scoping (see this answer).

Converting a powershell script to Runspace

I wrote a quick script to find the percentage of users in one user list (TEMP.txt) that are also in another user list (TEMP2.txt) It worked great for a while until my user lists got up above a couple 100,000 or so... its too slow. I want to convert it to runspace to speed it up, but I am failing miserably. The original script is:
$USERLIST1 = gc .\TEMP.txt
$i = 0
ForEach ($User in $USERLIST1){
If (gc .\TEMP2.txt |Select-String $User -quiet){
$i = $i + 1
}
}
$Count = gc .\TEMP2.txt | Measure-object -Line
$decimal = $i / $count.lines
$percent = $decimal * 100
Write-Host "$percent %"
Sorry I am still new at powershell.

Not sure how much this will help you, I am new with runspaces as well but here is some code I used with a Windows Form running things asynchronously in a separate runspace, you might be able to manipulate it to do what you need:
$Runspace = [Management.Automation.Runspaces.RunspaceFactory]::CreateRunspace($Host)
$Runspace.ApartmentState = 'STA'
$Runspace.ThreadOptions = 'ReuseThread'
$Runspace.Open()
#Add the Form object to the Runspace environment
$Runspace.SessionStateProxy.SetVariable('Form', $Form)
#Create a new PowerShell object (a Thread)
$PowerShellRunspace = [System.Management.Automation.PowerShell]::Create()
#Initializes the PowerShell object with the runspace
$PowerShellRunspace.Runspace = $Runspace
#Add the scriptblock which should run inside the runspace
$PowerShellRunspace.AddScript({
[System.Windows.Forms.Application]::Run($Form)
})
#Open and run the runspace asynchronously
$AsyncResult = $PowerShellRunspace.BeginInvoke()
#End the pipeline of the PowerShell object
$PowerShellRunspace.EndInvoke($AsyncResult)
#Close the runspace
$Runspace.Close()
#Remove the PowerShell object and its resources
$PowerShellRunspace.Dispose()

Apart from runspace concept, next script could run a bit faster:
$USERLIST1 = gc .\TEMP.txt
$USERLIST2 = gc .\TEMP2.txt
$i = 0
ForEach ($User in $USERLIST1) {
if ($USERLIST2.Contains($User)) {
$i += 1
}
}
$Count = $USERLIST2.Count
$decimal = $i / $count
$percent = $decimal * 100
Write-Host "$percent %"

Compare-Object PowerShell performance and Operation VS Loop

I was looking at this question where the OP wanted to know how to compare items in two arrays without looping through each array.
The command given was:
$array3 = #(Compare-Object $array1 $array2 | select -Expand InputObject
My question is two-fold:
One, does this actually avoid iterating through the arrays in any form? Or does it simply obfuscate the operation from the user by doing it behind the scenes.
Two, as far as performance goes is this the best method for comparing objects? It appears to me it is actually significantly slower.
I made a real crude test:
$Array1 = #("1","2","Orchid","Envy","Sam","Map Of the World","Short String","s","V","DM","qwerty","1234567891011")
$Array2 = #("Bob", "Helmet", "Jane")
$Date1 = Get-Date
$Array2 | ForEach-Object `
{
if ($Array1 -contains $_){}
}
$Date2 = Get-Date
$Time1 = [TimeSpan]$Date2.Subtract($Date1)
Write-Host $Time1
$Date1 = Get-Date
$Array3 = #(Compare-Object $Array1 $Array2)
$Date2 = Get-Date
$Time2 = [TimeSpan]$Date2.Subtract($Date1)
Write-Host $Time2
And my times came out:
ForEach-Object: 00:00:00.0030001
Compare-Object: 00:00:00.0030002
Edit
I updated the script to make it more fair, and it essentially evened out the times.
So what is the behind the scenes difference between Compare-Object and a traditional loop? Am I correct in assuming none?
Edit 2
I found this code using the decompiler:
internal int Compare(ObjectCommandPropertyValue first, ObjectCommandPropertyValue second)
{
if (first.IsExistingProperty && second.IsExistingProperty)
return this.Compare(first.PropertyValue, second.PropertyValue);
if (first.IsExistingProperty)
return -1;
return second.IsExistingProperty ? 1 : 0;
}
public int Compare(object first, object second)
{
if (ObjectCommandComparer.IsValueNull(first) && ObjectCommandComparer.IsValueNull(second))
return 0;
PSObject psObject1 = first as PSObject;
if (psObject1 != null)
first = psObject1.BaseObject;
PSObject psObject2 = second as PSObject;
if (psObject2 != null)
second = psObject2.BaseObject;
try
{
return LanguagePrimitives.Compare(first, second, !this.caseSensitive, (IFormatProvider) this.cultureInfo) * (this.ascendingOrder ? 1 : -1);
}
catch (InvalidCastException ex)
{
}
catch (ArgumentException ex)
{
}
return string.Compare(((object) PSObject.AsPSObject(first)).ToString(), ((object) PSObject.AsPSObject(second)).ToString(), !this.caseSensitive, this.cultureInfo) * (this.ascendingOrder ? 1 : -1);
}
I have traced it around as best as I can, and I believe these are the two worker threads. It appears Compare-Object actually only does a 1 <==> 1 check down the list. Am I missing something here?

Delete files older than so many days in multiple zips

Help!
I need to scan a folder with 200gb of zipped .log logfiles and delete all the files that are over 584 days.
I have found this, and have left a reply in there, but if anyone can help in the meantime then thanks
http://social.technet.microsoft.com/Forums/en/ITCG/thread/793118fb-8345-4711-9710-9c3e485e6d89?prof=required
Cheers

Using SevenZipSharp. Backup any important data first :)
Make sure to read and change any paths that don't make sense, etc.
[Reflection.Assembly]::LoadFile("c:\lib\SevenZipSharp.dll")
[SevenZip.SevenZipExtractor]::SetLibraryPath("c:\lib\7z.dll")
$zipFiles = Get-ChildItem D:\zips\ -Filter "*.zip"
$oldDate = (get-date).AddDays(-584)
$zipFiles | % {
$compressor = [SevenZip.SevenZipCompressor]("C:\")
$compressor.ArchiveFormat = "zip"
$extractor = [SevenZip.SevenZipExtractor]($_.FullName)
$object = New-Object 'system.collections.generic.dictionary[int,string]'
$extractor.ArchiveFileData | %{
if ($_.LastWriteTime -lt $oldDate){
#null index deletes the file
$object.add($_.Index,"")
}
else {
$object.add($_.Index,$_.FileName)
}
}
$compressor.ModifyArchive($_.FullName,$object)
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to split a huge folder? - windows

Related

Powershell Performance

Invoke-Command faster than the command itself?

Converting a powershell script to Runspace

Compare-Object PowerShell performance and Operation VS Loop

Delete files older than so many days in multiple zips

Categories

Resources