Efficiently merge large object datasets having mulitple matching keys

Efficiently merge large object datasets having mulitple matching keys - performance

In a Powershell script, I have two data sets that have multiple columns. Not all these columns are shared.
For example, data set 1:
A B XY ZY
- - -- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
and data set 2:
A B ABC GH
- - --- --
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
I want to merge these two dataset, specifying which columns act as key (A and B in my simple case). The expected result is:
A B XY ZY ABC GH
- - -- -- --- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
5 val5 foo5 bar5 foo5 bar5
6 val6 foo6 bar6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
The concept is very similar to a SQL cross join query.
I have been able to successfully write a function that merge objects. Unfortunately, the duration of the computation is exponential.
If I generate my data sets using :
$dsLength = 10
$dataset1 = 0..$dsLength | %{
New-Object psobject -Property #{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
New-Object psobject -Property #{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}
I get these results:
$dsLength = 10 ==> 33ms (fine)
$dsLength = 100 ==> 89ms (fine)
$dsLength = 1000 ==> 1563ms (acceptable)
$dsLength = 5000 ==> 35764ms (too much)
$dsLength = 10000 ==> 138047ms (too much)
$dsLength = 20000 ==> 573614ms (far too much)
How can I merge datasets efficiently when data sets are large (my target is around 20K items) ?
Right now, I have these function defined :
function Merge-Objects{
param(
[Parameter(Mandatory=$true)]
[object[]]$Dataset1,
[Parameter(Mandatory=$true)]
[object[]]$Dataset2,
[Parameter()]
[string[]]$Properties
)
$result = #()
$ds1props = $Dataset1 | gm -MemberType Properties
$ds2props = $Dataset2 | gm -MemberType Properties
$ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
$ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
foreach($row1 in $Dataset1){
$result += $row1
$ds2propsNotInDs1Props | % {
$row1 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
}
foreach($row2 in $Dataset2){
$existing = foreach($candidate in $result){
$match = $true
foreach($prop in $Properties){
if(-not ($row2.$prop -eq $candidate.$prop)){
$match = $false
break
}
}
if($match){
$candidate
break
}
}
if(!$existing){
$ds1propsNotInDs2Props | % {
$row2 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
$result += $row2
}else{
$ds2propsNotInDs1Props | % {
$existing.$($_.Name) = $row2.$($_.Name)
}
}
}
$result
}
I call these function like this :
Measure-Command -Expression {
$data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B"
}
My feeling is that the slowness is due to the second loop, where I try to match an existing row in each iteration
[Edit] Second approach using a hash as index. Surprisingly, it's event slower than first try
$dsLength = 1000
$dataset1 = 0..$dsLength | %{
New-Object psobject -Property #{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
New-Object psobject -Property #{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}
function Get-Hash{
param(
[Parameter(Mandatory=$true)]
[object]$InputObject,
[Parameter()]
[string[]]$Properties
)
$InputObject | Select-object $properties | Out-String
}
function Merge-Objects{
param(
[Parameter(Mandatory=$true)]
[object[]]$Dataset1,
[Parameter(Mandatory=$true)]
[object[]]$Dataset2,
[Parameter()]
[string[]]$Properties
)
$result = #()
$index = #{}
$ds1props = $Dataset1 | gm -MemberType Properties
$ds2props = $Dataset2 | gm -MemberType Properties
$allProps = $ds1props + $ds2props | select -Unique
$ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
$ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
$ds1index = #{}
foreach($row1 in $Dataset1){
$tempObject = new-object psobject
$result += $tempObject
$ds2propsNotInDs1Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
$ds1props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row1.$($_.Name)
}
$hash1 = Get-Hash -InputObject $row1 -Properties $Properties
$ds1index.Add($hash1, $tempObject)
}
foreach($row2 in $Dataset2){
$hash2 = Get-Hash -InputObject $row2 -Properties $Properties
if($ds1index.ContainsKey($hash2)){
# merge object
$existing = $ds1index[$hash2]
$ds2propsNotInDs1Props | % {
$existing.$($_.Name) = $row2.$($_.Name)
}
$ds1index.Remove($hash2)
}else{
# add object
$tempObject = new-object psobject
$ds1propsNotInDs2Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
$ds2props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row2.$($_.Name)
}
$result += $tempObject
}
}
$result
}
Measure-Command -Expression {
$data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B"
}
[Edit2] Putting Measure-Commands around the two loops show that event the first loop is yet slow. Actually the first loop take more than 50% of the total time

I agree with #Matt. Use a hashtable -- something like the below. This should run in m + 2n rather than mn time.
Timings on my system
original Solution above
#10 TotalSeconds : 0.07788
#100 TotalSeconds : 0.37937
#1000 TotalSeconds : 5.25092
#10000 TotalSeconds : 242.82018
#20000 TotalSeconds : 906.01584
This definitely looks O(n^2)
Solution Below
#10 TotalSeconds : 0.094
#100 TotalSeconds : 0.425
#1000 TotalSeconds : 3.757
#10000 TotalSeconds : 45.652
#20000 TotalSeconds : 92.918
This looks linear.
Solution
I used three techniques to increase the speed:
Change over to a hashtable. This allows constant time lookups so that you don't have to have nested loops. This is the only change really needed to go from O(n^2) to linear time. It does have the disadvantage that there is more setup work done. So, the advantage of linear time won't be seen until the loop count is large enough to pay for the setup.
Use ArrayList instead of a native array. Adding an item to a native array requires that the array be reallocated and all the items to be copied. So this is also an O(n^2) operation. Since this operation is being done at the engine level, the constant is very small, so it really won't make a difference until much later.
Use PsObject.Copy to create the new object. This is a small optimization compared to the other two, but it cut the run time in half for me.
--
function Get-Hash{
param(
[Parameter(Mandatory=$true)]
[object]$InputObject,
[Parameter()]
[string[]]$Properties
)
$arr = [System.Collections.ArrayList]::new()
foreach($p in $Properties) { $arr += $InputObject.$($p) }
return ( $arr -join ':' )
}
function Merge-Objects{
param(
[Parameter(Mandatory=$true)]
[object[]]$Dataset1,
[Parameter(Mandatory=$true)]
[object[]]$Dataset2,
[Parameter()]
[string[]]$Properties
)
$results = [System.Collections.ArrayList]::new()
$ds1props = $Dataset1 | gm -MemberType Properties
$ds2props = $Dataset2 | gm -MemberType Properties
$ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
$ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
$hash = #{}
$Dataset2 | % { $hash.Add( (Get-Hash $_ $Properties), $_) }
foreach ($row in $dataset1) {
$key = Get-Hash $row $Properties
$tempObject = $row.PSObject.Copy()
if ($hash.containskey($key)) {
$r2 = $hash[$key]
$hash.remove($key)
$ds2propsNotInDs1Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $r2.$($_.Name)
}
} else {
$ds2propsNotInDs1Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
}
[void]$results.Add($tempObject)
}
foreach ($row in $hash.values ) {
# add missing dataset2 objects and extend
$tempObject = $row.PSObject.Copy()
$ds1propsNotInDs2Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
[void]$results.Add($tempObject)
}
$results
}
########
$dsLength = 10000
$dataset1 = 0..$dsLength | %{
New-Object psobject -Property #{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
New-Object psobject -Property #{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}
Measure-Command -Expression {
$data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B"
}

There has been a lot of doubts for me to incorporate a binary search (a hash table) into my Join-Object cmdlet (see also: In Powershell, what's the best way to join two tables into one?) as there are a few issues to overcome that are conveniently left out from the example in the question.
Unfortunately, I can't compete with the performance of #mhhollomon solution:
dsLength Steve1 Steve2 mhhollomon Join-Object
-------- ------ ------ ---------- -----------
10 19 129 21 50
100 145 915 158 329
1000 2936 9646 1575 3355
5000 56129 69558 5814 12653
10000 183813 95472 14740 25730
20000 761450 265061 36822 80644
But I think that I can add some value:
not right
Hash keys are strings, which means that you need to cast the related properties to strings, which is a little questionable simple because:
$Left -eq $Right ≠ "$Left" -eq "$Right"
In most cases it will work, especially when source is a .csv file, but it might go wrong, e.g. in case the data comes from a cmdlet where $Null does mean something else then a empty string (''). Therefore, I recommend to explicitly define $Null keys, e.g. with a Control character.
And as properties values could easily contain a colon (:), I would also recommend to use a control character for separating (joining) multiple keys.
Also right
There is another pitfall by using a hash table which actually doesn't have to be an issue: what if the left ($dataset1) and/or right ($dataset2) have multiple matches. Take e.g. the following data sets:
$dataset1 = ConvertFrom-SourceTable '
A B XY ZY
- - -- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
4 val4 foo4a bar4a
5 val5 foo5 bar5
6 val6 foo6 bar6
'
$dataset2 = ConvertFrom-SourceTable '
A B ABC GH
- - --- --
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
5 val5 foo5a bar5a
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
'
In this case, I would expect a similar outcome a the SQL join and no Item has already been added. Key in dictionary error:
$Dataset1 | FullJoin $dataset2 -On A, B | Format-Table
A B XY ZY ABC GH
- - -- -- --- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
4 val4 foo4a bar4a foo4 bar4
5 val5 foo5 bar5 foo5 bar5
5 val5 foo5 bar5 foo5a bar5a
6 val6 foo6 bar6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
Only Right
As you might have figured out, there is no reason to put both sides in a hash table, but you might consider to stream the left side (rather than choking the input). In the example of the question, both datasets are loaded directly into memory which is hardly a used case. It is more common that your data comes from somewhere else e.g. remote from active directory were you might be able to concurrently search each incoming object in the hash table before the next object comes in. The same counts for the following cmdlet: it might directly start processing the output and doesn't have to wait till your cmdlet is finished (note that the data is immediately released from the Join-Object cmdlet when it is ready). In such a case measuring the performance using Measure-Command requires a complete different approach...
See also: Computer Programming: Is the PowerShell pipeline sequential mode more memory efficient? Why or why not?

Related

extract PSComputerName for invoke command

Hello I am working on a Powershell script that checks on remote PC if any recent update fail generate a txt file to send via email like a daily report here is the script:
$machines = Get-Content “Q:\Users\E\servers2.txt”
$outputfile = 'Q:\Users\E\Processes.dat'
foreach ( $computer in $machines){
Invoke-Command -ComputerName $computer -ScriptBlock {
function Convert-WuaResultCodeToName
{
param( [Parameter(Mandatory=$true)][int] $ResultCode)
$Result = $ResultCode
switch($ResultCode)
{2{$Result = "Succeeded"}3{$Result = "Succeeded With Errors"}4{$Result = "Failed"}}
return $Result}
function Get-WuaHistory
{
$session = (New-Object -ComObject 'Microsoft.Update.Session')
$history = $session.QueryHistory("",0,22) | ForEach-Object {
$Result = Convert-WuaResultCodeToName -ResultCode $_.ResultCode
$_ | Add-Member -MemberType NoteProperty -Value $Result -Name Result
$Product = $_.Categories | Where-Object {$_.Type -eq 'Product'} | Select-Object -First 1 -ExpandProperty Name
$_ | Add-Member -MemberType NoteProperty -Value $_.UpdateIdentity.UpdateId -Name UpdateId
$_ | Add-Member -MemberType NoteProperty -Value $_.UpdateIdentity.RevisionNumber -Name RevisionNumber
$_ | Add-Member -MemberType NoteProperty -Value $Product -Name Product -PassThru
Write-Output $_
}
$history | Where-Object {$_.title} | Select-Object #{Name='psComputerName';Expression={$env:COMPUTERNAME}},Result,Date,Title,Product
}
$fileout = #()
$date = Get-Date -format dd.MM.yy
$time = Get-Date -format HH:mm
$compa = (Get-Date).AddHours(-12)
$result = Get-WuaHistory |Select-Object -Unique -Property PSComputerName,Result,Date,Title
foreach ($item in $result){
if ($item.Result -like 'Failed'-and $item.Date -gt $compa){
$outputLine = $item.PSComputerName + "|" + $date + " " + $time + "|Recent_update_fail|NONE|WARNING|" + $item.Title +" on " + $item.Date + " Status: " + $item.Result
$fileout += $outputLine
}
else {
$outputLine = $item.PSComputerName + "|" + $date + " " + $time + "|Recent_update_fail|NONE|OK|" + $item.Title +" on " + $item.Date + " Status: " + $item.Result
$fileout += $outputLine
}
}
}
Clear-Content "$outputfile"
Start-sleep 1
Add-Content "$outputfile" $fileout
$stream = [IO.File]::OpenWrite($outputfile)
$stream.Close()
$stream.Dispose()
}
So far the script works it gets all the failed updates and generates a string but the problem is basically inside the string the pc name is missing the $item.PSComputerName dot returns the name of the remote computer I tried different combinations or casting string out of the $result but the computer name is still empty
Thanks in advance for your time
Edit 1 Following the solution of #Toni now the script work and pick up the servers name but now I am struggling to output the file I check a test output on the console and is fine and at the moment the output statement used generates a file but unfortunately is an empty file how can I resolve?

The problem is that you try to access the property psComputerName from the elements stored in the array $history. But those elements do not have a property called pscomputername, the available properties are:
$history[0] | gm | ?{$_.membertype -match 'property'}
Name MemberType Definition
---- ---------- ----------
Product NoteProperty System.String Product=Microsoft Defender Antivirus
Result NoteProperty object Result=null
RevisionNumber NoteProperty int RevisionNumber=200
UpdateId NoteProperty string UpdateId=df8b6e72-7b5a-4b4a-ac52-3562fe306501
Categories Property ICategoryCollection Categories () {get}
ClientApplicationID Property string ClientApplicationID () {get}
Date Property Date Date () {get}
Description Property string Description () {get}
HResult Property int HResult () {get}
Operation Property tagUpdateOperation Operation () {get}
ResultCode Property OperationResultCode ResultCode () {get}
ServerSelection Property ServerSelection ServerSelection () {get}
ServiceID Property string ServiceID () {get}
SupportUrl Property string SupportUrl () {get}
Title Property string Title () {get}
UninstallationNotes Property string UninstallationNotes () {get}
UninstallationSteps Property IStringCollection UninstallationSteps () {get}
UnmappedResultCode Property int UnmappedResultCode () {get}
UpdateIdentity Property IUpdateIdentity UpdateIdentity () {get}
To include the computerName you can used calculated property:
$history | Where-Object {$_.title} | Select-Object #{Name='psComputerName';Expression={$env:COMPUTERNAME}},Result,Date,Title,Product

Shared and unique lines from large files. Fastest method?

This code returns unique and shared lines between two files. Unfortunately, it runs forever if the files have 1 million lines. Is there a faster way to do this (e.g., -eq, -match, wildcard, Compare-Object) or containment operators are the optimal approach?
$afile = Get-Content (Read-Host "Enter 'A' file")
$bfile = Get-Content (Read-Host "Enter 'B' file")
$afile |
? { $bfile -notcontains $_ } |
Set-Content lines_ONLY_in_A.txt
$bfile |
? { $afile -notcontains $_ } |
Set-Content lines_ONLY_in_B.txt
$afile |
? { $bfile -contains $_ } |
Set-Content lines_in_BOTH_A_and_B.txt

As mentioned in my answer to a previous question of yours, -contains is a slow operation, particularly with large arrays.
For exact matches you could use Compare-Object and discriminate the output by side indicator:
Compare-Object $afile $bfile -IncludeEqual | ForEach-Object {
switch ($_.SideIndicator) {
'<=' { $_.InputObject | Add-Content 'lines_ONLY_in_A.txt' }
'=>' { $_.InputObject | Add-Content 'lines_ONLY_in_B.txt' }
'==' { $_.InputObject | Add-Content 'lines_in_BOTH_A_and_B.txt' }
}
}
If that's still too slow try reading each file into a hashtable:
$afile = Get-Content (Read-Host "Enter 'A' file")
$ahash = #{}
$afile | ForEach-Object {
$ahash[$_] = $true
}
and process the files like this:
$afile | Where-Object {
-not $bhash.ContainsKey($_)
} | Set-Content 'lines_ONLY_in_A.txt'
If that still doesn't help you need to identify the bottleneck (reading the files, comparing the data, doing multiple comparisons, ...) and proceed from there.

try this:
$All=#()
$All+= Get-Content "c:\temp\a.txt" | %{[pscustomobject]#{Row=$_;File="A"}}
$All+= Get-Content "c:\temp\b.txt" | %{[pscustomobject]#{Row=$_;File="B"}}
$All | group row | %{
$InA=$_.Group.File.Contains("A")
$InB=$_.Group.File.Contains("B")
if ($InA -and $InB)
{
$_.Group.Row | select -unique | Out-File c:\temp\lines_in_A_And_B.txt -Append
}
elseif ($InA)
{
$_.Group.Row | select -unique | Out-File c:\temp\lines_Only_A.txt -Append
}
else
{
$_.Group.Row | select -unique | Out-File c:\temp\lines_Only_B.txt -Append
}
}

Full code for the best option (#ansgar-wiechers). A unique, B unique, and A,B shared lines:
$afile = Get-Content (Read-Host "Enter 'A' file")
$ahash = #{}
$afile | ForEach-Object {
$ahash[$_] = $true
}
$bfile = Get-Content (Read-Host "Enter 'B' file")
$bhash = #{}
$bfile | ForEach-Object {
$bhash[$_] = $true
}
$afile | Where-Object {
-not $bhash.ContainsKey($_)
} | Set-Content 'lines_ONLY_in_A.txt'
$bfile | Where-Object {
-not $ahash.ContainsKey($_)
} | Set-Content 'lines_ONLY_in_B.txt'
$afile | Where-Object {
$bhash.ContainsKey($_)
} | Set-Content 'lines_in _BOTH_A_and_B.txt'

Considering my suggestion to do a binary search, I have created a reusable Search-SortedArray function for this:
Description
The Search-SortedArray (alias Search) (binary) searches a string in a sorted array. If the string is found, the index of the found string in the array is returned. Otherwise, if the string is not found, a $Null is returned.
Function Search-SortedArray ([String[]]$SortedArray, [String]$Find, [Switch]$CaseSensitive) {
$l = 0; $r = $SortedArray.Count - 1
While ($l -le $r) {
$m = [int](($l + $r) / 2)
Switch ([String]::Compare($find, $SortedArray[$m], !$CaseSensitive)) {
-1 {$r = $m - 1}
1 {$l = $m + 1}
Default {Return $m}
}
}
}; Set-Alias Search Search-SortedArray
$afile |
? {(Search $bfile $_) -eq $Null} |
Set-Content lines_ONLY_in_A.txt
$bfile |
? {(Search $afile $_) -eq $Null} |
Set-Content lines_ONLY_in_B.txt
$afile |
? {(Search $bfile $_) -ne $Null} |
Set-Content lines_in_BOTH_A_and_B.txt
Note 1: Due to the overhead, a binary search will only give advantage with (very) large arrays.
Note 2: The array has to be sorted otherwise the result will be unpredictable.
Nate 3: The search doesn't account for duplicates. In case of duplicate values, just one index will be returned (which isn't a concern for this specific question)
Added 2017-11-07 based on the comment from #Ansgar Wiechers:
Quick benchmark with 2 files with a couple thousand lines each (including duplicate lines): binary search: 2400ms; compare-object: 1850ms; hashtable lookup: 250ms
The idea is that the binary search will take its advantage on the long run: the larger the arrays the more it will proportional gain performance.
Taken $afile |? { $bfile -notcontains $_ } as an example, the performance measurements in the comment and that “a couple thousand lines” is 3000 lines:
For a standard search, you will need an average of 1500 iterations in the $bfile:*1
(3000 + 1) / 2 = 3001 / 2 = 1500
For a binary search, you will need an average of 6.27 iterations in the $bfile:
(log2 3000 + 1) / 2 = (11.55 + 1) / 2 = 6.27
In both situations you do this 3000 times (for each item in $afile)
This means that each single iteration takes:
For a standard search: 250ms / 1500 / 3000 = 56 nanoseconds
For a binary search: 2400ms / 6.27 / 3000 = 127482 nanoseconds
The breakeven point will at about:
56 * ((x + 1) / 2 * 3000) = 127482 * ((log2 x + 1) / 2 * 3000)
Which is (according my calculations) at about 40000 entries.
*1 presuming that a hashtable lookup doesn’t do a binary search itself as it is unaware that the array is sorted
Added 2017-11-07
Conclusion from the comments: Hash tables appear to have a similar associative array algorithms that can't be outperformed with low-level programming commands.

Where-Object or Compare-Object which would be better?

This may be a philosophical question but I would like to know how the following 2 items differ, from a speed and efficiency perspective. In PowerShell I have 2 objects that look like this:
$ObjectA = #()
1..10 | foreach-object{
$obj = New-Object System.Object
$obj | Add-Member -Type NoteProperty -Name index -Value $_
$ObjectA += $obj
}
$ObjectB = #()
5..15 | foreach-0bject{
$obj = New-Object System.Object
$obj | Add-Member -Type NoteProperty -Name index -Value $_
$ObjectB += $obj
}
Now, I want to get the objects that exist in both. I can do it 1 of 2 ways.
Solution 1:
$ObjectA | foreach-object{
$ind = $_
$matching = $ObjectB | where {$_ -eq $ind}
if (![string]::IsNullOrEmpty($matching)){
##do stuff with the match
}
}
Solution 2:
$matches = Compare-Object $ObjectA $ObjectB -Property index | where {$_.SideIndicator -eq '=='} -PassThru
$matches | foreach-object {
##do stuff with the matches.
}
My question is, when my array of objects gets very large (30K+) which one is going to be a better solution from a performance perspective? I don't know how the Compare-Object cmdlet works internally so I really don't know. Or does it not matter?
Thanks in advance.

As #Knows Not Much has pointed out, Compare-Object usually offers better performance than iterating the collection and comparing objects yourself. But the other answer fails to use the -ExcludeDifferent parameter and instead iterates over the Compare-Object output. This means doing many useless string comparisons for the SideIndicator property. For optimal performance, and simpler code, just use -IncludeEqual and -ExcludeDifferent:
$ObjectA = #()
1..10000 | %{
$obj = New-Object System.Object
$obj | Add-Member -Type NoteProperty -Name index -Value $_
$ObjectA += $obj
}
$ObjectB = #()
1000..7000 | %{
$obj = New-Object System.Object
$obj | Add-Member -Type NoteProperty -Name index -Value $_
$ObjectB += $obj
}
# Iterating over the result of Compare-Object takes 2.6 seconds.
Measure-Command { $matches_where_eq = Compare-Object $ObjectA $ObjectB -Property index -IncludeEqual | where {$_.SideIndicator -eq '=='} ; echo $matches_where_eq.count }
# Using -IncludeEqual and -ExcludeDifferent takes 2.1 seconds (80% of previous).
Measure-Command { $matches_ed_ie = Compare-Object $ObjectA $ObjectB -Property index -ExcludeDifferent -IncludeEqual; echo $matches_ed_ie.Count }

Even if you take a dataset of size 10000 you can easily see that compare object is way way faster.
I modified your code to make it work on powershell 3.0
cls
$ObjectA = #()
1..10000 | %{
$obj = New-Object System.Object
$obj | Add-Member -Type NoteProperty -Name index -Value $_
$ObjectA += $obj
}
$ObjectB = #()
1000..7000 | %{
$obj = New-Object System.Object
$obj | Add-Member -Type NoteProperty -Name index -Value $_
$ObjectB += $obj
}
Measure-Command {
$count = 0
$matches = Compare-Object $ObjectA $ObjectB -Property index -IncludeEqual | where {$_.SideIndicator -eq '=='}
}
echo $matches.length
echo $matches.Count
Measure-Command {
$count = 0
$ObjectA | %{
$ind = $_
$matching = $ObjectB | where {$_.Index -eq $ind.Index}
if (![string]::IsNullOrEmpty($matching)){
$count = $count + 1
}
}
echo $count
}
The compare-object returns in less than 5 seconds .... but the other approach just gets stuck forever.

This will build a regex search out of one array and then perform a regex match against the other array.
Solution 3:
[regex]$RegMatch = '(' + (($ObjectA |foreach {[regex]::escape($_)}) –join "|") + ')'
$ObjectB -match $RegMatch
Might want to throw some logic at that to build the regex out of the smaller set of data and then run the larger set against it to speed things up, but I'm pretty sure this would be fastest.

Format-List: sort properties by name

Is it possible to sort the output of the Format-List cmdlet by property name?
Suppose that I have an object $x with two properties "A" and "B", and when I run Format-List with it I get
(PS) > $x | Format-List
B : value b
A : value a
I would like to have
(PS) > $x | Format-List
A : value a
B : value b
NOTE: I should have specified from the beginning that, unlike in the example with "A" and "B" properties, the real object I have to deal with has quite a lot of properties, and new ones could be added in the future, so I don't know all the property names in advance.

AFAIK, Format-List does not provide such an option.
For your particular example this should work:
$x | Select-Object A, B | Format-List
If the property set is not fixed/known then the procedure will be more tricky with use of Get-Member and some preprocessing making sorted parameter array for Select-Object.
EDIT:
Here it is (let's use $host instead of $x):
$host | Select-Object ([string[]]($host | Get-Member -MemberType Property | %{ $_.Name } | Sort-Object)) | Format-List
Christopher is right, Select-Object is not absolutely needed:
$host | Format-List ([string[]]($host | Get-Member -MemberType Property | %{ $_.Name } | Sort-Object))

Nothing wrong with the accepted answer, but a really quick-and-dirty option for a one-off—that doesn't require having the collection already in a variable—might be...
... | Format-List | Out-String -Stream | Sort-Object
...which does a sort on each line of the output of Format-List.
Note that any property values that go onto the next line will be broken (and probably appear at the top of the output), but this could be fixed by the slightly-less-memorable...
... | Format-List | Out-String -Stream -Width ([Int32]::MaxValue) | Sort-Object
...at the expense of column indentation.
Of course, all object/pipeline info is lost by that Out-String call, although—considering the same is true of Format-List—you probably aren't going to care by that point.

Expanding on Christopher's idea, using get-member and format-list -Property:
$x | fl -property ($x| gm | sort name).name

The closest I can think of is to create a new psobject based off the old one but with the properties sorted e.g.:
$x | %{$obj = new-object psobject; `
$_.psobject.properties | Sort Name | `
%{Add-Member -Inp $obj NoteProperty $_.Name $_.Value}; $obj} | fl
You could get fancier and give the new psobject a typename that matches the old one, etc.

If you are dealing with a small number of properties, you can specify their order with the -Property parameter.
Here is an example:
Format-List -Property Owner, Path
If you have a lot of properties, I am not sure there is any easy way to sort them in Format-List, like Roman said.

This seems to work OK (edited so it accepts pipeline input):
function Format-SortedList
{
param (
[Parameter(ValueFromPipeline = $true)]
[Object]$InputObject,
[Parameter(Mandatory = $false)]
[Switch]$Descending
)
process
{
$properties = $InputObject | Get-Member -MemberType Properties
if ($Descending) {
$properties = $properties | Sort-Object -Property Name -Descending
}
$longestName = 0
$longestValue = 0
$properties | ForEach-Object {
if ($_.Name.Length -gt $longestName) {
$longestName = $_.Name.Length
}
if ($InputObject."$($_.Name)".ToString().Length -gt $longestValue) {
$longestValue = $InputObject."$($_.Name)".ToString().Length * -1
}
}
Write-Host ([Environment]::NewLine)
$properties | ForEach-Object {
Write-Host ("{0,$longestName} : {1,$longestValue}" -f $_.Name, $InputObject."$($_.Name)".ToString())
}
}
}
$Host, $MyInvocation | Format-SortedList
$Host, $MyInvocation | Format-SortedList -Descending

I feel sure that you can achieve the desired output. I suggest that you experiment with both Sort-Object (or plain Sort) and also Group-Object (plain Group)
My idea is to place the sort, or group before | format-list
Thus $x | sort-object -property xyz | Format-List

By using Select-Object with a calculated property (#{}) and then excluding it (-ExcludeProperty) you can also order the properties as you want. This works even when you don't know what's coming upfront.
#(
[PSCustomObject]#{
Color = 'Green'
Type = 'Fruit'
Name = 'kiwi'
Flavour = 'Sweet'
}
) | Select-Object #{Name = 'Flavour'; Expression = { $_.Flavour } },
#{Name = 'Name'; Expression = { $_.Name } }, * -ExcludeProperty Name, Flavour |
Format-List
Output:
Flavour : Sweet
Name : kiwi
Color : Green
Type : Fruit

du in PowerShell?

How can I get a du-ish analysis using PowerShell? I'd like to periodically check the size of directories on my disk.
The following gives me the size of each file in the current directory:
foreach ($o in gci)
{
Write-output $o.Length
}
But what I really want is the aggregate size of all files in the directory, including subdirectories. Also I'd like to be able to sort it by size, optionally.

There is an implementation available at the "Exploring Beautiful Languages" blog:
"An implementation of 'du -s *' in Powershell"
function directory-summary($dir=".") {
get-childitem $dir |
% { $f = $_ ;
get-childitem -r $_.FullName |
measure-object -property length -sum |
select #{Name="Name";Expression={$f}},Sum}
}
(Code by the blog owner: Luis Diego Fallas)
Output:
PS C:\Python25> directory-summary
Name Sum
---- ---
DLLs 4794012
Doc 4160038
include 382592
Lib 13752327
libs 948600
tcl 3248808
Tools 547784
LICENSE.txt 13817
NEWS.txt 88573
python.exe 24064
pythonw.exe 24576
README.txt 56691
w9xpopen.exe 4608

I modified the command in the answer slightly to sort descending by size and include size in MB:
gci . |
%{$f=$_; gci -r $_.FullName |
measure-object -property length -sum |
select #{Name="Name"; Expression={$f}},
#{Name="Sum (MB)";
Expression={"{0:N3}" -f ($_.sum / 1MB) }}, Sum } |
sort Sum -desc |
format-table -Property Name,"Sum (MB)", Sum -autosize
Output:
PS C:\scripts> du
Name Sum (MB) Sum
---- -------- ---
results 101.297 106217913
SysinternalsSuite 56.081 58805079
ALUC 25.473 26710018
dir 11.812 12385690
dir2 3.168 3322298
Maybe it is not the most efficient method, but it works.

If you only need the total size of that path, one simplified version can be,
Get-ChildItem -Recurse ${HERE_YOUR_PATH} | Measure-Object -Sum Length

function Get-DiskUsage ([string]$path=".") {
$groupedList = Get-ChildItem -Recurse -File $path | Group-Object directoryName | select name,#{name='length'; expression={($_.group | Measure-Object -sum length).sum } }
foreach ($dn in $groupedList) {
New-Object psobject -Property #{ directoryName=$dn.name; length=($groupedList | where { $_.name -like "$($dn.name)*" } | Measure-Object -Sum length).sum }
}
}
Mine is a bit different; I group all of the files on directoryname, then walk through that list building totals for each directory (to include the subdirectories).

Building on previous answers, this will work for those that want to show sizes in KB, MB, GB, etc., and still be able to sort by size. To change units, just change "MB" to desired units in both "Name=" and "Expression=". You can also change the number of decimal places to show (rounding), by changing the "2".
function du($path=".") {
Get-ChildItem $path |
ForEach-Object {
$file = $_
Get-ChildItem -File -Recurse $_.FullName | Measure-Object -Property length -Sum |
Select-Object -Property #{Name="Name";Expression={$file}},
#{Name="Size(MB)";Expression={[math]::round(($_.Sum / 1MB),2)}} # round 2 decimal places
}
}
This gives the size as a number not a string (as seen in another answer), therefore one can sort by size. For example:
PS C:\Users\merce> du | Sort-Object -Property "Size(MB)" -Descending
Name Size(MB)
---- --------
OneDrive 30944.04
Downloads 401.7
Desktop 335.07
.vscode 301.02
Intel 6.62
Pictures 6.36
Music 0.06
Favorites 0.02
.ssh 0.01
Searches 0
Links 0

My own take using the previous answers:
function Format-FileSize([int64] $size) {
if ($size -lt 1024)
{
return $size
}
if ($size -lt 1Mb)
{
return "{0:0.0} Kb" -f ($size/1Kb)
}
if ($size -lt 1Gb)
{
return "{0:0.0} Mb" -f ($size/1Mb)
}
return "{0:0.0} Gb" -f ($size/1Gb)
}
function du {
param(
[System.String]
$Path=".",
[switch]
$SortBySize,
[switch]
$Summary
)
$path = (get-item ".").FullName
$groupedList = Get-ChildItem -Recurse -File $Path |
Group-Object directoryName |
select name,#{name='length'; expression={($_.group | Measure-Object -sum length).sum } }
$results = ($groupedList | % {
$dn = $_
if ($summary -and ($path -ne $dn.name)) {
return
}
$size = ($groupedList | where { $_.name -like "$($dn.name)*" } | Measure-Object -Sum length).sum
New-Object psobject -Property #{
Directory=$dn.name;
Size=Format-FileSize($size);
Bytes=$size`
}
})
if ($SortBySize)
{ $results = $results | sort-object -property Bytes }
$results | more
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Efficiently merge large object datasets having mulitple matching keys - performance

Related

extract PSComputerName for invoke command

Shared and unique lines from large files. Fastest method?

Where-Object or Compare-Object which would be better?

Format-List: sort properties by name

du in PowerShell?

Categories

Resources