powershell sorting really large collection of objects - sorting

I am trying to generate histograms from a very large collection of objects (-gt 250k). I need to sort the collection on a property of each object. My line of script looks like:
$ch = $ch | sort TotalCount -descending <br>
where $ch[x].totalcount would be some integer.
The script works but it takes over an hour to sort and consumes 6GB of memory. How do I speed up the process?
I've done some searching for a solution and several web sites suggest using [array]::sort as it is much quicker. As this is a collection of objects, I'm not sure how I would use a static System.Array sort method. Even if I could, I don't see how to make the array descending (although reversing the result should be pretty simple).
Any suggestions on how to sort really large collections with powershell?

Let's create an array with 2500 elements. Each element of the array is an object containing the property totalCount and we assign an integer to it.
$array = #()
1..2500 | % {
$array += New-Object pscustomobject -Property #{
totalCount = $_;
}
}
Now let's sort this array and measure the total time for executing the command.
We start with the classic Sort-Object using the -descending parameter:
(Measure-Command {
$array = $array | Sort-Object TotalCount -descending
}).TotalSeconds
Total time in seconds is : 0.1217965
Let's now use the method Reverse of the class System.Array : [Array]::Reverse()
(Measure-Command {
[Array]::Reverse([array]$array)
}).TotalSeconds
Total time in seconds is : 0.0002594
Quite a difference!
Let's now see other possibilities, lets create a System.Collections.ArrayList
$array = New-Object System.Collections.ArrayList
1..2500 | % {
$o = New-Object pscustomobject -Property #{
totalCount = $_;
}
[Void] $array.Add($o)
}
And we rince and repeat. We first use the Reverse method of the class System.Collections.ArrayList, then we pass the collection to the Reverse method of System.Array.
(Measure-Command {
$array.reverse()
}).TotalSeconds
Total time in seconds is : 0.0002459
Slight improvement, but pretty similar overall.
Now we typecast the system collection and use [Array]::Reverse()
(Measure-Command {
[Array]::Reverse([array]$array)
}).TotalSeconds
Total time in seconds is : 0.0008172
Over twice as much time. This clearly shows it wasn't a good idea, so we scrap it.
Conclusion:
A System.Array with [Array]::Reverse() is definitely faster than Sort-Object, however keep in mind that System.Array is immutable, so if building the array is part of the performance issue, I'd definitely recommend using System.Collections.ArrayList since it's mutable.

[array]::reverse() is NOT sorting the array in any way.

Related

I'm having problems making a Powershell Linq "Join" work [Explicit Argument Typing?]

I've been working on provisioning scripts between Oracle and Active Directory, and more specifically using Powershell scripts. I found an excellent resource on using Linq in Powershell (High Performance Powershell by Michael Sorens) but I'm having trouble with the JOIN method, and I think it maybe be related to how I'm trying to type my arguments. I have to admit I haven't fully grasped the example on the page (Cross-Join). I'll set up the problem and then show what I'm trying (that has failed so far).
I have a database query that returns users who should be in Active Directory, and I have a "Get-ADUser" command that gets every person who IS in Active Directory. I previously got the "Except" operator to work by reducing the number of properties in both to the ID (samaccountname). So, at that point I could derive everyone who needed to be added, as well as everyone who needed to be removed. But I was now reduced to a list of IDs (i.e. no longer had the full compliment of fields I would need... either to add the AD record OR to send a "you're about to be removed" email).
So, seeing the Join operator, I thought I'd re-join the remove list to the "get all users" AD result-set. But... I keep getting the error
Cannot find an overload for "Join" and the argument count: "5"
The following was an attempt to simplify the moving parts, so it's two AD query results rather than the original problem (shows same error though).
$ad_host="my.adserver.edu"
$left = Get-ADUser -Server $ad_host -Identity 'knownuser' -Properties sAMAccountName | select sAMAccountName
$right = Get-ADUser -Server $ad_host -Filter * -SearchBase "OU=KnownUsersOU,OU=Students,OU=Users-Students,DC=my,DC=domain,DC=edu" -Properties sAMAccountName, givenName, sn | select sAMAccountName, givenName, sn
$outerKeyDelegate = [Func[Microsoft.ActiveDirectory.Management.ADAccount,string]] { $args[0].sAMAccountName }
$innerKeyDelegate = [Func[Microsoft.ActiveDirectory.Management.ADAccount,string]] { $args[0].sAMAccountName }
#$resultDelegate = [Func[Microsoft.ActiveDirectory.Management.ADAccount,Microsoft.ActiveDirectory.Management.ADAccount,string,string]] {'{0}, {1}, {2}, {3}, {4}' -f $args[0].sAMAccountName, $args[1].givenName, $args[1].sn, $args[1].mail, $args[1].employeeID }
$resultDelegate = [Func[Microsoft.ActiveDirectory.Management.ADAccount,string,string]] {'{0}, {1}' -f $args[0].sAMAccountName, $args[1].sAMAccountName }
[Linq.Enumerable]::Join($toRemove, $allUsers, $outerKeyDelegate, $innerKeyDelegate, $resultDelegate) | foreach { Add-Content -Path to_delete.csv -Value $_ }
So, in this case, I'm trying to explicitly type my join properties as Microsoft.ActiveDirectory.Management.ADAccount objects... I actually originally was using "string" since, the actual join property was the samaccountname, and when I ran a "getType()" on that, it returned "String"... well, actually it was "Name: String, BaseType: System.Object".
At this point, what I know is outweighed by what I don't know :) I could do this EASILY by moving it all into a database to make the "list", but this seems like it'd be a lot more elegant if I could master Powershell-Linq!
I believe your problem is with types. Consider this command:
$left = Get-ADUser -Server $ad_host -Identity 'knownuser' -Properties sAMAccountName | select sAMAccountName
The type of this object will be ADUser.
And for this command:
$right = Get-ADUser -Server $ad_host -Filter * -SearchBase "OU=KnownUsersOU,OU=Students,OU=Users-Students,DC=my,DC=domain,DC=edu" -Properties sAMAccountName, givenName, sn | select sAMAccountName, givenName, sn
The type of the object will be Object[]. It needs to be ADUser[].
You should be able to cast it like this:
$right = [Microsoft.ActiveDirectory.Management.ADUser[]](Get-ADUser -Server $ad_host -Filter * -SearchBase "OU=KnownUsersOU,OU=Students,OU=Users-Students,DC=my,DC=domain,DC=edu" -Properties sAMAccountName, givenName, sn)
Then, since you're dealing with ADUser objects, your key delegates must also match:
$outerKeyDelegate = [Func[Microsoft.ActiveDirectory.Management.ADUser,string]] { $args[0].sAMAccountName }
$innerKeyDelegate = [Func[Microsoft.ActiveDirectory.Management.ADUser,string]] { $args[0].sAMAccountName }
and your result delegate must also match to the type of objects you're working on (you were closer in your commented out code):
$resultDelegate = [Func[Microsoft.ActiveDirectory.Management.ADUser,Microsoft.ActiveDirectory.Management.ADUser,string]] {'{0}, {1}' -f $args[0].sAMAccountName, $args[1].sAMAccountName }

Except doesn't work with Laravel Collections

I have the following code:
$object = Object::with("prototypes.fields")->findOrFail($id)->get();
$object_copied = $object->except(['id', 'prefix', 'prototypes']);
dd($object_copied->all());
Last line returns collection with fields that should be except: 'id', 'prefix', 'prototypes'
The first thing:
$object = Object::with("prototypes.fields")->findOrFail($id)->get();
This is probably wrong.
You should either use:
$object = Object::with("prototypes.fields")->findOrFail($id);
or
$object = Object::with("prototypes.fields")->get();
The second thing is what you really want to achieve. except method might not be what you really want to use here if you want to get only some columns. In this case better option would be using select when getting data from database or maybe using map method.
Assuming in $object you have collection of object using except you will remove only some object from method (those with given keys) and keys in those collection will be numeric 0, 1, ... x so you should pass only numerical keys here if you want to not include first model in collection.
Try $object_copied = collect($object )->except('id', 'prefix', 'prototypes');

Limiting Eloquent chunks

I have a very large result set to process and so I'm using the chunk() method to reduce the memory footprint of the job. However, I only want to process a certain number of total results to prevent the job from running too long.
Currently I'm doing this, but it does not seem like an elegant solution:
$count = 0;
$max = 1000000;
$lists = Lists::whereReady(true);
$lists->chunk(1000, function (Collection $lists) use (&$count, $max) {
if ($count >= $max)
return;
foreach ($lists as $list) {
if ($count >= $max)
break;
$count++;
// ...do stuff
}
});
Is there a cleaner way to do this?
As of right now, I don't believe so.
There have been some issues and pull requests submitted to have chunk respect previously set skip/limits, but Taylor has closed them as expected behavior that chunk overwrites these.
There is currently an open issue in the laravel/internals repo where he said he'd take a look again, but I don't think it is high on the priority list. I doubt it is something he would work on, but may be more receptive to another pull request now.
Your solution looks fine, except for one thing. chunk() will end up reading your entire table, unless you return false from your closure. Currently, you are just returning null, so even though your "max" is set to 1000000, it will still read the entire table. If you return false from your closure when $count >= $max, chunk() will stop querying the database. It will cause chunk() to return false itself, but your example code doesn't care about the return of chunk() anyway, so that's okay.
Another option, assuming you're using sequential ids, would be to get the ending id and then add a where clause to your chunked query to get all the records with an id less than your max id. So, something like:
$max = 1000000;
$maxId = Lists::whereReady(true)->skip($max)->take(1)->value('id');
$lists = Lists::whereReady(true)->where('id', '<', $maxId);
$lists->chunk(1000, function (Collection $lists) {
foreach ($lists as $list) {
// ...do stuff
}
});
Code is slightly cleaner, but it is still a hack, and requires one extra query (to get the max id).

Rearranging active record elements in Yii

I am using a CDbCriteria with its own conditions, with & order clauses. However, the order i want to give to the elements in the array is way too complex to specify in the order clause.
The solution i have in mind consists of obtaining the active records with the defined criteria like this
$theModelsINeed = MyModel::model()->findAll($criteria);
and then rearrange the order from my php code. How can i do this? I mean, i know how to iterate through its elements, but i donĀ“t know if it is possible to actually change them.
I have been looking into this link about populating active records, but it seems quite complicated and maybe someone could have some better advice.
Thanks
There is nothing special about Yii's active records. The find family of methods will return an array of objects, and you can sort this array like any other array in PHP.
If you have complex sort criteria, this means that probably the best tool for this is usort. Since you will be dealing with objects, your user-defined comparison functions will look something like this:
function compare($x, $y)
{
// First sort criterion: $obj->Name
if ($x->Name != $y->Name) {
return $x->Name < $y->Name ? -1 : 1; // this is an ascending sort
}
// Second sort criterion: $obj->Age
if ($x->Age != $y->Age) {
return $x->Age < $y->Age ? 1 : -1; // this is a descending sort
}
// Add more criteria here
return 0; // if we get this far, the items are equal
}
If you do want to get an array as a result, you can use this method for fetching data that supports dbCriteria:
$model = MyModel::model()->myScope();
$model->dbCriteria->condition .= " AND date BETWEEN :d1 AND :d2";
$model->dbCriteria->order = 'field1 ASC, field2 DESC';
$model->dbCriteria->params = array(':d1'=>$d1, ':d2'=>$d2);
$theModelsINeed = $model->getCommandBuilder()
->createFindCommand($model->tableSchema, $model->dbCriteria)
->queryAll();
The above example shows using a defined scope and modifying the condition with named parameters.
If you don't need Active Record, you could also look into Query Builder, but the above method has worked pretty well for me when I want to use AR but need an array for my result.

Does this cause a MongoDB performance issue (when doing the `limit` on the client-side by 'breaking' the `cursor`)?

Though this has nothing to do with PHP specifically, I use PHP in the following examples.
Let's say this is the 'normal' way of limiting results.
$db->users->find()->limit(10);
This is probably the fastest way, but there are some restrictions here... In the following example, I'll filter out all rows that have the save value for a certain column as the previous row:
$cursor = $db->users->find();
$prev = null;
$results = array();
foreach ($cursor as $row) {
if ($row['coll'] != $prev['coll']) {
$results[] = $row;
$prev = $row;
}
}
But you still want to limit the results to 10, of course. So you could use the following:
$cursor = $db->users->find();
$prev = null;
$results = array();
foreach ($cursor as $row) {
if ($row['coll'] != $prev['coll']) {
$results[] = $row;
if (count($results) == 10) break;
$prev = $row;
}
}
Explanation: since the $cursor does not actually load the results from the database, breaking the foreach-loop will limit it just as the limit(...)-function does.
Just for sure, is this really working as I'm saying, or are there any performance issues I'm not aware of?
Thank you very much,
Tim
Explanation: since the $cursor does not actually load the results from the database, breaking the foreach-loop will limit it just as the limit(...)-function does.
This is not 100% true.
When you do the foreach, you're basically issuing a series of hasNext / getNext that is looping through the data.
However, underneath this layer, the driver is actually requesting and receiving batches of results. When you do a getNext the driver will seamlessly fetch the next batch for you.
You can control the batch size. The details in the documentation should help clarify what's happening.
In your second example, if you get to 10 and then break there are two side effects:
The cursor remains open on the server (times out in 10 minutes, generally not a big impact).
You may have more data cached in $cursor. This cache will go away when $cursor goes out of scope.
In most cases, these side effects are "not a big deal". But if you're doing lots of this processing in a single process, you'll want to "clean up" to avoid having cursors hanging around.

Resources