Text File Processing using powershell - Performance issue - performance

I am using powershell script below to read and process one 17 MB text file. The input file contains around 200 000 rows and 12 columns. Currently the script takes almost 1 hour to process the input file. How to optimize the processing time?
Script:
$fields = Get-Content Temp.txt
$results = #()
foreach($i in $fields)
{
$field = $i -split '\t' -replace '^\s*|\s*$'
$field1 = $field[0]
$field2 = $field[1]
$field3 = $field[2]
$field4 = $field[3]
$field5 = $field[4]
$field6 = $field[5]
$field7 = $field[6]
$field8 = $field[7]
$field9 = $field[8]
$field10 = $field[9]
$field11 = $field[10]
$field12 = $field[11]
if ($field1 -eq "4803" -and $field[2].substring(0,2) -eq "60")
{
$field2 = "5000000"
}
else
{
$field2 = $field[1]
}
$details = #{
Column1 = $field1
Column2 = $field2
Column3 = $field3
Column4 = $field4
Column5 = $field5
Column6 = $field6
Column7 = $field7
Column8 = $field8
Column9 = $field9
Column10 = $field10
Column11 = $field11
Column12 = $field12
}
$results += New-Object PSObject -Property $details
}
$results | ForEach-Object { '{0} {1} ... {11}' -f $_.Column1,$_. Column1,... $_.Column12 } | Set-Content -path Temp.txt
[Environment]::Exit(0)

Unless I'm missing something here the goal is to take in tab delimited data, modify one field based on another, and then outputting as CSV data, correct? If so this one-liner should execute MUCH faster.
Import-Csv test.txt -Header #(1..12) -Delimiter `t | % {if(($($_.2) -eq "4803") -and($($_.3).substring(0,2) -eq "60")){$_.2 = "5000000"};$_} | export-csv test2.csv -NoTypeInformation
It avoids all the weird string parsing and gets around the biggest problem which is
$results += New-Object PSObject -Property $details
That line is copying your entire array into a new array for each line of your script, which is problematic for performance. the rest of the changes just make things slightly faster.

If this were me, I would start to think about not using Get-Content if your files are going to get much bigger. Memory consumption will start to become an issue, and using Get-Content won't scale well if your files get really big because you pull everything into memory. And remember it will be more memory than the size of the file, because it has to represent things as objects (which is still smaller than an XML DOM, but regardless, it takes memory).
So first of all, you could loop through the input file using a stream reader, I have an example here: https://stackoverflow.com/a/32337282/380016
You can also write your output file using the stream writer, instead of concatting a big object like you are, only too loop through it and write it to a file at the end.
In the while loop of my example, you can still split the string as you are, if you want, do your manipulations, and then write it out. No need to accumulate it and wait to do it all at the end.
This approach should be faster and should use hardly any memory.

Related

PowerShell: compare 2 large CSV files to find users that don't exist in one of them

I have 2 csv files with ~10,000 users each. I need to count how many users appear in csv1 and not in csv2. At the moment I have the code below. However I'm aware that this is probably extremely inefficient as it is potentially looping through up to 10,000 users 10,000 times. The code takes forever to run and I'm sure there must be a more efficient way. Any help or suggestions are appreciated I am fairly new to Powershell
foreach ($csv1User in $csv1) {
$found = $false
foreach ($csv2User in $csv2) {
if ($csv1User.identifier -eq $csv2User.identifier)
{
$found = $true
break
}
}
if ($found -ne $true){
$count++
}
}
If you replace your nested loops with 2 HashSet's, you'll have two ways of calculating the exception between the two:
Using SymmetricExceptWith()
The HashSet<T>.SymmetricExceptWith() function allows us to calculate the subset of terms that exist in either collection but not in both:
# Create hashset from one list
$userIDs = [System.Collections.Generic.HashSet[string]]::new([string[]]$csv1.identifier)
# Pass the other list to `SymmetricExceptWith`
$userIDs.SymmetricExceptWith([string[]]$csv2.identifier)
# Now we have an efficient filter!
$relevantRecords = #($csv1;$csv2) |Where-Object { $userIDs.Contains($_.identifier) } |Sort-Object -Unique identifier
Using a sets to track duplicates
Similarly we can use hash sets to keep track of which terms that have been observed at least once, and which ones has been seen more than once:
# Create sets for tracking
$seenOnce = [System.Collections.Generic.HashSet[string]]::new()
$seenTwice = [System.Collections.Generic.HashSet[string]]::new()
# Loop through whole superset of records
foreach($record in #($csv1;$csv2)){
# Always attempt to add to the $seenOnce set
if(!$seenOnce.Add($record.identifier)){
# We've already seen this identifier once, add it to $seenTwice
[void]$seenTwice.Add($record.identifier)
}
}
# Just like the previous example, we now have an efficient filter!
$relevantRecords = #($csv1;$csv2) |Where-Object { $seenOnce.Contains($_.identifier) -and -not $seenTwice.Contains($_.identifier) } |Sort-Object -Unique identifier
Using a hash table as a grouping construct
You could also use a dictionary type (like a [hashtable] for example) to group records from both csv files based on their identifier, and then filter on number of record values in each dictionary entry:
# Groups records on their identifier value
$groupsById = #{}
foreach($record in #($csv1;$csv2)){
if(-not $groupsById.ContainsKey($record.identifier)){
$groupsById[$record.identifier] = #()
}
$groupsById[$record.identifier] += $record
}
# Filter based on number of records with a distinct identifier
$relevantRecords = $groupsById.GetEnumerator() |Where-Object { $_.Value.Count -eq 1 } |Select-Object -Expand Value
If you're just looking for the count then this should be much faster.
$csv2 = Import-Csv $csvfile2
Import-Csv $csvfile1 |
Where-Object identifier -in $csv2.identifier |
Measure-Object | Select-Object -ExpandProperty Count
Here's a small example
$csvfile1 = New-TemporaryFile
$csvfile2 = New-TemporaryFile
#'
identifier
bob
sally
john
sue
'# | Set-Content $csvfile1 -Encoding UTF8
#'
identifier
bill
sally
john
stan
'# | Set-Content $csvfile2 -Encoding UTF8
$csv2 = Import-Csv $csvfile2
Import-Csv $csvfile1 |
Where-Object identifier -in $csv2.identifier |
Measure-Object | Select-Object -ExpandProperty Count
Output is simply
2

Compare columns between 2 files and delete non common columns using Powershell

I have a bunch of files in folder A and their corresponding metadata files in folder B. I want to loop though the data files and check if the columns are the same in the metadata file, (since incoming data files could have new columns added at any position without notice). If the columns in both files match, no action to is to be taken. If Data file has more columns than metadata file, then those columns should be deleted from incoming data file. Any help would be appreciated. Thanks!
Data file is ps_job.txt
“empid”|”name”|”deptid”|”zipcode”|”salary”|”gender”
“1”|”Tom”|”10″|”11111″|”1000″|”M”
“2”|”Ann”|”20″|”22222″|”2000″|”F”
Meta data file is ps_job_metadata.dat
“empid”|”name”|”zipcode”|”salary”
I would like my output to be
“empid”|”name”|”zipcode”|”salary”
“1”|”Tom”|”11111″|”1000″
“2”|”Ann”|”22222″|”2000″
That's a seemingly simple question with a very complicated answer. However, I've broken down the code for what you will need to do. Here are the steps that need to happen in order for powershell to do everything you're asking of it.
Read the .dat file
Save the .dat data into an object
Read the .txt file
Save the .txt header into an object
Check for the differences
Delete the old text file (that had too many columns)
Create a new text file with the new columns
I've made some assumptions in how this looks. However, with the way I've structured the code, it should be easy enough to make modifications as necessary if my assumptions are wrong. Here are my assumptions:
The text file will always have all of the columns that the DAT file has (even though it will sometimes have more)
The dat file is structured like a text file and can be directly imported into powershell.
And here is the code, with comments. I've done my best to explain the purpose of each section, but I've written this with the expectation that you have a basic knowledge of powershell, especially arrays. If you have questions I'll do my best to answer, though I'll ask that you refer to the section of code you have questions on.
###
### The paths. I'm sure you will have multiples of each file. However, I didn't want to attempt to pull in
### the files with this sample code as it can vary so much in your environment.
###
$dat = "C:\StackOverflow\thingy.dat"
$txt = "C:\stackoverflow\ps_job.txt"
###
### This is the section to process the DAT file
###
# This will read the file and put it in a variable
$dat_raw = get-content -Path $dat
# Now, let's seperate out the punctuation and give us our object
$dat_array = $dat_raw.split("|")
$dat_object = #()
foreach ($thing in $dat_array)
{
$dat_object+=$thing.Replace("""","")
}
###
### This is the section to process the TXT file
###
# This will read the file and put it into a variable
$txt_raw = get-content -Path $txt
# Now, let's seperate out the punctuation and give us our object
$txt_header_array = $txt_raw[0].split("|")
$txt_header_object = #()
foreach ($thing in $txt_header_array)
{
$txt_header_object += $thing.Replace("""","")
}
###
### Now, let's figure out which columns we're eliminating (if any)
###
$x = 0
$total = $txt_header_object.count
$to_keep = #()
While ($x -le $total)
{
if ($dat_object -contains $txt_header_object[$x])
{
$to_keep += $x
}
$x++
}
### Now that we know which objects to keep, we can apply the changes to each line of the text file.
### We will save each line to a new variable. Then, once we have the new variable, we will delete
### The existing file with a new file that has only the data we want.Note, we will only run this
### Code if there's a difference in the files.
if ($total -ne $to_keep.count)
{
### This first section will go line by line and 'fix' the number of columns
$new_text_file = #()
foreach ($line in $txt_raw)
{
if ($line.Length -gt 0)
{
# Blank out the array each time
$line_array = #()
foreach ($number in $to_keep)
{
$line_array += ($line.split("|"))[$number]
}
$new_text_file += $line_array -join "|"
}
else
{
$new_text_file +=""
}
}
### This second section will delete the original file and replace it with our good
### file that has been created.
Remove-item -Path $txt
$new_text_file | out-file -FilePath $txt
}
This small example can be a start for your solution :
$ps_job = Import-Csv D:\ps_job.txt -Delimiter '|'
$ps_job_metadata = (Get-Content D:\ps_job_metadata.txt) -split '\|'-replace '"'
foreach( $d in (Compare-Object $column $ps_job_metadata))
{
if($d.SideIndicator -eq '<=')
{
$ps_job | %{ $_.psobject.Properties.Remove($d.InputObject) }
}
}
$ps_job | Export-Csv -Path D:\output.txt -Delimiter '|' -NoTypeInformation
I tried this and it works.
$outputFile = "C:\Script_test\ps_job_mod.dat"
$sample = Import-Csv -Path "C:\Script_test\ps_job.dat" -Delimiter '|'
$metadataLine = Get-Content -Path "C:\Script_test\ps_job_metadata.txt" -First 1
$desiredColumns = $metadataLine.Split("|").Replace("`"","")
$sample | select $desiredColumns | Export-Csv $outputFile -Encoding UTF8 -NoTypeInformation -Delimiter '|'
Please note that the smart quotes are in consistent over the rows and there are empty lines between the rows (I highly recommend to reformat/update your question).
Anyways, as long as the quoting of the header is consistent between the two (ps_job.txt and ps_job_metadata.dat) files:
# $JobTxt = Get-Content .\ps_job.txt
$JobTxt = #'
“empid”|”name”|”deptid”|”zipcode”|”salary”|”gender”
“1”|”Tom”|”10″|”11111″|”1000″|”M”
“2”|”Ann”|”20″|”22222″|”2000″|”F”
'#
# $MetaDataTxt = Get-Content .\ps_job_metadata.dat
$MetaDataTxt = #'
“empid”|”name”|”zipcode”|”salary”
'#
$Job = ConvertFrom-Csv -Delimiter '|' $JobTxt
$MetaData = ConvertFrom-Csv -Delimiter '|' (#($MetaDataTxt) + 'x|')
$Job | Select-Object $MetaData.PSObject.Properties.Name
“empid” ”name” ”zipcode” ”salary”
------- ------ --------- --------
“1” ”Tom” ”11111″ ”1000″
“2” ”Ann” ”22222″ ”2000″
Here's the same answer I posted to your question on Powershell.org
$jobfile = "ps_job.dat"
$metafile = "ps_job_metadata.dat"
$outputfile = "some_file.csv"
$meta = ((Get-Content $metafile -First 1 -Encoding UTF8) -split '\|')
Class ColumnSelector : System.Collections.Specialized.OrderedDictionary {
Select($line,$meta)
{
$meta | foreach{$this.add($_,(iex "`$line.$_"))}
}
ColumnSelector($line,$meta)
{
$this.select($line,$meta)
}
}
import-csv $jobfile -Delimiter '|' |
foreach{[pscustomobject]([columnselector]::new($_,$meta))} |
Export-CSV $outputfile -Encoding UTF8 -NoTypeInformation -Delimiter '|'
Output
PS C:\>Get-Content $outputfile
"empid"|"name"|"zipcode"|"salary"
"1"|"Tom"|"11111"|"1000"
"2"|"Ann"|"22222"|"2000"
Provided you want to keep those curly quotes and your code page and console font supports all the characters, you can do the following:
# Create array of properties delimited by |
$headers = (Get-Content .\ps_job_metadata.dat -Encoding UTF8) -split '\|'
Import-Csv ps_job.dat -Delimiter '|' -Encoding utf8 | Select-Object $headers

Powershell Plus or Minus Comparison Operator (Fuzzy Logic)?

So let me tell you what I'm trying to do here. Our SolarWinds alerts report on disk capacity as read by Windows, not the Virtual Machine vDisk size setting. What I'm trying to do is match the size so that I can find the correct vDisk and report on its datastore free space to determine whether or not we can add more.
Here's the problem, the GB number never matches between Windows and VMWare. Say the disk has a 149.67 capacity as reported by Windows, well the VMWare setting is 150, or 150.18854, or anything of that sort. I cannot find the vdisk without knowing the exact number, but theoretically I could find it if I could just say, have a comparison operator that had some breathing room, like plus or minus 1 or even 0.5. So for example:
Get-HardDisk -Vm SERVERNAME | Where-Object {
$_.CapacityGB -lt $size + 0.5 -and
$_.CapacityGB -gt $size - 0.5
}
This doesn't work though, for whatever reason. I need something similar to this. Any ideas?
UPDATE: Turns out to be user error, I was experimenting with the wrong number when testing the command. I thought it was the syntax, it was the number I was using itself.
So because I managed to answer my own question I thought I'd post a script for achieving this here. Note that you will need to have a txt file with a comma separated servername and capacity. You could probably modify this to do many other things with VMWare data gathering if you wanted. In the end you'll need to know which columns are which and import to Excel as comma delimited.
Most the variables are decimal values.
Also note that I have no yet figured out a way to programatically deal with the discovery of multiple matching disks.
$serverlist = Get-Content "./ServerList.txt"
$logfile = "./Stores.txt"
remove-item "./Stores.txt"
Function LogWrite {
Param (
[string]$srv,
[string]$disk,
[string]$store
)
Add-Content $logfile -value $srv","$disk","$store
}
foreach ($item in $serverlist){
$store = "Blank"
$disk = "Blank"
try {
$server,$arg = $item.split(',')
$round = [math]::Round($arg,0)
$disk = get-harddisk -vm $server | where-object{$_.CapacityGB -lt ($round + 2) -and $_.CapacityGB -gt ($round - 2) }
if ([string]::IsNullOrEmpty($disk)){
$disk = "Problem locating disk."
$store = "N/A"
continue
}
if ($disk.count -gt 1) {
$disk = "More than one matching disk."
$store = "N/A"
} else {
$store = get-harddisk -vm $server | where-object{$_.CapacityGB -lt ($round + 2) -and $_.CapacityGB -gt ($round - 2) } | Get-Datastore | %{ "{0},{1},{2}" -f $_.Name,[math]::Round($_.FreeSpaceGB,1),[math]::Round($_.CapacityGB,1) }
}
}
catch {
$disk = "Physical"
$store = "N/A"
}
LogWrite $server $disk $store
}

Parsing data from multiple text files into a CSV

I have a directory full of files filled with content similar to the below. I want to copy everything after //TEST: and before //, I want to copy the date and time, and the IPO into a CSV.
IPO 7 604 1148 17 - Psuedo text here doesnt mean anything just filler text, beep, boop.txt
werqwerwqerw
erqwerwqer
2. (test) On 7 July 2017 at 0600Z, wqerwqerwqerwerwqerqwerwqjeroisduhsuf //TEST: 37MGUI2974027//,
sdfajsfjiosauf
sadfu
(test2) On 7 July 2017 at 0600Z, blah blah //TEST: 89MTU34782374//
blah blah text here //TEST: GHO394749374// (this is uneeded)
Now, Each file has multiple instances of this data, and there may be hundreds of them.
I want to output it into a CSV similar to this:
89MTU34782374, 3 July 2016 at 0640Z, IPO 7 604 1148 17
I have successfully created that with the following, and I feel like I'm on the right track:
$x = "D:\New folder\"
$s = Get-Content $x
$ipo = [regex]::Match($s,'IPO([^/)]+?) -').Groups[1].Value
$test = [regex]::Matches($s,'//TEST: ([^/)]+?)//').Groups[1].Value
$date = [regex]::Matches($s,' On([^/)]+?),').Groups[1].Value
Write-Host $test"," $date"," IPO $ipo
However, I am having trouble getting it to find and select every instance in the file, and printing them onto a new line. I should also note that the way it is looking for the data, every text file is formatted the same way like this.
Not only am I having issues getting it to print each string/variable in the text document onto a new line, I'm having trouble figuring out how to do it for multiple files.
I have tried the following, but it seems to find the terms it's looking for from the first file, and spitting it out for as many files are contained in the directory:
$files = Get-ChildItem "D:\New folder\*.txt"
$s = Get-Content $files
for ($i=0; $i -lt $files.Count; $i++) {
$ipo = [regex]::Match($s,'IPO([^/)]+?) -').Groups[1].Value
$test = [regex]::Matches($s,'//TEST: ([^/)]+?)//').Groups[1].Value
$date = [regex]::Matches($s,' On([^/)]+?),').Groups[1].Value
Write-Host $test"," $date"," IPO $ipo
}
Does anyone have any ideas on how this could be done?
I did a bad job at explaining this.
Every document has an IPO number.
Every TEST string has a date/time associated with it.
There may be other TEST strings but they can be ignored, they are uneeded without a date/time. I could clean it up easily if they got included into the product, though.
Every TEST+date/time combo should have the IPO number from which they came
If date and //TEST: ...// substring always appear as pairs and in the same order you should be able to extract both values with a single regular expression. Try something like this:
Get-ChildItem "D:\New folder\*.txt" | ForEach-Object {
$s = Get-Content $_.FullName
$ipo = [regex]::Matches($s,'(IPO .+?) -').Groups[1].Value
[regex]::Matches($s,' On (.+?),[\s\S]*?//TEST: (.+?)//') | ForEach-Object {
New-Object -Type PSObject -Property #{
IPO = $ipo
Date = $_.Groups[1].Value
Test = $_.Groups[2].Value
}
}
} | Export-Csv 'C:\path\to\output.csv' -NoType
Like so? Most of your code seems to be fine if I understand your question.
It's the loop that seems incorrect as you are repeating the same thing for the number of files found, but not actually referring to the individual files. Also, $s = ... should be inside the loop to get the content of each file.
$files = Get-ChildItem "D:\New folder\*.txt"
foreach($file in $files){
$s = Get-content $file
$ipo = [regex]::Match($s,'IPO([^/)]+?) -').Groups[1].Value
$test = [regex]::Matches($s,'//TEST: ([^/)]+?)//').Groups[1].Value
$date = [regex]::Matches($s,' On([^/)]+?),').Groups[1].Value
Write-Host "$test, $date, IPO $ipo"
}

Sort very large text file in PowerShell

I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.
The simplest and most obvious thing that comes to mind is
Get-Content unsorted.txt | sort | get-unique > sorted.txt
I am guessing (without having tried it) that doing this using Get-Content would take forever in my 1GB files. I don't quite know my way around System.IO.StreamReader, but I'm curious if an efficient solution could be put together using that?
Thanks to anyone who might have a more efficient idea.
[edit]
I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.
Get-Content is terribly ineffective for reading large files. Sort-Object is not very fast, too.
Let's set up a base line:
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:
Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663
Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader to read file line by line into HashSet which will remove duplicates, then copy data to List and sort it there, then use StreamWriter to dump results back.
$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
while (($line = $reader.ReadLine()) -ne $null)
{
$t = $hs.Add($line)
}
}
finally {
$reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
$f = New-Object System.IO.StreamWriter "d:\result2.txt";
foreach ($s in $ls)
{
$f.WriteLine($s);
}
}
finally
{
$f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
this script produces:
read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802
On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.
I've grown to hate this part of windows powershell, it is a memory hog on these larger files. One trick is to read the lines [System.IO.File]::ReadLines('file.txt') | sort -u | out-file file2.txt -encoding ascii
Another trick, seriously is to just use linux.
cat file.txt | sort -u > output.txt
Linux is so insanely fast at this, it makes me wonder what the heck microsoft is thinking with this set up.
It may not be feasible in all cases, and i understand, but if you have a linux machine, you can copy 500 megs to it, sort and unique it, and copy it back in under a couple minutes.
If each line of the log is prefixed with a timestamp, and the log messages don't contain embedded newlines (which would require special handling), I think it would take less memory and execution time to convert the timestamp from [String] to [DateTime] before sorting. The following assumes each log entry is of the format yyyy-MM-dd HH:mm:ss: <Message> (note that the HH format specifier is used for a 24-hour clock):
Get-Content unsorted.txt
| ForEach-Object {
# Ignore empty lines; can substitute with [String]::IsNullOrWhitespace($_) on PowerShell 3.0 and above
if (-not [String]::IsNullOrEmpty($_))
{
# Split into at most two fields, even if the message itself contains ': '
[String[]] $fields = $_ -split ': ', 2;
return New-Object -TypeName 'PSObject' -Property #{
Timestamp = [DateTime] $fields[0];
Message = $fields[1];
};
}
} | Sort-Object -Property 'Timestamp', 'Message';
If you are processing the input file for interactive display purposes you can pipe the above into Out-GridView or Format-Table to view the results. If you need to save the sorted results you can pipe the above into the following:
| ForEach-Object {
# Reconstruct the log entry format of the input file
return '{0:yyyy-MM-dd HH:mm:ss}: {1}' -f $_.Timestamp, $_.Message;
} `
| Out-File -Encoding 'UTF8' -FilePath 'sorted.txt';
(Edited to be more clear based on n0rd's comments)
It's might be a memory issue. Since you're loading the entire file into memory to sort it (and adding the overhead of the pipe into Sort-Object and the pipe into Get-Unique), it's possible that you're hitting the memory limits of the machine and forcing it to page to disk, which will slow things down a lot. One thing you might consider is splitting the logs up before sorting them, and then splicing them back together.
This probably won't match your format exactly, but if I've got a large log file for, say, 8/16/2012 which spans several hours, I can split it up into a different file for each hour using something like this:
for($i=0; $i -le 23; $i++){ Get-Content .\u_ex120816.log | ? { $_ -match "^2012-08-16 $i`:" } | Set-Content -Path "$i.log" }
This is creating a regular expression for each hour of that day and dumping all the matching log entries into a smaller log file named by the hour (e.g. 16.log, 17.log).
Then I can run your process of sorting and getting unique entries on a much smaller subsets, which should run a lot faster:
for($i=0; $i -le 23; $i++){ Get-Content "$i.log" | sort | get-unique > "$isorted.txt" }
And then you can splice them back together.
Depending on the frequency of the logs, it might make more sense to split them by day, or minute; the main thing is to get them into more manageable chunks for sorting.
Again, this only makes sense if you're hitting the memory limits of the machine (or if Sort-Object is using a really inefficient algorithm).
"Get-Content" can be faster than you think. Check this code-snippet in addition to the above solution:
foreach ($block in (get-content $file -ReadCount 100)) {
foreach ($line in $block){[void] $hs.Add($line)}
}
There doesn't seem to be a great way to do it in powershell, including [IO.File]::ReadLines(), but with the native windows sort.exe or the gnu sort.exe, either within cmd.exe, 30 million random numbers can be sorted in about 5 minutes with around 1 gb of ram. The gnu sort automatically breaks things up into temp files to save ram. Both commands have options to start the sort at a certain character column. Gnu sort can merge sorted files. See external sorting.
30 million line test file:
& { foreach ($i in 1..300kb) { get-random } } | set-content file.txt
And then in cmd:
copy file.txt+file.txt file2.txt
copy file2.txt+file2.txt file3.txt
copy file3.txt+file3.txt file4.txt
copy file4.txt+file4.txt file5.txt
copy file5.txt+file5.txt file6.txt
copy file6.txt+file6.txt file7.txt
copy file7.txt+file7.txt file8.txt
With gnu sort.exe from http://gnuwin32.sourceforge.net/packages/coreutils.htm . Don't forget the dependency dll's -- libiconv2.dll & libintl3.dll. Within cmd.exe:
.\sort.exe < file8.txt > filesorted.txt
Or windows sort.exe within cmd.exe:
sort.exe < file8.txt > filesorted.txt
With the function below:
PS> PowerSort -SrcFile C:\windows\win.ini
function PowerSort {
param(
[string]$SrcFile = "",
[string]$DstFile = "",
[switch]$Force
)
if ($SrcFile -eq "") {
write-host "USAGE: PowerSort -SrcFile (srcfile) [-DstFile (dstfile)] [-Force]"
return 0;
}
else {
$SrcFileFullPath = Resolve-Path $SrcFile -ErrorAction SilentlyContinue -ErrorVariable _frperror
if (-not($SrcFileFullPath)) {
throw "Source file not found: $SrcFile";
}
}
[Collections.Generic.List[string]]$lines = [System.IO.File]::ReadAllLines($SrcFileFullPath)
$lines.Sort();
# Write Sorted File to Pipe
if ($DstFile -eq "") {
foreach ($line in $lines) {
write-output $line
}
}
# Write Sorted File to File
else {
$pipe_enable = 0;
$DstFileFullPath = Resolve-Path $DstFile -ErrorAction SilentlyContinue -ErrorVariable ev
# Destination File doesn't exist
if (-not($DstFileFullPath)) {
$DstFileFullPath = $ev[0].TargetObject
}
# Destination Exists and -force not specified.
elseif (-not $Force) {
throw "Destination file already exists: ${DstFile} (using -Force Flag to overwrite)"
}
write-host "Writing-File: $DstFile"
[System.IO.File]::WriteAllLines($DstFileFullPath, $lines)
}
return
}

Resources