Filtering a log file quickly - algorithm

I'm trying to write a PowerShell script to do some analysis on production log files.
I need to filter roughly 2.5 million lines of text by roughly 50-100 values - potentially 250,000,000 iterations if O(nm).
I've tried Get-Content | Select-String but this seems incredibly slow.
Is there any way to approach this without iterating over every line once for each value?
EDIT
So, the log files look a bit like this (datetime : process_id : log_level : message)
2016-01-30 14:01:22.349 [ 27] INFO XXX YYY XXXFX
2016-01-30 14:01:28.146 [ 16] INFO XXXD YY Z YYY XXXX
2016-01-30 14:01:28.162 [ 16] DEBUG YY XXXXX YY XX P YYY
2016-01-30 14:01:28.165 [ 16] DEBUG YY XXXXX YY XX YYY
2016-01-30 14:01:28.167 [ 16] DEBUG YY XXXXX YY XX YYY
2016-01-30 14:01:28.912 [ 27] INFO XXX YY XXGXXX YYYYYY YY XX
and I may be looking for the values D, F, G and Z.
The values could be strings of binary digits, hexadecimal digits, combinations of the two, strings of regular text and punctuation, or pipe-delimited values.

Rules of thumb:
StreamReader is faster than Get-Content is faster than Import-Csv.
String operation is faster than wildcard match is faster than regular expression match.
You probably want something like this if it's sufficient to check whether your log lines contain any of the given strings:
$reader = [IO.StreamReader]'C:\path\to\your.log'
$filters = 'foo', 'bar', ...
while ($reader.Peek() -ge 0) {
$line = $reader.ReadLine()
if ($filters | Where-Object {$line.Contains($_)}) {
$line
}
}
$reader.Close()
$reader.Dispose()
If you want to use a StreamWriter instead of just echoing the output simply adjust the code like this:
$reader = [IO.StreamReader]'C:\path\to\your.log'
$writer = [IO.StreamWriter]'C:\path\to\output.txt'
$filters = 'foo', 'bar', ...
while ($reader.Peek() -ge 0) {
$line = $reader.ReadLine()
if ($filters | Where-Object {$line.Contains($_)}) {
$writer.WriteLine($line)
}
}
$reader.Close(); $reader.Dispose()
$writer.Close(); $writer.Dispose()
Depending on the structure of your log lines as well as the filter values and how they need to be applied the filter logic may need adjustments. You need to actually show the log format and filter examples for that, though.

I would try a StreamReader + StreamWriter to speed up reading/writing as Get-Content is slow for big files. Also, I would try to make one regex (word OR word OR word etc.) to avoid hundreds of iterations. Ex:
$words = "foo","bar","donkey"
#Create regex-pattern (usually faster to match)
$regex = ($words | % { [regex]::Escape($_) }) -join '|'
$reader = New-Object System.IO.StreamReader -ArgumentList "c:\myinputfile.txt"
$writer = New-Object System.IO.StreamWriter -ArgumentList "c:\myOUTputfile.txt"
while (($line = $reader.ReadLine()) -ne $null) {
if($line -match $regex) { $writer.WriteLine($line) }
}
#Close writer
$writer.Close()
$writer.Dispose()
#Close reader
$reader.Close()
$reader.Dispose()

Related

Powershell 7.x How to Select a Text Substring of Unknown Length Only Using Boundary Substrings

I am trying to store a text file string which has a beginning and end that make it a substring of the original text file. I am new to Powershell so my methods are simple/crude. Basically my approach has been:
Roughly get what I want from the start of the string
Worry about trimming off what I don't want later
My minimum reproducible example is as follows:
# selectStringTest.ps
$inputFile = Get-Content -Path "C:\test\test3\Copy of 31832_226140__0001-00006.txt"
# selected text string needs to span from $refName up to $boundaryName
[string]$refName = "001 BARTLETT"
[string]$boundaryName = "001 BEECH"
# a rough estimate of the text file lines required
[int]$lines = 200
if (Select-String -InputObject $inputFile -pattern $refName) {
Write-Host "Selected shortened string found!"
# this selects the start of required string but with extra text
[string]$newFileStart = $inputFile | Select-String $refName -CaseSensitive -SimpleMatch -Context 0, $lines
}
else {
Write-Host "Selected string NOT FOUND."
}
# tidy up the start of the string by removing rubbish
$newFileStart = $newFileStart.TrimStart('> ')
# this is the kind of thing I want but it doesn't work
$newFileStart = $newFileStart - $newFileStart.StartsWith($boundaryName)
$newFileStart | Out-File tempOutputFile
As it is: the output begins correctly but I cannot remove text including and after $boundaryName
The original text file is OCR generated (Optical Character Recognition) So it is unevenly formatted. There are newlines in odd places. So I have limited options when it comes to delimiting.
I am not sure my if (Select-String -InputObject $inputFile -pattern $refName)is valid. It appears to work correctly. The general design seems crude. In that I am guessing how many lines I will need. And finally I have tried various methods of trimming the string from $boundaryName without success. For this:
string.split() not practical
replacing spaces with newlines in an array & looping through to elements of $boundaryName is possible but I don't know how to terminate the array at this point before returning it to string.
Any suggestions would be appreciated.
Abbreviated content of x2 200 listings single Copy of 31832_226140__0001-00006.txt file is:
Beginning of text file
________________
BARTLETT-BEDGGOOD
PENCARROW COMPOSITE ROLL
PAGE 6
PAGE 7
PENCARROW COMPOSITE ROLL
BEECH-BEST
www.
.......................
001 BARTLETT. Lois Elizabeth
Middle of text file
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
001 BEECH, Margaret ..........
End of text file
..............312 Munita Rood Eastbourne, Civil Eng 200 BEST, Dons Amy .........
..........50 Man Street, Wamuomata, Marned
SO NON
To use a regex across newlines, the file needs to be read as a single string. Get-Content -Raw will do that. This assumes that you do not want the lines containing refName and boundaryName included in the output
$c = Get-Content -Path '.\beech.txt' -Raw
$refName = "001 BARTLETT"
$boundaryName = "001 BEECH"
if ($c -match "(?smi).*$refName.*?`r`n(.*)$boundaryName.*?`r`n.*") {
$result = $Matches[1]
}
$result
More information at https://stackoverflow.com/a/12573413/447901
How close does this come to what you want?
function Process-File {
param (
[Parameter(Mandatory = $true, Position = 0)]
[string]$HeadText,
[Parameter(Mandatory = $true, Position = 1)]
[string]$TailText,
[Parameter(ValueFromPipeline)]
$File
)
Process {
$Inside = $false;
switch -Regex -File $File.FullName {
#'^\s*$' { continue }
"(?i)^\s*$TailText(?<Tail>.*)`$" { $Matches.Tail; $Inside = $false }
'^(?<Line>.+)$' { if($Inside) { $Matches.Line } }
"(?i)^\s*$HeadText(?<Head>.*)`$" { $Matches.Head; $Inside = $true }
default { continue }
}
}
}
$File = 'Copy of 31832_226140__0001-00006.txt'
#$Path = $PSScriptRoot
$Path = 'C:\test\test3'
$Result = Get-ChildItem -Path "$Path\$File" | Process-File '001 BARTLETT' '001 BEECH'
$Result | Out-File -FilePath "$Path\SpanText.txt"
This is the output:
. Lois Elizabeth
............. 15 St Ronans Av. Lower Hutt Marned 200 BEDGGOOD. Percy Lloyd
............15 St Ronans Av, Lower Mutt. Coachbuild
, Margaret ..........

PowerShell5. Modify ascii text file string with line number string is on. Switch and .NET framework or cmdlets & the pipeline? Which is faster?

How to modify a string (LINE2 "line number LINE2 is on") in a windows ascii text file using search strings that are easy to read and easy to add/modify/delete using PowerShell 5. This script will parse a 2500 line file, find 139 instances of the strings, replace them and overwrite the original in less than 165ms on average depending on which method you use. Which method is faster? Which method is easier to add/modify/delete the strings?
Search for strings "AROUND LINE {1-9999}" and "LINE2 {1-9999}" and replace {1-9999} with the {line number} the code is on. The tests were done with a 2500 line file not the two line sample.bat.
sample.bat contains two lines:
ECHO AROUND LINE 5936
TITLE %TIME% DISPLAY TCP-IP SETTINGS LINE2 5937
Method One: Using Get-Content + -replace + Set-Content:
Measure-command {
copy-item $env:temp\sample9.bat -d $env:temp\sample.bat -force
(gc $env:temp\sample.bat) | foreach -Begin {$lc = 1} -Process {
$_ -replace 'AROUND LINE \d+', "AROUND LINE $lc" -replace 'LINE2 \d+', "LINE2 $lc"
++$lc
} | sc -Encoding Ascii $env:temp\sample.bat}
Results: 175ms-387ms in ten runs for an average of 215ms.
You modify the search by adding / removing / modifying -replace.
-replace 'AROUND LINE \d+', "AROUND LINE $lc" -replace 'LINE2 \d+', "LINE2 $lc" -replace 'PLACEMARK \d+', "PLACEMARK $lc"
powershell $env:temp\sample.ps1 $env:temp\sample.bat:
(gc $args[0]) | foreach -Begin {$lc = 1} -Process {
$_ -replace 'AROUND LINE \d+', "AROUND LINE $lc" -replace 'LINE2 \d+', "LINE2 $lc"
++$lc
} | sc -Encoding Ascii $args[0]
Method Two: Using switch and .NET frameworks:
Measure-command {
copy-item $env:temp\sample9.bat -d $env:temp\sample.bat -force
$file = "$env:temp\sample.bat"
$lc = 0
$updatedLines = switch -Regex ([IO.File]::ReadAllLines($file)) {
'^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$' { $Matches[1] + ++$lc + $Matches[2] }
default { ++$lc; $_ }
}
[IO.File]::WriteAllLines($file, $updatedLines, [Text.Encoding]::ASCII)}
Results: 73ms-816ms in ten runs for an average of 175ms.
Method Three: Using switch and .NET frameworks optimized version based on a precompiled regex:
Measure-command {
copy-item $env:temp\sample9.bat -d $env:temp\sample.bat -force
$file = "$env:temp\sample.bat"
$regex = [Regex]::new('^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$', 'Compiled, IgnoreCase, CultureInvariant')
$lc = 0
$updatedLines = & {foreach ($line in [IO.File]::ReadLines($file)) {
$lc++
$m = $regex.Match($line)
if ($m.Success) {
$g = $m.Groups
$g[1].Value + $lc + $g[2].Value
} else { $line }
}}
[IO.File]::WriteAllLines($file, $updatedLines, [Text.Encoding]::ASCII)}
Results: 71ms-236ms in ten runs for an average of 106ms.
Add/Modify/Delete your search string:
AROUND LINE|LINE2|PLACEMARK
AROUND LINE|LINE3
LINE4
powershell $env:temp\sample.ps1 $env:temp\sample.bat:
$file=$args[0]
$regex = [Regex]::new('^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$', 'Compiled, IgnoreCase, CultureInvariant')
$lc = 0
$updatedLines = & {foreach ($line in [IO.File]::ReadLines($file
)) {
$lc++
$m = $regex.Match($line)
if ($m.Success) {
$g = $m.Groups
$g[1].Value + $lc + $g[2].Value
} else { $line }
}}
[IO.File]::WriteAllLines($file
, $updatedLines, [Text.Encoding]::ASCII)
Editor's note: This is a follow-up question to Iterate a backed up ascii text file, find all instances of {LINE2 1-9999} replace with {LINE2 "line number the code is on"}. Overwrite. Faster?
The evolution of this question from youngest to oldest:
1. 54757890 2. 54737787 3. 54712715 4. 54682186
Update: I've used #mklement0 regex solution.
switch -Regex -File $file {
'^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$' { $Matches[1] + ++$lc + $Matches[2] }
default { ++$lc; $_ }
}
Given that regex ^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$ contains only 2 capture groups - the part of the line before the number to replace (\d+) and the part of the line after, you must reference these groups with indices 1 and 2 into the automatic $Matches variable in the output (not 2 and 3).
Note that (?:...) is a non-capturing group, so by design it isn't reflected in $Matches.
Instead of reading the file with [IO.File]::ReadAllLines($file), I'm using the -File option with switch, which directly reads the lines from file $file.
The ++$lc inside default { ++$lc; $_ } ensures that the line counter is also incremented for non-matching lines before passing the line at hand through ($_).
Performance notes
You can improve the performance slightly with the following obscure optimization:
# Enclose the switch statement in & { ... } to speed it up slightly.
$updatedLines = & { switch -Regex -File ... }
With high iteration counts (a large number of lines), using a precompiled [regex] instance rather than a string literal that PowerShell converts to a regex behind the scenes can speed things up further - see benchmarks below.
Additionally, if case-sensitive matching is sufficient, you can squeeze out a little more performance by adding the -CaseSensitive option to the switch statement.
At a high level, what makes the solution fast is the use of switch -File to process the lines, and, generally, the use of .NET types for file I/O (rather than cmdlets) (IO.File]::WriteAllLines() in this case, as shown in the question) - see also this related answer.
That said, marsze's answer offers a highly optimized foreach loop approach based on a precompiled regex that is faster with higher iteration counts - it is, however, more verbose.
Benchmarks
The following code compares the performance of this answer's switch approach with marsze's foreach approach.
Note that in order to make the two solutions fully equivalent, the following tweaks were made:
The & { ... } optimization was added to the switch command as well.
The IgnoreCase and CultureInvariant options were added to the foreach approach to match the options PS regexes implicitly use.
Instead of a 6-line sample file, performance is tested with a 600-line, a 3,000 and a 30,000-line file respectively, so as to show the effects of the iteration count on performance.
100 runs are being averaged.
Sample results from my Windows 10 machine running Windows PowerShell v5.1 - the absolute times aren't important, but hopefully the relative performance shown in the Factor column is generally representative:
VERBOSE: Averaging 100 runs with a 600-line file of size 0.03 MB...
Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00 0.023 # switch -Regex -File with regex string literal...
1.16 0.027 # foreach with precompiled regex and [regex].Match...
1.23 0.028 # switch -Regex -File with precompiled regex...
VERBOSE: Averaging 100 runs with a 3000-line file of size 0.15 MB...
Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00 0.063 # foreach with precompiled regex and [regex].Match...
1.11 0.070 # switch -Regex -File with precompiled regex...
1.15 0.073 # switch -Regex -File with regex string literal...
VERBOSE: Averaging 100 runs with a 30000-line file of size 1.47 MB...
Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00 0.252 # foreach with precompiled regex and [regex].Match...
1.24 0.313 # switch -Regex -File with precompiled regex...
1.53 0.386 # switch -Regex -File with regex string literal...
Note how at lower iteration counts switch -regex with a string literal is fastest, but at around 1,500 lines the foreach solution with a precompiled [regex] instance starts to get faster; using a precompiled [regex] instance with switch -regex pays off to a lesser degree, only with higher iteration counts.
Benchmark code, using the Time-Command function:
# Sample file content (6 lines)
$fileContent = #'
TITLE %TIME% NO "%zmyapps1%\*.*" ARCHIVE ATTRIBUTE LINE2 1243
TITLE %TIME% DOC/SET YQJ8 LINE2 1887
SET ztitle=%TIME%: WINFOLD LINE2 2557
TITLE %TIME% _*.* IN WINFOLD LINE2 2597
TITLE %TIME% %%ZDATE1%% YQJ25 LINE2 3672
TITLE %TIME% FINISHED. PRESS ANY KEY TO SHUTDOWN ... LINE2 4922
'#
# Determine the full path to a sample file.
# NOTE: Using the *full* path is a *must* when calling .NET methods, because
# the latter generally don't see the same working dir. as PowerShell.
$file = "$PWD/test.bat"
# Note: input is the number of 6-line blocks to write to the sample file,
# which amounts to 600 vs. 3,000 vs. 30,0000 lines.
100, 500, 5000 | % {
# Create the sample file with the sample content repeated N times.
$repeatCount = $_
[IO.File]::WriteAllText($file, $fileContent * $repeatCount)
# Warm up the file cache and count the lines.
$lineCount = [IO.File]::ReadAllLines($file).Count
# Define the commands to compare as an array of scriptblocks.
$commands =
{ # switch -Regex -File with regex string literal
& {
$i = 0
$updatedLines = switch -Regex -File $file {
'^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$' { $Matches[1] + ++$i + $Matches[2] }
default { ++$i; $_ }
}
[IO.File]::WriteAllLines($file, $updatedLines, [text.encoding]::ASCII)
}
}, { # switch -Regex -File with precompiled regex
& {
$i = 0
$regex = [Regex]::new('^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$', 'Compiled, IgnoreCase, CultureInvariant')
$updatedLines = switch -Regex -File $file {
$regex { $Matches[1] + ++$i + $Matches[2] }
default { ++$i; $_ }
}
[IO.File]::WriteAllLines($file, $updatedLines, [text.encoding]::ASCII)
}
}, { # foreach with precompiled regex and [regex].Match
& {
$regex = [Regex]::new('^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$', 'Compiled, IgnoreCase, CultureInvariant')
$i = 0
$updatedLines = foreach ($line in [IO.File]::ReadLines($file)) {
$i++
$m = $regex.Match($line)
if ($m.Success) {
$g = $m.Groups
$g[1].Value + $i + $g[2].Value
} else { $line }
}
[IO.File]::WriteAllLines($file, $updatedLines, [Text.Encoding]::ASCII)
}
}
# How many runs to average.
$runs = 100
Write-Verbose -vb "Averaging $runs runs with a $lineCount-line file of size $('{0:N2} MB' -f ((Get-Item $file).Length / 1mb))..."
Time-Command -Count $runs -ScriptBlock $commands | Out-Host
}
Alternative solution:
$regex = [Regex]::new('^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$', 'Compiled, IgnoreCase, CultureInvariant')
$lc = 0
$updatedLines = & {foreach ($line in [IO.File]::ReadLines($file)) {
$lc++
$m = $regex.Match($line)
if ($m.Success) {
$g = $m.Groups
$g[1].Value + $lc + $g[2].Value
} else { $line }
}}
[IO.File]::WriteAllLines($file, $updatedLines, [Text.Encoding]::ASCII)

Iterate a windows ascii text file, find all instances of {LINE2 1-9999} replace with {LINE2 "line number the code is on"}. Overwrite. Faster?

This code works. I just want to see how much faster someone can make it work.
Backup your Windows 10 batch file in case something goes wrong. Find all instances of string {LINE2 1-9999} and replace with {LINE2 "line number the code is on"}. Overwrite, encoding as ASCII.
If _61.bat is:
TITLE %TIME% NO "%zmyapps1%\*.*" ARCHIVE ATTRIBUTE LINE2 1243
TITLE %TIME% DOC/SET YQJ8 LINE2 1887
SET ztitle=%TIME%: WINFOLD LINE2 2557
TITLE %TIME% _*.* IN WINFOLD LINE2 2597
TITLE %TIME% %%ZDATE1%% YQJ25 LINE2 3672
TITLE %TIME% FINISHED. PRESS ANY KEY TO SHUTDOWN ... LINE2 4922
Results:
TITLE %TIME% NO "%zmyapps1%\*.*" ARCHIVE ATTRIBUTE LINE2 1
TITLE %TIME% DOC/SET YQJ8 LINE2 2
SET ztitle=%TIME%: WINFOLD LINE2 3
TITLE %TIME% _*.* IN WINFOLD LINE2 4
TITLE %TIME% %%ZDATE1%% YQJ25 LINE2 5
TITLE %TIME% FINISHED. PRESS ANY KEY TO SHUTDOWN ... LINE2 6
Code:
Copy-Item $env:windir\_61.bat -d $env:temp\_61.bat
(gc $env:windir\_61.bat) | foreach -Begin {$lc = 1} -Process {
$_ -replace "LINE2 \d*", "LINE2 $lc";
$lc += 1
} | Out-File -Encoding Ascii $env:windir\_61.bat
I expect this to take less than 984 milliseconds. It takes 984 milliseconds. Can you think of anything to speed it up?
The key to better performance in PowerShell code (short of embedding C# code compiled on demand with Add-Type, which may or may not help) is to:
avoid use of cmdlets and the pipeline in general,
especially invocation of a script block ({...}) for each pipeline input object, such as with ForEach-Object and Where-Object
However, it isn't the pipeline per se that is to blame, it is the current inefficient implementation of these cmdlets - see GitHub issue #10982 - and there is a workaround that noticeably improves pipeline performance:
# Faster alternative to:
# 1..10 | ForEach-Object { $_ * 10 }
1..10 | . { process { $_ * 10 } }
# Faster alternative to:
# 1..10 | Where-Object { $_ -gt 5 }
1..10 | . { process { if ($_ -gt 5) { $_ } } }
avoiding the pipeline requires direct use of the .NET framework types as an alternative to cmdlets.
if feasible, use switch statements for array or line-by-line file processing - switch statements generally outperform foreach loops.
To be clear: The pipeline and cmdlets offer clear benefits, so avoiding them should only be done if optimizing performance is a must.
In your case, the following code, which combines the switch statement with direct use of the .NET framework for file I/O seems to offer the best performance - note that the input file is read into memory as a whole, as an array of lines, and a copy of that array with the modified lines is created before it is written back to the input file:
$file = "$env:temp\_61.bat" # must be a *full* path.
$lc = 0
$updatedLines = & { switch -Regex -File $file {
'^(.*? LINE2 )\d+(.*)$' { $Matches[1] + ++$lc + $Matches[2] }
default { ++$lc; $_ } # pass non-matching lines through
} }
[IO.File]::WriteAllLines($file, $updatedLines, [Text.Encoding]::ASCII)
Note:
Enclosing the switch statement in & { ... } is an obscure performance optimization explained in this answer.
If case-sensitive matching is sufficient, as suggested by the sample input, you can improve performance a little more by adding the -CaseSensitive option to the switch command.
In my tests (see below), this provided a more than 4-fold performance improvement in Windows PowerShell relative to your command.
Here's a performance comparison via the Time-Command function:
The commands compared are:
The switch command from above.
A slightly streamlined version of your own command.
A PowerShell Core v6.1+ alternative that uses the -replace operator with the array of lines as the LHS and a scriptblock as the replacement expression.
Instead of a 6-line sample file, a 6,000-line file is used.
100 runs are being averaged.
It's easy to adjust these parameters.
# Sample file content (6 lines)
$fileContent = #'
TITLE %TIME% NO "%zmyapps1%\*.*" ARCHIVE ATTRIBUTE LINE2 1243
TITLE %TIME% DOC/SET YQJ8 LINE2 1887
SET ztitle=%TIME%: WINFOLD LINE2 2557
TITLE %TIME% _*.* IN WINFOLD LINE2 2597
TITLE %TIME% %%ZDATE1%% YQJ25 LINE2 3672
TITLE %TIME% FINISHED. PRESS ANY KEY TO SHUTDOWN ... LINE2 4922
'#
# Determine the full path to a sample file.
# NOTE: Using the *full* path is a *must* when calling .NET methods, because
# the latter generally don't see the same working dir. as PowerShell.
$file = "$PWD/test.bat"
# Create the sample file with the sample content repeated N times.
$repeatCount = 1000 # -> 6,000 lines
[IO.File]::WriteAllText($file, $fileContent * $repeatCount)
# Warm up the file cache and count the lines.
$lineCount = [IO.File]::ReadAllLines($file).Count
# Define the commands to compare as an array of scriptblocks.
$commands =
{ # switch -Regex -File + [IO.File]::Read/WriteAllLines()
$i = 0
$updatedLines = & { switch -Regex -File $file {
'^(.*? LINE2 )\d+(.*)$' { $Matches[1] + ++$i + $Matches[2] }
default { ++$lc; $_ }
} }
[IO.File]::WriteAllLines($file, $updatedLines, [text.encoding]::ASCII)
},
{ # Get-Content + -replace + Set-Content
(Get-Content $file) | ForEach-Object -Begin { $i = 1 } -Process {
$_ -replace "LINE2 \d*", "LINE2 $i"
++$i
} | Set-Content -Encoding Ascii $file
}
# In PS Core v6.1+, also test -replace with a scriptblock operand.
if ($PSVersionTable.PSVersion.Major -ge 6 -and $PSVersionTable.PSVersion.Minor -ge 1) {
$commands +=
{ # -replace with scriptblock + [IO.File]::Read/WriteAllLines()
$i = 0
[IO.File]::WriteAllLines($file,
([IO.File]::ReadAllLines($file) -replace '(?<= LINE2 )\d+', { (++$i) }),
[text.encoding]::ASCII
)
}
} else {
Write-Warning "Skipping -replace-with-scriptblock command, because it isn't supported in this PS version."
}
# How many runs to average.
$runs = 100
Write-Verbose -vb "Averaging $runs runs with a $lineCount-line file of size $('{0:N2} MB' -f ((Get-Item $file).Length / 1mb))..."
Time-Command -Count $runs -ScriptBlock $commands
Here are sample results from my Windows 10 machine (the absolute timings aren't important, but hopefully the relative performance show in in the Factor column is somewhat representative); the PowerShell Core version used is v6.2.0-preview.4
# Windows 10, Windows PowerShell v5.1
WARNING: Skipping -replace-with-scriptblock command, because it isn't supported in this PS version.
VERBOSE: Averaging 100 runs with a 6000-line file of size 0.29 MB...
Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00 0.108 # switch -Regex -File + [IO.File]::Read/WriteAllLines()...
4.22 0.455 # Get-Content + -replace + Set-Content...
# Windows 10, PowerShell Core v6.2.0-preview 4
VERBOSE: Averaging 100 runs with a 6000-line file of size 0.29 MB...
Factor Secs (100-run avg.) Command
------ ------------------- -------
1.00 0.101 # switch -Regex -File + [IO.File]::Read/WriteAllLines()…
1.67 0.169 # -replace with scriptblock + [IO.File]::Read/WriteAllLines()…
4.98 0.503 # Get-Content + -replace + Set-Content…

Speed of Powershell Script. Optimisation sought

I have a working script who's objective is to parse data files for malformed rows before importing into Oracle. To process a 450MB csv file with > 1 million rows having 8 columns it takes a little over 2.5hrs and maxes a single CPU core. Small files complete quickly (in seconds).
Oddly a 350MB file with similar number of rows and 40 columns only takes 30 mins.
My issue is that the files will grow over time and 2.5 hours tying up a CPU ain't good. Can anyone recommend code optimisation ? A similarly title post recommended local paths - which I'm already doing.
$file = "\Your.csv"
$path = "C:\Folder"
$csv = Get-Content "$path$file"
# Count number of file headers
$count = ($csv[0] -split ',').count
# https://blogs.technet.microsoft.com/gbordier/2009/05/05/powershell-and-writing-files-how-fast-can-you-write-to-a-file/
$stream1 = [System.IO.StreamWriter] "$path\Passed$file-Pass.txt"
$stream2 = [System.IO.StreamWriter] "$path\Failed$file-Fail.txt"
# 2 validation steps: (1) count number of headers is ge (2) Row split after first col. Those right hand side cols must total at least 40 characters.
$csv | Select -Skip 1 | % {
if( ($_ -split ',').count -ge $count -And ($_.split(',',2)[1]).Length -ge 40) {
$stream1.WriteLine($_)
} else {
$stream2.WriteLine($_)
}
}
$stream1.close()
$stream2.close()
Sample Data File:
C1,C2,C3,C4,C5,C6,C7,C8
ABC,000000000000006732,1063,2016-02-20,0,P,ESTIMATE,2015473497A10
ABC,000000000000006732,1110,2016-06-22,0,P,ESTIMATE,2015473497A10
ABC,,2016-06-22,,201501
,,,,,,,,
ABC,000000000000006732,1135,2016-08-28,0,P,ESTIMATE,2015473497B10
ABC,000000000000006732,1167,2015-12-20,0,P,ESTIMATE,2015473497B10
Get-Content is extremely slow in the default mode that produces an array when the file contains millions of lines on all PowerShell versions, including 5.1. What's worse, you're assigning it to a variable so until the entire file is read and split into lines nothing else happens. On Intel i7 3770K CPU at 3.9GHz $csv = Get-Content $path takes more than 2 minutes to read a 350MB file with 8 million lines.
Solution: Use IO.StreamReader to read a line and process it immediately.
In PowerShell2 StreamReader is less optimized than in PS3+ but still faster than Get-Content.
Pipelining via | is at least several times slower than direct enumeration via flow control statements such as while or foreach statement (not cmdlet).
Solution: use the statements.
Splitting each line into an array of strings is slower than manipulating only one string.
Solution: use IndexOf and Replace method (not operator) to count character occurrences.
PowerShell always creates an internal pipeline when loops are used.
Solution: use the Invoke-Command { } trick for 2-3x speedup in this case!
Below is PS2-compatible code.
It's faster in PS3+ (30 seconds for 8 million lines in a 350MB csv on my PC).
$reader = New-Object IO.StreamReader ('r:\data.csv', [Text.Encoding]::UTF8, $true, 4MB)
$header = $reader.ReadLine()
$numCol = $header.Split(',').count
$writer1 = New-Object IO.StreamWriter ('r:\1.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer2 = New-Object IO.StreamWriter ('r:\2.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer1.WriteLine($header)
$writer2.WriteLine($header)
Write-Progress 'Filtering...' -status ' '
$watch = [Diagnostics.Stopwatch]::StartNew()
$currLine = 0
Invoke-Command { # the speed-up trick: disables internal pipeline
while (!$reader.EndOfStream) {
$s = $reader.ReadLine()
$slen = $s.length
if ($slen-$s.IndexOf(',')-1 -ge 40 -and $slen-$s.Replace(',','').length+1 -eq $numCol){
$writer1.WriteLine($s)
} else {
$writer2.WriteLine($s)
}
if (++$currLine % 10000 -eq 0) {
$pctDone = $reader.BaseStream.Position / $reader.BaseStream.Length
Write-Progress 'Filtering...' -status "Line: $currLine" `
-PercentComplete ($pctDone * 100) `
-SecondsRemaining ($watch.ElapsedMilliseconds * (1/$pctDone - 1) / 1000)
}
}
} #Invoke-Command end
Write-Progress 'Filtering...' -Completed -status ' '
echo "Elapsed $($watch.Elapsed)"
$reader.close()
$writer1.close()
$writer2.close()
Another approach is to use regex in two passes (it's slower than the above code, though).
PowerShell 3 or newer is required due to array element property shorthand syntax:
$text = [IO.File]::ReadAllText('r:\data.csv')
$header = $text.substring(0, $text.indexOfAny("`r`n"))
$numCol = $header.split(',').count
$rx = [regex]"\r?\n(?:[^,]*,){$($numCol-1)}[^,]*?(?=\r?\n|$)"
[IO.File]::WriteAllText('r:\1.csv', $header + "`r`n" +
($rx.matches($text).groups.value -join "`r`n"))
[IO.File]::WriteAllText('r:\2.csv', $header + "`r`n" + $rx.replace($text, ''))
If you feel like installing awk, you can do 1,000,000 records in under a second - seems like a good optimisation to me :-)
awk -F, '
NR==1 {f=NF; printf("Expecting: %d fields\n",f)} # First record, get expected number of fields
NF!=f {print > "Fail.txt"; next} # Fail for wrong field count
length($0)-length($1)<40 {print > "Fail.txt"; next} # Fail for wrong length
{print > "Pass.txt"} # Pass
' MillionRecord.csv
You can get gawk for Windows from here.
Windows is a bit awkward with single quotes in parameters, so if running under Windows I would use the same code, but formatted like this:
Save this in a file called commands.awk:
NR==1 {f=NF; printf("Expecting: %d fields\n",f)}
NF!=f {print > "Fail.txt"; next}
length($0)-length($1)<40 {print > "Fail.txt"; next}
{print > "Pass.txt"}
Then run with:
awk -F, -f commands.awk Your.csv
The remainder of this answer relates to a "Beat hadoop with the shell" challenge mentioned in the comments section, and I wanted somewhere to save my code, so it's here.... runs in 6.002 seconds on my iMac over the 3.5GB in 1543 files amounting to around 104 million records:
#!/bin/bash
doit(){
awk '!/^\[Result/{next} /1-0/{w++;next} /0-1/{b++} END{print w,b}' $#
}
export -f doit
find . -name \*.pgn -print0 | parallel -0 -n 4 -j 12 doit {}
Try experimenting with different looping strategies, for example, switching to a for loop cuts the processing time by more than 50%, e.g.:
[String] $Local:file = 'Your.csv';
[String] $Local:path = 'C:\temp';
[System.Array] $Local:csv = $null;
[System.IO.StreamWriter] $Local:objPassStream = $null;
[System.IO.StreamWriter] $Local:objFailStream = $null;
[Int32] $Local:intHeaderCount = 0;
[Int32] $Local:intRow = 0;
[String] $Local:strRow = '';
[TimeSpan] $Local:objMeasure = 0;
try {
# Load.
$objMeasure = Measure-Command {
$csv = Get-Content -LiteralPath (Join-Path -Path $path -ChildPath $file) -ErrorAction Stop;
$intHeaderCount = ($csv[0] -split ',').count;
} #measure-command
'Load took {0}ms' -f $objMeasure.TotalMilliseconds;
# Create stream writers.
try {
$objPassStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Passed{1}-pass.txt' -f $path, $file );
$objFailStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Failed{1}-fail.txt' -f $path, $file );
# Process CSV (v1).
$objMeasure = Measure-Command {
$csv | Select-Object -Skip 1 | Foreach-Object {
if( (($_ -Split ',').Count -ge $intHeaderCount) -And (($_.Split(',',2)[1]).Length -ge 40) ) {
$objPassStream.WriteLine( $_ );
} else {
$objFailStream.WriteLine( $_ );
} #else-if
} #foreach-object
} #measure-command
'Process took {0}ms' -f $objMeasure.TotalMilliseconds;
# Process CSV (v2).
$objMeasure = Measure-Command {
for ( $intRow = 1; $intRow -lt $csv.Count; $intRow++ ) {
if( (($csv[$intRow] -Split ',').Count -ge $intHeaderCount) -And (($csv[$intRow].Split(',',2)[1]).Length -ge 40) ) {
$objPassStream.WriteLine( $csv[$intRow] );
} else {
$objFailStream.WriteLine( $csv[$intRow] );
} #else-if
} #for
} #measure-command
'Process took {0}ms' -f $objMeasure.TotalMilliseconds;
} #try
catch [System.Exception] {
'ERROR : Failed to create stream writers; exception was "{0}"' -f $_.Exception.Message;
} #catch
finally {
$objFailStream.close();
$objPassStream.close();
} #finally
} #try
catch [System.Exception] {
'ERROR : Failed to load CSV.';
} #catch
exit 0;

Powershell: Count instances of strings in a file using a list

I am trying to get the number of times a string (varying from 40 to 400+ characters) in "file1" occurs in "file2" in an effective way. file1 has about 2k lines and file2 has about 130k lines. I currently have a Unix solution that does it in about 2 mins in a VM and about 5 in Cygwin, but I am trying to do it with Powershell/Python since the files are in windows and I am using the output in excel and use it with automation (AutoIT.)
I have a solution, but it takes WAY too long (in about the same times that the Cygwin finished - all 2k lines - I had only 40-50 lines in Powershell!)
Although I haven't prepare a solution yet, I am open to use Python as well if there is a solution that can be fast and accurate.
Here is the Unix Code:
while read SEARCH_STRING;
do printf "%s$" "${SEARCH_STRING}";
grep -Fc "${SEARCH_STRING}" file2.csv;
done < file1.csv | tee -a output.txt;
And here is the Powershell code I currently have
$Target = Get-Content .\file1.csv
Foreach ($line in $Target){
#Just to keep strings small, since I found that not all
#strings were being compared correctly if they where 250+ chars
$line = $line.Substring(0,180)
$Coll = Get-Content .\file2.csv | Select-string -pattern "$line"
$cnt = $Coll | measure
$cnt.count
}
Any ideas of suggestions will help.
Thanks.
EDIT
I'm trying a modified solution suggested by C.B.
del .\output.txt
$Target = Get-Content .\file1.csv
$file= [System.IO.File]::ReadAllText( "C:\temp\file2.csv" )
Foreach ($line in $Target){
$line = [string]$line.Substring(0, $line.length/2)
$cnt = [regex]::matches( [string]$file, $line).count >> ".\output.txt"
}
But, since my strings in file1 are varying in length I keept getting OutOfBound exceptions for the SubString function, so I halved (/2) the input string to try to get a match. And when I try to halve them, if I it had an open parentheses, it tells me this:
Exception calling "Matches" with "2" argument(s): "parsing "CVE-2013-0796,04/02/2013,MFSA2013-35 SeaMonkey: WebGL
crash with Mesa graphics driver on Linux (C" - Not enough )'s."
At C:\temp\script_test.ps1:6 char:5
+ $cnt = [regex]::matches( [string]$file, $line).count >> ".\output.txt ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : ArgumentException
I don't know if there is a way to raise the input limit in powershell (My biggest size at the moment is 406, but could be bigger in the future) or just give up and try a Python solution.
Thoughts?
EDIT
Thanks to #C.B. I got the correct answer and it matches the output of the Bash script perfectly. Here is the full code that outputs results to a text file:
$Target = Get-Content .\file1.csv
$file= [System.IO.File]::ReadAllText( "C:\temp\file2.csv" )
Foreach ($line in $Target){
$cnt = [regex]::matches( $file, [regex]::escape($line)).count >> ".\output.txt"
}
Give this a try:
$Target = Get-Content .\file1.csv
$file= [System.IO.File]::ReadAllText( "c:\test\file2.csv" )
Foreach ($line in $Target){
$line = $line.Substring(0,180)
$cnt = [regex]::matches( $file, [regex]::escape($line)).count
}
One issue with your script is that you read file2.csv over and over again, for each line from file1.csv. Reading the file just once and storing the content in a variable should significantly speed things up. Try this:
$f2 = Get-Content .\file2.csv
foreach ($line in (gc .\file1.csv)) {
$line = $line.Substring(0,180)
#($f2 | ? { $_ -match $line }).Count
}

Resources