I have a working script who's objective is to parse data files for malformed rows before importing into Oracle. To process a 450MB csv file with > 1 million rows having 8 columns it takes a little over 2.5hrs and maxes a single CPU core. Small files complete quickly (in seconds).
Oddly a 350MB file with similar number of rows and 40 columns only takes 30 mins.
My issue is that the files will grow over time and 2.5 hours tying up a CPU ain't good. Can anyone recommend code optimisation ? A similarly title post recommended local paths - which I'm already doing.
$file = "\Your.csv"
$path = "C:\Folder"
$csv = Get-Content "$path$file"
# Count number of file headers
$count = ($csv[0] -split ',').count
# https://blogs.technet.microsoft.com/gbordier/2009/05/05/powershell-and-writing-files-how-fast-can-you-write-to-a-file/
$stream1 = [System.IO.StreamWriter] "$path\Passed$file-Pass.txt"
$stream2 = [System.IO.StreamWriter] "$path\Failed$file-Fail.txt"
# 2 validation steps: (1) count number of headers is ge (2) Row split after first col. Those right hand side cols must total at least 40 characters.
$csv | Select -Skip 1 | % {
if( ($_ -split ',').count -ge $count -And ($_.split(',',2)[1]).Length -ge 40) {
$stream1.WriteLine($_)
} else {
$stream2.WriteLine($_)
}
}
$stream1.close()
$stream2.close()
Sample Data File:
C1,C2,C3,C4,C5,C6,C7,C8
ABC,000000000000006732,1063,2016-02-20,0,P,ESTIMATE,2015473497A10
ABC,000000000000006732,1110,2016-06-22,0,P,ESTIMATE,2015473497A10
ABC,,2016-06-22,,201501
,,,,,,,,
ABC,000000000000006732,1135,2016-08-28,0,P,ESTIMATE,2015473497B10
ABC,000000000000006732,1167,2015-12-20,0,P,ESTIMATE,2015473497B10
Get-Content is extremely slow in the default mode that produces an array when the file contains millions of lines on all PowerShell versions, including 5.1. What's worse, you're assigning it to a variable so until the entire file is read and split into lines nothing else happens. On Intel i7 3770K CPU at 3.9GHz $csv = Get-Content $path takes more than 2 minutes to read a 350MB file with 8 million lines.
Solution: Use IO.StreamReader to read a line and process it immediately.
In PowerShell2 StreamReader is less optimized than in PS3+ but still faster than Get-Content.
Pipelining via | is at least several times slower than direct enumeration via flow control statements such as while or foreach statement (not cmdlet).
Solution: use the statements.
Splitting each line into an array of strings is slower than manipulating only one string.
Solution: use IndexOf and Replace method (not operator) to count character occurrences.
PowerShell always creates an internal pipeline when loops are used.
Solution: use the Invoke-Command { } trick for 2-3x speedup in this case!
Below is PS2-compatible code.
It's faster in PS3+ (30 seconds for 8 million lines in a 350MB csv on my PC).
$reader = New-Object IO.StreamReader ('r:\data.csv', [Text.Encoding]::UTF8, $true, 4MB)
$header = $reader.ReadLine()
$numCol = $header.Split(',').count
$writer1 = New-Object IO.StreamWriter ('r:\1.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer2 = New-Object IO.StreamWriter ('r:\2.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer1.WriteLine($header)
$writer2.WriteLine($header)
Write-Progress 'Filtering...' -status ' '
$watch = [Diagnostics.Stopwatch]::StartNew()
$currLine = 0
Invoke-Command { # the speed-up trick: disables internal pipeline
while (!$reader.EndOfStream) {
$s = $reader.ReadLine()
$slen = $s.length
if ($slen-$s.IndexOf(',')-1 -ge 40 -and $slen-$s.Replace(',','').length+1 -eq $numCol){
$writer1.WriteLine($s)
} else {
$writer2.WriteLine($s)
}
if (++$currLine % 10000 -eq 0) {
$pctDone = $reader.BaseStream.Position / $reader.BaseStream.Length
Write-Progress 'Filtering...' -status "Line: $currLine" `
-PercentComplete ($pctDone * 100) `
-SecondsRemaining ($watch.ElapsedMilliseconds * (1/$pctDone - 1) / 1000)
}
}
} #Invoke-Command end
Write-Progress 'Filtering...' -Completed -status ' '
echo "Elapsed $($watch.Elapsed)"
$reader.close()
$writer1.close()
$writer2.close()
Another approach is to use regex in two passes (it's slower than the above code, though).
PowerShell 3 or newer is required due to array element property shorthand syntax:
$text = [IO.File]::ReadAllText('r:\data.csv')
$header = $text.substring(0, $text.indexOfAny("`r`n"))
$numCol = $header.split(',').count
$rx = [regex]"\r?\n(?:[^,]*,){$($numCol-1)}[^,]*?(?=\r?\n|$)"
[IO.File]::WriteAllText('r:\1.csv', $header + "`r`n" +
($rx.matches($text).groups.value -join "`r`n"))
[IO.File]::WriteAllText('r:\2.csv', $header + "`r`n" + $rx.replace($text, ''))
If you feel like installing awk, you can do 1,000,000 records in under a second - seems like a good optimisation to me :-)
awk -F, '
NR==1 {f=NF; printf("Expecting: %d fields\n",f)} # First record, get expected number of fields
NF!=f {print > "Fail.txt"; next} # Fail for wrong field count
length($0)-length($1)<40 {print > "Fail.txt"; next} # Fail for wrong length
{print > "Pass.txt"} # Pass
' MillionRecord.csv
You can get gawk for Windows from here.
Windows is a bit awkward with single quotes in parameters, so if running under Windows I would use the same code, but formatted like this:
Save this in a file called commands.awk:
NR==1 {f=NF; printf("Expecting: %d fields\n",f)}
NF!=f {print > "Fail.txt"; next}
length($0)-length($1)<40 {print > "Fail.txt"; next}
{print > "Pass.txt"}
Then run with:
awk -F, -f commands.awk Your.csv
The remainder of this answer relates to a "Beat hadoop with the shell" challenge mentioned in the comments section, and I wanted somewhere to save my code, so it's here.... runs in 6.002 seconds on my iMac over the 3.5GB in 1543 files amounting to around 104 million records:
#!/bin/bash
doit(){
awk '!/^\[Result/{next} /1-0/{w++;next} /0-1/{b++} END{print w,b}' $#
}
export -f doit
find . -name \*.pgn -print0 | parallel -0 -n 4 -j 12 doit {}
Try experimenting with different looping strategies, for example, switching to a for loop cuts the processing time by more than 50%, e.g.:
[String] $Local:file = 'Your.csv';
[String] $Local:path = 'C:\temp';
[System.Array] $Local:csv = $null;
[System.IO.StreamWriter] $Local:objPassStream = $null;
[System.IO.StreamWriter] $Local:objFailStream = $null;
[Int32] $Local:intHeaderCount = 0;
[Int32] $Local:intRow = 0;
[String] $Local:strRow = '';
[TimeSpan] $Local:objMeasure = 0;
try {
# Load.
$objMeasure = Measure-Command {
$csv = Get-Content -LiteralPath (Join-Path -Path $path -ChildPath $file) -ErrorAction Stop;
$intHeaderCount = ($csv[0] -split ',').count;
} #measure-command
'Load took {0}ms' -f $objMeasure.TotalMilliseconds;
# Create stream writers.
try {
$objPassStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Passed{1}-pass.txt' -f $path, $file );
$objFailStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Failed{1}-fail.txt' -f $path, $file );
# Process CSV (v1).
$objMeasure = Measure-Command {
$csv | Select-Object -Skip 1 | Foreach-Object {
if( (($_ -Split ',').Count -ge $intHeaderCount) -And (($_.Split(',',2)[1]).Length -ge 40) ) {
$objPassStream.WriteLine( $_ );
} else {
$objFailStream.WriteLine( $_ );
} #else-if
} #foreach-object
} #measure-command
'Process took {0}ms' -f $objMeasure.TotalMilliseconds;
# Process CSV (v2).
$objMeasure = Measure-Command {
for ( $intRow = 1; $intRow -lt $csv.Count; $intRow++ ) {
if( (($csv[$intRow] -Split ',').Count -ge $intHeaderCount) -And (($csv[$intRow].Split(',',2)[1]).Length -ge 40) ) {
$objPassStream.WriteLine( $csv[$intRow] );
} else {
$objFailStream.WriteLine( $csv[$intRow] );
} #else-if
} #for
} #measure-command
'Process took {0}ms' -f $objMeasure.TotalMilliseconds;
} #try
catch [System.Exception] {
'ERROR : Failed to create stream writers; exception was "{0}"' -f $_.Exception.Message;
} #catch
finally {
$objFailStream.close();
$objPassStream.close();
} #finally
} #try
catch [System.Exception] {
'ERROR : Failed to load CSV.';
} #catch
exit 0;
I've written this script (called SpeedTest.pl) to log internet speed due to resolve a problem with my ISP.
It work well, but just if I use a Perl interpreter (if I double-click on the script). I want to compile it to generate a stand-alone executable to run in a different PC without Perl installed.
Well, I've try with pp and Perl2Exe both, but when I launch the SpeedTest.exe i see a lot of process called "SpeedTest.exe" in task manager. If I don't block all these process, the PC OS will crash (a pop-up say: "the memory can't be written, blah blah blah).
Any ideas?
This is the script:
#!/usr/local/bin/perl
use strict;
use warnings;
use App::SpeedTest;
my($day, $month_temp, $year_temp)=(localtime)[3,4,5];
my $year = $year_temp+1900;
my $month = $month_temp+1;
my $date = "0"."$day"."-"."0"."$month"."-"."$year";
my $filename = "Speed Test - "."$date".".csv";
if (-e $filename) {
goto SPEEDTEST;
} else {
goto CREATEFILE;
}
CREATEFILE:
open(FILE, '>', $filename);
print FILE "Date".";"."Time".";"."Download [Mbit/s]".";"."Upload [Mbit/s]".";"."\n";
close FILE;
goto SPEEDTEST;
SPEEDTEST:
my $download = qx(speedtest -Q -C --no-upload);
my $upload = qx(speedtest -Q -C --no-download);
my #download_chars = split("", $download);
my #upload_chars = split("", $upload);
my $time = "$download_chars[12]"."$download_chars[13]"."$download_chars[14]"."$download_chars[15]"."$download_chars[16]";
my $download_speed = "$download_chars[49]"."$download_chars[50]"."$download_chars[51]"."$download_chars[52]"."$download_chars[53]";
my $upload_speed = "$upload_chars[49]"."$upload_chars[50]"."$upload_chars[51]"."$upload_chars[52]"."$upload_chars[53]";
my $output = "$date".";"."$time".";"."$download_speed".";"."$upload_speed".";";
open(FILE, '>>', $filename);
print FILE $output."\n";
close FILE;
sleep 300;
my($day_check, $month_temp_check, $year_temp_check)=(localtime)[3,4,5];
my $year_check = $year_temp_check+1900;
my $month_check = $month_temp_check+1;
my $date_check = "0"."$day_check"."-"."0"."$month_check"."-"."$year_check";
my $filename_check = "Speed Test - "."$date_check".".csv";
if ($filename = $filename_check) {
goto SPEEDTEST;
} else {
$filename = $filename_check;
goto CREATEFILE;
}
Well, Steffen really answered this by way of a Comment, but here it is as an Answer. Just compile your Perl into an EXE that does NOT have the same name as the one that the Perl script is calling, for example:
speedtest.pl compiled into myspeedtest.exe, which calls speedtest.exe
After spending some time looking for the most clearcut way to check if the body of a file has the same amount of delimiters as the header I came up with this code:
Param #user enters the directory path and delimiter they are checking for
(
[string]$source,
[string]$delim
)
#try {
$lineNum = 1
$thisOK = 0
$badLine = 0
$noDelim = 0
$archive = ("*archive*","*Archive*","*ARCHIVE*");
foreach ($files in Get-ChildItem $source -Exclude $archive) #folder directory may have sub folders, as a temp workaround just made sure to exclude any folder with archive
{
$read2 = New-Object System.IO.StreamReader($files.FullName)
$DataLine = (Get-Content $files.FullName)[0]
$validCount = ([char[]]$DataLine -eq $delim).count #count of delimeters in the header
$lineNum = 1 #used to write to host which line is bad in file
$thisOK = 0 #used for if condition to let the host know that the file has delimeters that line up with header
$badLine = 0 #used so the write-host doesnt meet the if condition and write the file is ok after throwing an error
while (!$read2.EndOfStream)
{
$line = $read2.ReadLine()
$total = $line.Split($delim).Length - 1;
if ($total -eq $validCount)
{
$thisOK = 1
}
elseif ($total -ne $validCount)
{
Write-Output "Error on line $lineNum for file $files. Line number $lineNum has $total delimeters and the header has $validCount"
$thisOK = 0
$badLine = 1
break; #break or else it will repeat each line that is bad
}
$lineNum++
}
if ($thisOK = 1 -and $badLine -eq 0 -and $validCount -ne 0)
{
Write-Output "$files is ok"
}
if ($validCount -eq 0)
{
Write-Output "$files does not contain entered delimeter: $delim"
}
$read2.Close()
$read2.Dispose()
} #end foreach loop
#} catch {
# $ErrorMessage = $_.Exception.Message
# $FailedItem = $_.Exception.ItemName
#}
It works for what I have tested so far. However, when it comes to larger files, it takes considerably longer. I was wondering what I can do or change for this code to make it process these text/CSV files more quickly?
Also, my try..catch statements are commented out since the script doesn't seem to run when I include them - no error just enters a new command line. As a thought I was looking to incorporate a simple GUI for other users to double check.
Sample file:
HeaderA|HeaderB|HeaderC|HeaderD //header line
DataLnA|DataLnBBDataLnC|DataLnD|DataLnE //bad line
DataLnA|DataLnB|DataLnC|DataLnD| //bad line
DataLnA|DataLnB|DataLnC|DataLnD //good line
Now that I look at it, I guess there could be an issue where there are the correct amount if delimeters but the columns mismatch like this:
HeaderA|HeaderB|HeaderC|HeaderD
DataLnA|DataLnBDataLnC|DataLnD|
The main problem that I see is that you are reading the file twice -- once with the call to Get-Content, which reads the entire file into memory, and a second time with your while loop. You can double the speed of your process by replacing this line:
$DataLine = (Get-Content $files.FullName)[0] #inefficient
with this:
$DataLine = Get-Content $files.FullName -First 1 #efficient
I am trying to achieve this is Mac OS, tried to achieve similar by using fdupes but didn't work. Here is what I am trying to achieve:
There are 100 files in directory 'alpha'
Pick one file A and compare it with each remaining file in the directory 'alpha'
If content of file A matches any file (duplicate), delete the duplicate file
Move to file B, and compare with the remaining file, and do the same (check for duplicate)
Repeat the same until all files are checked for duplicates. Remaining files should be unique
Update
I modified a bit something similar I found here, but I have to run it multiple times to take out the duplicates. It is not detecting duplicates in a single run (have to run it multiple times to detect duplicate). Not sure if it is working correctly
use Digest::MD5;
%check = ();
while (<*>) {
-d and next;
$fname = "$_";
print "checking .. $fname\n";
$md5 = getmd5($fname) . "\n";
if ( !defined( $check{$md5} ) ) {
$check{$md5} = "$fname";
}
else {
print "Found duplicate files: $fname and $check{$md5}\n";
print "Deleting duplicate $check{$md5}\n";
unlink $check{$md5};
}
}
sub getmd5 {
my $file = "$_";
open( FH, "<", $file ) or die "Cannot open file: $!\n";
binmode(FH);
my $md5 = Digest::MD5->new;
$md5->addfile(FH);
close(FH);
return $md5->hexdigest;
}
You should limit the number of times that you have to read each file's contents:
Inventory the files using Path::Class or some similar method.
a. Build a hash relating file sizes and MD5::Digest to a list of file names.
Compare likely duplicates only. Matching file size and digest.
The following is untested:
use strict;
use warnings;
use Path::Class;
use Digest::MD5;
my $dir = dir('.');
my %files_per_digest;
# Inventory Directory
while ( my $file = $dir->next ) {
my $size = $file->stat->size;
my $digest = do {
my $md5 = Digest::MD5->new;
$md5->addfile( $file->openr );
$md5->hexdigest;
};
push #{ $files_per_digest{"$size - $digest"} }, $file;
}
# Compare likely duplicates only
for my $files ( grep { #$_ > 1 } values %files_per_digest ) {
# Sort by alpha
#$files = sort #$files;
print "Comparing: #files\n";
for my $i ( reverse 0 .. $#files ) {
for my $j ( 0 .. $i - 1 ) {
my $fh1 = $files->[$i]->openr;
my $fh2 = $files->[$j]->openr;
my $diff = 0;
while ( !eof($fh1) && !eof($fh2) ) {
$diff = 1, last if scalar(<$fh1>) ne scalar(<$fh2>);
}
if ( $diff or !eof($fh1) or !eof($fh2) ) {
print " $files->[$i] ($i) is duplicate of $files->[$j] ($j)\n";
$files->[$i]->remove();
splice #$files, $i, 1;
}
}
}
}
I've used rdfind in the past with very good success. It's very accurate, fast, and seems to run leaner than fdupes. According to RDFind's web site (http://rdfind.pauldreik.se/), it can be installed using MacPorts.
Powershell question
Currently i have 5-10 log files all about 20-25GB each and need to search through each of them to check if any of 900 different search parameters match. i have written a basic powershell script that will search through the whole log file for 1 search parameter. if it matches it will dump out the results into a seperate text file, the problem is it is pretty slow. i was wondering if there is a way to speed this up by either making it search for all 900 parameters at once and only looking through the log once. any help would be good even if its just improving the script.
basic overview :
1 csv file with all the 900 items listed under an "item" column
1 log file (.txt)
1 result file (.txt)
1 ps1 file
here is the code i have below for powershell in a PS1 file:
$search = filepath to csv file<br>
$log = "filepath to log file"<br>
$result = "file path to result text file"<br>
$list = import-csv $search <br>
foreach ($address in $list) {<br>
Get-Content $log | Select-String $address.item | add-content $result <br>
*"#"below is just for displaying a rudimentary counter of how far through searching it is <br>*
$i = $i + 1 <br>
echo $i <br>
}
900 search terms is quite large a group. Can you reduce its size by using regular expressions? A trivial solution is based on reading the file row-by-row and looking for matches. Set up a collection that contains regexps or literal strings for search terms. Like so,
$terms = #("Keyword[12]", "KeywordA", "KeyphraseOne") # Array of regexps
$src = "path-to-some-huge-file" # Path to the file
$reader = new-object IO.StreamReader($src) # Stream reader to file
while(($line = $reader.ReadLine()) -ne $null){ # Read one row at a time
foreach($t in $terms) { # For each search term...
if($line -match $t) { # check if the line read is a match...
$("Hit: {0} ({1})" -f $line, $t) # and print match
}
}
}
$reader.Close() # Close the reader
Surely this is going to be incredibly painful on any parser you use just based on the file sizes you have there, but if your log files are of a format that is standard (for example IIS log files) then you could consider using a Log parsing app such as Log Parser Studio instead of Powershell?