Increase performance for checking file delimiters - performance

After spending some time looking for the most clearcut way to check if the body of a file has the same amount of delimiters as the header I came up with this code:
Param #user enters the directory path and delimiter they are checking for
(
[string]$source,
[string]$delim
)
#try {
$lineNum = 1
$thisOK = 0
$badLine = 0
$noDelim = 0
$archive = ("*archive*","*Archive*","*ARCHIVE*");
foreach ($files in Get-ChildItem $source -Exclude $archive) #folder directory may have sub folders, as a temp workaround just made sure to exclude any folder with archive
{
$read2 = New-Object System.IO.StreamReader($files.FullName)
$DataLine = (Get-Content $files.FullName)[0]
$validCount = ([char[]]$DataLine -eq $delim).count #count of delimeters in the header
$lineNum = 1 #used to write to host which line is bad in file
$thisOK = 0 #used for if condition to let the host know that the file has delimeters that line up with header
$badLine = 0 #used so the write-host doesnt meet the if condition and write the file is ok after throwing an error
while (!$read2.EndOfStream)
{
$line = $read2.ReadLine()
$total = $line.Split($delim).Length - 1;
if ($total -eq $validCount)
{
$thisOK = 1
}
elseif ($total -ne $validCount)
{
Write-Output "Error on line $lineNum for file $files. Line number $lineNum has $total delimeters and the header has $validCount"
$thisOK = 0
$badLine = 1
break; #break or else it will repeat each line that is bad
}
$lineNum++
}
if ($thisOK = 1 -and $badLine -eq 0 -and $validCount -ne 0)
{
Write-Output "$files is ok"
}
if ($validCount -eq 0)
{
Write-Output "$files does not contain entered delimeter: $delim"
}
$read2.Close()
$read2.Dispose()
} #end foreach loop
#} catch {
# $ErrorMessage = $_.Exception.Message
# $FailedItem = $_.Exception.ItemName
#}
It works for what I have tested so far. However, when it comes to larger files, it takes considerably longer. I was wondering what I can do or change for this code to make it process these text/CSV files more quickly?
Also, my try..catch statements are commented out since the script doesn't seem to run when I include them - no error just enters a new command line. As a thought I was looking to incorporate a simple GUI for other users to double check.
Sample file:
HeaderA|HeaderB|HeaderC|HeaderD //header line
DataLnA|DataLnBBDataLnC|DataLnD|DataLnE //bad line
DataLnA|DataLnB|DataLnC|DataLnD| //bad line
DataLnA|DataLnB|DataLnC|DataLnD //good line
Now that I look at it, I guess there could be an issue where there are the correct amount if delimeters but the columns mismatch like this:
HeaderA|HeaderB|HeaderC|HeaderD
DataLnA|DataLnBDataLnC|DataLnD|

The main problem that I see is that you are reading the file twice -- once with the call to Get-Content, which reads the entire file into memory, and a second time with your while loop. You can double the speed of your process by replacing this line:
$DataLine = (Get-Content $files.FullName)[0] #inefficient
with this:
$DataLine = Get-Content $files.FullName -First 1 #efficient

Related

Powershell IF conditional isn't firing in the way I expected. Unsure what I'm doing wrong

I am writing a simple script that makes use of 7zip's command-line to extract archives within folders and then delete the original archives.
There is a part of my script that isn't behaving how I would expect it to. I can't get my if statement to trigger correctly. Here's a snippet of the code:
if($CurrentRar.Contains(".part1.rar")){
[void] $RarGroup.Add($CurrentRar)
# Value of CurrentRar:
# Factory_Selection_2.part1.rar
$CurrentRarBase = $CurrentRar.TrimEnd(".part1.rar")
# Value: Factory_Selection_2
for ($j = 1; $j -lt $AllRarfiles.Count; $j++){
$NextRar = $AllRarfiles[$j].Name
# Value: Factory_Selection_2.part2.rar
if($NextRar.Contains("$CurrentRarBase.part$j.rar")){
Write-Host "Test Hit" -ForegroundColor Green
# Never fires, and I have no idea why
# [void] $RarGroup.Add($NextRar)
}
}
$RarGroups.Add($RarGroup)
}
if($NextRar.Contains("$CurrentRarBase.part$j.rar")) is the line that I can't get to fire.
If I shorten it to if($NextRar.Contains("$CurrentRarBase.part")), it fires true. But as soon as I add the inline $j it always triggers false. I've tried casting $j to string but it still doesn't work. Am I missing something stupid?
Appreciate any help.
The issue seems to be your for statement and the fact that an array / list is zero-indexed (means they start with 0).
In your case, the index 0 of $AllRarfiles is probably the part1 and your for statement starts with 1, but the file name of index 1 does not contain part1 ($NextRar.Contains("$CurrentRarBase.part$j.rar"), but part2 ($j + 1).
As table comparison
Index / $j
Value
Built string for comparison (with Index)
0
Factory_Selection_2.part1.rar
Factory_Selection_2.part0.rar
1
Factory_Selection_2.part2.rar
Factory_Selection_2.part1.rar
2
Factory_Selection_2.part3.rar
Factory_Selection_2.part2.rar
3
Factory_Selection_2.part4.rar
Factory_Selection_2.part3.rar
Another simpler approach
Since it seems you want to group split RAR files which belong together, you could also use a simpler approach with Group-Object
# collect and group all RAR files.
$rarGroups = Get-ChildItem -LiteralPath 'C:\somewhere\' -Filter '*.rar' | Group-Object -Property { $_.Name -replace '\.part\d+\.rar$' }
# do some stuff afterwards
foreach($rarGroup in $rarGroups){
Write-Verbose -Verbose "Processing RAR group: $($rarGroup.Name)"
foreach($rarFile in $rarGroup.Group) {
Write-Verbose -Verbose "`tCurrent RAR file: $($rarFile.Name)"
# do some stuff per file
}
}

Split database backup file using PowerShell and merge it back

I am new to PowerShell having some issues with getting corrupted files when splitting and re attaching files back together using PowerShell.
I have a remote server from which I need to download a .bak file with the size of 44GB. To be able to do this I am splitting the files in to smaller (100mb) pieces using this script.
$from = "D:\largebakfile\largefile.bak"
$rootName = "D:\foldertoplacelargebakfile\part"
$ext = "PART_"
$upperBound = 100MB
$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
do {
"Reading $upperBound"
$count = $fromFile.Read($buff, 0, $buff.Length)
if ($count -gt 0) {
$to = "{0}{1}{2}" -f ($rootName, $idx, $ext)
$toFile = [io.file]::OpenWrite($to)
try {
"Writing $count to $to"
$tofile.Write($buff, 0, $count)
} finally {
$tofile.Close()
}
}
$idx ++
} while ($count -gt 0)
}
finally {
$fromFile.Close()
}
After this is done and the "PART_" files are downloaded to local computer I use this script to merge the files back together to 1 .bak file.
# replace with the location of the "PARTS" file
Set-Location "C:\Folderwithsplitfiles\Parts"
# replace with the SQL backup folder in your computer.
$outFile = "C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Backup\newname.bak"
#The prefix for all PARTS files
$infilePrefix ="C:\Folderwithsplitfiles\Parts\PART_"
$ostream = [System.Io.File]::OpenWrite($outFile)
$chunkNum = 1
$infileName = "$infilePrefix$chunkNum"
$offset = 0
while(Test-Path $infileName) {
$bytes = [System.IO.File]::ReadAllBytes($infileName)
$ostream.Write($bytes, 0, $bytes.Count)
Write-Host "read $infileName"
$chunkNum += 1
$infileName = "$infilePrefix$chunkNum"
}
$ostream.close();
#Get-FileHash $outfile | Format-List
When trying to restore database in SSMS I get an error basically saying that the file is corrupted and can't be restored.
I have been struggling with this a couple of days now and don't seem to get my head correct.
Everything seems like its working but something is causing me these errors. Does anyone have any ideas?
Might I suggest you look at Background Intelligent Transfer Service, you should be able to download the whole file in one piece to save over complicating anything. (Also supports starting/stopping the transfer etc.)

Kaldi librispeech data preparation error

I'm trying to do ASR system. Im using kaldi manual and librispeech corpus.
In data preparation step i get this error
utils/data/get_utt2dur.sh: segments file does not exist so getting durations
from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file
headers, using wav-to-duration
utils/data/get_utt2dur.sh: line 99: wav-to-duration: command not found
And here the piece of code where this error occures
if cat $data/wav.scp | perl -e '
while (<>) { s/\|\s*$/ |/; # make sure final | is preceded by space.
#A = split;
if (!($#A == 5 && $A[1] =~ m/sph2pipe$/ &&
$A[2] eq "-f" && $A[3] eq "wav" && $A[5] eq "|")) { exit (1); }
$utt = $A[0]; $sphere_file = $A[4];
if (!open(F, "<$sphere_file")) { die "Error opening sphere file $sphere_file"; }
$sample_rate = -1; $sample_count = -1;
for ($n = 0; $n <= 30; $n++) {
$line = <F>;
if ($line =~ m/sample_rate -i (\d+)/) { $sample_rate = $1; }
if ($line =~ m/sample_count -i (\d+)/) { $sample_count = $1;
}
if ($line =~ m/end_head/) { break; }
}
close(F);
if ($sample_rate == -1 || $sample_count == -1) {
die "could not parse sphere header from $sphere_file";
}
$duration = $sample_count * 1.0 / $sample_rate;
print "$utt $duration\n";
} ' > $data/utt2dur; then
echo "$0: successfully obtained utterance lengths from sphere-file headers"
else
echo "$0: could not get utterance lengths from sphere-file headers,
using wav-to-duration"
if command -v wav-to-duration >/dev/null; then
echo "$0: wav-to-duration is not on your path"
exit 1;
fi
In file wav.scp i got such lines:
6295-64301-0002 flac -c -d -s /home/tinin/kaldi/egs/librispeech/s5/LibriSpeech/dev-clean/6295/64301/6295-64301-0002.flac |
In this dataset i have only flac files(they downloaded via provided script) and i dont understand why we search wav-files? And how run data preparation correctly(i didnt change source code in this manual.
Also, if you explain to me what is happening in this code, then I will be very grateful to you, because i'm not familiar with bash and perl.
Thank you a lot!
The problem I see from this line
utils/data/get_utt2dur.sh: line 99: wav-to-duration: command not found
is that you have not added the kaldi tools in your path.
Check the file path.sh and see if the directories that it adds to your path are correct (because it has ../../.. inside and it might not match your current folder setup)
As for the perl script, it counts the samples of the sound file and then it divides with the sample rate in order to get the duration. Don't worry about the 'wav' word, your files might be on another format, it's just the name of the kaldi functions.

Speed of Powershell Script. Optimisation sought

I have a working script who's objective is to parse data files for malformed rows before importing into Oracle. To process a 450MB csv file with > 1 million rows having 8 columns it takes a little over 2.5hrs and maxes a single CPU core. Small files complete quickly (in seconds).
Oddly a 350MB file with similar number of rows and 40 columns only takes 30 mins.
My issue is that the files will grow over time and 2.5 hours tying up a CPU ain't good. Can anyone recommend code optimisation ? A similarly title post recommended local paths - which I'm already doing.
$file = "\Your.csv"
$path = "C:\Folder"
$csv = Get-Content "$path$file"
# Count number of file headers
$count = ($csv[0] -split ',').count
# https://blogs.technet.microsoft.com/gbordier/2009/05/05/powershell-and-writing-files-how-fast-can-you-write-to-a-file/
$stream1 = [System.IO.StreamWriter] "$path\Passed$file-Pass.txt"
$stream2 = [System.IO.StreamWriter] "$path\Failed$file-Fail.txt"
# 2 validation steps: (1) count number of headers is ge (2) Row split after first col. Those right hand side cols must total at least 40 characters.
$csv | Select -Skip 1 | % {
if( ($_ -split ',').count -ge $count -And ($_.split(',',2)[1]).Length -ge 40) {
$stream1.WriteLine($_)
} else {
$stream2.WriteLine($_)
}
}
$stream1.close()
$stream2.close()
Sample Data File:
C1,C2,C3,C4,C5,C6,C7,C8
ABC,000000000000006732,1063,2016-02-20,0,P,ESTIMATE,2015473497A10
ABC,000000000000006732,1110,2016-06-22,0,P,ESTIMATE,2015473497A10
ABC,,2016-06-22,,201501
,,,,,,,,
ABC,000000000000006732,1135,2016-08-28,0,P,ESTIMATE,2015473497B10
ABC,000000000000006732,1167,2015-12-20,0,P,ESTIMATE,2015473497B10
Get-Content is extremely slow in the default mode that produces an array when the file contains millions of lines on all PowerShell versions, including 5.1. What's worse, you're assigning it to a variable so until the entire file is read and split into lines nothing else happens. On Intel i7 3770K CPU at 3.9GHz $csv = Get-Content $path takes more than 2 minutes to read a 350MB file with 8 million lines.
Solution: Use IO.StreamReader to read a line and process it immediately.
In PowerShell2 StreamReader is less optimized than in PS3+ but still faster than Get-Content.
Pipelining via | is at least several times slower than direct enumeration via flow control statements such as while or foreach statement (not cmdlet).
Solution: use the statements.
Splitting each line into an array of strings is slower than manipulating only one string.
Solution: use IndexOf and Replace method (not operator) to count character occurrences.
PowerShell always creates an internal pipeline when loops are used.
Solution: use the Invoke-Command { } trick for 2-3x speedup in this case!
Below is PS2-compatible code.
It's faster in PS3+ (30 seconds for 8 million lines in a 350MB csv on my PC).
$reader = New-Object IO.StreamReader ('r:\data.csv', [Text.Encoding]::UTF8, $true, 4MB)
$header = $reader.ReadLine()
$numCol = $header.Split(',').count
$writer1 = New-Object IO.StreamWriter ('r:\1.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer2 = New-Object IO.StreamWriter ('r:\2.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer1.WriteLine($header)
$writer2.WriteLine($header)
Write-Progress 'Filtering...' -status ' '
$watch = [Diagnostics.Stopwatch]::StartNew()
$currLine = 0
Invoke-Command { # the speed-up trick: disables internal pipeline
while (!$reader.EndOfStream) {
$s = $reader.ReadLine()
$slen = $s.length
if ($slen-$s.IndexOf(',')-1 -ge 40 -and $slen-$s.Replace(',','').length+1 -eq $numCol){
$writer1.WriteLine($s)
} else {
$writer2.WriteLine($s)
}
if (++$currLine % 10000 -eq 0) {
$pctDone = $reader.BaseStream.Position / $reader.BaseStream.Length
Write-Progress 'Filtering...' -status "Line: $currLine" `
-PercentComplete ($pctDone * 100) `
-SecondsRemaining ($watch.ElapsedMilliseconds * (1/$pctDone - 1) / 1000)
}
}
} #Invoke-Command end
Write-Progress 'Filtering...' -Completed -status ' '
echo "Elapsed $($watch.Elapsed)"
$reader.close()
$writer1.close()
$writer2.close()
Another approach is to use regex in two passes (it's slower than the above code, though).
PowerShell 3 or newer is required due to array element property shorthand syntax:
$text = [IO.File]::ReadAllText('r:\data.csv')
$header = $text.substring(0, $text.indexOfAny("`r`n"))
$numCol = $header.split(',').count
$rx = [regex]"\r?\n(?:[^,]*,){$($numCol-1)}[^,]*?(?=\r?\n|$)"
[IO.File]::WriteAllText('r:\1.csv', $header + "`r`n" +
($rx.matches($text).groups.value -join "`r`n"))
[IO.File]::WriteAllText('r:\2.csv', $header + "`r`n" + $rx.replace($text, ''))
If you feel like installing awk, you can do 1,000,000 records in under a second - seems like a good optimisation to me :-)
awk -F, '
NR==1 {f=NF; printf("Expecting: %d fields\n",f)} # First record, get expected number of fields
NF!=f {print > "Fail.txt"; next} # Fail for wrong field count
length($0)-length($1)<40 {print > "Fail.txt"; next} # Fail for wrong length
{print > "Pass.txt"} # Pass
' MillionRecord.csv
You can get gawk for Windows from here.
Windows is a bit awkward with single quotes in parameters, so if running under Windows I would use the same code, but formatted like this:
Save this in a file called commands.awk:
NR==1 {f=NF; printf("Expecting: %d fields\n",f)}
NF!=f {print > "Fail.txt"; next}
length($0)-length($1)<40 {print > "Fail.txt"; next}
{print > "Pass.txt"}
Then run with:
awk -F, -f commands.awk Your.csv
The remainder of this answer relates to a "Beat hadoop with the shell" challenge mentioned in the comments section, and I wanted somewhere to save my code, so it's here.... runs in 6.002 seconds on my iMac over the 3.5GB in 1543 files amounting to around 104 million records:
#!/bin/bash
doit(){
awk '!/^\[Result/{next} /1-0/{w++;next} /0-1/{b++} END{print w,b}' $#
}
export -f doit
find . -name \*.pgn -print0 | parallel -0 -n 4 -j 12 doit {}
Try experimenting with different looping strategies, for example, switching to a for loop cuts the processing time by more than 50%, e.g.:
[String] $Local:file = 'Your.csv';
[String] $Local:path = 'C:\temp';
[System.Array] $Local:csv = $null;
[System.IO.StreamWriter] $Local:objPassStream = $null;
[System.IO.StreamWriter] $Local:objFailStream = $null;
[Int32] $Local:intHeaderCount = 0;
[Int32] $Local:intRow = 0;
[String] $Local:strRow = '';
[TimeSpan] $Local:objMeasure = 0;
try {
# Load.
$objMeasure = Measure-Command {
$csv = Get-Content -LiteralPath (Join-Path -Path $path -ChildPath $file) -ErrorAction Stop;
$intHeaderCount = ($csv[0] -split ',').count;
} #measure-command
'Load took {0}ms' -f $objMeasure.TotalMilliseconds;
# Create stream writers.
try {
$objPassStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Passed{1}-pass.txt' -f $path, $file );
$objFailStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Failed{1}-fail.txt' -f $path, $file );
# Process CSV (v1).
$objMeasure = Measure-Command {
$csv | Select-Object -Skip 1 | Foreach-Object {
if( (($_ -Split ',').Count -ge $intHeaderCount) -And (($_.Split(',',2)[1]).Length -ge 40) ) {
$objPassStream.WriteLine( $_ );
} else {
$objFailStream.WriteLine( $_ );
} #else-if
} #foreach-object
} #measure-command
'Process took {0}ms' -f $objMeasure.TotalMilliseconds;
# Process CSV (v2).
$objMeasure = Measure-Command {
for ( $intRow = 1; $intRow -lt $csv.Count; $intRow++ ) {
if( (($csv[$intRow] -Split ',').Count -ge $intHeaderCount) -And (($csv[$intRow].Split(',',2)[1]).Length -ge 40) ) {
$objPassStream.WriteLine( $csv[$intRow] );
} else {
$objFailStream.WriteLine( $csv[$intRow] );
} #else-if
} #for
} #measure-command
'Process took {0}ms' -f $objMeasure.TotalMilliseconds;
} #try
catch [System.Exception] {
'ERROR : Failed to create stream writers; exception was "{0}"' -f $_.Exception.Message;
} #catch
finally {
$objFailStream.close();
$objPassStream.close();
} #finally
} #try
catch [System.Exception] {
'ERROR : Failed to load CSV.';
} #catch
exit 0;

Powershell: Count instances of strings in a file using a list

I am trying to get the number of times a string (varying from 40 to 400+ characters) in "file1" occurs in "file2" in an effective way. file1 has about 2k lines and file2 has about 130k lines. I currently have a Unix solution that does it in about 2 mins in a VM and about 5 in Cygwin, but I am trying to do it with Powershell/Python since the files are in windows and I am using the output in excel and use it with automation (AutoIT.)
I have a solution, but it takes WAY too long (in about the same times that the Cygwin finished - all 2k lines - I had only 40-50 lines in Powershell!)
Although I haven't prepare a solution yet, I am open to use Python as well if there is a solution that can be fast and accurate.
Here is the Unix Code:
while read SEARCH_STRING;
do printf "%s$" "${SEARCH_STRING}";
grep -Fc "${SEARCH_STRING}" file2.csv;
done < file1.csv | tee -a output.txt;
And here is the Powershell code I currently have
$Target = Get-Content .\file1.csv
Foreach ($line in $Target){
#Just to keep strings small, since I found that not all
#strings were being compared correctly if they where 250+ chars
$line = $line.Substring(0,180)
$Coll = Get-Content .\file2.csv | Select-string -pattern "$line"
$cnt = $Coll | measure
$cnt.count
}
Any ideas of suggestions will help.
Thanks.
EDIT
I'm trying a modified solution suggested by C.B.
del .\output.txt
$Target = Get-Content .\file1.csv
$file= [System.IO.File]::ReadAllText( "C:\temp\file2.csv" )
Foreach ($line in $Target){
$line = [string]$line.Substring(0, $line.length/2)
$cnt = [regex]::matches( [string]$file, $line).count >> ".\output.txt"
}
But, since my strings in file1 are varying in length I keept getting OutOfBound exceptions for the SubString function, so I halved (/2) the input string to try to get a match. And when I try to halve them, if I it had an open parentheses, it tells me this:
Exception calling "Matches" with "2" argument(s): "parsing "CVE-2013-0796,04/02/2013,MFSA2013-35 SeaMonkey: WebGL
crash with Mesa graphics driver on Linux (C" - Not enough )'s."
At C:\temp\script_test.ps1:6 char:5
+ $cnt = [regex]::matches( [string]$file, $line).count >> ".\output.txt ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : ArgumentException
I don't know if there is a way to raise the input limit in powershell (My biggest size at the moment is 406, but could be bigger in the future) or just give up and try a Python solution.
Thoughts?
EDIT
Thanks to #C.B. I got the correct answer and it matches the output of the Bash script perfectly. Here is the full code that outputs results to a text file:
$Target = Get-Content .\file1.csv
$file= [System.IO.File]::ReadAllText( "C:\temp\file2.csv" )
Foreach ($line in $Target){
$cnt = [regex]::matches( $file, [regex]::escape($line)).count >> ".\output.txt"
}
Give this a try:
$Target = Get-Content .\file1.csv
$file= [System.IO.File]::ReadAllText( "c:\test\file2.csv" )
Foreach ($line in $Target){
$line = $line.Substring(0,180)
$cnt = [regex]::matches( $file, [regex]::escape($line)).count
}
One issue with your script is that you read file2.csv over and over again, for each line from file1.csv. Reading the file just once and storing the content in a variable should significantly speed things up. Try this:
$f2 = Get-Content .\file2.csv
foreach ($line in (gc .\file1.csv)) {
$line = $line.Substring(0,180)
#($f2 | ? { $_ -match $line }).Count
}

Resources