Insertion sort script on powershell - algorithm

I am trying to make insertion sort algorithm in case of studying on powershell.
Code:
$TestArrayList = [System.Collections.ArrayList]#(8, 2, 11, 12, 5, 6, 7)
for ($i = 0; $i -lt $TestArrayList.Count; $i++) {
$key = $TestArrayList[$i]
$j = $i-1
while($j -gt 0 -and $key -lt $TestArrayList[$j]){
$TestArrayList[$j+1] = $TestArrayList[$j]
$TestArrayList[$j] = $key
$j = $j-1
#Write-Output $TestArrayList[$i]
}
Write-Output $TestArrayList[$i]
}
Output of the code is next:
8
2
11
12
12
12
12
Can you please help me to analyze, what's the problem. I tried to make it sorted from small to bigger one.
Expected to be sorted:
2, 5, 6, 7, 8, 11, 12

There are two problems with your code right now:
The output is not the finally sorted list, it's intermediate output from the Write-Output statement.
There's an off-by-one bug in the nested while loop that'll cause the first item to always be ignored.
To fix the first issue, simply remove the Write-Output statement from the loop.
To fix the second issue, change the first comparison in the while condition to $j -ge 0 instead of $j -gt 0:
for ($i = 0; $i -lt $TestArrayList.Count; $i++) {
$key = $TestArrayList[$i]
$j = $i - 1
while($j -ge 0 -and $key -lt $TestArrayList[$j]){
$TestArrayList[$j+1] = $TestArrayList[$j]
$TestArrayList[$j] = $key
$j = $j - 1
}
}
# the list is now sorted correctly
$TestArrayList

Related

Speed of Powershell Script. Optimisation sought

I have a working script who's objective is to parse data files for malformed rows before importing into Oracle. To process a 450MB csv file with > 1 million rows having 8 columns it takes a little over 2.5hrs and maxes a single CPU core. Small files complete quickly (in seconds).
Oddly a 350MB file with similar number of rows and 40 columns only takes 30 mins.
My issue is that the files will grow over time and 2.5 hours tying up a CPU ain't good. Can anyone recommend code optimisation ? A similarly title post recommended local paths - which I'm already doing.
$file = "\Your.csv"
$path = "C:\Folder"
$csv = Get-Content "$path$file"
# Count number of file headers
$count = ($csv[0] -split ',').count
# https://blogs.technet.microsoft.com/gbordier/2009/05/05/powershell-and-writing-files-how-fast-can-you-write-to-a-file/
$stream1 = [System.IO.StreamWriter] "$path\Passed$file-Pass.txt"
$stream2 = [System.IO.StreamWriter] "$path\Failed$file-Fail.txt"
# 2 validation steps: (1) count number of headers is ge (2) Row split after first col. Those right hand side cols must total at least 40 characters.
$csv | Select -Skip 1 | % {
if( ($_ -split ',').count -ge $count -And ($_.split(',',2)[1]).Length -ge 40) {
$stream1.WriteLine($_)
} else {
$stream2.WriteLine($_)
}
}
$stream1.close()
$stream2.close()
Sample Data File:
C1,C2,C3,C4,C5,C6,C7,C8
ABC,000000000000006732,1063,2016-02-20,0,P,ESTIMATE,2015473497A10
ABC,000000000000006732,1110,2016-06-22,0,P,ESTIMATE,2015473497A10
ABC,,2016-06-22,,201501
,,,,,,,,
ABC,000000000000006732,1135,2016-08-28,0,P,ESTIMATE,2015473497B10
ABC,000000000000006732,1167,2015-12-20,0,P,ESTIMATE,2015473497B10
Get-Content is extremely slow in the default mode that produces an array when the file contains millions of lines on all PowerShell versions, including 5.1. What's worse, you're assigning it to a variable so until the entire file is read and split into lines nothing else happens. On Intel i7 3770K CPU at 3.9GHz $csv = Get-Content $path takes more than 2 minutes to read a 350MB file with 8 million lines.
Solution: Use IO.StreamReader to read a line and process it immediately.
In PowerShell2 StreamReader is less optimized than in PS3+ but still faster than Get-Content.
Pipelining via | is at least several times slower than direct enumeration via flow control statements such as while or foreach statement (not cmdlet).
Solution: use the statements.
Splitting each line into an array of strings is slower than manipulating only one string.
Solution: use IndexOf and Replace method (not operator) to count character occurrences.
PowerShell always creates an internal pipeline when loops are used.
Solution: use the Invoke-Command { } trick for 2-3x speedup in this case!
Below is PS2-compatible code.
It's faster in PS3+ (30 seconds for 8 million lines in a 350MB csv on my PC).
$reader = New-Object IO.StreamReader ('r:\data.csv', [Text.Encoding]::UTF8, $true, 4MB)
$header = $reader.ReadLine()
$numCol = $header.Split(',').count
$writer1 = New-Object IO.StreamWriter ('r:\1.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer2 = New-Object IO.StreamWriter ('r:\2.csv', $false, [Text.Encoding]::UTF8, 4MB)
$writer1.WriteLine($header)
$writer2.WriteLine($header)
Write-Progress 'Filtering...' -status ' '
$watch = [Diagnostics.Stopwatch]::StartNew()
$currLine = 0
Invoke-Command { # the speed-up trick: disables internal pipeline
while (!$reader.EndOfStream) {
$s = $reader.ReadLine()
$slen = $s.length
if ($slen-$s.IndexOf(',')-1 -ge 40 -and $slen-$s.Replace(',','').length+1 -eq $numCol){
$writer1.WriteLine($s)
} else {
$writer2.WriteLine($s)
}
if (++$currLine % 10000 -eq 0) {
$pctDone = $reader.BaseStream.Position / $reader.BaseStream.Length
Write-Progress 'Filtering...' -status "Line: $currLine" `
-PercentComplete ($pctDone * 100) `
-SecondsRemaining ($watch.ElapsedMilliseconds * (1/$pctDone - 1) / 1000)
}
}
} #Invoke-Command end
Write-Progress 'Filtering...' -Completed -status ' '
echo "Elapsed $($watch.Elapsed)"
$reader.close()
$writer1.close()
$writer2.close()
Another approach is to use regex in two passes (it's slower than the above code, though).
PowerShell 3 or newer is required due to array element property shorthand syntax:
$text = [IO.File]::ReadAllText('r:\data.csv')
$header = $text.substring(0, $text.indexOfAny("`r`n"))
$numCol = $header.split(',').count
$rx = [regex]"\r?\n(?:[^,]*,){$($numCol-1)}[^,]*?(?=\r?\n|$)"
[IO.File]::WriteAllText('r:\1.csv', $header + "`r`n" +
($rx.matches($text).groups.value -join "`r`n"))
[IO.File]::WriteAllText('r:\2.csv', $header + "`r`n" + $rx.replace($text, ''))
If you feel like installing awk, you can do 1,000,000 records in under a second - seems like a good optimisation to me :-)
awk -F, '
NR==1 {f=NF; printf("Expecting: %d fields\n",f)} # First record, get expected number of fields
NF!=f {print > "Fail.txt"; next} # Fail for wrong field count
length($0)-length($1)<40 {print > "Fail.txt"; next} # Fail for wrong length
{print > "Pass.txt"} # Pass
' MillionRecord.csv
You can get gawk for Windows from here.
Windows is a bit awkward with single quotes in parameters, so if running under Windows I would use the same code, but formatted like this:
Save this in a file called commands.awk:
NR==1 {f=NF; printf("Expecting: %d fields\n",f)}
NF!=f {print > "Fail.txt"; next}
length($0)-length($1)<40 {print > "Fail.txt"; next}
{print > "Pass.txt"}
Then run with:
awk -F, -f commands.awk Your.csv
The remainder of this answer relates to a "Beat hadoop with the shell" challenge mentioned in the comments section, and I wanted somewhere to save my code, so it's here.... runs in 6.002 seconds on my iMac over the 3.5GB in 1543 files amounting to around 104 million records:
#!/bin/bash
doit(){
awk '!/^\[Result/{next} /1-0/{w++;next} /0-1/{b++} END{print w,b}' $#
}
export -f doit
find . -name \*.pgn -print0 | parallel -0 -n 4 -j 12 doit {}
Try experimenting with different looping strategies, for example, switching to a for loop cuts the processing time by more than 50%, e.g.:
[String] $Local:file = 'Your.csv';
[String] $Local:path = 'C:\temp';
[System.Array] $Local:csv = $null;
[System.IO.StreamWriter] $Local:objPassStream = $null;
[System.IO.StreamWriter] $Local:objFailStream = $null;
[Int32] $Local:intHeaderCount = 0;
[Int32] $Local:intRow = 0;
[String] $Local:strRow = '';
[TimeSpan] $Local:objMeasure = 0;
try {
# Load.
$objMeasure = Measure-Command {
$csv = Get-Content -LiteralPath (Join-Path -Path $path -ChildPath $file) -ErrorAction Stop;
$intHeaderCount = ($csv[0] -split ',').count;
} #measure-command
'Load took {0}ms' -f $objMeasure.TotalMilliseconds;
# Create stream writers.
try {
$objPassStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Passed{1}-pass.txt' -f $path, $file );
$objFailStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Failed{1}-fail.txt' -f $path, $file );
# Process CSV (v1).
$objMeasure = Measure-Command {
$csv | Select-Object -Skip 1 | Foreach-Object {
if( (($_ -Split ',').Count -ge $intHeaderCount) -And (($_.Split(',',2)[1]).Length -ge 40) ) {
$objPassStream.WriteLine( $_ );
} else {
$objFailStream.WriteLine( $_ );
} #else-if
} #foreach-object
} #measure-command
'Process took {0}ms' -f $objMeasure.TotalMilliseconds;
# Process CSV (v2).
$objMeasure = Measure-Command {
for ( $intRow = 1; $intRow -lt $csv.Count; $intRow++ ) {
if( (($csv[$intRow] -Split ',').Count -ge $intHeaderCount) -And (($csv[$intRow].Split(',',2)[1]).Length -ge 40) ) {
$objPassStream.WriteLine( $csv[$intRow] );
} else {
$objFailStream.WriteLine( $csv[$intRow] );
} #else-if
} #for
} #measure-command
'Process took {0}ms' -f $objMeasure.TotalMilliseconds;
} #try
catch [System.Exception] {
'ERROR : Failed to create stream writers; exception was "{0}"' -f $_.Exception.Message;
} #catch
finally {
$objFailStream.close();
$objPassStream.close();
} #finally
} #try
catch [System.Exception] {
'ERROR : Failed to load CSV.';
} #catch
exit 0;

Increase performance for checking file delimiters

After spending some time looking for the most clearcut way to check if the body of a file has the same amount of delimiters as the header I came up with this code:
Param #user enters the directory path and delimiter they are checking for
(
[string]$source,
[string]$delim
)
#try {
$lineNum = 1
$thisOK = 0
$badLine = 0
$noDelim = 0
$archive = ("*archive*","*Archive*","*ARCHIVE*");
foreach ($files in Get-ChildItem $source -Exclude $archive) #folder directory may have sub folders, as a temp workaround just made sure to exclude any folder with archive
{
$read2 = New-Object System.IO.StreamReader($files.FullName)
$DataLine = (Get-Content $files.FullName)[0]
$validCount = ([char[]]$DataLine -eq $delim).count #count of delimeters in the header
$lineNum = 1 #used to write to host which line is bad in file
$thisOK = 0 #used for if condition to let the host know that the file has delimeters that line up with header
$badLine = 0 #used so the write-host doesnt meet the if condition and write the file is ok after throwing an error
while (!$read2.EndOfStream)
{
$line = $read2.ReadLine()
$total = $line.Split($delim).Length - 1;
if ($total -eq $validCount)
{
$thisOK = 1
}
elseif ($total -ne $validCount)
{
Write-Output "Error on line $lineNum for file $files. Line number $lineNum has $total delimeters and the header has $validCount"
$thisOK = 0
$badLine = 1
break; #break or else it will repeat each line that is bad
}
$lineNum++
}
if ($thisOK = 1 -and $badLine -eq 0 -and $validCount -ne 0)
{
Write-Output "$files is ok"
}
if ($validCount -eq 0)
{
Write-Output "$files does not contain entered delimeter: $delim"
}
$read2.Close()
$read2.Dispose()
} #end foreach loop
#} catch {
# $ErrorMessage = $_.Exception.Message
# $FailedItem = $_.Exception.ItemName
#}
It works for what I have tested so far. However, when it comes to larger files, it takes considerably longer. I was wondering what I can do or change for this code to make it process these text/CSV files more quickly?
Also, my try..catch statements are commented out since the script doesn't seem to run when I include them - no error just enters a new command line. As a thought I was looking to incorporate a simple GUI for other users to double check.
Sample file:
HeaderA|HeaderB|HeaderC|HeaderD //header line
DataLnA|DataLnBBDataLnC|DataLnD|DataLnE //bad line
DataLnA|DataLnB|DataLnC|DataLnD| //bad line
DataLnA|DataLnB|DataLnC|DataLnD //good line
Now that I look at it, I guess there could be an issue where there are the correct amount if delimeters but the columns mismatch like this:
HeaderA|HeaderB|HeaderC|HeaderD
DataLnA|DataLnBDataLnC|DataLnD|
The main problem that I see is that you are reading the file twice -- once with the call to Get-Content, which reads the entire file into memory, and a second time with your while loop. You can double the speed of your process by replacing this line:
$DataLine = (Get-Content $files.FullName)[0] #inefficient
with this:
$DataLine = Get-Content $files.FullName -First 1 #efficient

Create a dynamic query using php inputs through dropdowns

I need to print data from database based on the selection in the dropdowns where three drops down lists are being shown. The user can select either one dropdown list or two or three based on his choice.But the should be based on the selected values in the selected dropdown lists. I'm new to php and I'm a learner. Can anyone sort out the problem here in my code.
if(isset($_POST['submt']))
{
$a = $_POST['prog'];
$b = $_POST['cntr'];
$c = $_POST['sectr'];
$a1 = 'Programme_name';
$b1 = 'Center_name';
$c1 = 'Name_of_trade';
$x=0; $y=0; $p=0; $q=0;
if($a=='' && $b!='' && $c!='') { $x = $b1; $y = $c1; $p = $b; $q = $c; }
if($b=='' && $c!='' && $a!='') { $x = $c1; $y = $a1; $p = $c; $q = $a; }
if($c=='' && $a!='' && $b!='') { $x = $a1; $y = $b1; $p = $a; $q = $b; }
echo $x." ".$y;
mysql_connect("localhost","root","sherk005");
mysql_select_db("erp");
$hai = mysql_query("SELECT * FROM student_master_1 WHERE $x = '$p' AND $y = '$q'");
while(mysql_fetch_row($hai)>0) {
echo $hai['Partner_name'] . " " . $hai['Programme_name'];
echo "<br>";
}
}
You should probably escape your variables in your query to get it right.
$hai = mysql_query("SELECT * FROM student_master_1 WHERE ".$x." = '".$p."' AND ".$y." = '".$q."'");
Consider also using mysqli functions as mysql are deprecated : http://php.net//manual/en/book.mysqli.php

how to improve Performance for this code?

this code is running on a file of 200M lines at least. and this takes a lot of time
I would like to know if I can improve the runtime of this loop.
my #bin_lsit; #list of 0's and 1's
while (my $line = $input_io->getline) {
if ($bin_list[$i]) {
$line =~ s/^.{3}/XXX/;
} else {
$line =~ s/^.{3}/YYY/;
}
$output_io->appendln($line);
$i++;
}
A regex solution may be overkill here. How about replacing the if/else blocks with:
substr($line, 0, 3, $bin_list[$i] ? 'XXX' : 'YYY';
Smallest change is probably to buffer between appendln's
my #bin_lsit; #list of 0's and 1's
my $i = 0;
while (my $line = $input_io->getline) {
if ($bin_list[$i]) {
$line =~ s/^.{3}/XXX/;
} else {
$line =~ s/^.{3}/YYY/;
}
$buffer .= $line;
if ( $i % 1000 == 0 ) {
$output_io->appendln($buffer);
$buffer = '';
}
$i++;
}
if ( $buffer ne '' ) {
$output_io->appendln($buffer);
}
Are you using IO::All?
I couldn't find anything else with appendln...
Replacing this:
my $input_io = io 'tmp.this';
my $output_io = io 'tmp.out';
while (my $line = $input_io->getline ) {
$output_io->appendln($line);
}
With this:
open(IFH, 'tmp.this');
open(OFH, '>>tmp.out');
while (my $line = <IFH> ) {
print OFH $line;
}
close IFH;
close OFH;
Is quite a bit faster (1 sec vs 23 in my test case).

Human readable byte script in Powershell

Im trying to write a script where the user can input a number and the script will convert it into human readable bytes.
Heres what I´ve got:
# human-readable-byte.ps1
$ans = Read-Host
if ($ans -gt 1TB) {
Write-Host ($ans/1TB) "TB"
} elseif ($ans -gt 1GB) {
Write-Host ($ans/1GB) "GB"
} elseif ($ans -gt 1MB) {
Write-Host ($ans/1MB) "MB"
} elseif ($ans -gt 1KB) {
Write-Host ($ans/1KB) "KB"
} else {
Write-Host $ans "B"
}
The problem i get is that everything under 2.0 comes out in B but then everything over comes out in TB. Why? Seems like everything inbetween is ignored. ive tried to do this in many different ways, but cant get it to work.
Any ideas?
Operators in PowerShell convert the operands to the type of the left operand. So in your case the comparisons convert the number on the right to a string. And thus -gt does a string comparison.
You'd need to convert $ans to the right type:
[long]$ans = Read-Host
or swap the operands:
if (1TB -lt $ans) ...

Resources