Powershell TCP receive file slow - performance

This is the code for a PowerShell script which is meant to receive a file in chunks from a python server.
$chunkSize = {chunkSize}
$chunk= New-Object Byte[] $chunkSize
while($bytes -gt 0)
{
if($bytes -ge $chunkSize)
{
$buf = $chunkSize
}
else
{
$buf = $bytes
}
$chunk= New-Object Byte[] $buf
[void]$s.Read($chunk, 0, $buf)
$file += $chunk
$bytes -= $buf
}
set-content -value $file -encoding byte -path "$env:temp\\{filename}"
rv file;rv bytes;rv buf;rv chunkSize;rv chunk
As you can see, I put a [void] in front of the read statement which significantly shortened receive time. However the receiving process is still very slow in 1024 byte chunks. In my socket-based C# projects my files
would send effortlessly with little to no lag.
With PowerShell, however, it takes forever, even after suppressing the output of the Read method.
Is there something I'm missing in my PowerShell script that is slowing things down?
Thank you for your help!

Related

Powershell Plus or Minus Comparison Operator (Fuzzy Logic)?

So let me tell you what I'm trying to do here. Our SolarWinds alerts report on disk capacity as read by Windows, not the Virtual Machine vDisk size setting. What I'm trying to do is match the size so that I can find the correct vDisk and report on its datastore free space to determine whether or not we can add more.
Here's the problem, the GB number never matches between Windows and VMWare. Say the disk has a 149.67 capacity as reported by Windows, well the VMWare setting is 150, or 150.18854, or anything of that sort. I cannot find the vdisk without knowing the exact number, but theoretically I could find it if I could just say, have a comparison operator that had some breathing room, like plus or minus 1 or even 0.5. So for example:
Get-HardDisk -Vm SERVERNAME | Where-Object {
$_.CapacityGB -lt $size + 0.5 -and
$_.CapacityGB -gt $size - 0.5
}
This doesn't work though, for whatever reason. I need something similar to this. Any ideas?
UPDATE: Turns out to be user error, I was experimenting with the wrong number when testing the command. I thought it was the syntax, it was the number I was using itself.
So because I managed to answer my own question I thought I'd post a script for achieving this here. Note that you will need to have a txt file with a comma separated servername and capacity. You could probably modify this to do many other things with VMWare data gathering if you wanted. In the end you'll need to know which columns are which and import to Excel as comma delimited.
Most the variables are decimal values.
Also note that I have no yet figured out a way to programatically deal with the discovery of multiple matching disks.
$serverlist = Get-Content "./ServerList.txt"
$logfile = "./Stores.txt"
remove-item "./Stores.txt"
Function LogWrite {
Param (
[string]$srv,
[string]$disk,
[string]$store
)
Add-Content $logfile -value $srv","$disk","$store
}
foreach ($item in $serverlist){
$store = "Blank"
$disk = "Blank"
try {
$server,$arg = $item.split(',')
$round = [math]::Round($arg,0)
$disk = get-harddisk -vm $server | where-object{$_.CapacityGB -lt ($round + 2) -and $_.CapacityGB -gt ($round - 2) }
if ([string]::IsNullOrEmpty($disk)){
$disk = "Problem locating disk."
$store = "N/A"
continue
}
if ($disk.count -gt 1) {
$disk = "More than one matching disk."
$store = "N/A"
} else {
$store = get-harddisk -vm $server | where-object{$_.CapacityGB -lt ($round + 2) -and $_.CapacityGB -gt ($round - 2) } | Get-Datastore | %{ "{0},{1},{2}" -f $_.Name,[math]::Round($_.FreeSpaceGB,1),[math]::Round($_.CapacityGB,1) }
}
}
catch {
$disk = "Physical"
$store = "N/A"
}
LogWrite $server $disk $store
}

Split a text file by lines

With Powershell i'm trying to split a text file into multiple files using the the beginning of each line as a delimiter
Input file (transfer.txt):
3M|9935551876|11.99|2235641|001|1|100|N|780
3M|1135741031|13.99|8735559|003|1|100|N|145
3M|5835551001|20.50|4556481|002|1|100|N|222
3M|4578420001|33.00|1125785|001|1|100|N|652
8L|00811444243|134148|4064080040|1|02/05/2017 21:15:13|8|170502707|19.85
8L|00811444243|130925|4189133003|1|02/05/2017 21:15:13|8|170502707|4.69
8L|00811444243|136513|4186144003|2|02/05/2017 21:15:13|8|170502707|10.83
Output file (Article.txt):
3M|9935551876|11.99|2235641|001|1|100|N|780
3M|1135741031|13.99|8735559|003|1|100|N|145
3M|5835551001|20.50|4556481|002|1|100|N|222
3M|4578420001|33.00|1125785|001|1|100|N|652
Here's a snippet of my code:
$Path = "D:\BATCH\"
$InputFile = (Join-Path $Path "transfer.txt")
$Reader = New-Object System.IO.StreamReader($InputFile)
while (($Line = $Reader.ReadLine()) -ne $null) {
if ($Line.StartsWith("3M")) {
$OutputFile = "Article.txt"
}
Add-Content (Join-Path $Path $OutputFile) $Line
}
This as a result, creates the same file as the input file. What's wrong with the code?
The below line is the problem. It is outside the If loop and adding the content of each line to the output file. But as I understand, that is not what you want. You want only the content that pass the If condition to be added to the output file. Hence, it needs to be inside the If loop.
Add-Content (Join-Path $Path $OutputFile) $Line
Although I am not too found of this approach because you would be making as many Disk I/O operations as there are lines that pass the if condition. Not very good for scalability.
You can change your code to something like this to reduce number of Disk I/O to just 1.
$out = While (($Line = $Reader.ReadLine()) -ne $null) {
If ($Line.StartsWith("3M")) {
$Line
}
}
$OutputFile = "Article.txt"
Add-Content (Join-Path $Path $OutputFile) $Out
As others have already pointed out, you never change the output file to anything different from "Article.txt", and you write all input lines to the defined output file.
If you want to write the lines of the input file to different files depending on the value of the first field I'd recommend naming the output files after that value. And since you're writing the output with Add-Content I'd also suggest reading the input file via Get-Content for simplicity reasons. Use a StreamReader when performance is an issue (in which case you'll want to use a StreamWriter too), but not just because.
Get-Content $InputFile | ForEach-Object {
$basename, $null = $_.Split('|', 2)
Add-Content (Join-Path $Path "${basename}.txt") $_
}

Sort very large text file in PowerShell

I have standard Apache log files, between 500Mb and 2GB in size. I need to sort the lines in them (each line starts with a date yyyy-MM-dd hh:mm:ss, so no treatment necessary for sorting.
The simplest and most obvious thing that comes to mind is
Get-Content unsorted.txt | sort | get-unique > sorted.txt
I am guessing (without having tried it) that doing this using Get-Content would take forever in my 1GB files. I don't quite know my way around System.IO.StreamReader, but I'm curious if an efficient solution could be put together using that?
Thanks to anyone who might have a more efficient idea.
[edit]
I tried this subsequently, and it took a very long time; some 10 minutes for 400MB.
Get-Content is terribly ineffective for reading large files. Sort-Object is not very fast, too.
Let's set up a base line:
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:
Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663
Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader to read file line by line into HashSet which will remove duplicates, then copy data to List and sort it there, then use StreamWriter to dump results back.
$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
while (($line = $reader.ReadLine()) -ne $null)
{
$t = $hs.Add($line)
}
}
finally {
$reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
$f = New-Object System.IO.StreamWriter "d:\result2.txt";
foreach ($s in $ls)
{
$f.WriteLine($s);
}
}
finally
{
$f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
this script produces:
read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802
On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.
I've grown to hate this part of windows powershell, it is a memory hog on these larger files. One trick is to read the lines [System.IO.File]::ReadLines('file.txt') | sort -u | out-file file2.txt -encoding ascii
Another trick, seriously is to just use linux.
cat file.txt | sort -u > output.txt
Linux is so insanely fast at this, it makes me wonder what the heck microsoft is thinking with this set up.
It may not be feasible in all cases, and i understand, but if you have a linux machine, you can copy 500 megs to it, sort and unique it, and copy it back in under a couple minutes.
If each line of the log is prefixed with a timestamp, and the log messages don't contain embedded newlines (which would require special handling), I think it would take less memory and execution time to convert the timestamp from [String] to [DateTime] before sorting. The following assumes each log entry is of the format yyyy-MM-dd HH:mm:ss: <Message> (note that the HH format specifier is used for a 24-hour clock):
Get-Content unsorted.txt
| ForEach-Object {
# Ignore empty lines; can substitute with [String]::IsNullOrWhitespace($_) on PowerShell 3.0 and above
if (-not [String]::IsNullOrEmpty($_))
{
# Split into at most two fields, even if the message itself contains ': '
[String[]] $fields = $_ -split ': ', 2;
return New-Object -TypeName 'PSObject' -Property #{
Timestamp = [DateTime] $fields[0];
Message = $fields[1];
};
}
} | Sort-Object -Property 'Timestamp', 'Message';
If you are processing the input file for interactive display purposes you can pipe the above into Out-GridView or Format-Table to view the results. If you need to save the sorted results you can pipe the above into the following:
| ForEach-Object {
# Reconstruct the log entry format of the input file
return '{0:yyyy-MM-dd HH:mm:ss}: {1}' -f $_.Timestamp, $_.Message;
} `
| Out-File -Encoding 'UTF8' -FilePath 'sorted.txt';
(Edited to be more clear based on n0rd's comments)
It's might be a memory issue. Since you're loading the entire file into memory to sort it (and adding the overhead of the pipe into Sort-Object and the pipe into Get-Unique), it's possible that you're hitting the memory limits of the machine and forcing it to page to disk, which will slow things down a lot. One thing you might consider is splitting the logs up before sorting them, and then splicing them back together.
This probably won't match your format exactly, but if I've got a large log file for, say, 8/16/2012 which spans several hours, I can split it up into a different file for each hour using something like this:
for($i=0; $i -le 23; $i++){ Get-Content .\u_ex120816.log | ? { $_ -match "^2012-08-16 $i`:" } | Set-Content -Path "$i.log" }
This is creating a regular expression for each hour of that day and dumping all the matching log entries into a smaller log file named by the hour (e.g. 16.log, 17.log).
Then I can run your process of sorting and getting unique entries on a much smaller subsets, which should run a lot faster:
for($i=0; $i -le 23; $i++){ Get-Content "$i.log" | sort | get-unique > "$isorted.txt" }
And then you can splice them back together.
Depending on the frequency of the logs, it might make more sense to split them by day, or minute; the main thing is to get them into more manageable chunks for sorting.
Again, this only makes sense if you're hitting the memory limits of the machine (or if Sort-Object is using a really inefficient algorithm).
"Get-Content" can be faster than you think. Check this code-snippet in addition to the above solution:
foreach ($block in (get-content $file -ReadCount 100)) {
foreach ($line in $block){[void] $hs.Add($line)}
}
There doesn't seem to be a great way to do it in powershell, including [IO.File]::ReadLines(), but with the native windows sort.exe or the gnu sort.exe, either within cmd.exe, 30 million random numbers can be sorted in about 5 minutes with around 1 gb of ram. The gnu sort automatically breaks things up into temp files to save ram. Both commands have options to start the sort at a certain character column. Gnu sort can merge sorted files. See external sorting.
30 million line test file:
& { foreach ($i in 1..300kb) { get-random } } | set-content file.txt
And then in cmd:
copy file.txt+file.txt file2.txt
copy file2.txt+file2.txt file3.txt
copy file3.txt+file3.txt file4.txt
copy file4.txt+file4.txt file5.txt
copy file5.txt+file5.txt file6.txt
copy file6.txt+file6.txt file7.txt
copy file7.txt+file7.txt file8.txt
With gnu sort.exe from http://gnuwin32.sourceforge.net/packages/coreutils.htm . Don't forget the dependency dll's -- libiconv2.dll & libintl3.dll. Within cmd.exe:
.\sort.exe < file8.txt > filesorted.txt
Or windows sort.exe within cmd.exe:
sort.exe < file8.txt > filesorted.txt
With the function below:
PS> PowerSort -SrcFile C:\windows\win.ini
function PowerSort {
param(
[string]$SrcFile = "",
[string]$DstFile = "",
[switch]$Force
)
if ($SrcFile -eq "") {
write-host "USAGE: PowerSort -SrcFile (srcfile) [-DstFile (dstfile)] [-Force]"
return 0;
}
else {
$SrcFileFullPath = Resolve-Path $SrcFile -ErrorAction SilentlyContinue -ErrorVariable _frperror
if (-not($SrcFileFullPath)) {
throw "Source file not found: $SrcFile";
}
}
[Collections.Generic.List[string]]$lines = [System.IO.File]::ReadAllLines($SrcFileFullPath)
$lines.Sort();
# Write Sorted File to Pipe
if ($DstFile -eq "") {
foreach ($line in $lines) {
write-output $line
}
}
# Write Sorted File to File
else {
$pipe_enable = 0;
$DstFileFullPath = Resolve-Path $DstFile -ErrorAction SilentlyContinue -ErrorVariable ev
# Destination File doesn't exist
if (-not($DstFileFullPath)) {
$DstFileFullPath = $ev[0].TargetObject
}
# Destination Exists and -force not specified.
elseif (-not $Force) {
throw "Destination file already exists: ${DstFile} (using -Force Flag to overwrite)"
}
write-host "Writing-File: $DstFile"
[System.IO.File]::WriteAllLines($DstFileFullPath, $lines)
}
return
}

Formatting large text file in Windows Powershell

I'm trying to format large text files (~300MB) between 0 to 3 columns :
12345|123 Main St, New York|91110
23456|234 Main St, New York
34567|345 Main St, New York|91110
And the output should be:
000000000012345,"123 Main St, New York",91110,,,,,,,,,,,,
000000000023456,"234 Main St, New York",,,,,,,,,,,,,
000000000034567,"345 Main St, New York",91110,,,,,,,,,,,,
I'm new to powershell, but I've read that I should avoid Get-Content so I am using StreamReader. It is still much too slow:
function append-comma{} #helper function to append the correct amount of commas to each line
$separator = '|'
$infile = "\large_data.csv"
$outfile = "new_file.csv"
$target_file_in = New-Object System.IO.StreamReader -Arg $infile
If ($header -eq 'TRUE') {
$firstline = $target_file_in.ReadLine() #skip header if exists
}
while (!$target_file_in.EndOfStream ) {
$line = $target_file_in.ReadLine()
$a = $line.split($separator)[0].trim()
$b = ""
$c = ""
if ($dataType -eq 'ECN'){$a = $a.padleft(15,'0')}
if ($line.split($separator)[1].length -gt 0){$b = $line.split($separator)[1].trim()}
if ($line.split($separator)[2].length -gt 0){$c = $line.split($separator)[2].trim()}
$line = $a +',"'+$b+'","'+$c +'"'
$line -replace '(?m)"([^,]*?)"(?=,|$)', '$1' |append-comma >> $outfile
}
$target_file_in.close()
I am building this for other people on my team and wanted to add a gui using this guide:
http://blogs.technet.com/b/heyscriptingguy/archive/2014/08/01/i-39-ve-got-a-powershell-secret-adding-a-gui-to-scripts.aspx
Is there a faster way to do this in Powershell?
I wrote a script using Linux bash(Cygwin64 on Windows) and a separate one in Python. Both ran much faster, but I am trying to script something that would be "approved" on a Windows Platform.
All that splitting and replacing costs you way more time than you gain from the StreamReader. Below code cut execution time to ~20% for me:
$separator = '|'
$infile = "\large_data.csv"
$outfile = "new_file.csv"
if ($header -eq 'TRUE') {
$linesToSkip = 1
} else {
$linesToSkip = 0
}
Get-Content $infile | select -Skip $linesToSkip | % {
[int]$a, [string]$b, [string]$c = $_.split($separator)
'{0:d15},"{1}",{2},,,,,,,,,,,,,' -f $a, $b.Trim(), $c.Trim()
} | Set-Content $outfile
How does this work for you? I was able to read and process a 35MB file in about 40 seconds on a cheap ole workstation.
File Size: 36,548,820 bytes
Processed In: 39.7259722 seconds
Function CheckPath {
[CmdletBinding()]
param(
[Parameter(Mandatory=$True,
ValueFromPipeline=$True)]
[string[]]$Path
)
BEGIN {}
PROCESS {
IF ((Test-Path -LiteralPath $Path) -EQ $False) {Write-host "Invalid File Path $Path"}
}
END {}
}
$infile = "infile.txt"
$outfile = "restult5.txt"
#Check File Path
CheckPath $InFile
#Initiate StreamReader
$Reader = New-Object -TypeName System.IO.StreamReader($InFile);
#Create New File Stream Object For StreamWriter
$WriterStream = New-Object -TypeName System.IO.FileStream(
$outfile,
[System.IO.FileMode]::Create,
[System.IO.FileAccess]::Write);
#Initiate StreamWriter
$Writer = New-Object -TypeName System.IO.StreamWriter(
$WriterStream,
[System.Text.Encoding]::ASCII);
If ($header -eq $True) {
$Reader.ReadLine() |Out-Null #Skip First Line In File
}
while ($Reader.Peek() -ge 0) {
$line = $Reader.ReadLine() #Read Line
$Line = $Line.split('|') #Split Line
$OutPut = "$($($line[0]).PadLeft(15,'0')),`"$($Line[1])`",$($Line[2]),,,,,,,,,,,,"
$Writer.WriteLine($OutPut)
}
$Reader.Close();
$Reader.Dispose();
$Writer.Flush();
$Writer.Close();
$Writer.Dispose();
$endDTM = (Get-Date) #Get Script End Time For Measurement
Write-Host "Elapsed Time: $(($endDTM-$startDTM).totalseconds) seconds" #Echo Time elapsed
Regex is fast:
$infile = ".\large_data.csv"
gc $infile|%{
$x=if($_.indexof('|')-ne$_.lastindexof('|')){
$_-replace'(.+)\|(.+)\|(.+)',('$1,"$2",$3'+','*12)
}else{
$_-replace'(.+)\|(.+)',('$1,"$2"'+','*14)
}
('0'*(15-($x-replace'([^,]),.+','$1').length))+$x
}
I have another approach. Let powershell read the input file as a csv file, with a pipe character as delimiter. Then format the output the way you want it. I have not tested this for speed with large files.
$infile = "\large-data.csv"
$outfile = "new-file.csv"
import-csv $infile -header id,addr,zip -delimiter "|" |
% {'{0},"{1}",{2},,,,,,,,,,,,,' -f $_.id.padleft(15,'0'), $_.addr.trim(), $_.zip} |
set-content $outfile

monitor for creation and then read a file via windows .bat script

I would like to write a batch script which will poll a windows directory for a certain time limit and pick a file as soon as it is placed in the directory.
It will also timeout after a certain time if the file is not placed in that directory within that time frame.
I would also like to parse the xml file and check for a status.
Here's a PowerShell script that will do what you asked.
$content variable will store contents of the file (it will actually be an array of lines so you can throw it into foreach loop).
$file = 'C:\flag.xml'
$timeout = New-TimeSpan -Minutes 1
$sw = [diagnostics.stopwatch]::StartNew()
while ($sw.elapsed -lt $timeout)
{
if (Test-Path $file)
{
"$(Get-Date -f 'HH:mm:ss') Found a file: $file"
$content = gc $file
if ($content -contains 'something interesting')
{
"$(Get-Date -f 'HH:mm:ss') File has something interesting in it!"
}
break
}
else
{
"$(Get-Date -f 'HH:mm:ss') Still no file: $file"
}
Start-Sleep -Seconds 5
}

Resources