PowersShell: Quickly Split File with commas inside Quotes

PowersShell: Quickly Split File with commas inside Quotes - performance

I am dealing with the exact same question from here, but the answer in this post is actually worse than the originally posted problem. I have a csv file with 7 million records. I have run several tests of ways to read and write this file and once I introduce the regex split, the performance drops off significantly.
Below are some code examples (these have all the processing removed, and are just focusing on the reading, splitting and then writing back out as a .csv, in the real programs, there is processing of these fields, but that is the same regardless of the read and write method)
Does anyone have another suggestion to split a csv with commas in quotes efficiently?
Thank you.
When I use the standard split (which is fast, but creates issues because I have commas inside quoted fields) it runs fairly fast:
`
#CSV Split
$reader = [System.IO.File]::OpenText($path)
$writer = [System.IO.StreamWriter] $outputFileName1
measure-command {
while ($null -ne ($line = $reader.ReadLine())) {
$item = $line.Split(",")
$writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15}"
, $item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9]
, $item[10], $item[11], $item[12], $item[13], $item[14], $item[15])
}
}
$reader.Close()
$writer.Close()
`
TotalMinutes : 2.50485881
When I add Regex, it fixes the comma within quotes issue, but becomes VERY slow.
`
#regex Split
$RegexOptions = [System.Text.RegularExpressions.RegexOptions]
$csvSplit = '(,)(?=(?:[^"]|"[^"]*")*$)'
$reader = [System.IO.File]::OpenText($path)
$writer = [System.IO.StreamWriter] $outputFileName2
measure-command {
while ($null -ne ($line = $reader.ReadLine())) {
$item = [regex]::Split($Line, $csvSplit, $RegexOptions::ExplicitCapture)
$writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15}"
, $item[0], $item[1], $item[2], $item[3], $item[4], $item[5], $item[6], $item[7], $item[8], $item[9]
, $item[10], $item[11], $item[12], $item[13], $item[14], $item[15])
}
}
$reader.Close()
$writer.Close()
`
TotalMinutes : 8.84508987333333
When I use the suggested answer from the link above, it becomes even slower.
`
`# With CSV
$writer = [System.IO.StreamWriter] $outputFileName3
measure-command {
Import-Csv $path | ForEach-Object {
$writer.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15}"
, $_.BANK, $_.SOURCE, $_.ACCOUNT, $_.COSTCENTER, $_.TRANCD, $_.DOCNO, $_.TRACENO, $_.EFFDATE, $_.AMOUNT, $_.DESCRIP, $_.DESCRIP1, $_.DESCRIP3, $_.DESCRIP2, $_.DESCHYPH, $_.POSTDATE, $_.SJECHK)
}
}
$writer.Close()`
`
TotalMinutes : 18.0537001666667

Related

OledbDataReader using up all the RAM (in powershell)

From all my reading, the oledb datareader does not store records in memory, but this code is maxing out the RAM. Its meant to pull data from an Oracle db (about 10M records) and write them to a GZIP file. I have tried everything (including commenting out the Gzip write) and it still ramps up the RAM until it falls over. Is there are way to just execute the reader without it staying in memory? What am I doing wrong?
$tableName='ACCOUNTS'
$fileNum=1
$gzFilename="c:\temp\gzip\$tableName.$fileNum.txt.gz"
$con=Open-Con ORA -tns $tns -userName $userName -fetchSize $fetchSize
$cmd = New-Object system.Data.OleDb.OleDbCommand($sql,$con);
$cmd.CommandTimeout = '0';
$output = New-Object System.IO.FileStream $gzFilename, ([IO.FileMode]::Create), ([IO.FileAccess]::Write), ([IO.FileShare]::None)
[System.IO.Compression.GzipStream]$gzipStream = New-Object System.IO.Compression.GzipStream $output, ([IO.Compression.CompressionMode]::Compress)
$encoding = [System.Text.Encoding]::UTF8
$reader=$cmd.ExecuteReader()
[int]$j=0
While ($reader.Read())
{
$j++
$str=$reader[0..$($reader.Fieldcount-1)] -join '|'
$out=$encoding.GetBytes($("$str`n").ToString() )
$gzipStream.Write($out,0, $out.length)
if($j % 10000 -eq 0){write-host $j}
if($j % 1000000 -eq 0){
write-host 'creating new gz file'
$gzipStream.Close();
$gzipStream.Dispose()
$fileNum+=1
$gzFilename="c:\temp\gzip\$tableName.$fileNum.txt.gz"
$output = New-Object System.IO.FileStream $gzFilename, ([IO.FileMode]::Create), ([IO.FileAccess]::Write), ([IO.FileShare]::None)
[System.IO.Compression.GzipStream]$gzipStream = New-Object System.IO.Compression.GzipStream $output, ([IO.Compression.CompressionMode]::Compress)
}
}
Edit:
from the comments, [system.gc]::Collect() had no effect. Also, stripping it down to the simplest form and only reading a single field also had no effect. This code ramps up to 16GB memory (viewed in task manager) and then quits with OOM
$con=Open-Con ORA -tns $tns -userName $userName -fetchSize $fetchSize
$cmd = New-Object system.Data.OleDb.OleDbCommand($sql,$con);
$cmd.CommandTimeout = '0';
$reader=$cmd.ExecuteReader()
[int]$j=0
While ($reader.Read())
{
$str=$reader[0]
}

Possibly it's using up virtual address space rather than actual RAM. That's a common problem with the underlying .Net garbage collector used with (at least) the ADO.Net and string objects created here, especially if any of the records have fields with lots of text.
Building on that, it looks like you're doing most of the correct things to avoid this issue (using DataReader, writing directly to a stream, etc). What you could do to improve this is writing to the stream one field at a time, rather than using -join to push all the fields into the same string and then writing, and making sure we re-use the same $out array buffer (though I'm not sure exactly what this last looks like in PowerShell or with Encoding.GetBytes().
This may help, but it still can create issues with how it concatenates the fieldDelimiter and line terminator. If you find this runs for longer, but still eventually produces an error, you probably need to do the tedious work to have separate write operations to the gzip stream for each of those values.
$tableName='ACCOUNTS'
$fileNum=1
$gzFilename="c:\temp\gzip\$tableName.$fileNum.txt.gz"
$con=Open-Con ORA -tns $tns -userName $userName -fetchSize $fetchSize
$cmd = New-Object system.Data.OleDb.OleDbCommand($sql,$con);
$cmd.CommandTimeout = '0';
$output = New-Object System.IO.FileStream $gzFilename, ([IO.FileMode]::Create), ([IO.FileAccess]::Write), ([IO.FileShare]::None)
[System.IO.Compression.GzipStream]$gzipStream = New-Object System.IO.Compression.GzipStream $output, ([IO.Compression.CompressionMode]::Compress)
$encoding = [System.Text.Encoding]::UTF8
$reader=$cmd.ExecuteReader()
[int]$j=0
While ($reader.Read())
{
$j++
$fieldDelimiter= ""
$terminator = ""
for ($k=0;$k -lt $reader.Fieldcount;$k++) {
if ($k -eq $reader.Fieldcount - 1) { $terminator = "`n"}
$out = $encoding.GetBytes("$fieldDelimiter$($reader[$k])$terminator")
$gzipStream.Write($out,0,$out.length)
$fieldDelimiter= "|"
}
if($j % 10000 -eq 0){write-host $j}
if($j % 1000000 -eq 0){
write-host 'creating new gz file'
$gzipStream.Close();
$gzipStream.Dispose()
$fileNum+=1
$gzFilename="c:\temp\gzip\$tableName.$fileNum.txt.gz"
$output = New-Object System.IO.FileStream $gzFilename, ([IO.FileMode]::Create), ([IO.FileAccess]::Write), ([IO.FileShare]::None)
[System.IO.Compression.GzipStream]$gzipStream = New-Object System.IO.Compression.GzipStream $output, ([IO.Compression.CompressionMode]::Compress)
}
}

How save png as jpg without saving the file in dir

I'm using FromFile to get the image out of files, and it has the following error for the png's on the FromFile line:
Exception calling "FromFile" with "1" argument(s): "The given path's
format is not supported."
So, I'm trying to convert the bmp's to jpg, (see convert line above FromFile below) but all the examples I see (that seem usable) are saving the file. I don't want to save the file in the dir. All I need is the image format, so FromFile can use it like this example. I saw ConvertTo-Jpeg, but I don't think this is a standard powershell module, or don't see how to install it.
I saw this link, but I don't think that would leave the image in the format needed by FromFile.
This is my code:
$imageFile2 = Get-ChildItem -Recurse -Path $ImageFullBasePath -Include #("*.bmp","*.jpg","*.png") | Where-Object {$_.Name -match "$($pictureName)"} #$imageFile | Select-String -Pattern '$($pictureName)' -AllMatches
Write-Host $imageFile2
if($imageFile2.Exists)
{
if($imageFile2 -Match "png")
{
$imageFile2 | .\ConvertTo-Jpeg #I don't think this will work with FromFile below
}
$image = [System.Drawing.Image]::FromFile($imageFile2) step
}
else {
Write-Host "$($imageFile2) does not exist"
}
And then I put it in excel:
$xlsx = $result | Export-Excel -Path $outFilePath -WorksheetName $errCode -Autosize -AutoFilter -FreezeTopRow -BoldTopRow -PassThru # -ClearSheet can't ClearSheet every time or it clears previous data ###left off
$ws = $xlsx.Workbook.Worksheets[$errCode]
$ws.Dimension.Columns #number of columns
$tempRowCount = $ws.Dimension.Rows #number of rows
#only change width of 3rd column
$ws.Column(3).Width
$ws.Column(3).Width = 100
#Change all row heights
for ($row = 2 ;( $row -le $tempRowCount ); $row++)
{
#Write-Host $($ws.Dimension.Rows)
#Write-Host $($row)
$ws.Row($row).Height
$ws.Row($row).Height = 150
#place the image in spreadsheet
#https://github.com/dfinke/ImportExcel/issues/1041 https://github.com/dfinke/ImportExcel/issues/993
$drawingName = "$($row.PictureID)_Col3_$($row)" #Name_ColumnIndex_RowIndex
Write-Host $image
$picture = $ws.Drawings.AddPicture("$drawingName",$image)
$picture.SetPosition($row - 1, 0, 3 - 1, 0)
if($ws.Row($row).Height -lt $image.Height * (375/500)) {
$ws.Row($row).Height = $image.Height * (375/500)
}
if($ws.Column(3).Width -lt $image.Width * (17/120)){
$ws.Column(3).Width = $image.Width * (17/120)
}
}
Update:
I just wanted to reiterate that FromFile can't be used for a png image. So where Hey Scripting Guy saves the image like this doesn't work:
$image = [drawing.image]::FromFile($imageFile2)

I figured out that the $imageFile2 path has 2 filenames in it. It must be that two met the Get-ChildItem/Where-Object/match criteria. The images look identical, but have similar names, so will be easy to process. After I split the names, it does FromFile ok.

Powershell Performance

i have a Problem with powershell Performance while searching a 40gb log file.
i Need to check if any of 1000 email adresses are included in this 40gb file. This would take 180 hours :D any ideas?
$logFolder = "H:\log.txt"
$adressen= Get-Content H:\Adressen.txt
$ergebnis = #()
foreach ($adr in $adressen){
$suche = Select-String -Path $logFolder -Pattern "\[\(\'from\'\,.*$adr.*\'\)\]" -List
$aktiv= $false
$adr
if ($suche){
$aktiv = $true
}
if ($aktiv -eq $true){
$ergebnis+=$adr + ";Ja"
}
else{
$ergebnis+=$adr + ";Nein"
}
}
$ergebnis |Out-File H:\output.txt

Don't read the file 1000 times.
Build a regexp line with all 1000 addresses (it's gonna be a huge line, but hey, much smaller than 40TB). Like:
$Pattern = "\[\(\'from\'\,.*$( $adressen -join '|' ).*\'\)\]"
Then do your Select-String, and save the result to do an address-by-address search in it. Hopefully, the result will be much smaller than 40Gb, and should be much faster.

As mentioned in the comments, replace
$ergebnis = #()
with
$ergebnis = New-Object System.Collections.ArrayList
and
$ergebnis+=$adr + ";Ja"
with
$ergebnis.add("$adr;Ja")
or respective
$ergebnis.add("$adr;Nein")
This will speed up your script quite a bit.

Perl Out Of Memory

I have a script that reads two csv files and compares them to find out if an ID that appears in one also appears in the other. The error I am receiving is as follows:
Out of memory during "large" request for 67112960 bytes, total sbrk() is 348203008 bytes
And now for the code:
use strict;
use File::Basename;
my $DAT = $ARGV[0];
my $OPT = $ARGV[1];
my $beg_doc = $ARGV[2];
my $end_doc = $ARGV[3];
my $doc_counter = 0;
my $page_counter = 0;
my %opt_beg_docs;
my %beg_docs;
my ($fname, $dir, $suffix) = fileparse($DAT, qr/\.[^.]*/);
my $outfile = $dir . $fname . "._IMGLOG";
open(OPT, "<$OPT");
while(<OPT>){
my #OPT_Line = split(/,/, $_);
$beg_docs{#OPT_Line[0]} = "Y" if(#OPT_Line[3] eq "Y");
$opt_beg_docs{#OPT_Line[0]} = "Y";
}
close(OPT);
open(OUT, ">$outfile");
while((my $key, my $value) = each %opt_beg_docs){
print OUT "$key\n";
}
close(OUT);
open(DAT, "<$DAT");
readline(DAT); #skips header line
while(<DAT>){
$_ =~ s/\xFE//g;
my #DAT_Line = split(/\x14/, $_);
#gets the prefix and the range of the beg and end docs
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
(my $beg = #DAT_Line[$beg_doc]) =~ s/\D//g;
(my $end = #DAT_Line[$end_doc]) =~ s/\D//g;
#print OUT "BEGDOC: $beg ENDDOC: $end\n";
foreach($beg .. $end){
my $doc_id = $pre . $_;
if($opt_beg_docs{$doc_id} ne "Y"){
if($beg_docs{$doc_id} ne "Y"){
print OUT "$doc_id,DOCUMENT NOT FOUND IN OPT FILE\n";
$doc_counter++;
} else {
print OUT "$doc_id,PAGE NOT FOUND IN OPT FILE\n";
$page_counter++;
}
}
}
}
close(DAT);
close(OUT);
print "Found $page_counter missing pages and $doc_counter missing document(s)";
Basically I get all the ID's from the file I am checking against to see if the ID exists in. Then I loop over the and generate the ID's for the other file, because they are presented as a range. Then I take the generated ID and check for it in the hash of ID's.
Also forgot to note I am using Windows

You're not using use warnings;, you're not checking for errors on opening files, and you're not printing out debugging statements showing the lines that you are reading in.
Do you know what the input file looks like? If it has no line breaks, you are reading the entire file in all at once, which will be disastrous if it is large. Pay attention to how you are parsing the file.

I'm not sure if it's the cause of your error, but inside your loop where you're reading DAT, you probably want to replace this:
(my $pre = #DAT_Line[$beg_doc]) =~ s/[0-9]//g;
with this:
(my $pre = $DAT_Line[$beg_doc]) =~ s/[0-9]//g;
and same for the other two lines there.

You're closing your OUT file handle and then trying to print to it inside the DAT loop, which, I think might be outputting to random memory, since you closed the FILEHANDLE - surprised this didn't output an error.
Remove the first close(OUT); and see if that improves.
I still don't know what your question is, if it's about the error message it means you've run out of memory. If it's about the message itself - you're trying to consume too much memory. If it's why you're consuming too much memory, I'd first ask if you read my message above, then I'd ask how much memory your system has, then I'd follow up with seeing if it improves if you take the regex away.

How to split a huge folder?

We have a folder on Windows that's ... huge. I ran "dir > list.txt". The command lost response after 1.5 hours. The output file is about 200 MB. It shows there're at least 2.8 million files. I know the situation is stupid but let's focus the problem itself. If I have such a folder, how can I split it to some "manageable" sub-folders? Surprisingly all the solutions I have come up with all involve getting all the files in the folder at some point, which is a no-no in my case. Any suggestions?
Thank Keith Hill and Mehrdad. I accepted Keith's answer because that's exactly what I wanted to do but I couldn't quite get PS working quickly.
With Mehrdad's tip, I wrote this little program. It took 7+ hours to move 2.8 million files. So the initial dir command did finish. But somehow it didn't return to console.
namespace SplitHugeFolder
{
class Program
{
static void Main(string[] args)
{
var destination = args[1];
if (!Directory.Exists(destination))
Directory.CreateDirectory(destination);
var di = new DirectoryInfo(args[0]);
var batchCount = int.Parse(args[2]);
int currentBatch = 0;
string targetFolder = GetNewSubfolder(destination);
foreach (var fileInfo in di.EnumerateFiles())
{
if (currentBatch == batchCount)
{
Console.WriteLine("New Batch...");
currentBatch = 0;
targetFolder = GetNewSubfolder(destination);
}
var source = fileInfo.FullName;
var target = Path.Combine(targetFolder, fileInfo.Name);
File.Move(source, target);
currentBatch++;
}
}
private static string GetNewSubfolder(string parent)
{
string newFolder;
do
{
newFolder = Path.Combine(parent, Path.GetRandomFileName());
} while (Directory.Exists(newFolder));
Directory.CreateDirectory(newFolder);
return newFolder;
}
}
}

I use Get-ChildItem to index my whole C: drive every night into c:\filelist.txt. That's about 580,000 files and the resulting file size is ~60MB. Admittedly I'm on Win7 x64 with 8 GB of RAM. That said, you might try something like this:
md c:\newdir
Get-ChildItem C:\hugedir -r |
Foreach -Begin {$i = $j = 0} -Process {
if ($i++ % 100000 -eq 0) {
$dest = "C:\newdir\dir$j"
md $dest
$j++
}
Move-Item $_ $dest
}
The key is to do the move in a streaming manner. That is, don't collect up all the Get-ChildItem results into a single variable and then proceed. That would require all 2.8 million FileInfos to be in memory at once. Also, if you use the Name parameter on Get-ChildItem it will output a single string containing the file's path relative to the base dir. Even then, perhaps this size will just overwhelm the memory available to you. And no doubt, it will take quite a while to execute. IIRC correctly, my indexing script takes several hours.
If it does work, you should wind up with c:\newdir\dir0 thru dir28 but then again, I haven't tested this script at all so your mileage may vary. BTW this approach assumes that you're huge dir is a pretty flat dir.
Update: Using the Name parameter is almost twice as slow so don't use that parameter.

I found out the GetChildItem is the slowest option when working with many items in a directory.
Look at the results:
Measure-Command { Get-ChildItem C:\Windows -rec | Out-Null }
TotalSeconds : 77,3730275
Measure-Command { listdir C:\Windows | Out-Null }
TotalSeconds : 20,4077132
measure-command { cmd /c dir c:\windows /s /b | out-null }
TotalSeconds : 13,8357157
(with listdir function defined like this:
function listdir($dir) {
$dir
[system.io.directory]::GetFiles($dir)
foreach ($d in [system.io.directory]::GetDirectories($dir)) {
listdir $d
}
}
)
With this in mind, what I would do: I would stay in PowerShell but use more lowlevel approach with .NET methods:
function DoForFirst($directory, $max, $action) {
function go($dir, $options)
{
foreach ($f in [system.io.Directory]::EnumerateFiles($dir))
{
if ($options.Remaining -le 0) { return }
& $action $f
$options.Remaining--
}
foreach ($d in [system.io.directory]::EnumerateDirectories($dir))
{
if ($options.Remaining -le 0) { return }
go $d $options
}
}
go $directory (New-Object PsObject -Property #{Remaining=$max })
}
doForFirst c:\windows 100 {write-host File: $args }
# I use PsObject to avoid global variables and ref parameters.
To use the code you have to switch to .NET 4.0 runtime -- enumerating methods are new in .NET 4.0.
You can specify any scriptblock as -action parameter, so in your case it would be something like {Move-item -literalPath $args -dest c:\dir }.
Just try to list first 1000 items, I hope it will finish very quickly:
doForFirst c:\yourdirectory 1000 {write-host '.' -nonew }
And of course you can process all items at once, just use
doForFirst c:\yourdirectory ([long]::MaxValue) {move-item ... }
and each item should be processed immediately after it is returned. So the whole list is not read at once and then processed, but it is processed during reading.

How about starting with this:
cmd /c dir /b > list.txt
That should get you a list of all the file names.
If you're doing "dir > list.txt" from a powershell prompt, get-childitem is aliased as "dir". Get-childitem has known issues enumerating large directories, and the object collections it returns can get huge.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

PowersShell: Quickly Split File with commas inside Quotes - performance

Related

OledbDataReader using up all the RAM (in powershell)

How save png as jpg without saving the file in dir

Powershell Performance

Perl Out Of Memory

How to split a huge folder?

Categories

Resources