Formatting large text file in Windows Powershell - shell

I'm trying to format large text files (~300MB) between 0 to 3 columns :
12345|123 Main St, New York|91110
23456|234 Main St, New York
34567|345 Main St, New York|91110
And the output should be:
000000000012345,"123 Main St, New York",91110,,,,,,,,,,,,
000000000023456,"234 Main St, New York",,,,,,,,,,,,,
000000000034567,"345 Main St, New York",91110,,,,,,,,,,,,
I'm new to powershell, but I've read that I should avoid Get-Content so I am using StreamReader. It is still much too slow:
function append-comma{} #helper function to append the correct amount of commas to each line
$separator = '|'
$infile = "\large_data.csv"
$outfile = "new_file.csv"
$target_file_in = New-Object System.IO.StreamReader -Arg $infile
If ($header -eq 'TRUE') {
$firstline = $target_file_in.ReadLine() #skip header if exists
}
while (!$target_file_in.EndOfStream ) {
$line = $target_file_in.ReadLine()
$a = $line.split($separator)[0].trim()
$b = ""
$c = ""
if ($dataType -eq 'ECN'){$a = $a.padleft(15,'0')}
if ($line.split($separator)[1].length -gt 0){$b = $line.split($separator)[1].trim()}
if ($line.split($separator)[2].length -gt 0){$c = $line.split($separator)[2].trim()}
$line = $a +',"'+$b+'","'+$c +'"'
$line -replace '(?m)"([^,]*?)"(?=,|$)', '$1' |append-comma >> $outfile
}
$target_file_in.close()
I am building this for other people on my team and wanted to add a gui using this guide:
http://blogs.technet.com/b/heyscriptingguy/archive/2014/08/01/i-39-ve-got-a-powershell-secret-adding-a-gui-to-scripts.aspx
Is there a faster way to do this in Powershell?
I wrote a script using Linux bash(Cygwin64 on Windows) and a separate one in Python. Both ran much faster, but I am trying to script something that would be "approved" on a Windows Platform.

All that splitting and replacing costs you way more time than you gain from the StreamReader. Below code cut execution time to ~20% for me:
$separator = '|'
$infile = "\large_data.csv"
$outfile = "new_file.csv"
if ($header -eq 'TRUE') {
$linesToSkip = 1
} else {
$linesToSkip = 0
}
Get-Content $infile | select -Skip $linesToSkip | % {
[int]$a, [string]$b, [string]$c = $_.split($separator)
'{0:d15},"{1}",{2},,,,,,,,,,,,,' -f $a, $b.Trim(), $c.Trim()
} | Set-Content $outfile

How does this work for you? I was able to read and process a 35MB file in about 40 seconds on a cheap ole workstation.
File Size: 36,548,820 bytes
Processed In: 39.7259722 seconds
Function CheckPath {
[CmdletBinding()]
param(
[Parameter(Mandatory=$True,
ValueFromPipeline=$True)]
[string[]]$Path
)
BEGIN {}
PROCESS {
IF ((Test-Path -LiteralPath $Path) -EQ $False) {Write-host "Invalid File Path $Path"}
}
END {}
}
$infile = "infile.txt"
$outfile = "restult5.txt"
#Check File Path
CheckPath $InFile
#Initiate StreamReader
$Reader = New-Object -TypeName System.IO.StreamReader($InFile);
#Create New File Stream Object For StreamWriter
$WriterStream = New-Object -TypeName System.IO.FileStream(
$outfile,
[System.IO.FileMode]::Create,
[System.IO.FileAccess]::Write);
#Initiate StreamWriter
$Writer = New-Object -TypeName System.IO.StreamWriter(
$WriterStream,
[System.Text.Encoding]::ASCII);
If ($header -eq $True) {
$Reader.ReadLine() |Out-Null #Skip First Line In File
}
while ($Reader.Peek() -ge 0) {
$line = $Reader.ReadLine() #Read Line
$Line = $Line.split('|') #Split Line
$OutPut = "$($($line[0]).PadLeft(15,'0')),`"$($Line[1])`",$($Line[2]),,,,,,,,,,,,"
$Writer.WriteLine($OutPut)
}
$Reader.Close();
$Reader.Dispose();
$Writer.Flush();
$Writer.Close();
$Writer.Dispose();
$endDTM = (Get-Date) #Get Script End Time For Measurement
Write-Host "Elapsed Time: $(($endDTM-$startDTM).totalseconds) seconds" #Echo Time elapsed

Regex is fast:
$infile = ".\large_data.csv"
gc $infile|%{
$x=if($_.indexof('|')-ne$_.lastindexof('|')){
$_-replace'(.+)\|(.+)\|(.+)',('$1,"$2",$3'+','*12)
}else{
$_-replace'(.+)\|(.+)',('$1,"$2"'+','*14)
}
('0'*(15-($x-replace'([^,]),.+','$1').length))+$x
}

I have another approach. Let powershell read the input file as a csv file, with a pipe character as delimiter. Then format the output the way you want it. I have not tested this for speed with large files.
$infile = "\large-data.csv"
$outfile = "new-file.csv"
import-csv $infile -header id,addr,zip -delimiter "|" |
% {'{0},"{1}",{2},,,,,,,,,,,,,' -f $_.id.padleft(15,'0'), $_.addr.trim(), $_.zip} |
set-content $outfile

Related

Powershell file updating with future and past time

I am having a text file that has content in this manner.
One;Thomas;Newyork;2020-12-31 14:00:00;0
Two;David;London;2021-01-31 12:00:00;0
Three;James;Chicago;2021-01-20 15:00:00;0
Four;Edward;India;2020-12-25 15:00:00;0
In these entries according to date time, two are past entries and two are future entries. The last 0 in the string indicates the Flag. With the past entries that flag needs to be changed to 1.
Consider all the entries are separated with the array. I tried this block of code but its not working to solve the problem here.
for ($item=0 ; $item -lt $entries.count ; $item++)
{
if ($entries.DateTime[$item] -lt (Get-Date -Format "yyyy-MM-dd HH:mm:ss"))
{
$cont = Get-Content $entries -ErrorAction Stop
$string = $entries.number[$item] + ";" + $entries.name[$item] + ";" +
$entries.city[$item]+ ";" + $entries.DateTime[$item]
$lineNum = $cont | Select-String $string
$line = $lineNum.LineNumber + 1
$cont[$line] = $string + ";1"
Set-Content -path $entries
}
}
I am getting errors with this concept.
Output should come as:-
One;Thomas;Newyork;2020-12-31 14:00:00;1 ((Past Deployment with respect to current date)
Two;David;London;2021-01-31 12:00:00;0
Three;James;Chicago;2021-01-20 15:00:00;0
Four;Edward;India;2020-12-25 15:00:00;1 (Past Deployment with respect to current date)
This output needs to be overwritten on the file from where the content is extracted ie Entries.txt
param(
$exampleFileName = "d:\tmp\file.txt"
)
#"
One;Thomas;Newyork;2020-12-31 14:00:00;0
Two;David;London;2021-01-31 12:00:00;0
Three;James;Chicago;2021-01-20 15:00:00;0
Four;Edward;India;2020-12-25 15:00:00;0
"# | Out-File $exampleFileName
Remove-Variable out -ErrorAction SilentlyContinue
Get-Content $exampleFileName | ForEach-Object {
$out += ($_ -and [datetime]::Parse(($_ -split ";")[3]) -gt [datetime]::Now) ? $_.SubString(0,$_.Length-1) + "1`r`n" : $_ + "`r`n"
}
Out-File -InputObject $out -FilePath $exampleFileName

Powershell script - break loop

Giving this little script in powershell:
$index = 1
$file = "C:\Users\myfile"
while ($index -le 100000)
{
$init_size = Write-Host((Get-Item $file).length/1KB)
<here is my command which populates $file>
$final_size = Write-Host((Get-Item $file).length/1KB)
$index ++
sleep 5
If ($final_size -eq $init_size) {break}
}
I don't understand why it breaks even if the init_size is different from the final_size.
Any suggestions?
Write-Host writes directly to the screen buffer and doesn't output anything, so the value of both $init_size and $final_size are effectively $null when you reach the if statement.
Do Write-Host $variable after assigning to $variable and it'll work:
$index = 1
$file = "C:\Users\myfile"
while ($index -le 100000) {
$init_size = (Get-Item $file).Length / 1KB
Write-Host $init_size
<here is my command which populates $file>
$final_size = (Get-Item $file).Length / 1KB
Write-Host $final_size
$index++
sleep 5
If ($final_size -eq $init_size) { break }
}
Calling Write-Host on the results of the assignment expression itself would work too:
Write-Host ($init_size = (Get-Item $file).Length / 1KB)

Powershell: improving LDIF file to CSV conversion

I have the below code to convert an LDIF file (over 100.000 lines) to a CSV file (over 4.000 lines), but I'm not sure I'm happy with the time it takes - although I don't know how long it should take really; maybe that's a normal time on my laptop (Core i5 7th Gen, 16GB RAM, SSD drive)?
Would there be any room for improvement? (especially for the parsing if possible, which takes 30 seconds)
# Reducing & editing data to process:
# -----------------------------------
$original = Get-Content $IN_ldif_file
$reduced = (($original | select-string -pattern '^cust[A-Z]','^$' -CaseSensitive).Line) -replace ':: ', ': ' -replace '^cust',''
"Writing reduced LDIF file..." # < 1 sec
(Measure-Command { Set-Content $reducedLDIF -Value $reduced -Encoding UTF8 }).TotalSeconds
# Parsing the relevant data:
# --------------------------
$inData = New-Object -TypeName System.IO.StreamReader -ArgumentList $reducedLDIF
$a = #{} # initialize the temporary hash
$lineNum = $rcdNum = 0 # initialize the counters
"Parsing reduced LDIF file..." # 27-36 sec
(Measure-Command {
# Begin reading and processing the input file:
$results = while (-not $inData.EndOfStream)
{
$line = $inData.ReadLine()
Write-Verbose "$("{0:D4}" -f ++$lineNum)|$("{0:D4}|" -f $rcdNum)$line"
if (($line -match "^\s*$") -or $inData.EndOfStream )
{
# blank line or end of stream - dump the hash as an object and reinit the hash
[PSCustomObject]$a
$a = #{}
$rcdNum++
} else {
# build up hash table for the object
$key, $value = $line -split ": "
$a[$key] = $value
}
}
$inData.Close()
}).TotalSeconds
# Populating & writing the CSV file:
# ----------------------------------
"Populating the CSV data..." # 7-11 sec
(Measure-Command {
$out = $results |
select "Attribute01",
"Attribute02",
"Attribute03",
<# etc... #>
#{n="Attribute39"; E={$_."Attribute20"}}, # Attribute39 (not in LDIF) takes value of Attribute20
"Attribute40"
}).TotalSeconds
"Writing CSV file..." # < 1 sec
(Measure-Command { $out | Export-CSV $OUT_csv_file -NoTypeInformation }).TotalSeconds
Note: I actually don't need to export the "$reduced" data to a file (e.g. "$reducedLDIF"), but the piece of code I found for the parsing seems to require a file.
Thanks!
So I found a way to cut the parsing time by almost half, by re-using the data in the $reduced variable that's already in memory:
$a = #{} # initialize the temporary hash
$lineNum = $rcdNum = 0 # initialize the counters
"Parsing reduced LDIF file..."
(Measure-Command {
$results = ForEach ($line in $reduced) {
Write-Verbose "$("{0:D6}" -f ++$lineNum)|$("{0:D4}|" -f $rcdNum)$line"
if ($line -match "^\s*$")
{ # blank line or end of stream - dump the hash as an object and reinit the hash
[PSCustomObject]$a
$a = #{}
$rcdNum++
}
else {
# build up hash table for the object
$key, $value = $line -split ": "
$a[$key] = $value
}
}
}).TotalSeconds
This is already more acceptable (about 16 sec instead of 30).

Replacing multiple lines in a file

I try to make, the line from the first array is read from a file and is replaced with a line from the second array, so some times with different lines. I made a script, but I do not understand why it does not work.
$OldStrings = #(
"desktopwidth:i:1440",
"desktopheight:i:900",
"winposstr:s:0,1,140,60,1596,999"
)
$NewStrings = #(
"desktopwidth:i:1734",
"desktopheight:i:990",
"winposstr:s:0,1,50,7,1800,1036"
)
$LinesArray = Get-Content -Path 'C:\temp\My Copy\Default.rdp'
$LinesCount = $LinesArray.Count
for ($i=0; $i -lt $LinesCount; $i++) {
foreach ($OldString in $OldStrings) {
foreach ($NewString in $NewStrings) {
if ($LinesArray[$i] -like $OldString) {
$LinesArray[$i] = $LinesArray[$i] -replace $OldString, $NewString
Write-Host "`nline" $i "takes on value:" $LinesArray[$i] "`n" -ForegroundColor Gray
}
}
}
}
The file is probably why it is not read at all.
After executing the script, I see only
line 2 takes on value: desktopwidth:i:1734
line 3 takes on value: desktopwidth:i:1734
line 5 takes on value: desktopwidth:i:1734
You're looking through the string arrays twice. You want to do two loops, one for each line in the file AND another for each count in the lines you're replacing. I think this should work:
$OldStrings = #(
"desktopwidth:i:1440",
"desktopheight:i:900",
"winposstr:s:0,1,140,60,1596,999"
)
$NewStrings = #(
"desktopwidth:i:1734",
"desktopheight:i:990",
"winposstr:s:0,1,50,7,1800,1036"
)
$LinesArray = Get-Content -Path 'C:\temp\My Copy\Default.rdp'
# loop through each line
for ($i=0; $i -lt $LinesArray.Count; $i++)
{
for ($j=0;$j -lt $OldStrings.Count; $j++)
{
if ($LinesArray[$i] -match $OldStrings[$j])
{
$LinesArray[$i] = $LinesArray[$i] -replace $OldStrings[$j],$NewStrings[$j]
Write-Host "`nline" $i "takes on value:" $LinesArray[$i] "`n" -ForegroundColor Gray
}
}
}
$LinesArray | Set-Content -Path 'C:\temp\My Copy\Default.rdp'
You don't need to bother checking the lines to look for matches. Since you have the replacements ready just do the replacements outright anyway. Should be faster this way as well.
$stringReplacements = #{
"desktopwidth:i:1440" = "desktopwidth:i:1734"
"desktopheight:i:900" = "desktopheight:i:990"
"winposstr:s:0,1,140,60,1596,999" = "winposstr:s:0,1,50,7,1800,1036"
}
$path = 'C:\temp\My Copy\Default.rdp'
# Read the file in as a single string.
$fileContent = Get-Content $path | Out-String
# Iterate over each key value pair
$stringReplacements.Keys | ForEach-Object{
# Attempt the replacement for each key/pair search/replace pair
$fileContent =$fileContent.Replace($_,$stringReplacements[$_])
}
# Write changes back to file.
# $fileContent | Set-Content $path
$stringReplacements is a key value hash of search and replace strings. I don't see you writing the changes back to file so I left a line on the end for you to uncomment.
You could add in checks to do the replacements still if you value the write-host lines but I figured that was for debugging and you already know how to do that.

Powershell - Speeding up writing to files

I wrote this script to find all of the folders in a directory and for each folder, check inside a common file if some strings exist and if not add them. I needed to insert strings in particular places. Not really knowing how to do this, I opted for simpler find and replace where the strings needed to be inserted. Anyway this script takes almost an hour to work through 800 files. I'm hoping some experienced members can point out ways to make my task quicker as I have only been working with Powershell for two days. Many Thanks!!!
# First find and replace items.
$FindOne =
$ReplaceOneA =
$ReplaceOneB =
$ReplaceOneC =
# Second find and replace items.
$FindTwo =
$ReplaceTwo =
# Strings to test if exist.
# To avoid duplicate entries.
$PatternOne =
$PatternTwo =
$PatternThree =
$PatternFour =
# Gets window folder names.
$FilePath = "$ProjectPath\$Station\WINDOW"
$Folders = Get-ChildItem $FilePath | Where-Object {$_.mode -match "d"}
# Adds folder names to an array.
$FolderName = #()
$Folders | ForEach-Object { $FolderName += $_.name }
# Adds code to each builder file.
ForEach ($Name in $FolderName) {
$File = "$FilePath\$Name\main.xaml"
$Test = Test-Path $File
# First tests if file exists. If not, no action.
If ($Test -eq $True) {
$StringOne = Select-String -pattern $PatternOne -path $File
$StringTwo = Select-String -pattern $PatternTwo -path $File
$StringThree = Select-String -pattern $PatternThree -path $File
$StringFour = Select-String -pattern $PatternFour -path $File
$Content = Get-Content $File
# If namespaces or object don't exist, add them.
If ($StringOne -eq $null) {
$Content = $Content -Replace $FindOne, $ReplaceOneA
}
If ($StringTwo -eq $null) {
$Content = $Content -Replace $FindOne, $ReplaceOneB
}
If ($StringThree -eq $null) {
$Content = $Content -Replace $FindOne, $ReplaceOneC
}
If ($StringFour -eq $null) {
$Content = $Content -Replace $FindTwo, $ReplaceTwo
}
$Content | Set-Content $File
}
}
# End of program.
You could try writing to the file with a stream, like this
$stream = [System.IO.StreamWriter] $File
$stream.WriteLine($content)
$stream.close()

Resources