i have a Problem with powershell Performance while searching a 40gb log file.
i Need to check if any of 1000 email adresses are included in this 40gb file. This would take 180 hours :D any ideas?
$logFolder = "H:\log.txt"
$adressen= Get-Content H:\Adressen.txt
$ergebnis = #()
foreach ($adr in $adressen){
$suche = Select-String -Path $logFolder -Pattern "\[\(\'from\'\,.*$adr.*\'\)\]" -List
$aktiv= $false
$adr
if ($suche){
$aktiv = $true
}
if ($aktiv -eq $true){
$ergebnis+=$adr + ";Ja"
}
else{
$ergebnis+=$adr + ";Nein"
}
}
$ergebnis |Out-File H:\output.txt
Don't read the file 1000 times.
Build a regexp line with all 1000 addresses (it's gonna be a huge line, but hey, much smaller than 40TB). Like:
$Pattern = "\[\(\'from\'\,.*$( $adressen -join '|' ).*\'\)\]"
Then do your Select-String, and save the result to do an address-by-address search in it. Hopefully, the result will be much smaller than 40Gb, and should be much faster.
As mentioned in the comments, replace
$ergebnis = #()
with
$ergebnis = New-Object System.Collections.ArrayList
and
$ergebnis+=$adr + ";Ja"
with
$ergebnis.add("$adr;Ja")
or respective
$ergebnis.add("$adr;Nein")
This will speed up your script quite a bit.
In Powershell, instead of downloaded a CSV file to drive from a web request, can you download a file directly into a string? I often pull earthquake data from the USGS database which outputs the data as a downloadable CVS file. The issue is data is pulled per day and when pulling years of data I end up pulling a thousand or more files, then pipe them together. If I can download directly into a string I could eliminate the read write file time needed for a thousand or more files by doing it all in memory.
The basic portion of code used to pull the file is:
$U = $env:userprofile
$Location = "Region_Earthquakes"
$MaxLat = "72.427"
$MinLat = "50.244"
$MaxLon = "-140.625"
$MinLon = "-176.133"
$yearspulled = 1
$ts = New-TimeSpan -Days 1
$Today = Get-Date
$StartDate = [datetime](((Get-Date -format "yyyy") - $yearspulled + 1).ToString() + "-01-01 00:00:00")
While ($StartDate -le $Today)
{$EndDate = $Startdate + $ts
$Start = $StartDate.ToString("yyyy-MM-dd HH:mm:ss")
$FileDate = $StartDate.ToString("yyyy_MM_dd")
$End = $EndDate.ToString("yyyy-MM-dd HH:mm:ss")
$url = "http://earthquake.usgs.gov/fdsnws/event/1/query.csv?starttime=$Start&endtime=$End&minmagnitude=-1.00&maxmagnitude=10.00&maxlatitude=$MaxLat&minlatitude=$MinLat&maxlongitude=$MaxLon&minlongitude=$MinLon&eventtype=earthquake&orderby=time"
$output = "$U\DownLoads\$Location" + "_$FileDate.csv"
(New-Object System.Net.WebClient).DownloadFile($url, $output)
Write-Output "Date pulled for $Start"
$Startdate = $Startdate + $ts}
When in doubt, read the documentation. The System.Net.WebClient class has a method DownloadString() that does exactly what you're asking.
WebClient.DownloadString Method
Namespace: System.Net
Assemblies: System.dll, netstandard.dll, System.Net.WebClient.dll, System.Net.dll
Downloads the requested resource as a String. The resource to download may be specified as either String containing the URI or a Uri.
Emphasis mine.
$s = (New-Object System.Net.WebClient).DownloadString($url)
I wrote a quick script to find the percentage of users in one user list (TEMP.txt) that are also in another user list (TEMP2.txt) It worked great for a while until my user lists got up above a couple 100,000 or so... its too slow. I want to convert it to runspace to speed it up, but I am failing miserably. The original script is:
$USERLIST1 = gc .\TEMP.txt
$i = 0
ForEach ($User in $USERLIST1){
If (gc .\TEMP2.txt |Select-String $User -quiet){
$i = $i + 1
}
}
$Count = gc .\TEMP2.txt | Measure-object -Line
$decimal = $i / $count.lines
$percent = $decimal * 100
Write-Host "$percent %"
Sorry I am still new at powershell.
Not sure how much this will help you, I am new with runspaces as well but here is some code I used with a Windows Form running things asynchronously in a separate runspace, you might be able to manipulate it to do what you need:
$Runspace = [Management.Automation.Runspaces.RunspaceFactory]::CreateRunspace($Host)
$Runspace.ApartmentState = 'STA'
$Runspace.ThreadOptions = 'ReuseThread'
$Runspace.Open()
#Add the Form object to the Runspace environment
$Runspace.SessionStateProxy.SetVariable('Form', $Form)
#Create a new PowerShell object (a Thread)
$PowerShellRunspace = [System.Management.Automation.PowerShell]::Create()
#Initializes the PowerShell object with the runspace
$PowerShellRunspace.Runspace = $Runspace
#Add the scriptblock which should run inside the runspace
$PowerShellRunspace.AddScript({
[System.Windows.Forms.Application]::Run($Form)
})
#Open and run the runspace asynchronously
$AsyncResult = $PowerShellRunspace.BeginInvoke()
#End the pipeline of the PowerShell object
$PowerShellRunspace.EndInvoke($AsyncResult)
#Close the runspace
$Runspace.Close()
#Remove the PowerShell object and its resources
$PowerShellRunspace.Dispose()
Apart from runspace concept, next script could run a bit faster:
$USERLIST1 = gc .\TEMP.txt
$USERLIST2 = gc .\TEMP2.txt
$i = 0
ForEach ($User in $USERLIST1) {
if ($USERLIST2.Contains($User)) {
$i += 1
}
}
$Count = $USERLIST2.Count
$decimal = $i / $count
$percent = $decimal * 100
Write-Host "$percent %"
Ok so we have a manual process that runs through PL/SQL Developer to run a query and then export to csv.
I am trying to automate that process using powershell since we are working in a windows environment.
I have created two files that seems to be exact duplicates from the automated and manual process but they don't work the same so I assume I am missing some hidden characters but I can't find them or figure out how to remove them.
The most obvious example of them working differently is opening them in excel. The manual file opens in excel automatically putting each column in it's own seperate column. The automated file instead puts everything into one column.
Can anybody shed some light? I am hoping that by resolving this or at least getting some info will help with the bigger problem of it not processing correctly.
Thanks.
ex one column
"rownum","year","month","batch","facility","transfer_facility","trans_dt","meter","ticket","trans_product","trans","shipper","customer","supplier","broker","origin","destination","quantity"
ex seperate column
"","ROWNUM","RPT_YR","RPT_MO","BATCH_NBR","FACILITY_CD","TRANSFER_FACILITY_CD","TRANS_DT","METER_NBR","TKT_NBR","TRANS_PRODUCT_CD","TRANS_CD","SHIPPER_CD","CUSTOMER_NBR","SUPPLIER_NBR","BROKER_CD","ORIGIN_CD","DESTINATION_CD","NET_QTY"
$connectionstring = "Data Source=database;User Id=user;Password=password"
$connection = New-Object System.Data.OracleClient.OracleConnection($connectionstring)
$command = New-Object System.Data.OracleClient.OracleCommand($query, $connection)
$connection.Open()
Write-Host -ForegroundColor Black " Opening Oracle Connection"
Start-Sleep -Seconds 2
#Getting data from oracle
Write-Host
Write-Host -ForegroundColor Black "Getting data from Oracle"
$Oracle_data=$command.ExecuteReader()
Start-Sleep -Seconds 2
if ($Oracle_data.read()){
Write-Host -ForegroundColor Green "Connection Success"
while ($Oracle_data.read()) {
#Variables for recordset
$rownum = $Oracle_data.GetDecimal(0)
$rpt_yr = $Oracle_data.GetDecimal(1)
$rpt_mo = $Oracle_data.GetDecimal(2)
$batch_nbr = $Oracle_data.GetString(3)
$facility_cd = $Oracle_data.GetString(4)
$transfer_facility_cd = $Oracle_data.GetString(5)
$trans_dt = $Oracle_data.GetDateTime(6)
$meter_nbr = $Oracle_data.GetString(7)
$tkt_nbr = $Oracle_data.GetString(8)
$trans_product_cd = $Oracle_data.GetString(9)
$trans_cd = $Oracle_data.GetString(10)
$shipper_cd = $Oracle_data.GetString(11)
$customer_nbr = $Oracle_data.GetString(12)
$supplier_nbr = $Oracle_data.GetString(13)
$broker_cd = $Oracle_data.GetString(14)
$origin_cd = $Oracle_data.GetString(15)
$destination_cd = $Oracle_data.GetString(16)
$net_qty = $Oracle_data.GetDecimal(17)
#Define new file
$filename = "Pipeline" #Get-Date -UFormat "%b%Y"
$filename = $filename + ".csv"
$fileLocation = $newdir + "\" + $filename
$fileExists = Test-Path $fileLocation
#Create object to hold record
$obj = new-object psobject -prop #{
rownum = $rownum
year = $rpt_yr
month = $rpt_mo
batch = $batch_nbr
facility = $facility_cd
transfer_facility = $transfer_facility_cd
trans_dt = $trans_dt
meter = $meter_nbr
ticket = $tkt_nbr
trans_product = $trans_product_cd
trans = $trans_cd
shipper = $shipper_cd
customer = $customer_nbr
supplier = $supplier_nbr
broker = $broker_cd
origin = $origin_cd
destination = $destination_cd
quantity = $net_qty
}
$records += $obj
}
}else {
Write-Host -ForegroundColor Red " Connection Failed"
}
#Write records to file with headers
$records | Select-Object rownum,year,month,batch,facility,transfer_facility,trans_dt,meter,ticket,trans_product,trans,shipper,customer,supplier,broker,origin,destination,quantity |
ConvertTo-Csv |
Select -Skip 1|
Out-File $fileLocation
Why are you skipping the first row(usually the headers)? Also, try using Export-CSV instead:
#Write records to file with headers
$records | Select-Object rownum, year, month, batch, facility, transfer_facility, trans_dt, meter, ticket, trans_product, trans, shipper, customer, supplier, broker, origin, destination, quantity |
Export-Csv $fileLocation -NoTypeInformation
I have to work with a huge number of text files. I am able to consolidate the files into one single file. But I also have the use of the file name in my work and I would like to have it before the text of the file itself in excel format, preferably the first column should contain the names of files and the columns afterwards can contain the data.
Any help would be appreciated. Thanks.
Here's the Powershell script. You might need to modify it a bit to look for specific file extensions as now it's only looking for PS1 files
[System.Threading.Thread]::CurrentThread.CurrentCulture = New-Object System.Globalization.CultureInfo("en-US")
$excel = new-Object -comobject Excel.Application
$excel.visible = $false
$workBook = $excel.Workbooks.Add()
$sheet = $workBook.Sheets.Item(1)
$sheet.Name = "Files"
$sheet.Range("A1", "B1").Font.Bold = $true
$sheet.Range("A1","A2").ColumnWidth = 40
$sheet.Range("B1","B2").ColumnWidth = 100
$sheet.Cells.Item(1,1) = "Filename"
$sheet.cells.Item(1,2) = "Content"
$files = get-childitem C:\PST -recurse | where {$_.extension -eq ".ps1"}
$index = 2
foreach($file in $files)
{
$sheet.Cells.Item($index,1) = $file.FullName
$sheet.Cells.Item($index,2) = [System.IO.File]::ReadAllText($file.FullName)
$index++
}
$workBook.SaveAs("C:\PST\1.xlsx")
$excel.Quit()
Note: I'm not pretending that it's perfect, you still need to polish it and refactor it, but at least it will give you direction