This code returns unique and shared lines between two files. Unfortunately, it runs forever if the files have 1 million lines. Is there a faster way to do this (e.g., -eq, -match, wildcard, Compare-Object) or containment operators are the optimal approach?
$afile = Get-Content (Read-Host "Enter 'A' file")
$bfile = Get-Content (Read-Host "Enter 'B' file")
$afile |
? { $bfile -notcontains $_ } |
Set-Content lines_ONLY_in_A.txt
$bfile |
? { $afile -notcontains $_ } |
Set-Content lines_ONLY_in_B.txt
$afile |
? { $bfile -contains $_ } |
Set-Content lines_in_BOTH_A_and_B.txt
As mentioned in my answer to a previous question of yours, -contains is a slow operation, particularly with large arrays.
For exact matches you could use Compare-Object and discriminate the output by side indicator:
Compare-Object $afile $bfile -IncludeEqual | ForEach-Object {
switch ($_.SideIndicator) {
'<=' { $_.InputObject | Add-Content 'lines_ONLY_in_A.txt' }
'=>' { $_.InputObject | Add-Content 'lines_ONLY_in_B.txt' }
'==' { $_.InputObject | Add-Content 'lines_in_BOTH_A_and_B.txt' }
}
}
If that's still too slow try reading each file into a hashtable:
$afile = Get-Content (Read-Host "Enter 'A' file")
$ahash = #{}
$afile | ForEach-Object {
$ahash[$_] = $true
}
and process the files like this:
$afile | Where-Object {
-not $bhash.ContainsKey($_)
} | Set-Content 'lines_ONLY_in_A.txt'
If that still doesn't help you need to identify the bottleneck (reading the files, comparing the data, doing multiple comparisons, ...) and proceed from there.
try this:
$All=#()
$All+= Get-Content "c:\temp\a.txt" | %{[pscustomobject]#{Row=$_;File="A"}}
$All+= Get-Content "c:\temp\b.txt" | %{[pscustomobject]#{Row=$_;File="B"}}
$All | group row | %{
$InA=$_.Group.File.Contains("A")
$InB=$_.Group.File.Contains("B")
if ($InA -and $InB)
{
$_.Group.Row | select -unique | Out-File c:\temp\lines_in_A_And_B.txt -Append
}
elseif ($InA)
{
$_.Group.Row | select -unique | Out-File c:\temp\lines_Only_A.txt -Append
}
else
{
$_.Group.Row | select -unique | Out-File c:\temp\lines_Only_B.txt -Append
}
}
Full code for the best option (#ansgar-wiechers). A unique, B unique, and A,B shared lines:
$afile = Get-Content (Read-Host "Enter 'A' file")
$ahash = #{}
$afile | ForEach-Object {
$ahash[$_] = $true
}
$bfile = Get-Content (Read-Host "Enter 'B' file")
$bhash = #{}
$bfile | ForEach-Object {
$bhash[$_] = $true
}
$afile | Where-Object {
-not $bhash.ContainsKey($_)
} | Set-Content 'lines_ONLY_in_A.txt'
$bfile | Where-Object {
-not $ahash.ContainsKey($_)
} | Set-Content 'lines_ONLY_in_B.txt'
$afile | Where-Object {
$bhash.ContainsKey($_)
} | Set-Content 'lines_in _BOTH_A_and_B.txt'
Considering my suggestion to do a binary search, I have created a reusable Search-SortedArray function for this:
Description
The Search-SortedArray (alias Search) (binary) searches a string in a sorted array. If the string is found, the index of the found string in the array is returned. Otherwise, if the string is not found, a $Null is returned.
Function Search-SortedArray ([String[]]$SortedArray, [String]$Find, [Switch]$CaseSensitive) {
$l = 0; $r = $SortedArray.Count - 1
While ($l -le $r) {
$m = [int](($l + $r) / 2)
Switch ([String]::Compare($find, $SortedArray[$m], !$CaseSensitive)) {
-1 {$r = $m - 1}
1 {$l = $m + 1}
Default {Return $m}
}
}
}; Set-Alias Search Search-SortedArray
$afile |
? {(Search $bfile $_) -eq $Null} |
Set-Content lines_ONLY_in_A.txt
$bfile |
? {(Search $afile $_) -eq $Null} |
Set-Content lines_ONLY_in_B.txt
$afile |
? {(Search $bfile $_) -ne $Null} |
Set-Content lines_in_BOTH_A_and_B.txt
Note 1: Due to the overhead, a binary search will only give advantage with (very) large arrays.
Note 2: The array has to be sorted otherwise the result will be unpredictable.
Nate 3: The search doesn't account for duplicates. In case of duplicate values, just one index will be returned (which isn't a concern for this specific question)
Added 2017-11-07 based on the comment from #Ansgar Wiechers:
Quick benchmark with 2 files with a couple thousand lines each (including duplicate lines): binary search: 2400ms; compare-object: 1850ms; hashtable lookup: 250ms
The idea is that the binary search will take its advantage on the long run: the larger the arrays the more it will proportional gain performance.
Taken $afile |? { $bfile -notcontains $_ } as an example, the performance measurements in the comment and that “a couple thousand lines” is 3000 lines:
For a standard search, you will need an average of 1500 iterations in the $bfile:*1
(3000 + 1) / 2 = 3001 / 2 = 1500
For a binary search, you will need an average of 6.27 iterations in the $bfile:
(log2 3000 + 1) / 2 = (11.55 + 1) / 2 = 6.27
In both situations you do this 3000 times (for each item in $afile)
This means that each single iteration takes:
For a standard search: 250ms / 1500 / 3000 = 56 nanoseconds
For a binary search: 2400ms / 6.27 / 3000 = 127482 nanoseconds
The breakeven point will at about:
56 * ((x + 1) / 2 * 3000) = 127482 * ((log2 x + 1) / 2 * 3000)
Which is (according my calculations) at about 40000 entries.
*1 presuming that a hashtable lookup doesn’t do a binary search itself as it is unaware that the array is sorted
Added 2017-11-07
Conclusion from the comments: Hash tables appear to have a similar associative array algorithms that can't be outperformed with low-level programming commands.
Related
I am having a text file that has content in this manner.
One;Thomas;Newyork;2020-12-31 14:00:00;0
Two;David;London;2021-01-31 12:00:00;0
Three;James;Chicago;2021-01-20 15:00:00;0
Four;Edward;India;2020-12-25 15:00:00;0
In these entries according to date time, two are past entries and two are future entries. The last 0 in the string indicates the Flag. With the past entries that flag needs to be changed to 1.
Consider all the entries are separated with the array. I tried this block of code but its not working to solve the problem here.
for ($item=0 ; $item -lt $entries.count ; $item++)
{
if ($entries.DateTime[$item] -lt (Get-Date -Format "yyyy-MM-dd HH:mm:ss"))
{
$cont = Get-Content $entries -ErrorAction Stop
$string = $entries.number[$item] + ";" + $entries.name[$item] + ";" +
$entries.city[$item]+ ";" + $entries.DateTime[$item]
$lineNum = $cont | Select-String $string
$line = $lineNum.LineNumber + 1
$cont[$line] = $string + ";1"
Set-Content -path $entries
}
}
I am getting errors with this concept.
Output should come as:-
One;Thomas;Newyork;2020-12-31 14:00:00;1 ((Past Deployment with respect to current date)
Two;David;London;2021-01-31 12:00:00;0
Three;James;Chicago;2021-01-20 15:00:00;0
Four;Edward;India;2020-12-25 15:00:00;1 (Past Deployment with respect to current date)
This output needs to be overwritten on the file from where the content is extracted ie Entries.txt
param(
$exampleFileName = "d:\tmp\file.txt"
)
#"
One;Thomas;Newyork;2020-12-31 14:00:00;0
Two;David;London;2021-01-31 12:00:00;0
Three;James;Chicago;2021-01-20 15:00:00;0
Four;Edward;India;2020-12-25 15:00:00;0
"# | Out-File $exampleFileName
Remove-Variable out -ErrorAction SilentlyContinue
Get-Content $exampleFileName | ForEach-Object {
$out += ($_ -and [datetime]::Parse(($_ -split ";")[3]) -gt [datetime]::Now) ? $_.SubString(0,$_.Length-1) + "1`r`n" : $_ + "`r`n"
}
Out-File -InputObject $out -FilePath $exampleFileName
I'm looking for a way to accelerate the (Windows 10) PowerShell command Where-Object for a sorted array.
In the end the array will contain thousands of lines from a log file. All lines in the log file start with date and time and are sorted by date/time (new lines will always be appended).
The following command would work but is extremely slow and ineffective with a sorted array:
$arrFileContent | where {($_ -ge $Start) -and ($_ -le $End)}
Here is a (strongly simplified) example:
$arrFileContent = #("Bernie", "Emily", "Fred", "Jake", "Keith", "Maria", "Paul", "Richard", "Sally", "Tim", "Victor")
$Start = "E"
$End = "P"
Expected result: "Emily", "Fred", "Jake", "Keith", "Maria", "Paul".
I guess, using "nested intervals" it should be much faster, like "find the first entry starting with "E" or above and the first starting with "P" or below and return all entries in between.
I suppose there must be a simple PowerShell or .NET solution for this, so I won't have to code it myself, correct?
Edit 31.08.19: Not sure if "nested intervals" (German "Intervallschachtelung") is the right term.
What I mean is the "telephone book principle": Open the book in the middle, check if the wanted name is listed before or after, open the book in the middle of the first (or last) half, and so on.
In this case (checking 100.000 lines of a log file for a given date range):
- check line no. 50.000
- if after given start date check line no. 75.000 else check no. 25.000
- check line no. 75.000 (or 25.000)
- if after given start date check line no. 87.500 (or ...) else check no. 62.500 (or ...)
and so on ...
The log file contains lines like this:
2018-01-17 14:28:19 Installation xxx started
(only with a lot more text)
Let's measure all ways mentioned in comments. Let's mimic thousands of lines from a log file using Get-ChildItem:
$arrFileContent = (
Get-ChildItem d:\bat\* -File -Recurse -ErrorAction SilentlyContinue
).Name | Sort-Object -Unique
$Start = "E"
$End = "P"
$arrFileContent.Count
('Where-Object', $(Measure-Command {
$arrFileNarrowed = $arrFileContent | Where-Object {
($_ -ge $Start) -and ($_ -le $End)
}
}).TotalMilliseconds, $arrFileNarrowed.Count) -join "`t"
('Where method', $(Measure-Command {
$arrFileNarrowed = $arrFileContent.Where( {
($_ -ge $Start) -and ($_ -le $End)
})
}).TotalMilliseconds, $arrFileNarrowed.Count) -join "`t"
('foreach + if', $(Measure-Command {
$arrFileNarrowed = foreach ($OneName in $arrFileContent) {
if ( ($OneName -ge $Start) -and ($OneName -le $End) ) {
$OneName
}
}
}).TotalMilliseconds, $arrFileNarrowed.Count) -join "`t"
Output using Get-ChildItem d:\bat\*:
D:\PShell\SO\56993333.ps1
2777
Where-Object 111,5433 535
Where method 56,8577 535
foreach + if 6,542 535
Output using Get-ChildItem d:\* (much more names):
D:\PShell\SO\56993333.ps1
89570
Where-Object 4056,604 34087
Where method 1636,9539 34087
foreach + if 422,8259 34087
"Nested intervals", to me, means "intervals within intervals." I think I'd describe what you're looking to do is select a range. We can exploit the fact that the data is sorted to stop enumerating as soon as the end of the range is found.
.NET's LINQ queries allow us to do this easily. Assuming this content for Names.txt...
Bernie
Emily
Fred
Jake
Keith
Maria
Paul
Richard
Sally
Tim
Victor
...in C# the filtering would be as simple as...
IEnumerable<string> filteredNames = System.IO.File.ReadLines("Names.txt")
.Where(name => name[0] >= 'E')
.TakeWhile(name => name[0] <= 'P');
ReadLines() enumerates the lines of the file, Where() filters the output of ReadLines() (setting a lower bound on the range), and TakeWhile() stops enumerating Where() (and, therefore, ReadLines()) once its condition is no longer true (setting an upper bound on the range). This is all very efficient because A) the file is enumerated rather than read entirely into memory and B) enumeration stops as soon as the end of the desired range is reached.
We can invoke LINQ methods from PowerShell, too, but since PowerShell supports neither extension methods nor lamba expressions the equivalent code is a little more verbose...
$source = [System.IO.File]::ReadLines($inputFilePath)
$rangeStartPredicate = [Func[String, Boolean]] {
$name = $args[0]
return $name[0] -ge [Char] 'E'
}
$rangeEndPredicate = [Func[String, Boolean]] {
$name = $args[0]
return $name[0] -le [Char] 'P'
}
$filteredNames = [System.Linq.Enumerable]::TakeWhile(
[System.Linq.Enumerable]::Where($source, $rangeStartPredicate),
$rangeEndPredicate
)
In order for this to work you have to invoke the static LINQ methods directly and get all of the types correct. Thus, the first parameter of Where() is an System.Collections.Generic.IEnumerable[String], which is what ReadLines() returns (that's why I used a file for this). The predicate parameters of Where() and TakeWhile() are of type [Func[String, Boolean]] (a function that takes a String and returns a Boolean), which is why the ScriptBlocks must be explicitly cast to that type.
After this code executes $filteredNames will contain a query object; that is, it doesn't contain the results but rather a blueprint for how to get the results...
PS> $filteredNames.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
False False <TakeWhileIterator>d__27`1 System.Object
Only when the query is executed/evaluated does file enumeration and filtering actually occur...
PS> $filteredNames
Emily
Fred
Jake
Keith
Maria
Paul
If you are going to access the results multiple times you should store them in an array to avoid reading the file multiple times...
PS> $filteredNames = [System.Linq.Enumerable]::ToArray($filteredNames)
PS> $filteredNames.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True True String[] System.Array
PS> $filteredNames
Emily
Fred
Jake
Keith
Maria
Paul
I tried a variation on #josefz's answer. I didn't get amazing results breaking when I was past the last line I wanted. Actually, if it was just 'a' to 'b', I save a minute. Unless the slowness is due to get-content? "Get-content log" will be slower than "get-content -readcount -1 log".
$arrFileContent = Get-ChildItem -name -File -Recurse | select -first 89570 | sort -u
$start = 'e'
$end = 'p'
measure-command {
$arrFileNarrowed = foreach ($OneName in $arrFileContent) {
if ($OneName -ge $Start) {
if ($OneName -le $End ) {
$OneName
}
}
}
} | fl seconds, milliseconds
# break early
measure-command {
$arrFileNarrowed = foreach ($OneName in $arrFileContent) {
if ($OneName -ge $Start) {
if ($OneName -le $End ) {
$OneName
} else {
break
}
}
}
} | fl seconds, milliseconds
Output:
Seconds : 1
Milliseconds : 207
Seconds : 1
Milliseconds : 174
Trying out get-content vs switch -file:
$start = 'e'
$end = 'p'
# uses more memory
measure-command {
$result1 = get-content -readcount -1 log | foreach { $_ |
where { $_ -ge $start -and $_ -le $end } }
} | fl seconds,milliseconds
measure-command {
$result2 = switch -file log {
{ $_ -ge $start -and $_ -le $end } { $_ } }
} | fl seconds,milliseconds
Output:
Seconds : 4
Milliseconds : 491
Seconds : 2
Milliseconds : 747
It is also effective to simply replace as follows
$arr | where { <expression> }
↓
$arr | & { process { if (<expression>) { $_ } } }
$arrFileContent | & { process { if ($_ -ge $Start -and $_ -lt $End) { $_ } } }
I'm having an issue sorting a hash table. I've broken down my code to just bare necessities so as not to overwhelm anyone with my original script.
Write-Host "PowerShell Version = " ([string]$psversiontable.psversion)
$h = #{}
$Value = #{SortOrder=1;v1=1;}
$h.Add(1, $Value)
$Value = #{SortOrder=2;v1=1;}
$h.Add(2, $Value)
$Value = #{SortOrder=3;v1=1;}
$h.Add(3, $Value)
$Value = #{SortOrder=4;v1=1;}
$h.Add(4, $Value)
Write-Host "Ascending"
foreach($f in $h.GetEnumerator() | Sort-Object Value.SortOrder)
{
Write-Host $f.Value.SortOrder
}
Write-Host "Descending"
foreach($f in $h.GetEnumerator() | Sort-Object Value.SortOrder -descending)
{
Write-Host $f.Value.SortOrder
}
The output is
PowerShell Version = 3.0
Ascending
2
1
4
3
Descending
2
1
4
3
I'm sure this is just a simple case of not knowing the correct usage of Sort-Object. The sort works correctly on Sort-Object Name so maybe it has something to do with not knowing how to handle the Value.SortOrder?
Sort-Object accepts a property name or a script block used to sort. Since you're trying to sort on a property of a property, you'll need to use a script block:
Write-Host "Ascending"
$h.GetEnumerator() |
Sort-Object { $_.Value.SortOrder } |
ForEach-Object { Write-Host $_.Value.SortOrder }
Write-Host "Descending"
$h.GetEnumerator() |
Sort-Object { $_.Value.SortOrder } -Descending |
ForEach-Object { Write-Host $_.Value.SortOrder }
You can filter using the Where-Object cmdlet:
Write-Host "Ascending"
$h.GetEnumerator() |
Where-Object { $_.Name -ge 2 } |
Sort-Object { $_.Value.SortOrder } |
ForEach-Object { Write-Host $_.Value.SortOrder }
You usually want to put Where-Object before any Sort-Object cmdlets, since it makes sorting faster.
I was using a hash table as a frequency table, to count the occurrence of words in filenames.
$words = #{}
get-childitem *.pdf | foreach-object -process {
$name = $_.name.substring($_.name.indexof("-") + 1, $_.name.indexof(".") - $_.name.indexof("-") - 1)
$name = $name.replace("_", " ")
$word = $name.split(" ")[0]
if ( $words.contains($word) ){
$words[$word] = $words[$word] + 1
}else{
$words.add($word, 1)
}
}
$words.getenumerator() | sort-object -property value
It's that last line that does the magic, sorting the hash table on the value(frequency).
I have a folder A with 75000 files which are to be processed. I have 4 folders (A,B,C,D) alongside it which can process 3000 files at a time.
I want a script to take 3000 files from A and put it in B. It should then take another 3000 files and put in C, then D and finally E
Below is the code I have so far. this takes 10 files and moves them into B, but then it just sits forever without putting any files into C,D or E.
Is there a way to quit out of the EnumerateFiles section of code? I just want the first X files it finds to get moved, I don't care about how many files are in A.
Any idea?
$dirBase = "\\networkDir\A\"
$dirProc1 = "\\networkDir\B\"
$dirProc2 = "\\networkDir\C\"
$dirProc3 = "\\networkDir\D\"
$dirProc4 = "\\networkDir\E\"
cd $dirBase
$directoryInfo1 = Get-ChildItem $dirProc1 | Measure-Object
$directoryInfo2 = Get-ChildItem $dirProc2 | Measure-Object
$directoryInfo3 = Get-ChildItem $dirProc3 | Measure-Object
$directoryInfo4 = Get-ChildItem $dirProc4 | Measure-Object
if ($directoryInfo1.count -eq 0) {
MoveFiles $dirBase $dirProc1
}
if ($directoryInfo2.count -eq 0) {
MoveFiles $dirBase $dirProc2
}
if ($directoryInfo3.count -eq 0) {
MoveFiles $dirBase $dirProc3
}
if ($directoryInfo4.count -eq 0) {
MoveFiles $dirBase $dirProc4
}
function MoveFiles([string]$srcDir, [string]$dest)
{
$FileLimit = 10
$Counter = 0
[IO.Directory]::EnumerateFiles($srcDir) | Where-Object {$Counter -lt $FileLimit} | %{
#Get-ChildItem $srcDir | Select-Object -first $FileLimit | %{
Move-Item $_ -destination $dest
$Counter++
}
}
Get-ChildItem $dirProc1 | select -first 3000
?
How can I get a du-ish analysis using PowerShell? I'd like to periodically check the size of directories on my disk.
The following gives me the size of each file in the current directory:
foreach ($o in gci)
{
Write-output $o.Length
}
But what I really want is the aggregate size of all files in the directory, including subdirectories. Also I'd like to be able to sort it by size, optionally.
There is an implementation available at the "Exploring Beautiful Languages" blog:
"An implementation of 'du -s *' in Powershell"
function directory-summary($dir=".") {
get-childitem $dir |
% { $f = $_ ;
get-childitem -r $_.FullName |
measure-object -property length -sum |
select #{Name="Name";Expression={$f}},Sum}
}
(Code by the blog owner: Luis Diego Fallas)
Output:
PS C:\Python25> directory-summary
Name Sum
---- ---
DLLs 4794012
Doc 4160038
include 382592
Lib 13752327
libs 948600
tcl 3248808
Tools 547784
LICENSE.txt 13817
NEWS.txt 88573
python.exe 24064
pythonw.exe 24576
README.txt 56691
w9xpopen.exe 4608
I modified the command in the answer slightly to sort descending by size and include size in MB:
gci . |
%{$f=$_; gci -r $_.FullName |
measure-object -property length -sum |
select #{Name="Name"; Expression={$f}},
#{Name="Sum (MB)";
Expression={"{0:N3}" -f ($_.sum / 1MB) }}, Sum } |
sort Sum -desc |
format-table -Property Name,"Sum (MB)", Sum -autosize
Output:
PS C:\scripts> du
Name Sum (MB) Sum
---- -------- ---
results 101.297 106217913
SysinternalsSuite 56.081 58805079
ALUC 25.473 26710018
dir 11.812 12385690
dir2 3.168 3322298
Maybe it is not the most efficient method, but it works.
If you only need the total size of that path, one simplified version can be,
Get-ChildItem -Recurse ${HERE_YOUR_PATH} | Measure-Object -Sum Length
function Get-DiskUsage ([string]$path=".") {
$groupedList = Get-ChildItem -Recurse -File $path | Group-Object directoryName | select name,#{name='length'; expression={($_.group | Measure-Object -sum length).sum } }
foreach ($dn in $groupedList) {
New-Object psobject -Property #{ directoryName=$dn.name; length=($groupedList | where { $_.name -like "$($dn.name)*" } | Measure-Object -Sum length).sum }
}
}
Mine is a bit different; I group all of the files on directoryname, then walk through that list building totals for each directory (to include the subdirectories).
Building on previous answers, this will work for those that want to show sizes in KB, MB, GB, etc., and still be able to sort by size. To change units, just change "MB" to desired units in both "Name=" and "Expression=". You can also change the number of decimal places to show (rounding), by changing the "2".
function du($path=".") {
Get-ChildItem $path |
ForEach-Object {
$file = $_
Get-ChildItem -File -Recurse $_.FullName | Measure-Object -Property length -Sum |
Select-Object -Property #{Name="Name";Expression={$file}},
#{Name="Size(MB)";Expression={[math]::round(($_.Sum / 1MB),2)}} # round 2 decimal places
}
}
This gives the size as a number not a string (as seen in another answer), therefore one can sort by size. For example:
PS C:\Users\merce> du | Sort-Object -Property "Size(MB)" -Descending
Name Size(MB)
---- --------
OneDrive 30944.04
Downloads 401.7
Desktop 335.07
.vscode 301.02
Intel 6.62
Pictures 6.36
Music 0.06
Favorites 0.02
.ssh 0.01
Searches 0
Links 0
My own take using the previous answers:
function Format-FileSize([int64] $size) {
if ($size -lt 1024)
{
return $size
}
if ($size -lt 1Mb)
{
return "{0:0.0} Kb" -f ($size/1Kb)
}
if ($size -lt 1Gb)
{
return "{0:0.0} Mb" -f ($size/1Mb)
}
return "{0:0.0} Gb" -f ($size/1Gb)
}
function du {
param(
[System.String]
$Path=".",
[switch]
$SortBySize,
[switch]
$Summary
)
$path = (get-item ".").FullName
$groupedList = Get-ChildItem -Recurse -File $Path |
Group-Object directoryName |
select name,#{name='length'; expression={($_.group | Measure-Object -sum length).sum } }
$results = ($groupedList | % {
$dn = $_
if ($summary -and ($path -ne $dn.name)) {
return
}
$size = ($groupedList | where { $_.name -like "$($dn.name)*" } | Measure-Object -Sum length).sum
New-Object psobject -Property #{
Directory=$dn.name;
Size=Format-FileSize($size);
Bytes=$size`
}
})
if ($SortBySize)
{ $results = $results | sort-object -property Bytes }
$results | more
}