find string with most occurrences in .txt file with powershell - windows

I'm currently working on a school assignment in powershell and I have to display the word longer then 6 characters with the most occurences from a txt file. I tried this code but it's returning the number of occurrences for each word and it's not what i need to do. Please help.
$a= Get-Content -Path .\germinal_split.txt
foreach($object in $a)
{
if($object.length -gt 6){
$object| group-object | sort-object -Property "Count" -Descending | ft -Property ("Name", "Count");
}
}

From the question we don't know what's in the text file. The approaches so far will only work if there's only 1 word per line. I think something like below will work regardless:
$Content = (Get-Content 'C:\temp\test12-01-19' -raw) -Split "\b"
$content |
Where-Object{$_.Length -ge 6} |
Group-Object -Property Length -NoElement | Sort-Object count | Format-Table -AutoSize
Here I'm reading in the file as a single string using the -Raw parameter. Then I'm splitting on word boundaries. Still use Where to filter out words shorter than 6 characters. Now use Group-Object against the length property as seen in the other examples.
I don't use the word boundary RegEx very often. My concern is it might be weird around punctuation, but my tests look pretty good.
Let me know what you think.

You can do something like the following:
$a = Get-Content -Path .\germinal_split.txt
$a | Where Length -gt 6 | Group-Object -NoElement | Sort-Object Count -Descending
Explanation:
Where specifies the Length property's condition. Group-Object -NoElement leaves off the Group property, which contains the actual object data. Sort-Object sorts the grouped output in ascending order by default. Here the Count property is specified as the sorted property and the -Descending parameter reverses the default sort order.

Related

Extract multiple columns from multiple test files in powershell

I got 450 files from computational model calculations for a nanosystem. Each of these files contain top three lines with Title, conditions and date/time. The fourth line has column labels (x y z t n m lag lead bus cond rema dock). From fifth line data starts upto 55th line. There are multiple spaces as delimiter. Spaces are not fixed.
I want to
I) create new text files with only x y z n m rema columns
Ii
II) I want only x y z and n values of all txt files in a single file
How to do it in powershell, plz help!
Based on your description, I guess the content of your files looks something like this:
Title: MyFile
Conditions: Critical
Date: 2020-02-23T11:33:02
x y z t n m lag lead bus cond rema dock
sdasd asdfafd awef wefaewf aefawef aefawrgt eyjrteujer bhtnju qerfqeg 524rwefqwert q3tgqr4fqr4 qregq5g
avftgwb ryhwtwtgqreg efqerfe rgwetgq ergqreq erwf ef 476j q4 w4th2 ef 42r13gg asdfasdrv
You can always read files like that by typing them out, line by line and only keep the lines you actually want. In your case, the data is in line 4-55 (including headers).
To get to that data, you can use this command:
Get-Content MyFile.txt | Select-Object -skip 3 -First 51
If you can confirm, that the data is the data you want, you can start working on the next issue - the multiple spaces delimiter issue.
Since (the number of) spaces are not fixed, you need to replace multiple spaces by a single space. Assuming that the values you are looking for are without spaces, you can add this to your pipeline:
Get-Content C:\MyFile.txt | Select-Object -skip 3 -First 51 | ForEach-Object {$_ -replace '( )+',' '}
The '( )+' part means one or more spaces.
Now you have proper csv data. To convert this to a proper object, you just need to convert the data from csv like this:
ConvertFrom-Csv -InputObject (Get-Content C:\MyFile.txt | Select-Object -skip 3 -First 51 | ForEach-Object {$_ -replace '( )+',' '}) -Delimiter ' '
From here it is pretty simple to select the values you want:
ConvertFrom-Csv -InputObject (Get-Content C:\MyFile.txt | Select-Object -skip 3 -First 51 | ForEach-Object {$_ -replace '( )+',' '}) -Delimiter ' ' | Select-Object x,y,z,n,m,rema
You also need to get all the files done, so you might start by getting the files like this:
foreach ($file in (Get-Content C:\MyFiles)){
ConvertFrom-Csv -InputObject (Get-Content $file.fullname | Select-Object -skip 3 -First 51 | ForEach-Object {$_ -replace '( )+',' '}) -Delimiter ' ' | Select-Object x,y,z,n,m,rema
}
You might want to split up the code into a more read-able format, but this should pretty much cover it.

Removing duplicates from CSV yet keeping column headers

I have a working powershell script that removes duplicates in a csv file, but it sorts the column headers within the data, which I don't want, and cannot figure out a way to keep the column headers.
Get-Content C:\testdata.csv | ConvertFrom-Csv -Header "Column1", "Column2", "Column3", "Column4" | sort -Unique -Property Column1 | % {"{0},{1},{2},{3}" -f $_.Column1, $_.Column2, $_.Column3, $_.Column4} | set-content c:\output.csv
The test data csv is as follows:
Name,IDNumber,OtherNumber,UniqueCode
Tom,10,133,abcd
Tom,10,133,abcd
Bill,4,132,efgh
Bill,4,132,efgh
Bill,4,132,efgh
Lefty,3,122,ijkl
Lefty,3,122,ijkl
Lefty,3,122,ijkl
Lefty,3,122,ijkl
Is there a way to accomplish this with Powershell?
Using Import-Csv and Export-Csv makes this process much easier as they are built to deal with csv files and headers.
Import-Csv "C:\testdata.csv" | Sort-Object * -Unique | Export-Csv "c:\output.csv" -NoTypeInformation
Untested, but try this...
Import-Csv -Path 'C:\path\to\File.csv' |
Select * -Unique |
Export-Csv 'C:\path\to\NewFile.csv' -NoTypeInformation
You could use Select -Skip 1 to skip over the original header column:
Get-Content testdata.csv | Select -Skip 1 | ConvertFrom-Csv -Header "Column1","Column2","Column3","Column4" | sort -Unique -Property Column1 | % {"{0},{1},{2},{3}" -f $_.Column1, $_.Column2, $_.Column3, $_.Column4} | set-content output.csv

Iterate through txt files and find rows that are not in all files

I have folder with 3 text files.
File 1, call it test1.txt has values
11
22
22
test2.txt has values
11
22
22
33
test3.txt has values
11
22
22
33
44
44
How can I get my final result equal to (New.txt)
to be:
44
44
This values is not in the other 2 files so this is what I want.
So far code:
$result = "C:\NonDuplicate.txt"
$filesvalues=gci "C:\*.txt" | %{$filename=$_.Name; gc $_ | %{[pscustomobject]#{FileName= $filename; Row=$_ }}}
#list file where not exists others file with same value
$filesvalues | % {
$valtockeck=$_
[pscustomobject]#{
Val=$valtockeck
Exist=$filesvalues.Where({ $_.FileName -ne $valtockeck.FileName -and $_.Row -eq $valtockeck.Row }).Count -gt 0
}
} |
where Exist -NE $true |
% {$_.Val.Row | out-file $result -Append}
This is the error:
Where-Object : Cannot bind parameter 'FilterScript'. Cannot convert the "Exist" value of type "System.String" to type "System.Management.Automation.ScriptBlock".
At line:16 char:23
+ where <<<< Exist -NE $true |
+ CategoryInfo : InvalidArgument: (:) [Where-Object], ParameterBindingException
+ FullyQualifiedErrorId : CannotConvertArgumentNoMessage,Microsoft.PowerShell.Commands.WhereObjectCommand
try this
#list files/values couple
$filesvalues=gci "C:\temp\test\test*.txt" -file | %{$filename=$_.Name; gc $_ | %{[pscustomobject]#{FileName= $filename; Row=$_ }}}
#list file where not exists others file with same value
$filesvalues | % {
$valtockeck=$_
[pscustomobject]#{
Val=$valtockeck
Exist=$filesvalues.Where({ $_.FileName -ne $valtockeck.FileName -and $_.Row -eq $valtockeck.Row }).Count -gt 0
}
} |
where Exist -NE $true |
% {$_.Val.Row | out-file "c:\temp\test\New.txt" -Append}
$file1 = ".\test1.txt"
$file2 = ".\test2.txt"
$file3 = ".\test3.txt"
$results = ".\New.txt"
$Content = Get-Content $File1
$Content += Get-Content $File2
Get-Content $file3 | Where {$Content -notcontains $_}| Set-Content $Results
Other solution 1
#get couple files/values
$filesvalues=gci "C:\temp\test\test*.txt" -file |
%{$filename=$_.Name; gc $_ |
%{[pscustomobject]#{FileName= $filename; Row=$_ }}}
#group by value and filter by number of distinct filename, then extract data into file
($filesvalues | group -Property Row | where {($_.Group.FileName | Get-Unique).Count -eq 1 }).Group.Row |
out-file "C:\temp\test\New2.txt" -Append
The Compare-Object cmdlet's purpose is to compare two sets of inputs.
Nesting two Compare-Object calls yields the desired output:
$file1Lines = Get-Content .\test1.txt
$file2Lines = Get-Content .\test2.txt
$file3Lines = Get-Content .\test3.txt
(Compare-Object `
(Compare-Object -IncludeEqual $file1Lines $file2Lines).InputObject `
$file3Lines |
Where-Object SideIndicator -eq '=>'
).InputObject
Compare-Object outputs [pscustomobject] instances whose .InputObject property contains the input object and whose .SideIndicator property indicates which operand the value is unique to - <= (LHS) or >= (RHS) - and, with -IncludeEqual, if it is contained in both operands (==).
-IncludeEqual in the 1st Compare-Object call not only outputs the lines that differ, but also includes the ones that are the same, resulting in a union of the lines from file test1.txt and test2.txt.
By not specifying switches for the 2nd Compare-Object call, only [objects wrapping] the lines that differ are output (the default behavior).
Filter Where-Object SideIndicator -eq '=>' then filters the differences down to those lines that are unique to the RHS.
To generalize the command to N > 3 files and output to a new file:
# Get all input files as file objects.
$files = Get-ChildItem .\test*.txt
# I'll asume that all files but the last are the *reference files* - the
# files for which the union of all their lines should be formed first...
$refFiles = $files[0..$($files.count-2)]
# ... and that the last file is the *difference file* - the file whose lines
# to compare against the union of lines from the reference files.
$diffFile = $files[($files.count-1)]
# The output file path.
$results = ".\New.txt"
# Build the union of all lines from the reference files.
$unionOfLines = #()
$refFiles | ForEach-Object {
$unionOfLines = (Compare-Object -IncludeEqual $unionOfLines (Get-Content $_)).InputObject
}
# Compare the union of lines to the difference file and
# output only the lines unique to the difference file to the output file.
(Compare-Object $unionOfLines (Get-Content $diffFile) |
Where-Object SideIndicator -eq '=>').InputObject |
Set-Content $results
Note that Set-Content uses the Windows legacy single-byte encoding by default. Use the -Encoding parameter to change that.
Well, instead of writing the result in the $results file, save it in a variable $tmpResult and then do the same check as above for $tmpResult and $file3 to gain a final result. And if you have more than 3 files, you can create a loop to repeat the check.
But something is missing in the code above - you only get the unique lines in file2 and not those in file1.

Powershell sort and filter

I have a csv file containing detailed data, say columns A,B,C,D etc. Columns A and B are categories and C is a time stamp.
I am trying to create a summary file showing one row for each combination of A and B. It should pick the row from the original data where C is the most recent date.
Below is my attempt at solving the problem.
Import-CSV InputData.csv | `
Sort-Object -property #{Expression="ColumnA";Descending=$false}, `
#{Expression="ColumnB";Descending=$false}, `
#{Expression={[DateTime]::ParseExact($_.ColumnC,"dd-MM-yyyy HH:mm:ss",$null)};Descending=$true} | `
Sort-Object ColumnA, ColumnB -unique `
| Export-CSV OutputData.csv -NoTypeInformation
First the file is read, then everything is sorted by all 3 columns, the second Sort-Object call is supposed to then take the first row of each. However, Sort-Object with the -unique switch seems to pick a random row, rather than the first one. Thus this does get one row for each AB combination, but not the one corresponding to most recent C.
Any suggestions for improvements? The data set is very large, so going through the file line by line is awkward, so would prefer a powershell solution.
You should look into Group-By. I didn't create a sample CSV (you should provide it :-) ) so I haven't tested this out, but I think it should work:
Import-CSV InputData.csv | `
Select-Object -Property *, #{Label="DateTime";Expression={[DateTime]::ParseExact($_.ColumnC,"dd-MM-yyyy HH:mm:ss",$null)}} | `
Group-Object ColumnA, ColumnB | `
% {
$sum = ($_.Group | Measure-Object -Property ColumnD -Sum).Sum
$_.Group | Sort-Object -Property "DateTime" -Descending | Select-Object -First 1 -Property *, #{name="SumD";e={ $sum } } -ExcludeProperty DateTime
} | Export-CSV OutputData.csv -NoTypeInformation
This returns the same columns that was inputted(datetime gets excluded from the output).

Powershell v2, getting specific lines from file, sorting

I have a text file with a simple structure, which is actually the content of an ftp:
1.0
1.0a
10.0
10.0b
11.0
11.0f
2.0
3.0
4.0
...(and so on)
random string
random string
I'm using get-content to get the contents of the file but then I want to be able to retrieve only the lines that contain the max number and the max-1 number. In this case for example I would want it to return:
10.0
10.0b
11.0
11.0f
I tried using sort-object but didn't work. Is there a way to use sort-object in such a manner so it knows it is sorting numbers and not strings(so that it doesn't place 10 after 1), then sort according to the digits before the full stop and ignore the random strings at the end alltogether...
Or if you have another method to suggest please do so... Thank you.
You can pass scriptblocks to some cmdlets, in this case Sort-Object and Group-Object. To clarify a bit more:
Load the data
Get-Content foo.txt |
Group by the number (ignoring the suffix, if present):
Group-Object { $_ -replace '\..*$' } |
This will remove non-digits at the end of the string first and use the remainder of the string (hopefully now just containing a floating-point number) as the group name.
Sort by that group name, numerically.
Sort-Object { [int] $_.Name } |
This is done simply by converting the name of the group to a number and sort by that, similar to how we grouped by something derived from the original line.
Then we can get the last two groups, representing all lines with the maximum number and second-to-maximum number and unwrap the groups. The -Last parameter is fairly self-explanatory, the -ExpandProperty selects the values of a property instead of constructing a new object with a filtered property list:
Select-Object -Last 2 -ExpandProperty Group
And there we are. You can try this pipeline in various stages just to get a feeling for what the commands to:
PS Home:\> gc foo.txt
1.0
1.0a
10.0
10.0b
11.0
11.0f
2.0
3.0
4.0
PS Home:\> gc foo.txt | group {$_ -replace '\..*$'}
Count Name Group
----- ---- -----
2 1.0 {1.0, 1.0a}
2 10.0 {10.0, 10.0b}
2 11.0 {11.0, 11.0f}
1 2.0 {2.0}
1 3.0 {3.0}
1 4.0 {4.0}
PS Home:\> gc foo.txt | group {$_ -replace '\..*$'} | sort {[int]$_.Name}
Count Name Group
----- ---- -----
2 1.0 {1.0, 1.0a}
1 2.0 {2.0}
1 3.0 {3.0}
1 4.0 {4.0}
2 10.0 {10.0, 10.0b}
2 11.0 {11.0, 11.0f}
PS Home:\> gc foo.txt | group {$_ -replace '\..*$'} | sort {[int]$_.Name} | select -l 2 -exp group
10.0
10.0b
11.0
11.0f
If you need the items within the groups (and this in the final result for the last two groups) sorted by suffix, you can stick another Sort-Object directly after the Get-Content.
You can pass an expression to Sort-Object, the sort will then use that expression to sort the objects. This is done by passing a hash table with key expression (can be abbreviated to e). To reverse the order add a second key descending (or d) with value $true.
In your case
...input... | Sort #{e={convert $_ as required}}
Multiple property names and hash tables can be supplied: so 11.0f could be split into a number and suffix.
If there is a lot of overlap between the sort expressions you could pre-process the input into objects with the sort properties first (and remove after):
...input... | %{
if ($_ -match '^(\d+\.0)(.)?') {
new-object PSObject -prop #{value=$_; a=[double]::Parse($matches[1]); b=$matches[2] }
} else {
new-object PSObject -prop #{value=$_; a=[double]::MinValue; b=$null }
}
} | sort a,b | select -expand value

Resources