I am a newbie in Powershell, but this is driving me a bit crazy. I have looked at various questions here, but could not find an answer so here I go. Apologies if this has been covered already.
I have two text files containing columns of numbers. I would like to create an array containing those 2 columns and sort it by column 1 or 2.
If we had
$a=#(1,5,10,15,25)
$b=#(100,99,98,99,10)
we create
c$=$a,$b
My initial thought was to try something like this:
$c | sort { [int]$_[0] }
But it does not work. I have tried many different things so any advice would be appreciated.
I am editing this as my question was not so clear. Ultimately, if I sort $c by ascending column 2, I expect something like:
25,10
10,98
5,99
15,99
1,100
Any idea how to achieve this ?
I am not sure about how you have declared your dimensional array because it is like you want it to be declared like this or something similar
$c = #(#(1,100),#(5,99),#(10,98),#(15,99),#(25,10))
If it was in that state then sorting is a breeze
$c | Sort-Object #{Expression={$_[1]}; Ascending=$True} | %{
"$($_[0]),$($_[1])"
}
Sort-Object works well with one dimensional arrays. When multiple properties are involved you need to specify which property to sort on to get the expected output. Since there are none we use a calculated expression to make on base on the second "column".
Sample Output
25,10
10,98
5,99
15,99
1,100
If you really want to work with your arrays like that we need an intermediate step to convert what you have to how it can be sorted the way you expect.
$a=#(1,5,10,15,25)
$b=#(100,99,98,99,10)
$c = #()
for($i = 0;$i -lt $a.Count; $i++){
$c += ,#($a[$i],$b[$i])
}
After running this code $c will work just like it does with my sorting.
Welcome to powershell world. The syntax is slightly different from classical programming languages, usually cmdlets take their input from current pipeline. In this case the command you talk about is Sort-Object and you can use it directly with the pipe content where you have the array content
$c = ($a | Sort-Object), ($b | Sort-Object)
Related
I'm trying to get my head around powershell and write a function as cmdlet, found the following code sample in one of the articles, but it doesnt seem to want to work as cmdlet even though it has [cmdletbinding()] declaration on the top of the file.
When I try to do something like
1,2,3,4,5 | .\measure-data
it returns empty response (the function itself works just fine if I invoke it at the bottom of the file and run the file itself).
Here's the code that I am working with, any help will be appreciated :)
Function Measure-Data {
<#
.Synopsis
Calculate the median and range from a collection of numbers
.Description
This command takes a collection of numeric values and calculates the
median and range. The result is written as an object to the pipeline.
.Example
PS C:\> 1,4,7,2 | measure-data
Median Range
------ -----
3 6
.Example
PS C:\> dir c:\scripts\*.ps1 | select -expand Length | measure-data
Median Range
------ -----
1843 178435
#>
[cmdletbinding()]
Param (
[Parameter(Mandatory=$True,ValueFromPipeline=$True)]
[ValidateRange([int64]::MinValue,[int64]::MaxValue)]
[psobject]$InputObject
)
Begin {
#define an array to hold incoming data
Write-Verbose "Defining data array"
$Data=#()
} #close Begin
Process {
#add each incoming value to the $data array
Write-Verbose "Adding $inputobject"
$Data+=$InputObject
} #close process
End {
#take incoming data and sort it
Write-Verbose "Sorting data"
$sorted = $data | Sort-Object
#count how many elements in the array
$count = $data.Count
Write-Verbose "Counted $count elements"
#region calculate median
if ($sorted.count%2) {
<#
if the number of elements is odd, add one to the count
and divide by to get middle number. But arrays start
counting at 0 so subtract one
#>
Write-Verbose "processing odd number"
[int]$i = (($sorted.count+1)/2-1)
#get the corresponding element from the sorted array
$median = $sorted[$i]
}
else {
<#
if number of elements is even, find the average
of the two middle numbers
#>
Write-Verbose "processing even number"
$i = $sorted.count/2
#get the lower number
$x = $sorted[$i-1]
#get the upper number
$y = $sorted[-$i]
#average the two numbers to calculate the median
$median = ($x+$y)/2
} #else even
#endregion
#region calculate range
Write-Verbose "Calculating the range"
$range = $sorted[-1] - $sorted[0]
#endregion
#region write result
Write-Verbose "Median = $median"
Write-Verbose "Range = $range"
#define a hash table for the custom object
$hash = #{Median=$median;Range=$Range}
#write result object to pipeline
Write-Verbose "Writing result to the pipeline"
New-Object -TypeName PSobject -Property $hash
#endregion
} #close end
} #close measure-data
this the article where I took the code from:
https://mcpmag.com/articles/2013/10/15/blacksmith-part-4.aspx
edit: maybe I should add that versions of this code from previous parts of the article worked just fine, but after adding all the things that make it a proper cmdlet like the help section and verbose lines, this thing just doesnt want to work, and I believe there is something missing, I have a feeling that this could be because it was written for powershell 3 and I am testing it on win 10 ps 5-point-something, but honestly I dont even know in which direction I should look for, that's why I ask you for help
There is nothing wrong with the code (apart from possible optimizations), but the way how you call it can't work:
1,2,3,4,5 | .\measure-data
When you call a script file that contains a named function, it is expected that "nothing happens". Actually, the scripts runs, but PowerShell does not know which function it should call (there could be multiple). So it just runs any code outside of functions.
You have two options to fix the problem:
Option 1
Remove the function keyword and the curly braces that belong to it. Keep the [cmdletbinding()] and Param sections.
[cmdletbinding()]
Param (
[Parameter(Mandatory=$True,ValueFromPipeline=$True)]
[ValidateRange([int64]::MinValue,[int64]::MaxValue)]
[psobject]$InputObject
)
Begin {
# ... your code ...
} #close Begin
Process {
# ... your code ...
} #close process
End {
# ... your code ...
}
Now the script itself is the "function" and can be called as such:
1,2,3,4,5 | .\measure-data
Option 2
Turn the script into a module. Basically you just need to save it with .psm1 extension (there is more to it, but for getting started it will suffice).
In the script where you want to use the function you have to import the module before you can use its functions. If the module is not installed, you can import it by specifying its full path.
# Import module from directory where current script is located
Import-Module $PSScriptRoot\measure-data.psm1
# Call a function of the module
1,2,3,4,5 | Measure-Data
A module is the way when there are multiple functions in a single script file. It is also more efficient when a function will be called muliple times, because PowerShell needs to parse it only once (it remembers Import-Module calls).
It works as-is, you just need to call it properly. Since the code is now a function, you cannot call it like before when the codes was directly in the file
# method when code is directly in file with no Function Measure-Data {}
1,2,3,4,5 | .\measure-data
Now that you've defined the function you instead need to dot source the file so that it loads your function(s) into memory. Then you can call your function by its name (which happens to be the same as the filename, but doesn't have to be)
# Load the functions by dot-sourcing
. .\measure-data.ps1
# Use the function
1,2,3,4,5 | Measure-Data
You're not passing it an Object but an array of integers. If you change the parameter to:
Param (
[Parameter(Mandatory=$True,ValueFromPipeline=$True)]
[ValidateRange([int64]::MinValue,[int64]::MaxValue)]
[Int[]]$InputObject
)
Now things work:
PS> 1,2,3,4,5 | Measure-Data
Median Range
------ -----
3 4
I've been doing research on this and I find a plethora of articles related to Text, but they don't seem to be working for me.
To be clear this formula works, I'm just looking to make it more efficient. My formula looks like:
if [organization_id] = 1 or [organization_id] = 2 or [organization_id] = 3 then "North" else if … where organization_id is of type "WholeNumber"
I'd like to simplify this by doing something like:
if [organization_id] in {1, 2, 3} then "North" else if …
I've tried wrapping in Parenthesis, Braces, & Brackets. Nothing seems to work. Most articles are using some form of text.replace function and mine is just a custom column.
Does MCode within Power Query have any efficiencies like this or do I have to write out each individual statement like the first line?
I've had success with the a List.Contains formulation:
List.Contains({1,2,3}, [organization_id])
The above checks if [organization_id] is in the list supplied in the first argument.
In some cases, you may not want to hardcode a list as shown above but reference a table column instead. For example,
List.Contains(TableWithDesiredIds[id_column], [organization_id])
My apologies ahead of time - I'm not sure that there is an answer for this one using only Linux command-line fu. Please note I am not a programmer, but I have been playing around with bash and python a bit over the last few years.
I have a large text file with rows and columns that resemble the following (note - fields are separated with tabs):
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
3078 Copland 2017GENERAL 07/07/17 Confirmed
3890 Bartok FOODS 09/11/17 Confirmed
5440 Alphapha 00B1106IMNH 01/09/18 Queued
What I want to do is find and output only those rows where the third field is either identical OR similar to another in the list. I don't really care whether the other fields are similar or not, but they should all be included in the output. By similar, I mean no more than [n] characters are different in that particular field (for example, no more than 3 characters are different). So the output I would want would be:
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
5440 Alphapha 00B1106IMNH 01/09/18 Queued
The line beginning 1074 has a third field that differs by 3 characters with 5440, so both of them are included. 3430 and 3431 are included because they are exactly identical. 3078 and 3890 are eliminated because they are not similar.
Through googling the forums I've managed to piece together this rather longish pipeline to be able to find all of the instances where field 3 is exactly identical:
cat inputfile.txt | awk 'BEGIN { OFS=FS="\t" } {if (count[$3] > 1) print $0; else if (count[$3] == 1) { print save[$3]; print $0; } else save[$3] = $0; count[$3]++; }' > outputfile.txt
I must confess I don't really understand awk all that well; I'm just copying and adapting from the web. But that seemed to work great at finding exact duplicates (i.e., it would output only 3430 and 3431 above). But I have no idea how to approach trying to find strings that are not identical but that differ in no more than 3 places.
For instance, in my example above, it should match 1074 and 5440 because they would both fit the pattern:
??B1106?MNH
But I would want it to be able to match also any other random pattern of matches, as long as there are no more than three differences, like this:
20?7G?N?RAL
These differences could be arbitrarily in any position.
The reason for needing this is we are trying to find a way to automatically find typographical errors in a serial-number-like field. There might be a mis-key, or perhaps a letter "O" replaced with a number "0", or the like.
So... any ideas? Thanks for the help!
you can use this script
$ more hamming.awk
function hamming(x,y,xs,ys,min,max,h) {
if(x==y) return 0;
else {
nx=split(x,xs,"");
mx=split(y,ys,"");
min=nx<mx?nx:mx;
max=nx<mx?mx:nx;
for(i=1;i<=min;i++) if(xs[i]!=ys[i]) h++;
return h+(max-min);
}
}
BEGIN {FS=OFS="\t"}
NR==FNR {
if($3 in a) nrs[NR];
for(k in a)
if(hamming(k,$3)<4) {
nrs[NR];
nrs[a[k]];
}
a[$3]=NR;
next
}
FNR in nrs
usage
$ awk -f hamming.awk file{,}
it's a double scan algorithm, finds the hamming distance (the one you described) between keys. Notice the it's O(n^2) algorithm, so may not suitable for very large data sets. However, not sure any other algorithm can do better.
NB Additional note based on the comment which I missed from the post. This algorithm compares the keys character by character, so displacements won't be identified. For example 123 and 23 will give a distance of 3.
Levenshtein distance aka "edit distance" suits your task best. Perl script below requires installing a module Text::Levenshtein (for debian/ubuntu do: sudo apt install libtext-levenshtein-perl).
use Text::Levenshtein qw(distance);
$maxdist = shift;
#ll = (<>);
#k = map {
$k = (split /\t/, $_)[2];
# $k =~ s/O/0/g;
} #ll;
for ($i = 0; $i < #ll; ++$i) {
for ($j = 0; $j < #ll; ++$j) {
if ($i != $j and distance($k[$i], $k[$j]) < $maxdist) {
print $ll[$i];
last;
}
}
}
Usage:
perl lev.pl 3 inputfile.txt > outputfile.txt
The algorithm is the same O(n^2) as in #karakfa's post, but matching is more flexible.
Also note the commented line # $k =~ s/O/0/g;. If you uncomment it, then all O's in key will become 0's, which will fix keys damaged by O->0 transformation. When working with damaged data I always use small rules like this to fix data gradually, refining rules from run to run, to the point where data is almost perfect and fuzzy match is no longer needed.
The most popular answer for this question involves the following Windows powershell code (edited to fix a bug):
$file1 = Get-Content C:\temp\file1.txt
$file2 = Get-Content C:\temp\file2.txt
$Diff = Compare-Object $File1 $File2
$LeftSide = ($Diff | Where-Object {$_.SideIndicator -eq '<='}).InputObject
$LeftSide | Set-Content C:\temp\file3.txt
I always get a zero byte file as the output, even if I remove the $Diff line.
Why is the output file always null, and how can it be fixed?
PetSerAl, as he routinely does, has provided the crucial pointer in a comment on the question:
Member-access enumeration - the ability to access a member (a property or a method) on a collection and have it implicitly applied to each of its elements, with the results getting collected in an array, was introduced in PSv3.[1]
Member-access enumeration is not only expressive and convenient, it is also faster than alternative approaches.
A simplified example:
PS> ((Get-Item /), (Get-Item $HOME)).Mode
d--hs- # The value of (Get-Item /).Mode
d----- # The value of (Get-Item $HOME).Mode
Applying .Mode to the collection that the (...)-enclosed command outputs causes the .Mode property to be accessed on each item in the collection, with the resulting values returned as an array (a regular PowerShell array, of type[System.Object[]]).
Caveats: Member-access enumeration handles the resulting array like the pipeline does, which means:
If the array has only a single element, that element's property value is returned directly, not inside a single-element array:
PS> #([pscustomobject] #{foo=1}).foo.GetType().Name
Int32 # 1 was returned as a scalar, not as a single-element array.
If the property values being collected are themselves arrays, a flat array of values is returned:
PS> #([pscustomobject] #{foo=1,2}, [pscustomobject] #{foo=3,4}).foo.Count
4 # a single, flat array was returned: 1, 2, 3, 4
Also, member-access enumeration only works for getting (reading) property values, not for setting (writing) them.
This asymmetry is by design, to avoid potentially unwanted bulk modification; in PSv4+, use .ForEach('<property-name', <new-value>) as the quickest workaround (see below).
This convenient feature is NOT available, however:
if you're running on PSv2 (categorically)
if the collection itself has a member by the specified name, in which case the collection-level member is applied.
For instance, even in PSv3+ the following does NOT perform member-access enumeration:
PS> ('abc', 'cdefg').Length # Try to report the string lengths
2 # !! The *array's* .Length property value (item count) is reported, not the items'
In such cases - and in PSv2 in general - a different approach is needed:
Fastest alternative, using the foreach statement, assuming that the entire collection fits into memory as a whole (which is implied when using member-access enumeration).
PS> foreach ($s in 'abc', 'cdefg') { $s.Length }
3
5
PSv4+ alternative, using collection method .ForEach(), also operating on the collection as a whole:
PS> ('abc', 'cdefg').ForEach('Length')
3
5
Note: If applicable to the input collection, you can also set property values with .ForEach('<prop-name>', <new-value>), which is the fastest workaround to not being able to use .<prop-name> = <new-value>, i.e. the inability to set property values with member-access enumeration.
Slowest, but memory-efficient approaches, using the pipeline:
Note: Use of the pipeline is only memory-efficient if you process the items one by one, in isolation, without collecting the results in memory as well.
Using the ForEach-Object cmdlet, as in Burt Harris' helpful answer:
PS> 'abc', 'cdefg' | ForEach-Object { $_.Length }
3
5
For properties only (as opposed to methods), Select-Object -ExpandProperty is an option; it is conceptually clear and simple, and virtually on par with the ForEach-Object approach in terms of performance (for a performance comparison, see the last section of this answer):
PS> 'abc', 'cdefg' | Select-Object -ExpandProperty Length
3
5
[1] Previously, the feature was semi-officially known as just member enumeration, introduced in this 2012 blog post along with the feature itself. A decision to formally introduce the term member-access enumeration was made in early 2022.
Perhaps instead of
$LeftSide = ($Diff | Where-Object {$_.SideIndicator -eq '<='}).InputObject
PowerShell 2 might work better with:
$LeftSide = $Diff | Where-Object {$_.SideIndicator -eq '<='} |
Foreach-object { $_.InputObject }
I have a variable titled F.
Describe F returns:
F: {group: bytearray,indexkey: {(indexkey: chararray)}}
Dump F returns:
(321,{(CHOW),(DREW)})
(5011,{(CHOW),(DREW)})
(5825,{(TANNER),(SPITZENBERGER)})
(16631,{(CHOW),(DREW)})
(34299,{(CHOW),(DREW)})
(35044,{(TANNER),(SPITZENBERGER)})
(65623,{(CHOW),(DREW)})
(74597,{(SPITZENBERGER),(TANNER)})
(83499,{(SPITZENBERGER),(TANNER)})
(90257,{(SPITZENBERGER),(TANNER)})
What I need is to produce an output that looks like this (only 1st row as an example):
(321,DREW,{(CHOW)})
I've tried using deference to pull out the first element by using this:
G = FOREACH F generate indexkey.$0;
But, this still returns the whole tuple.
Can anyone suggest a method for doing this? I was under the impression that the deference operator should allow me to do this.
Thanks in advance!
Daniel
You can't index into bags like that. The reason for that is bags don't have any notion of ordering. Selecting the first item in a bag should be treated as picking a random one.
Either way, if you want only one item instead of all of them you can used a nested FOREACH to pull a LIMIT of 1:
first = FOREACH F {
lim = LIMIT indexkey 1;
GENERATE group, lim;
}
(disclaimer: I can't test this code right now, if it doesn't work let me know. Hopefully you can get the gist)
You can take this a bit further and FLATTEN it to remove the bag of one item entirely, but be careful in that if the bag is empty i think you throw away the entire record in this case.
first = FOREACH F {
lim = LIMIT indexkey 1;
GENERATE group, FLATTEN(lim);
}