Powershell Sort of Strings with Underscores - sorting

The following list does not sort properly (IMHO):
$a = #( 'ABCZ', 'ABC_', 'ABCA' )
$a | sort
ABC_
ABCA
ABCZ
My handy ASCII chart and Unicode C0 Controls and Basic Latin chart
have the underscore (low line) with an ordinal of 95 (U+005F). This is a higher number than the capital letters A-Z. Sort should have put the string ending with an underscore last.
Get-Culture is en-US
The next set of commands does what I expect:
$a = #( 'ABCZ', 'ABC_', 'ABCA' )
[System.Collections.ArrayList] $al = $a
$al.Sort( [System.StringComparer]::Ordinal )
$al
ABCA
ABCZ
ABC_
Now I create an ANSI encoded file containing those same 3 strings:
Get-Content -Encoding Byte data.txt
65 66 67 90 13 10 65 66 67 95 13 10 65 66 67 65 13 10
$a = Get-Content data.txt
[System.Collections.ArrayList] $al = $a
$al.Sort( [System.StringComparer]::Ordinal )
$al
ABC_
ABCA
ABCZ
Once more the string containing the underscore/lowline is not sorted correctly. What am I missing?
Edit:
Let's reference this example #4:
'A' -lt '_'
False
[char] 'A' -lt [char] '_'
True
Seems like both statements should be False or both should be True. I'm comparing strings in the first statement, and then comparing the Char type. A string is merely a collection of Char types so I think the two comparison operations should be equivalent.
And now for example #5:
Get-Content -Encoding Byte data.txt
65 66 67 90 13 10 65 66 67 95 13 10 65 66 67 65 13 10
$a = Get-Content data.txt
$b = #( 'ABCZ', 'ABC_', 'ABCA' )
$a[0] -eq $b[0]; $a[1] -eq $b[1]; $a[2] -eq $b[2];
True
True
True
[System.Collections.ArrayList] $al = $a
[System.Collections.ArrayList] $bl = $b
$al[0] -eq $bl[0]; $al[1] -eq $bl[1]; $al[2] -eq $bl[2];
True
True
True
$al.Sort( [System.StringComparer]::Ordinal )
$bl.Sort( [System.StringComparer]::Ordinal )
$al
ABC_
ABCA
ABCZ
$bl
ABCA
ABCZ
ABC_
The two ArrayList contain the same strings, but are sorted differently. Why?

In many cases PowerShell wrap/unwrap objects in/from PSObject. In most cases it is done transparently, and you does not even notice this, but in your case it is what cause your trouble.
$a='ABCZ', 'ABC_', 'ABCA'
$a|Set-Content data.txt
$b=Get-Content data.txt
[Type]::GetTypeArray($a).FullName
# System.String
# System.String
# System.String
[Type]::GetTypeArray($b).FullName
# System.Management.Automation.PSObject
# System.Management.Automation.PSObject
# System.Management.Automation.PSObject
As you can see, object returned from Get-Content are wrapped in PSObject, that prevent StringComparer from seeing underlying strings and compare them properly. Strongly typed string collecting can not store PSObjects, so PowerShell will unwrap strings to store them in strongly typed collection, that allows StringComparer to see strings and compare them properly.
Edit:
First of all, when you write that $a[1].GetType() or that $b[1].GetType() you does not call .NET methods, but PowerShell methods, which normally call .NET methods on wrapped object. Thus you can not get real type of objects this way. Even more, them can be overridden, consider this code:
$c='String'|Add-Member -Type ScriptMethod -Name GetType -Value {[int]} -Force -PassThru
$c.GetType().FullName
# System.Int32
Let us call .NET methods thru reflection:
$GetType=[Object].GetMethod('GetType')
$GetType.Invoke($c,$null).FullName
# System.String
$GetType.Invoke($a[1],$null).FullName
# System.String
$GetType.Invoke($b[1],$null).FullName
# System.String
Now we get real type for $c, but it says that type of $b[1] is String not PSObject. As I say, in most cases unwrapping done transparently, so you see wrapped String and not PSObject itself. One particular case when it does not happening is that: when you pass array, then array elements are not unwrapped. So, let us add additional level of indirection here:
$Invoke=[Reflection.MethodInfo].GetMethod('Invoke',[Type[]]([Object],[Object[]]))
$Invoke.Invoke($GetType,($a[1],$null)).FullName
# System.String
$Invoke.Invoke($GetType,($b[1],$null)).FullName
# System.Management.Automation.PSObject
Now, as we pass $b[1] as part of array, we can see real type of it: PSObject. Although, I prefer to use [Type]::GetTypeArray instead.
About StringComparer: as you can see, when not both compared objects are strings, then StringComparer rely on IComparable.CompareTo for comparison. And PSObject implement IComparable interface, so that sorting will be done according to PSObject IComparable implementation.

Windows uses Unicode, not ASCII, so what you're seeing is the Unicode sort order for en-US. The general rules for sorting are:
numbers, then lowercase and uppercase intermixed
Special characters occur before numbers.
Extending your example,
$a = #( 'ABCZ', 'ABC_', 'ABCA', 'ABC4', 'abca' )
$a | sort-object
ABC_
ABC4
abca
ABCA
ABCZ

If you really want to do this.... I will admit it's ugly but it works. I would create a function if this is something you need to do on a regular basis.
$a = #( 'ABCZ', 'ABC_', 'ABCA', 'ab1z' )
$ascii = #()
foreach ($item in $a)
{
$string = ""
for ($i = 0; $i -lt $item.length; $i++)
{
$char = [int] [char] $item[$i]
$string += "$char;"
}
$ascii += $string
}
$b = #()
foreach ($item in $ascii | Sort-Object)
{
$string = ""
$array = $item.Split(";")
foreach ($char in $array)
{
$string += [char] [int] $char
}
$b += $string
}
$a
$b
ABCA
ABCZ
ABC_

I tried the following and the sort is as expected:
[System.Collections.ArrayList] $al = [String[]] $a

Related

Building table from list using nearest value?

I have a list similar to this...
1ID:42
2ID:85853
Name:Chris
Age:99
Species:Monkey
Name:Bob
Age:23
Species:Fish
Name:Alex
Age:67
Species:Cat
1ID:987
2ID:775437
Name:Tiffany
Age:32
Species:Dog
1ID:777
2ID:65336
Name:Becky
Age:122
Species:Hippo
I want to create a table where some of the data is taken from the nearest result. This prevents me from simply replacing "\n", "Name:", etc to make my table.
This is what I want to end up with...
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
I hope that makes sense. The last 2 columns are taken from the nearest previous 1ID and 2ID.
There could be any number of entries after the "ID" values.
Assumptions:
data is always formatted as presented and there is always a complete 3-tuple of name/age/species
first field of each line is spelled/capitalized exactly as in the example (the solution is based on an exact match)
Sample data file:
$ cat species.dat
1ID:42
2ID:85853
Name:Chris
Age:99
Species:Monkey
Name:Bob
Age:23
Species:Fish
Name:Alex
Age:67
Species:Cat
1ID:987
2ID:775437
Name:Tiffany
Age:32
Species:Dog
1ID:777
2ID:65336
Name:Becky
Age:122
Species:Hippo
One awk solution:
awk -F":" '
$1 == "1ID" { id1=$2 ; next }
$1 == "2ID" { id2=$2 ; next }
$1 == "Name" { name=$2 ; next }
$1 == "Age" { age=$2 ; next }
$1 == "Species" { print name,age,$2,id1,id2 }
' species.dat
NOTE: The next clauses are optional since each line is matching on a specific value in field 1 ($1).
Running the above generates:
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
Please see if following code fits your requirements
use strict;
use warnings;
use feature 'say';
my($id1,$id2,$name,$age,$species);
my $ready = 0;
$~ = 'STDOUT_HEADER';
write;
$~ = 'STDOUT';
while(<DATA>) {
$id1 = $1 if /^1ID:\s*(\d+)/;
$id2 = $1 if /^2ID:\s*(\d+)/;
$name = $1 if /^Name:\s*(\w+)/;
$age = $1 if /^Age:\s*(\d+)/;
$species = $1 if /^Species:\s*(\w+)/;
$ready = 1 if /^Species:/; # trigger flag for output
if( $ready ) {
$ready = 0;
write;
}
}
format STDOUT_HEADER =
Name Age Species Id1 Id2
---------------------------------
.
format STDOUT =
#<<<<<<< #>> #<<<<<< #>> #>>>>>>
$name, $age, $species, $id1, $id2
.
__DATA__
1ID:42
2ID:85853
Name:Chris
Age:99
Species:Monkey
Name:Bob
Age:23
Species:Fish
Name:Alex
Age:67
Species:Cat
1ID:987
2ID:775437
Name:Tiffany
Age:32
Species:Dog
1ID:777
2ID:65336
Name:Becky
Age:122
Species:Hippo
Output
Name Age Species Id1 Id2
---------------------------------
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
Would you try the following:
awk -F: '{a[$1]=$2} /^Species:/ {print a["Name"],a["Age"],a["Species"],a["1ID"],a["2ID"]}' file.txt
Here is an example in Perl:
use feature qw(say);
use strict;
use warnings;
my $fn = 'file.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my ($id1, $id2);
while( my $line = <$fh> ) {
chomp $line;
if ( $line =~ /^1ID:(\d+)/ ) {
$id1 = $1;
}
elsif ( $line =~ /^2ID:(\d+)/ ) {
$id2 = $1;
}
else {
my ( $name, $age, $species ) = get_block( $fh, $line );
say "$name $age $species $id1 $id2";
}
}
close $fh;
sub get_value {
my ( $line, $key ) = #_;
my ($key2, $value) = $line =~ /^(\S+):(.*)/;
if ( $key2 ne $key ) {
die "Bad format";
}
return $value;
}
sub get_block {
my ( $fh, $line ) = #_;
my $name = get_value( $line, 'Name' );
$line = <$fh>;
my $age = get_value( $line, 'Age' );
$line = <$fh>;
my $species = get_value( $line, 'Species' );
return ( $name, $age, $species );
}
Output:
Chris 99 Monkey 42 85853
Bob 23 Fish 42 85853
Alex 67 Cat 42 85853
Tiffany 32 Dog 987 775437
Becky 122 Hippo 777 65336
This might work for you (GNU sed):
sed -En '/^1ID./{N;h};/^Name/{N;N;G;s/\S+://g;s/\n/ /gp}' file
Stow the ID's in the hold space. Gather up the record in the pattern space, append the ID's, remove the labels and replace the newlines by spaces.

why grep does not return all pari of numbers?

Executing echo "123456" | grep -Eo "[[:digit:]]{1,2}" will return tree pairs, 12, 34, 56.
Why it does not return 12, 23, 34, 45, 56?
Your regular expression is not printing all of 5 pairs of numbers because you are asking only for three.
Your regex is the equivalent of [0-9][0-9] and will check that particular match starting from left; so if you have 123456 the steps would be something like:
1 -> Match? No; don't print anything.
12 -> Match? Yes; print it.
3 -> Match? No; don't print anything.
And so on...
Note that it doesn't start again after a match, otherwise it would match 12 over and over again...
You can use other solutions for your problem.
For example, if you need all pairs in that string you can use a function that take the first two numbers; cut the first one and check again, until the string is too short...
#!/bin/bash
check_pairs() {
local str="${1}"
if [ "${#str}" -ge 2 ]; then
printf "%s\n" "${str}" | sed -e "s/^\([0-9][0-9]\).*$/\1/"
check_pairs "${str#?}"
fi
}
check_pairs "123456"
exit 0
Probably there are other solutions (better, faster, stronger), but I cannot think of them right now.
While you already have a valid answer, if your shell is bash (or another advanced shell that allows string indexes), it would be far more efficient to use bash's built-in string indexing and a C-style for loop to output each pair in a given string rather than spawning a separate subshell on each loop iteration created by calling sed.
Bash string indexing allows you to access len characters within a string beginning at start index (where the valid indexes are 0 - len-1) using the form:
${var:$start:$len}
Combined with a C-style for loop looping over each index i in the string beginning at index 1 (the 2nd character) and outputting the pair of characters created by:
"${var:$((i-1)):2}"
A short example would be:
str=123456
for ((i = 1; i < ${#str}; i++)); do
echo "${str:$((i-1)):2}"
done
Example Use/Output
$ str=123456; \
for ((i = 1; i < ${#str}; i++)); do echo "${str:$((i-1)):2}"; done
12
23
34
45
56
Look things over and let me know if you have further questions.

Convert decimal to Base-4 in bash

I have been using a pretty basic, and for the most part straight forward, method to converting base-10 numbers {1..256} to base-4 or quaternary numbers. I have been using simple division $(($NUM/4)) to get the main result in order to get the remainders $(($NUM%4)) and then printing the remainders in reverse to arrive at the result. I use the following bash script to do this:
#!/bin/bash
NUM="$1"
main() {
local EXP1=$(($NUM/4))
local REM1=$(($NUM%4))
local EXP2=$(($EXP1/4))
local REM2=$(($EXP1%4))
local EXP3=$(($EXP2/4))
local REM3=$(($EXP2%4))
local EXP4=$(($EXP3/4))
local REM4=$(($EXP3%4))
echo "
$EXP1 remainder $REM1
$EXP2 remainder $REM2
$EXP3 remainder $REM3
$EXP4 remainder $REM4
Answer: $REM4$REM3$REM2$REM1
"
}
main
This script works fine for numbers 0-255 or 1-256. But beyond this(these) ranges, results become mixed and often repeated or inaccurate. This isn't so much of a problem as I don't intend to convert numbers beyond 256 or less than 0 (negative numbers [yet]).
My question is: "Is there a more simplified method to do this, possibly using expr or bc?
Base 4 conversion in bash
int2b4() {
local val out num ret=\\n;
for ((val=$1;val;val/=4)){
out=$((val%4))$out;
}
printf ${2+-v} $2 %s${ret[${2+1}]} $out
}
Invoked with only 1 argument, this will convert to base 4 and print the result followed by a newline. If a second argument is present, a variable of this name will be populated, no printing.
int2b4 135
2013
int2b4 12345678
233012011032
int2b4 5432 var
echo $var
1110320
Detailled explanation:
The main part is (could be written):
out=""
for (( val=$1 ; val > 0 ; val = val / 4 )) ;do
out="$((val%4))$out"
done
We're conversion loop could be easily understood (i hope)
local ensure out val num to be local empty variables and initialise locally ret='\n'
printf line use some bashisms
${2+-v} is emppty if $2 is empty and represent -v if not.
${ret[${2+1}]} become respectively ${ret[]} ( or ${ret[0]} ) and ${ret[1]}
So this line become
printf "%s\n" $out
if no second argument ($2) and
printf -v var "%s" $out
if second argument is var (Note that no newline will be appended to a populated variable, but added for terminal printing).
Conversion back to decimal:
There is a bashism letting you compute with arbitrary base, under bash:
echo $((4#$var))
5432
echo $((4#1110320))
5432
In a script:
for integer in {1234..1248};do
int2b4 $integer quaternary
backint=$((4#$quaternary))
echo $integer $quaternary $backint
done
1234 103102 1234
1235 103103 1235
1236 103110 1236
1237 103111 1237
1238 103112 1238
1239 103113 1239
1240 103120 1240
1241 103121 1241
1242 103122 1242
1243 103123 1243
1244 103130 1244
1245 103131 1245
1246 103132 1246
1247 103133 1247
1248 103200 1248
Create a look-up table taking advantage of brace expansion
$ echo {a..c}
a b c
$ echo {a..c}{r..s}
ar as br bs cr cs
$ echo {0..3}{0..3}
00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33
and so, for 0-255 in decimal to base-4
$ base4=({0..3}{0..3}{0..3}{0..3})
$ echo "${base4[34]}"
0202
$ echo "${base4[255]}"
3333

How to efficiently parse out number from a string in Shell?

For example if I have the following string:
I ate [ 6 ] chicken wings and [ 5 ] dishes of salad today.
I want to parse out 6 and 5 from this string to store to two variables A and B respectively. I am thinking of using [ and ] as the delimiters, and then narrow down to delimiting with spaces.. I am looking for simpler solutions to this. Thanks.
You can do this pretty easily using "sed" to replace all non-digits with spaces and let the shell use the space as the separator:
LINE="hi 1 there, 65 apples and 73 pears"
for i in $(echo $LINE | sed -e "s/[^0-9]/ /g" )
do
echo $i
done
1
65
73
Of course you can assign "i" to any variable you want as well, or you can create an array of your numbers and print them out:
LINE="hi 1 there 65 apples and 73 pears"
nums=($(echo hi 1 there 65 apples and 73 pears | sed -e "s/[^0-9]/ /g" ))
echo ${nums[#]}
1 65 73
grep -o with a perl regex:
line="I ate [ 6 ] chicken wings and [ 5 ] dishes of salad today."
n=( $( echo "$line" | grep -oP '(?<=\[ )\d+(?= \])' ) )
a=${n[0]} b=${n[1]}
You can work with the n array directly too:
for num in "${n[#]}"; do echo $num; done

Simplest method to convert file-size with suffix to bytes

Title says it all really, but I'm currently using a simple function with a case statement to convert human-readable file size strings into a size in bytes. It works well enough, but it's a bit unwieldy for porting into other code, so I'm curious to know if there are any widely available commands that a shell script could use instead?
Basically I want to take strings such as "100g" or "100gb" and convert them into bytes.
I'm currently doing the following:
to_bytes() {
value=$(echo "$1" | sed 's/[^0123456789].*$//g')
units=$(echo "$1" | sed 's/^[0123456789]*//g' | tr '[:upper:]' '[:lower:]')
case "$units" in
t|tb) let 'value *= 1024 * 1024 * 1024 * 1024' ;;
g|gb) let 'value *= 1024 * 1024 * 1024' ;;
m|mb) let 'value *= 1024 * 1024' ;;
k|kb) let 'value *= 1024' ;;
b|'') let 'value += 0' ;;
*)
value=
echo "Unsupported units '$units'" >&2
;;
esac
echo "$value"
}
It seems a bit overkill for something I would have thought was fairly common for scripts working with files; common enough that something might exist to do this more quickly.
If there are no widely available solutions (i.e - majority of unix and linux flavours) then I'd still appreciate any tips for optimising the above function as I'd like to make it smaller and easier to re-use.
See man numfmt.
# numfmt --from=iec 42 512K 10M 7G 3.5T
42
524288
10485760
7516192768
3848290697216
# numfmt --to=iec 42 524288 10485760 7516192768 3848290697216
42
512K
10M
7.0G
3.5T
toBytes() {
echo $1 | echo $((`sed 's/.*/\L\0/;s/t/Xg/;s/g/Xm/;s/m/Xk/;s/k/X/;s/b//;s/X/ *1024/g'`))
}
Here's something I wrote. It supports k, KB, and KiB. (It doesn't distinguish between powers of two and powers of ten suffixes, though, as in 1KB = 1000 bytes, 1KiB = 1024 bytes.)
#!/bin/bash
parseSize() {(
local SUFFIXES=('' K M G T P E Z Y)
local MULTIPLIER=1
shopt -s nocasematch
for SUFFIX in "${SUFFIXES[#]}"; do
local REGEX="^([0-9]+)(${SUFFIX}i?B?)?\$"
if [[ $1 =~ $REGEX ]]; then
echo $((${BASH_REMATCH[1]} * MULTIPLIER))
return 0
fi
((MULTIPLIER *= 1024))
done
echo "$0: invalid size \`$1'" >&2
return 1
)}
Notes:
Leverages bash's =~ regex operator, which stores matches in an array named BASH_REMATCH.
Notice the cleverly-hidden parentheses surrounding the function body. They're there to keep shopt -s nocasematch from leaking out of the function.
don't know if this is ok:
awk 'BEGIN{b=1;k=1024;m=k*k;g=k^3;t=k^4}
/^[0-9.]+[kgmt]?b?$/&&/[kgmtb]$/{
sub(/b$/,"")
sub(/g/,"*"g)
sub(/k/,"*"k)
sub(/m/,"*"m)
sub(/t/,"*"t)
"echo "$0"|bc"|getline r; print r; exit;}
{print "invalid input"}'
this only handles single line input. if multilines are needed, remove the exit
this checks only pattern [kgmt] and optional b. e.g. kib, mib would fail. also currently is only for lower-case.
e.g.:
kent$ echo "200kb"|awk 'BEGIN{b=1;k=1024;m=k*k;g=k^3;t=k^4}
/^[0-9.]+[kgmt]?b?$/&&/[kgmtb]$/{
sub(/b$/,"")
sub(/g/,"*"g)
sub(/k/,"*"k)
sub(/m/,"*"m)
sub(/t/,"*"t)
"echo "$0"|bc"|getline r
print r; exit
}{print "invalid input"}'
204800
Okay, so it sounds like there's nothing built-in or widely available, which is a shame, so I've had a go at reducing the size of the function and come up with something that's only really 4 lines long, though it's a pretty complicated four lines!
I'm not sure if it's suitable as an answer to my original question as it's not really what I'd call the simplest method, but I want to put it up in case anyone thinks it's a useful solution, and it does have the advantage of being really short.
#!/bin/sh
to_bytes() {
units=$(echo "$1" | sed 's/^[0123456789]*//' | tr '[:upper:]' '[:lower:]')
index=$(echo "$units" | awk '{print index ("bkmgt kbgb mbtb", $0)}')
mod=$(echo "1024^(($index-1)%5)" | bc)
[ "$mod" -gt 0 ] &&
echo $(echo "$1" | sed 's/[^0123456789].*$//g')"*$mod" | bc
}
To quickly summarise how it works, it first strips the number from the string given and forces to lowercase. It then use awk to grab the index of the extension from a structured string of valid suffixes. The thing to note is that the string is arranged to multiples of five (so it would need to be widened if more extensions are added), for example k and kb are at indices 2 and 7 respectively.
The index is then reduced by one and modulo'd by five so both k and kb become 1, m and mb become 2 and so-on. That's then used to raised 1024 as a power to get the size in bytes. If the extension was invalid this will resolve to a value of zero, and an extension of b (or nothing) will evaluate to 1.
So long as mod is greater than zero the input string is reduced to only the numeric part and multiplied by the modifier to get the end result.
This is actually how I would probably have solved this originally if I were using a language like PHP, Java etc., it's just a bit of a weird one to put together in a shell script.
I'd still very much appreciate any simplifications though!
Another variation, adding support for decimal values with a simpler T/G/M/K parser for outputs you might find from simpler Unix programs.
to_bytes() {
value=$(echo "$1" | sed -e 's/K//g' | sed -e 's/M//g' | sed -e 's/G//g' | sed -e 's/T//g' )
units=$(echo -n "$1" | grep -o .$ )
case "$units" in
T) value=$(bc <<< "scale=2; ($value * 1024 * 1024 * 1024 * 1024)") ;;
G) value=$(bc <<< "scale=2; ($value * 1024 * 1024 * 1024)") ;;
M) value=$(bc <<< "scale=2; ($value * 1024 * 1024)") ;;
K) value=$(bc <<< "scale=2; ($value * 1024)") ;;
b|'') let 'value += 0' ;;
*)
value=
echo "Unsupported units '$units'" >&2
;;
esac
echo "$value"
}

Resources