Sort text column alphanumerically (letters before numbers) - sorting

I am trying to sort our inventory list alphanumerically by part number such that letters are sorted before numbers. For instance, given the list:
0004006A
AN42B10
0400975
1968
MS21042L3
0004006
AN414A
J961393
AN4H16A
SR22SCW20
STD1410
4914
15KE51CA
21
560
the sorted list should be:
AN4H16A
AN414A
AN42B10
J961393
MS21042L3
SR22SCW20
STD1410
0004006
0004006A
0400975
15KE51CA
1968
21
4914
560
Currently, I can only get it to sort with numbers before letters so the list looks like:
004006
0004006A
0400975
15KE51CA
1968
21
4914
560
AN414A
AN42B10
AN4H16A
J961393
MS21042L3
SR22SCW20
STD1410
(Note especially AN4H16A coming after AN42B10 and AN414A rather than before.)
I have tried adding a custom list (A, B, C, ..., 7, 8, 9) but get the same result sorting by that list.
Is this possible?

The following solution is for LO Calc. If the data is in A1 through A15, then enter the following formula into B1.
=IF(LEN($A1)<COLUMN()-1;-1;IF(CODE(MID($A1;COLUMN()-1;1))<=CODE(9);MID($A1;COLUMN()-1;1)+27;CODE(MID($A1;COLUMN()-1;1))-CODE("A")))
This gets the first character of the string in A1 and then determines a sorting value for that character, with "A" becoming 0 (the first in sorted order) and "9" becoming 36 (the last in sorted order).
Now, drag and fill over to J15 for the rest of the characters in the string, then down to J15 for the other strings.
Then, go to Data -> Sort. Sort Key 1 is column B, sort key 2 is column C, and so on through J.
Alternatively, select A1 through A15 and then run the following Python macro.
import uno
def custom_sort():
oSelect = XSCRIPTCONTEXT.getDocument().getCurrentSelection()
rowTuples = oSelect.getDataArray()
rowTuples = sorted(rowTuples, key=letters_then_numbers)
oSelect.setDataArray(rowTuples)
def letters_then_numbers(rowTuple):
strval = str(rowTuple[0])
sresult = ""
for c in strval:
if c in (str(i) for i in range(10)): # if character is a number
c = chr(ord('z') + int(c)) # then order it after z
sresult += c
return sresult
g_exportedScripts = custom_sort,

Related

Explanation of Spark ML CountVectorizer output

Please help understand the output of the Spark ML CountVectorizer and suggest which documentation explains it.
val cv = new CountVectorizer()
.setInputCol("Tokens")
.setOutputCol("Frequencies")
.setVocabSize(5000)
.setMinTF(1)
.setMinDF(2)
val fittedCV = cv.fit(tokenDF.select("Tokens"))
fittedCV.transform(tokenDF.select("Tokens")).show(false)
2374 should be the number of terms (words) in the dictionary.
What is the "[2,6,328,548,1234]"?
Are they indices of the words "[airline, bag, vintage, world, champion]" in the dictionary? If so, why the same word "airline" has a different index "0" in the second line?
+------------------------------------------+----------------------------------------------------------------+
|Tokens |Frequencies |
+------------------------------------------+----------------------------------------------------------------+
...
|[airline, bag, vintage, world, champion] |(2374,[2,6,328,548,1234],[1.0,1.0,1.0,1.0,1.0]) |
|[airline, bag, vintage, jet, set, brown] |(2374,[0,2,6,328,405,620],[1.0,1.0,1.0,1.0,1.0,1.0]) |
+------------------------------------------+----------------------------------------------------------------+
[1]: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer
There is some doc explaining the basics. However this is pretty bare.
Yes. The numbers represent the words in a vocabulary index. However the order in the frequencies vector does not correspond to the order in tokens vector.
airline, bag, vintage are in both rows, hence they correspond to indices [2,6,328]. But you can't rely on the same order.
The row data type is a SparseVector. The first array, shows the indices and the second the values.
e.g
vector[328]
=> 1.0
a mapping could be as follows:
vocabulary
airline 328
bag 6
vintage 2
Frequencies
2734, [2, 6 ,328], [99, 5, 7]
# counts
vintage x 99
bag x 5
airline 7
In order to get the words back , you can do a lookup in the vocabulary. This needs to be broadcasted to different workers. You also most probably want to explode the counts per doc into separate rows.
Here is some python code snippet to extract top 25 frequent words per doc with a udf into separate rows and compute the mean for each word
import pyspark.sql.types as T
import pyspark.sql.functions as F
from pyspark.sql import Row
vocabulary = sc.broadcast(fittedCV.vocabulary)
def _top_scores(v):
# create count tuples for each index(i) in a vector(v)
# `.item()` is used, because in python the count value is a numpy datatype, in `scala` it will be just double
counts = [Row(i=i.item(),count=v[i.item()].item()) for i in v.indices]
# => [Row(i=2, count=30, Row(i=362, count=40)]
# return 25 top count rows
counts = sorted(counts, reverse=True, key=lambda x: x.count)
return counts[:25]
top_scores = F.udf(_top_scores, T.ArrayType(T.StructType().add('i', T.IntegerType()).add('count', T.DoubleType())))
vec_to_word = F.udf(_vecToWord, T.StringType())
def _vecToWord(i):
return vocabulary.value[i]
res = df.withColumn('word_count', explode(top_scores('Frequencies')))
=>
+-----+-----+----------+
doc_id, ..., word_count
(i, count)
+-----+-----+----------+
4711, ..., (2, 30.0)
4711, ..., (362, 40.0)
+-----+-----+----------+
res = res \
.groupBy('word_count.i').agg( \
avg('word_count.count').alias('mean')
.orderBy('mean', ascending=False)
res = res.withColumn('token', vec_to_word('i'))
=>
+---+---------+----------+
i, token, mean
+---+---------+----------+
2, vintage, 15
328, airline, 30
+--+----------+----------+

Sorting an array of strings in Forth

I have used CREATE to create an array of strings:
create mystringarray s" This" , s" is" , s" a", s" list" ,
And I want to sort this in ascending order. I've found some tutorials for assembly language online, but I want to do it in Forth. What is the best practice method?
You need to first make sure that your data representation is accurate.
A literal string in Forth is obtained using the word s" and so you would write, for example:
s" This" ok
Once entered, if you do .s, you'll see two values:
.s <2> 7791776 4 ok
This is a pointer to the actual string (array of characters), and a count of the number of characters in the string. Certain words in Forth understand this string representation. type is one of them. If you now entered type, you'd get the string typed on the display:
type This ok
So now you know you need two cells to represent a string obtained by s". Your create needs to take this into account and use the 2, word to store 2 cells per entry, rather than , which stores only one cell:
create myStringArray
s" This" 2,
s" is" 2,
s" an" 2,
s" array" 2,
s" of" 2,
s" strings" 2,
This is an array of address/count pairs for the strings. If you want to access one of them, you can do so as follows:
: myString ( u1 -- caddr u1 ) \ given the index, get the string address/count
\ fetch 2 cells from myStringArray + (sizeof 2 cells)*index
myStringArray swap 2 cells * + 2# ;
Breaking this down, you need to take the base of your array variable, myStringArray and add to it the correct offset to the string address/count you want. That offset is the size of an array entry (2 cells) times the index (which is on the data stack). Thus the expression, myStringArray swap 2 cells * +. This is followed by 2# which retrieves the double word (address and count) at that location.
Put to use...
3 myString type array ok
0 myString type This ok
etc...
Now that you know the basics of indexing the array, then the "best practice" of sorting would follow the normal best practices for choosing a sort algorithm for the kind of array you wish to sort. In this case, a bubble sort is probably appropriate for a very small array of strings. You would use the compare word to compare two strings. For example:
s" This" 0 myString compare .s <1> 0 ok
Result is 0 meaning the strings are equal.
The best practice method to sort an array is to use some existing library. If the existing libraries don't fit your needs, or your main purpose is learning —
then it makes sense to implement your own library.
Using a library
For example, Cell array module from The Forth Foundation Library (FFL) can be used to sort an array of any items.
Code example
include ffl/car.fs
include ffl/str.fs
0 car-new value arr \ new array in the heap
\ shortcut to keep -- add string into our 'arr' array
: k ( a1 u1 -- ) str-new dup arr car-push str-set ;
\ set compare method
:noname ( a1 a2 -- n ) >r str-get r> str-get compare ; arr car-compare!
\ dump strings from the array
: dump-arr ( -- ) arr car-length# 0 ?do i arr car-get str-get type cr loop ;
\ populate the array
s" This" k s" is" k s" a" k s" list" k
\ test sorting
dump-arr cr
arr car-sort
dump-arr cr
The output
This
is
a
list
This
a
is
list
Using bare Forth
If you need a bare Forth solution just for learning, look at bubble sort sample.
An array of strings should contain the string addresses only. The strings themselves should be kept in some another place. It is useful to use counted string format in this case — so, we use c" word for string literals. To keep the strings themselves we place the initialization code into a definition (:noname in this case) — it will keep the strings in the dictionary space.
Bubble sort is adapted from the variant for numbers to the variant for strings just with replacing the word for compare items. Note that 2# word returns the value of the lowest address at the top.
Code example
\ some helper words
: bounds ( addr1 u1 -- addr1 addr2 ) over + swap ;
: lt-cstring ( a1 a2 -- flag ) >r count r> count compare -1 = ;
\ create an array of counted strings
:noname ( -- addr cnt )
here
c" This" , c" is" , c" a" , c" list" ,
here over - >cells
; execute constant cnt constant arr
\ dump strings from the array
: dump-arr ( -- ) cnt 0 ?do i cells arr + # count type cr loop ;
\ bubble sort
: sort-arr ( -- )
cnt 2 u< if exit then
cnt 1 do true
arr cnt i - cells bounds do
i 2# ( a2 a1 ) lt-cstring if i 2# swap i 2! false and then
cell +loop
if leave then
loop
;
\ test sorting
dump-arr cr
sort-arr
dump-arr cr
\ the output is the same as before

Confusing matrix generation example in Lua

If you want to create NxM matrix in Lua you basically do the following:
function get_zero_matrix(rows, cols)
matrix = {}
for i=1, rows do
matrix[i] = {}
for j=1, cols do
matrix[i][j] = 0
end
end
return matrix
end
However, on the official Lua website I've seen the second variant:
function get_zero_matrix2(rows, cols)
mt = {}
for i=1, rows do
for j=1, cols do
mt[i*cols + j] = 0
end
end
return mt
end
First, I don't understand how it works. How [i*M + j] index is supposed to create rows and columns?
Second, I tried this variant, it works but what it returns is actually an array, not NxM matrix:
M = function get_zero_matrix2(10, 20)
print(#M, #M[1])
> attempt to get length of a nil value (field '?')
Can you please explain how the second variant works?
Maybe I am misinterpreting it.
I don't understand how it works.
For a 2D array of dimension N (rows) x M (cols), the total number of required elements = N * M. Now creating N * M elements in one shot as a single array, we will essentially have a 1D array (flattened 2D array) in memory. Since the formula assumes array index start as 0 and not 1 (the usual Lua's convention), we'll follow 0: the first M items with indices [0, M - 1] form row 0, next M items with indices [M, 2M - 1] form row 1, and so on.
Memory layout for a 5 x 2 array; index 4 in this 1D array is (2, 0) in the 2D array.
--+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--
... | 0,0 | 0,1 | 1,0 | 1,1 | 2,0 | 2,1 | 3,0 | 3,1 | 4,0 | 4,1 | ...
--+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--
|-- row 0 --|-- row 1 --|-- row 2 --|-- row 3 --|-- row 4 --|
To access element (i, j), one would get past i - 1 rows and then access the jth item on the ith row. But the index is already one lesser, since the index starts at 0, so i can be used as-is. Thus i * rows + j gives the right index.
How [i*M + j] index is supposed to create rows and columns?
It doesn't. It's an abstraction over a 1D array of numbers, giving the interface of a matrix. In languages like C, declaring a 2D array, most implementations do something similar. int a[2][3] would create an array of 6 integers and do the indexing with the formula above, so this is not an uncommon pattern.
The variant available on Lua PiL is actually representation/mapping of a 2D array as a 1D array.
Basically, consider the following 2x3 array/matrix:
{
{11, 12, 13},
{21, 22, 23}
}
The variant, instead creates it as:
{
[4] = 11,
[5] = 12,
.
.
.
[9] = 23
}
Now, when you want to, let's say, get matrix[1][x], you'll instead fetch:
matrix[1 * rows + x]
It does not, in any way create rows and columns. It is just data stored in a single row of numbers. You will have to implement your own logic; which here, essentially; is i * M + j.
The i*M + j is usually seen in languages with 0-indexing array, like C, where the matrix would be:
{
[0] = 11,
[1] = 12,
[2] = 13,
.
.
.
[5] = 23
}

How to convert an array of integer into a single integer

Using pseudo-code, if I have an array of integer, how can I make a single big integer that represents the same array in bits?
Example of input (using bits):
[10101, 10001, 00010, 01100]
The integer should be:
10101100010001001100
or
01100000101000110101
number = 0
for each element e
number *= 1 + maximumRepresentableNumber
number += e
For your example, maximumRepresentableNumber will be 11111, as that is the maximum number we can represent using the allowed number of bits (5). Adding 1 to that gives us 100000, and, if we multiply by that, it will be equivalent to a bit-shift by 5 to the left.
This would work for decimal representation as well, i.e. [123, 55, 29] will return 123055029. In this case maximumRepresentableNumber will be 999, so we'd just be multiplying by 1000.
What you are looking for is well known in functional programming land as fold or reduce. The basic idea is, that in a list
a,b,c,d, ..., x
we replace the commas with an operation we want (the operation beig symbolaized by $ here):
a $ (b $ (c $ (d $ ...(x $ Z))) // right fold
and the empty list with some default value Z
A bit different is the left fold, where we start out with Z:
((((Z $ a) $ b) $ c ).... $ x)))
The genral imperative algorithm for left fold would be:
result = Z
for each e in list do result = result $ e
Now, the only problem left is to identify $ and Z, that is the function we want to apply subsequently to all list elements to reach the goal and the starting value. In your case, what you want is either:
append the stringified element to the result string. Z is the empty string.
or: add the element to the result so far multiplied with 2^5. Z would be 0.
In ruby:
answer = 0
[0b10101, 0b10001, 0b00010, 0b01100].each do |x|
answer <<= 5
answer |= x
end
puts answer.to_s(2) # 10101100010001001100
In Python this would be:
a = [0b10101, 0b10001, 0b00010, 0b01100]
b = 0
for elem in a:
b <<= elem.bit_length()
b += elem
print(bin(b))

Modified VlookUp which return k-th value corresponding to lookup value

I would like to modify this function : Custom Excel VBA Function (Modified VLOOKUP) from Cell referring to a range in different file gives an error
functionality I need is conceptualy simple - I need VlookUp which return value that corresponds to k-th occurance of lookup value instead of standard 1-th, example :
If k-th occurance doesn't exist then function should return an error.
Spreadsheet-like data :
A B
1 "a" "1a"
2 "a" "2a"
3 "b" "1b"
4 "a" "3a"
5 "b" "2a"
VLOOKUPnew(lookup_value =A1, table_array =A1:B3,
col_index_num = 2, exactMatch =0, k=1) should return 1a
VLOOKUPnew(lookup_value =A1, table_array =A1:B3,
col_index_num = 2, exactMatch =0, k=2) should return 2a
VLOOKUPnew(lookup_value =A1, table_array =A1:B3,
col_index_num = 2, exactMatch =0, k=3) should return 3a
VLOOKUPnew(lookup_value =A3, table_array =A1:B3,
col_index_num = 2, exactMatch =0, k=1) should return 1b
VLOOKUPnew(lookup_value =A3, table_array =A1:B3,
col_index_num = 2, exactMatch =0, k=2) should return 2b
VLOOKUPnew(lookup_value =A3, table_array =A1:B3,
col_index_num = 2, exactMatch =0, k=3) should return error
I'm familiar with R and Matlab, so my thinking is vector oriented, I've first tried to write code for case witk k=1 or 2 by rewriting one line of code (from post I'm linking to) :
row = .Match(lookup_value, table_array.Columns(1), 0)
into :
If k =2 Then
row_1 = .Match(lookup_value, table_array.Columns(1), 0)
number_of_rows=table_array.Columns(1).Rows.Count
row = .Match(lookup_value, table_array.Columns(1).Rows( (row_1+1):number_of_rows ), 0)
above line is pseudocode because I don't have any idea how to write it properly (.Rows( (row_1+1):number_of_rows ) is vector of numbers and it looks quite funny)
else
row = .Match(lookup_value, table_array.Columns(1), 0)
End If
for k > 2 it would be simple (but inefficient) to put this code into for loop.
I've noticed that modified .Match() which takes also k as parameter would make all job needed. Using loop for to find position of k-th occurance of value seems to be quite slow or mayby I'm just not very familiar with VBA.
You may try out both of these Excel based formula: Adjust according to your data table.
Method 1:
CountIF function allows you to count number of occurances of a lookup value in a column range.
=COUNTIF(columnRange,lookupvalue)
Assume this is what you may be looking for: Data extracted from the reference.
CUST column is populated using =F78&COUNTIF($F$75:$F78,F78)
Master Data Starts from `F75 to H84`
Customer CUST Phone number
Smith Smith1 320-966-4023
Smith Smith2 686-612-7782
Jason Jason1 122-617-7154
Albert Albert1 547-436-7376
Nancy Nancy1 956-633-7322
Smith Smith3 132-716-5240
Grove Grove1 340-267-0529
Andy Andy1 531-413-4718
Jason Jason2 613-228-4294
Nancy Nancy2 272-525-2042
Final nth Lookup:
e.g. Phonenumber for 4th occurance for Customer = Smith
=VLOOKUP($D$74&"4",$G$75:$H$93,2,FALSE)
Lookup
Customer Smith
Phone number
1st 320-966-4023
2nd 686-612-7782
3rd 132-716-5240
4th 185-813-8883
Reference from Chandoo: 4. Lookup 2nd / 3rd / 4th occurrence of an item in a list.
Method 2:
Sample data used for following formula:
Formula:
=INDEX(ALTable,SMALL(IF(OFFSET(ALTable,0,0,ROWS(ALTable),1)=F90,
ROW(OFFSET(ALTable,0,0,ROWS(ALTable),1))-ROW(OFFSET(ALTable,0,0,1,1))+1,
ROW(OFFSET(ALTable,ROWS(ALTable)-1,0,1,1))+1),F91),2)
Reference from CPearson Arbitary Lookups.
Personally I don't fancy volatile functions such as index()...though..

Resources