Sorting an array of strings in Forth - sorting

I have used CREATE to create an array of strings:
create mystringarray s" This" , s" is" , s" a", s" list" ,
And I want to sort this in ascending order. I've found some tutorials for assembly language online, but I want to do it in Forth. What is the best practice method?

You need to first make sure that your data representation is accurate.
A literal string in Forth is obtained using the word s" and so you would write, for example:
s" This" ok
Once entered, if you do .s, you'll see two values:
.s <2> 7791776 4 ok
This is a pointer to the actual string (array of characters), and a count of the number of characters in the string. Certain words in Forth understand this string representation. type is one of them. If you now entered type, you'd get the string typed on the display:
type This ok
So now you know you need two cells to represent a string obtained by s". Your create needs to take this into account and use the 2, word to store 2 cells per entry, rather than , which stores only one cell:
create myStringArray
s" This" 2,
s" is" 2,
s" an" 2,
s" array" 2,
s" of" 2,
s" strings" 2,
This is an array of address/count pairs for the strings. If you want to access one of them, you can do so as follows:
: myString ( u1 -- caddr u1 ) \ given the index, get the string address/count
\ fetch 2 cells from myStringArray + (sizeof 2 cells)*index
myStringArray swap 2 cells * + 2# ;
Breaking this down, you need to take the base of your array variable, myStringArray and add to it the correct offset to the string address/count you want. That offset is the size of an array entry (2 cells) times the index (which is on the data stack). Thus the expression, myStringArray swap 2 cells * +. This is followed by 2# which retrieves the double word (address and count) at that location.
Put to use...
3 myString type array ok
0 myString type This ok
etc...
Now that you know the basics of indexing the array, then the "best practice" of sorting would follow the normal best practices for choosing a sort algorithm for the kind of array you wish to sort. In this case, a bubble sort is probably appropriate for a very small array of strings. You would use the compare word to compare two strings. For example:
s" This" 0 myString compare .s <1> 0 ok
Result is 0 meaning the strings are equal.

The best practice method to sort an array is to use some existing library. If the existing libraries don't fit your needs, or your main purpose is learning —
then it makes sense to implement your own library.
Using a library
For example, Cell array module from The Forth Foundation Library (FFL) can be used to sort an array of any items.
Code example
include ffl/car.fs
include ffl/str.fs
0 car-new value arr \ new array in the heap
\ shortcut to keep -- add string into our 'arr' array
: k ( a1 u1 -- ) str-new dup arr car-push str-set ;
\ set compare method
:noname ( a1 a2 -- n ) >r str-get r> str-get compare ; arr car-compare!
\ dump strings from the array
: dump-arr ( -- ) arr car-length# 0 ?do i arr car-get str-get type cr loop ;
\ populate the array
s" This" k s" is" k s" a" k s" list" k
\ test sorting
dump-arr cr
arr car-sort
dump-arr cr
The output
This
is
a
list
This
a
is
list
Using bare Forth
If you need a bare Forth solution just for learning, look at bubble sort sample.
An array of strings should contain the string addresses only. The strings themselves should be kept in some another place. It is useful to use counted string format in this case — so, we use c" word for string literals. To keep the strings themselves we place the initialization code into a definition (:noname in this case) — it will keep the strings in the dictionary space.
Bubble sort is adapted from the variant for numbers to the variant for strings just with replacing the word for compare items. Note that 2# word returns the value of the lowest address at the top.
Code example
\ some helper words
: bounds ( addr1 u1 -- addr1 addr2 ) over + swap ;
: lt-cstring ( a1 a2 -- flag ) >r count r> count compare -1 = ;
\ create an array of counted strings
:noname ( -- addr cnt )
here
c" This" , c" is" , c" a" , c" list" ,
here over - >cells
; execute constant cnt constant arr
\ dump strings from the array
: dump-arr ( -- ) cnt 0 ?do i cells arr + # count type cr loop ;
\ bubble sort
: sort-arr ( -- )
cnt 2 u< if exit then
cnt 1 do true
arr cnt i - cells bounds do
i 2# ( a2 a1 ) lt-cstring if i 2# swap i 2! false and then
cell +loop
if leave then
loop
;
\ test sorting
dump-arr cr
sort-arr
dump-arr cr
\ the output is the same as before

Related

Sort text column alphanumerically (letters before numbers)

I am trying to sort our inventory list alphanumerically by part number such that letters are sorted before numbers. For instance, given the list:
0004006A
AN42B10
0400975
1968
MS21042L3
0004006
AN414A
J961393
AN4H16A
SR22SCW20
STD1410
4914
15KE51CA
21
560
the sorted list should be:
AN4H16A
AN414A
AN42B10
J961393
MS21042L3
SR22SCW20
STD1410
0004006
0004006A
0400975
15KE51CA
1968
21
4914
560
Currently, I can only get it to sort with numbers before letters so the list looks like:
004006
0004006A
0400975
15KE51CA
1968
21
4914
560
AN414A
AN42B10
AN4H16A
J961393
MS21042L3
SR22SCW20
STD1410
(Note especially AN4H16A coming after AN42B10 and AN414A rather than before.)
I have tried adding a custom list (A, B, C, ..., 7, 8, 9) but get the same result sorting by that list.
Is this possible?
The following solution is for LO Calc. If the data is in A1 through A15, then enter the following formula into B1.
=IF(LEN($A1)<COLUMN()-1;-1;IF(CODE(MID($A1;COLUMN()-1;1))<=CODE(9);MID($A1;COLUMN()-1;1)+27;CODE(MID($A1;COLUMN()-1;1))-CODE("A")))
This gets the first character of the string in A1 and then determines a sorting value for that character, with "A" becoming 0 (the first in sorted order) and "9" becoming 36 (the last in sorted order).
Now, drag and fill over to J15 for the rest of the characters in the string, then down to J15 for the other strings.
Then, go to Data -> Sort. Sort Key 1 is column B, sort key 2 is column C, and so on through J.
Alternatively, select A1 through A15 and then run the following Python macro.
import uno
def custom_sort():
oSelect = XSCRIPTCONTEXT.getDocument().getCurrentSelection()
rowTuples = oSelect.getDataArray()
rowTuples = sorted(rowTuples, key=letters_then_numbers)
oSelect.setDataArray(rowTuples)
def letters_then_numbers(rowTuple):
strval = str(rowTuple[0])
sresult = ""
for c in strval:
if c in (str(i) for i in range(10)): # if character is a number
c = chr(ord('z') + int(c)) # then order it after z
sresult += c
return sresult
g_exportedScripts = custom_sort,

Blocks as various data structures in Rebol

I gather that in Rebol one is expected to use a block for representing arbitrary structured data. Are there built-in or standard ways of treating blocks as data structures other than lists?
I am thinking of:
stacks
queues (possibly double-ended)
sets
maps aka. associative arrays
Rebol have three holders of arbitrary data that all can be treated the same way.
block! implemented as an array, for fast index (integer) referencing
list! implemented as a linked list, for fast inserting and removing data
hash! implemented as a hash referenced list, for fast lookup of both data and key
You operate on them in the same way with
insert append index? find poke select ...
but they differ a little in result and particularly in response time.
In your case use
block! for a stack
list! for queues (I think)
hash! for associative arrays
As mentioned all operate similarly (even the hash! can be referenced by index). Hence you can treat any of them as an associative array.
>> x: [a one b two c 33]
== [a one b two c 33]
>> x/a
== one
>> x/c
== 33
>> select x 'b
== two
>> pick x 4
== two
which would result in exactly the same for a hash! defined as x: hash! [a 1 b 2 33]. So to add a new key value pair:
>> x: make hash! [ a 1 b 2 c 33]
== make hash! [a 1 b 2 c 33]
>> append x [ key value ]
== make hash! [a 1 b 2 c 33 key value]
>> x/key
== value
>> select x 'key
== value
>> pick x 8
== value
Note that rebol does not have a sense of key value pairs, the hash! is just a list of ordered values that internally will build hash! values for referencing. You can therefore just as well ask what follows the value 33 above
>> select x 33
== key
To really use it for key value pairs, use the skip refinement
>> select/skip x 33 2
== none
For the associative arrays you can also use object! in case it does not need to have dynamic fields.

How to convert an array of integer into a single integer

Using pseudo-code, if I have an array of integer, how can I make a single big integer that represents the same array in bits?
Example of input (using bits):
[10101, 10001, 00010, 01100]
The integer should be:
10101100010001001100
or
01100000101000110101
number = 0
for each element e
number *= 1 + maximumRepresentableNumber
number += e
For your example, maximumRepresentableNumber will be 11111, as that is the maximum number we can represent using the allowed number of bits (5). Adding 1 to that gives us 100000, and, if we multiply by that, it will be equivalent to a bit-shift by 5 to the left.
This would work for decimal representation as well, i.e. [123, 55, 29] will return 123055029. In this case maximumRepresentableNumber will be 999, so we'd just be multiplying by 1000.
What you are looking for is well known in functional programming land as fold or reduce. The basic idea is, that in a list
a,b,c,d, ..., x
we replace the commas with an operation we want (the operation beig symbolaized by $ here):
a $ (b $ (c $ (d $ ...(x $ Z))) // right fold
and the empty list with some default value Z
A bit different is the left fold, where we start out with Z:
((((Z $ a) $ b) $ c ).... $ x)))
The genral imperative algorithm for left fold would be:
result = Z
for each e in list do result = result $ e
Now, the only problem left is to identify $ and Z, that is the function we want to apply subsequently to all list elements to reach the goal and the starting value. In your case, what you want is either:
append the stringified element to the result string. Z is the empty string.
or: add the element to the result so far multiplied with 2^5. Z would be 0.
In ruby:
answer = 0
[0b10101, 0b10001, 0b00010, 0b01100].each do |x|
answer <<= 5
answer |= x
end
puts answer.to_s(2) # 10101100010001001100
In Python this would be:
a = [0b10101, 0b10001, 0b00010, 0b01100]
b = 0
for elem in a:
b <<= elem.bit_length()
b += elem
print(bin(b))

How do I modify the Damerau-Levenshtein algorithm, such that it also includes the start index, and the end index of the larger substring?

Here is my code:
#http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
# used for fuzzy matching of two strings
# for indexing, seq2 must be the parent string
def dameraulevenshtein(seq1, seq2)
oneago = nil
min = 100000000000 #index
max = 0 #index
thisrow = (1..seq2.size).to_a + [0]
seq1.size.times do |x|
twoago, oneago, thisrow = oneago, thisrow, [0] * seq2.size + [x + 1]
seq2.size.times do |y|
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + ((seq1[x] != seq2[y]) ? 1 : 0)
thisrow[y] = [delcost, addcost, subcost].min
if (x > 0 and y > 0 and seq1[x] == seq2[y-1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y])
thisrow[y] = [thisrow[y], twoago[y-2] + 1].min
end
end
end
return thisrow[seq2.size - 1], min, max
end
there has to be someway to get the starting and ending index of substring, seq1, withing parent string, seq2, right?
I'm not entirely sure how this algorithm works, even after reading the wiki article on it. I mean, I understand the highest level explanation, as it finds the insertion, deletion, and transposition difference (the lines in the second loop).. but beyond that. I'm a bit lost.
Here is an example of something that I wan to be able to do with this (^):
substring = "hello there"
search_string = "uh,\n\thello\n\t there"
the indexes should be:
start: 5
end: 18 (last char of string)
Ideally, the search_string will never be modified. But, I guess I could take out all the white space characters (since there are only.. 3? \n \r and \t) store the indexes of each white space character, get the indexes of my substring, and then re-add in the white space characters, making sure to compensate the substring's indexes as I offset them with the white space characters that were originally in there in the first place. -- but if this could all be done in the same method, that would be amazing, as the algorithm is already O(n^2).. =(
At some point, I'd like to only allow white space characters to split up the substring (s1).. but one thing at a time
I don't think this algorithm is the right choice for what you want to do. The algorithm is simply calculating the distance between two strings in terms of the number of modifications you need to make to turn one string into another. If we rename your function to dlmatch for brevity and only return the distance, then we have:
dlmatch("hello there", "uh, \n\thello\n\t there"
=> 7
meaning that you can convert one string into the other in 7 steps (effectively by removing seven characters from the second). The problem is that 7 steps is a pretty big difference:
dlmatch("hello there", "panda here"
=> 6
This would actually imply that "hello there" and "panda here" are closer matches than the first example.
If what you are trying to do is "find a substring that mostly matches", I think you are stuck with an O(n^3) algorithm as you feed the first string to a series of substrings of the second string, and then selecting the substring that provides you the closest match.
Alternatively, you may be better off trying to do pre-processing on the search string and then doing regexp matching with the substring. For example, you could strip off all special characters and then build a regexp that looks for words in the substring that are case insensitive and can have any amount of whitespace between them.

Algorithms for "fuzzy matching" strings

By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit.
I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;)
I think it would be useful to state the problem and the requirements more clearly.
Problem:
We are looking for a way to speed up typing by allowing users to only type a few letters of the keyword they actually intended and propose them a list from which to select.
It is expected that all the letters of the input be in the keyword
It is expected that the letters in the input be in the same order in the keyword
The list of keywords returned should be presented in a consistent (reproductible) order
The algorithm should be case insensitive
Analysis:
The first two requirements can be sum up like such: for an input axg we are looking for words matching this regular expression [^a]*a[^x]*x[^g]*g.*
The third requirement is purposely loose. The order in which the words should appear in the list need being consistent... however it's difficult to guess whether a scoring approach would be better than alphabetical order. If the list is extremy long, then a scoring approach could be better, however for short list it's easier for the eye to look for a particular item down a list sorted in an obvious manner.
Also, the alphabetical order has the advantage of consistency during typing: ie adding a letter does not completely reorder the list (painful for the eye and brain), it merely filters out the items that do not match any longer.
There is no precision about handling unicode characters, for example is à similar to a or another character altogether ? Since I know of no language that currently uses such characters in their keywords, I'll let it slip for now.
My solution:
For any input, I would build the regular expression expressed earlier. It suitable for Python because the language already features case-insensitive matching.
I would then match my (alphabetically sorted) list of keywords, and output it so filtered.
In pseudo-code:
WORDS = ['Bar', 'Foo', 'FooBar', 'Other']
def GetList(input, words = WORDS):
expr = ['[^' + i + ']*' + i for i in input]
return [w for w in words if re.match(expr, w, re.IGNORECASE)]
I could have used a one-liner but thought it would obscure the code ;)
This solution works very well for incremental situations (ie, when you match as the user type and thus keep rebuilding) because when the user adds a character you can simply refilter the result you just computed. Thus:
Either there are few characters, thus the matching is quick and the length of the list does not matter much
Either there are a lots of characters, and this means we are filtering a short list, thus it does not matter too much if the matching takes a bit longer element-wise
I should also note that this regular expression does not involve back-tracking and is thus quite efficient. It could also be modeled as a simple state machine.
Levenshtein 'Edit Distance' algorithms will definitely work on what you're trying to do: they will give you a measurement of how closely two words or addresses or phone numbers, psalms, monologues and scholarly articles match each other, allowing you you rank the results and choose the best match.
A more lightweight appproach is to count up the common substrings: it's not as good as Levenshtein, but it provides usable results and runs quickly in slow languages which have access to fast 'InString' functions.
I published an Excel 'Fuzzy Lookup' in Excellerando a few years ago, using 'FuzzyMatchScore' function that is, as far as I can tell, exactly what you need:
http://excellerando.blogspot.com/2010/03/vlookup-with-fuzzy-matching-to-get.html
It is, of course, in Visual Basic for Applications. Proceed with caution, crucifixes and garlic:
Public Function SumOfCommonStrings( _
ByVal s1 As String, _
ByVal s2 As String, _
Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
Optional iScore As Integer = 0 _
) As Integer
Application.Volatile False
' N.Heffernan 06 June 2006
' THIS CODE IS IN THE PUBLIC DOMAIN
' Function to measure how much of String 1 is made up of substrings found in String 2
' This function uses a modified Longest Common String algorithm.
' Simple LCS algorithms are unduly sensitive to single-letter
' deletions/changes near the midpoint of the test words, eg:
' Wednesday is obviously closer to WedXesday on an edit-distance
' basis than it is to WednesXXX. So it would be better to score
' the 'Wed' as well as the 'esday' and add up the total matched
' Watch out for strings of differing lengths:
'
' SumOfCommonStrings("Wednesday", "WednesXXXday")
'
' This scores the same as:
'
' SumOfCommonStrings("Wednesday", "Wednesday")
'
' So make sure the calling function uses the length of the longest
' string when calculating the degree of similarity from this score.
' This is coded for clarity, not for performance.
Dim arr() As Integer ' Scoring matrix
Dim n As Integer ' length of s1
Dim m As Integer ' length of s2
Dim i As Integer ' start position in s1
Dim j As Integer ' start position in s2
Dim subs1 As String ' a substring of s1
Dim len1 As Integer ' length of subs1
Dim sBefore1 ' documented in the code
Dim sBefore2
Dim sAfter1
Dim sAfter2
Dim s3 As String
SumOfCommonStrings = iScore
n = Len(s1)
m = Len(s2)
If s1 = s2 Then
SumOfCommonStrings = n
Exit Function
End If
If n = 0 Or m = 0 Then
Exit Function
End If
's1 should always be the shorter of the two strings:
If n &GT; m Then
s3 = s2
s2 = s1
s1 = s3
n = Len(s1)
m = Len(s2)
End If
n = Len(s1)
m = Len(s2)
' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
SumOfCommonStrings = n
Exit Function
End If
For len1 = n To 1 Step -1
For i = 1 To n - len1 + 1
subs1 = Mid(s1, i, len1)
j = 0
j = InStr(1, s2, subs1, Compare)
If j &GT; 0 Then
' We've found a matching substring...
iScore = iScore + len1
' Now clip out this substring from s1 and s2...
' And search the fragments before and after this excision:
If i &GT; 1 And j &GT; 1 Then
sBefore1 = left(s1, i - 1)
sBefore2 = left(s2, j - 1)
iScore = SumOfCommonStrings(sBefore1, _
sBefore2, _
Compare, _
iScore)
End If
If i + len1 &LT; n And j + len1 &LT; m Then
sAfter1 = right(s1, n + 1 - i - len1)
sAfter2 = right(s2, m + 1 - j - len1)
iScore = SumOfCommonStrings(sAfter1, _
sAfter2, _
Compare, _
iScore)
End If
SumOfCommonStrings = iScore
Exit Function
End If
Next
Next
End Function
Private Function Minimum(ByVal a As Integer, _
ByVal b As Integer, _
ByVal c As Integer) As Integer
Dim min As Integer
min = a
If b &LT; min Then
min = b
End If
If c &LT; min Then
min = c
End If
Minimum = min
End Function
Two algorithms I've found so far:
LiquidMetal
Better Ido Flex-Matching
I'm actually building something similar to Vim's Command-T and ctrlp plugins for Emacs, just for fun. I have just had a productive discussion with some clever workmates about ways to do this most efficiently. The goal is to reduce the number of operations needed to eliminate files that don't match. So we create a nested map, where at the top-level each key is a character that appears somewhere in the search set, mapping to the indices of all the strings in the search set. Each of those indices then maps to a list of character offsets at which that particular character appears in the search string.
In pseudo code, for the strings:
controller
model
view
We'd build a map like this:
{
"c" => {
0 => [0]
},
"o" => {
0 => [1, 5],
1 => [1]
},
"n" => {
0 => [2]
},
"t" => {
0 => [3]
},
"r" => {
0 => [4, 9]
},
"l" => {
0 => [6, 7],
1 => [4]
},
"e" => {
0 => [9],
1 => [3],
2 => [2]
},
"m" => {
1 => [0]
},
"d" => {
1 => [2]
},
"v" => {
2 => [0]
},
"i" => {
2 => [1]
},
"w" => {
2 => [3]
}
}
So now you have a mapping like this:
{
character-1 => {
word-index-1 => [occurrence-1, occurrence-2, occurrence-n, ...],
word-index-n => [ ... ],
...
},
character-n => {
...
},
...
}
Now searching for the string "oe":
Initialize a new map where the keys will be the indices of strings that match, and the values the offset read through that string so far.
Consume the first char from the search string "o" and look it up in the lookup table.
Since strings at indices 0 and 1 match the "o", put them into the map {0 => 1, 1 => 1}.
Now search consume the next char in the input string, "e" and loo it up in the table.
Here 3 strings match, but we know that we only care about strings 0 and 1.
Check if there are any offsets > the current offsets. If not, eliminate the items from our map, otherwise update the offset: {0 => 9, 1 => 3}.
Now by looking at the keys in our map that we've accumulated, we know which strings matched the fuzzy search.
Ideally, if the search is being performed as the user types, you'll keep track of the accumulated hash of results and pass it back into your search function. I think this will be a lot faster than iterating all search strings and performing a full wildcard search on each one.
The interesting thing about this is that you could also efficient store the Levenstein Distance along with each match, assuming you only care about insertions, not substitutions or deletions. Though perhaps it's not hard to get that logic added too.
I recently had to solve the same problem. My solution involves scoring strings with consecutively matched letters highly and excluding strings that don't contain the typed letters in order.
I've documented the algorithm in detail here: http://blog.kazade.co.uk/2014/10/a-fuzzy-filename-matching-algorithm.html
If your text is predominantly English then you may try your hand at various Soundex algorithms
1. Classic soundex
2. Metafone
These algorithms will let you choose words which sound like each other and will be a good way to find misspelled words.

Resources