Concise (one line?) binary search in Raku - algorithm

Many common operations aren't built in to Raku because they can be concisely expressed with a combination of (meta) operators and/or functions. It feels like binary search of a sorted array ought to be expressable in that way (maybe with .rotor? or …?) but I haven't found a particularly good way to do so.
For example, the best I've come up with for searching a sorted array of Pairs is:
sub binary-search(#a, $target) {
when +#a ≤ 1 { #a[0].key == $target ?? #a[0] !! Empty }
&?BLOCK(#a[0..^*/2, */2..*][#a[*/2].key ≤ $target], $target)
}
That's not awful, but I can't shake the feeling that it could be an awfully lot better (both in terms of concision and readability). Can anyone see what elegant combo of operations I might be missing?

Here's one approach that technically meets my requirements (in that the function body it fits on a single normal-length line). [But see the edit below for an improved version.]
sub binary-search(#a, \i is copy = my $=0, :target($t)) {
for +#a/2, */2 … *≤1 {#a[i] cmp $t ?? |() !! return #a[i] with i -= $_ × (#a[i] cmp $t)}
}
# example usage (now slightly different, because it returns the index)
my #a = ((^20 .pick(*)) Z=> 'a'..*).sort;
say #a[binary-search(#a».key, :target(17))];
say #a[binary-search(#a».key, :target(1))];
I'm still not super happy with this code, because it loses a bit of readability – I still feel like there could/should be a concise way to do a binary sort that also clearly expresses the underlying logic. Using a 3-way comparison feels like it's on that track, but still isn't quite there.
[edit: After a bit more thought, I came up with an more readable version of the above using reduce.
sub binary-search(#a, :target(:$t)) {
(#a/2, */2 … *≤.5).reduce({ $^i - $^pt×(#a[$^i] cmp $t || return #a[$^i]) }) && Nil
}
In English, that reads as: for a sequence starting at the midpoint of the array and dropping by 1/2, move your index $^i by the value of the next item in the sequence – with the direction of the move determined by whether the item at that index is greater or lesser than the target. Continue until you find the target (in which case, return it) or you finish the sequence (which means the target wasn't present; return Nil)]

Related

What is the time complexity performance of Scala's Vector data structure?

I know that most of the Vector methods are effectively O(1) (constant time) because of the tree they use, but I cannot find any information on the contains method. My first thought is that it would have to be O(n) to check all the elements but I am not sure.
Answering the question in the title, performance characteristics (2.13 docs version) of basic operations head, tail, apply, update, prepend, append, insert are all listed as eC for Vector:
eC The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.
You are correct contains is O(N), as there is no hashing or nothing else that would avoid the need to compare with all items. Still, if you want to be sure, it is best to check the implementation.
As finding the correct implementation in the library sources can be difficult because of many traits and overrides used to implement the containers, the best way to check this is the debugger. Use a code like:
val v = Vector(0, 1, 2)
v.contains(1)
Use the debugger to step into v.contains and the source you will see is:
def contains[A1 >: A](elem: A1): Boolean = exists (_ == elem)
If you are still not convinced at this point, some more "step into" will lead you to:
def exists(p: A => Boolean): Boolean = {
var res = false
while (!res && hasNext) res = p(next())
res
}

Huffman decoding (in Scala)

I'm trying to write an algorithm to perform Huffman decoding. I am doing it in Scala - it's an assignment for a Coursera course and I don't want to violate the honor code, so the below is pseudocode rather than Scala.
The algorithm I have written takes in a tree tree and a list of bits bits, and is supposed to return the message. However, when I try it on the provided tree, I get a NoSuchElementException (head of empty list). I can't see why.
I know that my code could be tidied up a bit - I'm still very new to functional programming so I've written it in a way that makes sense to me, rather than, probably, in the most compact way.
def decode(tree, bits) [returns a list of chars]: {
def dc(aTree, someBits, charList) [returns a list of chars]: {
if aTree is a leaf:
if someBits is empty: return char(leaf) + charList
else: dc(aTree, someBits, char(leaf) + charList)
else aTree is a fork:
if someBits.head is 0: dc(leftFork, someBits.tail, charList)
else someBits is 1: dc(rightFork, someBits.tail, charList)
}
dc(tree, bits, [empty list])
}
Thanks in advance for your help. It's my first time on StackOverflow, so I probably have some learning to do as to how best to use the site.
If I understand it correctly, you want to go through forks (with directions from bits) until you will find a leaf. Then you are adding leaf value to your char list and from this point you want to repeat steps.
If I am right, then you should pass original tree to your helper method, not a leftFork or rightFork, which are leafs now.
So it would be something like:
if aTree is a leaf:
if someBits is empty: return char(leaf) + charList
else: dc(tree, someBits, char(leaf) + charList)

When are numbers NOT Magic?

I have a function like this:
float_as_thousands_str_with_precision(value, precision)
If I use it like this:
float_as_thousands_str_with_precision(volts, 1)
float_as_thousands_str_with_precision(amps, 2)
float_as_thousands_str_with_precision(watts, 2)
Are those 1/2s magic numbers?
Yes, they are magic numbers. It's obvious that the numbers 1 and 2 specify precision in the code sample but not why. Why do you need amps and watts to be more precise than volts at that point?
Also, avoiding magic numbers allows you to centralize code changes rather than having to scour the code when for the literal number 2 when your precision needs to change.
I would propose something like:
HIGH_PRECISION = 3;
MED_PRECISION = 2;
LOW_PRECISION = 1;
And your client code would look like:
float_as_thousands_str_with_precision(volts, LOW_PRECISION )
float_as_thousands_str_with_precision(amps, MED_PRECISION )
float_as_thousands_str_with_precision(watts, MED_PRECISION )
Then, if in the future you do something like this:
HIGH_PRECISION = 6;
MED_PRECISION = 4;
LOW_PRECISION = 2;
All you do is change the constants...
But to try and answer the question in the OP title:
IMO the only numbers that can truly be used and not be considered "magic" are -1, 0 and 1 when used in iteration, testing lengths and sizes and many mathematical operations. Some examples where using constants would actually obfuscate code:
for (int i=0; i<someCollection.Length; i++) {...}
if (someCollection.Length == 0) {...}
if (someCollection.Length < 1) {...}
int MyRidiculousSignReversalFunction(int i) {return i * -1;}
Those are all pretty obvious examples. E.g. start and the first element and increment by one, testing to see whether a collection is empty and sign reversal... ridiculous but works as an example. Now replace all of the -1, 0 and 1 values with 2:
for (int i=2; i<50; i+=2) {...}
if (someCollection.Length == 2) {...}
if (someCollection.Length < 2) {...}
int MyRidiculousDoublinglFunction(int i) {return i * 2;}
Now you have start asking yourself: Why am I starting iteration on the 3rd element and checking every other? And what's so special about the number 50? What's so special about a collection with two elements? the doubler example actually makes sense here but you can see that the non -1, 0, 1 values of 2 and 50 immediately become magic because there's obviously something special in what they're doing and we have no idea why.
No, they aren't.
A magic number in that context would be a number that has an unexplained meaning. In your case, it specifies the precision, which clearly visible.
A magic number would be something like:
int calculateFoo(int input)
{
return 0x3557 * input;
}
You should be aware that the phrase "magic number" has multiple meanings. In this case, it specifies a number in source code, that is unexplainable by the surroundings. There are other cases where the phrase is used, for example in a file header, identifying it as a file of a certain type.
A literal numeral IS NOT a magic number when:
it is used one time, in one place, with very clear purpose based on its context
it is used with such common frequency and within such a limited context as to be widely accepted as not magic (e.g. the +1 or -1 in loops that people so frequently accept as being not magic).
some people accept the +1 of a zero offset as not magic. I do not. When I see variable + 1 I still want to know why, and ZERO_OFFSET cannot be mistaken.
As for the example scenario of:
float_as_thousands_str_with_precision(volts, 1)
And the proposed
float_as_thousands_str_with_precision(volts, HIGH_PRECISION)
The 1 is magic if that function for volts with 1 is going to be used repeatedly for the same purpose. Then sure, it's "magic" but not because the meaning is unclear, but because you simply have multiple occurences.
Paul's answer focused on the "unexplained meaning" part thinking HIGH_PRECISION = 3 explained the purpose. IMO, HIGH_PRECISION offers no more explanation or value than something like PRECISION_THREE or THREE or 3. Of course 3 is higher than 1, but it still doesn't explain WHY higher precision was needed, or why there's a difference in precision. The numerals offer every bit as much intent and clarity as the proposed labels.
Why is there a need for varying precision in the first place? As an engineering guy, I can assume there's three possible reasons: (a) a true engineering justification that the measurement itself is only valid to X precision, so therefore the display shoulld reflect that, or (b) there's only enough display space for X precision, or (c) the viewer won't care about anything higher that X precision even if its available.
Those are complex reasons difficult to capture in a constant label, and are probbaly better served by a comment (to explain why something is beng done).
IF the use of those functions were in one place, and one place only, I would not consider the numerals magic. The intent is clear.
For reference:
A literal numeral IS magic when
"Unique values with unexplained meaning or multiple occurrences which
could (preferably) be replaced with named constants." http://en.wikipedia.org/wiki/Magic_number_%28programming%29 (3rd bullet)

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

An efficient technique to replace an occurence in a sequence with mutable or immutable state

I am searching for an efficient a technique to find a sequence of Op occurences in a Seq[Op]. Once an occurence is found, I want to replace the occurence with a defined replacement and run the same search again until the list stops changing.
Scenario:
I have three types of Op case classes. Pop() extends Op, Push() extends Op and Nop() extends Op. I want to replace the occurence of Push(), Pop() with Nop(). Basically the code could look like seq.replace(Push() ~ Pop() ~> Nop()).
Problem:
Now that I call seq.replace(...) I will have to search in the sequence for an occurence of Push(), Pop(). So far so good. I find the occurence. But now I will have to splice the occurence form the list and insert the replacement.
Now there are two options. My list could be mutable or immutable. If I use an immutable list I am scared regarding performance because those sequences are usually 500+ elements in size. If I replace a lot of occurences like A ~ B ~ C ~> D ~ E I will create a lot of new objects If I am not mistaken. However I could also use a mutable sequence like ListBuffer[Op].
Basically from a linked-list background I would just do some pointer-bending and after a total of four operations I am done with the replacement without creating new objects. That is why I am now concerned about performance. Especially since this is a performance-critical operation for me.
Question:
How would you implement the replace() method in a Scala fashion and what kind of data structure would you use keeping in mind that this is a performance-critical operation?
I am happy with answers that point me in the right direction or pseudo code. No need to write a full replace method.
Thank you.
Ok, some considerations to be made. First, recall that, on lists, tail does not create objects, and prepending (::) only creates one object for each prepended element. That's pretty much as good as you can get, generally speaking.
One way of doing this would be this:
def myReplace(input: List[Op], pattern: List[Op], replacement: List[Op]) = {
// This function should be part of an KMP algorithm instead, for performance
def compare(pattern: List[Op], list: List[Op]): Boolean = (pattern, list) match {
case (x :: xs, y :: ys) if x == y => compare(xs, ys)
case (Nil, Nil) => true
case _ => false
}
var processed: List[Op] = Nil
var unprocessed: List[Op] = input
val patternLength = pattern.length
val reversedReplacement = replacement.reverse
// Do this until we finish processing the whole sequence
while (unprocessed.nonEmpty) {
// This inside algorithm would be better if replaced by KMP
// Quickly process non-matching sequences
while (unprocessed.nonEmpty && unprocessed.head != pattern.head) {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
if (unprocessed.nonEmpty) {
if (compare(pattern, unprocessed)) {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
} else {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
}
}
processed.reverse
}
You may gain speed by using KMP, particularly if the pattern searched for is long.
Now, what is the problem with this algorithm? The problem is that it won't test if the replaced pattern causes a match before that position. For instance, if I replace ACB with C, and I have an input AACBB, then the result of this algorithm will be ACB instead of C.
To avoid this problem, you should create a backtrack. First, you check at which position in your pattern the replacement may happen:
val positionOfReplacement = pattern.indexOfSlice(replacement)
Then, you modify the replacement part of the algorithm this:
if (compare(pattern, unprocessed)) {
if (positionOfReplacement > 0) {
unprocessed :::= replacement
unprocessed :::= processed take positionOfReplacement
processed = processed drop positionOfReplacement
} else {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
}
} else {
This will backtrack enough to solve the problem.
This algorithm won't deal efficiently, however, with multiply patterns at the same time, which I guess is where you are going. For that, you'll probably need some adaptation of KMP, to do it efficiently, or, otherwise, use a DFA to control possible matchings. It gets even worse if you want to match both AB and ABC.
In practice, the full blow problem is equivalent to regex match & replace, where the replace is a function of the match. Which means, of course, you may want to start looking into regex algorithms.
EDIT
I was forgetting to complete my reasoning. If that technique doesn't work for some reason, then my advice is going with an immutable tree-based vector. Tree-based vectors enable replacement of partial sequences with low amount of copying.
And if that doesn't do, then the solution is doubly linked lists. And pick one from a library with slice replacement -- otherwise you may end up spending way too much time debugging a known but tricky algorithm.

Resources