O(n) string handling in Scheme - scheme

Background: I've been writing a little interpreter in Scheme (R5RS).
The reader/lexer takes a (sometimes long) string from input and tokenises it. It does this by matching the first few characters of the string against some token and returning the token and the remaining unmatched part of the string.
Problem: to return the remaining portion of the string, a new string is created every time a token is read. This means the reader is O(n^2) in the number of tokens present in the string.
Possible solution: convert the string to a list, which can be done in time O(n), then pull tokens from the list instead of the string, returning the remainder of the list instead of the remainder of the string. But this seems terribly inefficient and artificial.
Question: am I imagining it, or is there just no other way to do this efficiently in Scheme due to its purely functional outlook?
Edit: in R5RS Scheme, there isn't a way to return a pointer into a string. The "substring" function is the only function which extracts an object which is itself a string. But the Scheme standard insists this be a newly allocated string. Why? Because strings are not immutable in Scheme R5RS, e.g. see the "string-set!" function!!
One solution suggested below which works is to store an index into the string. Then one can read off the characters one at a time from that index until a token is read. Too bad the regexp library I'm using for the tokenisation requires an actual string not an index into one...

Consider making a shared-substring implementation of strings (this is how Java does it, for example). So when you want to grab a substring of a given string, rather than copying the characters, simply keep a pointer to (some location in) those characters, and a length.

Related

Extract substrings/values from a long text

I have a long string/text, e.g.
...blahblahblahblah,"shortcode":"Bk5z5Lgn1234",blahblahblablha...,"shortcode":"Wuipsz5Lgn1234",blahblahblablh...
I'm looking to extract all substrings of the following pattern:
"shortcode":"Bk5z5Lgn1234"
"shortcode":"Wuipsz5Lgn1234"
The values of the shortcodes, i.e. Bk5z5Lgn1234 and Wuipsz5Lgn1234, are of constant length (11 characters). Just getting the values will be fine. If getting all the occurrences of shortcode values is complicated, just getting the first value will be sufficient.
I know how to find the substrings (using the scan method), but I have no idea how to traverse the string and pull out the shortcode values.
If the code is always in the exact format that you specified, and 11 characters long, this regular expression will find them:
"shortcode":"(.{11})"
The following will return all the matches:
text.scan(/"shortcode":"(.{11})"/)
This is admittedly likely not to be the most efficient solution, but simple and easy to use. Parsing HTML with regular expressions is never the best idea.

Evaluating a frozen string

My vague understanding is that, with Ruby 2.2's frozen method on string or Ruby 2.3's frozen-string-literal: true pragma, a relevant frozen string literal is evaluated only once throughout program execution if and only if the string does not have interpolation. The following seems to illustrate this:
Not interpolated
#frozen-string-literal: true
5.times{p "".object_id}
Outputs (same object IDs):
70108065381260
70108065381260
70108065381260
70108065381260
70108065381260
Interpolated
#frozen-string-literal: true
5.times{p "#{}".object_id}
Outputs (different object IDs):
70108066220720
70108066220600
70108066220420
70108066220300
70108066220180
What is this property (i.e., being evaluated only once) called? It should be distinct from immutability.
Is my understanding of the condition when strings come to have such property correct? Where is the official documentation mentioning this?
Is there a way to make an interpolated string be evaluated only once?
Interning. The strings are said to be interned.
Not completely. It is more like if the interpreter can decide what the value of the string would be before evaluating it. For example, consider:
5.times { puts "#{'foo'}".object_id }
The id is the same even though there is interpolation involved.
No. This is an internal optimization. The main point of Object#freeze is immutability.
UPDATE: Only literal strings get internalized. This is evident here.
I couldn't find the part of the code responsible for interpolation. So I'm not sure why "#{'foo'}" is considered a literal string. Note that wherever this translation occurs, it is on a lower parser level and happens way before any actual processing. This is evident by the fact that String#freeze is mapped to rb_str_freeze, which doesn't call opt_str_freeze.
"Frozen" is not about whether the string is evaluated more than once. It is, you are right, about mutability.
A string literal will be evaluated every time the line containing it is encountered.
The (only) way to make it be evaluated only once, is to put it in a line of source code that is only executed once, instead of in a loop. A string literal in a loop (or any other part of source code) will always be evaluated every time that line of source code is executed in program flow.
This is indeed a separate thing than whether it is frozen/immutable or not, once evaluated.
The accepted answer is kind of misleading. "It is more like if the interpreter can decide what the value of the string would be before evaluating it." Nope. Not at all. It needs to be evaluated. If the string is frozen, then once it IS evaluated, it will use the same location in memory and the same object/object_id (which are two ways of saying the same thing) as all other equivalent strings. But it's still being evaluated, with or without interpolation.
(Without interpolation, 'evaluation' of a string literal is very very quick. With simple interpolation it's usually pretty quick too. You can of course use interpolation to call out to an expensive method though, hypothetically).
Without interpolation, I wouldn't worry about it at all. With interpolation, if you think your interpolation is expensive enough you don't want to do it in a loop -- the only way to avoid it is not to do it in a loop, but create the string once outside the loop.
Ruby docs probably talk about "String literals" rather than "literal Strings". A "String literal" is any String created by bytes in source code (using '', "", %Q[], or any of the other ways of creating strings literals in source code in ruby). With or without interpolation.
So what kinds of Strings aren't created by String literals? Well, a string created by reading in bytes from a file or network for instance. Or a String created by taking an existing string and calling a method on it that returns a copy, like some_string.dup. "String literal" means a string created literally in source code, rather than by reading from external input. http://ruby-doc.org/core-2.1.1/doc/syntax/literals_rdoc.html

Julia: Strange characters in my string

I scraped some text from the internet, which I put in an UTF8String. I can use this string normally, but when I select some specific characters (strange character with accents, like in my case รบ), which are not part of the UTF8 standard, I get an error, saying that I used invalid indexes. This only happens when the string contains strange characters; my code works with normal string that do not contain strange characters.
Any way to solve this?
EDIT:
I have a variable word of type SubString{UTF8String}
When I use do method(word), no problems occur. When I do method(word[2:end]) (assuming length of at least 2), I get an error in case the second character is strange (not in UTF8).
Julia does indexing on byte positions instead of character position. It is way more efficient for a variable length encoding like UTF-8, but it makes some operations use some more boilerplate.
The problem is that some codepoints is encoded as multiple bytes and when you slice the string from 2:end you would have got half of the first character (witch is invalid and you get an error).
The solution is to get the second valid index instead of 2 in the slice. I think that is something like str[nextind(str, 1):end]
PS. Sorry for a less than clear answer on my phone.
EDIT:
I tried this, and it seems like SubString{UTF8String} and UTF8String has different behaviour on slicing. I've reported it as bug #7811 on GitHub.

Get first lines from string (w/o processing entire string)

I know that
my_str.split("\n").first
gives me the first line of the string.
But sadly that cuts the entire string into an array. If that string is several MB in size and I only need the first 5 lines then... There's gotta be a better alternative. I could write my own method to process the string character by character but there is probably some better method or even a build-in one for what I need?
There's String#each_line:
my_str.each_line.take(5)

Programming idiom to parse a string in multiple-passes

I'm working on a Braille translation library, and I need to translate a string of text into braille. I plan to do this in multiple passes, but I need a way to keep track of which parts of the string have been translated and which have not, so I don't retranslate them.
I could always create a class which would track the ranges of positions in the string which had been processed, and then design my search/replace algorithm to ignore them on subsequent passes, but I'm wondering if there isn't a more elegant way to accomplish the same thing.
I would imagine that multi-pass string translation isn't all that uncommon, I'm just not sure what the options are for doing it.
A more usual approach would be to tokenize your input, then work on the tokens. For example, start by tokenizing the string into a token for each character. Then, in a first pass generate a straightforward braille mapping, token by token. In subsequent passes, you can replace more of the tokens - for example, by replacing sequences of input tokens with a single output token.
Because your tokens are objects or structs, rather than simple characters, you can attach additional information to each - such as the source token(s) you translated (or rather, transliterated) the current token from.
Check out some basic compiler theory..
Lexical Analysis
Parsing/Syntax Analysis

Resources