Why is String#scan not finding all the matches? - ruby

Let's use this as sample data :
text=<<EOF
#if A==20
int b = 20;
#else
int c = 30;
#endif
And this code :
puts text.scan(/\#.*?\#/m)
Why is this only capturing this:
#if A==20
int b = 20;
#
I was expecting this to match as well:
#else
int c = 30;
#
What do I have to modify so that it captures that as well? I used /m for multiline matching, but it doesn't seem to work.

It doesn't match the second part, because the "#" before the else has already been consumed, so all that's left ist
else
int c = 30;
#
which does not match the pattern. You can fix this by using lookahead to match the second # without consuming it:
text.scan(/#.*?(?=#)/m)

Second # in your input was already matched by the first substring scan found. From there, it proceeds to scan the remaining part of the string, which is:
else
int c = 30;
#endif
which of course doesn't contain anything to match your regex anymore.

.*? finds the shortest match. Try just .* instead.

Related

Duplicate substring searching

Is there any efficient way to find the duplicate substring? Here, duplicate means that two same substring close to each other have the same value without overlap. For example, the source string is:
ABCDDEFGHFGH
'D' and 'FGH' is duplicated. 'F' appear two times in the sequence, however, they are not close to each other, so it does not duplicate. so our algorithm will return ['D', 'FGH']. I want to know whether there exists an elegant algorithm instead the brute force method?
It relates to Longest repeated substring problem, which builds Suffix Tree to provide string searching in linear time and space complexity Θ(n)
Not very efficient (suffix tree/array are better for very large strings), but very short regular expression solution (C#):
string source = #"ABCDDEFGHFGH";
string[] result = Regex
.Matches(source, #"(.+)\1")
.OfType<Match>()
.Select(match => match.Groups[1].Value)
.ToArray();
Explanation
(.+) - group of any (at least 1) characters
\1 - the same group (group #1) repeated
Test
Console.Write(string.Join(", ", result));
Outcome
D, FGH
In case of ambiguity, e.g. "AAAA" where we can provide "AA" as well as "A" the solution performs greedy and thus "AA" is returned.
Without using any regex which might turn out to be very slow, I guess it's best to use two cursors running hand to hand. The algorithm is pretty obvious from the below JS code.
function getNborDupes(s){
var cl = 0, // cursor left
cr = 0, // cursor right
ts = "", // test string
res = []; // result array
while (cl < s.length){
cr = cl;
while (++cr < s.length){
ts = s.slice(cl,cr); // ts starting from cl to cr (char # cr excluded)
// check ts with subst from cr to cr + ts.length (char # cr + ts.length excluded)
// if they match push it to result advance cursors to cl + ts.length and continue
ts === s.substr(cr,ts.length) && (res.push(ts), cl = cr += ts.length);
}
cl++;
}
return res;
}
var str = "ABCDDEFGHFGH";
console.log(getNborDupes(str));
Throughout the whole process ts will take the following values.
A
AB
ABC
ABCD
ABCDD
ABCDDE
ABCDDEF
ABCDDEFG
ABCDDEFGH
ABCDDEFGHF
ABCDDEFGHFG
B
BC
BCD
BCDD
BCDDE
BCDDEF
BCDDEFG
BCDDEFGH
BCDDEFGHF
BCDDEFGHFG
C
CD
CDD
CDDE
CDDEF
CDDEFG
CDDEFGH
CDDEFGHF
CDDEFGHFG
D
E
EF
EFG
EFGH
EFGHF
EFGHFG
F
FG
FGH
Though the cl = cr += ts.length part decides whether or not to re-start searching on from before or after the matching sub-string. As of currently the above code; "ABABABAB" input would return ["AB","AB"] for but if you make it cr = cl += ts.length then you should expect the result to be ["AB", "AB", "AB"].

Regex cuts word if end of string

I want to check and capture 2 or x words after and before a target string in a multiline text. The problem is that if the words matched are less than x number of words, then regex cuts off the last word and splits it till x.
For example
text = "This is an example /year"
if example is the target:
Matching Data: "is" , "an", "/yea", "r"
If i add random words after /year it matches it correctly.
How could I fix this so that if less than x words exist just stop there or return empty for the rest of the matches?
So it should be
Matching Data: "is" , "an", "/year", ""
def checkWords(target, text, numLeft = 2, numRight = 2)
target = target.compact.map{|x| x.inspect}.join('').gsub(/"/, '')
regex = ""
regex += "\\s+{,2}(\\S+)\\s+{,2}" * numLeft
regex += target
regex += "\\s+{,2}(\\S+)" * numRight
pattern = Regexp.new(regex)
matches = pattern.match(text)
puts matches.inspect
end
Since you want to capture the words before and after target, you need to set a capturing group around the whole regex parts that match the 0 to 2 occurrences of spaces-non-spaces. Also, you need to allow a minimum bound of 0 - use {0,2} (or a more succint {,2}) limiting quantifier to make sure you get the context on the left even if it is missing on the right:
/((?:\S+\s+){,2})target((?:\s+\S+){,2})/
^ ^ ^ ^
See this Rubular demo
If you use /(?:(\S+)\s+){0,2}target(?:\s+(\S+)){0,2}/, all captured values but the last one will be lost, i.e. once quantified, repeated capturing groups only store the value captured during the last iteration in the group buffer.
Also note that setting a {,2} quantifier on the + quantifier makes no sense, \\s+{,2} = \\s+.

Removing occurences of #ifdef/#endif from a file with perl

I have a code file that has some #ifdefs I would like removed in the header file after building a library. My first thought was to do this as a perl script that XCode can run. While I can certainly open the header file and read all content of it into a string in perl, I'm curious as to the best way to do the following
Find any occurrence of #ifdef EXAMPLE
Remove it and anything in between the following #endif
So the example is:
int i;
NSString *someString;
#ifdef EXAMPLE
NSString *exampleString;
#endif
bool done;
and the output would be:
int i;
NSString *someString;
bool done;
Options I'm considering:
finding index of every #ifdef EXAMPLE and removing it via substring with the next found #endif
Write a regex that can somehow remove these occurences.
Considering I haven't written Perl before (Objective-C is my primary language) I was curious if any XCode or Perl developers had any suggestions on what the best approach would be
I'm not sure why you want to strip out ifdefs, and you can probably use a C pre-processor to do this, but here's how you'd do it in Perl because it means I get to play with the flip-flop operator.
First thing is to craft a sufficient regex to match the ifdefs. IIRC they can be indented and there can be indentation between the # and the word.
#ifdef
# ifdef
#ifdef
Not sure if that last one is valid, but I'm going with it anyway.
my $ifdef_re = qr{^\s*#\s*ifdef\b};
my $endif_re = qr{^\s*#\s*endif\b};
If it was just removing text between #ifdef and #endif, Perl has the little used flip flop scalar .. operator.
#!/usr/bin/env perl
use strict;
use warnings;
my $ifdef_re = qr{^\s*#\s*ifdef\b};
my $endif_re = qr{^\s*#\s*endif\b};
while(<DATA>) {
my $in_ifdef = /$ifdef_re/ .. /$endif_re/;
print if !$in_ifdef;
}
__DATA__
int i;
NSString *someString;
#ifdef EXAMPLE
NSString *exampleString;
#endif
bool done;
But since we need to worry about nested ifdefs, its insufficient. A depth counter takes care of that.
#!/usr/bin/env perl
use strict;
use warnings;
my $ifdef_re = qr{^\s*#\s*ifdef\b};
my $endif_re = qr{^\s*#\s*endif\b};
my $ifdef_count = 0;
while(<DATA>) {
$ifdef_count++ if /$ifdef_re/;
print if $ifdef_count <= 0;
$ifdef_count-- if /$endif_re/;
}
__DATA__
int i;
NSString *someString;
#ifdef EXAMPLE
NSString *exampleString;
# ifdef FOO
this should not appear
# endif
nor should this
#endif
bool done;
I love regexes, but for this problem I wouldn't use a regex, I'd just read line by line, keeping track of whether I was inside a ifdef:
my $nesting = 0;
while (<STDIN>)
{
$nesting += 1 if /^#ifdef/;
print $_ unless $nesting;
$nesting -= 1 if /^#endif/;
}
If you really want to use a regex, and have read the whole file into the variable $source, I think this will work, if you don't need to worry about nesting:
$source =~ s/^#ifdef.*?^#endif.*?$//gms;
The ^ characters anchor those parts of the expression to the beginning of a line. The $ makes the last part of the match only happen at the end of a line.
The .*? behaves almost like .*, which matches zero or more characters, except that it does minimal matching. So instead of matching all the way to the last #endif, it matches to the first one.
The /gms at the end makes it:
Substitute every occurrence, not just one (that's the g)
Make ^ and $ match at line boundaries, not just string boundaries (the m)
Make . match newlines (the s)
You might want to follow every #ifdef and #endif with \s, to only match if there is whitespace following that string.
I'd just do this with unifdef. XCode installs this by default:
-U will remove #ifdef and matching #else/#endif as if <constant> is undefined.
-D will remove #ifdef and matching #else/#endif as if <constant> is defined.
Here's an example:
$ cat test.h
#ifdef TEST
#ifdef DEBUG
# define AWESOME_DEBUG_LEVEL 1
#else
# define AWESOME_DEBUG_LEVEL 0
#endif
#endif
$ unifdef -U DEBUG test.h
#ifdef TEST
# define AWESOME_DEBUG_LEVEL 0
#endif
$ unifdef -U DEBUG -D TEST test.h
# define AWESOME_DEBUG_LEVEL 0

How to detect the difference between ' as used in an abbreviation and as quotation markers

I'm attempting to parse blocks of text and need a way to detect the difference between apostrophes in different contexts. Possession and abbreviation in one group, quotations in the other.
e.g.
"I'm the cars' owner" -> ["I'm", "the", "cars'", "owner"]
but
"He said 'hello there' " -> ["He","said"," 'hello there' "]
Detecting whitespace on either side won't help as things like " 'ello " and " cars' " would parse as one end of a quotation, same with matching pairs of apostrophes. I'm getting the feeling that there's no way of doing it other than an outrageously complicated NLP solution and I'm just going to have to ignore any apostrophes not occurring mid-word, which would be unfortunate.
EDIT:
Since writing I have realised this is impossible. Any regex-ish based parser would have to parse:
'ello there my mates' dogs
in 2 different ways, and could only do that with understanding of the rest of the sentence. Guess I'm for the inelegant solution of ignoring the least likely case and hoping it's rare enough to only cause infrequent anomalies.
Hm, I'm afraid this won't be easy. Here's a regex that kinda works, alas only for stuff like "I'm" and "I've":
>> s1 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> nil
>> s2 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> 0
>> $1
=> "'hello there'"
If you play around with it a bit more, you may be able to eliminate some other common contractions, which might still be better than nothing.
Some rules to think about:
Quotes will start with an apostrophe with a whitespace character or nothing before it.
Quotes will end with an apostrophe with punctuation or a whitespace character after it.
Some words may look like the end of quotes, e.g., peoples'.
Quote delimiting apostrophes will never have letters directly before and after them.
Use a very simple two-phase process.
In pass 1 of 2, start with this regular expression to break the text down into alternating segments of word and non-word characters.
/(\w+)|(\W+)/gi
Store the matches in a list like this (I'm using AS3-style pseudo-code, since I don't work with ruby):
class MatchedWord
{
var text:String;
var charIndex:int;
var isWord:Boolean;
var isContraction:Boolean = false;
function MatchedWord( text:String, charIndex:int, isWord:Boolean )
{
this.text = text; this.charIndex = charIndex; this.isWord = isWord;
}
}
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it's null, then it's not a word and match[2] would be non-null)
In pass 2 of 2, iterate over the list of matches to find contractions by checking to see if each (trimmed, non-word) match ENDS with an apostrophe. If it does, then check the next adjacent (word) match to see if it matches one of only 8 common contraction endings. Despite all the two-part contractions I could think of, there are only 8 common endings.
d
l
ll
m
re
s
t
ve
Once you've identified such a pair of matches (non-word)="'" and (word)="d", then you just include the preceding adjacent (word) match and concatenate the three matches to get your contraction.
Understanding the process just described, one modification you must make is expand that list of contraction endings to include contractions that start with apostrophe, such as "'twas" and "'tis". For those, you simply don't concatenate the preceding adjacent (word) match, and you look at the apostrophe match a little more closely to see if it included other non-word character before it (that's why it's important it ends with an apostrophe). If the trimmed string EQUALS an apostrophe, then merge it with the next match, and if it only ENDS with an apostrophe, then strip off the apostrophe and merge it with the following match. Likewise, conditions that will include the prior match should first check to ensure the (trimmed non-word) match ending with an apostrophe EQUALS an apostrophe, so there are no extra non-word characters included accidentally.
Another modification you may need to make is expand that list of 8 endings to include endings that are whole words such as "g'day" and "g'night". Again, it's a simple modification involving a conditional check of the preceding (word) match. If it's "g", then you include it.
That process should capture the majority of contractions, and is flexible enough to include new ones you can think of.
The data structure would look like this.
Condition(Ending, PreCondition)
where PreCondition is
"*", "!", or "<exact string>"
The final list of conditions would look like this:
new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");
If you just process those conditions as I explained, that should cover all of these 86 contractions (and more):
'tis 'twas ain't aren't can't could've couldn't didn't doesn't don't
everybody's g'day g'night hadn't hasn't haven't he'd he'll he's how'd
how'll how's I'd I'll I'm I've isn't it'd it'll it's let's li'l
might've mightn't mustn't needn't nobody's nothing's shan't she'd
she'll she's should've shouldn't that'd that'll that's there's they'd
they'll they're they've wasn't we'd we'll we're we've weren't what'll
what're what'd what's what've when'd when'll when's where'd where'll
where's who's who'll who're who'd who'll who's who've why'd why'll
why's won't would've wouldn't you'd you'll you're you've
On a side note, don't forget about slang contractions that don't use apostrophes such as "gotta" > "got to" and "gonna" > "going to".
Here is the final AS3 code. Overall, you're looking at less than 50 lines of code to parse the text into alternating word and non-word groups, and identify and merge contractions. Simple. You could even add a Boolean "isContraction" variable to the MatchedWord class and set the flag in the code below when a contraction is identified.
//Automatically merge known contractions
var conditions:Array = [
["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
["l","*"],
["ll","*"],
["m","*"],
["re","*"],
["s","*"],
["t","*"],
["ve","*"],
["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
["tis","!"],
["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
["night","g"]
];
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
{
var m:MatchedWord = matched_words[i];
var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
{
var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
for each (var condition:Array in conditions)
{
if (StringUtils.trim( m_next.text ) == condition[0])
{
var pre_condition:String = condition[1];
switch (pre_condition)
{
case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
{
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );
}
break;
case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
if (apostrophe_text == "'")
{
m.text += m_next.text;
m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
m.isContraction = true;
matched_words.splice( i + 1, 1 );
}
else
{ //strip apostrophe off end and merge with next item, nothing needs deleted
//preserve spaces and match start indexes by manipulating untrimmed strings
var apostrophe_end:int = m.text.lastIndexOf( "'" );
var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
m_next.text = apostrophe_ending + m_next.text;
m_next.charIndex = m.charIndex + apostrophe_end;
m_next.isContraction = true;
}
break;
default: //conditional success, check prior match meets condition
if (m_prev != null && m_prev.text == pre_condition)
{
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );
}
break;
}
}
}
}
}

Best word wrap algorithm? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Word wrap is one of the must-have features in a modern text editor.
How word wrap be handled? What is the best algorithm for word-wrap?
If text is several million lines, how can I make word-wrap very fast?
Why do I need the solution? Because my projects must draw text with various zoom level and simultaneously beautiful appearance.
The running environment is Windows Mobile devices. The maximum 600 MHz speed with very small memory size.
How should I handle line information? Let's assume original data has three lines.
THIS IS LINE 1.
THIS IS LINE 2.
THIS IS LINE 3.
Afterwards, the break text will be shown like this:
THIS IS
LINE 1.
THIS IS
LINE 2.
THIS IS
LINE 3.
Should I allocate three lines more? Or any other suggestions?
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Here is a word-wrap algorithm I've written in C#. It should be fairly easy to translate into other languages (except perhaps for IndexOfAny).
static char[] splitChars = new char[] { ' ', '-', '\t' };
private static string WordWrap(string str, int width)
{
string[] words = Explode(str, splitChars);
int curLineLength = 0;
StringBuilder strBuilder = new StringBuilder();
for(int i = 0; i < words.Length; i += 1)
{
string word = words[i];
// If adding the new word to the current line would be too long,
// then put it on a new line (and split it up if it's too long).
if (curLineLength + word.Length > width)
{
// Only move down to a new line if we have text on the current line.
// Avoids situation where wrapped whitespace causes emptylines in text.
if (curLineLength > 0)
{
strBuilder.Append(Environment.NewLine);
curLineLength = 0;
}
// If the current word is too long to fit on a line even on it's own then
// split the word up.
while (word.Length > width)
{
strBuilder.Append(word.Substring(0, width - 1) + "-");
word = word.Substring(width - 1);
strBuilder.Append(Environment.NewLine);
}
// Remove leading whitespace from the word so the new line starts flush to the left.
word = word.TrimStart();
}
strBuilder.Append(word);
curLineLength += word.Length;
}
return strBuilder.ToString();
}
private static string[] Explode(string str, char[] splitChars)
{
List<string> parts = new List<string>();
int startIndex = 0;
while (true)
{
int index = str.IndexOfAny(splitChars, startIndex);
if (index == -1)
{
parts.Add(str.Substring(startIndex));
return parts.ToArray();
}
string word = str.Substring(startIndex, index - startIndex);
char nextChar = str.Substring(index, 1)[0];
// Dashes and the likes should stick to the word occuring before it. Whitespace doesn't have to.
if (char.IsWhiteSpace(nextChar))
{
parts.Add(word);
parts.Add(nextChar.ToString());
}
else
{
parts.Add(word + nextChar);
}
startIndex = index + 1;
}
}
It's fairly primitive - it splits on spaces, tabs and dashes. It does make sure that dashes stick to the word before it (so you don't end up with stack\n-overflow) though it doesn't favour moving small hyphenated words to a newline rather than splitting them. It does split up words if they are too long for a line.
It's also fairly culturally specific, as I don't know much about the word-wrapping rules of other cultures.
Donald E. Knuth did a lot of work on the line breaking algorithm in his TeX typesetting system. This is arguably one of the best algorithms for line breaking - "best" in terms of visual appearance of result.
His algorithm avoids the problems of greedy line filling where you can end up with a very dense line followed by a very loose line.
An efficient algorithm can be implemented using dynamic programming.
A paper on TeX's line breaking.
I had occasion to write a word wrap function recently, and I want to share what I came up with.
I used a TDD approach almost as strict as the one from the Go example. I started with the test that wrapping the string "Hello, world!" at 80 width should return "Hello, World!". Clearly, the simplest thing that works is to return the input string untouched. Starting from that, I made more and more complex tests and ended up with a recursive solution that (at least for my purposes) quite efficiently handles the task.
Pseudocode for the recursive solution:
Function WordWrap (inputString, width)
Trim the input string of leading and trailing spaces.
If the trimmed string's length is <= the width,
Return the trimmed string.
Else,
Find the index of the last space in the trimmed string, starting at width
If there are no spaces, use the width as the index.
Split the trimmed string into two pieces at the index.
Trim trailing spaces from the portion before the index,
and leading spaces from the portion after the index.
Concatenate and return:
the trimmed portion before the index,
a line break,
and the result of calling WordWrap on the trimmed portion after
the index (with the same width as the original call).
This only wraps at spaces, and if you want to wrap a string that already contains line breaks, you need to split it at the line breaks, send each piece to this function and then reassemble the string. Even so, in VB.NET running on a fast machine, this can handle about 20 MB/second.
I don't know of any specific algorithms, but the following could be a rough outline of how it should work:
For the current text size, font, display size, window size, margins, etc., determine how many characters can fit on a line (if fixed-type), or how many pixels can fit on a line (if not fixed-type).
Go through the line character by character, calculating how many characters or pixels have been recorded since the beginning of the line.
When you go over the maximum characters/pixels for the line, move back to the last space/punctuation mark, and move all text to the next line.
Repeat until you go through all text in the document.
In .NET, word wrapping functionality is built into controls like TextBox. I am sure that a similar built-in functionality exists for other languages as well.
With or without hyphenation?
Without it's easy. Just encapsulate your text as wordobjects per word and give them a method getWidth(). Then start at the first word adding up the rowlength until it is greater than the available space. If so, wrap the last word and start counting again for the next row starting with this one, etc.
With hyphenation you need hyphenation rules in a common format like: hy-phen-a-tion
Then it's the same as the above except you need to split the last word which has caused the overflow.
A good example and tutorial of how to structure your code for an excellent text editor is given in the Gang of Four Design Patterns book. It's one of the main samples on which they show the patterns.
I wondered about the same thing for my own editor project. My solution was a two-step process:
Find the line ends and store them in an array.
For very long lines, find suitable break points at roughly 1K intervals and save them in the line array, too. This is to catch the "4 MB text without a single line break".
When you need to display the text, find the lines in question and wrap them on the fly. Remember this information in a cache for quick redraw. When the user scrolls a whole page, flush the cache and repeat.
If you can, do loading/analyzing of the whole text in a background thread. This way, you can already display the first page of text while the rest of the document is still being examined. The most simple solution here is to cut the first 16 KB of text away and run the algorithm on the substring. This is very fast and allows you to render the first page instantly, even if your editor is still loading the text.
You can use a similar approach when the cursor is initially at the end of the text; just read the last 16 KB of text and analyze that. In this case, use two edit buffers and load all but the last 16 KB into the first while the user is locked into the second buffer. And you'll probably want to remember how many lines the text has when you close the editor, so the scroll bar doesn't look weird.
It gets hairy when the user can start the editor with the cursor somewhere in the middle, but ultimately it's only an extension of the end-problem. Only you need to remember the byte position, the current line number, and the total number of lines from the last session, plus you need three edit buffers or you need an edit buffer where you can cut away 16 KB in the middle.
Alternatively, lock the scrollbar and other interface elements while the text is loading; that allows the user to look at the text while it loads completely.
I cant claim the bug-free-ness of this, but I needed one that word wrapped and obeyed boundaries of indentation. I claim nothing about this code other than it has worked for me so far. This is an extension method and violates the integrity of the StringBuilder but it could be made with whatever inputs / outputs you desire.
public static void WordWrap(this StringBuilder sb, int tabSize, int width)
{
string[] lines = sb.ToString().Replace("\r\n", "\n").Split('\n');
sb.Clear();
for (int i = 0; i < lines.Length; ++i)
{
var line = lines[i];
if (line.Length < 1)
sb.AppendLine();//empty lines
else
{
int indent = line.TakeWhile(c => c == '\t').Count(); //tab indents
line = line.Replace("\t", new String(' ', tabSize)); //need to expand tabs here
string lead = new String(' ', indent * tabSize); //create the leading space
do
{
//get the string that fits in the window
string subline = line.Substring(0, Math.Min(line.Length, width));
if (subline.Length < line.Length && subline.Length > 0)
{
//grab the last non white character
int lastword = subline.LastOrDefault() == ' ' ? -1 : subline.LastIndexOf(' ', subline.Length - 1);
if (lastword >= 0)
subline = subline.Substring(0, lastword);
sb.AppendLine(subline);
//next part
line = lead + line.Substring(subline.Length).TrimStart();
}
else
{
sb.AppendLine(subline); //everything fits
break;
}
}
while (true);
}
}
}
Here is mine that I was working on today for fun in C:
Here are my considerations:
No copying of characters, just printing to standard output. Therefore, since I don't like to modify the argv[x] arguments, and because I like a challenge, I wanted to do it without modifying it. I did not go for the idea of inserting '\n'.
I don't want
This line breaks here
to become
This line breaks
here
so changing characters to '\n' is not an option given this objective.
If the linewidth is set at say 80, and the 80th character is in the middle of a word, the entire word must be put on the next line. So as you're scanning, you have to remember the position of the end of the last word that didn't go over 80 characters.
So here is mine, it's not clean; I've been breaking my head for the past hour trying to get it to work, adding something here and there. It works for all edge cases that I know of.
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
int isDelim(char c){
switch(c){
case '\0':
case '\t':
case ' ' :
return 1;
break; /* As a matter of style, put the 'break' anyway even if there is a return above it.*/
default:
return 0;
}
}
int printLine(const char * start, const char * end){
const char * p = start;
while ( p <= end )
putchar(*p++);
putchar('\n');
}
int main ( int argc , char ** argv ) {
if( argc <= 2 )
exit(1);
char * start = argv[1];
char * lastChar = argv[1];
char * current = argv[1];
int wrapLength = atoi(argv[2]);
int chars = 1;
while( *current != '\0' ){
while( chars <= wrapLength ){
while ( !isDelim( *current ) ) ++current, ++chars;
if( chars <= wrapLength){
if(*current == '\0'){
puts(start);
return 0;
}
lastChar = current-1;
current++,chars++;
}
}
if( lastChar == start )
lastChar = current-1;
printLine(start,lastChar);
current = lastChar + 1;
while(isDelim(*current)){
if( *current == '\0')
return 0;
else
++current;
}
start = current;
lastChar = current;
chars = 1;
}
return 0;
}
So basically, I have start and lastChar that I want to set as the start of a line and the last character of a line. When those are set, I output to standard output all the characters from start to end, then output a '\n', and move on to the next line.
Initially everything points to the start, then I skip words with the while(!isDelim(*current)) ++current,++chars;. As I do that, I remember the last character that was before 80 chars (lastChar).
If, at the end of a word, I have passed my number of chars (80), then I get out of the while(chars <= wrapLength) block. I output all the characters between start and lastChar and a newline.
Then I set current to lastChar+1 and skip delimiters (and if that leads me to the end of the string, we're done, return 0). Set start, lastChar and current to the start of the next line.
The
if(*current == '\0'){
puts(start);
return 0;
}
part is for strings that are too short to be wrapped even once. I added this just before writing this post because I tried a short string and it didn't work.
I feel like this might be doable in a more elegant way. If anyone has anything to suggest I'd love to try it.
And as I wrote this I asked myself "what's going to happen if I have a string that is one word that is longer than my wraplength" Well it doesn't work. So I added the
if( lastChar == start )
lastChar = current-1;
before the printLine() statement (if lastChar hasn't moved, then we have a word that is too long for a single line so we just have to put the whole thing on the line anyway).
I took the comments out of the code since I'm writing this but I really feel that there must be a better way of doing this than what I have that wouldn't need comments.
So that's the story of how I wrote this thing. I hope it can be of use to people and I also hope that someone will be unsatisfied with my code and propose a more elegant way of doing it.
It should be noted that it works for all edge cases: words too long for a line, strings that are shorter than one wrapLength, and empty strings.
I may as well chime in with a perl solution that I made, because gnu fold -s was leaving trailing spaces and other bad behavior. This solution does not (properly) handle text containing tabs or backspaces or embedded carriage returns or the like, although it does handle CRLF line-endings, converting them all to just LF. It makes minimal change to the text, in particular it never splits a word (doesn't change wc -w), and for text with no more than single space in a row (and no CR) it doesn't change wc -c (because it replaces space with LF rather than inserting LF).
#!/usr/bin/perl
use strict;
use warnings;
my $WIDTH = 80;
if ($ARGV[0] =~ /^[1-9][0-9]*$/) {
$WIDTH = $ARGV[0];
shift #ARGV;
}
while (<>) {
s/\r\n$/\n/;
chomp;
if (length $_ <= $WIDTH) {
print "$_\n";
next;
}
#_=split /(\s+)/;
# make #_ start with a separator field and end with a content field
unshift #_, "";
push #_, "" if #_%2;
my ($sep,$cont) = splice(#_, 0, 2);
do {
if (length $cont > $WIDTH) {
print "$cont";
($sep,$cont) = splice(#_, 0, 2);
}
elsif (length($sep) + length($cont) > $WIDTH) {
printf "%*s%s", $WIDTH - length $cont, "", $cont;
($sep,$cont) = splice(#_, 0, 2);
}
else {
my $remain = $WIDTH;
{ do {
print "$sep$cont";
$remain -= length $sep;
$remain -= length $cont;
($sep,$cont) = splice(#_, 0, 2) or last;
}
while (length($sep) + length($cont) <= $remain);
}
}
print "\n";
$sep = "";
}
while ($cont);
}
#ICR, thanks for sharing the C# example.
I did not succeed using it, but I came up with another solution. If there is any interest in this, please feel free to use this:
WordWrap function in C#. The source is available on GitHub.
I've included unit tests / samples.

Resources