T4 template whitespace control - t4

Are there any commands to easily control T4 template output whitespace? I'm getting some excessive tabbing. I thought I remembered a way to control template whitespace...

PushIndent, PopIndent, ClearIndent http://msdn.microsoft.com/en-us/library/bb126474.aspx
Do not format your template for readability. Any white space you have outside of control block will end up in the output
for(int i=0; i < 10; i++)
{
#>
Hello
<#
}
will end up as
Hello
Hello
Hello
Hello
Hello
Hello

There's probably no great fix to this, it's a problem with the T4 engine itself, IMO. But if you're trying to reduce leading tabs/spaces in your output while preserving directive nesting you can do the following.
Before
<# for (...) { #>
<# if (...) { #>
SomeText
<# } #>
<# } #>
After
<# for (...) { #>
<# if (...) { #>
SomeText
<# } #>
<# } #>
E.g. start your directives at column 0, indent within the directive itself! In addition to this, you may want to trim extra lines:
private void TrimExtraneousLineBreaksAfterCommentsFromGeneratedFile(ref string fileText)
{
Regex regex = new Regex(#"(//.+?)(?:\r?\n){2,}");
// Replace multiple coniguous line breaks, after a comment, with a single line break.
fileText = regex.Replace(fileText, "\r\n");
}
private void TrimExtraneousLineBreaksFromGeneratedFile(ref string fileText)
{
Regex regex = new Regex(#"\r?\n(?:\s*?\r?\n)+");
// Replace multiple coniguous line breaks with 2 line breaks.
fileText = regex.Replace(fileText, "\r\n\r\n");
// Remove spaces/line breaks from the file.
fileText = fileText.Trim();
}
YMMV

just incase someone is looking adding tabs using writeLine method. The escape character works.
<#
for(int i=0; i < 10; i++)
{
this.WriteLine("\tHello");
}
#>

Related

I need Ruby to parse and edit parts of files with a custom syntax

I have a set of .txt files that show a custom languange, files I want to systematically modify using a Ruby script. The syntax of that language is as follows:
(I will use [some text] as meta variables for expressions, like [atom 1] to indicate an arbitrary atom, and [atom 2] to indicate an arbitrary atom diferent from the former)
Atoms: alphanumerical strings, possibly surrounded by double quotes. Examples:
same_realm
"Ok"
Statements: either
[atom_1] = [atom_2]
or
[atom_1] = { [atom or statement 1] ... [atom or statement n] }
comments: in any line of the text, any character after a # is ignored. Example:
[atom_1] = [atom_2] #This is a comment and will be ignored
If an statement is of the form [atom 1] = {[atom or statement 1] ... [atom or statement n]}, we call [atom 1] the head of the satement and [atom or statement 1] ... [atom or statement n] the body of the statement.
Before and after =, { and } there can be an arbitrary number (possibly 0) of space characters.
Between two consecutive atoms must be at least one space character, but can be any number higher than that.
So, the two expressions any_realm_lord = {...} and any_realm_lord = {...} in the example below are valid, the only syntactical difference between them being the use of any_realm_lord/any_province_lord as the head of each statement.
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
} any_province_lord={any_character = { #some comment
limit = {#some other comment
same_realm = ROOT} set_character_flag =
my_flag
}
}#more text
Once that's explained, this is what I want to do with ruby (I will use the Example file above to illustrate it)
1) Open a file and locate the statements which are not within the body of other statements
(in the example, I'd want it to locate the any_realm_character = {...} and the any_province_character = {...} statements)
2) Iterate over the statements located in 1) and select the ones whose head matches a certain string. If the match is in a line where there is also other atoms or statements, separate them. From now on, I will refer to the statement whose head matches the string as "the target statement".
(Say the string to match is "any_province_lord". After this step the file will look like this:
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
}
any_province_lord={any_character = { #some comment
limit = {#some other comment
same_realm = ROOT} set_character_flag =
my_flag
}
}#more text
)
3) create a blank line above the line where the head of the target statement is, and cut and paste there any comment in the lines that the target statement encompass
(
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
}
#some comment#some other comment
any_province_lord={any_character = {
limit = {
same_realm = ROOT} set_character_flag =
my_flag
}
}#more text
)
4)If the closing bracket of the target statement is in the same line as another atom(s) or statement(s), add a \n after the closing bracket
(
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
}
#some comment#some other comment
any_province_lord={any_character = {
limit = {
same_realm = ROOT} set_character_flag =
my_flag
}
}
#more text
)
5)Erase the body of the target statement (but not the brackets), and add new content between the brackets that I have already defined, with nice spacing.
(
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
}
#some comment#some other comment
any_province_lord={
#my predefined content will be here
}
#more text
)
What would then be the best way to do this in terms of efficiency? I need my program to do this to over a thousand of files (each one of an average size of 500Kb). I'm fairly new to Ruby, so I am still figuring out if for these things is best to use read, readlines, or readline in term of efficiency.
What do you think?
I hope I've been clear in the explanation of what I need and not too unnecesarily verbose

How to make a syntax manipulator?

Firstly, sorry for the question, I know I've heard something that could help, but I just can't remember.
Basically I would like to create my own syntax for a programming language. For example this code:
WRITE OUT 'Hello World!'
NEW LINE
would turn into this Java code:
System.out.print("Hello World!");
System.out.println();
How could I achieve this? Is there a method?
Olá.
There are techniques and proper algorithms to do that.
Search for "compiler techniques" and "Interpreter pattern".
An initial approach could be a basic pattern interpreter.
Assuming simple sentences and only one sentence per line, you could read the input file line by line and search for defined patterns (regular expressions).
The patterns describe the structure of the commands in your invented language.
If you get a match then you do the translation.
In particular, we use the regex.h library in c to perform the regular expression search.
Of course regex is also available in java.
Ex. NEW LINE match the pattern " *NEW +LINE *"
The * means that the preceding character occurs 0 or more times.
The + means that the preceding character occurs 1 or more times.
Thus, this pattern can match the command " NEW LINE " with arbitrary spaces between the words.
Ex. WRITE OUT 'Hello World!' match the pattern "WRITE OUT '([[:print:]]*)'"
or if you want to allow spaces " *WRITE +OUT +'([[:print:]]*)' *"
[[:print:]] means: match one printable character (ex. 'a' or 'Z' or '0' or '+')
Thus, [[:print:]]* match a sequence of 0, 1 or more printable characters
If a line of your input file matched the pattern of some command then you can do the translation, but in most cases you will need to retrieve some information before,
ex. the arbitrary text after WRITE OUT. Thats why you need to put parenthesis around [[:print:]]*. That will indicate to the function that perform the search that you want retrieve that particular part of your pattern.
A nice coincidence is that I recently assisted a friend with an college project similar to the problem you want to solve: a translator from c to basic. I reused that code to make an example for you.
I tested the code and it works.
It can translate:
WRITE OUT 'some text'
WRITE OUT variable
NEW LINE
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
#include <string.h>
#define STR_SHORT 100
#define MATCHES_SIZE 10
/**************************************************************
Returns the string of a match
**************************************************************/
char * GetExp(char *Source, char *Destination, regmatch_t Matches) {
//Source The string that was searched
//Destination Will contains the matched string
//Matches One element of the vector passed to regexec
int Length = Matches.rm_eo - Matches.rm_so;
strncpy(Destination, Source+Matches.rm_so, Length);
Destination[Length]=0;
return Destination;
}
/**************************************************************
MAIN
**************************************************************/
int main(int argc, char *argv[]) {
//Usage
if (argc==1) {
printf("Usage:\n");
printf("interpreter source_file\n");
printf("\n");
printf("Implements a very basic interpreter\n");
return 0;
}
//Open the source file
FILE *SourceFile;
if ( (SourceFile=fopen(argv[1], "r"))==NULL )
return 1;
//This variable is used to get the strings that matched the pattern
//Matches[0] -> the whole string being searched
//Matches[1] -> first parenthetical
//Matches[2] -> second parenthetical
regmatch_t Matches[MATCHES_SIZE];
char MatchedStr[STR_SHORT];
//Regular expression for NEW LINE
regex_t Regex_NewLine;
regcomp(&Regex_NewLine, " *NEW +LINE *", REG_EXTENDED);
//Regular expression for WRITE OUT 'some text'
regex_t Regex_WriteOutStr;
regcomp(&Regex_WriteOutStr, " *WRITE +OUT +'([[:print:]]*)' *", REG_EXTENDED);
//Regular expresion for WRITE OUT variable
regex_t Regex_WriteOutVar;
regcomp(&Regex_WriteOutVar, " *WRITE +OUT +([_[:alpha:]][[:alnum:]]*) *", REG_EXTENDED);
//Regular expression for an empty line'
regex_t Regex_EmptyLine;
regcomp(&Regex_EmptyLine, "^([[:space:]]+)$", REG_EXTENDED);
//Now we read the file line by line
char Buffer[STR_SHORT];
while( fgets(Buffer, STR_SHORT, SourceFile)!=NULL ) {
//printf("%s", Buffer);
//Shorcut for an empty line
if ( regexec(&Regex_EmptyLine, Buffer, MATCHES_SIZE, Matches, 0)==0 ) {
printf("\n");
continue;
}
//NEW LINE
if ( regexec(&Regex_NewLine, Buffer, MATCHES_SIZE, Matches, 0)==0 ) {
printf("System.out.println();\n");
continue;
}
//WRITE OUT 'some text'
if ( regexec(&Regex_WriteOutStr, Buffer, MATCHES_SIZE, Matches, 0)==0 ) {
printf("System.out.print(\"%s\");\n", GetExp(Buffer, MatchedStr, Matches[1]));
continue;
}
//WRITE OUT variable
//Assumes variable is a string variable
if ( regexec(&Regex_WriteOutVar, Buffer, MATCHES_SIZE, Matches, 0)==0 ) {
printf("System.out.print(\"%%s\", %s);\n", GetExp(Buffer, MatchedStr, Matches[1]));
continue;
}
//Unknown command
printf("Unknown command: %s", Buffer);
}
return 0;
}
Proper solution for this question requires the following steps:
Parse the original syntax code and create a syntax tree.
That is commonly done with tools like ANTLR.
Go through the syntax tree and either convert it to Java code, or to a Java syntax tree.
Both of those steps have their own complexity, so it would be better to ask separate questions about specific issues you encounter while implementing them.
Strictly speaking you can skip step 2 and generate Java directly when parsing, but unless your language is very simple renaming of Java concepts, you wouldn't be able to do that easily.

How do I handle C-style comments in Ruby using Parslet?

Taking as a starting point the code example from the Parslet's own creator (available in this link) I need to extend it so as to retrieve all the non-commented text from a file written in a C-like syntax.
The provided example is able to successfully parse C-style comments, treating these areas as regular line spaces. However, this simple example only expects 'a' characters in the non-commented areas of the file such as the input example:
a
// line comment
a a a // line comment
a /* inline comment */ a
/* multiline
comment */
The rule used to detect the non-commented text is simply:
rule(:expression) { (str('a').as(:a) >> spaces).as(:exp) }
Therefore, what I need is to generalize the previous rule to get all the other (non-commented) text from a more generic file such as:
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
I am new to Parsing Expression Grammars and neither of my previous trials succeeded.
The general idea is that everything is code (aka non-comment) until one of the sequences // or /* appears. You can reflect this with a rule like this:
rule(:code) {
(str('/*').absent? >> str('//').absent? >> any).repeat(1).as(:code)
}
As mentioned in my comment, there is a small problem with strings, though. When a comment occurs inside a string, it obviously is part of the string. If you were to remove comments from your code, you would then alter the meaning of this code. Therefore, we have to let the parser know what a string is, and that any character inside there belongs to it. Another thing are escape sequences. For example the string "foo \" bar /*baz*/", which contains a literal double quote, would actually be parsed as "foo \", followed by some code again. This is of course something that needs to be addressed. I have written a complete parser that handles all of the above cases:
require 'parslet'
class CommentParser < Parslet::Parser
rule(:eof) {
any.absent?
}
rule(:block_comment_text) {
(str('*/').absent? >> any).repeat.as(:comment)
}
rule(:block_comment) {
str('/*') >> block_comment_text >> str('*/')
}
rule(:line_comment_text) {
(str("\n").absent? >> any).repeat.as(:comment)
}
rule(:line_comment) {
str('//') >> line_comment_text >> (str("\n").present? | eof)
}
rule(:string_text) {
(str('"').absent? >> str('\\').maybe >> any).repeat
}
rule(:string) {
str('"') >> string_text >> str('"')
}
rule(:code_without_strings) {
(str('"').absent? >> str('/*').absent? >> str('//').absent? >> any).repeat(1)
}
rule(:code) {
(code_without_strings | string).repeat(1).as(:code)
}
rule(:code_with_comments) {
(code | block_comment | line_comment).repeat
}
root(:code_with_comments)
end
It will parse your input
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
to this AST
[{:code=>"\n word0\n "#0},
{:comment=>" line comment"#13},
{:code=>"\n word1 "#26},
{:comment=>" line comment"#37},
{:code=>"\n phrase "#50},
{:comment=>" inline comment "#61},
{:code=>" something \n "#79},
{:comment=>" multiline\n comment "#94},
{:code=>"\n"#116}]
To extract everything except the comments you can do:
input = <<-CODE
word0
// line comment
word1 // line comment
phrase /* inline comment */ something
/* multiline
comment */
CODE
ast = CommentParser.new.parse(input)
puts ast.map{|node| node[:code] }.join
which will produce
word0
word1
phrase something
Another way to handle comments is to consider them white space. For example:
rule(:space?) do
space.maybe
end
rule(:space) do
(block_comment | line_comment | whitespace).repeat(1)
end
rule(:whitespace) do
match('/s')
end
rule(:block_comment) do
str('/*') >>
(str('*/').absent >> match('.')).repeat(0) >>
str('*/')
end
rule (:line_comment) do
str('//') >> match('[^\n]') >> str("\n")
end
Then, when you are writing rules with white-space, such as this entirely off-the-cuff and probably wrong rule for C,
rule(:assignment_statement) do
lvalue >> space? >> str('=') >> space? >> rvalue >> str(';')
end
comments get "eaten" by the parser without any fuss. Anywhere white space can or must appear, comments of any kind are allowed, and are treated as white space.
This approach is not as suitable for your exact problem, which is to recognize non-comment text in a C program, but it works very well in a parser which must recognize the full language.

Ruby parslet: parsing multiple lines

I'm looking for a way to match multiple lines Parslet.
The code looks like this:
rule(:line) { (match('$').absent? >> any).repeat >> match('$') }
rule(:lines) { line.repeat }
However, lines will always end up in an infinite loop which is because match('$') will endlessly repeat to match end of string.
Is it possible to match multiple lines that can be empty?
irb(main)> lines.parse($stdin.read)
This
is
a
multiline
string^D
should match successfully. Am I missing something? I also tried (match('$').absent? >> any.maybe).repeat(1) >> match('$') but that doesn't match empty lines.
Regards,
Danyel.
I usually define a rule for end_of_line. This is based on the trick in http://kschiess.github.io/parslet/tricks.html for matching end_of_file.
class MyParser < Parslet::Parser
rule(:cr) { str("\n") }
rule(:eol?) { any.absent? | cr }
rule(:line_body) { (eol?.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.repeat (0)}
root(:lines?)
end
puts MyParser.new.parse(""" this is a line
so is this
that was too
This ends""").inspect
Obviously if you want to do more with the parser than you can achieve with String::split("\n") you will replace the line_body with something useful :)
I had a quick go at answering this question and mucked it up. I just though I would explain the mistake I made, and show you how to avoid mistakes of that kind.
Here is my first answer.
rule(:eol) { str('\n') | any.absent? }
rule(:line) { (eol.absent? >> any).repeat >> eol }
rule(:lines) { line.as(:line).repeat }
I didn't follow my usual rules:
Always make repeat count explicit
Any rule that can match zero length strings, should have name ending in a '?'
So lets apply these...
rule(:eol?) { str('\n') | any.absent? }
# as the second option consumes nothing
rule(:line?) { (eol.absent? >> any).repeat(0) >> eol? }
# repeat(0) can consume nothing
rule(:lines?) { line.as(:line?).repeat(0) }
# We have a problem! We have a rule that can consume nothing inside a `repeat`!
Here see why we get an infinite loop. As the input is consumed, you end up with just the end of file, which matches eol? and hence line? (as the line body can be empty). Being inside lines' repeat, it keeps matching without consuming anything and loops forever.
We need to change the line rule so it always consumes something.
rule(:cr) { str('\n') }
rule(:eol?) { cr | any.absent? }
rule(:line_body) { (eol.absent? >> any).repeat(1) }
rule(:line) { cr | line_body >> eol? }
rule(:lines?) { line.as(:line).repeat(0) }
Now line has to match something, either a cr (for empty lines), or at least one character followed by the optional eol?. All repeats have bodies that consume something. We are now golden.
I think you have two, related, problems with your matching:
The pseudo-character match $ does not consume any real characters. You still need to consume the newlines somehow.
Parslet is munging the input in some way, making $ match in places you might not expect. The best result I could get using $ ended up matching each individual character.
Much safer to use \n as the end-of-line character. I got the following to work (I am somewhat of a beginner with Parslet myself, so apologies if it could be clearer):
require 'parslet'
class Lines < Parslet::Parser
rule(:text) { match("[^\n]") }
rule(:line) { ( text.repeat(0) >> match("\n") ) | text.repeat(1) }
rule(:lines) { line.as(:line).repeat }
root :lines
end
s = "This
is
a
multiline
string"
p Lines.new.parse( s )
The rule for the line is complex because of the need to match empty lines and a possible final line without a \n.
You don't have to use the .as(:line) syntax - I just added it to show clearly that the :line rule is matching each line individually, and not simply consuming the whole input.

Best word wrap algorithm? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Word wrap is one of the must-have features in a modern text editor.
How word wrap be handled? What is the best algorithm for word-wrap?
If text is several million lines, how can I make word-wrap very fast?
Why do I need the solution? Because my projects must draw text with various zoom level and simultaneously beautiful appearance.
The running environment is Windows Mobile devices. The maximum 600 MHz speed with very small memory size.
How should I handle line information? Let's assume original data has three lines.
THIS IS LINE 1.
THIS IS LINE 2.
THIS IS LINE 3.
Afterwards, the break text will be shown like this:
THIS IS
LINE 1.
THIS IS
LINE 2.
THIS IS
LINE 3.
Should I allocate three lines more? Or any other suggestions?
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Here is a word-wrap algorithm I've written in C#. It should be fairly easy to translate into other languages (except perhaps for IndexOfAny).
static char[] splitChars = new char[] { ' ', '-', '\t' };
private static string WordWrap(string str, int width)
{
string[] words = Explode(str, splitChars);
int curLineLength = 0;
StringBuilder strBuilder = new StringBuilder();
for(int i = 0; i < words.Length; i += 1)
{
string word = words[i];
// If adding the new word to the current line would be too long,
// then put it on a new line (and split it up if it's too long).
if (curLineLength + word.Length > width)
{
// Only move down to a new line if we have text on the current line.
// Avoids situation where wrapped whitespace causes emptylines in text.
if (curLineLength > 0)
{
strBuilder.Append(Environment.NewLine);
curLineLength = 0;
}
// If the current word is too long to fit on a line even on it's own then
// split the word up.
while (word.Length > width)
{
strBuilder.Append(word.Substring(0, width - 1) + "-");
word = word.Substring(width - 1);
strBuilder.Append(Environment.NewLine);
}
// Remove leading whitespace from the word so the new line starts flush to the left.
word = word.TrimStart();
}
strBuilder.Append(word);
curLineLength += word.Length;
}
return strBuilder.ToString();
}
private static string[] Explode(string str, char[] splitChars)
{
List<string> parts = new List<string>();
int startIndex = 0;
while (true)
{
int index = str.IndexOfAny(splitChars, startIndex);
if (index == -1)
{
parts.Add(str.Substring(startIndex));
return parts.ToArray();
}
string word = str.Substring(startIndex, index - startIndex);
char nextChar = str.Substring(index, 1)[0];
// Dashes and the likes should stick to the word occuring before it. Whitespace doesn't have to.
if (char.IsWhiteSpace(nextChar))
{
parts.Add(word);
parts.Add(nextChar.ToString());
}
else
{
parts.Add(word + nextChar);
}
startIndex = index + 1;
}
}
It's fairly primitive - it splits on spaces, tabs and dashes. It does make sure that dashes stick to the word before it (so you don't end up with stack\n-overflow) though it doesn't favour moving small hyphenated words to a newline rather than splitting them. It does split up words if they are too long for a line.
It's also fairly culturally specific, as I don't know much about the word-wrapping rules of other cultures.
Donald E. Knuth did a lot of work on the line breaking algorithm in his TeX typesetting system. This is arguably one of the best algorithms for line breaking - "best" in terms of visual appearance of result.
His algorithm avoids the problems of greedy line filling where you can end up with a very dense line followed by a very loose line.
An efficient algorithm can be implemented using dynamic programming.
A paper on TeX's line breaking.
I had occasion to write a word wrap function recently, and I want to share what I came up with.
I used a TDD approach almost as strict as the one from the Go example. I started with the test that wrapping the string "Hello, world!" at 80 width should return "Hello, World!". Clearly, the simplest thing that works is to return the input string untouched. Starting from that, I made more and more complex tests and ended up with a recursive solution that (at least for my purposes) quite efficiently handles the task.
Pseudocode for the recursive solution:
Function WordWrap (inputString, width)
Trim the input string of leading and trailing spaces.
If the trimmed string's length is <= the width,
Return the trimmed string.
Else,
Find the index of the last space in the trimmed string, starting at width
If there are no spaces, use the width as the index.
Split the trimmed string into two pieces at the index.
Trim trailing spaces from the portion before the index,
and leading spaces from the portion after the index.
Concatenate and return:
the trimmed portion before the index,
a line break,
and the result of calling WordWrap on the trimmed portion after
the index (with the same width as the original call).
This only wraps at spaces, and if you want to wrap a string that already contains line breaks, you need to split it at the line breaks, send each piece to this function and then reassemble the string. Even so, in VB.NET running on a fast machine, this can handle about 20 MB/second.
I don't know of any specific algorithms, but the following could be a rough outline of how it should work:
For the current text size, font, display size, window size, margins, etc., determine how many characters can fit on a line (if fixed-type), or how many pixels can fit on a line (if not fixed-type).
Go through the line character by character, calculating how many characters or pixels have been recorded since the beginning of the line.
When you go over the maximum characters/pixels for the line, move back to the last space/punctuation mark, and move all text to the next line.
Repeat until you go through all text in the document.
In .NET, word wrapping functionality is built into controls like TextBox. I am sure that a similar built-in functionality exists for other languages as well.
With or without hyphenation?
Without it's easy. Just encapsulate your text as wordobjects per word and give them a method getWidth(). Then start at the first word adding up the rowlength until it is greater than the available space. If so, wrap the last word and start counting again for the next row starting with this one, etc.
With hyphenation you need hyphenation rules in a common format like: hy-phen-a-tion
Then it's the same as the above except you need to split the last word which has caused the overflow.
A good example and tutorial of how to structure your code for an excellent text editor is given in the Gang of Four Design Patterns book. It's one of the main samples on which they show the patterns.
I wondered about the same thing for my own editor project. My solution was a two-step process:
Find the line ends and store them in an array.
For very long lines, find suitable break points at roughly 1K intervals and save them in the line array, too. This is to catch the "4 MB text without a single line break".
When you need to display the text, find the lines in question and wrap them on the fly. Remember this information in a cache for quick redraw. When the user scrolls a whole page, flush the cache and repeat.
If you can, do loading/analyzing of the whole text in a background thread. This way, you can already display the first page of text while the rest of the document is still being examined. The most simple solution here is to cut the first 16 KB of text away and run the algorithm on the substring. This is very fast and allows you to render the first page instantly, even if your editor is still loading the text.
You can use a similar approach when the cursor is initially at the end of the text; just read the last 16 KB of text and analyze that. In this case, use two edit buffers and load all but the last 16 KB into the first while the user is locked into the second buffer. And you'll probably want to remember how many lines the text has when you close the editor, so the scroll bar doesn't look weird.
It gets hairy when the user can start the editor with the cursor somewhere in the middle, but ultimately it's only an extension of the end-problem. Only you need to remember the byte position, the current line number, and the total number of lines from the last session, plus you need three edit buffers or you need an edit buffer where you can cut away 16 KB in the middle.
Alternatively, lock the scrollbar and other interface elements while the text is loading; that allows the user to look at the text while it loads completely.
I cant claim the bug-free-ness of this, but I needed one that word wrapped and obeyed boundaries of indentation. I claim nothing about this code other than it has worked for me so far. This is an extension method and violates the integrity of the StringBuilder but it could be made with whatever inputs / outputs you desire.
public static void WordWrap(this StringBuilder sb, int tabSize, int width)
{
string[] lines = sb.ToString().Replace("\r\n", "\n").Split('\n');
sb.Clear();
for (int i = 0; i < lines.Length; ++i)
{
var line = lines[i];
if (line.Length < 1)
sb.AppendLine();//empty lines
else
{
int indent = line.TakeWhile(c => c == '\t').Count(); //tab indents
line = line.Replace("\t", new String(' ', tabSize)); //need to expand tabs here
string lead = new String(' ', indent * tabSize); //create the leading space
do
{
//get the string that fits in the window
string subline = line.Substring(0, Math.Min(line.Length, width));
if (subline.Length < line.Length && subline.Length > 0)
{
//grab the last non white character
int lastword = subline.LastOrDefault() == ' ' ? -1 : subline.LastIndexOf(' ', subline.Length - 1);
if (lastword >= 0)
subline = subline.Substring(0, lastword);
sb.AppendLine(subline);
//next part
line = lead + line.Substring(subline.Length).TrimStart();
}
else
{
sb.AppendLine(subline); //everything fits
break;
}
}
while (true);
}
}
}
Here is mine that I was working on today for fun in C:
Here are my considerations:
No copying of characters, just printing to standard output. Therefore, since I don't like to modify the argv[x] arguments, and because I like a challenge, I wanted to do it without modifying it. I did not go for the idea of inserting '\n'.
I don't want
This line breaks here
to become
This line breaks
here
so changing characters to '\n' is not an option given this objective.
If the linewidth is set at say 80, and the 80th character is in the middle of a word, the entire word must be put on the next line. So as you're scanning, you have to remember the position of the end of the last word that didn't go over 80 characters.
So here is mine, it's not clean; I've been breaking my head for the past hour trying to get it to work, adding something here and there. It works for all edge cases that I know of.
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
int isDelim(char c){
switch(c){
case '\0':
case '\t':
case ' ' :
return 1;
break; /* As a matter of style, put the 'break' anyway even if there is a return above it.*/
default:
return 0;
}
}
int printLine(const char * start, const char * end){
const char * p = start;
while ( p <= end )
putchar(*p++);
putchar('\n');
}
int main ( int argc , char ** argv ) {
if( argc <= 2 )
exit(1);
char * start = argv[1];
char * lastChar = argv[1];
char * current = argv[1];
int wrapLength = atoi(argv[2]);
int chars = 1;
while( *current != '\0' ){
while( chars <= wrapLength ){
while ( !isDelim( *current ) ) ++current, ++chars;
if( chars <= wrapLength){
if(*current == '\0'){
puts(start);
return 0;
}
lastChar = current-1;
current++,chars++;
}
}
if( lastChar == start )
lastChar = current-1;
printLine(start,lastChar);
current = lastChar + 1;
while(isDelim(*current)){
if( *current == '\0')
return 0;
else
++current;
}
start = current;
lastChar = current;
chars = 1;
}
return 0;
}
So basically, I have start and lastChar that I want to set as the start of a line and the last character of a line. When those are set, I output to standard output all the characters from start to end, then output a '\n', and move on to the next line.
Initially everything points to the start, then I skip words with the while(!isDelim(*current)) ++current,++chars;. As I do that, I remember the last character that was before 80 chars (lastChar).
If, at the end of a word, I have passed my number of chars (80), then I get out of the while(chars <= wrapLength) block. I output all the characters between start and lastChar and a newline.
Then I set current to lastChar+1 and skip delimiters (and if that leads me to the end of the string, we're done, return 0). Set start, lastChar and current to the start of the next line.
The
if(*current == '\0'){
puts(start);
return 0;
}
part is for strings that are too short to be wrapped even once. I added this just before writing this post because I tried a short string and it didn't work.
I feel like this might be doable in a more elegant way. If anyone has anything to suggest I'd love to try it.
And as I wrote this I asked myself "what's going to happen if I have a string that is one word that is longer than my wraplength" Well it doesn't work. So I added the
if( lastChar == start )
lastChar = current-1;
before the printLine() statement (if lastChar hasn't moved, then we have a word that is too long for a single line so we just have to put the whole thing on the line anyway).
I took the comments out of the code since I'm writing this but I really feel that there must be a better way of doing this than what I have that wouldn't need comments.
So that's the story of how I wrote this thing. I hope it can be of use to people and I also hope that someone will be unsatisfied with my code and propose a more elegant way of doing it.
It should be noted that it works for all edge cases: words too long for a line, strings that are shorter than one wrapLength, and empty strings.
I may as well chime in with a perl solution that I made, because gnu fold -s was leaving trailing spaces and other bad behavior. This solution does not (properly) handle text containing tabs or backspaces or embedded carriage returns or the like, although it does handle CRLF line-endings, converting them all to just LF. It makes minimal change to the text, in particular it never splits a word (doesn't change wc -w), and for text with no more than single space in a row (and no CR) it doesn't change wc -c (because it replaces space with LF rather than inserting LF).
#!/usr/bin/perl
use strict;
use warnings;
my $WIDTH = 80;
if ($ARGV[0] =~ /^[1-9][0-9]*$/) {
$WIDTH = $ARGV[0];
shift #ARGV;
}
while (<>) {
s/\r\n$/\n/;
chomp;
if (length $_ <= $WIDTH) {
print "$_\n";
next;
}
#_=split /(\s+)/;
# make #_ start with a separator field and end with a content field
unshift #_, "";
push #_, "" if #_%2;
my ($sep,$cont) = splice(#_, 0, 2);
do {
if (length $cont > $WIDTH) {
print "$cont";
($sep,$cont) = splice(#_, 0, 2);
}
elsif (length($sep) + length($cont) > $WIDTH) {
printf "%*s%s", $WIDTH - length $cont, "", $cont;
($sep,$cont) = splice(#_, 0, 2);
}
else {
my $remain = $WIDTH;
{ do {
print "$sep$cont";
$remain -= length $sep;
$remain -= length $cont;
($sep,$cont) = splice(#_, 0, 2) or last;
}
while (length($sep) + length($cont) <= $remain);
}
}
print "\n";
$sep = "";
}
while ($cont);
}
#ICR, thanks for sharing the C# example.
I did not succeed using it, but I came up with another solution. If there is any interest in this, please feel free to use this:
WordWrap function in C#. The source is available on GitHub.
I've included unit tests / samples.

Resources