I'd like to take some user input text and quickly parse it to produce some latex code. At the moment, I'm replacing % with \% and \n with \n\n, but I'm wondering if there are other replacements I should be making to make the conversion from plain text to latex.
I'm not super worried about safety here (can you even write malicious latex code?), as this should only be used by the user to convert their own text into latex, and so they should probably be allowed to used their own latex markup in the pre-converted text, but I'd like to make sure the output doesn't include accidental latex commands if possible. If there's a good library to make such a conversion, I'd take a look.
Apparently, the following characters
\ { } $ ^ _ % ~ # &
are special in LaTeX, so you should make sure to escape them (prefixing with backslash will do for some of them, see Thomas' answer for special cases) or tell your users not to use them unless they deliberately want to use LaTeX commands (or a mix of both, depending on the character).
Some additional pitfalls:
Not every line break in the text might be intended as a new paragraph.
If your users use a language other than English (or Latin), you will need to \usepackage something that deals with the encoding (like utf8) or convert the characters yourself (e.g. ä -> \"a).
As dmckee points out, quotes also need to be treated separately.
EDIT: Since this has become the accepted answer, I also added the points raised in the other answers, so this is now a summary.
As Heinzi said, the following need attention:
\ { } $ ^ _ % ~ # &
Most can be escaped with a backslash, but \ becomes \textbackslash and ~ becomes \textasciitilde.
I think you might want to leave line breaks alone. LaTeX handles these in exactly the same way as many content management systems; many people have come to expect that "double line break" = "paragraph break". Heck, even stackoverflow itself works that way.
(You cannot write malicious LaTeX code; everything that happens inside LaTeX stays inside LaTeX. Unless you explicitly enable write18 when running latex, but it's disabled by default.)
Heinzi has already shown most of the basic characters that need to be escaped, but the hard part here is insuring that the quoting comes out right.
She said "He didn't do it".
needs to be converted to
She said ``He didn't do it''.
which looks easy in this trivial case, but is full of gatcha's that require careful handling. For modest size texts, I generally use a naive substitution generated in sed and diddle the results by hand. Things are both easier and harder if your "plain text" uses curly quotes.
Here "naive quote substitution" means that quotes followed by word characters are replaced by (one or two as appropriate) back ticks, and all others are replaced by (one or two) single-quotes ('). That catches most cases in prose, but you will have to clean up all the triple-quote cases by hand.
Another possible solution is to make all "special" characters into ordinary ones before inserting the user's text. That might avoid many headaches, but might also create new ones...
You can do this by changing the catcode of the character. The TeX Wikibook knows more.
\catcode`\$=12
will turn $ into an ordinary character. However, for some reason some characters don't come out as you'd expect. \ becomes a double open quote, { becomes a dash... and redefining } inside a group ({...}) makes TeX choke entirely.
Long story short: only recommended if you know what you're doing.
Related
I wrote a parser, which recognizes elements of text based on certain pattern.
My program is able to recognize paragraph, chapter etc. The problem is it shouldn't recognize elements, when there's a quote. For example:
Paragraph 1
Something here...
would be proceed as Paragraph.
And:
Paragraph 1
"Paragraph 2"
shouldn't. But as my program is based on regexp patterns, it looks for the word "Paragraph". I'm going line by line and recognize patterns for each line. I don't know how to tell my program: if you see quotes mark, leave text alone without doing anything? My mentor told me to use raise, but I'm not sure how to do it.
OK, so I'm still a bit of a beginner, I don't know if there is a way to direct the regex to ignore things inside quotes, but if I wanted to solve this problem, I would first make a copy of the text to be parsed, run a regex over that and delete everything inside quotes, then run the parser over the remaining text.
A bit kludgy and inelegant I admit, and may have performance issues over a large enough text, but it would get the job done.
See HERE for link to documentation of ruby regex. About a third of the way down it discusses quotes:
/\p{Pi}/ - 'Punctuation: Initial Quote'
/\p{Pf}/ - 'Punctuation: Final Quote'
You may be able to bake that into the regex with the ^ to direct it to ignore items in quotes.
Right single quotation mark (U+2019)
vs.
Apostrophe (U+0027)
What is the difference between these two characters?
I ran into this issue where I use CAtlString to load a string from a resource file, and on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations. The U+2019 character appears in strings in my resource file that I copied from Word, and U+0027 appears in stirngs that I hand coded. Why does LoadString (sometimes) choke on this?
What is the difference between these two characters?
Arguable!
Going by the names, one would imagine that the curly ‹’› is only for use as a quotation mark, and that the straight ‹'› is only for use as a real apostrophe, an indicator of omitted letters.
However traditional typesetting practice in English is always to use a curly ‹’› to render an apostrophe. Personally—and I may be alone here—I don't like this. It can make for more ambiguous reading:
“He said, ‘It’s fish ’n’ chips’...”
with the apostrophes being straight it's (marginally) clearer where the quotation ends:
“He said, ‘It's fish 'n' chips’...”
and the apostrophe being ‘straight’ makes more sense to me because its purpose of indicating omitted letters has no inherent directionality, whereas quotation marks are clearly asymmetrical in purpose.
In traditional ASCII, of course, there are no smart quotes, so the apostrophe is always used for both...
on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations.
Here you are meeting the horror of the ‘ANSI’ code page. This is a default character encoding that is different across different Windows install locales. So on a machine in the Western region, you get different results when you read a resource to when you read it on a Japanese Windows.
It is highly unfortunate that Windows has varying default code pages instead of using a single global encoding like UTF-8, but it's too late to fix now. If you compile your whole application as a Unicode app (so you'll be using LoadStringW rather than LoadStringA) then you can cope with non-ASCII characters like the smart quotes much better.
If you can't move to a Unicode application you're a bit stuck. You won't be able to handle non-ASCII characters like the smart quotes globally, so stick with ASCII characters like the straight apostrophe ‹'› alone.
The U+2019 character appears in strings in my resource file that I copied from Word
Yes, Word has an annoying AutoCorrect feature that replaces all apostrophes you type with smart quotes. This is especially undesirable when you are dealing with code, where ‹’› will break the program; but it's also wrong even for plain old English, as it's not possible to correctly guess the desired direction of the quote. (It'll get one of the apostrophes in “fish 'n' chips” the wrong way round, for example.)
I suggest turning off the automatic-replace-with-smart-quotes feature. If you want the smart quotes, it's better to type them deliberately. Unfortunately they are inconvenient to type on most keyboard layouts, often requiring obscure Alt+numpad sequences. Personally I use this one to drop them onto Alt+[] keys.
Historically, single-quote and double-quote come in pairs, left (open) and right (close).
For many years the character sets of computers were limited, having a single form of each.
Now, with the advent of Unicode, the full forms are available, but support for them is still limited. Programming languages still use the simple forms, and the full forms can still cause problems.
(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-823
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).
I'm trying to pass string to Win32 program from command line so it will be printed without changes.
Why I have to escape
"AAA <BBB#pobox.com>" as """AAA <BBB#pobox.com>"""
but
"AAA <BBB#pobox.com>", (comma included) as "\"AAA ^<BBB#pobox.com^>\","
I see no consistency in escaping rules for windows command line
P.S. I'm trying to generate a .cmd file
Update:
I'm using simple C program for testing that is compiled with gcc, no additional object files linked. If I replace it with perl, rules remain same.
I'm trying to create a general escaping algorithm. It will generate .cmd file which will call perl with output redirect. Currently I have a problem that if string contains odd number of double quotes which are escaped with backslash, output redirect does not function. Same problem is described in the last comment to http://blogs.msdn.com/b/oldnewthing/archive/2010/09/17/10063629.aspx .
If I use "" as escape for ", it splits on space, so it will result it 2 parameters instead of one. Also "" has some artifacts.
In windows there is no one way of getting a command line and parsing it. Mostly programs have generally been left to deal with that themselves.
There is a recent post by Raymond Chen about the CommandLineToArgvW function which mentions various rules about quoting but they'll only apply if the program uses that particular function. http://blogs.msdn.com/b/oldnewthing/archive/2010/09/17/10063629.aspx
In windows the command line is passed to the program unmolested (i.e. no wildcards expanded) and then the program needs to deal with it. The programming language may provide a convenience which does some default argument parsing, and this might use a standard windows function like CommandLineToArgvW but even so the program could opt to read the unadulterated string itself thereby skipping those standards.
This means you need to figure out the rules for the particular program you are trying to script yourself and then use them.
I've just tried those as parameters into one of my own programs, and both versions (with or without the comma) can be escaped in both ways (using either """ or \" to escape the quotes). The only reason I can see that the < and > need to be escaped with ^ in the second version is that as the command line is seeing them as I/O redirections prior to passing them to the application, due to the different way of escaping the string quotes.
I've written a little program to download images to different folders from the web. I want to create a quick and dirty batch file syntax and was wondering what the best delimiter would be for the different variables.
The variables might include urls, folder paths, filenames and some custom messages.
So are there any characters that cannot be used for the first three? That would be the obvious choice to use as a delimiter. How about the good old comma?
Thanks!
You can use either:
A Control character: Control characters don't appear in files. Tab (\t) is probably the best choice here.
Some combination of characters which is unlikely to occur in your files. For e.g. #s# etc.
Tab is the generally preferred choice though.
Why not just use something that exists already? There are one or two choices, perl, python, ruby, bash, sh, csh, Groovy, ECMAscript, heavens for forbid windows scripting files.
I can't see what you'd gain by writing yet another batch file syntax.
Tabs. And then expand or compress any tabs found in the text.
Choose a delimiter that has the least chance of collision with the names of any variable that you may have (which precludes #, /, : etc). The comma (,) looks good to me (unless your custom message has a few) or < and > (subject to previous condition).
However, you may also need to 'escape' delimiter characters occurring as part of the variables you want to delimit.
This sounds like a really bad idea. There is no need to create yet another (data-representation) language, there are plenty ones which might fit your needs. In addition to Ruby, Perl, etc., you may want to consider YAML.
Designing good syntax for these sort of this is difficult and fraught with peril. Does reinventing the wheel ring a bell?
I would use '|'
It's one of the rarest characters.
How about String.fromCharCode(1) ?