In the RStudio Search/Replace tool, I am trying to replace two consecutive linebreaks by one. However, I cannot get it to find line breaks with any regex I am aware of - even when I introduce them by replacing something with \n, searching for [\n]+ and similar patterns does not result in any matches. Any suggestions would be much appreciated. (I am on Windows.)
Related
How do I write a regular expression to find ellipses in a text file using VBScript? the text will be look something like this
>…………………………………………………………………………………………………………………<
that I want to find, and replace with something else.
I've tried the following as the search pattern to no avail:
">[\133]*<"
">[…]*<"
">[\133]+<"
">[…]+<"
">[\133]{1,}<"
">[…]{1,}<"
">[\x85]+<"
The first one finds the zero case, but not if an ellipse occurs between the >< characters. Several work when using Notepad++ regular expressions. Any help is appreciated.
I think I've found how to do it.
">[\W]{2,}<"
does it in my file, since the ellipses aren't characters.
In the above context, I can't help but think that a regular expression is a bit overkill, but I had a quick look: \>…+\< will work - it won't capture anything, though, but you could put some parentheses around it if you wanted...
The ellipses is a character. From what I can see, ellipsis is ASCII #133. The character used in your question, however, is something else entirely. They register as ASCII #226 for reasons I can't quite work out. Hopefully someone smarter than me might know the answer. In any event, assuming it is CHR(133), it should be easy enough to construct a string pattern in VBScript to accomplish the above.
I wrote a parser, which recognizes elements of text based on certain pattern.
My program is able to recognize paragraph, chapter etc. The problem is it shouldn't recognize elements, when there's a quote. For example:
Paragraph 1
Something here...
would be proceed as Paragraph.
And:
Paragraph 1
"Paragraph 2"
shouldn't. But as my program is based on regexp patterns, it looks for the word "Paragraph". I'm going line by line and recognize patterns for each line. I don't know how to tell my program: if you see quotes mark, leave text alone without doing anything? My mentor told me to use raise, but I'm not sure how to do it.
OK, so I'm still a bit of a beginner, I don't know if there is a way to direct the regex to ignore things inside quotes, but if I wanted to solve this problem, I would first make a copy of the text to be parsed, run a regex over that and delete everything inside quotes, then run the parser over the remaining text.
A bit kludgy and inelegant I admit, and may have performance issues over a large enough text, but it would get the job done.
See HERE for link to documentation of ruby regex. About a third of the way down it discusses quotes:
/\p{Pi}/ - 'Punctuation: Initial Quote'
/\p{Pf}/ - 'Punctuation: Final Quote'
You may be able to bake that into the regex with the ^ to direct it to ignore items in quotes.
(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-823
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).
I'd like to take some user input text and quickly parse it to produce some latex code. At the moment, I'm replacing % with \% and \n with \n\n, but I'm wondering if there are other replacements I should be making to make the conversion from plain text to latex.
I'm not super worried about safety here (can you even write malicious latex code?), as this should only be used by the user to convert their own text into latex, and so they should probably be allowed to used their own latex markup in the pre-converted text, but I'd like to make sure the output doesn't include accidental latex commands if possible. If there's a good library to make such a conversion, I'd take a look.
Apparently, the following characters
\ { } $ ^ _ % ~ # &
are special in LaTeX, so you should make sure to escape them (prefixing with backslash will do for some of them, see Thomas' answer for special cases) or tell your users not to use them unless they deliberately want to use LaTeX commands (or a mix of both, depending on the character).
Some additional pitfalls:
Not every line break in the text might be intended as a new paragraph.
If your users use a language other than English (or Latin), you will need to \usepackage something that deals with the encoding (like utf8) or convert the characters yourself (e.g. ä -> \"a).
As dmckee points out, quotes also need to be treated separately.
EDIT: Since this has become the accepted answer, I also added the points raised in the other answers, so this is now a summary.
As Heinzi said, the following need attention:
\ { } $ ^ _ % ~ # &
Most can be escaped with a backslash, but \ becomes \textbackslash and ~ becomes \textasciitilde.
I think you might want to leave line breaks alone. LaTeX handles these in exactly the same way as many content management systems; many people have come to expect that "double line break" = "paragraph break". Heck, even stackoverflow itself works that way.
(You cannot write malicious LaTeX code; everything that happens inside LaTeX stays inside LaTeX. Unless you explicitly enable write18 when running latex, but it's disabled by default.)
Heinzi has already shown most of the basic characters that need to be escaped, but the hard part here is insuring that the quoting comes out right.
She said "He didn't do it".
needs to be converted to
She said ``He didn't do it''.
which looks easy in this trivial case, but is full of gatcha's that require careful handling. For modest size texts, I generally use a naive substitution generated in sed and diddle the results by hand. Things are both easier and harder if your "plain text" uses curly quotes.
Here "naive quote substitution" means that quotes followed by word characters are replaced by (one or two as appropriate) back ticks, and all others are replaced by (one or two) single-quotes ('). That catches most cases in prose, but you will have to clean up all the triple-quote cases by hand.
Another possible solution is to make all "special" characters into ordinary ones before inserting the user's text. That might avoid many headaches, but might also create new ones...
You can do this by changing the catcode of the character. The TeX Wikibook knows more.
\catcode`\$=12
will turn $ into an ordinary character. However, for some reason some characters don't come out as you'd expect. \ becomes a double open quote, { becomes a dash... and redefining } inside a group ({...}) makes TeX choke entirely.
Long story short: only recommended if you know what you're doing.
I've written a little program to download images to different folders from the web. I want to create a quick and dirty batch file syntax and was wondering what the best delimiter would be for the different variables.
The variables might include urls, folder paths, filenames and some custom messages.
So are there any characters that cannot be used for the first three? That would be the obvious choice to use as a delimiter. How about the good old comma?
Thanks!
You can use either:
A Control character: Control characters don't appear in files. Tab (\t) is probably the best choice here.
Some combination of characters which is unlikely to occur in your files. For e.g. #s# etc.
Tab is the generally preferred choice though.
Why not just use something that exists already? There are one or two choices, perl, python, ruby, bash, sh, csh, Groovy, ECMAscript, heavens for forbid windows scripting files.
I can't see what you'd gain by writing yet another batch file syntax.
Tabs. And then expand or compress any tabs found in the text.
Choose a delimiter that has the least chance of collision with the names of any variable that you may have (which precludes #, /, : etc). The comma (,) looks good to me (unless your custom message has a few) or < and > (subject to previous condition).
However, you may also need to 'escape' delimiter characters occurring as part of the variables you want to delimit.
This sounds like a really bad idea. There is no need to create yet another (data-representation) language, there are plenty ones which might fit your needs. In addition to Ruby, Perl, etc., you may want to consider YAML.
Designing good syntax for these sort of this is difficult and fraught with peril. Does reinventing the wheel ring a bell?
I would use '|'
It's one of the rarest characters.
How about String.fromCharCode(1) ?