How can I get an apostrophe at the beginning or the end of the word? This would be necessary for old-style
'Tis
instead of
It's
Or the apostrophe at the end of a word in plural, like
arguments'
Of course I could also just type
arguments’
but this defeats the purpose of using markdown.
Edit: It does not seem to me that there is a defined inline quotation style with single quote at beginning and end, like
'some sort of quotation'
so it shouldn't be too much of a stretch?
I think the best you can do is to go ahead and specify the single-right-quote symbol as you have done, but you don't have to use the numeric notation (’). AsciiDoc has a predefined symbol for that ({rsquo}), so it's not quite so ugly.
.Examples of Single-Apostrophe Notation
[width="50%",cols="",options="header"]
|===
|Use this |To get this
|\'italics' |'italics'
|\'\'single-quoted'' (two single apostrophes each) |''single-quoted''
|it's |it's (automatically formatted)
|its' |its' (ugly)
|'tis |'tis (ugly)
|its'\{empty\} |its'{empty} (still ugly)
|\{empty\}'tis |{empty}'tis (still ugly)
|its\{rsquo\} **{nbsp} <- This is what you want** |its{rsquo}
|\{rsquo\}tis **{nbsp} <- And this** |{rsquo}tis
|===
Related
I am writing code to extract some data between (italic, --bold--) characters. (Very similar to SO comment feature)
I actually wrote the method for that (using a loop and checking characters), but I wondered if I can re-write that method using Regex.
I tried Rubular, but I am not that good at Regex:
This kinda works for italic, but I think it is not a good solution for using all other special chars (like -- and possibly others)
regex: _{2}([^_]*)_{2}
text: __word1__ not_italic __a__ --bolder--
Is it possible to do that with a 1 match call and regex, or do I have to crete special regex's for each special formatting characters?
Sure you can. Here's a nifty construct you can use: (__|--)((?:(?!\1).)+)\1
Demo + explanation: http://regex101.com/r/tO4tW1
The content you're after will be in the second backreference every time.
I'm doing some text processing with Ruby.
For some text I'm working with: single quotes should never be outside of double quotes. So, I'd like to craft a RegEx which matches single-quoted strings, but not those enclosed in double quotes already, so I can swap them with a script. Make sense?
Thus, in the following examples, sentences #1, 2, 4, 6 and 8 are OK, while sentences #3, 5, and 7 contain incorrectly nested single quotes, which I'd like to swap:
This is a sentence.
This is a sentence "with double quotes."
This is a sentence 'with single quotes.'
This is a sentence "with a 'nested single quote.' Sometimes there are 'more than one.'"
This is a sentence 'with a "nested double quote." Sometimes there are "more than one."'
This is a sentence "without a double 'closing quote,' which is common in this text.
This is a sentence 'without a single "closing quote," common too, unfortunately.
I don't want to match apostrophes, however. That won't work.
(bold face indicates the matches I'd like to make with the RegEx, so I can swap quotes.)
The point: I am trying to quote extended passages which already have quotes within them. This requires me to swap their doubles with singles.
Is this possible? I've been trying for hours, and I can't seem to get it. Any help appreciated.
I don't think regular expressions are the way to go for this one. Why not just scan through the text yourself?
(pseudocode)
for each char in text
if char is `"`, then ignore until next `"`
else if char is `'` (and not part of a contraction), then capture until next `'` or `.`
end for
I foresee future issues with this.
This might not be a perfect answer, but you could try using a gsub with something like this for #5:
a=> This is a sentence 'with a "nested double quote." Sometimes there are "more than one."'
a.gsub(/^[A-Z][a-zA-Z\s]*'[a-zA-Z\s]*(".*")[a-zA-Z\s]*'/) { |m| m.gsub('"',"'")}
For # 3 use:
a.gsub(/^[A-Z][a-zA-Z\s]*('.*')/) { |m| m.gsub('"',"'")}
etc. for the others.
These are just examples, but hopefully they help. I think you have to be very
careful with this because depending on the data and regex you use, you can get
unexpected results and it may change your data in a way that makes things
worse. Make sure to get some rspec tests written and test with a very large
sample to play around with the best regex to process this with.
Another issue you may have is identifying sentences if they are in paragraphs.
It becomes much more complicated and you may need to use something like NLP to
identify them.
Additionally, you may consider using chr() and ord() in your code.
Good luck!
How can I match a balanced pair of delimiters not escaped by backslash (that is itself not escaped by a backslash) (without the need to consider nesting)? For example with backticks, I tried this, but the escaped backtick is not working as escaped.
regex = /(?!<\\)`(.*?)(?!<\\)`/
"hello `how\` are` you"
# => $1: "how\\"
# expected "how\\` are"
And the regex above does not consider a backslash that is escaped by a backslash and is in front of a backtick, but I would like to.
How does StackOverflow do this?
The purpose of this is not much complicated. I have documentation texts, which include the backtick notation for inline code just like StackOverflow, and I want to display that in an HTML file with the inline code decorated with some span material. There would be no nesting, but escaped backticks or escaped backslashes may appear anywhere.
Lookbehind is the first thing everyone thinks of for this kind of problem, but it's the wrong tool, even in flavors like .NET that support unrestricted lookbehinds. You can hack something up, but it's going to be ugly, even in .NET. Here's a better way:
`[^`\\]*(\\.[^`\\]*)*`
The first part starts from the opening delimiter and gobbles up anything that's not the delimiter or a backslash. If the next character is a backslash, it consumes that and the character following it, whatever it may be. It could be the delimiter character, another backslash, or anything else, it doesn't matter.
It repeats those steps as many times as necessary, and when neither [^`\\] nor \\. can match, the next character must be the closing delimiter. Or the end of the string, but I'm assuming the input is well formed. But if it's not well formed, this regex will fail very quickly. I mention that because of this other approach I see a lot:
`(?:[^`\\]+|\\.)*`
This works fine on well-formed input, but what happens if you remove the last backtick from your sample input?
"hello `how\` are you"
According to RegexBuddy, after encountering the first backtick, this regex performed 9,252 distinct operations (or steps) before it could give up and report failure; mine failed in ten steps.
EDIT To extract just the par inside the delimiters, wrap that part in a capturing group. You'll still have to remove the backslashes manually.
`([^`\\]*(?:\\.[^`\\]*)*)`
I also changed the other group to non-capturing, which I should have done from the start. I don't avoid capturing religiously, but if you are using them to capture stuff, any other groups you use should be non-capturing.
EDIT I think I've been reading too much into the question. On StackOverflow, if you want to include literal backticks in an inline-code segment or a comment, you use three backticks as the the delimiter, not just one. Since there's no need to escape backticks, you can ignore backslashes as well. Your regex could turn out to be as simple as this:
```(.*?)```
Dealing with the possibility of false delimiters, you use the same basic technique:
```([^`]*(?:`(?!``)[^`]*)*)```
Is this what you're after?
By the way, this answer doesn't contradict #nneonneo's comment above. This answer doesn't consider the context in which the match is taking place. Is it in the source code of a program or web page? If it is, did the match occur inside a comment or a string literal? How do I even know the first backtick I found wasn't escaped? Regexes don't know anything about the context in which they operate; that's what parsers are for.
If you don't need nesting, regexes can indeed be a proper tool. Lexers of programming languages, for instance, use regexes to tokenize strings, and strings usually allow their own delimiters as an escaped content. Anything more complicated than that will probably need a full-blown parser though.
The "general formula" is to match an escaped character (\\.) or any character that's valid as content but don't need to be escaped ([^{list of invalid chars}]). A "naïve" solution would be joining them with or (|), but for a more efficient variant see #AlanMoore's answer.
The complete example is shown below, in two variants: the first assumes than backslashes should only be used for escaping inside the string, the second assumes that a backslash anywhere in the text escapes the next character.
`((?:\\.|[^`\\])*)`
(?:\\.|[^`\\])*`((?:\\.|[^`\\])*)`
Working examples here and here. However, as #nneonneo commented (and I endorsed), regexes are not meant to do a complete parse, so you'd better keep things simple if you want them to work out right (do you want to find a token in the text, or do you want to delimit it already knowing where it starts? The answer to that question is important to decide which strategy works best for your case).
In Matt's post about drying up cucumber tests, Aslak suggests the following.
When I have lots of quotes, I prefer this:
Given %{I enter “#{User.first.username}” in “username”}
What is the %{CONTENT} construct called? Will someone mind referencing it in some documentation? I'm not sure how to go about looking it up.
There's also the stuff about %Q. Is that equivalent to just %? What of the curly braces? Can you use square braces? Do they function differently?
Finally, what is the #{<ruby stuff to be evaluated>} construct called? Is there a reference to that in documentation somewhere, too?
None of the other answers actually answer the question.
This is percent sign notation. The percent sign indicates that the next character is a literal delimiter, and you can use any (non alphanumeric) one you want. For example:
%{stuff}
%[stuff]
%?stuff?
etc. This allows you to put double quotes, single quotes etc into the string without escaping:
%{foo='bar with embedded "baz"'}
returns the literal string:
foo='bar with embedded "baz"'
The percent sign can be followed by a letter modifier to determine how the string is interpolated. For example, %Q[ ] is an interpolated String, %q[ ] is a non-interpolated String, %i[ ] is a non-interpolated Array of Symbols etc. So for example:
%i#potato tuna#
returns this array of Symbols:
[:potato, :tuna]
Details are here: Wikibooks
"Percent literals" is usually a good way to google some information:
http://www.sampierson.com/articles/ruby-percent-literals
http://en.wikibooks.org/wiki/Ruby_Programming/Syntax/Literals#The_.25_Notation
#{} is called "string interpolation".
The #{1+1} is called String Interpolation.
I, and Wikibooks, refer to the % stuff as just "% notation". Reference here. The % notation takes any delimiter, so long as it's non alphanumeric. It can also take modifiers (kind of like how regular expressions take options), one of which, interestingly enough, is whether you'll permit #{}-style string interpolation (this is also enabled by default).
% then does some special stuff to it, giving that notation some distinct, if a bit cryptic to beginners, terseness. For example %w{hello world} returns an array ['hello','world']. %s{hello} returns a symbol :hello.
I know that you can use this to remove blank lines
sed /^$/d
and this to remove comments starting with #
sed /^#/d
but how to you do delete all the comments starting with // ?
You just need to "escape" the slashes with the backslash.
/\/\//
the ^ operator binds it to the front of the line, so your example will only affect comments starting in the first column. You could try adding spaces and tabs in there, too, and then use the alternation operator | to choose between two comment identifiers.
/^[ \t]*(\/\/|$)/
Edit:
If you simply want to remove comments from the file, then you can do something like:
/(\/\/|$).*/
I don't know what the 'd' operator at the end does, but the above expression should match for you modulo having to escape the parentheses or the alternation operator (the '|' character)
Edit 2:
I just realized that using a Mac you may be "shelling" that command and using the system sed. In that case, you could try putting quotation marks around the search pattern so that the shell doesn't do anything crazy to all of your magic characters. :) In this case, 'd' means "delete the pattern space," so just stick a 'd' after the last example I gave and you should be set.
Edit 3:
Oh I just realized, you'll want to beware that if you don't catch things inside of quotes (i.e. you don't want to delete from # to end of line if it's in a string!). The regexp becomes quite a bit more complicated in that case, unfortunately, unless you just forgo checking lines with strings for comments. ...but then you'd need to use the substitution operation to sed rather than search-and-delete-match. ...and you'd need to put in more escapes, and it becomes madness. I suggest searching for an online sed helper (there are good regex testers out there, maybe there's one for sed?).
Sorry to sort of abandon the project at this point. This "problem" is one that sed can do but it becomes substantially more complex at every stage, as opposed to just whipping up a bit of Python to do it.