Ruby regex for stripping BBCode

Ruby regex for stripping BBCode - ruby

I'm trying to remove BBCode from a given string (just using gsub with some regex).
Here's an example string:
The [b]quick[/b] brown [url=http://example.com]fox[/url] jumps over the lazy dog [img=http://example.com/lazy_dog.png]
And what I need that to output is:
The quick brown fox jumps over the lazy dog
So what's a way to do that? I've found various examples of doing this, but none have worked for my use case.
One that I've tried: /\[(\w+)[^w]*?](.*?)\[\/\1]/
But that wouldn't catch the ending [img] tag.

The purpose of this post is to show the disparity in how the BBCode is interpreted, which one should take into consideration when stripping the BBCode tags while preserving the content
This will only remove BB code tags as defined by this page.
It may remove more than what is considered valid BB code tag, though. For example, [b ]Bold[/b] is not bolded by this BBCode tester, so by right, those tags should be left alone. But [\b] will be removed by the regex below. It will also remove clearly non-BBCode such as [\b=something]
Another example is [url=http://example.com/ ][/url] (note the space). This might be OK or not OK depending on the BBCode parser. The regex below ignores the opening tag, but removes the closing tag.
/\[\/?(?:b|u|i|s|size|color|center|quote|url|img|ul|ol|list|li|\*|code|table|tr|th|td|youtube|gvideo)(?:=[^\]\s]+)?\]/
The [code] tag is also not treated correctly by the regex as seen in this demo. The replacement should leave [code] in between code tag alone.
This BBCode tester allows [b][b][b]Text[/b][/b][/b] to be parsed into Text bolded, but the other one interpret it as [b][b]Text[/b][/b] with the part [b][b]Text bolded and the rest not bolded. If you allow nested tags, then regex is not a good choice.

Related

Apostrophe at end (or beginning) of word

How can I get an apostrophe at the beginning or the end of the word? This would be necessary for old-style
'Tis
instead of
It's
Or the apostrophe at the end of a word in plural, like
arguments'
Of course I could also just type
arguments’
but this defeats the purpose of using markdown.
Edit: It does not seem to me that there is a defined inline quotation style with single quote at beginning and end, like
'some sort of quotation'
so it shouldn't be too much of a stretch?

I think the best you can do is to go ahead and specify the single-right-quote symbol as you have done, but you don't have to use the numeric notation (’). AsciiDoc has a predefined symbol for that ({rsquo}), so it's not quite so ugly.
.Examples of Single-Apostrophe Notation
[width="50%",cols="",options="header"]
|===
|Use this |To get this
|\'italics' |'italics'
|\'\'single-quoted'' (two single apostrophes each) |''single-quoted''
|it's |it's (automatically formatted)
|its' |its' (ugly)
|'tis |'tis (ugly)
|its'\{empty\} |its'{empty} (still ugly)
|\{empty\}'tis |{empty}'tis (still ugly)
|its\{rsquo\} **{nbsp} <- This is what you want** |its{rsquo}
|\{rsquo\}tis **{nbsp} <- And this** |{rsquo}tis
|===

Multi-Line Regex: Find A where B is absent

I have been looking through a lot on Regex lately and have seen a lot of answers involving the matching of one word, where a second word is absent. I have seen a lot of Regex Examples where I can have a Regex search for a given word (or any more complex regex in its place) and find where a word is missing.
It seems like the works very well on a line by line basis, but after including the multi-line mode it still doesn't seem to match properly.
Example: Match an entire file string where the word foo is included, but the word bar is absent from the file. What I have so far is (?m)^(?=.*?(foo))((?!bar).)*$ which is based off the example link. I have been testing with a Ruby Regex tester, but I think it is a open ended regex problem/question. It seems to match smaller pieces, I would like to have it either match/not match on the entire string as one big chunk.
In the provided example above, matches are found on a line by line basis it seems. What changes need to be made to the regex so it applies over the ENTIRE string?
EDIT: I know there are other more efficient ways to solve this problem that doesn't involve using a regex. I am not looking for a solution to the problem using other means, I am asking from a theoretical regex point of view. It has a multi-line mode (which looks to "work"), it has negative/positive searching which can be combined on a line by line basis, how come combining these two principals doesn't yield the expected result?

Sawa's answer can be simplified, all that's needed is a positive lookahead, a negative lookahead, and since you're in multiline mode, .* takes care of the rest:
/(?=.*foo)(?!.*bar).*/m
Multiline means that . matches \n also, and matches are greedy. So the whole string will match without the need for anchors.
Update
#Sawa makes a good point for the \A being necessary but not the \Z.
Actually, looking at it again, the positive lookahead seems unnecessary:
/\A(?!.*bar).*foo.*/m

A regex that matches an entire string that does not include foo is:
/\A(?!.*foo.*).*\z/m
and a regex that matches from the beginning of an entire string that includes bar is:
/\A.*bar/m
Since you want to satisfy both of these, take a conjunction of these by putting one of them in a lookahead:
/\A(?=.*bar)(?!.*foo.*).*\z/m

Regex Markdown Header

I'm trying to create a regular (ruby) expression which checks for multiple conditions. I use this regex to replace the content of my object. My regex is close to finished, except two problems I'm facing with regard to markdown.
First of, headers are giving me trouble. For example, I don't want to replace the word "Hi" for "Hello" if "Hi" is in a header.
Hi John <== # should not change
==================
Text: Hi, how are you? <== # Should be: Hello, how are you? after substitution
Or:
#### Hi Peter <== # should not change
Text: Hi, how are you? <== # Should be: Hello, how are you? after substitution
Question: How can I escape markdown headers within my regex? I've tried negative lookbehind and lookahead assertions, but to no avail.
My second problem should be quite easy, but somehow I'm struggling. If words are Italic "hi" I want to find and replace them, without changing the underscores. I can find the word with this regex:
\b[_]*hi[_]*\b
Question 2: But if I would replace it, I would also change the underscores. Is there a way to only detect the word itself and replace it, while still using word boundaries?
Code Example
#website.autolinks.all.each do |autolink|
autolink.name #for example returns "Iphone5"
autolink.url #for example returns "http://www.apple.com"
regex = /\b(?<!##\s)(?<![\d.\[])([_]*)#{autolink.name}([_]*)(?![\d'"<\/a>])\b/
if #permalink.blog_entry.content.match(regex)
#permalink.blog_entry.content.gsub!(regex, "[#{autolink.name}](# {autolink.url})")
end
end
Example text
Iphone5
==============
Iphone5 is the best mobile phone there is, even though the people at Samsung probably think, or perhaps only hope that their Samsung Galaxy S3 is better.
#### Samsung Galaxy S3?
Yes, that's the name of the newest Samsung phone.
This will result in a text with HTML tags, but when I use my regex my content uses Markdown syntax (used before the markdown converter).

Regexes work best when they do one clear thing. If you have multiple conditions, your code should usually reflect that by dividing the processing into steps.
In this case, you have two clear steps:
Use a simple regex or other logic to skip over the header portion of the message.
Once you know you are in the content, use another regex to process the content.

I've found a solution:
regex = /(?<!##\s)(?<![\d.\[a-z])#{autolink.name}(?![\d'"a-z<\/a>])(?!.*\n(==|--))/i
if #permalink.blog_entry.content.match(regex)
#permalink.blog_entry.content.gsub!(regex, "[\\0](#{autolink.url})")
end

Using Regex to Extract between italic and some other

I am writing code to extract some data between (italic, --bold--) characters. (Very similar to SO comment feature)
I actually wrote the method for that (using a loop and checking characters), but I wondered if I can re-write that method using Regex.
I tried Rubular, but I am not that good at Regex:
This kinda works for italic, but I think it is not a good solution for using all other special chars (like -- and possibly others)
regex: _{2}([^_]*)_{2}
text: __word1__ not_italic __a__ --bolder--
Is it possible to do that with a 1 match call and regex, or do I have to crete special regex's for each special formatting characters?

Sure you can. Here's a nifty construct you can use: (__|--)((?:(?!\1).)+)\1
Demo + explanation: http://regex101.com/r/tO4tW1
The content you're after will be in the second backreference every time.

Problem capturing data inside of a capture that is optional

It's best to start with an example and what I've gotten so far.
Sample Data:
FOO foo#acme.com 5545
<Data><Name>tester</Name><Foo>bar</Foo></Data>
Current regex:
/FOO\s(.{1,20}#[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}<Name>(.{0,20})<\/Name>.{0,100}<\/Data>)?/m
Matches from regex:
foo#acme.com
testerbar
tester
I've wrapped the <Data> section in parenthesis followed-by a ? because the entire data section may or may not exist. However, the <Name> section is also optional, it may or may not exist. So I tried putting parenthesis around <Name> with a question mark as well but then I don't get the matches:
/FOO\s(.{1,20}#[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}(<Name>(.{0,20})<\/Name>)?.{0,100}<\/Data>)?/m
I've posted my regex and sample data on a regex site to make it easier to test/validate what I'm trying to do: http://www.rubular.com/r/ZhQzlNp1vv
In the <Data> section there is <Name> and even <Foo>. The point is, there may be many different elements in <Data> and I only care about extracting data from some of them. I need to use regex for my particular situation so please don't suggest using some XML parsing library (thanks!).
Thanks in advance.

/FOO\s(\S+#\S+).*?\n(?:.{0,100}(.{0,20})</Name>.{0,100}</Data>)?/m
http://www.rubular.com/r/IhisH7HYJR

To capture an optional group, use a non-capturing group to indicate the optionality inside a capturing group.
i.e.
((?:content)?)
The outer parentheses form the capturing group - if the optional group doesn't match you get an empty string. The (?:...) is the non-capturing group, which allows you to group the content (so it can all be made optional) without capturing it.
Update:
Whenever you have a complex regex, use free-spacing comment mode (flag=x) to make it readable (and thus far easier to figure out what's going on), like this:
FOO\s(.{1,20}#[^\s]+)\s.{0,20}\s{1,2}
((?:<Data>
# upto 200 chars, excluding captured tags or end tag (repeated below)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 3:
((?:<Name>.{0,20}<\/Name>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 4:
((?:<Foo>.{0,20}<\/Foo>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 5:
((?:<Bob>.{0,20}<\/Bob>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
<\/Data>)?)
Which at rubular results in:
1. foo#acme.com
2. <Data><Name>tester</Name><Foo>bar</Foo></Data>
3. <Name>tester</Name>
4. <Foo>bar</Foo>
5.
Annoyingly rubular doesn't seem to provide a multi-line editor when x is turned on, which sucks, and it also doesn't support standard comment syntax, so I had to change those #... to (?#...) which is less readable. Oh well.
If you need the values without the tags, you'll need a separate expression to strip those.
( Or, y'know, use a tool actually designed for the job. ;) )

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby regex for stripping BBCode - ruby

Related

Apostrophe at end (or beginning) of word

Multi-Line Regex: Find A where B is absent

Regex Markdown Header

Using Regex to Extract between italic and some other

Problem capturing data inside of a capture that is optional

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby regex for stripping BBCode - ruby

Related

Apostrophe at end (or beginning) of word

Multi-Line Regex: Find A where B is absent

Regex Markdown Header

Using Regex to Extract between __italic__ and some other

Problem capturing data inside of a capture that is optional

Categories

Resources

Using Regex to Extract between italic and some other