I need to wrap all instances of %{ ... %} with <span code='notranslate'>...</span> UNLESS the %{ ... } appears within an HTML tag. For example, this:
"Or %{register_text} for a new account by <a href='%{path}'>clicking here</a>."
needs to become this
"Or <span code='notranslate'>%{register_text}</span> for a new account by <a href='%{path}'>clicking here</a>."
my current regex doesn't take into account the HTML tag situation:
x.gsub(/[?<!]%\{([a-zA-Z0-9_\-]*)\}[?>!]/i) {|s| "<span class='notranslate'>#{s}</span>"}
so I am wondering how to do this in Ruby with regex.
Any takers?
I am not sure about the input space, so this is the best that I can come up with. I also clean up the regex a bit along the way.
/%\{[\w-]+\}(?![^<>]>)/
For a well-formed HTML, it will only match tokens that are outside tag. If the HTML is malformed, I don't think I'm up to the task to write the regex.
I also assume that there is no embedded Javascript in the page, since > and < in Javascript is not escaped.
Related
I have set of strings with nested [quote] tags in following format:
[quote name="John"]Some text. [quote name="Piter"]Inner quote.[/quote][/quote]
As you see it is not like ordinary BBCode. So I can't find a suitable regexp for gsub in Ruby to convert them to strings like this:
<blockquote>
<p>Some text.
<blockquote>
<p>Inner quote.</p>
<small>Piter</small>
</blockquote>
</p>
<small>John</small>
</blockquote>
Can anybody please help me with such regexp?
I'm pretty sure that regexes fundamentally can't cope with nesting. What you could do is make it do a minimal match (e.g. only the inner quote levels), replace them, and then repeat as long as you have more matches. Once you've replaced a level it will just be HTML so will not match the regex any more.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I tested my regex in rubular.com and it works, but when I run the code it behaves differently.
I want to parse whole paragraphs out of some HTML code
Here is my regex
description = ad_page.body.scan(/(?<=<span id="preview-local-desc">).+(?=<\/span>)/m)
Here is some of the HTML source
<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>
The match begins where I need it to but then it keeps matching all the way to the end of the document.
Aside from the fact that you shouldn't parse HTML with regex, you want non-greedy matching:
/(?<=<span id="preview-local-desc">).+?(?=<\/span>)/m
Parsing XML or HTML with a regex is marginally OK for trivial tasks, if you own or control the file's format. If you don't, then a simple change to the file could break your regex.
Using a parser will avoid that problem; I've parsed some horrible XML with Nokogiri and it didn't even notice. After writing a RSS aggregator that was handling 1000+ feeds I was hooked on using a parser.
require 'nokogiri'
html = '<span id="preview-local-desc"> I want to pick up everything typed here.
Paragraphs, everything.
</span>'
doc = Nokogiri.HTML(html)
doc.at('span').text
# => " I want to pick up everything typed here.\n Paragraphs, everything.\n "
If there are multiple <span> tags you want:
doc.search('span').map(&:text)
# => [" I want to pick up everything typed here.\n Paragraphs, everything.\n "]
If there are multiple <span> tags and you only want this one:
doc.at('span#preview-local-desc').text
# => " I want to pick up everything typed here.\n Paragraphs, everything.\n "
I'm writing a simple Ruby on Rails app. I have a model with a "description" attribute, which is a string.
I'd like to display this string in a view, but have some of the words in the string rendered using a special music font (one of the ones located here), and the rest to use the main font of the website. Problem is, since the description attribute is just a string that is persisted to the database, there's no real way to tell which words should use the special font...
The only way I can think of would be to define my own "escape sequence" or "special characters" that would allow me to indicate to the view whether a word should use the special font.
For example, say I have the following string:
cat dog rabbit elephant
If I wanted "dog" and "elephant" to use the special font, I could the store the string in the database as:
cat ${dog} rabbit ${elephant}
In other words, use "${}" as the custom escape sequence.
And then in my view I would have a helper method to process the string, and generate the appropriate HTML/CSS for words that use the escape sequence. For example, this is the kind of output I would expect it to produce:
<p>cat <span class="music">dog</span> rabbit <span class="music">elephant</span></p>
Does this seem like a reasonable solution? If so, how would I implement the view method to parse the string and the escape characters? I'm guessing some sort of regular expression?
In a way, it's sort of similar to how LaTeX allows you to render mathematical equations. For instance, in LaTeX, to activate the mathematical font for particular characters, you can do:
\mathnormal{some text}
You could just store the html snippet directly in your database.
<p>cat <span class="music">dog</span> rabbit <span class="music">elephant</span></p>
Otherwise, you would have to remake the HTML string every time it was requested.
If that's not an option, you could use a simple regex to replace the ${} with an html snippet.
string = "cat ${dog} rabbit ${elephant}"
string.gsub /\$\{([^\}]+)\}/, '<span class="music">\1</span>'
=> "cat <span class=\"music\">dog</span> rabbit <span class=\"music\">elephant</span>"
For some reason I have to have one HTML tag per line. So if the following is the input:
<p><div class="class1 <%= "class3" %>class2">div content</div></p>
Output should be:
<p>
<div class="class1 <%= "class3" %>class2">div content
</div>
</p>
The regular expression should be able to recognize the difference between the erb script tag and HTML tag. Indentation is not needed.
How can this be done through regular expression?
You can replace (?=<[\w/]) with \n. This is a lookahed that matched the position before a < sign, the is followed by a letter or a slash. (another option is (?=<(?!%))).
This works for your posted code, but fails on quite a few scenarios, notionally < in attributes, or < in server-side scripts and JavaScript blocks. If you need anything more complex, you may need a stronger solution, like an erb parser.
replace "(?<!%)>\s*<(?=!%))" with ">\n<" and replace "(?<!(\s|^))</" with "\n</"
this makes sure that % is not found either before or after >whitespace<.
then always break on </
i think kobi's answer is better :)
I have a string like this.
<p class='link'>try</p>bla bla</p>
I want to get only <p class='link'>try</p>
I have tried this.
/<p class='link'>[^<\/p>]+<\/p>/
But it doesn't work.
How can I can do this?
Thanks,
If that is your string, and you want the text between those p tags, then this should work...
/<p\sclass='link'>(.*?)<\/p>/
The reason yours is not working is because you are adding <\/p> to your not character range. It is not matching it literally, but checking for not each character individually.
Of course, it is mandatory I mention that there are better tools for parsing HTML fragments (such as a HTML parser.)
'/<p[^>]+>([^<]+)<\/p>/'
will get you "try"
It looks like you used this block: [^<\/p>]+ intending to match anything except for </p>. Unfortunately, that's not what it does. A [] block matches any of the characters inside. In your case, the /<p class='link'>[^<\/p>]+ part matched <p class='link'>try</, but it was not immediately followed by the expected </p>, so there was no match.
Alex's solution, to use a non-greedy qualifier is how I tend to approach this sort of problem.
I tried to make one less specific to any particular tag.
(<[^/]+?\s+[^>]*>[^>]*>)
this returns:
<p class='link'>try</p>