I'm trying to write a ruby script to turn a small markup language I wrote into HTML, but I can't figure out how to parse links. It's basically a trimmed down version of BBCode, so for example, if someone enters [i]{text}[/i], I use [i]{text}[/i].gsub('[i]','<i>').gsub('[/i]','</i>'. I can't figure out how to parse links, though. How would I turn [url=website.com]site[/url] into site? I'm not using a premade BBCode parser because there are a few tags that are different, and I don't want people to use some of the tags such as [img][/img].
Very naïvely:
s.gsub(/\[url=(.*?)\](.*?)\[\/url\]/) { "<a href='#{$1}'>#{$2}</a>" }
HTML injection would be quite easy. The point here (write a proper parser) still applies to what you're doing.
I agree with kch by using a regular expression but if you want to wrap your head around it using gsub() like you've been doing...
s = "[url=website.com]site[/url]"
s2 = s.gsub('[url=','<a href="').gsub('[/url]','</a>').gsub(']','">')
Related
I want to remove everything contained within two HTML tags, as well as the tags themselves, using regular expressions in Ruby. Here's an example:
<tag>a bunch of stuff between the tags, no matter what it is</tag>
Basically, I want to use gsub! to filter all instances of this type out, like so:
text_file_contents.gsub!(/appropriate regex/, '')
What would be a good Ruby regular expression for doing so?
As has been said in the comments use an html parser. If, however, you just want to remove everything between two tags and don't care about nesting (e.g. if you have <tag><tag></tag></tag>) then you can simply use:
text_file_contents.gsub!(/<tag>.*?<\/tag>/, '')
But again this is flaky. Nokogiri is really easy to use and will be a lot more stable, please use that.
require 'nokogiri'
doc = Nokogiri::XML(yourfile)
doc.search('//tag').each do |node|
node.remove
end
I have a JSON string that looks like {\"heading\":\"Test\",\"id\":1} and I want to wipe the ID data from the string.
I've tried test.gsub(/\,\\"id\\"\:d+/, '') but that's not working.
How best to achieve this?
Sergio's JSON.parse is something you should consider. But baring that, those \'s you are seeing probably aren't really part of the string. That's just how irb is displaying it.
So test.gsub(/,"id":\d+/, '') should be what you want. (Also fixed a few other small bugs in the regex).
When I do
puts /<title>(.*?)<\/title>/.match(html)
I get
<h2>foobar</h2>
But I want just
foobar
What's the most elegant method for doing so?
The most elegant way would be to parse HTML with an HTML parser:
require 'nokogiri'
html = '<title><h2>Pancakes</h2></title>'
doc = Nokogiri::HTML(html)
title = doc.at('title').text
# title is now 'Pancakes'
If you try to do this with a regular expression, you will probably fail. For example, if you have an <h2> in your <title> what's to prevent you from having something like this:
<title><strong>Where</strong> is <span>pancakes</span> <em>house?</em></title>
Trying to handle something like that with a single regex is going to be ugly but doc.at('title').text handles that as easily as it handles <title>Pancakes</title> or <title><h2>Pancakes</h2></title>.
Regular expressions are great tools but they shouldn't be the only tool in your toolbox.
Something of this style will return just the contents of the match.
html[/<title>(.*?)<\/title>/,1]
Maybe you need to tell us more, like what html might contain, but right now, you are capturing the contents of the title block, irrespective of the internal tags. I think that is the way you should do it, rather than assuming that there is an internal tag you want to handle, especially because what would happen if you had two internal tags? This is why everyone is telling you to use an html parser, which you really should do.
I have a very large xml file which I load as a string
so my XML lools like
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
I want to count the number of occurrences the string
article ID="5705641" contentstatus="Changed"
how can I convert the ID to a regex
Here is what I have tried doing
searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count
Please let me know how can I achieve this?
Thanks
I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.
Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.
That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.
Something like
/article ID="[1-9]{7}" contentstatus="Changed"/
Quotation marks aren't special characters in a regex, so you don't need to escape them.
When in doubt about regex in Ruby, I recommend checking out Rubular.com.
And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.
If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:
//article[#contentstatus="Changed"]
Or, if possible:
count(//article[#contentstatus="Changed"])
Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.
I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.
require 'nokogiri'
xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756263" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT
doc = Nokogiri::XML(xml)
puts doc.search('//article[#contentstatus="Changed"]').size.to_s + ' found'
puts doc.search('//article[#contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }
>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca
The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but
Your current string looks almost perfect to me, just remove the errant / from around the numbers:
searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'
I am using ruby 1.8.7. I am not using rails.
How do I find all the links which are not already in anchor tag.
s = %Q{ <a href='www.a.com'><b>www.a.com</b></a> www.b.com <div>www.c.com</div> }
The output of above string should be
www.b.com
www.c.com
I know "b" tag before www.a.com complicates the case but that's what I have to work with.
You are going to want to use a real XML parser (Nokogiri will do). Regexes are unsuitable for a task like this. Especially so in ruby 1.8.7 where negative look behind is not supported.
Dirty way to get rid of anchor tags. Doesn't work the way you want if they're nested. Also use a real parser ;-)
s.gsub(%r[<a\b.*?</a>]i, "")
=> " www.b.com <div>www.c.com</div> "