How do I find matching <pre> tags using a reqular expression? - ruby

I am trying to create a simple blog that has code inclosed in <pre> tags.
I want to display "read more" after the first closing </pre> tag is encountered, thus showing only the first code segment.
I need to display all text, HTML, code up to the first closing </pre> tag.
What I've come up with so far is the follow:
/^(.*<\/pre>).*$/m
However, this matches every closing </pre> tag up to the last one encountered.
I thought something like the following would work:
/^(.*<\/pre>{1}).*$/m
It of course does not.
I've been using Rubular.
My solution thanks to your guys help:
require 'nokogiri'
module PostsHelper
def readMore(post)
doc = Nokogiri::HTML(post.message)
intro = doc.search("div[class='intro']")
result = Nokogiri::XML::DocumentFragment.parse(intro)
result << link_to("Read More", post_path(post))
result.to_html
end
end
Basically in my editor for the blog I wrap the blog preview in div class=intro
Thus, only the intro is displayed with read more added on to it.

This is not a job for regular expressions, but for a HTML/XML parser.
Using Nokogiri, this will return all <pre> blocks as HTML, making it easy for you to grab the one you want:
require 'nokogiri'
html = <<EOT
<html>
<head></head>
<body>
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
pre_blocks = doc.search('pre')
puts pre_blocks.map(&:to_html)
Which will output:
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>

You can capture all text upto the first closing pre tag by modifying your regular expression to,
/^(.*?<\/pre>{1}).*$/m
This way you can get the matched text by,
text.match(regex)[1]
which will return only the text upto the first closing pre tag.

Reluctant matching might help in your case:
/^(.*?<\/pre>).*$/m
But it's probably not the best way to do the thing, consider using some html parser, like Nokogiri.

Related

Nokogiri/Mechanize xpath locator breaks when there is a stray start tag

I loaded a page using Mechanize:
url = 'http://www.blah.com'
agent = Mechanize.new
page = agent.get(url)
and tried to access an element using an XPath selector:
found = page.at('/html/body/table')
It returns nil because the HTML, which is out of my control, has an opening tag where it shouldn't be:
<html>
<body>
<tr>
<table>
. . .
The "stray start tag," as Firefox calls it, is ignored when the browser renders the page in real life (and Firefox gives me xpaths that ignore it), but Nokogiri can't see anything past that extra <tr>.
Is there any way to clean the HTML of hanging tags like this?
In your example it would be:
page.at '/html/body/tr/table'
But maybe it makes more sense to just do:
page.at 'table'
Use a less brittle XPath query?
found = page.at('//table')
You can clean this using Nokogiri easily:
require 'nokogiri'
html = '<html><body><tr><table><tr><td>foo</td></tr></table></tr></body></html>'
doc = Nokogiri::HTML(html)
inner_table = doc.at('//body/tr/table')
if (inner_table)
doc.at('body tr').replace(inner_table)
end
puts doc.to_html
With the result being:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table><tr><td>foo</td></tr></table></body></html>
If your HTML is more complex, then find some sort of marker similar to the <body><tr><table> node-chain, and substitute it into the code above.
Note that I'm mixing both XPath and CSS accessors. I prefer CSS for their readability, but sometimes XPath makes it easier to get at something or is more self-documenting.
Also notice that I'm using both XPath and CSS with Nokogiri's at method. Though Nokogiri supports both at, at_css and at_xpath, I rely on at unless I need to explicitly tell Nokogiri that what I'm using as an accessor is CSS or XPath. It's a convenience thing. The same applies to Nokogiri's search method.

Nokogiri grab only visible inner_text

Is there a better way to extract the visible text on a web page using Nokogiri? Currently I use the inner_text method, however that method counts a lot of JavaScript as visible text. The only text I want to capture is the visible text on the screen.
For example, in IRB if I do the following in Ruby 1.9.2-p290:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
words = doc.inner_text
words.scan(/\w+/)
If I search for the word "function" I see that it appears 20 times in the list, however if I go to http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX the word "function" does not appear anywhere in the visible text.
Can I ignore JavaScript or is there a better way of doing this?
You could try:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.traverse{ |x|
if x.text? && x.text !~ /^\s*$/
puts x.text
end
}
I have not done much with Nokogiri, but I believe this should find/output all text nodes in the document that are not blanks. This at least seems to be ignoring the javascript and all the text I checked was visible on the page (though some of it in the dropdown menus).
You can ignore JavaScript and there is a better way. You're ignoring the power of Nokogiri. Badly.
Rather than provide you with the direct answer, it will do you well to learn to "fish" using Nokogiri.
In a document like:
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
I recommend starting with CSS accessors because they're generally more familiar to people:
doc = Nokogiri::HTML(var_containing_html) will parse and return the HTML DOM in doc.
doc.at('p') will return a Node, which basically points to the first <p> node.
doc.search('p') will return a NodeSet of all matching nodes, which acts like an array, in this case all <p> nodes.
doc.at('p').text will return the text inside a node.
doc.search('p').map{ |n| n.text } will return all the text in the <p> nodes as an array of text strings.
As your document gets more complex you need to drill down. Sometimes you can do it using a CSS accessor, such as 'body p' or something similar, and sometimes you need to use XPaths. I won't go into those but there are great tutorials and references out there.
Nokogiri's tutorials are very good. Go through them and they will reveal all you need to know.
In addition, there are many answers on Stack Overflow discussing this sort of problem. Check out the "Related" links on the right of the page.
Ignore the tags where JavaScript lives (<script>). While we’re at it, we should also ignore CSS (<styles>).
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(URI.open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.css('style').each(&:remove)
doc.css('script').each(&:remove)
puts doc.text
# Alternatively, for cleaner output:
# puts doc.text.split("\n").map(&:strip).reject(&:empty?)

How to get a mail address from HTML code with Nokogiri

How can I get the mail address from HTML code with Nokogiri? I'm thinking in regex but I don't know if it's the best solution.
Example code:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
</body>
</html>
Does a method exist in Nokogiri to get the mail address if it is not between some tags?
You can extract the email addresses using xpath.
The selector //a will select any a tags on the page, and you can specify the href attribute using # syntax, so //a/#href will give you the hrefs of all a tags on the page.
If there are a mix of possible a tags on the page with different urls types (e.g. http:// urls) you can use xpath functions to further narrow down the selected nodes. The selector
//a[starts-with(#href, \"mailto:\")]/#href
will give you the href nodes of all a tags that have a href attribute that starts with "mailto:".
Putting this all together, and adding a little extra code to strip out the "mailto:" from the start of the attribute value:
require 'nokogiri'
selector = "//a[starts-with(#href, \"mailto:\")]/#href"
doc = Nokogiri::HTML.parse File.read 'my_file.html'
nodes = doc.xpath selector
addresses = nodes.collect {|n| n.value[7..-1]}
puts addresses
With a test file that looks like this:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
A Web link
<a>An empty anchor.</a>
</body>
</html>
this code outputs the desired example#example.com. addresses is an array of all the email addresses in mailto links in the document.
I'll preface this by saying that I know nothing about Nokogiri. But I just went to their website and looked at the documentation and it looks pretty cool.
If you add an email_field class (or whatever you want to call it) to your email link, you can modify their example code to do what you are looking for.
require 'nokogiri'
require 'open-uri'
# Get a Nokogiri::HTML:Document for the page we’re interested in...
doc = Nokogiri::HTML(open('http://www.yoursite.com/your_page.html'))
# Do funky things with it using Nokogiri::XML::Node methods...
####
# Search for nodes by css
doc.css('.email_field').each do |email|
# assuming you have than one, do something with all your email fields here
end
If I were you, I would just look at their documentation and experiment with some of their examples.
Here's the site: http://nokogiri.org/
CSS selectors can now (finally) find text at the beginning of a parameter:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
blah
blah
EOT
doc.at('a[href^="mailto:"]')
.to_html # => "blah"
Nokogiri tries to track the jQuery extensions. I used to have a link to a change-notice or message from one of the maintainers talking about it but my mileage has varied.
See "CSS Attribute Selectors" for more information.
Try to get the whole html page and use regular expressions.

Convert HTML to plain text and maintain structure/formatting, with ruby

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.
The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).
I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.
First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.
You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:
require 'nokogiri'
html = '
<html>
<body>
<p>This is
some text.</p>
<p>This is some more text.</p>
<pre>
This is
preformatted
text.
</pre>
</body>
</html>
'
doc = Nokogiri::HTML(html)
puts doc.text
>> This is
>> some text.
>> This is some more text.
>>
>> This is
>> preformatted
>> text.
The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.
The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.
You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.
If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.
An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

how to get the content between the first html tag and the second html tag in ruby

Hi friend~
I want get the content between the first html tag and the second html tag.
for example
<p>Hello <bold>world</bold>!</p>
will return
Hello
What should I do in Ruby?
Thank you~
Regex will be: <[^>]*>([^<]*)
<[^>]*> - math thiw first tag "<...>"
([^<]*) - capture text to open next tag "<...> some text <...>"
how apply him on Rubby - i dont know
look http://www.regular-expressions.info/ruby.html
A regular expression catching everthing between the first and the second pair of angle brackets lools like
/<.*?>(.*?)</m
The result will be in the first capturing group of the first match.
Note that this will probably fail on HTML comments and JavaScript.

Resources