Finding partial string within horrible HTML using Nokogiri - ruby

Using Nokogiri, I want to fetch the part of the paragraph that comes after the <span> tags.
I am no regex hero, and it is the only thing that I need to discover before I can move forward. The only constant in the list is the | symbol, and the ugly way is to get the whole thing and split and join it I guess. Hopefully, there is a smarter, more elegant way!
<ul>
<li>
<p>
<strong>I don't care about </strong>
<span>|</span>
this I do care about
</p></li> ...
</ul>

If your HTML is that simple, then this will work:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<ul>
<li>
<p>
<strong>I don't care about </strong>
<span>|</span>
this I do care about
</p></li> ...
</ul>
EOT
doc.at('p').children.last # => #<Nokogiri::XML::Text:0x3ff1995c5b00 "\nthis I do care about\n">
doc.at('p').children.last.text # => "\nthis I do care about\n"
Parsing HTML and XML is really a matter of looking for landmarks that can be used to find what you want. In this case, <span> is OK, but getting the content you want based on that isn't quite as easy as looking up one level, to the <p> tag, grabbing its content, the children, selecting the last node in that list, which is text node containing the text you want.
The reason using the <span> tag is not the way I'd go is, if the HTML formatting changes, the number of nodes between <span> and your desired text could change. Intervening text nodes containing "\n" could be introduced for the formatting of the source, which would mess up a simple indexed lookup. To work around that, the code would have to ignore blank nodes and find the one that wasn't blank.
I am no regex hero...
And you shouldn't try to be with HTML or XML. They're too flexible and can confound regular expressions unless you're dealing with extremely trivial searches on very static HTML, which isn't very likely in the real internet unless you're scanning abandoned pages. Instead, learn and rely on decent HTML/XML parsers, that can reduce a page into a DOM, making it easy to search and traverse the markup.

Related

XPath: How do I find a page element which contains another element, using the full text of both?

I have an HTML page which contains the following:
<div class="book-info">
The book is <i>Italicized Title</i> by Author McWriter
</div>
When I view this in Chrome Dev Tools, it looks like:
<div class="book-info">
"The book is "
<i>Italicized Title</i>
" by Author McWriter"
</div>
I need a way to find this single div using XPath.
Constraints:
There are many book-info divs on the page, so I can't just look for a div with that class.
Any part of the text within the book-info div might also appear in another, but the complete text within the div is unique. So I want to match the entire text, if possible.
It is not guaranteed that an <i> will exist within the book-info div. The following could also exist, and I need to be able to find it as well (but my code is working for this case):
<div class="book-info">
"Author McWriter's Legacy"
</div>
I think I can detect whether the div I'm looking for contains an <i> or not, and construct a different XPath expression depending on that.
Things I have tried:
//div[text()=concat("The book is ","Italicized Title"," by Author McWriter")]
//div[text()=concat("The book is ","<i>Italicized Title"</i>," by Author McWriter")]
//div[text()=concat("The book is ",[./i[text()="Italicized Title"]," by Author McWriter")]
//div[concat(text()="The book is ", i[text()="Italicized Title"],text()=" by Author McWriter")]
None of these worked for me. What XPath expression would?
You can use this combination of XPath-1.0 predicates in one expression. It matches both cases:
//div[#class="book-info" and ((i and contains(text()[1],"The book is") and contains(text()[2],"by Author McWriter")) or (not(i) and contains(string(.),"Author McWriter&apos;s Legacy")))]

How to find the nth element that has a class of .foo in the document with Capybara/Nokogiri

I'm trying to find the n-th element that has a special class in a document. The elements are not necessarily children of the same parent. So for example
<ul>
<li><div class="foo">This</div></li>
<li><div>Nothing</div>
<ul>
<li><div class="foo">This also</div></li>
</ul>
</li>
<li><div class="foo">And this</div><li>
</ul>
I'd like to find the first, second or third element that has the class .foo.
I tried
page.find '.foo'
Which errors in Capybara::Ambiguous: Ambiguous match, found 3 elements matching css ".foo"
I then tried
page.all('.foo')[n]
Which works nice except that it doesn't seem to wait this little time like Capybaras find does, which I need because the HTML is actually generated from ajax data. So how to do this correctly with find?
Okay after a short chat in #RubyOnRails on freenode it became clear to me that this isn't as easy possible as it sounds first. The problem is that Capybara can't know if the .foos that are already inserted into the page are "all" of them. Thats why .all has no (or doesn't need) support for waiting like .find has.
The solution would be to manually wait for an appropriate amount of time and then just use .all.
Nokogiri's CSS queries are effective for finding elements of certain classes. It is explained in the tutorial.
For example you can use the following Ruby one-liner to read from a given file and find the second element of class foo:
ruby -rnokogiri -e 'puts Nokogiri::HTML(readlines.join).css(".foo")[1]' sample.html
which returns
<div class="foo">This also</div>
Replace the number in [1] with the index of the element you want to find and replace sample.html with the html file you want to search in. If you want to pick out certain parts of the elements you can use methods of Nokogiri::XML::Element, e.g. content to get its contents.

matching <div></div> tag with regular expression in ruby

Could anyone tell me how can I match the start of <div> tag to the end of </div> tag with a regular expression in Ruby?
For example let say I have a:
<div>
<p>test content</p>
</div>
So far I have this:
< div [^>]* > [^<]*<\/div>
but it doesn't seems to work.
Nokogiri is great but, imho, there are situations when it can not be used.
For your mere case you can use this:
puts str.scan(/<div>(.*)<\/div>/im).flatten.first
<p>test content</p>
To match the <div> when it's all on one line, use:
/<div[^>]*>/
But, that will break on any markup with a new-line inside the tag. It'll also break if there is whitespace between < and div, which there could be.
Eventually, after you've added in all the extra checks for the possible ways a tag can be written you'll want to consider a better way, which would be to use a parser, like Nokogiri, which makes working with HTML and XML much easier.
For instance, since you're trying to tear apart the HTML:
<div>
<p>test content</p>
</div>
it's pretty easy to guess you really want to get to "test content". What if the HTML changed to:
<div><p>test content</p></div>
or worse:
<div
><p>
test
content
</div>
A browser won't care, nor will a good parser, but a regex will get upset and require rework.
require 'nokogiri'
require 'pp'
doc = Nokogiri.HTML(<<EOT)
<div
><p>
test
content
</div>
EOT
pp doc.at('p').text.strip.gsub(/\s+/, ' ')
# => "test content"
That's why we recommend parsers.
An HTML parser such as Nokogiri would probably be a better option than using a Regex as PinnyM pointed out.
Here is a tutorial on the Nokogiri page that describes how to search an HTML/XML document.
This stackoverflow question demonstrates something similar to what you want to accomplish using CSS selectors. Perhaps something like that would work for you.

Convert HTML to plain text and maintain structure/formatting, with ruby

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.
The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).
I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.
First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.
You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:
require 'nokogiri'
html = '
<html>
<body>
<p>This is
some text.</p>
<p>This is some more text.</p>
<pre>
This is
preformatted
text.
</pre>
</body>
</html>
'
doc = Nokogiri::HTML(html)
puts doc.text
>> This is
>> some text.
>> This is some more text.
>>
>> This is
>> preformatted
>> text.
The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.
The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.
You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.
If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.
An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

how to get xpath of text between <br> or <br />?

</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Assuming the above, how is it possible to use Xpath to grab each fruit ? Must use xpath of some sort.
should i use substring-after(following-sibling...) ?
EDIT: I am using Nokogiri parser.
Well, you could use "//br/text()", but that will return all the text-nodes inside the <br> tags. But since the above isn't well-formed xml I'm not sure how you are going to use xpath on it. Regex is usually a poor choice for html, but there are html (not xhtml) parsers available. I hesitate to suggest one for ruby simply because that isn't "my area" and I'd just be googling...
Try the following, which gets all text siblings of <br> tags as array of strings stripped from trailing and leading whitespaces:
require 'rubygems'
reguire 'nokogiri'
doc = Nokogiri::HTML(DATA)
fruits =
doc.xpath('//br/following-sibling::text()
| //br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end
puts fruits
__END__
</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Is this what you want?
There are several issues here:
XPath works on XML - you have HTML which is not XML (basically, the tags don't match so an XML parser will throw an exception when you give it that text)
XPath normally also works by finding the attributes inside tags. Seeing as your <br> tags don't actually contain the text, they're just in-between it, this will also prove difficult
Because of this, what you probably want to do is use XPath (or similar) to get the contents of the div, and then split the string based on <br> occurrences.
As you've tagged this question with ruby, I'd suggest looking into hpricot, as it's a really nice and fast HTML (and XML) parsing library, which should be much more useful than mucking around with XPath

Resources