HtmlUnit processing whitespace

HtmlUnit processing whitespace - htmlunit

I'm using HtmlUnit to do some processing of an Html page. My problem is that it does not seem to be correctly maintaining whitespace.
The original html looks like:
<div><cite>www.<b>example</b>.com</cite>
Which renders as:
www.example.com
After using html unit to do some parsing on other parts of the dom, I print the html back out using getXml(). Doing so causes the html to be pretty printed:
<div>
<cite>
www.
<b>
example
</b>
.com
</cite>
This ends up rendering as:
www. example .com
Note the extra space before and after example.
I tried just trimming the whitespace from resulting pretty-printed dom, but then you lose spaces in places where you actually want them.
Stepping through the generated dom, it appears that HtmlUnit trims all of the DomText nodes when it creates them, so the space information is lost.
Is there any way I can configure HtmlUnit to track this information? Or some alternative that better maintains the original html? I just need to be able to extra portions of the html via XPath.

I think this should return the original html:
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage("http://www.yourpage.com");
String originalHtml = page.getWebResponse().getContentAsString();

Using JavaScript gets the html without the extra whitespace:
WebClient client = new WebClient(BrowserVersion.FIREFOX_17);
HtmlPage page = client.getPage(url);
client.waitForBackgroundJavaScript(5000);
String html = htmlPage.executeJavaScript("document.body.parentNode.outerHTML")
.getJavaScriptResult()
.toString();

Related

Cast a Nokogiri::XML::Document to a Nokogiri::HTML::Document

I want to transform an XML document to HTML using XSL, tinker with it a little, then render it out. This is essentially what I'm doing:
source = Nokogiri::XML(File.read 'source.xml')
xsl = Nokogiri::XSLT(File.read 'transform.xsl')
transformed = xsl.transform(source)
html = Nokogiri::HTML(transformed.to_html)
html.title = 'Something computed'
Stylesheet::transform always returns XML::Document, but I need a HTML::Document instance to use methods like title=.
The code above works, but exporting and re-parsing as HTML is just awful. Since the target is a subclass of the source, there must be a more effective way to perform the conversion.
How can I clean up this mess?
As a side question, Nokogiri has generally underwhelmed me with its handling of doctypes, unawareness of <meta charset= etc... does anyone know of a less auto-magic library with similar capabilities?
Many thanks ;)

HTML::Document extends XML::Document, but the individual nodes in a HTML document are just plain XML::Nodes, i.e. there aren’t any HTML::Nodes. This suggests a way of converting an XML document to HTML by creating a new empty HTML::Document and setting its root to that of the XML document:
html = Nokogiri::HTML::Document.new
html.root= transformed.root
The new document has the HTML methods like title= and meta_encoding= available, and when serializing it creates a HTML document rather than HTML: adds a HTML doctype, correctly uses empty tags like <br>, displays minimized attributes where appropriate (e.g. <input type="checkbox" selected>) and doesn’t escape things like > in <script> blocks.

Passing JSON as HTML element text

Would there be bad consequences from transporting JSON in HTML like this:
<div id="json" style="display: none;">{"foo": "bar"}</div>
assuming HTML chars such as < are escaped as < in the element text?
The JSON could be strictly parsed:
var blah = $.parseJSON($('#json').html())
in a try/catch statement, for example. The rationale is to enable passing of JSON in Ajax'd HTML responses, when script tags are being stripped an not executed. An example would be Ajax requests made using the jQuery .load() special selector syntax:
$('#here').load('some.html #fragment')
...which ditches all script tags and thus prevents the use of:
<script>var blah = {"foo":"bar"}</script>
I've seen JSON being passed around in HTML attributes, and I'd guess this is equivalent - w.r.t. weirdness, security, etc - but is far less readable due to all the additional quote-escaping.

The natural way of passing JS data in HTML is through JavaScript code (if is a part of actual JavaScript code, like in the case of initial values/configuration) or by data- HTML5 attributes (whenever JS code is not necessary; always when data needs to be somehow attached to DOM elements).
In your example this would be probably the best:
<div id="json" style="display: none;"
data-something="{"foo":"bar"}">
</div>
but reorganize your data to actually follow HTML structure:
<div class="profile-container"
data-profile="{"name":"John Doe","id":123}">
... profile 123 ...
</div>
<div class="profile-container"
data-profile="{"name":"Jane Doe","id":321}">
... profile 321 ...
</div>
(quoting should be done server-side, eg. using PHP's htmlspecialchars(...), or Python's cgi.escape(..., True)).
And then you can obtain the data in one of multiple ways, eg. using jQuery's .data() method.
EDIT:
Yes, your approach with embedding JSON as content of HTML tags and hiding it using CSS styles has gotchas. As I said, if you want to pass data in HTML, the only "best practice" way is to attach it to one of HTML elements (you are kind-of doing it anyway, but you use CSS to hide it, while you can use existing solutions for passing JSON/data without affecting clients that could override your styles). The proof for one of disadvantages is here: http://jsfiddle.net/NY7Bs/ (data is passed both ways, but one simple external style overrides your inline styles and shows the content - not mentioning the influence on semantics of your document).

Why not simply use the .ajax() function then, you would get only the string with the json. Then you could parse it as you suggested.

How do I find matching <pre> tags using a reqular expression?

I am trying to create a simple blog that has code inclosed in <pre> tags.
I want to display "read more" after the first closing </pre> tag is encountered, thus showing only the first code segment.
I need to display all text, HTML, code up to the first closing </pre> tag.
What I've come up with so far is the follow:
/^(.*<\/pre>).*$/m
However, this matches every closing </pre> tag up to the last one encountered.
I thought something like the following would work:
/^(.*<\/pre>{1}).*$/m
It of course does not.
I've been using Rubular.
My solution thanks to your guys help:
require 'nokogiri'
module PostsHelper
def readMore(post)
doc = Nokogiri::HTML(post.message)
intro = doc.search("div[class='intro']")
result = Nokogiri::XML::DocumentFragment.parse(intro)
result << link_to("Read More", post_path(post))
result.to_html
end
end
Basically in my editor for the blog I wrap the blog preview in div class=intro
Thus, only the intro is displayed with read more added on to it.

This is not a job for regular expressions, but for a HTML/XML parser.
Using Nokogiri, this will return all <pre> blocks as HTML, making it easy for you to grab the one you want:
require 'nokogiri'
html = <<EOT
<html>
<head></head>
<body>
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
pre_blocks = doc.search('pre')
puts pre_blocks.map(&:to_html)
Which will output:
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>

You can capture all text upto the first closing pre tag by modifying your regular expression to,
/^(.*?<\/pre>{1}).*$/m
This way you can get the matched text by,
text.match(regex)[1]
which will return only the text upto the first closing pre tag.

Reluctant matching might help in your case:
/^(.*?<\/pre>).*$/m
But it's probably not the best way to do the thing, consider using some html parser, like Nokogiri.

How do HtmlAgilityPack extract text from html node whose class attribute appended dynamically

Dear friends,I want to extract text 平均3.6 星 from this code segment excerpted from amazon.cn.
<div class="content"><ul>
<li><b>用户评分:</b>
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" ref="dp_db_cm_cr_acr_pop_" name="B004GUSIKO">
<a>
<span class="swSprite s_star_3_5 " title="平均3.6 星">
<span>平均3.6 星</span>
</span>
</a>
My question is span class tag value "s_star_3_5 " vary from different customer's rating level and appended dynamically. So I attempt to use doc.DocumentNode.SelectSingleNode(" //span[#class='swSprite']").InnerText or //span[#class='swSprite s_star_3_5 '], but the result is an error or not what my want !
Any suggestions?

First of all, I suggest you saving the value of doc.DocumentNode.OuterHtml to a local .html file and see if the code you're obtaining is that code. The thing is that sometimes you start parsing a website using HtmlAgilityPack, but the very first problem is that you're not getting the valid HTML correctly. Maybe you're getting a 404 error, or a redirection, etc.
I'm suggesting this because I tested //span[#class='swSprite s_star_3_5 '] and worked correctly.
That was the issue in the following questions:
Selecting nodes that have an attribute with spaces using HTMLAgilityPack
XPath Query Problem using HTML Agility Pack
If that doesn't help, post the HTML code and I'll help you ;)

This works for me:
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtml);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'swSprite')]");
Console.WriteLine("Text=" + node.InnerText.Trim());
and outputs
平均3.6 星
Note I use the XPATH starts-with function.

Convert HTML to plain text and maintain structure/formatting, with ruby

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.
The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).
I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.

First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.
You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:
require 'nokogiri'
html = '
<html>
<body>
<p>This is
some text.</p>
<p>This is some more text.</p>
<pre>
This is
preformatted
text.
</pre>
</body>
</html>
'
doc = Nokogiri::HTML(html)
puts doc.text
>> This is
>> some text.
>> This is some more text.
>>
>> This is
>> preformatted
>> text.
The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.
The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.
You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.
If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.
An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

HtmlUnit processing whitespace - htmlunit

I think this should return the original html: WebClient webClient = new WebClient(); HtmlPage page = webClient.getPage("http://www.yourpage.com"); String originalHtml = page.getWebResponse().getContentAsString();

Related

Cast a Nokogiri::XML::Document to a Nokogiri::HTML::Document

Passing JSON as HTML element text

How do I find matching <pre> tags using a reqular expression?

How do HtmlAgilityPack extract text from html node whose class attribute appended dynamically

Convert HTML to plain text and maintain structure/formatting, with ruby

Categories

Resources