Determine character index in HTML source given a DOMRange from a WebKit selection - cocoa

I'm attempting to synchronize a DOMRange (representing a user-selection from a Cocoa WebView) to the original HTML source currently rendered in that view, as a kind of Dreamweaver-split-editor:
My first idea was to get the DOMRange object's startContainer and offset and walk up the DOM tree from there, accumulating the overall character offset up to the body tag.
Unfortunately this task presents some problems:
Clearly the document's outerHTML will differ from the original HTML source if the DOM was manipulated via Javascript or the parser needed to clean up malformed tags.
I can't figure out how to get the offset of a node within its parent text node (e.g., 4 characters to target in <p>some<div>target</div>text</p>), and normalize doesn't seem to make this any easier.
Trying to account for some of the problems in #1, or just going from HTML source to WebView will probably require separately parsing the HTML and then correlating the two DOM-trees.
One ray of hope is that HTML5 specifies a standard parsing algorithm for dealing with invalid HTML (which WebKit has since adopted), so in theory it should be possible to use an off-the-shelf HTML5 parser to generate the same tree as WebKit — right?
This is the most similar existing question I could find, but it's for a slightly different problem:
Getting source HTML from a WebView in Cocoa

Your problem #1 is actually not so bad; you can just turn off JS interpretation.
Look at QWebSettings::JavascriptEnabled, or just drop this in before you load any html:
QWebSettings::globalSettings()->setAttribute(QWebSettings::JavascriptEnabled, false);
That should leave your DOM un-mangled by JS. Good luck!

Related

Polymer svg iconset and performance with multiple same icon

I am using the polymer svg iconset and iron-icon element in my page. Lets say I have a large number if rows in a table and each row has a couple of icons in it (uses iron-icon). These icons are repeated in every row. When I inspect the DOM, I see that each iron-icon in each row has the same icon svg as part of the DOM (inside the shadow root of icon-icon).
Isn't this a huge performance bottleneck? IE11 is slow in parsing DOM and this can cause further slowness. Would a font base icon set be more optimized here? Is Polymer's approach to use a svg iconset wrong?
From my experience, the performance issue is not in the size of the DOM itself, but the JS API interaction with it. The way Polymer implements the icons, it acts as a polyfill for Web Components custom elements. What actually happens in older browsers that don't understand their declaration is that if you write
<iron-icon icon="search"></iron-icon>
a scripts cycling through the DOM replaces what are considered unknown elements with DOM elements the browser understands (You'd have to look in the DOM inspector to see what actually is used in a specific browser).
A more direct approach could be to use something that IE understands natively, for example the SVG sprite pattern. Include an invisible <svg> element that contains symbols
<svg display="none">
<symbol id="search" viewBox="..."><path d="..." /></symbol>
...
</svg>
and reference them
<svg class="icon"><use xlink:href="#search"/></svg>
If you can achieve that when compiling the page server-side, it avoids the use of scripting in the client and should give a nice performance boost.
Even if your table cells are constructed client-side, adding these elements to the DOM directly might still be faster than first adding somesthing that a script has to replace later in a second run. (But that is only my guess without experience to back it up.)

When does a web page appear the first time?

I'd like to know when does a web page appear the first time, especially in relation with events like the DOMContentLoaded or or the Load event.
If I knew the event in question then I could minimize the HTTP requests up until that point and lazy-load resources after that. My knowledge on the subject is admittedly limited and I know that it's a very broad topic but I'd rather like some practical info.
According to Google, this is the general sequence of events:
Process HTML markup and build the DOM tree.
Process CSS markup and build the CSSOM tree.
Combine the DOM and CSSOM into a render tree.
Run layout on the render tree to compute geometry of each node.
Paint the individual nodes to the screen.
According to Google again,
domContentLoaded typically marks when both the DOM and CSSOM are ready.
Together, I would say that in general, DOMContentLoaded is the closest event related to the painting of the markup, and Load is the closest event to when the rendering and loading of external resources is finished.
But, this could vary based on browser implementation, HTML version (4, 5, etc.), and probably other things I'm not thinking about.

Nokogiri strategy for identifying largest text on a page?

I'm doing a comparison of a bunch of landing pages in the wild. I'm trying to pull out the main header and the call to action, but of course the HTML formatting of the pages varies wildly.
I started looking for H1, H2, etc. assuming that the header tags correspond to primacy, but this is often not the case. Rendered font-size* might be a better indicator, however this seems messy and wouldn't handle cases where images with alt tags are used.
What's a good strategy to identify the main heading of 100 wild landing pages using Nokogiri?
*Also- is there a clever selector for rendered font-size?
You can't do it unless you have an AI running that can determine the most semantically important section of a document.
You can't count on the tags, such as headers or meta-tags, because those can be missing entirely.
You can't count on location in the source because CSS can move things anywhere.
And, even if you think you've got it nailed by looking at the CSS, the JavaScript can rip that reality from you because it can override everything, relying on the fact it takes a human's eyes and brain to make sense of the final rendered page.
So, basically, you're going to be mostly shooting in the dark unless you have code that can understand the content of the page and determine how often a word occurs, along with its synonyms and their root words, and then determine their placement on the page after CSS and JavaScript have been run.
It's really a tough task that a lot of big companies are spending a lot of money on.

Is it valid to include images with <object> instead of <img>?

Inspired by this question, where the poster casually states as fact that <object> should be used instead of <img> to embed images in HTML documents.
I'm developing a web app at the moment, and, being a perfectionist, I try to make it all shiny and compliant with all the standards. As far as I know, the <img> tag is to be deprecated in the upcoming standards for xHTML, and as nowadays even IE is able to handle <object> properly, I wanted to use the <object> tag for all the images on my site
It became clear that the "upcoming standards" the poster was talking about was the abandoned XHTML2 spec, which didn't even formally deprecate <img> anyway. (Although there were apparently rumors to that effect.)
To the best of my knowledge, I've not seen anyone in the web development community advocating for the usage of the general-purpose <object> tag over the arguably more semantic and definitely more compatible <img> tag.
Is there a good reason to use <object> instead of <img>? Should <object> even be used at all in place of <img> – what might this break?
To answer the question in the heading: yes, it is valid, of course. The validity of an object element does not even depend on the type of data being embedded. If you meant to ask whether it is correct, then the answer is yes, there is nothing in the specifications that would forbid it or recommend against it.
Among the possible reasons for using object to embed an image, the most practical is that it allows the fallback content to contain HTML markup, such as headings, lists, tables, and phrase markup. The img element lets you specify only plain text as fallback content—even paragraph breaks cannot be specified.
For accessibility reasons, any image should have fallback content to be rendered e.g. when the document is used in nonvisual browsing (screen reader, Braille, etc.) or the image is not displayed for one reason or other. For any content-rich image (say, an organization chart, or a drawing describing a complex process), the fallback content needs to be long and needs to have some structure.
However, it is rare to use object for embedding an image. The importance of fallback content is not widely understood, and practical economical and technical considerations often cause fallback issues to be ignored. Moreover, object has a long history of slow, buggy, and qualitatively poor implementation in browsers. Only recently has it become viable to use object fairly safely for image inclusion.
The question which element is more semantic is mostly futile, and answers typically reflect just various ways to misunderstand the concept “semantic.” Both img and object mean inclusion (embedding) of external content. The img element is in principle for the inclusion of images, whatever that means, though it has also been used to include videos. For the object element, the type attribute can be used to specify the type of embedded content, down to specific image type, e.g. type=image/gif, or it may be left open.
This implies that the object element is more flexible: you can leave the type unspecified, letting it to be specified in HTTP headers. This way, the type of the embedded data could be changed without changing the object element or the embedding document in general; e.g., you could start with a simple version where the embedded content is an image, later replaced it by an HTML document (containing an image and text for example).
The only time I've ever seen an object used to show an image is to create a "fallback" when the intended object couldn't be loaded for whatever reason. Take this example from the W3 specs:
<OBJECT title="The Earth as seen from space" classid="http://www.observer.mars/TheEarth.py">
<!-- Else, try the MPEG video -->
<OBJECT data="TheEarth.mpeg" type="application/mpeg">
<!-- Else, try the GIF image -->
<OBJECT data="TheEarth.gif" type="image/gif">
<!-- Else render the text -->
The <STRONG>Earth</STRONG> as seen from space.
</OBJECT>
</OBJECT>
</OBJECT>
Only ever attempting to load an image via an object is basically removing the semantic meaning off of the image, which I would consider very bad practice.
There is no good practical reason to use object instead of img. object has always been inconsistently and messily supported. Use img for images, embed for flash content, and video or audio for multimedia content. Basically the only use of object left is for invoking specific plugins.
That said, the philosophical position for object's existence is still very good. It was meant to be a generic element for any type of content which can contain nested fallback content. W3C/XHTML2 had a very idealistic roadmap for html which emphasized syntactic and semantic purity and wanted to do things like allow href on any element (eliminating a), and eliminate img in favor of object. However, browsers could never seem to get object right in a fully generic way. Additionally, it was difficult to define the js APIs for a generic object element. That's a big reason why video and audio are separate--object serving a video won't expose the js video APIs.
However, XHTML2 lost and HTML5 won and HTML5 favors img, so use img.
I came across a real-world use case for using object over img tags. I’m using PlantUML to generate SVGs that include tooltips. If I use the img tag to include the img, none of the native SVG pointer events events work. But if I use the object tag, presto - all over the onhover/mouseover behaviors work as expected.

PDF generation under ruby - block should not cut by page separator

My PDF consists of a number of blocks (actually, a list of quotations), they go one after another till the end of the document. If the text of a quotation
does not fit on the page, the whole quotation should start from the top of the next page, instead of being torn apart. How can I implement that on any library under ruby?
Try PrinceXML - this is a standalone executable that generates PDF out of HTML or XML. It supports a lot of special CSS properties that will even help you to control page breaks. Refer to http://www.princexml.com/doc/6.0/page-breaks/
This application is available for windows and linux. I was using it for generation of a pretty complicated PDF documents with headers and footers on every page except first one. And since you don't need to output a PDF with precise positioning of elements, it might be a perfect solution for you.
I haven't tried it, but in Prawn I would try using either the Document#text_box method or looking up the table methods and putting your text in cells with invisible borders. The documentation's unclear on how page break functionality fits in with the bounding box models, but it's worth a shot.
HTMLDoc which converts HTML to PDF has a page break facility.

Resources