I've read that Firefox 3.5 has a new feature in its parser ?
Improvements to the Gecko layout
engine, including speculative parsing
for faster content rendering.
Could you explain that in simple terms.
It's all to do with this entry in bugzilla: https://bugzilla.mozilla.org/show_bug.cgi?id=364315
In that entry, Anders Holbøll suggested:
It seems that when encountering a script-tag, that references an external file,
the browser does not attempt to load any elements after the script-tag until
the external script files is loaded. This makes sites, that references several
or large javascript files, slow.
...
Here file1.js will be loaded first, followed sequentially by file2.js. Then
img1.gif, img2.gif and file3.js will be loaded concurrently. When file3.js has
loaded completely, img3.gif will be loaded.
One might argue that since the js-files could contain for instance a line like
"document.write('<!--');", there is no way of knowing if any of the content
following a script-tag will ever be show, before the script has been executed.
But I would assume that it is far more probable that the content would be shown
than not. And in these days it is quite common for pages to reference many
external javascript files (ajax-libraries, statistics and advertising), which
with the current behavior causes the page load to be serialized.
So essentially, the html parser continues reading through the html file and loading referenced links, even if it is blocked from rendering due to a script.
It's called "speculative" because the script might do things like setting css parameters like "display: none" or commenting out sections of the following html, and by doing so, making certian loads unnecessary... However, in the 95% use case, most of the references will be loaded, so the parser is usually guessing correctly.
I think it means that when the browser would normally block (for example for a script tag), it will continue to parse the HTML. It will not create an actual DOM until the missing pieces are loaded, but it will start fetching script files and stylesheets in the background.
Related
I'm looking for some bite-sized examples on how Hugo might be managing site-wide data, like Site.AllPages.
Specifically, Hugo seems too fast to be reading in every file and it's metadata, before beginning to generate pages and making things like .Site.AllPages available -- but obviously that has to be the case.
Are Ruby (Jekyll) and Python (Pelican) really just that slow, or is there some specific (algorithmic) method that Hugo employs to generate pages before everything is ready?
There is no magic, and Hugo does not start any rendering until the .Site.Pages etc. collections are filled and ready.
Some key points here:
We have a processing pipeline where we do concurrent processing whenever we can, so your CPUs should be pretty busy.
Whenever we do content manipulation (shortcodes, emojis etc.), you will most likely see a hand crafted parser or replacement function that is built for speed.
We really care about the "being fast" part, so we have a solid set of benchmarks to reveal any performance regressions.
Hugo is built with Go -- which is really fast, and have a really great set of tools for this (pprof, benchmark support etc.)
Some other points that makes the hugo server variant even faster than the regular hugo build:
Hugo uses a virtual file system, and we render directly to memory when in server/development mode.
We have some partial reloading logic in there. So, even if we render everything every time, we try to reload and rebuild only the content files that have changed and we don't reload/rebuild templates if it is a content change etc.
I'm bep on GitHub, the main developer on Hugo.
You can see AllPages in hugolib/page_collections.go.
A git blame shows that it was modified in Sept. 2016 for Hugo v0.18 in commit 698b994, in order to fix PR 2297 Fix Node vs Page.
That PR references the discussion/improvement proposal "Node improvements"
Most of the "problems" with this gets much easier once we agree that a page is just a page that is just a ... page...
And that a Node is just a Page with a discriminator.
So:
Today's pages are Page with discriminator "page"
Homepage is Page with discriminator "home" or whatever
Taxonomies are Pages with discriminator "taxonomy"
...
They have some structural differences (pagination etc.), but they are basically just pages.
With that in mind we can put them all in one collection and add query filters on discriminator:
.Site.Pages: filtered by discriminator = 'page'
*.Site.All: No filter
where: when the sequence is Pages add discriminator = 'page', but let user override
That key (the discriminator) allows to retrieve quickly all 'pages'.
I've got a rather large asciidoc document that I translate dynamically to PDF for our developer guide. Since the doc often refers to Java classes that are documented in our developer guide we converted them into links directly in the docs e.g.:
In this block we create a new
https://www.codenameone.com/javadoc/com/codename1/ui/Form.html[Form]
named `hi`.
This works rather well for the most part and looks great in HTML as every reference to a class leads directly to its JavaDoc making the reference/guide process much simpler.
However when we generate a PDF we end up with something like this on some pages:
Normally I wouldn't mind a lot of footnotes or even repeats from a previous page. However, in this case the link to Container appears 3 times.
I could remove some of the links but I'd rather not since they make a lot of sense on the web version. Since I also have no idea where the page break will land I'd rather not do it myself.
This looks to me like a bug somewhere, if the link is the same the footnote for the link should only be generated once.
I'm fine with removing all link footnotes in the document if that is the price to pay although I'd rather be able to do this on a case by case basis so some links would remain printable
Adding these two parameters in fo-pdf.xsl remove footnotes:
<xsl:param name="ulink.footnotes" select="0"></xsl:param>
<xsl:param name="ulink.show" select="0"></xsl:param>
The first parameter disable footnotes, which triggers urls to re-appear inline.
The second parameter removes urls from the text. Links remain active and clickable.
Non-zero values toggle these parameters.
Source:
http://docbook.sourceforge.net/release/xsl/1.78.1/doc/fo/ulink.show.html
We were looking for something similar in a slightly different situation and didn't find a solution. We ended up writing a processor that just stripped away some of the links e.g. every link to the same URL within a section that started with '==='.
Not an ideal situation but as far as I know its the only way.
I am rebuilding a site with docpad and it's very liberating to form a folders structure that makes sense with my workflow of content-creation, but I'm running into a problem with docpad's hard-division of content-to-be-rendered vs 'static'-content.
Docpad recommends that you put things like images in /files instead of /documents, and the documentation makes it sound as if otherwise there will be some processing overhead incurred.
First, I'd like an explanation if anyone has it of why a file with a
single extension (therefore no rendering) and no YAML front-matter,
such as a .jpg, would impact site-regeneration time when placed
within /documents.
Second, the real issue: is there a way, if it does indeed create a
performance hit, to mitigate it? For example, to specify an 'ignore'
list with regex, etc...
My use case
I would like to do this for posts and their associated images to make authoring a post more natural. I can easily see the images I have to work with and all the related files are in one place.
I also am doing this for an artwork I am displaying. In this case it's an even stronger use case, as the only data in my html.eco file is yaml front matter of various meta data, my layout automatically generates the gallery from all the attached images located in a folder of the same-name as the post. I can match the relative output path folder in my /files directory but it's error prone, because you're in one folder (src/files/artworks/) when creating the folder of images and another (src/documents/artworks/) when creating the html file -- typos are far more likely (as you can't ever see the folder and the html file side by side)...
Even without justifying a use case I can't see why docpad should be putting forth such a hard division. A performance consideration should not be passed on to the end user like that if it can be avoided in any way; since with docpad I am likely to be managing my blog through the file system I ought to have full control over that structure and certainly don't want my content divided up based on some framework limitation or performance concern instead of based on logical content divisions.
I think the key is the line about "metadata".Even though a file does NOT have a double extension, it can still have metadata at the top of the file which needs to be scanned and read. The double extension really just tells docpad to convert the file from one format and output it as another. If I create a straight html file in the document folder I can still include the metadata header in the form:
---
tags: ['tag1','tag2','tag3']
title: 'Some title'
---
When the file is copied to the out directory, this metadata will be removed. If I do the same thing to a html file in the files directory, the file will be copied to the out directory with the metadata header intact. So, the answer to your question is that even though your file has a single extension and is not "rendered" as such, it still needs to be opened and processed.
The point you make, however, is a good one. Keeping images and documents together. I can see a good argument for excluding certain file extensions (like image files) from being processed. Or perhaps, only including certain file extensions.
So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/
and create one HTML page that I can either print or send to my Kindle.
I am thinking of using Hpricot, but am not too sure how to proceed.
How do I set it up so it recursively checks each link, gets the HTML, either stores it in a variable or dumps it to the main HTML page and then goes back to the table of contents and keeps doing that?
You don't have to tell me EXACTLY how to do it, but just the theory behind how I might want to approach it.
Do I literally have to look at the source of one of the articles (which is EXTREMELY ugly btw), e.g. view-source:http://boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e-sports-icon.html and manually programme the script to extract text between certain tags (e.g. h3, p, etc.)?
If I do that approach, then I will have to look at each individual source for each chapter/article and then do that. Kinda defeats the purpose of writing a script to do it, no?
Ideally I would like a script that will be able to tell the difference between JS and other code and just the 'text' and dump it (formatted with the proper headings and such).
Would really appreciate some guidance.
Thanks.
I'd recomment using Nokogiri instead of Hpricot. It's more robust, uses less resources, fewer bugs, it's easier to use, and faster.
I did some scraping extensively for work on time, and had to switch to Nokogiri, because Hpricot would crash on some pages unexplicably.
Check this RailsCast:
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
and:
http://nokogiri.org/
http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html
http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/
Say that on page1.html I need mootools.js and main.js... I guess that these tools should generate one minified js file (say min1.js).
Then on page2.html I need mootools.js, main.js AND page2.js... Do those tools serve min1.js (already cached by browser) and page2.js ? Or do they combine these 3 .js files and serve the resulting minified file which need to be fully cached again by the browser ?
Thank you
Assuming you are using the Apache module mod_pagespeed because you tagged the question with it but didn't mention if you are or not...
If you turn on ModPagespeedEnableFilters combine_javascript (which is disabled by default), it operates on the whole page. According to the documentation:
This filter generates URLs that are essentially the concatenation of
the URLs of all the CSS files being combined.
page1.html would combine mootools.js, main.js, and page1.js; and page2.html would be mootools.js, main.js, and page2.js.
To answer your question then, yes it will cache several copies of the repeated JavaScript files.
However,
By default, the filter will combine together script files from
different paths, placing the combined element at the lowest level
common to both origins. In some cases, this may be undesirable. You
can turn off the behavior with: ModPagespeedCombineAcrossPaths off
If you leave this behavior on, and put the files spread out across paths that you want combined, you could keep them separate so that common scripts will be combined as one and individual scripts would be combined on their own. This would keep the duplication of large, common libraries down.