How does Hugo maintain site-wide data, like .Site.AllPages? - go

I'm looking for some bite-sized examples on how Hugo might be managing site-wide data, like Site.AllPages.
Specifically, Hugo seems too fast to be reading in every file and it's metadata, before beginning to generate pages and making things like .Site.AllPages available -- but obviously that has to be the case.
Are Ruby (Jekyll) and Python (Pelican) really just that slow, or is there some specific (algorithmic) method that Hugo employs to generate pages before everything is ready?

There is no magic, and Hugo does not start any rendering until the .Site.Pages etc. collections are filled and ready.
Some key points here:
We have a processing pipeline where we do concurrent processing whenever we can, so your CPUs should be pretty busy.
Whenever we do content manipulation (shortcodes, emojis etc.), you will most likely see a hand crafted parser or replacement function that is built for speed.
We really care about the "being fast" part, so we have a solid set of benchmarks to reveal any performance regressions.
Hugo is built with Go -- which is really fast, and have a really great set of tools for this (pprof, benchmark support etc.)
Some other points that makes the hugo server variant even faster than the regular hugo build:
Hugo uses a virtual file system, and we render directly to memory when in server/development mode.
We have some partial reloading logic in there. So, even if we render everything every time, we try to reload and rebuild only the content files that have changed and we don't reload/rebuild templates if it is a content change etc.
I'm bep on GitHub, the main developer on Hugo.

You can see AllPages in hugolib/page_collections.go.
A git blame shows that it was modified in Sept. 2016 for Hugo v0.18 in commit 698b994, in order to fix PR 2297 Fix Node vs Page.
That PR references the discussion/improvement proposal "Node improvements"
Most of the "problems" with this gets much easier once we agree that a page is just a page that is just a ... page...
And that a Node is just a Page with a discriminator.
So:
Today's pages are Page with discriminator "page"
Homepage is Page with discriminator "home" or whatever
Taxonomies are Pages with discriminator "taxonomy"
...
They have some structural differences (pagination etc.), but they are basically just pages.
With that in mind we can put them all in one collection and add query filters on discriminator:
.Site.Pages: filtered by discriminator = 'page'
*.Site.All: No filter
where: when the sequence is Pages add discriminator = 'page', but let user override
That key (the discriminator) allows to retrieve quickly all 'pages'.

Related

Getting most relevant content from page

I need to create a universal web scraper to parse articles on the different websites. Of course, I know about XPath, but I want to try to make it universal for any website despite the HTML markup of a page.
I need to determine whether there is an article on the page and if it is - parse a text of title, body and tags (if exists).
Frankly speaking, my knowledge in DS is not very huge, but I assume this task (determine whether it is article, and parsing only needed parts) is possible to solve.
What tools should I use? Any help?
Actually, for the second task, I need to implement something similar that google chrome mobile does. When page is not optimised for mobile, then propose to show the page in adaptive mode (just title, and main content).
If you are using Python, some libraries to look at are:
scrapy, which scrapes data and can extract some of the results) and,
BeautifulSoup, which is more geared towards the extraction part itself.
It is possible to request a version of a website (e.g. for Chrome, Safari, Mobile, old-school systems) by creating a custom header for your scraper.
HAve a look at the relevant documentation, and you can get an idea of how to use headers in scrapy here.
I do not know of any more specialised tools. Your tasks are more analytical and are typically not performed with the use of models for estimating e.g. what content is where on a webpage. This might be an intersting research direction though; to see if you can create a model that generalises across many websites to extract the desired content.
That leads me on to my last point, which is to say that creating a single scraper that works for any website *containing your artile type) is not usually possible. People create websites differently, however they see fit, which means they also change them. This usually leads to a good scraper requiring constant updates as time (and developers) moves on.
EDIT:
Then if you have lots of labelled examples, it might be possible to train a model. The challenge might be the look-back range of the model. For example, a typical LSTM model is given a parameter that tells it how far to look back into the past. It is stored within its memory internally. In your case, you might be looking for a start and end HTML tag of an article, to then extract just that part. These tahs could be thousands of words apart. Something a standard LSTM might not be fit to retain and use.
If you could pose your problem a little differently, then there are other approaches that might be plausible. E.g., you could make it a "question-answer" problem, by saying: I have this HTML, where is the article content? If that sounds ok for your use-case, have a look here for some model based approaches.

How do I merge or even disable footnote links in asciidoc fop

I've got a rather large asciidoc document that I translate dynamically to PDF for our developer guide. Since the doc often refers to Java classes that are documented in our developer guide we converted them into links directly in the docs e.g.:
In this block we create a new
https://www.codenameone.com/javadoc/com/codename1/ui/Form.html[Form]
named `hi`.
This works rather well for the most part and looks great in HTML as every reference to a class leads directly to its JavaDoc making the reference/guide process much simpler.
However when we generate a PDF we end up with something like this on some pages:
Normally I wouldn't mind a lot of footnotes or even repeats from a previous page. However, in this case the link to Container appears 3 times.
I could remove some of the links but I'd rather not since they make a lot of sense on the web version. Since I also have no idea where the page break will land I'd rather not do it myself.
This looks to me like a bug somewhere, if the link is the same the footnote for the link should only be generated once.
I'm fine with removing all link footnotes in the document if that is the price to pay although I'd rather be able to do this on a case by case basis so some links would remain printable
Adding these two parameters in fo-pdf.xsl remove footnotes:
<xsl:param name="ulink.footnotes" select="0"></xsl:param>
<xsl:param name="ulink.show" select="0"></xsl:param>
The first parameter disable footnotes, which triggers urls to re-appear inline.
The second parameter removes urls from the text. Links remain active and clickable.
Non-zero values toggle these parameters.
Source:
http://docbook.sourceforge.net/release/xsl/1.78.1/doc/fo/ulink.show.html
We were looking for something similar in a slightly different situation and didn't find a solution. We ended up writing a processor that just stripped away some of the links e.g. every link to the same URL within a section that started with '==='.
Not an ideal situation but as far as I know its the only way.

Resources for Building Dynamic Lift Shopping Cart?

Here's what I'd like to do with Lift: I want to build a dynamic shopping cart, with lines able to be added and removed via AJAX calls. The total needs to be wired to the specific lines. Each line would include a number, the length of time for a lease, and a calculated price based on that, so I would have to add wired cells on each addable/removable line as well. So it would look something like this:
Number Length of Lease Price Remove?
(AJAX Textbox) (AJAX Dropdown Select) (Plain Updateable Text) (Ajax Checkbox)
(Another Row)...
+ Add
Total: ______
The problem I'm running into is that I can find resources to build a static page that displays all of this via Wiring. Using the Lift Demo site, I can pull up code that will let me add new lines, but it doesn't seem to me to be conducive to removing lines (this in general is one of my frustrations with Lift at the moment: a "little extra detail" to change from a tutorial ends up requiring me to completely change tacks and spend hours more at work and research, and I want to figure out how I'm probably approaching these problems wrongly!). Alternatively, I can use CSS selectors to dynamically create content, but I don't know how to effectively wire these together.
In addition, all of my attempts end up creating 2-3 times the amount of code I would have written to simply do some JQuery updates on the page, so I suspect that I'm doing something wrong and overcomplicating everything.
What resources would people recommend to set me on the right path?
These are your best resources for learning Lift:
Simply Lift
Lift Cookbook
Lift in Action
For any specific questions, I highly recommend you join us at the Lift Community Google Group. It is the official support channel for Lift. Although a few of us occasionally help out here at Stackoverflow, the best Lift help can be found there.

Which CodeIgniter Template Library fits requirements?

I would like to use tags like {{headline}} in the CodeIgniter views instead of PHP and I'm looking for a template parser. CodeIgniter has a built-in template parser: http://www.ellislab.com/codeigniter/user-guide/libraries/parser.html
The question is if it's better to use the built-in parser or another parser? Are there any limitations with the CI template parser like not supporting loops, if statements, etc.?
If so, there are a number of other parsers but it seems that a developer works on them for some time and then it falls into a numb state when it's not supported any more. I'm looking for a parser which will also be supported in a year:
Bucket
http://backstack.ca/projects/bucket/
Comper Template Parser
http://parser.comper.sk/en/
Ocular-Template-Library
http://github.com/lonnieezell/Ocular-Template-Library
Phil Sturgeon Template library
http://philsturgeon.co.uk/code/codeigniter-template
PyroCMS Lex Parser
http://github.com/pyrocms/lex
Template Library for CodeIgniter
http://www.williamsconcepts.com/ci/codeigniter/libraries/template/
The most active seem to be Comper and Lex Parser. What is the difference between Phil Sturgeon Template library and PyroCMS Lex Parser because it's the same developer?
What I am looking for is:
- Separation of PHP and HTML/CSS in views
- Solidly supported so that it's not stalled within a year
- Use of simple tags but also loops, if statements and other functions
Can anyone give me a tip? The existing information on the CI forum or elsewhere have not been really useful.
Many thanks!
Philip
How to Choose a Template Engine
I went through a similar exercise for choosing a PHP/CMS system, and here are some points that may carry over to your decision making process.
I first look at the documentation to get a sense of how much support there is for the system, evaluate range of features and so on. I also see if there is an online forum with enough activity to get some help if needed.
I then try out the installation to see if it goes smoothly. If I have trouble at this stage, I may simply quit and try another system unless there is an online forum or help desk with a ready answer.
I then set up a sample website (2-3 pages) and try out the features that I need. In the case of CodeIgniter, I may have content stored in multiple database tables and I evaluate how much effort it takes to get the data from my SQL queries into the array structure that can be used by the template system. This is usually the step that takes the most effort when developing the website.
I also check to see how easy I could integrate a PHP function into the mix. For example, I once had to build a specialized function to determine a range of dates and these dates had to be passed to the template engine. I was able to do it but it took a lot of effort. The template system had almost no support for parsing dates and I had to resort to a PHP function to do the work.
Summary
Ultimately, you will need to try a few of these systems out to get a feel for them. Once that is done, pick one that makes sense for your coding style, ease of use, and your data structure.
PS
I have not used the systems that you listed above but I have spent quite a bit of time using the template engine in Expression Engine (CMS from the same group that created CodeIgniter). My comments are based on my experience implementing database driven websites using Expression Engine and dealing with the limitations and quirks of that particular platform.

Codeigniter pagination

I need two paginations on one page, is it possible to do this with codeigniter?!? Of course they must operate independently of one another.
Yes and no. If you want two different pagination visuals (customized renderings of the library) then sure. The problem you'll run into is by default the pagination library will pull the current page out of your $ci->uri->segments() list automatically to determine which page to mark as "active".
I do not know of a way to explicitly override this. Perhaps if you made a MY_Pagination that took an additional $config value for current page you could get it to behave like that. I haven't looked at the library's code in a while so you'd have to do some digging.
Honestly though, I'd suggest you build your own, it's not incredibly hard to do some simple math to determine what numbers to link.
Also you'll run into issues with CI's Pagination Library if you want the "current page" part to be NOT the last segment in your url. This may have been fixed lately but last time I looked it was the stop-gap for me using the library all together.
Bottom Line Invest the time in making your own if you want more than it's basic functionality, it's simple enough, just make yours reusable if you can.

Resources