Generate PDF from content in Magnolia CMS - pdf-generation

We would like to generate a PDF document for a single page. While only this link talks about this subject (and the other discussion linked from there), the information given is quite slim.
Could anybody share any success stories made so far including source-code?
Has someone succeeded in using wkhtmltopdf?
(we plan to use Magnolia 4.5.6)

After evaluating both Aspose.pdf (commercial product) and iText, we went to use LaTex. We had quite some specific requirements (e.g. two column layout with footnotes, very large table), which were not possible with the two above mentioned products.
We are very happy with this solution, but there are some things to be noted: first and foremost you leave the JVM, and second LaTex is itself another macro language to be learned. The quality of the outcome is very good, although, and we are very happy with that solution.
wkhtmltopdf is used in another project, and the outcome is also good, for more straight forward formatting.

Related

Getting most relevant content from page

I need to create a universal web scraper to parse articles on the different websites. Of course, I know about XPath, but I want to try to make it universal for any website despite the HTML markup of a page.
I need to determine whether there is an article on the page and if it is - parse a text of title, body and tags (if exists).
Frankly speaking, my knowledge in DS is not very huge, but I assume this task (determine whether it is article, and parsing only needed parts) is possible to solve.
What tools should I use? Any help?
Actually, for the second task, I need to implement something similar that google chrome mobile does. When page is not optimised for mobile, then propose to show the page in adaptive mode (just title, and main content).
If you are using Python, some libraries to look at are:
scrapy, which scrapes data and can extract some of the results) and,
BeautifulSoup, which is more geared towards the extraction part itself.
It is possible to request a version of a website (e.g. for Chrome, Safari, Mobile, old-school systems) by creating a custom header for your scraper.
HAve a look at the relevant documentation, and you can get an idea of how to use headers in scrapy here.
I do not know of any more specialised tools. Your tasks are more analytical and are typically not performed with the use of models for estimating e.g. what content is where on a webpage. This might be an intersting research direction though; to see if you can create a model that generalises across many websites to extract the desired content.
That leads me on to my last point, which is to say that creating a single scraper that works for any website *containing your artile type) is not usually possible. People create websites differently, however they see fit, which means they also change them. This usually leads to a good scraper requiring constant updates as time (and developers) moves on.
EDIT:
Then if you have lots of labelled examples, it might be possible to train a model. The challenge might be the look-back range of the model. For example, a typical LSTM model is given a parameter that tells it how far to look back into the past. It is stored within its memory internally. In your case, you might be looking for a start and end HTML tag of an article, to then extract just that part. These tahs could be thousands of words apart. Something a standard LSTM might not be fit to retain and use.
If you could pose your problem a little differently, then there are other approaches that might be plausible. E.g., you could make it a "question-answer" problem, by saying: I have this HTML, where is the article content? If that sounds ok for your use-case, have a look here for some model based approaches.

Generate EDGAR FTP File Path List

I'm brand new to programming (though I'm willing to learn), so apologies in advance for my very basic question.
The [SEC makes available all of their filings via FTP][1], and eventually, I would like to download a subset of these files in bulk. However, before creating such a script, I need to generate a list for the location of these files, which follow this format:
/edgar/data/51143/000005114313000007/0000051143-13-000007-index.htm
51143 = the company ID, and I already accessed the list of company IDs I need via FTP
000005114313000007/0000051143-13-000007 = the report ID, aka "accession number"
I'm struggling with how to figure this out as the documentation is fairly light. If I already have the 000005114313000007/0000051143-13-000007 (what the SEC calls the "accession number") then it's pretty straightforward. But I'm looking for ~45k entries and would obviously need to generate these automatically for a given CIK ID (which I already have).
Is there an automated way to achieve this?
Welcome to SO.
I'm currently scraping the same site, so I'll explain what I've done so far. What I am assuming is that you'll have the CIK numbers of the companies you're looking to scrape. If you search the company's CIK, you'll get a list of all of the files that are available for the company in question. Let's use Apple as an example (since they have a TON of files):
Link to Apple's Filings
From here you can set a search filter. The document you linked was a 10-Q, so let's use that. If you filter 10-Q, you'll have a list of all of the 10-Q documents. You'll notice that the URL changes slightly, to accommodate for the filter.
You can use Python and its web scraping libraries to take that URL and scrape all of the URLs of the documents in the table on that page. For each of these links you can scrape whatever links or information you want off the page. I personally use BeautifulSoup4, but lxml is another choice for web scraping, should you choose Python as your programming language. I would recommend using Python, as it's fairly easy to learn the basics and some intermediate programming constructs.
Past that, the project is yours. Good luck, I've posted some links below to get you started. I'm only allowed to post two links since I'm new to the site, so I'll give you the beautiful soup link:
Beautiful Soup Home Page
If you choose to use Python and are new to the language, check out the codecademy python course, and don't forget to check out lxml, as some people prefer it over BeautifulSoup (some people also use both in conjunction, so it's all a matter of personal preference).

Algorithm to find out if a website is a blog?

This is a creative one :-)
I'll be receiving a list of hundreds of new URLs regularly and want to find out if they are linking to a blog or not - between 80% and 95% accuracy would be sufficient.
Obviously I need to analyze the HTML of the page - but how exactly would you approach this (e.g. meta tags, structural analysis, pattern matching, machine learning ...)?
I would look at the generator <meta> tag for known blog editors. For example here's how it looks for Wordpress:
<meta name="generator" content="WordPress.com" />
Building on Darin's solution, I would look for the generator <meta> tag for known blog editors and combine it with a lookup table of common sites, ie. WordPress.com, Blogspot.com, Livejournal.com, and so forth. That should give you 80-95% in the near term, though it won't be robust enough for an ongoing process over an extended period of time.
An extended solution is much harder, given the amorphous definition of the term "blog". In which case, you'll want to consider breaking the list down into its hosting site and defining characteristics and create hard and fast rules on what constitutes a blog:
Is it hosted by a blogging service provider?
Is it listed in a blog aggregator, such as Technorati?
Does it include blog-like services, such as user-generated articles, tags, and the ability to comment?
Does it provide meta information that I can use to easily identify it as a blog?
Does it otherwise identify itself as a blog, via the inclusion of the term "blog" or some other criteria?
I can easily see a neural network constructed to determine if a page is a blog or not, but this serverely oversteps the bounds of your requirements. I'd say start simple, then extend your solution relative to the proposed lifetime of your system.
The above suggestions are good, and probably will work if you're aiming for 80-90% accuracy.
I would go one step further and look for any .xml RSS feed in either a meta tag, or as a link. Then check the feed to see if there are any comment tags (since there are feeds for other purposes too). I would OMIT this for certain blog platforms that don't give you a feed such as Tumblr.

Inline data representation

I would like to represent data that gives an overview but allows them to drill down in an inline fashion - so if you had a grouping of say 6 objects the user could expand the data and it would show the 6 objects immeadiately below it before any more high level data.
It would appear that MSHFlexgrid gives this ability but I can't find any information about actually using it, or what it's limitations are (can you have differing number of fields and/or can they have different spacing, what about column headers, indentation at for the start, etc).
I found this site, but the images are broken (in ie8 and ff3.5). Google searches show people just using the flat data representation but nothing using the hierarchical properties). Does anyone know any good tutorials or forums with a good discussion about pitfalls?
Due to lack of information about using it, I am thinking of coding my own version but if anyone has done work in this area I haven't found it - I would of thought it would be a natural wish for data representation. If someone has coded a version of this (any language) then I wouldn't mind reading about it - maybe my idea of how to do it wouldn't be the best way.
You might want to check out vbAccelerator. He has a Multi-Column Treeview control that sounds like what you may be looking for. He gives you the source and has some pretty decent samples.
The MSHFlexGrid reference pages and the "using the MSHFlexGrid" topic in the Visual Basic manual?
Sorry if you've already looked at these!

What simple syntax can be used for rich text?

I want in an application with a simple text input, enriched with some marks to include formatting or semantic labeling. I want the syntax as easy as possible and I want to include self-defined labels.
Example:
[bold]Stackoverflow[/bold] is a [tag]good[/tag] resource for programmers.
Tables would be needed too.
HTML/XML and LaTeX are mighty enough to allow this, but too complicated. Wiki-Syntax seems simple, but uses another symbol for each markup, has unclear quoting and every Wiki seems to have another syntax. For tables and similar stuff Wiki becomes very complicated.
Exists a language/syntax, that matches my needs or can be slightly changed to do so? Or do I have to invent something myself? In that case, do you have suggestions?
Definitely do NOT invent your own. There are plenty of simple markup languages already, and users HATE learning new ones. Trust me on this!
I would suggest using one of the following:
Textile
Markdown
BBCode
Make your decision based on your userbase, as well as what tools and parsers are available in your chosen language. For my site, we went with Textile, but I've found that BBCode tends to be the language that most people already know. However, this will vary with different user demographics.
StackOverflow, along with several other sites, uses Markdown. I think it will give you the best balance between features and simplicity.
Let me add ReStructuredText to the list.
An additional benefit of using it is given by the availability of ReStructuredText to Anything service that makes extremely easy to create HTML or PDF versions of the document.
As already pointed out there are a lot of lightweight markup languages (many are listed here: wikipedia article), there should be no need of creating your own.

Resources