batch html file editing - html-editor

I have a collection of one thousand HTML files and need to somewhat trim them. I need to delete all the tags inside <body></body> area of those except for one, <div.pg>, to make them clean to be printed. the excess are navigation links which make the prints messy and make the pages occupy more paper. the contents are not the same so I can't find and replace the code excerpt but the tags are the same foe example there are 3 <table> tags to be deleted each with specific class. manipulate specific tags inside batch HTML files?
Any batch processing technique or software to do this job?
What an easy solution on windows?

I would use an xslt transform on each html page you have. Batch is not the tool to manipulate html files. You can use batch as a "manager" to pass the required file to the xsl transform. Also windows have a rudimentary msxml utility which you can download and install to your machine : http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21714
That's how I would do it. I am sure there are more options.

If it is XHTML you could use XSLT to transform your HTML to "another" format. Look for example here: http://www.w3schools.com/xsl/ or here: http://help.hannonhill.com/discussions/how-do-i/269-strip-specific-html-tag-in-xslt

Related

Migrate from bookdown to pure Pandoc: split the HTML output in one page per section

I have a book project in RMarkdown, but since I do not use Knitr or other RMarkdown specific features I am considering switching to pure Pandoc to remove the R burden from the dependencies.
For what concerns PDF and ePub output it seems all straightforward to me, but I have some troubles with the HTML output. In fact Pandoc generates a single HTML file with the entire book.
With Bookdown I used the gitbook HTML output which generates a page for each section and each page have the complete TOC on the left sidebar and its footnotes and partial bibliography on the bottom.
To achieve this I thought to write a md file for each section and convert them one by one with Pandoc (for the HTML output, and merge them to one unique file for converting to PDF and ePub), but in this way I cannot have references across sections, have a full bibliography at the end and also easily create a TOC.
So my question is if there is an easy way (e.g. a Pandoc filter or a script) to generate an HTML book (similar to gitbook in behavior, the style doesn't matter) without installing R and Bookdown?
Pandoc follows the philosophy of only writing files that have explicitly be specified on the command line. This is why no such feature is not built in.
It would be possible to do what you want with the help of a custom writer. The basic would be doable in a few lines of Lua code, but it's likely that you'd have to implement all bookdown features yourself.
The best (IMHO) alternative is to use Quarto, a standalone tool built on top of pandoc, created in part by the authors of bookdown. That way you can remove R from your dependencies but retain the features of bookdown -- and more.

Convert .docx to Html without CSS with Docx4j

I'm trying to "upload" an html-converted .docx file into a CKEditor. So far, the convertion from .docx to html is nearly perfect and I'm able to pass the code from Java(Spring/Maven) to my webapp(ZK framework, using native CKEditor and JavaScript).
The problems I've had so far revolve around the fact that the loaded text is either half-formatted or not formatted at all, and that's the actual reason I'm working in this (To avoid loss of format present by copy-pasting). I've managed to find the reason of this behaviour: CK likes HTML tags OR won't use multiple styles per container (ie. style="font-weight: bold" is Ok, but style="font-style: italic; font-weight: bold" isn't, it will pick one or another) and Docx4j uses inline styling for formating because of XHTML (As far as I've read).
After that I tried to force the styles in CKEditor by the config file, but that wasn't the solution as one element will overwrite the another, resulting in only one style being used.
With all that, I decided to manipulate a test docx (It's literally a "hello world" line bold, with italics and underline), converted it and forced the tags b, i and u on the resulting HTML file through Java. The result was the desired one.
Now my focus is to config docx4j so it uses tags instead of inline css, as so far it's the "easiest" solution and I liked the resulting html from it. After reasing some more I came across an old class with a method that (by name) will do that, but it's not present in my imported library. I tried both, new and old methods to convert to html but the results are the same.
Is there a setting or a way to let docx4j (v8.2.3 reference) know that I want html tags instead of css styles? I've seen the examples and looked into the javadoc, but it's a bit outdated and didn't really helped me that much. This seems to be the only way to do this, or build my own parser, which is simply not an option due time constraints.
Thanks!

fenced_divs pandoc extension in RMarkdown

Is there a way, either in YAML or within an R script/Rmd, to turn on the fenced_divs pandoc extension?
If possible, I would prefer being able to turn on fenced_divs without having to specify it inside each individual output format in the YAML block but rather once, globally.
The reason is that I want to have within-document links to items that are not headers using the same code for .docx and .html.
Thanks.

how to use markdown and eco together?

I want to have a template variable pre-processed in a markdown doc.
I tried converting the filename to file.html.md.eco but it just comes out as plain text - ie the markdown plugin doesn't seem to get applied.
The file just as html.md renders fine.
Is it needed to add the plugins to the docpad.coffee to make sure they're applied when using multiple passes?
the FAQ states how to use multiple processors
http://docpad.org/docs/faq
... Alternatively, we can get pretty inventive and do something like this: .html.md.eco which means process this with Eco, then Markdown and finally render it as HTML.

Using Processing Sketches With Tabs In Processing JS

I've got a Processing sketch that I'd like to display on my site with Processing.js rather than as a Java applet, however I'm not sure it supports tabs - or classes. Does it need to be written as procedural script, or is there an <include> I can use - or another option?
Thanks
You can also include multiple .pde files in the html canvas tag separated by spaces
<canvas data-processing-sources="hello-web.pde class.pde"></canvas>
mentioned near the top of this page: http://processingjs.org/reference/articles/jsQuickStart
I've answered a Processing related question and used classes, but I simply pasted the class after the rest of the program. I don't know if this fully answer your question, but here's an example

Resources