Generating PDF from OpenXml - pdf-generation

I am trying to find a SDK that can generate PDF from OpenXml. I have used the Open Xml Power Tools to convert the open XML and html and and using iTextSharp to parse the Html to PDF. But the result is a very terrible looking PDF.
I have not yet tried the iText's RTF parser. If I go this direction, I will end up needing a RTF converter making the simple conversion a double step nightmare.
It almost looks like I might end up writing a custom converter based of power tools OpenXml to html converter. Any advise is appreciated. I really at this time can't end up going for a professional converter as the licenses are too expensive (Aspose Word/TxText).
I thought I will put some more effort into my investigation. I went back to the conversion utility "http://msdn.microsoft.com/en-us/library/ff628051.aspx" and looked through its code. Given the biggest thing it missed was reading the underlying styles and generate a style attribute. The PDF looked much better with the limitation of not handling custom true type font. More investigation tomorrow. I am hoping has done something like this/faced weird issues and can shed some light.
private static StringDictionary GetStyle(XElement el)
{
IEnumerable jcL = el.Elements(W.jc);
IEnumerable spacingL = el.Elements(W.spacing);
IEnumerable rPL = el.Elements(W.rPr);
StringDictionary sd = new StringDictionary();
if (HasAttribute(jcL, W.val)) sd.Add("text-align", GetAttribute(jcL, W.val));
// run prop exists
if (rPL.Count() > 0)
{
XElement r = rPL.First();
IEnumerable ftL = el.Elements(W.rFonts);
if (r.Element(W.b) != null) sd.Add("font-weight", "bolder");
if (r.Element(W.i) != null) sd.Add("font-style", "italic");
if (r.Element(W.u) != null) sd.Add("text-decoration", "underline");
if (r.Element(W.color) != null && HasAttribute(r.Element(W.color), W.val)) sd.Add("color", "#" + GetAttribute(r.Element(W.color), W.val));
if (r.Element(W.rFonts) != null )
{
//
if(HasAttribute(r.Element(W.rFonts), W.cs)) sd.Add("font-family", GetAttribute(r.Element(W.rFonts), W.cs));
else if (HasAttribute(r.Element(W.rFonts), W.hAnsi)) sd.Add("font-family", GetAttribute(r.Element(W.rFonts), W.hAnsi));
}
if (r.Element(W.sz) != null && HasAttribute(r.Element(W.sz), W.val)) sd.Add("font-size", GetAttribute(r.Element(W.sz), W.val) + "pt");
}
return sd.Keys.Count > 0 ? sd : null;
}

I don't know of any direct converter with source-code availabe, but yeah, my thought is that you may need to build a converter from scratch. Luckily (I guess), Word's WordprocessingML is the simplest of the Open XML formats and you can look to other projects for inspiration, such as:
TextGlow - Word to Silverlight converter
Word to XAML Converter - Word to XAML converter (probably very
similar to TextGlow above)
OpenXML-DAISY - conversion to Daisy
ODF Converter - convert from/to OpenOffice formats and OpenXML
The XHTML solution by Eric White you already referenced.
For commercial & server-side solutions, you can use either Word Automations Services (requires SharePoint) or Apose.NET Words.

Related

Any ar js multimarkers learning tutorial?

I have been searching for ar.js multimarkers tutorial or anything that explains about it. But all I can find is 2 examples, but no tutorials or explanations.
So far, I understand that it requires to learn the pattern or order of the markers, then it stores it in localStorage. This data is used later to display the image.
What I don't understand, is how this "learner" is implemented. Also, the learning process is only used once by the "creator", right? The output file should be stored and then served later when needed, not created from scratch at each person's phone or computer.
Any help is appreciated.
Since the question is mostly about the learner page, I'll try to break it down as much as i can:
1) You need to have an array of {type, URL} objects.
A sample of creating the default array is shown below (source code):
var markersControlsParameters = [
{
type : 'pattern',
patternUrl : 'examples/marker-training/examples/pattern-files/pattern-hiro.patt',
},
{
type : 'pattern',
patternUrl : 'examples/marker-training/examples/pattern-files/pattern-kanji.patt',
}]
2) You need to feed this to the 'learner' object.
By default the above object is being encoded into the url (source) and then decoded by the learner site. What is important, happens on the site:
for each object in the array, an ArMarkerControls object is created and stored:
// array.forEach(function(markerParams){
var markerRoot = new THREE.Group()
scene.add(markerRoot)
// create markerControls for our markerRoot
var markerControls = new THREEx.ArMarkerControls(arToolkitContext, markerRoot, markerParams)
subMarkersControls.push(markerControls)
The subMarkersControls is used to create the object used to do the learning. At long last:
var multiMarkerLearning = new THREEx.ArMultiMakersLearning(arToolkitContext, subMarkersControls)
The example learner site has multiple utility functions, but as far as i know, the most important here are the ArMultiMakersLearning members which can be used in the following order (or any other):
// this method resets previously collected statistics
multiMarkerLearning.resetStats()
// this member flag enables data collection
multiMarkerLearning.enabled = true
// this member flag stops data collection
multiMarkerLearning.enabled = false
// To obtain the 'learned' data, simply call .toJSON()
var jsonString = multiMarkerLearning.toJSON()
Thats all. If you store the jsonString as
localStorage.setItem('ARjsMultiMarkerFile', jsonString);
then it will be used as the default multimarker file later on. If you want a custom name or more areas - then you'll have to modify the name in the source code.
3) 2.1.4 debugUI
It seems that the debug UI is broken - the UI buttons do exist but are nowhere to be seen. A hot fix would be using the 'markersAreaEnabled' span style for the div
containing the buttons (see this source bit).
It's all in this glitch, you can find it under the phrase 'CHANGES HERE' in the arjs code.

Parsing xml with Go, ignoring nested elements?

I am trying to parse a html document with the Golang xml parser. I have managed it to extract all the <li>elements but if the element contains a link <a>, then the content of the link is ignored. I would like to just ignore the nested <a> and display it's content as plain text but I don't know how.
Here is my code:
d := xml.NewDecoder(resp.Body)
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity
type list_item struct {
Data string `xml:",chardata"`
}
for {
t,_ := d.Token()
if t == nil {
break
}
switch se := t.(type) {
case xml.StartElement:
if se.Name.Local == "li" {
var q list_item
d.DecodeElement(&q, &se)
c.Infof("%+v\n", q)
}
}
}
Is there any way to just ignore nested elements and display their content?
Constder using specialized package for parsing HTML. In general, HTML is not XML (XHTML 1.0 is, but documents formatted using it are not very common, and that standard has been deprecated).
An even better approach in my opinion—given your apparent use case,— would be using XPath to extract the necessary information using a query.
As to the question as stated, I think there's no built-in way to do what you want: the xml.Decoder implements the Skip() method but it only allows you to skip over unneeded content; there's nothing returning "inner XML" as is. You could roll this yourself by using xml.Decoder's RawToken(): by immediately rendering whatever it returns until it returns something denoting and end element you're looking for (you'll have to implement support for handling nested elements).
I found a library that uses the jQuery style of getting html information: http://godoc.org/github.com/PuerkitoBio/goquery
I used that and it solved the problem.

Read image IPTC data

I'm having some trouble with reading out the IPTC data of some images, the reason why I want to do this, is because my client has all the keywords already in the IPTC data and doesn't want to re-enter them on the site.
So I created this simple script to read them out:
$size = getimagesize($image, $info);
if(isset($info['APP13'])) {
$iptc = iptcparse($info['APP13']);
print '<pre>';
var_dump($iptc['2#025']);
print '</pre>';
}
This works perfectly in most cases, but it's having trouble with some images.
Notice: Undefined index: 2#025
While I can clearly see the keywords in photoshop.
Are there any decent small libraries that could read the keywords in every image? Or am I doing something wrong here?
I've seen a lot of weird IPTC problems. Could be that you have 2 APP13 segments. I noticed that, for some reasons, some JPEGs have multiple IPTC blocks. It's possibly the problem with using several photo-editing programs or some manual file manipulation.
Could be that PHP is trying to read the empty APP13 or even embedded "thumbnail metadata".
Could be also problem with segments lenght - APP13 or 8BIM have lenght marker bytes that might have wrong values.
Try HEX editor and check the file "manually".
I have found that IPTC is almost always embedded as xml using the XMP format, and is often not in the APP13 slot. You can sometimes get the IPTC info by using iptcparse($info['APP1']), but the most reliable way to get it without a third party library is to simply search through the image file from the relevant xml string (I got this from another answer, but I haven't been able to find it, otherwise I would link!):
The xml for the keywords always has the form "<dc:subject>...<rdf:Seq><rdf:li>Keyword 1</rdf:li><rdf:li>Keyword 2</rdf:li>...<rdf:li>Keyword N</rdf:li></rdf:Seq>...</dc:subject>"
So you can just get the file as a string using file_get_contents(get_attached_file($attachment_id)), use strpos() to find each opening (<rdf:li>) and closing (</rdf:li>) XML tag, and grab the keyword between them using substr().
The following snippet works for all jpegs I have tested it on. It will fill the array $keys with IPTC tags taken from an image on wordpress with id $attachment_id:
$content = file_get_contents(get_attached_file($attachment_id));
// Look for xmp data: xml tag "dc:subject" is where keywords are stored
$xmp_data_start = strpos($content, '<dc:subject>') + 12;
// Only proceed if able to find dc:subject tag
if ($xmp_data_start != FALSE) {
$xmp_data_end = strpos($content, '</dc:subject>');
$xmp_data_length = $xmp_data_end - $xmp_data_start;
$xmp_data = substr($content, $xmp_data_start, $xmp_data_length);
// Look for tag "rdf:Seq" where individual keywords are listed
$key_data_start = strpos($xmp_data, '<rdf:Seq>') + 9;
// Only proceed if able to find rdf:Seq tag
if ($key_data_start != FALSE) {
$key_data_end = strpos($xmp_data, '</rdf:Seq>');
$key_data_length = $key_data_end - $key_data_start;
$key_data = substr($xmp_data, $key_data_start, $key_data_length);
// $ctr will track position of each <rdf:li> tag, starting with first
$ctr = strpos($key_data, '<rdf:li>');
// Initialize empty array to store keywords
$keys = Array();
// While loop stores each keyword and searches for next xml keyword tag
while($ctr != FALSE && $ctr < $key_data_length) {
// Skip past the tag to get the keyword itself
$key_begin = $ctr + 8;
// Keyword ends where closing tag begins
$key_end = strpos($key_data, '</rdf:li>', $key_begin);
// Make sure keyword has a closing tag
if ($key_end == FALSE) break;
// Make sure keyword is not too long (not sure what WP can handle)
$key_length = $key_end - $key_begin;
$key_length = (100 < $key_length ? 100 : $key_length);
// Add keyword to keyword array
array_push($keys, substr($key_data, $key_begin, $key_length));
// Find next keyword open tag
$ctr = strpos($key_data, '<rdf:li>', $key_end);
}
}
}
I have this implemented in a plugin to put IPTC keywords into WP's "Description" field, which you can find here.
ExifTool is very robust if you can shell out to that (from PHP it looks like?)

How do you combine PDFs in ruby?

This was asked in 2008. Hopefully there's a better answer now.
How can you combine PDFs in ruby?
I'm using the pdf-stamper gem to fill out a form in a PDF. I'd like to take n PDFs, fill out a form in each of them, and save the result as an n-page document.
Can you do this with a native library like prawn? Can you do this with rjb and iText? pdf-stamper is a wrapper on iText.
I'd like to avoid using two libraries (i.e. pdftk and iText), if possible.
As of 2013 you can use Prawn to merge pdfs. Gist: https://gist.github.com/4512859
class PdfMerger
def merge(pdf_paths, destination)
first_pdf_path = pdf_paths.delete_at(0)
Prawn::Document.generate(destination, :template => first_pdf_path) do |pdf|
pdf_paths.each do |pdf_path|
pdf.go_to_page(pdf.page_count)
template_page_count = count_pdf_pages(pdf_path)
(1..template_page_count).each do |template_page_number|
pdf.start_new_page(:template => pdf_path, :template_page => template_page_number)
end
end
end
end
private
def count_pdf_pages(pdf_file_path)
pdf = Prawn::Document.new(:template => pdf_file_path)
pdf.page_count
end
end
After a long search for a pure Ruby solution, I ended up writing code from scratch to parse and combine/merge PDF files.
(I feel it is such a mess with the current tools - I wanted something native but they all seem to have different issues and dependencies... even Prawn dropped the template support they use to have)
I posted the gem online and you can find it at GitHub as well.
you can install it with:
gem install combine_pdf
It's very easy to use (with or without saving the PDF data to a file).
For example, here is a "one-liner":
(CombinePDF.load("file1.pdf") << CombinePDF.load("file2.pdf") << CombinePDF.load("file3.pdf")).save("out.pdf")
If you find any issues, please let me know and I will work on a fix.
Use ghostscript to combine PDFs:
options = "-q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite"
system "gs #{options} -sOutputFile=result.pdf file1.pdf file2.pdf"
I wrote a ruby gem to do this — PDF::Merger. It uses iText. Here's how you use it:
pdf = PDF::Merger.new
pdf.add_file "foo.pdf"
pdf.add_file "bar.pdf"
pdf.save_as "combined.pdf"
Haven't seen great options in Ruby- I got best results shelling out to pdftk:
system "pdftk #{file_1} multistamp #{file_2} output #{file_combined}"
We're closer than we were in 2008, but not quite there yet.
The latest dev version of Prawn lets you use an existing PDF as a template, but not use a template over and over as you add more pages.
Via iText, this will work... though you should flatten the forms before you merge them to avoid field name conflicts. That or rename the fields one page at a time.
Within PDF, fields with the same name share a value. This is usually not the desired behavior, though it comes in handy from time to time.
Something along the lines of (in java):
PdfCopy mergedPDF = new PdfCopy( new Document(), new FileOutputStream( outPath );
for (String path : paths ) {
PdfReader reader = new PdfReader( path );
ByteArrayOutputStream curFormOut = new ByteArrayOutputStream();
PdfStamper stamper = new PdfStamper( reader, curFormOut );
stamper.setField( name, value ); // ad nauseum
stamper.setFlattening(true); // flattening setting only takes effect during close()
stamper.close();
byte curFormBytes = curFormOut.toByteArray();
PdfReader combineMe = new PdfReader( curFormBytes );
int pages = combineMe .getNumberOfPages();
for (int i = 1; i <= pages; ++i) { // "1" is the first page
mergedForms.addPage( mergedForms.getImportedPage( combineMe, i );
}
}
mergedForms.close();
If you want to add any template (created by macOS Pages or Google Docs) using the combine_pdf gem then you can try with this:
final_pdf = CombinePDF.new
company_template = CombinePDF.load(template_file.pdf).pages[0]
pdf = CombinePDF.load (content_file.pdf)
pdf.pages.each {|page| final_pdf << (company_template << page)}
final_pdf.save "final_document.pdf"

Image tag not closing with HTMLAgilityPack

Using the HTMLAgilityPack to write out a new image node, it seems to remove the closing tag of an image, e.g. should be but when you check outer html, has .
string strIMG = "<img src='" + imgPath + "' height='" + pubImg.Height + "px' width='" + pubImg.Width + "px' />";
HtmlNode newNode = HtmlNode.Create(strIMG);
This breaks xhtml.
Telling it to output XML as Micky suggests works, but if you have other reasons not to want XML, try this:
doc.OptionWriteEmptyNodes = true;
Edit 1:Here is how to fix an HTML Agilty Pack document to correctly display image (img) tags:
if (HtmlNode.ElementsFlags.ContainsKey("img"))
{ HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;}
else
{ HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);}
replace "img" for any other tag to fix them as well (input, select, and option come up frequently). Repeat as needed. Keep in mind that this will produce rather than , because of the HAP bug preventing the "closed" and "empty" flags from being set simultaneously.
Source: Mike Bridge
Original answer:
Having just labored over solutions to this issue, and not finding any sufficient answers (doctype set properly, using Output as XML, Check Syntax, AutoCloseOnEnd, and Write Empty Node options), I was able to solve this with a dirty hack.
This will certainly not solve the issue outright for everyone, but for anyone returning their generated html/xml as a string (EG via a web service), the simple solution is to use fake tags that the agility pack doesn't know to break.
Once you have finished doing everything you need to do on your document, call the following method once for each tag giving you a headache (notable examples being option, input, and img). Immediately after, render your final string and do a simple replace for each tag prefixed with some string (in this case "Fix_", and return your string.
This is only marginally better in my opinion than the regex solution proposed in another question I cannot locate at the moment (something along the lines of )
private void fixHAPUnclosedTags(ref HtmlDocument doc, string tagName, bool hasInnerText = false)
{
HtmlNode tagReplacement = null;
foreach(var tag in doc.DocumentNode.SelectNodes("//"+tagName))
{
tagReplacement = HtmlTextNode.CreateNode("<fix_"+tagName+"></fix_"+tagName+">");
foreach(var attr in tag.Attributes)
{
tagReplacement.SetAttributeValue(attr.Name, attr.Value);
}
if(hasInnerText)//for option tags and other non-empty nodes, the next (text) node will be its inner HTML
{
tagReplacement.InnerHtml = tag.InnerHtml + tag.NextSibling.InnerHtml;
tag.NextSibling.Remove();
}
tag.ParentNode.ReplaceChild(tagReplacement, tag);
}
}
As a note, if I were a betting man I would guess that MikeBridge's answer above inadvertently identifies the source of this bug in the pack - something is causing the closed and empty flags to be mutually exclusive
Additionally, after a bit more digging, I don't appear to be the only one who has taken this approach:
HtmlAgilityPack Drops Option End Tags
Furthermore, in cases where you ONLY need non-empty elements, there is a very simple fix listed in that same question, as well as the HAP codeplex discussion here: This essentially sets the empty flag option listed in Mike Bridge's answer above permanently everywhere.
There is an option to turn on XML output that makes this issue go away.
var htmlDoc = new HtmlDocument();
htmlDoc.OptionOutputAsXml = true;
htmlDoc.LoadHtml(rawHtml);
This seems to be a bug with HtmlAgilityPack. There are many ways to reproduce this, for example:
Debug.WriteLine(HtmlNode.CreateNode("<img id=\"bla\"></img>").OuterHtml);
Outputs malformed HTML. Using the suggested fixes in the other answers does nothing.
HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
HtmlNode node = doc.CreateElement("x");
node.InnerHtml = "<img id=\"bla\"></img>";
doc.DocumentNode.AppendChild(node);
Debug.WriteLine(doc.DocumentNode.OuterHtml);
Produces malformed XML / XHTML like <x><img id="bla"></x>
I have created a issue in CodePlex for this.

Resources