How to convert pdf and doc files to html using Cocoa - cocoa

I would like to convert pdf, doc files to html files using Cocoa
Please help me in this.
Thanks in advance,

You can convert Word files to HTML using NSAttributedString. You can't do this in pure Cocoa for PDF files; you'll have to use a conversion tool, such as stigi suggested. To do that, use NSTask.

Cocoa's PDFKit framework can convert a PDF file to text, through PDFDocument's -string method for example. Of course this won't copy images or formatting though, and it depends on PDFKit being able to recognize text in the file.

there are a couple of tools for the unix commandline that do such kind of conversions.
check out http://pdftohtml.sourceforge.net/ & http://rtf2html.sourceforge.net/
you may see if there are other tools like this.
but to get back to your question. these command line tools can be called from within your cocoa app (won't work on the iphone) and produce the html result.
check out this link for a guide on how to embed such command line tools within your app.

Related

PDF file basic usage in cocoa(objective c)

Can anybody please give me a step by step guide on how to use a locally saved pdf file in a cocoa application and display its contents?
P.S.I am basically a noob in cocoa , and I want to have a proper idea of what I am learning.Therefore I need a step to guide on this.
Thanks,
Animesh
I do not know of any step by step guide. Apple has the PDFKit framework to display PDF files. Read the PDFKit Programming Guide, which is part of Apple's developer documentation. For the user interface you need a window and a PDF view to display the contents of the PDF file.
Keep in mind that PDFKit is not a beginning technology. If you are learning Cocoa and Objective-C, you should start with a simpler project before you move on to displaying PDF files. An example of a simpler project would be an app that converts temperature from Celsius to Fahrenheit or the other way around.

Replace text in PDF using Cocoa

I am looking for a way to replace text in an PDF document in my Mac Application. But the problem is that I don't know how. I am thinking of converting the PDF to an HTML file, so I can use stringByReplacingOccurrencesOfString: and then converting it back to an PDF, but I can not find out how.
I also tried to replace the text using CGPDFDocumentRef but I couldn't find a valide method.
Can anyone please help me to solve this issue?
Thanks, David
It is not possible to replace text in PDF using CGPDF* API. PDF -> HTML -> PDF will not work because the double conversion will loose content (PDF and HTML formats are not quite compatible).
The only solution is to find a 3rd party toolkit that supports this functionality.

Batch convert Mac iWork files to PDF on the command line

I'm trying to batch convert a bunch of assorted iWork files (Numbers, Pages, Keynote) to PDF on the command line.
I've been trying cups-filter but there's no MIME type filter for the iWork types. I then looked into using qlmanage to generate the preview image and use that, but this doesn't seem to work for multi file Keynote documents as they generate as HTML rather than PDF.
Any suggestions? I'd rather not resort to AppleScript.
I created an .applescript script that converts all .pages files within a folder to .docx. .pdf support can be easily added. In pages2docx.applescript you just need to replace Microsoft Word with PDF.
Here's what I ended up going with, since I really wanted to avoid, AppleScript.
When saving an iWork document there's a "Include Preview In Document" checkbox. Checking this creates a "QuickLook/Preview.pdf" inside the iWork document bundle (which is actually a zip file). Luckily I had this checked for most of the zip files, so it was simply a case of unzipping to NSTemporaryDirectory and grabbing that file.
For those that didn't I put together a script to run qlmanage to create the document preview. For some that creates the PDF, for others it creates an HTML file. You can then use http://code.google.com/p/wkhtmltopdf/ to convert this HTML to a PDF.
Well... you need something that
understand the iWork file formats,
can render the documents to then create the PDF.
Unless you want to re-invent the iWork suite... Sounds simpler to just tell the iWork apps what you want from them.
You would do that via the Scripting Bridge
I would use Applescript, but perhaps you can use Ruby and Python with the Scripting Bridge to accomplish what you need
With Scripting Bridge, RubyCocoa and PyObjC scripts can do what AppleScript scripts can do: control scriptable applications and exchange data with them.
I haven't used the Scripting Bridge in a while, but I believe you can tell applications to print documents. And any application that can print in OS X can send it to PDF instead.
Here are a couple of commands to help those who want to get this working without much thought. It worked for me with a ppt file.
Make sure to get wkhtmltopdf from here.
qlmanage -p -o /tmp /path/of/file.ppt
wkhtmltopdf /tmp/file.ppt.qlpreview/Preview.html /output/to/file.pdf
You may have to fiddle with sizes if you want the original pages to stay consistent, for the ppt I was using the following parameters did the job:
wkhtmltopdf --page-width 200 --page-height 145 Preview.html file.pdf
Edit: I have written a Python script to do a batch conversion. Hopefully people can contribute to make it more robust:
https://github.com/matthewfitch23/DocToPdf

Convert indesign output to html5

I want to write a viewer that convert in-design output format to html5 format and all the user design in adobe indesign can display in browser but i do not know which output is suitable for me, i think i can retrieve all info about the adobe indesign in idml export,but the problem is parsing such XML and display the tags in html5 format,i want to know is it possible the simple way to convert the output format into html5?
is it possible to download the adobe indesign SDK and use its method to this purpose?
You can use in5 to export HTML5 (layout intact) from InDesign.
Full disclosure: I am the creator of in5.
Exporting to EPUB would result in XHTML 1.1. The Epub file that InDesign generates is a zip file, in which you will find a number of files. (At least) one of them is an XHTML file.
XHTML 1.1 would surely be an easier source to use than the idml, however you will have to make sure that the ePub export is good enough to start with (the pages won't come out exactly the same as in InDesign).
Would that be a solution?
EPub export is supported from InDesign CS4 (JavaScript based export option, outside the object model, as I understand it and a built-in export option, part of the object model, from CS5).
You don't mention what version of InDesign you are using. CS5, CS5.5 and CS6 all allow you to export to HTML. The problem is that the HTML is version 4 and it create badly written CSS. What I like to do is to use XML to build my own HTML. Just create a set of HTML5 tags you want to use and then Map the existing Paragraph and Character styles to the XML tags.
When you're done you will have a basic content structure. Then I use the Structure pane to add different elements as needed. You can add Parents or children as you need to right there and then export to XML. When you save the file, just change its name to .HTML and edit the code to remove the one reference to "xml".
It takes a little time, but it is very doable.

Converting Word to PDF Using SharePoint 2010 Word Automation Services

I have tried to find out the way I can put locks or disable the copy and paste on the PDF file after the conversion. I looked at the ConversionJobSettings properties but I couldn’t be able to accomplish this.
Based on what I have read, the sharepoint2010 Word Automation services API provides very limited capability in manipulating the conversion logics but is there any way I can lock down the content so that it cannot be copied?
Thank for your help
You will either need to code something up yourself or get a third party product such as this one, which allows conversion as well as PDF manipulation including security and watermarking.
Note that I worked on this product, so I am obviously biased. Having said that, it works brilliantly.
The only way to prevent copy and paste (as text) is to create image versions of the pages and saves those as a PDF.
a possible solution:
1) Use Word automation to print to a PostScript (PS) printer driver to get a .ps file
2) Use GhostScript to convert the PS to tif files
3) Create a PDF using the tif files (possibly with GhostScript too)

Resources