I am trying to do exactly what is described in the following thread:
AppleScript/Automator: renaming PDF with extracted text content of this PDF
So I am using the Chino22's version and there are two issues with it:
First, instead of the contents of the pdf, theFileContentsText gets some metadata stuff.
Second, althought the script runs to the end, I get the following error for the last step:
error "The variable thisFile is not defined." number -2753 from "thisFile"
So, how do I get the text contents instead, and how do I define thisFile to the current pdf that is being processed in the loop?
Thanks in advance!
I would not expect the linked script to work.
Except for document metadata, extracting text content from PDF is notoriously difficult and unreliable, and not a road you want to go down if you can possibly avoid it. Adobe’s PDF file format is designed for printing, not for data processing. PDF files contain blocks of Postscript-like page drawing instructions, typically compressed, and while it’s possible for PDFs also to include the original plain text for accessibility use, most PDF generators do not do this so the only way to get the original text is by reconstructing it from those low-level drawing instructions—not a trivial job.
AppleScript’s read command only reads that raw file data; it does not parse it into drawing instructions, never mind translating those drawing instructions back into plain text. Change a PDF file’s extension to .txt and open it in a plain text editor, and you’ll see what I mean. Nasty.
If you need to work with the PDF’s original content (text, images, whatever), your best solution is to get those files before they were converted into a PDF.
If you must extract content from a PDF file, use an existing tool that knows how to do it.
For instance, if you’re lucky enough to have PDFs that contain XFDF (XML form) or accessibility data, there are 3rd-party apps and libraries to extract that content in readable form. I can’t think offhand of any that are AppleScriptable (Adobe Acrobat has only minimal AS support) so you’ll probably need to find one you can run from command line (do shell script in AS).
Or, if the PDFs have a consistent visual structure, a 3rd-party library such as Python’s PDFMiner (which I’ve used in the past) can identify blocks of characters by position and convert those back into strings with varying degrees of reliability (it has to convert font glyphs back into Unicode characters, guess at which characters are close enough to constitute a word, and where to insert space and return characters between those words). You’ll have to write some Python code to extract the bits you want, so look for tutorials to get started (or pay someone to write it for you).
But again, if you can possibly avoid having to extract text from PDF, you should. You will save yourself a lot of trouble.
Related
As in the title - imagine there is some Gimp .xcf file containing many layers. Part of these layers contain text. Is there any format I can export .xcf file to, that it somehow preserve 'human readable' text ?
The final goal is to process that text and put it again into the file, I am aware that this sounds unusual but maybe some of you have an idea how to achieve scenario like that.
I did some research and I saw I can export image to .psd format and then using NPM package process that image and extract text. This is just partially solves the problem, because I will not know how to put the processed text back into this .psd file (unless I decompile this NPM package and try to write some implementation myself...)
Any solutions and alternatives higly appreciated
You can script Gimp (using Scheme or Python). Technically you cannot change the text in a layer (there is no API for that), but you can recover the characteristics of a text layer (original text, font type, font size...) and recreate a new layer with a new text. Here is some Python code to recover the text information:
def text_info(img,layer):
parasites=None
try:
parasites=layer.parasite_list()
except Exception as e:
pass;
if parasites and 'gimp-text-layer' in parasites:
data=layer.parasite_find('gimp-text-layer').data
pdb.gimp_message('Text layer "%s": %s' % (layer.name,data))
else:
pdb.gimp_message('No text information found for layer "%s"' % layer.name)
(this information is only present of the file has been saved, it is not available on a newly created layer, but this shouldn't bea problem in your case)
Of course if the text is in a plain bitmap layer of its own this cannot be done, you have to guess the font type & size (but sometimes the code above can still recover the text information)
But if your XCF has a simple structure, it can be a lot simpler to decompose it into individual images, and build a new image with ImageMagick, using some of these layers plus new text images (or directly rendered text).
Is there an option for to me to ask Ghostscript to indent the Postscript it creates?
Everything starts at the beginning of a line and I find it difficult to follow.
Alternatively, I am using Emacs and ps-mode.
If anyone know how to indent code in this mode I would appreciate a tip (apologize because this may not be relevant to this StackExchange)
No, there is no option for indenting the output.
PostScript is pretty much regarded as a write-only language anyway, and the output of ps2write (which is what I assume you are using though you don't say) is particularly difficult since it fundamentally outputs PDF syntax with a PostScript program on the front to parse it into PostScript operations.
Why do you want to read it ?
[EDIT]
You can always edit your question, you don't need to post a new answer.
I'm afraid what you want to do isn't as simple as you might think.
It might be possible for this use case if the PDF files you receive are always created the same way, but there are significant problems.
The font you use as a substitute for the missing font must be encoded the same way. Say for example the font in the PDF file is encoded so that 0x41 is 'A', you need to make sure that the replacement font is also encoded so that 0x41 is an 'A'. So just the findfont, scalefont, setfont sequence is not always going to be sufficient, sometimes you will need to re-encode the font.
CIDFonts will be a major stumbling block. Firstly because ps2write simply doesn't emit CIDFonts at all. These were not part of level 2 PostScript. As a result all text in a CIDFont will be embedded as bitmaps. If your original file doesn't contain the CIDFont then you'll get the fallback CIDFont bitmapped.
Secondly CIDFonts can use multiple-byte character codes, of variable length. You can't simply replace a CIDFont with a Font, it just won't work.
The best solution, obviously, is to have the PDF files created with the fonts required embedded. This is best practice. If you can't get that, then I'd suggest that rather than trying to hand edit PostScript, you use the fontmap.GS and cidfmap files which Ghostscript uses to find font.
Ghostscript already has a load of code to do font substitution automatically, using both Fonts and CIDFonts as substitutes, and it does all the hard work of re-encoding the fonts or building CMaps as required. If you are on Windows much of this may already be done for you, when you install Ghostscript it will ask if you want to create font mappings. If you said yes then it will
Add the font substitutions you want to use in those files (they have comments explaining the layout) and then use the pdfwrite device to make a new PDF file. Set EmbedAllFonts to true (you may need to add a AlwayEmbed font array as well, listing the fonts specifically) and SubsetFonts to false.
That should create a new PDF file where the missing fonts have been replaced by your defined substitutes, those substitutes will have been embedded in the new PDF file and they have will not been subset (Acrobat will generally refuse to edit text in a subset font).
The switches I mentioned above are standard Adobe Distiller parameters, but they are documented for pdfwrite here. There's some documentation on adding fonts here and here and specifically for CIDFonts here.
Basically I'd suggest you define your substitutions and let Ghostscript do the work for you.
This is not an answer to the problem but rather an answer to KenS's question about "Why do you want to read it?"
I tried to put it in the comment box but it was too long.
I am a retired engineer with a strong programming background.
I would like to read and understand the postscript code for the reason shown below.
I play duplicate bridge as a hobby. I recieve a PDF file of what is know as a convention card (a single page document of bridge agreements).
Frequently I would like to edit these files.
When I open with Adobe Illustrator I have to spend a significant amount of time replacing fonts that are not on my system with fonts that I do have.
I can take the PDF and export it as a postscript file using Ghostscript.
I was going to write a little program to replace the embedded fonts with the fonts that I use to replace them.
I was going to leave the postscript file unaltered and insert things like
/HelveticaMonospacedPro-RG findfont
12 scalefont setfont
just above where the text is written.
I was planning on using the fonts that I have on my system (e.g., HelveticaMonospacedPro-RG).
Every now and then I run into a situation when I need to email a piece of code from emacs. When I paste text into my email program (not emacs), all the color highlighting is lost. This is especially disappointing when pasting from org-mode, which relies heavily on colors for readability. It would be good to preserve font faces.
Is there a way to do this? I am looking for output similar to that of ps-print-buffer-with-faces.
Suppose your email program can handle html, try M-xhtmlfontify-buffer, which converts the contents of the current buffer (with faces) to css-styled html.
I'm looking for a nice way to generate either a Keynote file from XML or a Powerpoint file that I can then import to Keynote. Basically, I'm looking for a simple human-writable markup format (for easy scripting) that can be exported into slides.
I volunteer with a local nonprofit, where anything remotely technical falls to me. On a fairly regular basis, I'm sent information for events and produce a nice looking printed program in Word, though much of the same material also goes into slides in Keynote. (Keynote is used rather than PowerPoint so that Keynote Remote can be used.)
Anyway, there's a large volume of text I work with that I'm sent via email, and it has to go in both a Keynote presentation and a Word document, and requires all sorts of odd manual formatting to not break pages or slides at odd times, also requiring a good deal of manual restyling, since I'm not going to allow something I do to come out looking like something sloppy from the 1990s.
My hope is to write up a Ruby script that I can feed the source text to, and it'll go do all the processing for me, at least for Powerpoint or Keynote. I've normally had fantastic luck finding a gem for just about any format or service I've wanted to work with, but I haven't found anything that works with Powerpoint or Keynote.
My next thought was to have the Ruby code generate appropriate XML since both Office and I Work allegedly open the Office XML format, but I couldn't find any actual friendly documentation for human-writable XML code.
Is it wishful thinking to want to be able to do something like the following?
<SLIDE FORMAT="Title & Bullets">
<SLIDE_TITLE>
Lorem Ipsum
</SLIDE_TITLE>
<PARAGRAPH>
[etc.]
All I can find as far as converter scripts is all related to charts and tables and such which is of zero use here), usually revolves around opening or converting FROM Powerpoint or Keynote rather than creating, and furthermore generally seems to be for Windows using OLE or VBScript. This needs to run on the Macs they have there, so no Visual Studio stuff, Windows related scripting, etc will work. I don't HAVE to do it in Ruby, but that's what I'd be most comfortable with on the Mac end of things.
So is there documentation out there on a marginally friendly XML format for Powerpoint or Keynote, or even better, a Ruby gem for either?
If all you need to do is title + bullet point slides, you simply need to create an ascii text file. Each line of text will become the title of a new slide. But if the first character in a line of text is a tab, the line will become a first level bullet point on the same slide as the previous title. If two tabs, it indents the text to a second level bullet point and so on.
This becomes the title on slide one
This becomes the title on slide two
<tab>This is a bullet point, first level
<tab><tab>And this is a bullet point, second level
<tab>Back to first level bullet point
And another new slide
Once you have the text file, you can do File Open in PPT and force files of type to all files . and select your .TXT file. Or you can use Insert Slide From File to bring the .TXT file into an existing presentation.
There's a limit to the number of slides you can create at one go like this; 100 perhaps?
Note also that VBA disappeared in Mac Ofice 2008 but is back in Mac Office 2011, so if you can find examples of VB/VBA code that do what you want, you can use them on Mac, so long as it doesn't have to happen in Office 2008.
I was initially dabbling with IFrames to launch a document, and found that for large files, the memory in all browsers (I first noticed this in FF) jumped to 500,000 K.
At first I thought it might have been some bad JS code that I had written, but removing all the extraneous code and just OPENING the text file still displayed the same problem.
So right now, all I'm doing is going to a site http://url/largefile, and seeing the file slowly display to the screen.
Is there any efficient way for me to display the file without the browser exploding? What am I missing here?
EDIT: I've received responses to use a text editor for this purpose. My original goal was to allow a user to click the url, which would append a search term as a post variable. The opened textfile would then scroll to the specified point of the search term. Is there a way to auto open a text editor ... on that person's computer and then going directly to the search point?
30MB is kind of big, even for a regular text editor, I suspect you will be unable to convince FF to handle it well. I might try one of the following:
implement paging/searching in your web site so it only displays a portion of the file at one time
open the file in an actual text editor - it's what they are good at after all
Your paging implementation (if suitably clever) might only load the text around the selected piece of the file, and when they scroll up or down use AJAX to load additional parts of the file (kind of like a virtual list control in windows). This might help to mitigate the performance impact.
Is it xml? Firefox tries to create a DOM for xml files that can be many times larger than the file itself.