As in the title - imagine there is some Gimp .xcf file containing many layers. Part of these layers contain text. Is there any format I can export .xcf file to, that it somehow preserve 'human readable' text ?
The final goal is to process that text and put it again into the file, I am aware that this sounds unusual but maybe some of you have an idea how to achieve scenario like that.
I did some research and I saw I can export image to .psd format and then using NPM package process that image and extract text. This is just partially solves the problem, because I will not know how to put the processed text back into this .psd file (unless I decompile this NPM package and try to write some implementation myself...)
Any solutions and alternatives higly appreciated
You can script Gimp (using Scheme or Python). Technically you cannot change the text in a layer (there is no API for that), but you can recover the characteristics of a text layer (original text, font type, font size...) and recreate a new layer with a new text. Here is some Python code to recover the text information:
def text_info(img,layer):
parasites=None
try:
parasites=layer.parasite_list()
except Exception as e:
pass;
if parasites and 'gimp-text-layer' in parasites:
data=layer.parasite_find('gimp-text-layer').data
pdb.gimp_message('Text layer "%s": %s' % (layer.name,data))
else:
pdb.gimp_message('No text information found for layer "%s"' % layer.name)
(this information is only present of the file has been saved, it is not available on a newly created layer, but this shouldn't bea problem in your case)
Of course if the text is in a plain bitmap layer of its own this cannot be done, you have to guess the font type & size (but sometimes the code above can still recover the text information)
But if your XCF has a simple structure, it can be a lot simpler to decompose it into individual images, and build a new image with ImageMagick, using some of these layers plus new text images (or directly rendered text).
Related
I am trying to do exactly what is described in the following thread:
AppleScript/Automator: renaming PDF with extracted text content of this PDF
So I am using the Chino22's version and there are two issues with it:
First, instead of the contents of the pdf, theFileContentsText gets some metadata stuff.
Second, althought the script runs to the end, I get the following error for the last step:
error "The variable thisFile is not defined." number -2753 from "thisFile"
So, how do I get the text contents instead, and how do I define thisFile to the current pdf that is being processed in the loop?
Thanks in advance!
I would not expect the linked script to work.
Except for document metadata, extracting text content from PDF is notoriously difficult and unreliable, and not a road you want to go down if you can possibly avoid it. Adobe’s PDF file format is designed for printing, not for data processing. PDF files contain blocks of Postscript-like page drawing instructions, typically compressed, and while it’s possible for PDFs also to include the original plain text for accessibility use, most PDF generators do not do this so the only way to get the original text is by reconstructing it from those low-level drawing instructions—not a trivial job.
AppleScript’s read command only reads that raw file data; it does not parse it into drawing instructions, never mind translating those drawing instructions back into plain text. Change a PDF file’s extension to .txt and open it in a plain text editor, and you’ll see what I mean. Nasty.
If you need to work with the PDF’s original content (text, images, whatever), your best solution is to get those files before they were converted into a PDF.
If you must extract content from a PDF file, use an existing tool that knows how to do it.
For instance, if you’re lucky enough to have PDFs that contain XFDF (XML form) or accessibility data, there are 3rd-party apps and libraries to extract that content in readable form. I can’t think offhand of any that are AppleScriptable (Adobe Acrobat has only minimal AS support) so you’ll probably need to find one you can run from command line (do shell script in AS).
Or, if the PDFs have a consistent visual structure, a 3rd-party library such as Python’s PDFMiner (which I’ve used in the past) can identify blocks of characters by position and convert those back into strings with varying degrees of reliability (it has to convert font glyphs back into Unicode characters, guess at which characters are close enough to constitute a word, and where to insert space and return characters between those words). You’ll have to write some Python code to extract the bits you want, so look for tutorials to get started (or pay someone to write it for you).
But again, if you can possibly avoid having to extract text from PDF, you should. You will save yourself a lot of trouble.
Given a rectangle that represents an area on a Windows screen that contains text, what is the best way to extract the text?
I know that it is possible using OCR, but even after significant pre processing, the quality is really poor.
Getting the Window Text using Win32 API does not always work as well.
Assuming that the text was rendered using a font, is it possible to get it from there?
Any directions would be extremely helpful. Thanks!
Given a rectangle that represents an area on window screen, the best way to extract text is indeed OCR. Use a better OCR library like this one from Microsoft.
The reason getting the window text using Win32 API does not work well is because there may be multiple windows in that rectangle. You will have to find out what all windows the rectangle contains and send a message to get the text for each window. It is not impossible but difficult to do and even if you manage to do that, you will run into issues of text alignment, etc. OCR is your best option.
It does seem possible without using OCR, as NirSoft SysExporter can do this:
https://www.nirsoft.net/utils/sysexp.html
This may be suitable for programmatic use as it can be run from a command line:
Starting from version 1.70, you can export the content of Windows
control from command-line, without displaying any user interface.
You may not be able to target it at a specific rectangle on the screen, but maybe the same result could be achieved by first scraping everything followed by some post-processing.
Further basic info:
SysExporter utility allows you to grab the data stored in standard
list-views, tree-views, list boxes, combo boxes, text-boxes, and
WebBrowser/HTML controls from almost any application running on your
system, and export it to text, HTML or XML file.
...
Known Limitations
SysExporter can export data from most combo boxes, list boxes,
tree-view, and list-view controls, but not from all of them. There are
some applications that use these controls to display data, but the
data itself is not actually stored in the control, but in another
location in the computer's memory. In such cases, SysExporter won't be
able to export the data.
Personally I've used it to grab text from what look like label controls.
While in class I like to take handwritten notes, afterwards I scan them and then type them up (helps me remember them and also makes them easily searchable). The main issue is I have is I use A LOT of drawings and complex math and converting the math formulas into latex (or word) is very time consuming and the drawings require that I keep the PDF and the text document. What I would like to do is take the basic text that I have typed myself (no OCR) and add a text layer to the PDF's that way the PDF's will be searchable and I can save a lot of time by not converting the math or drawings.
I've looked into Preview, PDFpenPro, acrobat, a couple of linux programs but so far I haven't really found anything that will do this.
Any idea of how I could do this or a program to use?
I also scan my notes. Sometimes I go back and add some text to them using this technique:
Open up the scanned pdf in Preview, then click on the "Edit" button in the top right corner, then the "Text tools" button on the left side (its a little box with Aa in it). From there you can drag open a text box and type into it.
Now the secret trick is that if you save it here as it is and try to open it in your ipad using PDFExpert or some other program then the text might not be there. So here's how to go through that slight hiccup: After you've annotated your notes how you want instead of just saving it as a pdf, use the Print option: File->Print or Command+P. Now click the PDF button on the left to "Save it as a pdf". Now that its printed you can open it and search it in any program that reads pdfs. Attached is an example.
One other thing, it seems like maybe you want to write over your existing handwritten text with typed text? I'm not sure if this is the best way. But if that's what I was trying to do I would:
Scan my notes
Read through them, typing them up as you said
Open the scanned notes in Photoshop or some other program
Draw a giant White Fill White Stroke rectangle over the handwritten text
Save it as a pdf
Do the technique above and copy and paste the typed text from step 2.
I hope this helps. And I wish you luck, I'm still working out the kinks myself for scanned notes but the possibilities have me pretty excited!
EDIT: I just checked out PDFpenPro, which I highly recommend because you don't have to go through that printing trick, you can just save the pdf document after annotating and other programs will recognize the annotations.
I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.
The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.
I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.
Could you recommend a good and working solution for the task?
Use XPDF (http://www.foolabs.com/xpdf/)
It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.
It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.
Several Java libraries can do this. Have you looked at JPedal or PdfBox?
If a commercial library is an option for you, you could try Amyuni PDF Creator .Net or Amyuni PDF Creator ActiveX. You could use the method IacDocument.GetObjectsInRectangle to retrieve all the "graphic objects" of your interest, then use the ObjectType attribute to separate images from text. The library already provides an algorithm for putting close text together. From the documentation:
IacDocument.GetObjectsInRectangle Method
The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.
Usual disclaimer applies.
Does anyone knows how to extract the characters image from a font(ttf) file?
TTF is a vector format, so there are no characters shapes, really. Load the font, select it into a device context (a memory one), render a character, grab a bitmap.
Relevant APIs: AddFontResource, CreateFont, CreateDC, CreateBitmap, SelectObject, TextOut (or DrawText).
You can use GetGlyphOutline with GGO_BEZIER to get the shape of a single character.
For the sake of completeness I'd like to add a GUI and Python way to this pretty old thread.
If the goal is to extract images (as e.g. png) from a .ttf file I found two pretty straight forward ways which both involve the open-source program fontforge (Link to their website):
GUI Way (Suitable for extracting a handful of characters): Open the .ttf file in fontforge click on the character you want to export. Then: file -> export -> format:png
CLI / Python Way (Suitable for automation): FontForge has a cli api for python 2.7 which allows to automate the extraction of the images. Refer to this superuser thread for a complete script.
Link 1: https://fontforge.org/en-US/
Link 2: https://superuser.com/questions/1337567/how-do-i-convert-a-ttf-into-individual-png-character-images