how to fetch data from pdf file - algorithm

I would like to know if there is any way to just take our relevant data from a pdf file. Suppose we have something like this Name:John, so we can some how automate to take just this field value in order to store it somewhere like a predefined database or file?? Thanks.

Use pdftotext to extract text content from your pdf file. Then parse the text file with your favorite programming language.
If your pdf doesn't contain real text, just images of text, you will need to use an optical character recognition software to extract the text.

Related

How to replace text in Gimp image programatically

As in the title - imagine there is some Gimp .xcf file containing many layers. Part of these layers contain text. Is there any format I can export .xcf file to, that it somehow preserve 'human readable' text ?
The final goal is to process that text and put it again into the file, I am aware that this sounds unusual but maybe some of you have an idea how to achieve scenario like that.
I did some research and I saw I can export image to .psd format and then using NPM package process that image and extract text. This is just partially solves the problem, because I will not know how to put the processed text back into this .psd file (unless I decompile this NPM package and try to write some implementation myself...)
Any solutions and alternatives higly appreciated
You can script Gimp (using Scheme or Python). Technically you cannot change the text in a layer (there is no API for that), but you can recover the characteristics of a text layer (original text, font type, font size...) and recreate a new layer with a new text. Here is some Python code to recover the text information:
def text_info(img,layer):
parasites=None
try:
parasites=layer.parasite_list()
except Exception as e:
pass;
if parasites and 'gimp-text-layer' in parasites:
data=layer.parasite_find('gimp-text-layer').data
pdb.gimp_message('Text layer "%s": %s' % (layer.name,data))
else:
pdb.gimp_message('No text information found for layer "%s"' % layer.name)
(this information is only present of the file has been saved, it is not available on a newly created layer, but this shouldn't bea problem in your case)
Of course if the text is in a plain bitmap layer of its own this cannot be done, you have to guess the font type & size (but sometimes the code above can still recover the text information)
But if your XCF has a simple structure, it can be a lot simpler to decompose it into individual images, and build a new image with ImageMagick, using some of these layers plus new text images (or directly rendered text).

AppleScript: renaming PDF with content of PDF

I am trying to do exactly what is described in the following thread:
AppleScript/Automator: renaming PDF with extracted text content of this PDF
So I am using the Chino22's version and there are two issues with it:
First, instead of the contents of the pdf, theFileContentsText gets some metadata stuff.
Second, althought the script runs to the end, I get the following error for the last step:
error "The variable thisFile is not defined." number -2753 from "thisFile"
So, how do I get the text contents instead, and how do I define thisFile to the current pdf that is being processed in the loop?
Thanks in advance!
I would not expect the linked script to work.
Except for document metadata, extracting text content from PDF is notoriously difficult and unreliable, and not a road you want to go down if you can possibly avoid it. Adobe’s PDF file format is designed for printing, not for data processing. PDF files contain blocks of Postscript-like page drawing instructions, typically compressed, and while it’s possible for PDFs also to include the original plain text for accessibility use, most PDF generators do not do this so the only way to get the original text is by reconstructing it from those low-level drawing instructions—not a trivial job.
AppleScript’s read command only reads that raw file data; it does not parse it into drawing instructions, never mind translating those drawing instructions back into plain text. Change a PDF file’s extension to .txt and open it in a plain text editor, and you’ll see what I mean. Nasty.
If you need to work with the PDF’s original content (text, images, whatever), your best solution is to get those files before they were converted into a PDF.
If you must extract content from a PDF file, use an existing tool that knows how to do it.
For instance, if you’re lucky enough to have PDFs that contain XFDF (XML form) or accessibility data, there are 3rd-party apps and libraries to extract that content in readable form. I can’t think offhand of any that are AppleScriptable (Adobe Acrobat has only minimal AS support) so you’ll probably need to find one you can run from command line (do shell script in AS).
Or, if the PDFs have a consistent visual structure, a 3rd-party library such as Python’s PDFMiner (which I’ve used in the past) can identify blocks of characters by position and convert those back into strings with varying degrees of reliability (it has to convert font glyphs back into Unicode characters, guess at which characters are close enough to constitute a word, and where to insert space and return characters between those words). You’ll have to write some Python code to extract the bits you want, so look for tutorials to get started (or pay someone to write it for you).
But again, if you can possibly avoid having to extract text from PDF, you should. You will save yourself a lot of trouble.

Extract Images and Words with coordinates and sizes from PDF

I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.
The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.
I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.
Could you recommend a good and working solution for the task?
Use XPDF (http://www.foolabs.com/xpdf/)
It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.
It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.
Several Java libraries can do this. Have you looked at JPedal or PdfBox?
If a commercial library is an option for you, you could try Amyuni PDF Creator .Net or Amyuni PDF Creator ActiveX. You could use the method IacDocument.GetObjectsInRectangle to retrieve all the "graphic objects" of your interest, then use the ObjectType attribute to separate images from text. The library already provides an algorithm for putting close text together. From the documentation:
IacDocument.GetObjectsInRectangle Method
The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.
Usual disclaimer applies.

Is thera an Autocad export file so it is programmatically readable?

I would like to extract the coordinates (latitude, longitude) and some properties like name and colour of Autocad files. I may do this from a Java program.
From Autocad, which is the right format to export to so I can programmatically parse the file, look for objects and get their properties? (coordinates, name, colour...)
I know Autocad DWG format is a propietary binary file that changes its format every 3 years, so I need to find a file format to export to that allow me to read it easily.
Thanks!
DXF is what you're looking for. It's a documented format for drawing exchange in plain text.
http://images.autodesk.com/adsk/files/acad_dxf0.pdf
DXF is definitely the best format if you're not using a vertical product like Civil or Map. Some vertical products can also export SHP/SHX (shape) files.
Here's references for DXF: http://en.wikipedia.org/wiki/AutoCAD_DXF, http://images.autodesk.com/adsk/files/acad_dxf0.pdf
Here's a SHP reference: http://en.wikipedia.org/wiki/Shapefile

Converting image containing text to editable text

I have a pdf file that is scanned from a hard copy . Therefore the pdf file has an image of the hardcopy . Now when I try to convert the pdf into word , I dont get an editable document , rather I get an image sitting on the word document . Is there any way I can make a editable word document out of it ? Any Software program or something which will help me do that ?
It's called optical character recognition OCR
There are lots of software packages that do this - to do this in a program try http://code.google.com/p/tesseract-ocr/

Resources