Decode Identity-h embedded font text while reading pdf I text7 - itext7

I am working on Itext7 Pdf generation , In the beginning I am setting font with Identity-h embedded . After some point need to read the document content for searching particular text , But not able decode the Decode Identity-h embedded font

Related

Converting pdf to image - prevent text output

I know Ghostscript can translate pdf into png.
Can you tell which lines in the source code to comment out so that blocks with text are simply skipped (ignored) when converting pdf to png.
Don't modify the source. Instead use -dFILTERTEXT which will drop text rather than rendering it. See here

Change text in PDF

I have a PDF and I want to programmatically change text, not fonts, colors, just letters.
I tried
pdf-toolkit - just metadata
prawn - templates not supported any more
combine_pdf - some fonts not supported
Is there easier way to change just text?
Just decode the XML inside the PDF file, change a encode back?

use sphinx to generated pdf by latex, how to change the chapter font size and secion font size?

When I use sphinx to generated pdf , I do not know how to set package to change the chapter font size and section font size,main text font size??
How can I do this ?

Scanned Image/PDF to Searchable Image/PDF

Can anyone suggest me how to convert a scanned image into a searchable image or a scanned pdf to a searchable pdf ?
I have been stuck in this situation since quite a while now.
i have tried pdfocr application in ubuntu but no success.
Tesseract version 3.03 supports creation of searchable PDF from image. For PDF, you can use GhostScript to convert it to image before sending it to Tesseract.
https://github.com/tesseract-ocr/tesseract
Currently, there is no right way of doing this on Ubuntu. All OCR engines output plain text and there is no way to add that text as a hidden layer on PDF over the image text.
Option 1: Use gscan2pdf which will make you a searchable PDF, but the OCRed text is placed in the top-left corner of the page, is invisible and much too small.
Option 2: Use PDF X-Change Viewer which has an option to OCR and works correctly by adding a text layer over the scanned image which is in concordance with it. You'll have to run it in wine, because it is a Windows application.

How to save text file in UTF-8 format using pdftotext

I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to text file but not able to see any accented characters.
pdftotext -enc UTF-8 book1.pdf book1.txt
Please help me to resolve this issue.
Thanks in advance,
You can get a list of available encodings using the command:
pdftotext -listenc
and pick the right one using the -enc argument. Mine here seems to do UTF-8 by default. i.e. your "UTF-8" is superflous
pdftotext -enc UTF-8 your.pdf
You may want to check your locale (LC_ALL, LANG, ...).
EDIT:
I downloaded the following PDF:
http://www.i18nguy.com/unicode/unicodeexample.pdf
and converted it on a Windows 7 PC (german) and XPDF 3.02PL5 using the command:
pdftotext.exe -enc UTF-8 unicodeexample.pdf
The text file is definitely UTF-8 encoded, as all characters are displayed correctly. What are you using the text file for? If you're displaying it through a web application, your content encoding might simply be wrong, while the text file has been converted as you wanted it to.
Double-check using either a browser (force the encoding in Firefox to ISO-8859-1 and UTF-8) or using a hex editor.
Things are getting a little bit messy, so I'm adding another answer.
I took the PDF apart and my best guess would be a "problem" with the font used:
open the PDF file in Acrobar Reader
select all the text on the page
copy it and paste it into a Unicode-aware text editor (there's no "hidden" OCR, so you're copying actual data)
You'll see that the codepoints you end up with aren't the ones you're seeing in the PDF reader. Whatever the font is, it may have a mapping different from the one defined in the Unicode standard. As such, your content is "wront" and there's not much you can do about it.

Resources