How to save text file in UTF-8 format using pdftotext - utf-8

I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to text file but not able to see any accented characters.
pdftotext -enc UTF-8 book1.pdf book1.txt
Please help me to resolve this issue.
Thanks in advance,

You can get a list of available encodings using the command:
pdftotext -listenc
and pick the right one using the -enc argument. Mine here seems to do UTF-8 by default. i.e. your "UTF-8" is superflous
pdftotext -enc UTF-8 your.pdf
You may want to check your locale (LC_ALL, LANG, ...).
EDIT:
I downloaded the following PDF:
http://www.i18nguy.com/unicode/unicodeexample.pdf
and converted it on a Windows 7 PC (german) and XPDF 3.02PL5 using the command:
pdftotext.exe -enc UTF-8 unicodeexample.pdf
The text file is definitely UTF-8 encoded, as all characters are displayed correctly. What are you using the text file for? If you're displaying it through a web application, your content encoding might simply be wrong, while the text file has been converted as you wanted it to.
Double-check using either a browser (force the encoding in Firefox to ISO-8859-1 and UTF-8) or using a hex editor.

Things are getting a little bit messy, so I'm adding another answer.
I took the PDF apart and my best guess would be a "problem" with the font used:
open the PDF file in Acrobar Reader
select all the text on the page
copy it and paste it into a Unicode-aware text editor (there's no "hidden" OCR, so you're copying actual data)
You'll see that the codepoints you end up with aren't the ones you're seeing in the PDF reader. Whatever the font is, it may have a mapping different from the one defined in the Unicode standard. As such, your content is "wront" and there's not much you can do about it.

Related

How to use ImageMagick with Chinese fonts on Text to Image Handling

I try to use ImageMagick to handle a Chinese character to a image on my MacBook.
I use command to check the Chinese fonts available on my system.
convert -list font | grep Font
I did not get any.
Seen from the ImageMagick guide Text to Image Handling, Chinese font seems like supported , such as ZenKaiUni
And seen from the application Font Album of my MacBook. There are so many Chinese fonts.
I think it is OK. How to figure it out?
You can either tell ImageMagick about all the fonts on your system, like this and then they will show up if you do:
convert -list font
Then you can use shorthand:
convert -font Arial ...
Or, you can just tell ImageMagick the full path to any font on a per-invocation basis:
printf "Hello" | convert -pointsize 72 \
-font "/Applications/iMovie.app/Contents/Frameworks/Flexo.framework/Versions/A/Resources/Fonts/Zingende Regular.ttf" \
label:#- result.png
You would probably put Unicode for Chinese characters in place of my "Hello".
I do not have any chinese fonts on my system, but here is an example of what I would suggest using the symbol font. First download a proper UTF-8 chinese font, i.e. one that supports UTF-8 characters. Then open a UTF-8 compatible text editor, choose that font and type your string. For example, here is a screensnap of the symbols.txt file that I created using the symbol font in my BBEdit UTF-8 compatible text editor on my Mac.
Then using ImageMagick,
convert -size 100x -font "/library/fonts/GreekMathSymbols Normal.ttf" label:#symbols.txt symbol.gif
And the resulting symbol.gif image is:
Adding .utf8 as suffix to your file is not adequate. You must create a text file in a UTF-8 compatible text editor using a UTF-8 compatible font.
Furthermore, most terminal windows do not support UTF-8 characters / fonts. So typing your characters directly into the command line in the terminal window does not always work. See http://www.imagemagick.org/Usage/text/#unicode
You can't do it in ImagMagick as the font information it uses doesn't include language support, but it's easy with Font Book.app by creating a Smart Collection as follows:
On my Mac I have 35 fonts which include Chinese characters.
(The dimmed/greyed fonts are available but will need to be downloaded from Apple servers before I can use them, an automatic process done when selecting those fonts in any app.)

How to create png from unicode character?

I have been looking far and wide on the internet for images/vectors of unicode characters in any font, and have not found any. I need image files of unicode characters for the project I am working on, where I cannot just use text. Is there a way to "convert" unicode characters from a font into an image file? Or does anyone know where I can find this? Thank you.
Try BMFont Bitmap Font Generator Supports Unicode, generates PNG images - looks like a perfect match.

generated docx with opentbs converted by unoconv and libreoffice

For some reason I am expecting a strange behaviour.
When I am merging my docx template with opentbs, it works all fine and it looks correct in the generated docx.
But now I need to convert the docx into a pdf where I am using unoconv and libreoffice on mac OS X 10.11.
when I do this, all strings with multiple lines (which are displayed correctly in the docx) will be displayed as single line in the pdf.
Also if I open the generated docx with libreoffice, all multi line strings will be displayed as single line.
I figured out, that I can use ;strconv=no.
This will then do exactly the opposite. All multi line strings in the docx will be displayed as single line, but in libreoffice or converting to pdf with unoconv they are displayed correctly with multi lines.
anyone has a solution for this problem?

How to convert a source code text file (e.g. asp php js) to jpg with syntax highlight using command line / bash routine?

I need to create images of the first page of some source code text files, like asp or php or js files for example.
I usually accomplish this by typing a command like
enscript --no-header --pages=1 "${input_file}" -o - | ps2pdf - "${temp_pdf_file}"
convert -quality 100 -density 150x150 -append "${temp_pdf_file}"[0] "${output_file}"
trash "${temp_pdf_file}"
This works nice for my needs, but it obviously outputs an image "as is" with no "eye-candy" features.
I was wondering if there's a way to add syntax highlighting too.
This might come handy to speed up the creation of presentations of developed works for example.
Pygments is a source highlighting library which has PNG, JPEG, GIF and BMP formatters. No intermediate steps:
pygmentize -o jquery.png jquery-1.7.1.js
Edit: adding source code image to the document means you are doing it wrong to begin with. I would suggest LaTeX, Markdown or similar for the whole document and source code document could be generated.
Another easy/lazy way would be to create an html document using pygmentize and copy-paste it to the document. Not professional, but better than raster image.
Here's how I do it on my Mac:
I open up the file with MacVIM. MacVIM supports syntax highlighting.
I print the file to a PDF. This gives me a paged document with highlighted syntax.
When I print, The program Preview opens up to display the file. I can Export it to a jpg, or whatever my hearts desire.
I don't have a Mac
This works with Windows too.
You have to get VIM although Notepad++ may also work. Any program editor will support syntax highlighting and allow you to print out with the highlighted syntax. So, pick what you like.
You have to get some sort of PDF producing print driver such as CutePDF.
Converting it to a jpg. I think Adobe Acrobat may be able to export a PDF into a JPG, or maybe the print driver can print to a JPG instead of a PDF. Or, you can send it to a friend who has a Mac.

Creating JPEG's from binary

If I read some binary from a stream, save it in a text file and then rename it with the .jpg extension, how come the file won't open as an image?
As a reference, I have got the source image, opened it up in notepad and compared both files - side-by-side they have exactly the same content.
I'd guess that you didn't open your text file in binary mode. Some bytes will be changed when you write data in text mode (most notably the end-of-line byte sequence) and those changes will be ignored by Notepad because it thinks everything is text. Try using comp (I think that's the right command) to compare your files rather than Notepad.

Resources