How to use ImageMagick with Chinese fonts on Text to Image Handling

How to use ImageMagick with Chinese fonts on Text to Image Handling - macos

I try to use ImageMagick to handle a Chinese character to a image on my MacBook.
I use command to check the Chinese fonts available on my system.
convert -list font | grep Font
I did not get any.
Seen from the ImageMagick guide Text to Image Handling, Chinese font seems like supported , such as ZenKaiUni
And seen from the application Font Album of my MacBook. There are so many Chinese fonts.
I think it is OK. How to figure it out?

You can either tell ImageMagick about all the fonts on your system, like this and then they will show up if you do:
convert -list font
Then you can use shorthand:
convert -font Arial ...
Or, you can just tell ImageMagick the full path to any font on a per-invocation basis:
printf "Hello" | convert -pointsize 72 \
-font "/Applications/iMovie.app/Contents/Frameworks/Flexo.framework/Versions/A/Resources/Fonts/Zingende Regular.ttf" \
label:#- result.png
You would probably put Unicode for Chinese characters in place of my "Hello".

I do not have any chinese fonts on my system, but here is an example of what I would suggest using the symbol font. First download a proper UTF-8 chinese font, i.e. one that supports UTF-8 characters. Then open a UTF-8 compatible text editor, choose that font and type your string. For example, here is a screensnap of the symbols.txt file that I created using the symbol font in my BBEdit UTF-8 compatible text editor on my Mac.
Then using ImageMagick,
convert -size 100x -font "/library/fonts/GreekMathSymbols Normal.ttf" label:#symbols.txt symbol.gif
And the resulting symbol.gif image is:
Adding .utf8 as suffix to your file is not adequate. You must create a text file in a UTF-8 compatible text editor using a UTF-8 compatible font.
Furthermore, most terminal windows do not support UTF-8 characters / fonts. So typing your characters directly into the command line in the terminal window does not always work. See http://www.imagemagick.org/Usage/text/#unicode

You can't do it in ImagMagick as the font information it uses doesn't include language support, but it's easy with Font Book.app by creating a Smart Collection as follows:
On my Mac I have 35 fonts which include Chinese characters.
(The dimmed/greyed fonts are available but will need to be downloaded from Apple servers before I can use them, an automatic process done when selecting those fonts in any app.)

Related

ImageMagick adds thick horizontal lines to PNGs extracted from PDF

Edit July 7, 2017: Downgrading to ImageMagick 6.9.5 solved this problem, which may be Cygwin-specific. I still don't know the underlying cause.
I need to extract data via OCR from images in PDF reports published by Chicago Public Schools. An example PDF is here (NB: this link downloads the file automatically rather than opening it in the browser). Here's a sample image (from PDF page 11, print page 8), extracted with pdfimages -png version 0.52.0 on Cygwin:
I'd like to crop each bar into its own file and extract the text with OCR. But when I try this with ImageMagick (version 7.0.4-5 Q16 x86_64 2017-01-25 according to convert -version), using the command convert chart.png -crop 320x600+0+0 bar.png, I get this image, with horizontal lines that interfere with OCR:
Running pdfimages to extract to PPM format first and then converting to PNG while cropping gives the same result, as does round-trip converting the extracted images to SVG format with ImageMagick's rsvg delegate, and fiddling with the PNG alpha channel changes the line's colors from gray to white or black but doesn't eliminate them. I've found a workaround of round-trip converting extracted images through JPG (introducing ringing artifacts, which I hope are irrelevant). But I don't see why I should have to do this. Incidentally, ImageMagick introduces the lines to PNGs even if I run a null conversion convert chart.png chart.png, which ought to leave the image unchanged:
I have found other complaints that PDF software adds horizontal lines to images, but none of them exactly matches this problem. A discussion thread mentions that versions of the PDF standard somehow differ in their treatment of alpha channels, but my knowledge of graphics is too poor understand the discussion fully; besides, my images get horizontal lines added after they're extracted from the PDF, because of something internal to ImageMagick. Can anyone shed some light on the causes of the grey lines?

Using the latest ImageMagick 7.0.6.0 Q16 Mac OS X, I get a good result. As mentioned above by Bonzo, the correct syntax for IM 7 is magick rather than convert. The use of convert reverts to IM 6. Also do not use magick convert either.
magick chart.png -crop 320x600+0+0 +repage bar.png
If this does not work for you, then there must have been a bug in your older version of IM 7. So you should then upgrade.
Note also the +repage is needed to remove the virtual canvas

How to create png from unicode character?

I have been looking far and wide on the internet for images/vectors of unicode characters in any font, and have not found any. I need image files of unicode characters for the project I am working on, where I cannot just use text. Is there a way to "convert" unicode characters from a font into an image file? Or does anyone know where I can find this? Thank you.

Try BMFont Bitmap Font Generator Supports Unicode, generates PNG images - looks like a perfect match.

CropBox and MediaBox in GhostScript

I'm using Ghostscript to turn PDFs into jpeg thumbnails. It works great for most files, but I've got a few that end up looking bad - like a tiny thumbnail on a huge white background.
This is because, on those problem PDFs, the MediaBox is set to a much larger size than the CropBox. I can fix this in Ghostscript by using -dUseCropbox to make it ignore the MediaBox dimensions ... but that does not work on other PDFs that have no CropBox defined.
So I can think of two solutions:
Somehow check a PDF file before import to see whether it has a CropBox defined. If it has a CropBox, then use the -dUseCropBox switch. If it does not have a CropBox defined, then we do not use that switch.
Modifying the MediaBox dimensions in the PDF file itself so that they match the CropBox dimensions.
So what code would I use to check a PDF file for CropBox/MediaBox dimensions and, if necessary, edit them?

What do you plan to do with files that have no CropBox ? It seems to me that you are already doing everything you can, if a CropBox is present (and you select -dUseCropBox) it is used, if not then (if I recall correctly) GS will use the MediaBox anyway.

I think what you're really looking for is a program / script to crop whitespace from PDFs, irrespective of the media/trim/crop box setting, you could try either of these freeware pdf croppers:
pdfcrop - a perl script, works on multiple platforms
http://tug.ctan.org/tex-archive/support/pdfcrop
(requires *tex, ghostscript and obviously perl)
PDF Cropper, for Windows
http://www.noliturbare.com/pdf-tools/pdf-cropper
(requires ghostscript and .NET 3.5)
Alternatively, if you have a Mac, you could use the crop function on the Preview application. This sets the cropbox without touching the mediabox (at least it does on MacOSX 10.4), allowing you to use -dUseCropbox.

Instead of Ghostscript, use Imagemagick. For example:
convert -resize 70px file.pdf file.jpg

How to convert a source code text file (e.g. asp php js) to jpg with syntax highlight using command line / bash routine?

I need to create images of the first page of some source code text files, like asp or php or js files for example.
I usually accomplish this by typing a command like
enscript --no-header --pages=1 "${input_file}" -o - | ps2pdf - "${temp_pdf_file}"
convert -quality 100 -density 150x150 -append "${temp_pdf_file}"[0] "${output_file}"
trash "${temp_pdf_file}"
This works nice for my needs, but it obviously outputs an image "as is" with no "eye-candy" features.
I was wondering if there's a way to add syntax highlighting too.
This might come handy to speed up the creation of presentations of developed works for example.

Pygments is a source highlighting library which has PNG, JPEG, GIF and BMP formatters. No intermediate steps:
pygmentize -o jquery.png jquery-1.7.1.js
Edit: adding source code image to the document means you are doing it wrong to begin with. I would suggest LaTeX, Markdown or similar for the whole document and source code document could be generated.
Another easy/lazy way would be to create an html document using pygmentize and copy-paste it to the document. Not professional, but better than raster image.

Here's how I do it on my Mac:
I open up the file with MacVIM. MacVIM supports syntax highlighting.
I print the file to a PDF. This gives me a paged document with highlighted syntax.
When I print, The program Preview opens up to display the file. I can Export it to a jpg, or whatever my hearts desire.
I don't have a Mac
This works with Windows too.
You have to get VIM although Notepad++ may also work. Any program editor will support syntax highlighting and allow you to print out with the highlighted syntax. So, pick what you like.
You have to get some sort of PDF producing print driver such as CutePDF.
Converting it to a jpg. I think Adobe Acrobat may be able to export a PDF into a JPG, or maybe the print driver can print to a JPG instead of a PDF. Or, you can send it to a friend who has a Mac.

How to save text file in UTF-8 format using pdftotext

I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to text file but not able to see any accented characters.
pdftotext -enc UTF-8 book1.pdf book1.txt
Please help me to resolve this issue.
Thanks in advance,

You can get a list of available encodings using the command:
pdftotext -listenc
and pick the right one using the -enc argument. Mine here seems to do UTF-8 by default. i.e. your "UTF-8" is superflous
pdftotext -enc UTF-8 your.pdf
You may want to check your locale (LC_ALL, LANG, ...).
EDIT:
I downloaded the following PDF:
http://www.i18nguy.com/unicode/unicodeexample.pdf
and converted it on a Windows 7 PC (german) and XPDF 3.02PL5 using the command:
pdftotext.exe -enc UTF-8 unicodeexample.pdf
The text file is definitely UTF-8 encoded, as all characters are displayed correctly. What are you using the text file for? If you're displaying it through a web application, your content encoding might simply be wrong, while the text file has been converted as you wanted it to.
Double-check using either a browser (force the encoding in Firefox to ISO-8859-1 and UTF-8) or using a hex editor.

Things are getting a little bit messy, so I'm adding another answer.
I took the PDF apart and my best guess would be a "problem" with the font used:
open the PDF file in Acrobar Reader
select all the text on the page
copy it and paste it into a Unicode-aware text editor (there's no "hidden" OCR, so you're copying actual data)
You'll see that the codepoints you end up with aren't the ones you're seeing in the PDF reader. Whatever the font is, it may have a mapping different from the one defined in the Unicode standard. As such, your content is "wront" and there's not much you can do about it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio