Ruby extract arabic text from PDF - ruby

I usually use this code to extract text from PDFs:
require 'rubygems'
require 'pdf/reader'
filename = File.expand_path(File.dirname(__FILE__)) + "/myfile.pdf"
PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
end
end
This time I want to parse an Arabic PDF, but, using this code, I get a bunch of weird characters. For example: ±πNuô ≠ö ¥πbËÊ ´Lö Ë«_°u«» ±GKIW √±U±Nr ËîUÅW √Ê ´bœ Ë≠w «∞LπLuŸ, ¥L
I have already read that coding: utf-8 is fine for Arabic, so, is there any solution?

The text in this PDF is not properly encoded: the relation between what appears on the screen and what character code it represents is not stored in this PDF. That's why you get 'random' text.
Also notable: the text appears in the correct order, but that is because the font characters are drawn mirrored and the text itself is also drawn mirrored:
-- a typical hack-ish workaround to properly typeset Arabic using Quark XPress (there used to be an XTension (sp.?) that 'enabled' this).
As it seems this wrong encoding is actually defined as such inside the fonts ("Font uses built-in encoding", according to Acrobat Pro's "Inventory" function), you might be able to find a translation table between the characters you are reading and what they actually should be. Be aware that these tables may very well differ for each of the fonts in this document, so you have to check what font each of your text strings is using.
Addition
I did some further investigations, and they agree with your own, and Acrobat Pro's, findings. Your sample text appears like this:
/F1 1 Tf % set font and size "HGKECF+PHBagdad"
...
[ (´Mb ) -24.4 (¢'b¥b ) -24.4 («®{05}d«ØU¢Nr, ) -24.4 (Ë«ù´öÂ ) -24.4 (°LDU{03}&Nr.) ] TJ
Usually, font entries in a PDF contain a table that 'translates' into actual character codes. That is also true for this font (and all others):
<<
/Type /Font
/Subtype /Type1
/BaseFont /HGKECF+PHBagdad
/Encoding 66 0 R
/ToUnicode 69 0 R
>>
(only relevant entries listed). The /Encoding entry points to a simple array of index > character codes list, and /ToUnicode to a more formal table, which essentially contains the same. Both lists translate to the same text.
As you can see in the top image, the font contains Arabic glyphs (mirrored), but the code linked to these glyphs is not the correct one for Arabic. It's like the old "Symbol" font hack: type 'a' to get an alpha, 'b' for a beta, 'g' for a gamma: text on your screen appears to be "ɑβɣ" but in truth it says "abg".
Addition 2
See also this Adobe Forum thread: Arabic - ToUnicode Map incorrect?
Quote:
Arabic XT fonts are not Arabic fonts from the operating system point of view (MacOS or Windows). They use the Mac Roman encoding; the Arabic glyphs are placed in place of the Roman glyphs.
I tried to find a "correcting" encoding for your fonts but have this far not been successful. If I could locate a translation table, it ought to be possible to exchange the existing /ToUnicode table with a corrected one, and you'd get the correct text when extracting. (Although it may be simpler to use the same table to change the text strings after extraction in your programming language of choice.)

Related

How to get glyph unicode using freetype?

I'm trying to use freetype to enumerate the glyphs (name and unicode) in a font file.
For getting the name, I'm using FT_Get_Glyph_Name.
But how can I get the glyph unicode value?
I'm a newbie to glyph and font.
The Unicode codepoint is not technically stored together with the glyph in the TrueType/OpenType font. One has to iterate the font cmap table in the font to get the mapping, which could also be a non-Unicode one and also multiple mappings pointing to the same glyph may exist. The good news is that FreeType provides facilities in the API to iterate the glyphs codepoints in the currently selected character map, which are very well documented. So, with code:
// Ensure an unicode characater map is loaded
FT_Select_Charmap(face, FT_ENCODING_UNICODE);
FT_ULong charcode;
FT_UInt gid;
charcode = FT_Get_First_Char(face, &gid);
while (gid != 0)
{
std::cout << std::format("Codepoint: {:x}, gid: {}", charcode, gid) << std::endl;
charcode = FT_Get_Next_Char(face, charcode, &gid);
}
With this information you can create a best effort map from glyphs to Unicode code points.
One would expect the FT_CharMap to hold this info:
[...] The currently active charmap is available as face->charmap.
but unfortunately it only defines the kind of encoding (Unicode, MacRoman, Shift-JIS etc.). Apparently the act of looking up a code is done elsewhere – and .notdef simply gets returned when that character is unavailable after all.
Looking in one of my own FreeType-based OpenType renderers which reports 'by name', where possible, I found in the initialization sequence some code that stores the name of a glyph if it has one, the Unicode else. But that code was based on the presence of glyph names.
Thinking further: you can test every possible Unicode codepoint and see if it returns 0 (.notdef) or a valid glyph index. So initialize an empty table for all possible glyphs and only fill in each one's Unicode if the following routine finds it.
For a moderately modern font you need only check up to Unicode U+FFFF; for something like a heavy Chinese font (up to U+2F9F4 for Heiti SC) or Emoji (up to U+1FA95 for Segoe UI Emoji) you need quite a larger array. (Getting that max number out of a font is an entirely different story, alas. Deciding what to do depends on what you want to use this for.)
printf ("num glyphs: %u\n", face->num_glyphs);
for (code=1; code<=0xFFFF; code++)
{
glyph_index = FT_Get_Char_Index(face, code);
/* 0 = .notdef */
if (glyph_index)
{
printf ("%d -> %04X\n", glyph_index, code);
}
}
This short C snippet prints out the translation table from font glyph index to a corresponding Unicode. Beware that (1) not all glyphs in a font need to have a Unicode associated with them. Some fonts have tons of 'extra' glyphs, to be used in OpenType substitutions (such as alternative designs and custom ligatures) or other uses (such as aforementioned Segoe UI Emoji; it contains color masks for all of its emoji). And (2) some glyphs may be associated with multiple Unicode characters. The glyph design for A, for example, can be used as both a Latin Capital Letter A and a Greek Capital Letter Alpha.
Not all glyphs in a font will necessarily have a Unicode code point. In OpenType text display, there is a m:n mapping that occurs between Unicode character sequences and glyph sequences. If you are interested in a relationship between Unicode code points and glyphs, the thing that makes most sense would be to use the mapping from Unicode code points to default glyph that is contained in a font's 'cmap' table.
For more background, see OpenType spec: Advanced Typographic Extensions - OpenType Layout.
As for glyph names, every glyph can have a name, regardless of whether it is mapped from a code point in the 'cmap' table or not. Glyph names are contained in the 'post' table. But not all fonts necessarily include glyph names. For example, a CJK font is unlikely to include glyph names.

Italic and bold Latin, and Greek letters using custom unicode font in gnuplot to produce (e)ps or pdf

I would like to create a postscript or pdf figure with enhanced notations, italic or bold Latin characters, and sometimes (regular) Greek characters. How to do that in general?
Let's say I downloaded CMU Sans Serif, a font that has glyphs for all the strange characters I ever want to use. I converted them to pfa with an online tool and copied the files to the path of working directory.
Expectations
Let's say I'd like to produce the following notation somewhere.
What I tried: original
I create a gnuplot script encoded in a utf-8 file (without BOM) with the content
set term postscript eps enhanced "CMUSansSerif" 15 fontfile add 'CMUSansSerif.pfa' fontfile add 'CMUSansSerif-Oblique.pfa' fontfile add 'CMUSansSerif-Bold.pfa'
set encoding utf8
set o "print.eps"
p x t "Label: {/CMUSansSerif-Bold important }{/CMUSansSerif-Oblique note}: ∫⟨α₂ + β²⟩ = äßű"
set o
and executed with the newest gnuplot, version 5.2.6.
What I got
I used a vector graphics editor to open the eps file and relevant part looks like this:
What I also tried
According to Ethan's answer I added adobeglyphnames to the termoptions. It made at least the letters available but other Unicode symbols are still unavailable. The result is:
Question
What went wrong? How could I produce the desired output?
So many possibilities, where things can go wrong: Is the font not suitable for this task? Did I download a wrong version of it? Did the pfa converter do a bad job? Did I include the font files incorrectly? Was there something wrong with the set encoding? Do I use a bad vector graphics editor? Do I have wrong fonts installed and the vector graphics editor tries to use them?
I am afraid that the answer is that in general PostScript is the wrong tool for this. If it is at all possible for you to work with PDF output instead, I suggest you do that. It is even possible the resulting PDF file can be translated to a PostScript file by standard tools (e.g. pdf2ps). That is likely to work if the non-ascii characters are limited to Greek and other relatively common symbols but I don't know how much of the full unicode tables are covered by those standard tools.
If you really need to produce PostScript with additional unicode characters directly from gnuplot, you can find full instructions and sample character encoding tables in the gnuplot distribution files:
.../term/PostScript/unicode_maps.README
.../term/PostScript/unicode_big.map
.../term/PostScript/unicode_small.map
I am not familiar with the online tool font conversion you used but probably it failed because it did not have, or at any rate did not use, suitable character encoding tables for the desired conversion.
===
One other thought. There are two ways that a *.pfa font can encode unicode characters that are common enough to have a name assigned by Adobe for use in PostScript. (1) It may use generic names like uni0439 for Unicode code points. (2) It may use Adobe-specific names from the list here:
agl-aglfn glyph list
When selecting PostScript output from gnuplot you can tell it which of these two conventions is used by the font you provide. The default is "noadobeglyphnames".
set term postscript {no}adobeglyphnames
==
(recipe for using "set term pdfcairo")
Font handling is unfortunately system-specific, so I cannot tell you how to install or configure fonts on all your target machines. I will show you a procedure that works on a linux desktop that uses the fontconfig utilities for system font handling.
Create directory /home/share/fonts/CMUSans
Add this directory to the search list in file /etc/fonts/local.conf
Copy *.ttf files into this directory from the CMU Sans Serif zip archive you link to in your original query. The system fontconfig system tools should now be able to find these fonts. By inspection they self-report as "CMU Sans Serif"
in gnuplot (tested with version 5.2.6)
set term pdfcairo font "CMU Sans Serif,15"
set output 'enhanced_utf8.pdf'
load 'enhanced_utf8.dem'
convert output pdf file to PostScript with the following command
pdf2ps enhanced_utf8.pdf enhanced_utf8.ps
Screenshot of the result is shown below
It seems that CMU Sans Serif doesn't contain the UTF-8 characters you are asking for. Check the font with a font editor like Birdfont. Although the webpage shows symbols you want to use, the font itself does not contain them. However, your browser may show symbols, but they are just fallback representations from other fonts.

Arabic-English Transliteration using unsupported font

I am working on language transliteration for Ar and En text.
Here is the link which displays character by character replacement : https://github.com/Shnoulle/Ar-PHP/blob/master/Arabic/data/Transliteration.xml
Now issue is:
I am dealing with font style robert_bold.ttf and robert_regular_0.ttf which has some typical characters with underline and overline as in this snap
I have .ttf file so I can see this fonts on my system. But in my application or in above Transliteration.xml characters are considered as junk like [, } [ etc.
How can I add support of this unsupported characters in Transliteration.xml file?
<pair>
<search>ي</search>
<replace>y</replace>
</pair>
<pair>
<search>ى</search>
<replace>a</replace>
</pair>
<pair>
<search>أ</search>
<replace>^</replace> // Here is one of the character s_ (s with underscore not supported)
</pair>
It seems that the font is not Unicode encoded but contains the underlined letters at some arbitrarily assigned codes. While this works up to a point, it does not work across applications, of course. It works only when that specific font is used.
The proper way is to use correct Unicode characters such as U+1E0F LATIN SMALL LETTER D WITH LINE BELOW “ḏ” and, for rendering, try to find fonts containing it.
An alternative is to use just basic Latin letters with some markup, say <u>d</u>. This means that the text must not be treated as plain text in later processing, and in rendering, the markup should be interpreted as requesting for a line under the letter(s).

How can I change the background color of specific characters in a RTF document?

I'm trying to output RTF (Rich Text Format) from a Ruby program - and I'd prefer to just emit RTF directly without using the RTF gem as I'm doing pretty simple stuff.
I would like to highlight specific characters in a DNA sequence alignment and from the docs it seems that I can either use \highlightN ... \highlight0 or \cbN ... \cb1
The problem is that I cannot get \cb to work in either Word:Mac 2008 or Mac TextEdit (\cf works fine so I know it's not a color table issue)
\highlight does work but seemingly only with two of the possible colors (black and red) and \highlight does not use the custom color table.
By creating simple docs in Word with character shading and saving as RTF I can see blocks of ridiculously verbose RTF code that presumably does what I want, but it is so impenetrable that I'm not seeing the wood for the trees.
Part of the problem may well be that Mac Word is just not implementing RTF properly. I don't have a Windows version of Word handy.
Anyone know the right way to shade blocks of text?
Thanks
--Rob
There is a note in the RTF Pocket Guide that says MS Word does not implement the \cb command. It says MS Word uses \chshdng0\chcbpatN (where "N" is the color number that you would use with \cb). The book recommends using something like the following for compatibility with programs that implement \cbN and/or \chshdng0\chcbpatN: {\chshdng0\chcbpat5\cb5 text}.
Note: The copy of the book I have was published in 2003, so it might be a bit out-of-date.
The sequence of RTF commands that seems to be most universally supported by RTF-capable applications is:
\chshdng10000\chcbpatN\chcfpatN\cbN
These commands:
set the shading to 100 percent
set the pattern foreground and background colors to the color from the color table (we're not actually specifying a shading pattern)
set the character background to the color from the color table
Word was the most difficult application to properly render background colors in:
Despite what the latest (1.9.1) RTF spec says, Word 2013 does not resolve \highlightN colors from the \colortbl. Instead, \highlightN maps to a predefined list of colors. It looks like those colors come from the 1.5 version of the RTF spec.
Regarding \cb, the 1.9.1 spec contains this helpful pointer at the end of the section on Color Table:
Note: Windows versions of Word have never supported \cbN, but it can be emulated by the control word sequence \chshdng0\chcbpatN.
This is almost a useful suggestion, except that if you read the documentation for \chshdngN:
Character shading. The N argument is a value representing the shading of the text in hundredths of a percent.
So, 0 turns out to not be a very useful value; 100 / 0.01 gives us the 10000 we used in the sequence above.
Use WordPad to create RTF documents, not Word. WordPad creates much simpler documents, i.e. approaching human-readable.
I use WordPad every time I need to display formatted text in a WinForms application, and need something that the RichTextBox control can handle being assigned to its Rtf parameter.

How to Programmatically Identify a PI Font (a Dingbat) under OS X

There is a class of fonts called Pi fonts whose glyphs, under OS X, get mapped to the private Unicode space 0xF021-0xF0FF such that if you subtract 0xF000 from each unicode character to retrieve the 8-bit version of the character and be able to draw that character as if it were a standard Roman character.
My question is how do I recognize these fonts? It's obvious the system can do so because there is a category on the Special Characters palette called "Pi Fonts" which apparently has the various such fonts installed on my system. In my case they are BookshelSymbolSeven, MSReferenceSpeciality, MT-Extras, Marlett, MonotypeSorts, Webdings, and various Wingdings. If I use the old fashioned QuickDraw routines to ask for the TextEncoding of these fonts, I get a value of 0x20000 which I do not see in the system header file TextCommon.h. Am I supposed to treat any font with a TextEncoding of 0x20000 as a Pi Font? And I'd rather not use any QuickDraw font handling routines for obvious reasons.
The closest thing I know is NSSymbolicTraits called NSFontSymbolicClass of NSFontDescriptor. The code
NSFontDescriptor*fontDesc=[NSFontDescriptor fontDescriptorWithFontAttributes:nil];
fontDesc=[fontDesc fontDescriptorWithSymbolicTraits:NSFontSymbolicClass];
NSArray*foo=[fontDesc matchingFontDescriptorsWithMandatoryKeys:[NSSet setWithObject:NSFontTraitsAttribute]];
NSLog(#"%#",foo);
gives me a list of Pi fonts + Braille + a bit more.

Resources