I am trying to process a PDF with pdf-reader gem. It is mostly fine, but where there should be a summation symbol, I'm getting \u0001 instead of \u2211. The relevant font object is:
{:Type=>:Font,
:Subtype=>:Type1,
:FirstChar=>1,
:LastChar=>2,
:Widths=>[1444, 1056],
:Encoding=>{:Type=>:Encoding, :Differences=>[1, :summationdisplay, :summationtext]},
:BaseFont=>:"APHKGN+CMEX10",
:FontDescriptor=>
{:Type=>:FontDescriptor,
:Ascent=>0,
:CapHeight=>0,
:Descent=>0,
:Flags=>4,
:FontBBox=>[0, -1400, 1387, 0],
:FontName=>:"APHKGN+CMEX10",
:ItalicAngle=>0,
:StemV=>47,
:StemH=>47,
:CharSet=>"/summationdisplay/summationtext",
:FontFile3=>
#<PDF::Reader::Stream:0x007faab138a528
#data=
"H\x89bd`ab`dd\xE4s\f\xF0\xF0v\xF7\xD3v\xF6u\x8D04\x00\x89(\xFD\x90e\xFC!\xCE\xF2C\x8EG\xACX\xE6K\x81\f\xEB\xBA\x9F3X\xBF;\xF1\x7Fw\x13\xF8\xEE%\xB8\xE2\x87\xA7\x10\x03\vP\x9F\\rfqinnbIf~^IjE\t\x9C\x93\x92Y\\\x90\x93X\xE9\x9C_PY\x94\x99\x9EQ\xA2\xA0\xE1\xAC\xA9`hii\xAE\xE0\x98\x9BZ\x94\x99\x9C\x98\xA7\xE0\x9BX\x92\x91\nR\x9D\x9C\x98\xA3\x10\x9C\x9F\x9C\x99ZR\xA9\xA7\xE0\x98\x93\xA3\x10\x04\xD2Q\xAC\x10\x94Z\x9CZT\x96\x9A\x02u\x15\xD0Y\xED\x8C\fL\x01\x11\f\xCC\x8C\x8C\xECE?\xFF3\xFA\x86\x86\xF1\xFDg\x91\xEFO\xF8Ws\xE8\x97\xECf\xC6\x1F\xD5\x7Ff\x88N\x9A\xD2\xDB\xD7/\xD5\xDF\xD5\xD3:E\xEE\xF7\xCD\x1FA\xAC?\x14\xD8\xBE\xB3}\xAFj\xF9\xED\x7FQ~\t\x9B\xE9\xF7:\xD6\xBF\x17\xD9\n\xBA\xBAr\xE4\x7F0\xFE\xE9\xFA\xFD\xFD\x8F7kscWg\xBBT\xC3\x94\xEE\xB9r?/\xB2=\xFC\xDE\xCBZ\xC4V\xE4\xE0\xE1g\x96\xC7\xD1V\xEDV\xFC[]\xFA\x8F-\e\xDF\x7F\xD6%\x85'd~u<\x92a\xF9\xB8\x9BQ\x86\xE5\x13\x90-\xFA\x9D\xF7\xFB\x15\xA0\xEA\x14eE\xF7\xDF\xEC\xB9\x1Cme\x9A\x85\xBFC\xA4\xFF\xBCg\xFB1\xF1\xC7K\xD6I\x93{\xFB&H\xF5v\xF7\xB5L\x95\xFB\x93\xF6S\x90\xF5\xC7\x0E\xB6\xEFR\xCFj;\xA7\xC8\x1Fl~Tu+rI\xF5\xF9\xB8\xB5V\x1CK\xD8~\xF3~_\xCB*\xF3;\x89\xAD\xA4\xAB\xAB\xB5C\xBE\xAB\xA3\xBB\xA2A\xEA\xC7\xD2\xBF\x19\x7Ff\xFD\xF9\xCC\xDAX\xDF\xDD\xD6\x05q _\xF9|6\x99\xDF\x95\xF3\xD9\xE5\x16\xB8O\x9D9\xE3?\x0F\xE7.\xAE]\xDC\x9B'\xF1\xF0\x001/#\x80\x01\x00J\xBC\xBFN\n",
#hash={:Filter=>:FlateDecode, :Length=>464, :Subtype=>:Type1C},
#udata=nil>}}
Since the Adobe glyphlist.txt (replicated at pdf-reader/lib/pdf/reader/glyphlist.txt) only includes summation, and not summationtext nor summationdisplay, #differences don't get applied to #mapping in PDF::Reader::Encoding#differences=, and #state.current_font.to_utf8(1) fails to fetch the correct glyph (it returns the glyphcode as a fallback, which is why I end up with \u0001). I.e. The font mapping differences inside the PDF font object should (according to my understanding) reference glyphs on the master glyph list by name, but these two don't match.
What am I missing? If summationdisplay and summationtext are not on Adobe's glyphlist.txt, how do other PDF readers render this font correctly?
This is defining a font subset with custom encoding and non-standard glyph names. Furthermore it does not include a ToUnicode reverse mapping from the custom encoding.
The PDF-32000 Specification covers this scenario:
9.10 Extraction of Text Content
9.10.1 General
...
When extracting character content, a conforming reader can easily convert text to Unicode values if a font’s characters are identified according to a standard character set that is known to the conforming reader. This character identification can occur if either the font uses a standard named encoding or the characters in the font are identified by standard character names or CIDs in a well-known collection. 9.10.2, "Mapping Character Codes to Unicode Values", describes in detail the overall algorithm for mapping character codes to Unicode values.
If a font is not defined in one of these ways, the glyphs can still be shown, but the characters cannot be converted to Unicode values without additional information:
• This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see 9.10.3, "ToUnicode CMaps"), whose value shall be a stream object containing a special kind of CMap file that maps character codes to Unicode values.
pdf-reader does seem to be conforming to the above. There is a custom sub-set encoding with /summationdisplay mapped to \u0001. There enough information to render, but not to reverse map the font back to Unicode.
Related
I have searched many a page on wikipedia, the official GS1 specifications, but have yet to find a definite answer to the question
What is the actual HEX / binary value of the GS1 FNC1 character?
There is much information about how to use the GS1 identifiers, how to print the barcodes with ZPL and how to encode the FNC1, but I want to know the actual HEX value of that character.
The special function characters such as FNC1 through FNC4 belong to the class of "non-data characters" that can be encoded within various barcode symbologies but with do not have any direct ASCII representation in the decoded data stream. Each symbology that supports such characters has a different scheme for encoding them in its internal representation quite distinct from any byte-orientated character data.
The FNC characters serve both as flag characters (indicating something special to the reader) and as formatting characters (modifying the meaning of the encoded data). As such they are not intended to be transmitted directly in the data received by the host system from a basic barcode reader, although in both cases they may have an "effect" on the transmitted message.
The usual purpose of each of the FNC characters are as follows:
FNC1 - Structured Data flag character indicating GS1 and AIM formatting AND group separator formatting character, amongst other uses.
FNC2 - Message Append flag character for buffering the data in groups of symbols for a single read.
FNC3 - Reader Programming flag character for device configuration purposes.
FNC4 - Extended ASCII formatting character for encoding characters with ordinals 128-255.
Be aware that they may not all be available in certain barcode symbologies and may even be specified in different, non-typical or overloaded ways.
Encoding an FNC character in a symbol's internal data is accomplished via an "escape mechanism" that is specific to the encoding software. Each library has a different way of accepting these non-data characters within their input. For example, to use FNC1 in its typical GS1 structured data role for the data "(01)00312345678906(21)123456789012(30)0144" you might see the FNC1 characters escaped as {FNC1} so that the input looks like {FNC1}010031234567890621123456789012{FNC1}300144.
Some libraries will even use a set of regular or extended ASCII characters as placeholders for the FNC characters, but these are arbitrary representations and it is a mistake to consider them to be actual ASCII values for these non-data characters.
Upon scanning a barcode the symbol's internal data is typically decoded then transmitted to the host over a basic channel (e.g. keyboard wedge) as a sequence of bytes to be interpreted according to the Latin-1 character encoding. The FNC characters cannot be represented in such a manner and are excluded from the data stream, however their formatting effect on the data remains.
For instance, the standards for most symbologies specify that when an FNC1 character is being used in its role as a field separator in data conforming to GS1 Application Identifier Standard Format it should be decoded and transmitted as GS (ASCII 29). Explicitly stated, the formatting effect of a FNC1 character used as a GS1 Application Identifier separator is to place a GS character at the end of the variable-length field. But in other roles (such as when FNC1 is used in "first/second position" as a flag character and with non-GS1 formatted data) there is no formatting effect on the carried data and therefore no ASCII representation during decoding.
Another instance of the special function characters having a formatting effect on the data is with symbologies that use FNC4 to extend their reach from 7-bit ASCII into extended ASCII as described in this answer.
A subtle technical point is that the data transferred to the host is often prefixed with a short symbol indicator header known as a "symbology identifier" which denotes the type and usage of the symbol from which the data is being read. This is often modified by the presence of otherwise invisible flag characters within the symbol data, for example to indicate the presence of GS1 formatted data with "FNC1 in first" or to indicate reader programming mode when FNC3 appears anywhere in the symbol. The details are symbology specific.
Aside: In addition to FNC non-data characters, there are other non-data characters commonly supported by barcode symbologies that have no direct ASCII representation but affect the overall message. These include macro characters (that wrap the message data in an "envelope"), and ECI indicators that require the use of a transmission protocol beyond the typical "basic channel" mode but which enable the use of extended character sets amongst other enhancements.
Important is to know (and to setup a scanner properly) that the FNC1 character at the first position is translated to a symbology identifier according ISO/IEC 15424. The modifier m of the symbology identifier shows if there was a FNC1 or not. If this is not done the application cannot see anymore if a GS1 Structure was intended or not. Other structures are identified by e.g. Macro 06 in a data matrix code (ISO/IEC 16022, ISO/IEC 15434). Its required to figure our the difference to take the correct action to process the data.
I am using CoreText CTFontGetGlyphsForCharactersto get glyphs that correspond to unicode chars , i.e. UniChar. Now, I'd like to retrieve Ligature characters, for instance fi. Is there any way to retrieve ligature glyphs from CoreText (for instance by a series of unichars that should be combined)?
Thanks,
moka
After some useful advice from the Core Text mailing list, here is the solution:
You can't get ligatures directly from CTFont, as no API exists to do that.
What you do instead is the following:
Use CTLineCreateWithAttributedString() to create a text line from the character sequence whose ligature you want to get.
Make sure to set kCTLigatureAttributeName to either 1 or 2 on the attributed string to get ligatures.
If done correctly, with a ligature that exists in the font, you should get a CTLine containing one CTRun containing one CGGlyph which will be the ligated glyph you are looking for!
If you don't want to go down this route, the only other way appears to be parsing the Open - / True Type font tables yourself which I don't recommend.
Hope that helps!
Problem
Problem is simple: I have XML containing this value
Mu¨ller
This appears to be valid XML format for representing a u with an umlaut, like this.
Müller
But all the parsers we have tried so far result in u¨ -- two distinct characters.
Background
This form of unicode (UTF-8) uses two codepoints to represent a single character; and is called Normalized Form Decomposed or NFD, and in binary is \303\274.
Most characters can also be represented as a single codepoint and entity, including this case. The XML could also have included ü or ü or ü and in binary is \195\188. This is called Normalized Form Composed. Any of these would work fine.
Getting Right to the Question
So I think the question is one of:
Is there a parser (doesn't seem to be nokogiri) that can detect and normalize to our preferred form?
Is there a reasonable way for us to reliably detect entities in the NFD form and convert them to the NFC form (or is there something that will do that out there?)
Thanks!
The character you’re using, U+00A8 (DIAERESIS) isn’t a combining character – it is distinct from U+0308 (COMBINING DIAERESIS). (I’ve only just discovered this myself – I don’t know what the use for the non-combining diaeresis is).
It looks like in this case this behaviour is correct and your XML is wrong (it should be using ̈ and not ¨).
I have a Farsi word that if shown in UTF-8 coding is like this:
"خطاب"
I have two versions of this word, both in Notepad++ in UTF-8 are shown as above.
But if I look at them in ANSI mode then I see:
ïºïºŽï»„ﺧ
and for the other one I see:
خطاب
How come the same words have such a different representation in ANSI format? When I use PIL in Python to draw these, the result is correct for one of these and not correct for the other.
I appreciate any help on this.
In Unicode you can represent some character in more than one way.
In this case, these Arabic characters are represented with code points from the Arabic Presentation Forms-B Block in the first case, and with code points from the regular Arabic Block in the second case.
If you convert the text
ïºïºŽï»„ﺧ
to a byte stream, you get
EFBA0F EFBA8E EFBB84 EFBAA7
Notice that you are not seeing a character representing the 0F byte in the text above, because it's a non-visual character.
Now that byte stream is representing a UTF-8-encoded text. Decoding it will give you the following Unicode code points:
FE8F FE8E FEC4 FEA7
You can match those in the Arabic Presentation Forms-B Block to form your Farsi text:
خطاب
You can do the same process for the other text: خطاب gives you the byte stream D8AE D8B7 D8A7 D8A8, which represents UTF-8-encoded text, which decoded gives you the Unicode code points 062e 0637 0627 0628, which matched to the regular Arabic Block gives you again the text خطاب.
The ¬ character (0xAC in ISO-8859-1) works for normal text if I ensure that ISO-8859-1 is always used as the encoding throughout. However, when using it in attributes it is escaped to: %C2%AC. I understand that it needs to be escaped for urls, but not why it escapes it in the same way as it would for UTF-8, rather than just %AC as I'd expect it to for ISO-8859-1.
Since the escapes are in the output html file the only conclusion is that the xslt processor is the cause.
Example:
input.xml
stylesheet.xslt
makefile
Which for me generates:
output.html
Output was generated using xsltproc, compiled against libxml 20707, libxslt 10126 and libexslt 815. This was on #! Linux (amd64). I have also tried: xmlstarlet tr (also uses libxml), xalan and google chrome (by adding an <?xml-stylesheet ... >, see input_ss.xml tag) with the same result.
Opera doesn't escape it at all, and it allows ¬ to be used literally in the url and attribute.
Is this standard behaviour for xslt or is this a bug in the way the attributes are escaped? And either way, is there a solution other than replacing %C2%AC with %AC bearing in mind it is almost certainly the same for other characters that are valid ISO-8859-1 and invalid in UTF-8.
There are 3 different text-based technologies in use here, XML, HTML and URIs.
All of these have escape mechanisms - that is to say, ways to use text to indicate other text that it is impossible or difficult to indicate in a given context.
The not-sign character ¬ (U+00AC) could be escaped in the first two as ¬ or ¬ perhaps with some leading zeros, in both XML and HTML (¬ would also work in HTML). This escape would be used no matter what encoding the XML or HTML was in, because it relates to the character ¬, not to its set of octets in a given character encoding - indeed, we would generally only use it in the case where there was no such set of octets in the encoding being used.
In this case, this is unnecessary, since the output is in a character encoding in which there is no need to escape it, and so in the source you can see The ¬ character unescaped.
This HTML includes the text of a URI. The encoding of the HTML has nothing to do with this, because the encoding is how we get the text of the HTML from one machine to another, but when the HTML is being parsed to read this URI we're past that point and are dealing with some text at the level of text - that is to say, it doesn't have an encoding any more.
Now, URIs have their own escape mechanisms. This must be used in the case of ¬, as it is not a character allowed in URIs (as opposed to IRIs). Sadly, unlike the escapes in XML and HTML, these escapes are based on octets in a given encoding rather than the code-point of the character itself.
It's easy to see this as a mistake now, but URIs were specified in 1994 and that formalised work going back to 1989/1990 while Unicode 1.0 was released in 1991 and didn't have the ground-breaking 2.0 until 1996, so hindsight has considerably more benefits than URI's inventors. (HTML had the same problem many years ago, but the format of its encodings made it much easier to fix this without as many backwards-compatibility issues).
So, what encoding should we use for those octets? The original specs left this undefined, but really the only possible choice is UTF-8. It's the only encoding that gives those escapes commonly used for chracters special to URIs their escapes in the range 0x20 - 0x7F while also covering all of the UCS.
There's also no way to indicate another choice could be more appropriate. Remember, we're working at the level of text, so your use of ISO-8859-1 is completely irrelevant. Even if we kept track of the encoding while parsing the HTML, the URI is going to be made use of in a way that is nothing to do with the document, so we still couldn't use it. In all, if we have to make use of an octet-based encoding, and we have to keep characters in the ASCII range matching the octets they'd have in ASCII, the only possible basis for the encoding is UTF-8.
For that reason, the escape in any URI for ¬ must always be %C2%AC.
There can be some legacy systems that expect URIs to use other encodings, but the solution is to fix the bit that's broken, not the bit that works, so if something expects ¬ to be %AC then catch it close to that by converting %C2%AC close to its use (and if it outputs %AC itself then of course you'll need to fix it to %C2%AC before it hits the outside world).
The XSLT spec says that when serializing URI-valued attributes, all non-ASCII characters are escaped using the %HH-escaping of the UTF-8 octets that represent the character. Although %HH-escaping of other encodings has been used in the past, it is no longer used today. This is quite independent of the encoding of the document itself.