New iText Producer field causes validation failure - validation

I switched from the old iText library to the iTextPdf library and noticed a problem. The new library sets the producer to a value that includes non-Unicode characters (windows TM symbol and copyright symbol). The problem is that validation programs that read this text choke on these characters.
Can I get iText to fix this (w/o paying for a license)? I am ok with iText getting credit. I just want the credits to be Unicode clean.
<</Producer(iText® 5.5.0 ©2000-2013 iText Group NV \(AGPL-version\))/ModDate(D:20150126155550-07'00')/CreationDate(D:20150126155550-07'00')>>

You are looking at the document information dictionary of a PDF, more exactly at the value of its Producer entry. It is specified as:
Producer text string (Optional) If the document was converted to PDF from another format, the name of the conforming product that converted it to PDF.
(Table 317 – Entries in the document information dictionary)
So the value must have the type text string. This in turn is specified as:
The text string type shall be used for character strings that shall be encoded in either PDFDocEncoding or the UTF-16BE Unicode character encoding scheme. PDFDocEncoding can encode all of the ISO Latin 1 character set and is documented in Annex D.
(section 7.9.2.2 Text String Type)
In Annex D you find:
CHAR CODE (OCTAL)
CHAR NAME STD MAC WIN PDF
...
© copyright — 251 251 251
...
® registered — 250 256 256
...
(D.2 Latin Character Set and Encodings)
Thus, these characters are completely valid here and validators which choke on these characters are broken.
So you had better report this bug to the developers of the validators in question.

Related

bwip.js: How to use the Group Separator character with GS1-128

There is a service hosted on for generating barcodes metafloor.com using bwip.js
I want to generate a barcode for following data (GS character is represent by {GS}).
(01)10875066000333(10)1212{GS}(17)121212(30)8{GS}
According the documentation I'm able to generate a barcode for data without GS character
https://bwipjs-api.metafloor.com/?bcid=gs1-128&text=(01)10875066000333(10)1212(17)121212(30)8
But the scanner require GS characters.
The documentation is clear
Special characters must be encoded in format ^NNN
Parse option has to be true, by using parsefnc parameter
The parameter has to be URL-encoded.
So for my string it's:
https://bwipjs-api.metafloor.com/?bcid=gs1-128&text=(01)10875066000333(10)1212%5E029(17)121212(30)8%5E029&parsefnc
But this gives me Error: bwipp.GS1badCSET82character: AI 10: Invalid CSET 82 character.
I also tried
Send GS char directly as %1D
Send GS char as %5EGS
Send GS char as ^029
Send GS char directly
Set parsefnc=true
Combination of all above
But still getting the same error.
Is there something I'm doing wrong or is the problem on the other side?
For GS1 Application Identifier based data, trust the library to encode the data correctly by selecting the GS1-specific encoder for the symbology (gs1datamatrix in this case) and then provide the input in bracketed AI notation, i.e. without FNC1 / GS separators.
The encoder will automatically add all of the necessary FNC1 non-data characters (which are transmitted as ASCII GS characters when read by a scanner) and it will also validate the contents of the AI data that you supply.
Users that select a generic symbology and then attempt to perform the AI encoding themselves are prone to making several mistakes:
Omitting the required FNC1 in first position.
Omitting the required FNC1 separators at the end of AIs with no pre-determined width.
Terminating pre-defined length AIs with unnecessary FNC1 characters.
Terminating the message with an unnecessary FNC1 character.
Encoding ASCII GS data characters instead of the canonical FNC1 non-data characters.
Including illegal, literal parentheses to denote the AIs.
Providing improperly formatted or invalid AI values.
Omitting requisite AI attributes.
Including mutually-exclusive AI pairings.
Many of these mistakes will result in failure to decode and interpret the GS1 AI data (even if the barcode appears to read successfully) which may result in charge-backs and necessitate relabelling or disposal.
The data that you are providing falls afoul of at least some of these pitfalls.
See this article for a thorough description of the checks that BWIPP (and hence BWIP-JS) implements to prevent such data quality issues.

Unpacking COMP-3 digit using Record Editor/Jrecord

I have created layout based on cobol copybook.
Layout snap-shot:
I tried to load data also selecting same layout, it gives me wrong result for some columns. I try using all binary numeric type.
CLASS-ORDER-EDGE
DIV-NO-EDG
OFFICE-NO-EDG
REG-AREA-NO-EDG
CITY-NO-EDG
COUNTY-NO-EDG
BILS-COUNT-EDG
REV-AMOUNT-EDG
USAGE-QTY-EDG
GAS-CCF-EDG
result snapshot
Input file can be find below attachment
enter link description here
or
https://drive.google.com/open?id=0B-whK3DXBRIGa0I0aE5SUHdMTDg
Expected output:
Related thread
Unpacking COMP-3 digit using Java
First Problem you have done an EBCDIC --> ascii conversion on the file !!!!
The EBCDIC --> ascii conversion will also try and convert binary fields as well as text.
For example:
Comp-3 value hex hex after Ascii conversion
400 x'400c' x'200c' x'40' is the ebcdic space character
it gets converted to the ascii
space character x'20'
You need to do binary transfer, keeping the file as ebcdic:
Check the file on the Mainframe if it has a RECFM=FB you can do a transfer
If the file is RECFM=VB make sure you transfer the RDW (Record Descriptor word) (or copy the VB file to a FB file on the mainframe).
Other points:
You will have to update RecordEditor/JRecord
The font will need to be ebcdic (cp037 for US ebcdic; for other lookup)
The FileStructure/FileOrganisation needs to change (Fixed length / VB)
Finally
BILS-Count-EDG is either 9 characters long or starts in column 85 (and is 8 bytes long).
You should include Xml in as text not copy a picture in.
In the RecordEditor if you Right click >>> Edit Record; it will show the fields as Value, Raw Text and Hex. That is useful for seeing what is going on
You do not seem to accept many answers; it is not relevant whether the answer solves your problem; it is whether the answer is correct answer for the question.

Unrecognised glyphs in PDF (summationdisplay, summationtext)

I am trying to process a PDF with pdf-reader gem. It is mostly fine, but where there should be a summation symbol, I'm getting \u0001 instead of \u2211. The relevant font object is:
{:Type=>:Font,
:Subtype=>:Type1,
:FirstChar=>1,
:LastChar=>2,
:Widths=>[1444, 1056],
:Encoding=>{:Type=>:Encoding, :Differences=>[1, :summationdisplay, :summationtext]},
:BaseFont=>:"APHKGN+CMEX10",
:FontDescriptor=>
{:Type=>:FontDescriptor,
:Ascent=>0,
:CapHeight=>0,
:Descent=>0,
:Flags=>4,
:FontBBox=>[0, -1400, 1387, 0],
:FontName=>:"APHKGN+CMEX10",
:ItalicAngle=>0,
:StemV=>47,
:StemH=>47,
:CharSet=>"/summationdisplay/summationtext",
:FontFile3=>
#<PDF::Reader::Stream:0x007faab138a528
#data=
"H\x89bd`ab`dd\xE4s\f\xF0\xF0v\xF7\xD3v\xF6u\x8D04\x00\x89(\xFD\x90e\xFC!\xCE\xF2C\x8EG\xACX\xE6K\x81\f\xEB\xBA\x9F3X\xBF;\xF1\x7Fw\x13\xF8\xEE%\xB8\xE2\x87\xA7\x10\x03\vP\x9F\\rfqinnbIf~^IjE\t\x9C\x93\x92Y\\\x90\x93X\xE9\x9C_PY\x94\x99\x9EQ\xA2\xA0\xE1\xAC\xA9`hii\xAE\xE0\x98\x9BZ\x94\x99\x9C\x98\xA7\xE0\x9BX\x92\x91\nR\x9D\x9C\x98\xA3\x10\x9C\x9F\x9C\x99ZR\xA9\xA7\xE0\x98\x93\xA3\x10\x04\xD2Q\xAC\x10\x94Z\x9CZT\x96\x9A\x02u\x15\xD0Y\xED\x8C\fL\x01\x11\f\xCC\x8C\x8C\xECE?\xFF3\xFA\x86\x86\xF1\xFDg\x91\xEFO\xF8Ws\xE8\x97\xECf\xC6\x1F\xD5\x7Ff\x88N\x9A\xD2\xDB\xD7/\xD5\xDF\xD5\xD3:E\xEE\xF7\xCD\x1FA\xAC?\x14\xD8\xBE\xB3}\xAFj\xF9\xED\x7FQ~\t\x9B\xE9\xF7:\xD6\xBF\x17\xD9\n\xBA\xBAr\xE4\x7F0\xFE\xE9\xFA\xFD\xFD\x8F7kscWg\xBBT\xC3\x94\xEE\xB9r?/\xB2=\xFC\xDE\xCBZ\xC4V\xE4\xE0\xE1g\x96\xC7\xD1V\xEDV\xFC[]\xFA\x8F-\e\xDF\x7F\xD6%\x85'd~u<\x92a\xF9\xB8\x9BQ\x86\xE5\x13\x90-\xFA\x9D\xF7\xFB\x15\xA0\xEA\x14eE\xF7\xDF\xEC\xB9\x1Cme\x9A\x85\xBFC\xA4\xFF\xBCg\xFB1\xF1\xC7K\xD6I\x93{\xFB&H\xF5v\xF7\xB5L\x95\xFB\x93\xF6S\x90\xF5\xC7\x0E\xB6\xEFR\xCFj;\xA7\xC8\x1Fl~Tu+rI\xF5\xF9\xB8\xB5V\x1CK\xD8~\xF3~_\xCB*\xF3;\x89\xAD\xA4\xAB\xAB\xB5C\xBE\xAB\xA3\xBB\xA2A\xEA\xC7\xD2\xBF\x19\x7Ff\xFD\xF9\xCC\xDAX\xDF\xDD\xD6\x05q _\xF9|6\x99\xDF\x95\xF3\xD9\xE5\x16\xB8O\x9D9\xE3?\x0F\xE7.\xAE]\xDC\x9B'\xF1\xF0\x001/#\x80\x01\x00J\xBC\xBFN\n",
#hash={:Filter=>:FlateDecode, :Length=>464, :Subtype=>:Type1C},
#udata=nil>}}
Since the Adobe glyphlist.txt (replicated at pdf-reader/lib/pdf/reader/glyphlist.txt) only includes summation, and not summationtext nor summationdisplay, #differences don't get applied to #mapping in PDF::Reader::Encoding#differences=, and #state.current_font.to_utf8(1) fails to fetch the correct glyph (it returns the glyphcode as a fallback, which is why I end up with \u0001). I.e. The font mapping differences inside the PDF font object should (according to my understanding) reference glyphs on the master glyph list by name, but these two don't match.
What am I missing? If summationdisplay and summationtext are not on Adobe's glyphlist.txt, how do other PDF readers render this font correctly?
This is defining a font subset with custom encoding and non-standard glyph names. Furthermore it does not include a ToUnicode reverse mapping from the custom encoding.
The PDF-32000 Specification covers this scenario:
9.10 Extraction of Text Content
9.10.1 General
...
When extracting character content, a conforming reader can easily convert text to Unicode values if a font’s characters are identified according to a standard character set that is known to the conforming reader. This character identification can occur if either the font uses a standard named encoding or the characters in the font are identified by standard character names or CIDs in a well-known collection. 9.10.2, "Mapping Character Codes to Unicode Values", describes in detail the overall algorithm for mapping character codes to Unicode values.
If a font is not defined in one of these ways, the glyphs can still be shown, but the characters cannot be converted to Unicode values without additional information:
• This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see 9.10.3, "ToUnicode CMaps"), whose value shall be a stream object containing a special kind of CMap file that maps character codes to Unicode values.
pdf-reader does seem to be conforming to the above. There is a custom sub-set encoding with /summationdisplay mapped to \u0001. There enough information to render, but not to reverse map the font back to Unicode.

What is the actual HEX / binary value of the GS1 FNC1 character?

I have searched many a page on wikipedia, the official GS1 specifications, but have yet to find a definite answer to the question
What is the actual HEX / binary value of the GS1 FNC1 character?
There is much information about how to use the GS1 identifiers, how to print the barcodes with ZPL and how to encode the FNC1, but I want to know the actual HEX value of that character.
The special function characters such as FNC1 through FNC4 belong to the class of "non-data characters" that can be encoded within various barcode symbologies but with do not have any direct ASCII representation in the decoded data stream. Each symbology that supports such characters has a different scheme for encoding them in its internal representation quite distinct from any byte-orientated character data.
The FNC characters serve both as flag characters (indicating something special to the reader) and as formatting characters (modifying the meaning of the encoded data). As such they are not intended to be transmitted directly in the data received by the host system from a basic barcode reader, although in both cases they may have an "effect" on the transmitted message.
The usual purpose of each of the FNC characters are as follows:
FNC1 - Structured Data flag character indicating GS1 and AIM formatting AND group separator formatting character, amongst other uses.
FNC2 - Message Append flag character for buffering the data in groups of symbols for a single read.
FNC3 - Reader Programming flag character for device configuration purposes.
FNC4 - Extended ASCII formatting character for encoding characters with ordinals 128-255.
Be aware that they may not all be available in certain barcode symbologies and may even be specified in different, non-typical or overloaded ways.
Encoding an FNC character in a symbol's internal data is accomplished via an "escape mechanism" that is specific to the encoding software. Each library has a different way of accepting these non-data characters within their input. For example, to use FNC1 in its typical GS1 structured data role for the data "(01)00312345678906(21)123456789012(30)0144" you might see the FNC1 characters escaped as {FNC1} so that the input looks like {FNC1}010031234567890621123456789012{FNC1}300144.
Some libraries will even use a set of regular or extended ASCII characters as placeholders for the FNC characters, but these are arbitrary representations and it is a mistake to consider them to be actual ASCII values for these non-data characters.
Upon scanning a barcode the symbol's internal data is typically decoded then transmitted to the host over a basic channel (e.g. keyboard wedge) as a sequence of bytes to be interpreted according to the Latin-1 character encoding. The FNC characters cannot be represented in such a manner and are excluded from the data stream, however their formatting effect on the data remains.
For instance, the standards for most symbologies specify that when an FNC1 character is being used in its role as a field separator in data conforming to GS1 Application Identifier Standard Format it should be decoded and transmitted as GS (ASCII 29). Explicitly stated, the formatting effect of a FNC1 character used as a GS1 Application Identifier separator is to place a GS character at the end of the variable-length field. But in other roles (such as when FNC1 is used in "first/second position" as a flag character and with non-GS1 formatted data) there is no formatting effect on the carried data and therefore no ASCII representation during decoding.
Another instance of the special function characters having a formatting effect on the data is with symbologies that use FNC4 to extend their reach from 7-bit ASCII into extended ASCII as described in this answer.
A subtle technical point is that the data transferred to the host is often prefixed with a short symbol indicator header known as a "symbology identifier" which denotes the type and usage of the symbol from which the data is being read. This is often modified by the presence of otherwise invisible flag characters within the symbol data, for example to indicate the presence of GS1 formatted data with "FNC1 in first" or to indicate reader programming mode when FNC3 appears anywhere in the symbol. The details are symbology specific.
Aside: In addition to FNC non-data characters, there are other non-data characters commonly supported by barcode symbologies that have no direct ASCII representation but affect the overall message. These include macro characters (that wrap the message data in an "envelope"), and ECI indicators that require the use of a transmission protocol beyond the typical "basic channel" mode but which enable the use of extended character sets amongst other enhancements.
Important is to know (and to setup a scanner properly) that the FNC1 character at the first position is translated to a symbology identifier according ISO/IEC 15424. The modifier m of the symbology identifier shows if there was a FNC1 or not. If this is not done the application cannot see anymore if a GS1 Structure was intended or not. Other structures are identified by e.g. Macro 06 in a data matrix code (ISO/IEC 16022, ISO/IEC 15434). Its required to figure our the difference to take the correct action to process the data.

IMultiLanguage2::ConvertStringFromUnicode - how to avoid compound prefix?

I am using IMultilanguage2::ConvertStringFromUnicode to convert from UTF-16. For some languages (Japanese, Chinese, Korean), I am getting an escape sequence (e.g. 0x1B, 0x24, 0x29, 0x43 for codepage 50225 (ISO-2022 Korean)). WideCharToMultiByte exhibits the same behavior.
I am building a MIME message, so the encoding is specified in the header itself and the escape prefix is displayed as-is.
Is there a way to convert without the prefix?
Thank you!
I don't really see a problem here. That is a valid byte sequence in ISO 2022:
Escape sequences to designate character sets take the form ESC I [I...] F, where there are one or more intermediate I bytes from the range 0x20–0x2F, and a final F byte from the range 0x40–0x7F. (The range 0x30–0x3F is reserved for private-use F bytes.) The I bytes identify the type of character set and the working set it is to be designated to, while the F byte identifies the character set itself.
...
Code: ESC $ ) F
Hex: 1B 24 29 F
Abbr: G1DM4
Name: G1-designate multibyte 94-set F
Effect: selects a 94n-character set to be used for G1.
As F is 0x43 (C), this byte sequence tells a decoder to switch to ISO-2022-KR:
Character encodings using ISO/IEC 2022 mechanism include:
...
ISO-2022-KR. An encoding for Korean.
ESC $ ) C to switch to KS X 1001-1992, previously named KS C 5601-1987 (2 bytes per character) [designated to G1]
In this case, you have to specify iso-2022-kr as the charset in a MIME Content-Type or RFC2047-encoded header. But an ISO 2022 decoder still has to be able to switch charsets dynamically while decoding, so it is valid for the data to include an intial switch sequence to the Korean charset.
Is there a way to convert without the prefix?
Not with IMultiLanguage2 and WideCharToMultiByte(), no. They have no clue how you are going to use their output, so it makes sense why they include an initial switch sequence to the Korean charset - so a decoder without access to charset info from MIME (or other source) would still know what charset to use initially.
When you put the data into a MIME message, you will have to manually strip off the charset switch sequence when you set the MIME charset to iso-2022-kr. If you do not want to strip it manually, you will have to find (or write) a Unicode encoder that does not output that initial switch sequence.
That was a red herring - turned out the escape sequence is necessary. The problem was with my code that was trimming the names and addresses using Trim() Delphi function, which trims all characters less than or equal to space (0x20); that includes the escape character (0x1B).
Switching to my own trimming function that removes only spaces fixed the problem.

Resources