I have searched many a page on wikipedia, the official GS1 specifications, but have yet to find a definite answer to the question
What is the actual HEX / binary value of the GS1 FNC1 character?
There is much information about how to use the GS1 identifiers, how to print the barcodes with ZPL and how to encode the FNC1, but I want to know the actual HEX value of that character.
The special function characters such as FNC1 through FNC4 belong to the class of "non-data characters" that can be encoded within various barcode symbologies but with do not have any direct ASCII representation in the decoded data stream. Each symbology that supports such characters has a different scheme for encoding them in its internal representation quite distinct from any byte-orientated character data.
The FNC characters serve both as flag characters (indicating something special to the reader) and as formatting characters (modifying the meaning of the encoded data). As such they are not intended to be transmitted directly in the data received by the host system from a basic barcode reader, although in both cases they may have an "effect" on the transmitted message.
The usual purpose of each of the FNC characters are as follows:
FNC1 - Structured Data flag character indicating GS1 and AIM formatting AND group separator formatting character, amongst other uses.
FNC2 - Message Append flag character for buffering the data in groups of symbols for a single read.
FNC3 - Reader Programming flag character for device configuration purposes.
FNC4 - Extended ASCII formatting character for encoding characters with ordinals 128-255.
Be aware that they may not all be available in certain barcode symbologies and may even be specified in different, non-typical or overloaded ways.
Encoding an FNC character in a symbol's internal data is accomplished via an "escape mechanism" that is specific to the encoding software. Each library has a different way of accepting these non-data characters within their input. For example, to use FNC1 in its typical GS1 structured data role for the data "(01)00312345678906(21)123456789012(30)0144" you might see the FNC1 characters escaped as {FNC1} so that the input looks like {FNC1}010031234567890621123456789012{FNC1}300144.
Some libraries will even use a set of regular or extended ASCII characters as placeholders for the FNC characters, but these are arbitrary representations and it is a mistake to consider them to be actual ASCII values for these non-data characters.
Upon scanning a barcode the symbol's internal data is typically decoded then transmitted to the host over a basic channel (e.g. keyboard wedge) as a sequence of bytes to be interpreted according to the Latin-1 character encoding. The FNC characters cannot be represented in such a manner and are excluded from the data stream, however their formatting effect on the data remains.
For instance, the standards for most symbologies specify that when an FNC1 character is being used in its role as a field separator in data conforming to GS1 Application Identifier Standard Format it should be decoded and transmitted as GS (ASCII 29). Explicitly stated, the formatting effect of a FNC1 character used as a GS1 Application Identifier separator is to place a GS character at the end of the variable-length field. But in other roles (such as when FNC1 is used in "first/second position" as a flag character and with non-GS1 formatted data) there is no formatting effect on the carried data and therefore no ASCII representation during decoding.
Another instance of the special function characters having a formatting effect on the data is with symbologies that use FNC4 to extend their reach from 7-bit ASCII into extended ASCII as described in this answer.
A subtle technical point is that the data transferred to the host is often prefixed with a short symbol indicator header known as a "symbology identifier" which denotes the type and usage of the symbol from which the data is being read. This is often modified by the presence of otherwise invisible flag characters within the symbol data, for example to indicate the presence of GS1 formatted data with "FNC1 in first" or to indicate reader programming mode when FNC3 appears anywhere in the symbol. The details are symbology specific.
Aside: In addition to FNC non-data characters, there are other non-data characters commonly supported by barcode symbologies that have no direct ASCII representation but affect the overall message. These include macro characters (that wrap the message data in an "envelope"), and ECI indicators that require the use of a transmission protocol beyond the typical "basic channel" mode but which enable the use of extended character sets amongst other enhancements.
Important is to know (and to setup a scanner properly) that the FNC1 character at the first position is translated to a symbology identifier according ISO/IEC 15424. The modifier m of the symbology identifier shows if there was a FNC1 or not. If this is not done the application cannot see anymore if a GS1 Structure was intended or not. Other structures are identified by e.g. Macro 06 in a data matrix code (ISO/IEC 16022, ISO/IEC 15434). Its required to figure our the difference to take the correct action to process the data.
Related
There is a service hosted on for generating barcodes metafloor.com using bwip.js
I want to generate a barcode for following data (GS character is represent by {GS}).
(01)10875066000333(10)1212{GS}(17)121212(30)8{GS}
According the documentation I'm able to generate a barcode for data without GS character
https://bwipjs-api.metafloor.com/?bcid=gs1-128&text=(01)10875066000333(10)1212(17)121212(30)8
But the scanner require GS characters.
The documentation is clear
Special characters must be encoded in format ^NNN
Parse option has to be true, by using parsefnc parameter
The parameter has to be URL-encoded.
So for my string it's:
https://bwipjs-api.metafloor.com/?bcid=gs1-128&text=(01)10875066000333(10)1212%5E029(17)121212(30)8%5E029&parsefnc
But this gives me Error: bwipp.GS1badCSET82character: AI 10: Invalid CSET 82 character.
I also tried
Send GS char directly as %1D
Send GS char as %5EGS
Send GS char as ^029
Send GS char directly
Set parsefnc=true
Combination of all above
But still getting the same error.
Is there something I'm doing wrong or is the problem on the other side?
For GS1 Application Identifier based data, trust the library to encode the data correctly by selecting the GS1-specific encoder for the symbology (gs1datamatrix in this case) and then provide the input in bracketed AI notation, i.e. without FNC1 / GS separators.
The encoder will automatically add all of the necessary FNC1 non-data characters (which are transmitted as ASCII GS characters when read by a scanner) and it will also validate the contents of the AI data that you supply.
Users that select a generic symbology and then attempt to perform the AI encoding themselves are prone to making several mistakes:
Omitting the required FNC1 in first position.
Omitting the required FNC1 separators at the end of AIs with no pre-determined width.
Terminating pre-defined length AIs with unnecessary FNC1 characters.
Terminating the message with an unnecessary FNC1 character.
Encoding ASCII GS data characters instead of the canonical FNC1 non-data characters.
Including illegal, literal parentheses to denote the AIs.
Providing improperly formatted or invalid AI values.
Omitting requisite AI attributes.
Including mutually-exclusive AI pairings.
Many of these mistakes will result in failure to decode and interpret the GS1 AI data (even if the barcode appears to read successfully) which may result in charge-backs and necessitate relabelling or disposal.
The data that you are providing falls afoul of at least some of these pitfalls.
See this article for a thorough description of the checks that BWIPP (and hence BWIP-JS) implements to prevent such data quality issues.
When testing my code that uses a routine that checks for chars to show using an ASCII value routine, my program should drop control chars but keep chars that may be entered by the user. It seems that while the ASCII value routine is called "ascii", it does not just return ascii values: giving it a char of ƒ returns 402.
For example have found this web site
but it doesn't have ƒ 402 that I can see.
Need to know whether there are other ascii codes above 402 that I need to test my code with. The character set used internally by the software that 'ascii' is written in uses UCS2. The web site found doesn't mention USC2.
There are probably many interpretations ouf »Control Character« out there, but I'll assume you mean C0 and C1 control characters (includes references to the relevant Unicode Standards).
The commonly used 32-bit integer representation of Unicode characters in general is the codepoint notation: »U+« followed by a at least 4 digit positive hex number, which you will find near mentions of characters, e.g. as in »U+007F (delete)«. The result of your »ASCII value« routine will probably be this number without the »U+«;
UCS-2 is a specific encoding for Unicode characters, which you probably won't need to care about directly), and is equivalent to Unicode codepoints for all characters within the the range of the BMP only.
When I get data from some website, sometime the data is encode in utf8 but look like this:
Thỏ , Nạt
The accent mark is seperated from character when in fact these string must be:
Thỏ, Nạt
I don't know what is the problem here and how to correct it. Can someone help me with this
The first sample string contains two Vietnamese characters in decomposed form. The first one of them is “ỏ”, consisting of simple letter “o” followed by U+0309 COMBINING HOOK ABOVE.
The second sample string has those characters in precomposed form. The first one of them is “ỏ” U+1ECF LATIN SMALL LETTER O WITH HOOK ABOVE.
The decomposed and precomposed form are defined to be “canonical equivalent” and are normally expected to result in the same rendering (though this does not always happen). They are not identical, however; in programmatic comparison of characters and strings, they are very much different.
Mostly Latin letters with diacritics, such as “é” and “ä”, are used in precomposed form only, since that’s what keyboard drivers, online keyboards, character picking utilities, etc., normally produce. However, Vietnamese keyboard drivers often work so that some diacritic marks are entered after entering a base character, and the diacritic is thus produced as a combining character, i.e. the letter (like “ỏ”) is then in decomposed form.
One way of dealing with this issue, recommended in many contexts, is to convert your strings to Normalization Form C (NFC). This would put these characters into precomposed form. Note, however, that conversion to NFC removes some other distinctions, too (but this is not relevant if the text is in Vietnamese only and does not contain special symbols).
It remains a mystery why the first sample string has a space character before the comma.
Suppose I allow my users to submit a form containing some text fields (I'm not talking about passwords). My users would occasionally use non-ASCII characters like Russian, Chinese, etc. so I use UTF-8 charsets in my database. The question is, should I really allow all of the possible UTF-8 characters? I had a look at the ASCII table and saw that characters 0 to 31 have nothing to do with text, except for newlines and white spaces. Characters 176 to 223 seem to be for decorative purposes :p. Should I restrict them?
The W3C skips these characters in their example regular expression in Multilingual form encoding:
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
Make sure it is valid UTF-8 and Unicode? Yes
Make sure it does not include certain characters, such as control codes? Probably not necessary
You should be aware that even though you are using UTF-8 in your form, you may not get valid UTF-8 from all user-agents when they send form data to you, and you will have to filter it as necessary. Invalid UTF-8 can take many forms, some of them being
Overlong encodings (which can lead to security issues)
Other invalid UTF-8 byte sequences, which may indicate that the user-agent ignored the character encoding and has submitted something like Windows-1252 or ISO-8859-1 encoding instead.
Code points that lie in reserved surrogate space in Unicode
All the above need to be filtered out during input, otherwise you are not storing valid Unicode.
If you want to serve valid HTML or XHTML, which use a subset of Unicode, you will need also need to filter out (either at input or output):
C0 control codes 0x00 to 0x19 (apart from tab, space, new line, carraige return)
0x7F
C1 control codes 0x80 to 0xBF
(probably) any code point above 0x10FFFF
No.
It's a very bad idea to try to "pre-clean" user input. What you consider "decorative" might be absolutely necessary to readers of another language. The best solution is to store the text as-is in the database, and then sanitize it before writing to the page.
When you say "the ASCII table" you're talking about this page, aren't you? That page is garbage. Only the first 128 characters (ie, 0..127) are "ASCII"; the mappings they show for the numbers 128..255 are from an ASCII extension called cp437. There are a lot of "extended ASCII's" out there, and cp437 is far from the most common one.
But I digress. Your question isn't about character encodings, it's about filtering, and a filter should be based on the properties of the characters: is it a letter, a digit, a control character? Most modern programming languages provide methods or functions to obtain such information, and most provide regex support as well. As for what you should filter, or whether you should filter at all, only you can know that.
It sounds like you need to learn more about character encodings and Unicode, though. Start here.
I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.