Is byte 0xFF valid in a UTF-8 encoded string? - utf-8

Can an UTF-8 string contain the byte 0xFF (255)?

No. It is specifically forbidden by the spec.

UTF-8, Number of 1 bytes,First code point is U+0000, Last code point is U+007F.
The bytes 0xFE and 0xFF do not valid in UTF-8.
The first byte is 0 in The UTF-8 when bytes only one.
[click image for more info about UTF-8 bytes]

Related

APDU always get response trailer SW1 SW2 = 0x67 0x00

I'm working on PN5180 module to read data from my ePassport (ICAO 9303). I can send RATS - ATS, PPS, so technically, now i can exchange data using APDU command. Firstly, i tried to select LDS1 but however i tried, i always get SW1 SW2 = 0x67 0x00, which means "Wrong length".
Here my code trace:
RATS: 0xE0 0x80
ATS: 0E 78 77 D4 03 4D 4B 6A 43 4F 53 2D 33 37
PPS: 0xD0 0x11 0x00
PPS_resp: 0xD0
APDU_SELECT: 0x0A 0x00 0x00 0x00 0xA4 0x04 0x0C 0x07 0xA0 0x00 0x00 0x02 0x47 0x10 0x01
APDU_SELECT_resp: 0x0A 0x00 0x67 0x00
So maybe my INF in APDU_SELECT is incorrect, but the problem is i have used PN532 to communicate, i could read my ePassport with the same INF (using InlistPassiveTarget and InDataExchange).
If anyone see this post and worked with PN5180 or smart card before, pls let me know.
To everybody who is stuck in this problem like me,
It was my fault because i have read the ISO14443-4 version 2001 but ISO has updated it to ISO14443-4 2018 version. So the difference is about the Block format.
Take a look at Block format, in v2018, it's appended one byte as the first byte of the Block. It's Length byte!, when v2001 doesn't include it in its Format.
That's why whatever i tried, my ePassport always take 0x0c (right before 0x07, which is the true Lc) as the Lc. So that, it returned '6700' "Wrong length" all the time.
Link about ISO14443-4 2018.
Good luck to everybody read my post, thank you so so so much Maarten Bodewes!
I'm going to make a bit of a guess here; it may be the answer and it doesn't fit in a comment.
If you look at the I-block then the final nibble is defined as such:
b4 : CID following, if bit is set to 1
b3 : NAD following, if bit is set to 1
b2 shall be set to 1
b1 : Block number
I presume that this is the first APDU so the block number is probably the one initialized. However, as you both provide a CID and a NAT I presume that you need to set both of those to 1. That would make a nibble with value 1110 which translates to E, not A. To the card I would assume that the start of the APDU is off by one byte, receiving a 0x0C instead of a 0x07 as Lc.
I don't see anything particularly wrong with the APDU, so I presume that the error lies outside of it.

How to detect and fix incorrect character encoding

A upstream service reads a stream of UTF-8 bytes, assumes they are ISO-8859-1, applies ISO-8859-1 to UTF-8 encoding, and sends them to my service, labeled as UTF-8.
The upstream service is out of my control. They may fix it, it may never be fixed.
I know that I can fix the encoding by applying UTF-8 to ISO-8859-1 encoding then labeling the bytes as UTF-8. But what happens if my upstream fixes their issue?
Is there any way to detect this issue and fix the encoding only when I find a bad encoding?
I'm also not sure that the upstream encoding is ISO-8859-1. I think the upstream is perl so that encoding makes sense and each sample I've tried decoded correctly when I apply ISO-8859-1 encoding.
When the source sends e4 9c 94 (✔) to my upstream, my upstream sends me c3 a2 c2 9c c2 94 (â).
utf-8 string ✔ as bytes: e4 9c 94
bytes e4 9c 94 as latin1 string: â
utf-8 string â as bytes: c3 a2 c2 9c c2 94
I can fix it applying upstream.encode('ISO-8859-1').force_encoding('UTF-8') but it will break as soon as the upstream issue is fixed.
Since you know how it is mangled, you can try to unmangle it by decoding the received UTF-8 bytes, encoding to latin1, and decoding as UTF-8 again. Only your mangled strings, pure ASCII strings, or very unlikely latin-1 string combinations will successfully decode twice. If that decoding fails, assume the upstream was fixed and just decode once as UTF-8. A pure ASCII string will correctly decode with either method so there is no issue there as well. There are valid UTF-8-encoded sequences that survive a double-decode but they are unlikely to occur in normal text.
Here's an example in Python (you didn't mention a language...):
# Assume bytes are latin1, but return encoded UTF-8.
def bad(b):
return b.decode('latin1').encode('utf8')
# Assume bytes are UTF-8, and pass them along.
def good(b):
return b
def decoder(b):
try:
return b.decode('utf8').encode('latin1').decode('utf8')
except UnicodeError:
return b.decode('utf8')
b = '✔'.encode('utf8')
print(decoder(bad(b)))
print(decoder(good(b)))
Output:
✔
✔
Bare ISO 8859-1 is almost guaranteed to be invalid UTF-8. Attempting to decode as ISO 8859-1 and then as UTF-8, and falling back to simply decoding as UTF-8 if this produces invalid byte sequences should work for this specific case.
In some more detail, the UTF-8 encoding severely restricts which non-ASCII character sequences are allowed. The allowed patterns are extremely unlikely in ISO-8859-1 because in this encoding, they represent sequences like à followed by an unprintable control character or mathematical operator, which simply do not tend to occur in any valid text.
Based on Mark Tolonen's answer, again in Python 3:
def maybe_fix_encoding(utf8_string, possible_codec="cp1252"):
"""Attempts to fix mangled text caused by interpreting UTF8 as cp1252
(or other codec: https://docs.python.org/3/library/codecs.html)"""
try:
return utf8_string.encode(possible_codec).decode('utf8')
except UnicodeError:
return utf8_string
>>> maybe_fix_encoding("some normal text and some scandinavian characters æ ø å Æ Ø Å")
'some normal text and some scandinavian characters æ ø å Æ Ø Å'
Based on turpachull's answer, and the python3 list of standard encodings (& Mark Amery's answer listing the set for various versions of python), here's a script that will attempt every encoding transform on stdin and output each version if it's different from the plain utf_8.
#!/usr/bin/env python3
import sys
import fileinput
encodings = ["ascii", "big5hkscs", "cp1006", "cp1125", "cp1250", "cp1252", "cp1254", "cp1256", "cp1258", "cp273", "cp437", "cp720", "cp775", "cp852", "cp856", "cp858", "cp861", "cp863", "cp865", "cp869", "cp875", "cp949", "euc_jis_2004", "euc_kr", "gbk", "hz", "iso2022_jp_1", "iso2022_jp_2004", "iso2022_jp_ext", "iso8859_11", "iso8859_14", "iso8859_16", "iso8859_3", "iso8859_5", "iso8859_7", "iso8859_9", "koi8_r", "koi8_u", "latin_1", "mac_cyrillic", "mac_iceland", "mac_roman", "ptcp154", "shift_jis_2004", "utf_16_be", "utf_32", "utf_32_le", "utf_7", "utf_8_sig", "big5", "cp037", "cp1026", "cp1140", "cp1251", "cp1253", "cp1255", "cp1257", "cp424", "cp500", "cp737", "cp850", "cp855", "cp857", "cp860", "cp862", "cp864", "cp866", "cp874", "cp932", "cp950", "euc_jisx0213", "euc_jp", "gb18030", "gb2312", "iso2022_jp", "iso2022_jp_2", "iso2022_jp_3", "iso2022_kr", "iso8859_10", "iso8859_13", "iso8859_15", "iso8859_2", "iso8859_4", "iso8859_6", "iso8859_8", "johab", "koi8_t", "kz1048", "mac_greek", "mac_latin2", "mac_turkish", "shift_jis", "shift_jisx0213", "utf_16", "utf_16_le", "utf_32_be", "utf_8"]
def maybe_fix_encoding(utf8_string, possible_codec="utf_8"):
try:
return utf8_string.encode(possible_codec).decode('utf_8')
except UnicodeError:
return utf8_string
for line in sys.stdin:
for e in encodings:
i=line.rstrip('\n')
result=maybe_fix_encoding(i, e)
if result != i or e == 'utf_8':
print("\t".join([e, result]))
print("\n")
usage e.g.:
$ echo 'Requiem der morgenröte' | ~/decode_string.py
cp1252 Requiem der morgenröte
cp1254 Requiem der morgenröte
iso2022_jp_1 Requiem der morgenr(D**B"yte
iso2022_jp_2 Requiem der morgenr(D**B"yte
iso2022_jp_2004 Requiem der morgenr(Q):B"yte
iso2022_jp_3 Requiem der morgenr(O):B"yte
iso2022_jp_ext Requiem der morgenr(D**B"yte
latin_1 Requiem der morgenröte
iso8859_9 Requiem der morgenröte
iso8859_14 Requiem der morgenröte
iso8859_15 Requiem der morgenröte
mac_iceland Requiem der morgenr̦te
mac_roman Requiem der morgenr̦te
mac_turkish Requiem der morgenr̦te
utf_7 Requiem der morgenr+AMMAtg-te
utf_8 Requiem der morgenröte
utf_8_sig Requiem der morgenröte

UTF-8 value of a character in ColdFusion?

In ColdFusion I can determine the ASCII value of character by using asc()
How do I determine the UTF-8 value of a character?
<cfscript>
x = "漢"; // 3 bytes
// bytes of unicode character, a.k.a. String.getBytes("UTF-8")
bytes = charsetDecode(x, "UTF-8");
writeDump(bytes); // -26-68-94
// convert the 3 bytes to Hex
hex = binaryEncode(bytes, "HEX");
writeDump(hex); // E6BCA2
// convert the Hex to Dec
dec = inputBaseN(hex, 16);
writeDump(dec); // 15121570
// asc() uses the UCS-2 representation: 漢 = Hex 6F22 = Dec 28450
asc = asc(x);
writeDump(asc); // 28450
</cfscript>
USC-2 is fixed to 2 bytes, so it cannot support all unicode characters (as there can be as much as 4 bytes per character). But what are you actually trying to achieve here?
Note: If you run this example and get more than 3 bytes returned, make sure CF picks up the file as UTF-8 (with BOM).

How to restore PDF from ASCII?

I have a question, how to restore PDF file, if all I have is the only ASCII output?
Example:
%PDF-1.3
%���������
4 0 obj
<< /Length 5 0 R /Filter /FlateDecode >>
stream
x�ѽ
�0�ݧ8O�����[�AAqp� �jK|{S�"�f�2���[�
�(M#���#�FFIw�=*��?J4'�P�y^TP`�Q�
+�i�E�8ψ�g���º��(6�񮽗֭,���s0�T��ZL�~�e�.EA��`J�f��<��M�
[...]
0000120481 00000 n
0000122448 00000 n
trailer
<</Size 94 /Root 57 0 R /Prev 116103 /Info 1 0 R>>
startxref
122488
%%EOF
It's the beginning and end of output I have and I need to restore it back into a readable form. I tried a few things, but I was unlucky.
It is impossible, the information was lost.
You can't represent binary data as a printable text using ASCII encoding in the 'One Byte' to 'One Char' ratio.
There are many non-printable characters in the ASCII table that could be supressed when converting the pdf binary file contents, destroying the original data.
Quoted-Printable encoding and Base64 encoding are more suitable for such application.
Check this out: Binary-to-text_encoding

Explain what those escaped numbers mean in unicode encoding in ruby 1.8.7

0186 is the unicode "code". Where do 198 and 134 come from? How can go the other way around, from these byte codes to unicode strings?
>> c = JSON '["\\u0186"]'
[
[0] "Ɔ"
]
>> c[0][0]
198
>> c[0][1]
134
>> c[0][2]
nil
Another confusing thing is unpack. Another seemingly arbitrary number. Where does that come from? Is it even correct? From the 1.8.7 String#unpack documentation:
U | Integer | UTF-8 characters as unsigned integers
>> c[0].unpack('U')
[
[0] 390
]
>
You can find your answers here Unicode Character 'LATIN CAPITAL LETTER OPEN O' (U+0186):
Note that 186 (hexadecimal) === 390 (decimal)
C/C++/Java source code : "\u0186"
UTF-32 (decimal) : 390
UTF-8 (hex) : 0xC6 0x86 (i.e. 198 134)
You can read more about UTF-8 encoding on Wikipedia's article on UTF-8.
UTF-8 (UCS Transformation Format — 8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

Resources