a SubjectAltName contains a mix of ASCII and UTF-8 characters such as:
6DC3C16E - m(small a with acute)n
I'm using X509_NAME_oneline to parse and getting mixed escape sequences like 'm\xC3\xA1n'
Is there an openssl function which would return a full UTF-8 string?
Thanks
John
"m\xC3\xA1n" is a full UTF-8 string. UTF-8 is a variable length encoding, and all of the ASCII characters (which have codepoints less than 128) are encoded identically in ASCII and UTF-8. The character m, for example, is just the single byte 0x6d in both ASCII and UTF-8.
Related
In my source files I have string containing non-ASCII characters like
sCursorFormat = TRANSLATE("Frequency (Hz): %s\nDegree (°): %s");
But when I extract them they vanish like
msgid ""
"Frequency (Hz): %s\n"
"Degree (): %s"
msgstr ""
I have specified the encoding when extracting as
xgettext --from-code=UTF-8
I'm running under MS Windows and the source files are C++ (not that it should matter).
The encoding of your source file is probably not UTF-8, but ANSI, which stands for whatever the encoding for non-Unicode applications is (probably code page 1252). If you would open the file in some hex editor you would see byte 0x80 standing for degree symbol. This byte is not a valid UTF-8 character. In UTF-8 encoding degree symbol is represented with two bytes 0xC2 0xB0. This is why the byte vanishes when using --from-code=UTF-8.
The solution for your problem is to use --from-code=windows-1252. OR, better yet, to save all source files as UTF-8, and then use --from-code=UTF-8.
I want to convert a text file with ascii encoding to utf-8 encoding.
So far I have tried this:
open( my $test, ">:encoding(utf-8)", $test_file ) or die("Error: Could not open file!\n");
and ran the below command which is showing the encoding of file
file $test_file
test_file: ASCII text
Please let me know if I am missing something here.
Any file that is in ASCII (i.e. containing only codepoints from 0 to 127) is already in UTF-8. There will be no difference in encoding and, hence, no way for file to identify it as UTF-8.
Differences in encoding only happen with characters with codepoints from 128.
It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.
(From the Wikipedia article on UTF-8)
You are doing it correctly.
ASCII is a subset of UTF-8.
decode encode
ASCII ⇒ Unicode ⇒ UTF-8
---------- ---------- ----------
00 U+0000 00
01 U+0001 01
02 U+0002 02
⋮ ⋮ ⋮
7E U+007E 7E
7F U+007F 7F
---------- ---------- ----------
ASCII ⇐ Unicode ⇐ UTF-8
encode decode
As such, an ASCII file is a UTF-8 file.[1]
When you only use that subset, file identifies the file as being encoded using ASCII.
$ perl -M5.010 -e'use utf8; use open ":std", ":encoding(UTF-8)"; say "abcdef"' | file -
/dev/stdin: ASCII text
Going out of that subset causes file to identify the file as text encoded using UTF-8.
$ perl -M5.010 -e'use utf8; use open ":std", ":encoding(UTF-8)"; say "abcdéf"' | file -
/dev/stdin: UTF-8 Unicode text
It is also an iso-latin-1 file, iso-latin-2 file, iso-latin-3 file, a cp1250 file, a cp1251 file, a cp1252 file, etc, etc, etc
I have the following text in Notepad++ which is in "ANSI" encoding:
Vallés, Ramon Casas
When I tell Notepad++ to "encode in UTF8" it displays as:
Vallés, Ramon Casas
The two characters é are c3a9 in Hex. How can they become an é in UTF8 which is e9?
I have a String that has non ascii characters encoded as "\\'fc" (without quotes), where fc is hex 252 which corresponds to the german ü umlaut.
I managed to find all occurences and can replace them. But I have not been able to convert the fc to an ü.
"fc".hex.chr
gives me another representation...but if I do
puts "fc".hex.chr
I get nothing back...
Thanks in advance
PS: I'm working on ruby 1.9 and have
# coding: utf-8
at the top of the file.
fc is not the correct UTF-8 codepoint for that character; that's iso-8859-1 or windows-1252. The UTF-8 encoding for ü is the two-byte sequence, c3bc. Further, FC is not a valid UTF-8 sequence.
Since UTF-8 is assumed in Ruby 1.9, you should be able to get the literal u-umlaut with: "\xc3\xbc"
Have you tried
puts "fc".hex.chr(Encoding::UTF_8)
Ruby docs:
int.chr
Encoding
UPDATE:
Jason True is right. fc is invalid UTF-8. I have no idea why my example works!
Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!
This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...
You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"