convert text from utf to read-able text - utf-8

I have some UTF-Text starting with "ef bb bf". How can I turn this message to human read-able text? vim, gedit, etc. interpret the file as plain text and show all the ef-text even when I force them to read the file with several utf-encodings. I tried the "recode" tool, it doesn't work. Even php's utf8_decode failed to produce the expected text output.
Please help, how can I convert this file so that I can read it?

ef bb bf is the UTF-8 BOM. Strip of the first three bytes and try to utf8_decode the remainder.
$text = "\xef\xbb\xbf....";
echo utf8_decode(substr($text, 3));

Is it UFT8, UTF16, UTF32? It matters a lot! I assume you want to convert the text into old-fashioned ASCII (all characters are 1 byte long).
UTF8 should already be (at least mostly) readable as it uses 1 byte for standard ASCII characters and only uses multiple bytes for special/multilingual characters (Character codes > 127). It sounds like your file isn't UTF8, or you'd already be able to read it! Online content is generally UTF-8.
Unicode character codes are the same as the old ASCII codes up to 127.
UTF16 and UTF32 always use 2 and 4 bytes respectively to encode every character, whether those characters can be represented in a single byte or not. That makes it unreadable if the text editor is expecting UTF8.
Gedit supports UTF16 and UTF32 but you need to 'add' those encoding explicitly in the open dialog box (and possibly select them explicitly instead of using auto-detect)

Related

reading a file stored in storage or memory

Everything is stored as 0 1 in digital , binary format file or other format file
So when we are trying to open a Executive file with hex editor it could show all the random ASCII character that will be produced based on 7, 0 1 bits that randomly created that ( anything ) file because it is 0 1 at the storage, in memory.
So why it shows strange character that are not ASCII char?
A hex reader doesn't just parse 7 bit per 7 bit or 8 bit? it read some meta data in file and then read based on that?
Your hex editor is choosing to decode the bytes not as ASCII but as some other character encoding.
You are right the ASCII character set has 128 codepoints and the ASCII character encoding encodes them in single bytes in the range 0 to 127. Since bytes in an arbitrary file could range from 0 to 255 and the ASCII character set isn't used much for text files, decoding as ASCII wouldn't reveal as much information about potential text as a more likely character encoding and would reveal information about only half of the values of binary files.
A hex editor's job is to display and allow you to edit bytes. Additional presentations and editing capability are extra features. Many do present text, some allow search and replace of text. Some even work with other data formats such as decimal integers, multibyte integers, floating point, etc.
There is no text but encoded text. So, to support text, a hex editor—and any other program (including compilers)—must choose or allow you to choose a character encoding. For byte values that can't be decoded using that encoding, a dot or question mark is often substituted in the text display of a hex editor.
If you don't find which character encoding is used by your hex editor, you could test it by creating a file with byte values 0 to 255, see what it displays and match it against the many, many possibilities. It might be one that your operating system uses for your "default". In Windows cmd, go chcp; In Linux terminal, go locale.

Octal, Hex, Unicode

I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.
From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.

how notepad++ picture extended chars

I'm working with binary data and want to find out that is wrong.
I use notepad++ to preview binary, I have set View->Show Symbol->Show All Characters to see all chars, but there still exists some chars I cannot identify, e.g. â©ÎÅ. The problem is that ASCII has strong standart for number 0 to 127, extended ASCII may be picturing in many ways, so I have problem with chars what represents numbers 128 to 255.
Is there any table of notepad++ extended chars or some option to make it show symbol code instead of symbol.
Maybe not solution for you, but in PSPad and SynWrite editors you can:
create text-converter (INI file) which changes ASCII 127..255 to strings like <127>...<255> or others
apply this converter to text.
Text converter usage is described in help of both apps.

Converting ANSI to UTF8 with Ruby

I have a Ruby script that generates an ANSI file.
I want to convert the file to UTF8.
What's the easiest way to do it?
If your data is between ascii range 0 to 0x7F, its valid UTF8, so you don't need to do anything.
Or, if there is characters above 0x7F, you could use Iconv
text=Iconv.iconv('UTF-8', 'ascii',text)
The 8-bit Unicode Transformation Format (UTF-8) was designed to be backwards compatible with the American Standard Code for Information Interchange (ASCII). Therefore, by definition, any valid ASCII sequence is also a valid UTF-8 sequence. For more information, read the UTF FAQ and Unicode FAQ.
Any ASCII file is a valid UTF8 file, going by your Q's title, so no conversion is needed. I don't know what a UIF8 file is, going by your Q's text, so different from its title.

Importing extended ASCII into Oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.

Resources