Assembly problems with ascii extended characters - terminal

i want know what to do to solve this issue with the ascii extended characters, i don´t understand why print a strange symbols instead of letter that represent 0x90
i put PutStr c381
nothing happen

This has nothing to do with assembly language and everything to do with UTF-8 (which your terminal is expecting) vs. ISO-8859-1 (latin-1) or Windows 1252 (IDK which) extended 8-bit character set which you seem to be looking up codes from. It would be the same if you wrote a C program with those bytes in a char array[] and used stdio puts.
As #Fuz says, "Á does not have an ASCII code." ASCII only includes characters from 0..127 (and the low 32 are non-printable) http://www.asciitable.com/. Extended-ASCII 8-bit character sets only overlap with UTF-8 for code-points from 0 to 127.
Any program that makes a write() system call to write a 0x90 byte to stdout will do the same thing, regardless of what language it was written in. (Use strace ./program to see what yours does, or pipe it into hexdump -C). For example, in bash run printf '\x90\n' to do exactly the same thing. 90 0a is not a valid UTF-8 multi-byte sequence, so your terminal prints a � glyph (a ? in a diamond).
You could set your gnome-terminal to ISO-8859-1 or Windows 1252 (right click and use the dropdown, or find the menu entry). I'm using konsole, and it does support both those non-UTF-8 character encodings.
You'll probably want to set export LANG=en_US in that terminal only (not the usual en_US.UTF-8) if you do that, so other programs will continue to work well.
Or en_CA or whatever locale you actually use, just use the non-UTF-8 version of it so man's line-drawing will work, and so will full-screen text things like gdb's TUI layout reg mode, or editors like jed.

Related

Octal, Hex, Unicode

I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.
From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.

Where can I find a list of terminal ANSI codes sent by Ctrl-key sequences?

I am writing some behavioural tests for code that interacts with a terminal and I need to assert behaviour on the sequence C-p C-q (ctrl-p ctrl-q). In order to do this, I need to write the raw characters to the PTY. I have a small mapping at the moment for things like C-d => 0x04, C-h => 0x08.
Is there somewhere I can get a basic mapping of human readable control sequences, mapped to raw byte sequences for xterm?
Take the ASCII value of the character (e.g., for ^H, take 72), and subtract 64. Thus, ^H is 8.
This works for any control character. Using it, you can discover that, for example, ^# is the NUL character and ^[ is ESC.

Using non-ASCII characters in a cmd batch file

I'm working on a .bat program, and the program is written in Finnish. The problem is that CMD doesn't know these "special" letters, such as Ä, Ö, Å.
Is there a way to make those work? I'd also like it if the user could use those letters too.
Part of my code:
#echo off
/u
title JustATestProgram
goto test123
:test123
echo Letters : Ää Öö Åå
pause
exit
When I open this file, the letters look like this:
Try putting this line at the top of the batch file:
chcp 65001
It should change the console encoding to UTF-8, and you should be able to read the file properly in the script after that.
Theoretically you just need to use the /u (Unicode) switch:
c:\>cmd /u
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
c:\>echo Ä
Ä
If you use Notepad++, you can simply change the charset. Doing this will allow you to write letters from desired charset. The western region -US. should support it.
You can do it in a drop down menu in Notepad++ or by hand by writing chcp 437. But I recommend doing this in Notepad++ as it will show you the output as it will be in the batch. So you will then easily see if you use the right code page. And at same time it's easy to switch if you want more special symbols. You can also as stated in previous posts. Try UTF-8.
You can read more about this here: http://ss64.com/nt/chcp.html. And here's a list over different code pages (check out the OEM pages): Code Page Identifiers
The command prompt uses DOS encoding. Windows uses ANSI or Unicode.
PS I'm assuming you are in the US with code page 437 rather than international English/Western European 850.
So I used Character Map to get the DOS code then find out what ANSI character that code maps to.
This is the notepad contents.
echo Ž„™”†
Which was made by putting the DOS codes for your characters into notepad.
0142, 0132, 0153, 0148, 0143, 0134 which display as the above ANSI characters.
Command prompt output
C:\Windows\system32>echo ÄäÖöÅå
ÄäÖöÅå
Alt + Character Code [Prev | Next | Contents]
Holding down alt and pressing the character code on the numeric keypad will enter that character. The keyboard language in use must support entering that character. If your keyboard supports it the code is shown on the right hand side of the status bar in Character Map else this section of the status bar is empty. The status bar us also empty for characters with well known keys, like the letters A to Z.
However there is two ways of entering codes. The point to remember here that the characters are the same for the first 127 codes. The difference is if the first number typed is a zero of not. If it is then the code will insert the character from the current character set else it will insert a character from the OEM character set. Codes over 255 enter the unicode character and are in decimal. Characters entered are converted to OEM for Dos applications and either ANSI or Unicode depending on the Windows' application. See Converting Between Decimal and Hexadecimal.
E.G., Alt + 0 then 6 then 5 then release Alt enters the letter A
From Shortcut Keys and Key Modifiers by Me at https://1drv.ms/f/s!AvqkaKIXzvDieQFjUcKneSZhDjw

WriteConsoleW, wprintf and Unicode

AllocConsole();
consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);
WriteConsoleW(consoleHandle, L"qweąęėšų\n", 9, NULL, NULL);
_wfreopen(L"CONOUT$", L"w", stdout);
wprintf(L"qweąęėšų\n");
Output is:
qweąęėšų
qwe
Why does wprintf stop after printing qwe? \0 byte encountered in ą should terminate wide-char string, AFAIK
At first I accepted Hans Passant answer, but the root cause for wprintf not printing to UTF-8 streams is that wprintf behaves as though it uses the function wcrtomb, which encodes a wide character (wchar_t) into a multibyte sequence, depending on the current locale - link.
Windows does not have an UTF-8 capable locale (a locale which would support an UTF-8 codepage (65001)).
Quote from MSDN:
The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8.
The stdout stream can be redirected and therefore always operates in 8-bit mode. The Unicode string you pass to wprintf() gets converted from utf-16 to the 8-bit code page that's selected for the console. By default that's the olden 437 OEM code page. That's where the buck stops, that code page doesn't support the character.
You'll need to switch to another 8-bit code page, one that does support that character. A good choice is 65001, the code page for utf-8. Fix:
SetConsoleOutputCP(CP_UTF8);
Or use SetConsoleCP() if you want stdin to use utf-8 as well.

convert text from utf to read-able text

I have some UTF-Text starting with "ef bb bf". How can I turn this message to human read-able text? vim, gedit, etc. interpret the file as plain text and show all the ef-text even when I force them to read the file with several utf-encodings. I tried the "recode" tool, it doesn't work. Even php's utf8_decode failed to produce the expected text output.
Please help, how can I convert this file so that I can read it?
ef bb bf is the UTF-8 BOM. Strip of the first three bytes and try to utf8_decode the remainder.
$text = "\xef\xbb\xbf....";
echo utf8_decode(substr($text, 3));
Is it UFT8, UTF16, UTF32? It matters a lot! I assume you want to convert the text into old-fashioned ASCII (all characters are 1 byte long).
UTF8 should already be (at least mostly) readable as it uses 1 byte for standard ASCII characters and only uses multiple bytes for special/multilingual characters (Character codes > 127). It sounds like your file isn't UTF8, or you'd already be able to read it! Online content is generally UTF-8.
Unicode character codes are the same as the old ASCII codes up to 127.
UTF16 and UTF32 always use 2 and 4 bytes respectively to encode every character, whether those characters can be represented in a single byte or not. That makes it unreadable if the text editor is expecting UTF8.
Gedit supports UTF16 and UTF32 but you need to 'add' those encoding explicitly in the open dialog box (and possibly select them explicitly instead of using auto-detect)

Resources