I'm taking a Python class right now, and I just learned about the backspace character. Like newline (\n), backspace is a special character with ASCII code 8. My teacher couldn't think of a reason to use that, but I'm curious as to how it's used. Maybe just historical reasons? When I tried print("Hellow\b World"), I got just got what I expected: Hello World.
What's the reason for the backspace character, and how might it be used?
Edit: I am aware that it isn't python specific, but when writing the original question I was thinking mainly about Python and forgot this fact. I've tried to edit to make this more clear.
Backspace is a control character that moves the cursor one character back in the console but doesn't delete it.
What's the reason for the backspace character, and how might it be used?
It was used historically in the ASCII world to print accented characters.
For example à could be produced using the three character sequence a Backspace ` (or, using the characters' hex values, 0x61 0x08 0x60).
See more on backspace here
Backspace key vs Backspace character
A lot of people confuse the two. Backspace key on keyboard has almost the universal function to delete the previous character (= move cursor back and delete that character). The backspace character '\b' however only moves the cursor one position back in the console window and doesn't delete it.
Maybe it helps to first understand what is happening there:
print() is writing to the standard output and it is writing everything there including the w and the backspace.
Now something has to display it. Most likely a terminal emulator.
What theoretically would happen is that the w was displayed and then deleted, but it was not flushed and was to fast for it to actually happen.
So real-life applications will almost always use \b at the beginning of the printed text.
I wrote a short example that will have a little spinner on a progress indicator. The example print "-" followed by "\b\\" (deleting the - and replacing it with \) followed by "\b|" (deleting the \ and replacing it with |) and so on.
That way - \ | / - \ | / looks like an animated rotating line.
#!/usr/bin/env python3
import time
spinner="\\|/-"
print ("----------\r", end='', flush=True)
for pos in range(10):
print ("-", end='', flush=True)
for spin in range(25):
#here should be a break as soon as some task proceeded further
time.sleep(0.1)
print ("\b" + spinner[spin % 4], end='', flush=True)
print ("\b*", end='', flush=True)
print ()
P.S.: A lot of existing programs use control characters like \b \r \033 to display a status line. Most popular is probably wget. I have also seen such output by at least one python script (although I cant remember which one any more)
This is not a feature of Python, but a symbol defined by ASCII. Python just supports it (like all other languages).
More specifically, it is a control character that is used either to erase the last character printed or to overprint it. The first version of ASCII was published in 1963 when the common way to output symbols was to send them to a printer and physically print the letters on paper. Here's an excerpt from Wikipedia:
Printing control characters were first used to control the physical mechanism of printers, the earliest output device. [...] The backspace character (BS) moves the printing position one character space backwards. On printers, this is most often used so the printer can overprint characters to make other, not normally available, characters. On terminals and other electronic output devices, there are often software (or hardware) configuration choices which will allow a destruct backspace (i.e., a BS, SP, BS sequence) which erases, or a non-destructive one which does not.
A small example of how it works:
>>> name = "Robert" + "\b\b\b\b" + 'p' + "\b\b" + 'u'
>>> print(text)
Rupert
>>> print(list(text))
['R', 'o', 'b', 'e', 'r', 't', '\x08', '\x08', '\x08', '\x08', 'p', '\x08', '\x08', 'u']
>>> text += "\bob"
>>> print(text)
Robert
>>> print(list(text))
['R', 'o', 'b', 'e', 'r', 't', '\x08', '\x08', '\x08', '\x08', 'p', '\x08', '\x08', 'u', '\x08', 'o', 'b']
Sometimes your tcp-ip connection to a device that send data per-keystroke rather than the entire line when reaching a line-end. This happens a lot and and you still modernly see this, you actually would need to send backspaces to undo values already typed. And in many cases this is still useful and done. You need a command that means unsend-the-last-character. This is that character and if you need to use it. Use it.
Related
I am new to Go. Just learnt the various uses of fmt.Println(). I tried the following stuff in the official playground but got a pretty unexpected output. Please explain where I have gone wrong in my understanding.
input: fmt.Println("hi\b", "there!")
output: hi� there!
expected: h there!
input: fmt.Println("hi", '\b', "there!")
output: hi 8 there!
expected: hithere!... assuming runes are not appended with spaces
input: fmt.Println("hi", "\bthere!")
output: hi �there!
expected: hithere!
(Note: above, the placeholder character has been substituted by U+FFFD, as the original character does not render consistently between environments.)
Your program outputs exactly what you told it to. The problem is mostly with your output viewer.
Control characters and sequences only have their expected effect when sent to a compatible virtual console (or a physical terminal, or a printer or teletypewriter; but the latter are pretty rare these days). What the Go playground does is capture the output of your program as-is and send it unmodified to the browser to display. The browser does not interpret terminal control codes (other than the newline character, and even that only sometimes); instead, it expects formatting to be conveyed via HTML markup. Since the backspace character does not have an assigned glyph, browsers will usually display a placeholder glyph instead, or sometimes nothing at all.
You would get a similar effect if, when running your Go program on your local machine, you redirected its output into a text file and then opened the file in a text editor: the editor will not interpret any escape sequence contained in the text file; sometimes it will even actively prevent control characters from being interpreted by the terminal displaying the editor (if it happens to be console-based editor), by substituting a symbolic, conventional representation of the character like ^H.
In the middle example, the '\b' literal evaluates to an integer with the value of the character’s Unicode code point number (what Go terms a ‘rune’). This is explained in the specification:
A rune literal represents a rune constant, an integer value identifying a Unicode code point. A rune literal is expressed as one or more characters enclosed in single quotes, as in 'x' or '\n'. Within the quotes, any character may appear except newline and unescaped single quote. A single quoted character represents the Unicode value of the character itself, while multi-character sequences beginning with a backslash encode values in various formats.
Since '\b' represents U+0008, what is passed to fmt.Println is the integer value 8. The function then prints the integer as its decimal representation, instead of interpreting it as a character code.
First thing to check out is your terminal, '\b' is terminal dependent, check if the terminal running your program handles that as "move cursor one character back" (most unixes-like will, i don't know about Windows), your first and third given example works exactly how your expectation is on my terminal (st).
input: fmt.Println("hi", '\b', "there!")
output: hi 8 there!
expected: hithere!... assuming runes are not appended with spaces
Here your assumption is not what package fmt does:
For each Printf-like function, there is also a Print function that takes no format and is equivalent to saying %v for every operand. Another variant Println inserts blanks between operands and appends a newline.
Fmt handles %v for rune as %d, not %c, so '\b' is formatted as "8" (ascii value 56), not the '\b' (ascii value 8). Also runes will have a space if they are between two arguments.
What Println does for this input is:
print string "hi"
print space
format number 8 then print string "8"
print space
print string "there!"
To debug problems like rendering invisible characters, I suggest you to use encoding/hex package, For example:
package main
import (
"encoding/hex"
"fmt"
"os"
)
func main() {
d := hex.Dumper(os.Stdout)
defer d.Close()
fmt.Fprintln(d, "hi", '\b', "there!")
}
Output: 00000000 68 69 20 38 20 74 68 65 72 65 21 0a |hi 8 there!.|
Playground: https://go.dev/play/p/F-I2mdh43K7
I'm new to golang and learning it now. I'm reading "The Go Programming Language" book and trying to run the dup1 example on my Mac. But I noticed a very weird issue. The output of the count contains an extra "D". Anyone has any idea why?
> go run dup1New.go test
test
test
hello
hello
world
3D test
2 hello
> cat dup1New.go
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
counts := make(map[string]int)
input := bufio.NewScanner(os.Stdin)
for input.Scan() {
counts[input.Text()]++
}
// NOTE: ignoring potential errors from input.Err()
for line, n := range counts {
if n > 1 {
fmt.Printf("%d\t%s\n", n, line)
}
}
}
go version go1.13.5 darwin/amd64
You're getting that D character from Ctrl+D is because of echoctl option in your terminal device interface. You could easily remove that off by running this command in your shell/terminal:
stty -echoctl
Ref: man stty
As wlisrausr answered, this is in part from your MacOS Terminal stty settings. (You probably should not turn off echoctl, though.)
To be more complete: when you type the CTRL+D sequence to signal EOF,1 the tty driver2 "displays" the character as the two-character sequence ^D, but then prints two backspace or CTRL+H characters. More precisely, it does so as long as the ECHOCTL flag is set in the lflags control field in the underlying tty settings.
The window that is displaying the interactive Terminal session is treating output as directives to draw particular characters, move (position) the cursor, and have other interesting effects. Some character codes, particularly those in the range 0x20 (32 decimal) through 0x7e (126 decimal), are displayable ASCII characters. Others are controlling characters—ANSI escape codes—or Unicode characters that have been encoded in UTF-8. Go itself uses UTF-8 extensively, to encode runes, so Go's use of UTF-8 dovetails nicely with Terminal's use of UTF-8.3
The CTRL+H, ASCII code 8—which they call BACKSPACE or BS—has the effect of moving the cursor back one display-column. That is, it is a cursor-positioning control code. (There are many of these; see the ANSI escape codes page. This stuff has a very long history, going back to just after the first glass tty.)
So, the CTRL+D has been displayed as ^D, but the cursor is positioned over the ^ (hat or caret or circumflex) character. Now you, in your Go program, send to the Terminal display-handling code, a sequence of ASCII codes: 3, which is 0x33 or 51 decimal; then TAB or CTRL+I or ASCII Horizontal Tab (HT), which is code 9; then the ASCII codes for the letters test (0x74, 0x65, 0x73, 0x74), then a newline or CTRL+J or ASCII NL, which is code 10.
Like backspace, a horizontal tab is a cursor positioning operation. It directs the terminal (or window emulation of terminal) to move the cursor to the next tab-stop, without changing anything else on the display. So you first overwrite the ^ with 3, leaving 3D visible, and the cursor positioned over the letter D. Then you have Terminal move the cursor to column 9 (columns are numbered from 1 and the default tab stop is at every eighth column) and display the word test, and then move the cursor to column 1 of a new line. The result is that the line shows:
3D test
(with exactly six blank positions between D and the first t). On the newly exposed or created line, which is currently all-blank, you print the character 2, move to column 9, and print the letters hello (and another newline directive).
1In fact, control-D simply pushes the accumulating line through the "input canonization" queue as is. If the line is empty, this sends a zero-length record up the tty's read side. Reading zero bytes from a file or device-file is interpreted as EOF by many systems, including Go's os.File reader. If you type a partial line, without a terminating newline, and then use control-D to send it, you can no longer edit that partial line, and a reader that is reading and is not concerned with newlines will have obtained the data and be using it at this point. A second control-D is then required to signal the EOF: the reader simply got the non-newline terminated input from the first control-D.
2This link describes Linux tty drivers, but Linux tty drivers are derived from the same common ancestor behind MacOS tty drivers.
3This is not an accident, even though the Go folks are not the Darwin folks: again, all this stuff goes back (via different paths) to some common ancestors.
I'm echoing the serial port input to a CRichEditCtrl, one char at a time as it arrives. The problem I've come across is that when I receive '\r' followed by '\n' I end up two lines further down page, not one. Debugging it a little I realise that sending "\r\n" results in (what I'd consider to be) the correct, single new line insertion, but sending '\r' and '\n' separately yields two new lines.
Simple example, where m_Output is obviously a rich edit control variable:
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("X\r\n"));
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("Y"));
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("\r"));
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("\n"));
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("Z"));
The output from the above is:
X
Y
Z
Why the extra line?!?!
I figure maybe something about the behaviour of Set/ReplaceSel(), but it doesn't insert lines between regular characters in this way, e.g. if I send 'a' followed by 'b' the output is simply "ab" ...
The various versions of the RichEdit control are documented as using different characters for paragraph breaks; RichEdit 1.0 used \r\n, RichEdit 2.0 is documented as using \r and RichEdit 3.0 (and presumably higher) can use both.
What this looks like though is that the control is actually seeing a solitary \n as a break as well (i.e. it sounds like it accepts \r, \n and \r\n as all representing a single break). This doesn't match the documentation but then again it wouldn't be the first time Microsoft documentation was somewhat inaccurate.
Internally the control probably doesn't store the actual break character verbatim, so when you feed it a \r and then separately a \n it isn't able to join them together into a single break.
It sounds like the easiest solution for you would be to filter out \n characters rather than sending them to the control. That way all the control will see are the \r characters and you'll only end up with a single break in the text.
To dynamically delete a character from a string, you can use the /b character.
puts "Hello\b World!" #=> Hell World!
\b basically does the same thing as a backspace. Is there a character that emulates a forward delete?
In the execution of:
puts "Hello\b World!"
The \b doesn't delete the prior character. This is a common misconception since a backspace used on a keyboard will delete the previously typed character prior to the cursor on screen and from the keyboard input buffer. That behavior occurs because of how the keyboard input software of the operating system is designed.
In the case of the above puts, the o still exists in the string. What happens is that, when displayed, the backspace causes the o to be overwritten by the following space. This occurs because the o is display first, followed by the backspace (output cursor is backed up one character position), followed by the space, in sequence.
If you could have such a case where:
puts "Hello<del> World!"
would display HelloWorld!, then that would mean the output of the value of <del> would somehow cause the following output of space () to not occur. In other words, the <del> would have the function of, "whatever the next charter is that comes for the output, skip it". I don't believe such a control character exists in Windows or Linux output, although I suppose it would be possible to write an output driver that would have that behavior for some defined control character.
You might even be able to do something like this:
"Hello W<left-arrow><left-arrow><del><right-arrow>orld!"
Which would display HelloWorld if your terminal is set up to accept control characters that move the cursor left or right. But it still obviously isn't the same functionality as the "delete in the future" case.
I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:
🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443". ̆
encodes to "\u0306".
This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for ✓, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).
This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.
You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.
try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100٪".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.