I would like to display box characters (single o double line)in batch scripts that are aim to run on Windows CMD environments (XP,7,8 and reactOS). These symbols for "boxes" are specified in code page 1252.
From script I am setting the necessary code 850 or 437 with CHCP command.
chcp 437
for writing I am Using ECHO command
ECHO "char to display"
What file encoding should i use (ANSI, UTF8,..)?
Open a command prompt, run chcp (change code page) without any parameter and the Windows command processor outputs the code page of the character encoding expected by cmd.exe on interpreting a batch file according to country configured for the user account used to execute the batch file.
However, it is possible to use for example chcp 437 >nul to set explicitly the code page 437 before the batch file outputs characters with command echo. In this case all characters in the batch file should be encoded with using code page 437. Code page 437 is used by default in North American countries (Canada, USA) and for that reason supported by all fonts used by default for Windows console windows.
Another very common code page used for Windows console is code page 850 being similar to code page 437, but has less box drawing characters in comparison to code page 437. This code page is used by default in Western European countries. It is also supported by all fonts used by default for Windows console windows.
The two referenced Wikipedia pages about the code pages 437 and 850 show the box drawing characters and their decimal and hexadecimal code values on being encoded with one byte per character, i.e. using "ANSI" encoding. "ANSI" is not really a correct term here because the code pages 437 and 850 are OEM code pages which are not standardized by the American National Standards Institute (ANSI). But Microsoft used the term ANSI for all charater encodings using just one byte per character.
The Wikipedia pages about the code pages 437 and 850 show also the Unicode code value in case of UTF-8 encoding is used for the batch file. But please be aware that some fonts used by default for Windows console window like Terminal (raster font) used by default on Windows 7 does not support UTF-8 encoding. For details see my answer on Using another language (code page) in a batch file made for others and the comments below the answer.
I recommend to use "ANSI" or more precise OEM character encoding for the batch file with echo command lines which output box drawing characters encoded with code page 437.
The "ANSI" encoding used by default by Windows GUI text editors for countries in North America and Western Europe is Windows-1252. This could be important to know if the used text editor does not support displaying the batch file content with interpreting the bytes using code page 437 and for that reason it is necessary to enter the Windows-1252 characters with the code values which result in displaying the box drawing characters on being interpreted with OEM code page 437.
Some editors like UltraEdit support displaying a one byte per character encoded text file with any code page as long as the configured font supports also this code page.
The font Terminal is definitely a good choice as text editor font on writing a batch file which should output box drawing characters.
Example:
A batch file contains following command lines OEM encoded with code page 437:
#echo off
%SystemRoot%\System32\chcp.com 437 >nul
echo ┌───────────────┐
echo │ box drawing 1 │
echo └───────────────┘
echo/
echo ╔═══════════════╗
echo ║ box drawing 2 ║
echo ╚═══════════════╝
This batch file contains following bytes (offset: hexadecimal bytes ; ASCII representation):
0000h: 40 65 63 68 6F 20 6F 66 66 0D 0A 25 53 79 73 74 ; #echo off..%Syst
0010h: 65 6D 52 6F 6F 74 25 5C 53 79 73 74 65 6D 33 32 ; emRoot%\System32
0020h: 5C 63 68 63 70 2E 63 6F 6D 20 34 33 37 20 3E 6E ; \chcp.com 437 >n
0030h: 75 6C 0D 0A 65 63 68 6F 20 DA C4 C4 C4 C4 C4 C4 ; ul..echo ÚÄÄÄÄÄÄ
0040h: C4 C4 C4 C4 C4 C4 C4 C4 C4 BF 0D 0A 65 63 68 6F ; ÄÄÄÄÄÄÄÄÄ¿..echo
0050h: 20 B3 20 62 6F 78 20 64 72 61 77 69 6E 67 20 31 ; ³ box drawing 1
0060h: 20 B3 0D 0A 65 63 68 6F 20 C0 C4 C4 C4 C4 C4 C4 ; ³..echo ÀÄÄÄÄÄÄ
0070h: C4 C4 C4 C4 C4 C4 C4 C4 C4 D9 0D 0A 65 63 68 6F ; ÄÄÄÄÄÄÄÄÄÙ..echo
0080h: 2F 0D 0A 65 63 68 6F 20 C9 CD CD CD CD CD CD CD ; /..echo ÉÍÍÍÍÍÍÍ
0090h: CD CD CD CD CD CD CD CD BB 0D 0A 65 63 68 6F 20 ; ÍÍÍÍÍÍÍÍ»..echo
00a0h: BA 20 62 6F 78 20 64 72 61 77 69 6E 67 20 32 20 ; º box drawing 2
00b0h: BA 0D 0A 65 63 68 6F 20 C8 CD CD CD CD CD CD CD ; º..echo ÈÍÍÍÍÍÍÍ
00c0h: CD CD CD CD CD CD CD CD BC 0D 0A ; ÍÍÍÍÍÍÍͼ..
The ASCII representation of the bytes use code page Windows-1252. So it can be seen here how same byte value can result in a different character being displayed just because of using a different code page of "ANSI" encoded text file.
The same batch file would contain UTF-8 encoded without byte order mark:
0000h: 40 65 63 68 6F 20 6F 66 66 0D 0A 25 53 79 73 74
0010h: 65 6D 52 6F 6F 74 25 5C 53 79 73 74 65 6D 33 32
0020h: 5C 63 68 63 70 2E 63 6F 6D 20 34 33 37 20 3E 6E
0030h: 75 6C 0D 0A 65 63 68 6F 20 E2 94 8C E2 94 80 E2
0040h: 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94
0050h: 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80
0060h: E2 94 80 E2 94 80 E2 94 80 E2 94 90 0D 0A 65 63
0070h: 68 6F 20 E2 94 82 20 62 6F 78 20 64 72 61 77 69
0080h: 6E 67 20 31 20 E2 94 82 0D 0A 65 63 68 6F 20 E2
0090h: 94 94 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94
00a0h: 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80
00b0h: E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2
00c0h: 94 98 0D 0A 65 63 68 6F 2F 0D 0A 65 63 68 6F 20
00d0h: E2 95 94 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2
00e0h: 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95
00f0h: 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90
0100h: E2 95 97 0D 0A 65 63 68 6F 20 E2 95 91 20 62 6F
0110h: 78 20 64 72 61 77 69 6E 67 20 32 20 E2 95 91 0D
0120h: 0A 65 63 68 6F 20 E2 95 9A E2 95 90 E2 95 90 E2
0130h: 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95
0140h: 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90
0150h: E2 95 90 E2 95 90 E2 95 9D 0D 0A
Note: The font used by your browser to display the batch file code above could result in getting the two boxes not displayed as real closed boxes with same width on all six lines as it does in a Windows console window of Windows XP and Windows 7 with default raster font or with font Lucida Console which is by default also available in properties of a Windows console window. Lucida Console supports much more characters than Terminal, but it is not the default font for console windows.
The text editor UltraEdit has an ASCII Table view for which the font Terminal can be set which is an OEM font. This makes it very easy to enter the box drawing characters which are displayed in ASCII Table view with font Terminal and which can be inserted into the batch file with double clicking on these characters in the view.
What file encoding should i use (ANSI, UTF8,..)?
AFAIK it doesn't make a difference what file encoding you use in this case.
The character set used by you editor makes all the difference.
I'm using Notepad++ and need to set "Encoding -> Character sets -> Western European -> OEM 850" - eg.:
Do I need to chcp in my batch file or cmd console?
If you don't need all 40 box-drawing characters and can make do with just 22 it's usually not necessary (I'm not sure if and how CMD's charset is affected by eg. Cyrillic, Japanese or Chinese Windows versions/settings).
Word cloud
show same ASCII characters editor notepad dos cmd shell
Related
I've got some code to create labels in Gmail, which usually works fine. But now the requirement is to create a label with Japanese characters, specifically "アーカイブ". I am encoding the json like this:
7B 0D 0A 22 6E 61 6D 65 22 3A 22 E3 82 A2 E3 83 {.."name":".....
BC E3 82 AB E3 82 A4 E3 83 96 22 2C 0D 0A 22 6D ..........",.."m
65 73 73 61 67 65 4C 69 73 74 56 69 73 69 62 69 essageListVisibi
6C 69 74 79 22 3A 22 73 68 6F 77 22 2C 0D 0A 22 lity":"show",.."
6C 61 62 65 6C 4C 69 73 74 56 69 73 69 62 69 6C labelListVisibil
69 74 79 22 3A 22 6C 61 62 65 6C 53 68 6F 77 22 ity":"labelShow"
0D 0A 7D 0D 0A 00 00 00 00 00 00 00 00 00 00 00 ..}.............
As you can see, the first character is the UTF8 sequence E3 82 A2, which if you look at this table (https://www.utf8-chartable.de/unicode-utf8-table.pl?start=12352&names=-) seems to be correct for that first character. The others look OK also.
As a test, I created a Japanese folder with that name in the UI, then got a dump of the json that Gmail produces when I get a list of existing folders. What Gmail produces is exactly the same as what I'm trying to import. So I don't see what I could be doing wrong here. Any help appreciated.
Never mind this - turns out my Japanese characters translate to "Archive" which is apparently a reserved folder name.
I have a csv file with text and numbers.
If a number is bigger than 1000, formatted like this: 1 000,
so it has a space as thousand separator, but it is not space. I tried to sed it, and it worked where real space was, but not in this format.
It is also not TAB, I removed all the TABs with "expand -t 1".
The following is a line that demonstrates the issue:
x17_Provident_GDN_REMARKETING_provident.hu_listák;Display_Hálózat;Szeged;2021-03-09;Kedd;Mobil;HUF;1 736;9;130.83;0.00
In penultimate row, in column 8: 1 736
is the problem.
And running this: grep -E -m 1 -e '[;]1[^;]+736[;]' <yourfile.csv | hexdump -C
gives:
00000000 78 31 37 5f 50 72 6f 76 69 64 65 6e 74 5f 47 44 |x17_Provident_GD|
00000010 4e 5f 52 45 4d 41 52 4b 45 54 49 4e 47 5f 70 72 |N_REMARKETING_pr|
00000020 6f 76 69 64 65 6e 74 2e 68 75 5f 6c 69 73 74 c3 |ovident.hu_list.|
00000030 a1 6b 3b 44 69 73 70 6c 61 79 5f 48 c3 a1 6c c3 |.k;Display_H..l.|
00000040 b3 7a 61 74 3b 53 7a 65 67 65 64 3b 32 30 32 31 |.zat;Szeged;2021|
00000050 2d 30 33 2d 30 39 3b 4b 65 64 64 3b 4d 6f 62 69 |-03-09;Kedd;Mobi|
00000060 6c 3b 48 55 46 3b 31 c2 a0 37 33 36 3b 39 3b 31 |l;HUF;1..736;9;1|
00000070 33 30 2e 38 33 3b 30 2e 30 30 0a |30.83;0.00.|
0000007b
It's a 2 byte, UTF-8 encoded non breaking space - c2 a0.
You can use perl to safely remove it.
perl -pe 's/\xc2\xa0//g' dirty.csv > clean.csv
After we know it is No break space, I simply sed it on mac with entry method:
opt+space
cat test4.csv | sed 's/ //g'
Similar to perl, you can use GNU sed with LC_ALL=C:
LC_ALL=C sed 's/\xc2\xa0//g'
I send the command DISPLAY TEXT to phone by using SMS-SUBMIT message.
Why don't I get any text on my phone's screen?
details:
My SMS-SUBMIT message
according to ETSI TS 23.048, I don't use any security capabilities (no KIC, KID, RC/CC/DS)
My DISPLAY TEXT command
D0 3F ; This is a proactive command of length 0x3f
81 03 01 21 81 ; The command is DISPLAY TEXT, high priority, wait for user
82 02 81 02 ; It was sent from the SIM to the display
8D 34 04 ; Encoding is 8-bit default SMS (ASCII), message:
48 65 6C 6C 6F 20 77 6F 72 6C 64 21 20 49 20 61 6D 20
; "Hello world! I am "
61 6E 20 61 6C 74 65 72 6E 61 74 69 76 65 20 53 49 4D
; "an alternative SIM"
20 54 6F 6F 6C 6B 69 74 20 73 74 61 63 6B 2E
; " Toolkit stack."
I merge SMS-SUBMIT and DISPLAY TEXT and send them on the SIM by SMS.
But I don't get any text on the screen of tne mobile phone.
Why?
I have a 240MB logfile from a PuTTY session. This was mistakenly logged in the "SSH packets and raw data" format instead of "All session output". If I open the file in a text editor then I can see that the data I require (the plain text).
The problem is extracting that from the raw data.
For example:
Incoming raw data at 2016-01-06 15:47:42
00000000 e8 fd c2 d2 88 a9 39 b9 2a 77 2a 7b 4a 60 fc 21 ......9.*w*{J`.!
00000010 1d f5 fc d4 b1 58 1f 4d 68 a4 ef 83 03 39 59 b7 .....X.Mh....9Y.
00000020 41 be 36 7b b5 3c 10 fa 65 27 77 30 77 97 02 39 A.6{.<..e'w0w..9
00000030 46 4c 28 da 5c c6 2c 1e ae 33 db e1 a8 09 ea 4a FL(.\.,..3.....J
00000040 06 94 c6 eb 38 8e d3 d3 33 13 78 08 7c 5f 41 56 ....8...3.x.|_AV
00000050 f1 13 9e e1 ....
Incoming packet #0x31, type 94 / 0x5e (SSH2_MSG_CHANNEL_DATA)
00000000 00 00 01 00 00 00 00 20 64 69 73 61 62 6c 69 6e ....... disablin
00000010 67 20 61 20 72 75 6e 6e 69 6e 67 20 77 61 74 63 g a running watc
00000020 68 64 6f 67 2e 2e 0d 0a hdog....
Incoming raw data at 2016-01-06 15:47:42
00000000 dc 96 f3 54 f8 a8 5c 83 80 7b a8 07 da 79 95 50 ...T..\..{...y.P
00000010 3f 19 2f 0c f0 03 a1 01 a3 33 2f 97 75 9d 47 15 ?./......3/.u.G.
00000020 b9 95 df c6 66 e0 50 32 88 1e db 5b 73 1b 7b ad ....f.P2...[s.{.
I think what I need to do is read only the sections of the file labelled "Incoming packet". Then I can read the ascii character codes and convert to readable text (this will recover the tabs, linefeeds and carriage returns).
I'm not familiar with awk or sed, but I know a bit of grep. How can I go about firstly extracting the sections (of variable size) that I need to translate from ASCII codes to text?
sed -n '/^Incoming packet/,/^Incoming raw data/{//!p}
This will print lines between the matches Incoming packet and Incoming raw. Process this output further to get your desired output.
Print only ASCII characters (print last 17 characters) from the matching line:
sed -n '/Incoming packet/,/Incoming raw data/{//!{s/^.*\(.\{17\}\)/\1/;p}}'
Ref:1, 2
The following is a bash file I wrote to convert all C++ style(//) comments in a C file to C style(/**/).
#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang
It works. But I found a problem I cannot explain. In default, my local setting is en_US.UTF-8. And in my C code, there are comments written in Chinese, such as
// some english 一些中文注释
If I don't change the local setting, i.e., do not run the statement export LANG=C, I'll get
/* some english */一些中文注释
instead of
/* some english 一些中文注释*/
I don't know why. I just find a solution by try and error.
After read Jonathan Leffler's answer, I think I've make some mistake leading to some misunderstand. In the question, those Chinese words were inputed in Google Chrome and were not the actual words in my C file. 一些中文注释 just means some Chinese comments.
Now I inputed // some english 一些中文注释 in Visual C++ 6.0 in Windows XP, and copied the c file to Debian. Then I just run sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 and got
/* some english 一些 */中文注释
I think it's different character coding(GB18030, GBK, UTF-8?) cause the different results.
The following is my results gotten on Debian
~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / s o m e e n g l i s h
2f 2f 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68 20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
d2 bb d0 a9 d6 d0 ce c4 d7 a2 ca cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * s o m e e n g l i s h
2f 2a 20 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68
0000020 322 273 320 251 * / 326 320 316 304 327 242 312 315
20 d2 bb d0 a9 20 2a 2f d6 d0 ce c4 d7 a2 ca cd
0000040
~/sandbox$
I think these Chinese Character encoding with 2 byte(Unicode).
There are another example:
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / I n W i n d o w : 250 250 ?
2f 2f 20 49 6e 57 69 6e 64 6f 77 3a 20 a8 a8 3f
0000020 1 ?
31 3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * I n W i n d o w : *
2f 2a 20 20 49 6e 57 69 6e 64 6f 77 3a 20 20 2a
0000020 / 250 250 ? 1 ?
2f a8 a8 3f 31 3f
Which platform are you working on? Your sed script works fine on MacOS X without changing locale. The Linux terminal was less happy with the Chinese characters, but it is not setup to use UTF-8. Moreover, a hex dump of the string that it did get contained a zero byte 0x00 where the Chinese started, which might lead to the confusion. (I note that your regex adds a space before the comment text if it starts // with a space.)
MacOS X (10.6.8)
The 'odx' command use is a hex-dump program.
$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
All of which looks clean and tidy.
Linux (RHEL 5)
I copied the x3.utf8 file to a Linux box, and dumped it. Then I ran the sed script on it, and all seemed OK:
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 /* some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 ...............
0x0020: E9 87 8A 20 2A 2F 0A ... */.
0x0027:
$
So far, so good. I also tried:
$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE
$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
So, the terminal is nominally working in UTF-8 after all, and it certainly seems display the data OK.
However, if I echo the string at the terminal, it gets into a tizzy. When I cut'n'pasted the string to the Linux terminal, it said:
$ echo "// some english d8d^G:
> "
// some english d8d:
$
and beeped.
$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: 64 38 64 07 3A 0A 0A d8d.:..
0x0017:
$
I'm not quite sure what to make of that. I think it means that something in the input side of bash is having some problems, but I'm not quite sure. I also am getting slightly inconsistent results. The first time I tried it, I got:
$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^#d:^[d8-f^Gf3(i^G
$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D 's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A d8-f.f3(i...
0x004C:
$
And in that hex dump, you can see a 0x00 byte (offset 0x003C). That appears at the position where you got the end comment, and a null there could confuse sed; but the whole input is such a mess it is hard to know what to make of it.
Okay, here's the correct answer...
The GNU regular expression library (regex) doesn't match everything when you put a . in your expression. Yup, I know how braindead that sounds.
The problem comes from the word "character", now reasonable people will say that everything that's in the input file for sed is characters. And even in your case they are perfectly correct. But regex has been programmed to required that the input be perfectly correctly formatted characters of the current locale character set (UTF-8) if they're correctly formatted characters for the Windows character set (UTF-16) they're not "characters".
So as . only matches "characters" it doesn't match your characters.
If you used the regex //.*$, ie: pinned it to the end of the line it wouldn't match at all because there's something that's not a "character" between the // and the end of the line.
And no you can't do anything like //\(.\|[^.]\)*$, it's just impossible to match those characters without switching to the C locale.
This will also, sometimes, destroy 8-bit transparency; ie: a binary piped through sed will get corrupted even if no changes are made.
Fortunately the C locale still uses the reasonable interpretation so anything that's not a perfectly correctly formatted ASCII-68 character is still a "character".