Nokogiri: replace newlines and spaces with `<br>` and ` ` inside `<pre>`

Nokogiri: replace newlines and spaces with `<br>` and ` ` inside `<pre>` - ruby

A certain downstream sink (that I don't control), being fed w/ html, ecstatically trims down all spaces & newlines into 1 space. Thus I need to 'escape' such chars inside all <pre> tags.
I've managed to 'protect' newlines by replacing them w/ <br>, but the spaces issue stupefies me:
require 'nokogiri'
doc = Nokogiri::HTML.fragment "<pre><code> 1\n\n 2</code></pre>"
doc.css('pre').each do |node|
text = node.to_s.gsub(/\n/, '<br>').gsub(/\s/) { ' ' }
node.replace Nokogiri::HTML.fragment text
end
puts doc
still yields in:
<pre><code> 1<br><br> 2</code></pre>
(instead of expected <pre><code> 1<br><br> 2</code></pre>)
i.e., Nokogiri re-encodes to spaces again!
Now, here's what also interesting: those are not actual 'spaces', but (I gather) UTF8 non breaking space chars:
$ ruby 1.rb | hexdump -C
00000000 3c 70 72 65 3e 3c 63 6f 64 65 3e c2 a0 c2 a0 31 |<pre><code>....1|
00000010 3c 62 72 3e 3c 62 72 3e c2 a0 c2 a0 32 3c 2f 63 |<br><br>....2</c|
00000020 6f 64 65 3e 3c 2f 70 72 65 3e 0a |ode></pre>.|
0000002b
notice c2 a0 c2 a0 31 (<nbsp><nbsp>1) instead of more common 20 20 31.
I don't think its prudent to blindly enforce non-UTF8 encoding on my output, so playing w/ various encodings isn't an option.

Related

Box drawing characters in batch scripts (Windows CMD)

I would like to display box characters (single o double line)in batch scripts that are aim to run on Windows CMD environments (XP,7,8 and reactOS). These symbols for "boxes" are specified in code page 1252.
From script I am setting the necessary code 850 or 437 with CHCP command.
chcp 437
for writing I am Using ECHO command
ECHO "char to display"
What file encoding should i use (ANSI, UTF8,..)?

Open a command prompt, run chcp (change code page) without any parameter and the Windows command processor outputs the code page of the character encoding expected by cmd.exe on interpreting a batch file according to country configured for the user account used to execute the batch file.
However, it is possible to use for example chcp 437 >nul to set explicitly the code page 437 before the batch file outputs characters with command echo. In this case all characters in the batch file should be encoded with using code page 437. Code page 437 is used by default in North American countries (Canada, USA) and for that reason supported by all fonts used by default for Windows console windows.
Another very common code page used for Windows console is code page 850 being similar to code page 437, but has less box drawing characters in comparison to code page 437. This code page is used by default in Western European countries. It is also supported by all fonts used by default for Windows console windows.
The two referenced Wikipedia pages about the code pages 437 and 850 show the box drawing characters and their decimal and hexadecimal code values on being encoded with one byte per character, i.e. using "ANSI" encoding. "ANSI" is not really a correct term here because the code pages 437 and 850 are OEM code pages which are not standardized by the American National Standards Institute (ANSI). But Microsoft used the term ANSI for all charater encodings using just one byte per character.
The Wikipedia pages about the code pages 437 and 850 show also the Unicode code value in case of UTF-8 encoding is used for the batch file. But please be aware that some fonts used by default for Windows console window like Terminal (raster font) used by default on Windows 7 does not support UTF-8 encoding. For details see my answer on Using another language (code page) in a batch file made for others and the comments below the answer.
I recommend to use "ANSI" or more precise OEM character encoding for the batch file with echo command lines which output box drawing characters encoded with code page 437.
The "ANSI" encoding used by default by Windows GUI text editors for countries in North America and Western Europe is Windows-1252. This could be important to know if the used text editor does not support displaying the batch file content with interpreting the bytes using code page 437 and for that reason it is necessary to enter the Windows-1252 characters with the code values which result in displaying the box drawing characters on being interpreted with OEM code page 437.
Some editors like UltraEdit support displaying a one byte per character encoded text file with any code page as long as the configured font supports also this code page.
The font Terminal is definitely a good choice as text editor font on writing a batch file which should output box drawing characters.
Example:
A batch file contains following command lines OEM encoded with code page 437:
#echo off
%SystemRoot%\System32\chcp.com 437 >nul
echo ┌───────────────┐
echo │ box drawing 1 │
echo └───────────────┘
echo/
echo ╔═══════════════╗
echo ║ box drawing 2 ║
echo ╚═══════════════╝
This batch file contains following bytes (offset: hexadecimal bytes ; ASCII representation):
0000h: 40 65 63 68 6F 20 6F 66 66 0D 0A 25 53 79 73 74 ; #echo off..%Syst
0010h: 65 6D 52 6F 6F 74 25 5C 53 79 73 74 65 6D 33 32 ; emRoot%\System32
0020h: 5C 63 68 63 70 2E 63 6F 6D 20 34 33 37 20 3E 6E ; \chcp.com 437 >n
0030h: 75 6C 0D 0A 65 63 68 6F 20 DA C4 C4 C4 C4 C4 C4 ; ul..echo ÚÄÄÄÄÄÄ
0040h: C4 C4 C4 C4 C4 C4 C4 C4 C4 BF 0D 0A 65 63 68 6F ; ÄÄÄÄÄÄÄÄÄ¿..echo
0050h: 20 B3 20 62 6F 78 20 64 72 61 77 69 6E 67 20 31 ; ³ box drawing 1
0060h: 20 B3 0D 0A 65 63 68 6F 20 C0 C4 C4 C4 C4 C4 C4 ; ³..echo ÀÄÄÄÄÄÄ
0070h: C4 C4 C4 C4 C4 C4 C4 C4 C4 D9 0D 0A 65 63 68 6F ; ÄÄÄÄÄÄÄÄÄÙ..echo
0080h: 2F 0D 0A 65 63 68 6F 20 C9 CD CD CD CD CD CD CD ; /..echo ÉÍÍÍÍÍÍÍ
0090h: CD CD CD CD CD CD CD CD BB 0D 0A 65 63 68 6F 20 ; ÍÍÍÍÍÍÍÍ»..echo
00a0h: BA 20 62 6F 78 20 64 72 61 77 69 6E 67 20 32 20 ; º box drawing 2
00b0h: BA 0D 0A 65 63 68 6F 20 C8 CD CD CD CD CD CD CD ; º..echo ÈÍÍÍÍÍÍÍ
00c0h: CD CD CD CD CD CD CD CD BC 0D 0A ; ÍÍÍÍÍÍÍÍ¼..
The ASCII representation of the bytes use code page Windows-1252. So it can be seen here how same byte value can result in a different character being displayed just because of using a different code page of "ANSI" encoded text file.
The same batch file would contain UTF-8 encoded without byte order mark:
0000h: 40 65 63 68 6F 20 6F 66 66 0D 0A 25 53 79 73 74
0010h: 65 6D 52 6F 6F 74 25 5C 53 79 73 74 65 6D 33 32
0020h: 5C 63 68 63 70 2E 63 6F 6D 20 34 33 37 20 3E 6E
0030h: 75 6C 0D 0A 65 63 68 6F 20 E2 94 8C E2 94 80 E2
0040h: 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94
0050h: 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80
0060h: E2 94 80 E2 94 80 E2 94 80 E2 94 90 0D 0A 65 63
0070h: 68 6F 20 E2 94 82 20 62 6F 78 20 64 72 61 77 69
0080h: 6E 67 20 31 20 E2 94 82 0D 0A 65 63 68 6F 20 E2
0090h: 94 94 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94
00a0h: 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80
00b0h: E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2 94 80 E2
00c0h: 94 98 0D 0A 65 63 68 6F 2F 0D 0A 65 63 68 6F 20
00d0h: E2 95 94 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2
00e0h: 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95
00f0h: 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90
0100h: E2 95 97 0D 0A 65 63 68 6F 20 E2 95 91 20 62 6F
0110h: 78 20 64 72 61 77 69 6E 67 20 32 20 E2 95 91 0D
0120h: 0A 65 63 68 6F 20 E2 95 9A E2 95 90 E2 95 90 E2
0130h: 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95
0140h: 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90 E2 95 90
0150h: E2 95 90 E2 95 90 E2 95 9D 0D 0A
Note: The font used by your browser to display the batch file code above could result in getting the two boxes not displayed as real closed boxes with same width on all six lines as it does in a Windows console window of Windows XP and Windows 7 with default raster font or with font Lucida Console which is by default also available in properties of a Windows console window. Lucida Console supports much more characters than Terminal, but it is not the default font for console windows.
The text editor UltraEdit has an ASCII Table view for which the font Terminal can be set which is an OEM font. This makes it very easy to enter the box drawing characters which are displayed in ASCII Table view with font Terminal and which can be inserted into the batch file with double clicking on these characters in the view.

What file encoding should i use (ANSI, UTF8,..)?
AFAIK it doesn't make a difference what file encoding you use in this case.
The character set used by you editor makes all the difference.
I'm using Notepad++ and need to set "Encoding -> Character sets -> Western European -> OEM 850" - eg.:
Do I need to chcp in my batch file or cmd console?
If you don't need all 40 box-drawing characters and can make do with just 22 it's usually not necessary (I'm not sure if and how CMD's charset is affected by eg. Cyrillic, Japanese or Chinese Windows versions/settings).
Word cloud
show same ASCII characters editor notepad dos cmd shell

Can't use unicode character as rune

It appears that golang doesn't support all unicode characters for its runes
package main
import "fmt"
func main() {
standardSuits := []rune{'♠️', '♣️', '♥️', '♦️'}
fmt.Println(standardSuits)
}
Generates the following error:
./main.go:6: missing '
./main.go:6: invalid identifier character U+FE0F '️'
./main.go:6: syntax error: unexpected ️, expecting comma or }
./main.go:6: missing '
./main.go:6: invalid identifier character U+FE0F '️'
./main.go:6: missing '
./main.go:6: invalid identifier character U+FE0F '️'
./main.go:6: missing '
./main.go:6: invalid identifier character U+FE0F '️'
./main.go:6: missing '
./main.go:6: too many errors
Is there a way to get around this, or should I just live with this limitation and use something else?

It looks to me like a parsing issue. You could use the unicode points to produce that runes, which should give the same result as using the chars.
package main
import "fmt"
func main() {
standardSuits := []rune{'\u2660', '\u2663', '\u2665', '\u2666', '⌘'}
fmt.Println(standardSuits)
}
Generates
[9824 9827 9829 9830 8984]
Playground link: https://play.golang.org/p/jTLsbs7DM1
I added the additional 5th rune to check if the result from code point or char gives the same. Looks like it does.
Edit:
Not sure what is wrong with your chars (did not view them in a hex editor, have none around), but something is strange about them.
I also got this to run by copy pasting the chars from Wikipedia:
package main
import "fmt"
func main() {
standardSuits := []rune{'♠', '♣', '♥', '♦'}
fmt.Println(standardSuits)
}
https://play.golang.org/p/CKR0u2_IIB

The unicode string you use in your source code consist of more than one "character", but a character constant '...' is not allowed to contain strings of length greater than one. In more detail:
If I copy&paste your source code and print a hexdump, I can see the exact bytes in your source code:
>>> hexdump -C x.go
00000000 70 61 63 6b 61 67 65 20 6d 61 69 6e 0a 0a 69 6d |package main..im|
00000010 70 6f 72 74 20 22 66 6d 74 22 0a 0a 66 75 6e 63 |port "fmt"..func|
00000020 20 6d 61 69 6e 28 29 20 7b 0a 20 20 73 74 61 6e | main() {. stan|
00000030 64 61 72 64 53 75 69 74 73 20 3a 3d 20 5b 5d 72 |dardSuits := []r|
00000040 75 6e 65 7b 27 e2 99 a0 ef b8 8f 27 2c 20 27 e2 |une{'......', '.|
00000050 99 a3 ef b8 8f 27 2c 20 27 e2 99 a5 ef b8 8f 27 |.....', '......'|
00000060 2c 20 27 e2 99 a6 ef b8 8f 27 7d 0a 20 20 66 6d |, '......'}. fm|
00000070 74 2e 50 72 69 6e 74 6c 6e 28 73 74 61 6e 64 61 |t.Println(standa|
00000080 72 64 53 75 69 74 73 29 0a 7d 0a |rdSuits).}.|
This shows, for example, that your '♠️' is encoded using the hex bytes e2 99 a0 ef b8 8f. In utf-8 encoding this corresponds to the two(!) characters \u2660 \uFE0F. This is not obvious by looking at the code, since \uFE0F is no printable character, but Go complains, because you have more than one character in a character constant. Using '♠' or '\u2660' instead works as expected.

Convert PuTTY Raw SSH output to plain text

I have a 240MB logfile from a PuTTY session. This was mistakenly logged in the "SSH packets and raw data" format instead of "All session output". If I open the file in a text editor then I can see that the data I require (the plain text).
The problem is extracting that from the raw data.
For example:
Incoming raw data at 2016-01-06 15:47:42
00000000 e8 fd c2 d2 88 a9 39 b9 2a 77 2a 7b 4a 60 fc 21 ......9.*w*{J`.!
00000010 1d f5 fc d4 b1 58 1f 4d 68 a4 ef 83 03 39 59 b7 .....X.Mh....9Y.
00000020 41 be 36 7b b5 3c 10 fa 65 27 77 30 77 97 02 39 A.6{.<..e'w0w..9
00000030 46 4c 28 da 5c c6 2c 1e ae 33 db e1 a8 09 ea 4a FL(.\.,..3.....J
00000040 06 94 c6 eb 38 8e d3 d3 33 13 78 08 7c 5f 41 56 ....8...3.x.|_AV
00000050 f1 13 9e e1 ....
Incoming packet #0x31, type 94 / 0x5e (SSH2_MSG_CHANNEL_DATA)
00000000 00 00 01 00 00 00 00 20 64 69 73 61 62 6c 69 6e ....... disablin
00000010 67 20 61 20 72 75 6e 6e 69 6e 67 20 77 61 74 63 g a running watc
00000020 68 64 6f 67 2e 2e 0d 0a hdog....
Incoming raw data at 2016-01-06 15:47:42
00000000 dc 96 f3 54 f8 a8 5c 83 80 7b a8 07 da 79 95 50 ...T..\..{...y.P
00000010 3f 19 2f 0c f0 03 a1 01 a3 33 2f 97 75 9d 47 15 ?./......3/.u.G.
00000020 b9 95 df c6 66 e0 50 32 88 1e db 5b 73 1b 7b ad ....f.P2...[s.{.
I think what I need to do is read only the sections of the file labelled "Incoming packet". Then I can read the ascii character codes and convert to readable text (this will recover the tabs, linefeeds and carriage returns).
I'm not familiar with awk or sed, but I know a bit of grep. How can I go about firstly extracting the sections (of variable size) that I need to translate from ASCII codes to text?

sed -n '/^Incoming packet/,/^Incoming raw data/{//!p}
This will print lines between the matches Incoming packet and Incoming raw. Process this output further to get your desired output.
Print only ASCII characters (print last 17 characters) from the matching line:
sed -n '/Incoming packet/,/Incoming raw data/{//!{s/^.*\(.\{17\}\)/\1/;p}}'
Ref:1, 2

Extract data between two matched patterns in a binary file containing non-ASCII characters using bash

I am trying to extract a jpeg image from a binary text file. I want to extract all data between 0xFF 0xD8 (start of image) and 0xFF 0xD9 (end of image) inclusive. Earlier, I have successfully run the following command to get the desired image.jpg from a single paragraph file received.txt:
sed 's/.*\xFF\xD8/\xFF\xD8/; s/\xFF\xD9.*/\xFF\xD9/' received.txt > image.jpg
But when I tried to run the same operation on a different file, it didn't work. I also tried using
sed -n '/\xFF\xD8/,/\xFF\xD9/p' received.txt > temp.txt
sed 's/.*\xFF\xD8/\xFF\xD8/; s/\xFF\xD9.*/\xFF\xD9/' temp.txt > image.jpg
to remove any lines before or after the matched lines but got no success.
Although the file was too large, I pasted the hex dump of the relevant portion below:
0a 55 57 5d 50 cf ff d8 ff fe ff ff ff d9 df 47 fe e7 c9 3b e9 9b 6b 55 c4 57 9b 98 73 fd 15 f7 77 7e f7 95 dd 55 f7 55 05 cc 55 97 55 dd 62 d1 1f 51 ef f1 ef fb e9 bf ed 5f bf f2 9d 75 af fe 6b fb bf 8f f7 f7 7e ff d3 bf 8e d5 5f df 57 75 fe 77 7b bf d7 af df 5d fb 0a 47 de d5 ff c1 23 9b 20 08 20 65 3c 06 83 11 05 30 50 a0 20 55 20 84 41 04 c2 59 50 89 64 44 44 10 05 20 87 28 1d a9
The hex dump of the desired output in this case is:
ff d8 ff fe ff ff ff d9
Update
While trying to resolve the issue, I found that the sed command removes all the characters before or after a matched pattern upto the non-ASCII character (0x80 - 0xFF) but not go beyond that non-ASCII character. As an example, if we try:
echo 55 57 5d 50 cf 50 65 7f ff d8 ff fe ff ff ff d9 | xxd -r -p | sed 's/.*\xFF\xD8/\xFF\xD8/' > output
The hex dump of the output can be seen as:
xxd output
which is:
55 57 5d 50 cf ff d8 ff fe ff ff ff d9
As can be seen, the characters between the non-ASCII character and matched pattern are removed but the characters before the non-ASCII character are not.
Alternative Solution (not perfect)
I used the following commands to somewhat resolve the problem:
sed 's/\xFF\xD8/\x0A\xFF\xD8/; s/\xFF\xD9/\xFF\xD9\x0A/' received.txt > temp.txt
then run the following command (which will work if there is no new line character (0x0A) somewhere between 0xFF 0xD8 and 0xFF 0xD9):
sed -n '/\xFF\xD8/{/\xFF\xD9/p}' temp.txt > image.jpg
but if image.jpg file is empty (after execution of the above command), then run the following command:
sed -n '/\xFF\xD8/,/\xFF\xD9/p' temp.txt > image.jpg
These commands will do the desired job except that it puts 0x0A at the end of the image.jpg file (i.e., after 0xFF 0xD9). In my case, it did not create any issue as JPEG file automatically discards data after 0xFF 0xD9 marker.
I was stuck at the implementation of 'if image file is empty' condition when #chaos came up with a perfect solution. So, I am now following his solution. Thanks a lot #chaos!
Please follow the link below for chaos solution!
https://unix.stackexchange.com/questions/231289/extract-data-between-two-matched-patterns-in-a-binary-file
Notes:
Here is how you can get the actual data from its hex dump which you can pipe to sed command:
echo 0a 55 57 5d 50 cf ff d8 ff fe ff ff ff d9 df 47 fe e7 c9 3b e9 9b 6b 55 c4 57 9b 98 73 fd 15 f7 77 7e f7 95 dd 55 f7 55 05 cc 55 97 55 dd 62 d1 1f 51 ef f1 ef fb e9 bf ed 5f bf f2 9d 75 af fe 6b fb bf 8f f7 f7 7e ff d3 bf 8e d5 5f df 57 75 fe 77 7b bf d7 af df 5d fb 0a 47 de d5 ff c1 23 9b 20 08 20 65 3c 06 83 11 05 30 50 a0 20 55 20 84 41 04 c2 59 50 89 64 44 44 10 05 20 87 28 1d a9 | xxd -r -p
and you can see the hex dump of a file by:
xxd file.txt

Change the local setting to enable sed work correctly, but why?

The following is a bash file I wrote to convert all C++ style(//) comments in a C file to C style(/**/).
#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang
It works. But I found a problem I cannot explain. In default, my local setting is en_US.UTF-8. And in my C code, there are comments written in Chinese, such as
// some english 一些中文注释
If I don't change the local setting, i.e., do not run the statement export LANG=C, I'll get
/* some english */一些中文注释
instead of
/* some english 一些中文注释*/
I don't know why. I just find a solution by try and error.
After read Jonathan Leffler's answer, I think I've make some mistake leading to some misunderstand. In the question, those Chinese words were inputed in Google Chrome and were not the actual words in my C file. 一些中文注释 just means some Chinese comments.
Now I inputed // some english 一些中文注释 in Visual C++ 6.0 in Windows XP, and copied the c file to Debian. Then I just run sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 and got
/* some english 一些 */中文注释
I think it's different character coding(GB18030, GBK, UTF-8?) cause the different results.
The following is my results gotten on Debian
~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / s o m e e n g l i s h
2f 2f 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68 20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
d2 bb d0 a9 d6 d0 ce c4 d7 a2 ca cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * s o m e e n g l i s h
2f 2a 20 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68
0000020 322 273 320 251 * / 326 320 316 304 327 242 312 315
20 d2 bb d0 a9 20 2a 2f d6 d0 ce c4 d7 a2 ca cd
0000040
~/sandbox$
I think these Chinese Character encoding with 2 byte(Unicode).
There are another example:
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / I n W i n d o w : 250 250 ?
2f 2f 20 49 6e 57 69 6e 64 6f 77 3a 20 a8 a8 3f
0000020 1 ?
31 3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * I n W i n d o w : *
2f 2a 20 20 49 6e 57 69 6e 64 6f 77 3a 20 20 2a
0000020 / 250 250 ? 1 ?
2f a8 a8 3f 31 3f

Which platform are you working on? Your sed script works fine on MacOS X without changing locale. The Linux terminal was less happy with the Chinese characters, but it is not setup to use UTF-8. Moreover, a hex dump of the string that it did get contained a zero byte 0x00 where the Chinese started, which might lead to the confusion. (I note that your regex adds a space before the comment text if it starts // with a space.)
MacOS X (10.6.8)
The 'odx' command use is a hex-dump program.
$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
All of which looks clean and tidy.
Linux (RHEL 5)
I copied the x3.utf8 file to a Linux box, and dumped it. Then I ran the sed script on it, and all seemed OK:
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 /* some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 ...............
0x0020: E9 87 8A 20 2A 2F 0A ... */.
0x0027:
$
So far, so good. I also tried:
$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE
$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
So, the terminal is nominally working in UTF-8 after all, and it certainly seems display the data OK.
However, if I echo the string at the terminal, it gets into a tizzy. When I cut'n'pasted the string to the Linux terminal, it said:
$ echo "// some english d8d^G:
> "
// some english d8d:
$
and beeped.
$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: 64 38 64 07 3A 0A 0A d8d.:..
0x0017:
$
I'm not quite sure what to make of that. I think it means that something in the input side of bash is having some problems, but I'm not quite sure. I also am getting slightly inconsistent results. The first time I tried it, I got:
$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^#d:^[d8-f^Gf3(i^G
$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D 's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A d8-f.f3(i...
0x004C:
$
And in that hex dump, you can see a 0x00 byte (offset 0x003C). That appears at the position where you got the end comment, and a null there could confuse sed; but the whole input is such a mess it is hard to know what to make of it.

Okay, here's the correct answer...
The GNU regular expression library (regex) doesn't match everything when you put a . in your expression. Yup, I know how braindead that sounds.
The problem comes from the word "character", now reasonable people will say that everything that's in the input file for sed is characters. And even in your case they are perfectly correct. But regex has been programmed to required that the input be perfectly correctly formatted characters of the current locale character set (UTF-8) if they're correctly formatted characters for the Windows character set (UTF-16) they're not "characters".
So as . only matches "characters" it doesn't match your characters.
If you used the regex //.*$, ie: pinned it to the end of the line it wouldn't match at all because there's something that's not a "character" between the // and the end of the line.
And no you can't do anything like //\(.\|[^.]\)*$, it's just impossible to match those characters without switching to the C locale.
This will also, sometimes, destroy 8-bit transparency; ie: a binary piped through sed will get corrupted even if no changes are made.
Fortunately the C locale still uses the reasonable interpretation so anything that's not a perfectly correctly formatted ASCII-68 character is still a "character".

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Nokogiri: replace newlines and spaces with `<br>` and ` ` inside `<pre>` - ruby

Related

Box drawing characters in batch scripts (Windows CMD)

Can't use unicode character as rune

Convert PuTTY Raw SSH output to plain text

Extract data between two matched patterns in a binary file containing non-ASCII characters using bash

Change the local setting to enable sed work correctly, but why?

Categories

Resources