How to split on NULs in shell [duplicate] - shell

This question already has answers here:
Capturing output of find . -print0 into a bash array
(13 answers)
Closed 8 years ago.
I am using zsh as a shell.
I would like to execute the unix find command and put the result into a shell array variable, something like:
FILES=($(find . -name '*.bak'))
so that I can iterate over the values with something like
for F in "$FILES[#]"; do echo "<<$F>>"; done
However, my filenames contain spaces at least, and perhaps other funky characters, so the above doesn't work. What does work is:
IFS=$(echo -n -e "\0"); FILES=($(find . -name '*.bak' -print0)); unset IFS
but that's fugly. This is already a bit beyond my comfort limit with zsh syntax, so I'm hoping someone can point me to some basic feature that I never knew about but should.

I tend to use read for that. A quick google search showed me zsh also seem to support that:
find . -name '*.bak' | while read file; do echo "<<$file>>"; done
That doesn't split with zero bytes, but it will make it work with file-names containing whitespace other than newlines. If the file-name appears at the very last of the command to be executed, you can use xargs, working also with newlines in filenames:
find . -name '*.bak' -print0 | xargs -0 cp -t /tmp/dst
copies all files found into the directory /tmp/dst. Downside of the xargs approach is that you don't have the filenames in a variable, of course. So this not always applicable.

The only way I figured out is using eval:
(zyx:~/tmp) % F="$(find . -maxdepth 1 -name '* *' -print0)"
(zyx:~/tmp) % echo $F | hexdump -C
00000000 2e 2f 20 10 30 00 2e 2f d0 96 d1 83 d1 80 d0 bd |./ .0../........|
00000010 d0 b0 d0 bb 20 c2 ab d0 a1 d0 b0 d0 bc d0 b8 d0 |.... ...........|
00000020 b7 d0 b4 d0 b0 d1 82 c2 bb 2e d0 9c d0 b8 d1 82 |................|
00000030 d1 8e d0 b3 d0 b8 d0 bd d0 b0 20 d0 9e d0 bb d1 |.......... .....|
00000040 8c d0 b3 d0 b0 2e 20 d0 93 d1 80 d0 b0 d0 bd d0 |...... .........|
00000050 b8 20 d0 be d1 82 d1 80 d0 b0 d0 b6 d0 b5 d0 bd |. ..............|
00000060 d0 b8 d0 b9 2e 68 74 6d 6c 00 2e 2f 0a 20 0a 6e |.....html../. .n|
00000070 00 2e 2f 20 7b 5b 5d 7d 20 28 29 21 26 7c 00 2e |../ {[]} ()!&|..|
00000080 2f 74 65 73 74 32 20 2e 00 2e 2f 74 65 73 74 33 |/test2 .../test3|
00000090 20 2e 00 2e 2f 74 74 74 0a 74 74 0a 0a 74 20 00 | .../ttt.tt..t .|
000000a0 2e 2f 74 65 73 74 20 2e 00 2e 2f 74 74 0a 74 74 |./test .../tt.tt|
000000b0 0a 74 20 00 2e 2f 74 74 5c 20 74 0a 74 74 74 00 |.t ../tt\ t.ttt.|
000000c0 2e 2f 0a 20 0a 0a 00 2e 2f 7b 5c 5b 5d 7d 20 28 |./. ..../{\[]} (|
000000d0 29 21 26 7c 00 0a |)!&|..|
000000d6
(zyx:~/tmp) % echo $F[1]
.
(zyx:~/tmp) % eval 'F=( ${(s.'$'\0''.)F} )'
(zyx:~/tmp) % echo $F[1]
./ 0
${(s.\0.)...} and ${(s.$'\0'.)...} do not work.
You can use function:
function SplitAt0()
{
local -r VAR=$1
shift
local -a CMD
set -A CMD $#
local -r CMDRESULT="$($CMD)"
eval "$VAR="'( ${(s.'$'\0''.)CMDRESULT} )'
}
Usage: SplitAt0 varname command [arguments]
I should have used ${(ps.\0.)F}, not ${(s.\0.)F}:
% F=${(ps.\0.)"$(find . -maxdepth 1 -name '* *' -print0)"}
% echo $F[1]
./ 0

Alternatively, since you're using zsh, you can use zsh's extended globbing syntax to find files rather than using the find command. As far as I know, all the functionality of find is present in the globbing syntax, and it handles filenames with whitespaces properly. See the zshexpn(1) manpage for more info. If you use zsh on a fulltime basis, it's well worth learning the syntax.

I've tried
F=(*)
and it handles even the files with newlines in the filename.

Related

Convert PuTTY Raw SSH output to plain text

I have a 240MB logfile from a PuTTY session. This was mistakenly logged in the "SSH packets and raw data" format instead of "All session output". If I open the file in a text editor then I can see that the data I require (the plain text).
The problem is extracting that from the raw data.
For example:
Incoming raw data at 2016-01-06 15:47:42
00000000 e8 fd c2 d2 88 a9 39 b9 2a 77 2a 7b 4a 60 fc 21 ......9.*w*{J`.!
00000010 1d f5 fc d4 b1 58 1f 4d 68 a4 ef 83 03 39 59 b7 .....X.Mh....9Y.
00000020 41 be 36 7b b5 3c 10 fa 65 27 77 30 77 97 02 39 A.6{.<..e'w0w..9
00000030 46 4c 28 da 5c c6 2c 1e ae 33 db e1 a8 09 ea 4a FL(.\.,..3.....J
00000040 06 94 c6 eb 38 8e d3 d3 33 13 78 08 7c 5f 41 56 ....8...3.x.|_AV
00000050 f1 13 9e e1 ....
Incoming packet #0x31, type 94 / 0x5e (SSH2_MSG_CHANNEL_DATA)
00000000 00 00 01 00 00 00 00 20 64 69 73 61 62 6c 69 6e ....... disablin
00000010 67 20 61 20 72 75 6e 6e 69 6e 67 20 77 61 74 63 g a running watc
00000020 68 64 6f 67 2e 2e 0d 0a hdog....
Incoming raw data at 2016-01-06 15:47:42
00000000 dc 96 f3 54 f8 a8 5c 83 80 7b a8 07 da 79 95 50 ...T..\..{...y.P
00000010 3f 19 2f 0c f0 03 a1 01 a3 33 2f 97 75 9d 47 15 ?./......3/.u.G.
00000020 b9 95 df c6 66 e0 50 32 88 1e db 5b 73 1b 7b ad ....f.P2...[s.{.
I think what I need to do is read only the sections of the file labelled "Incoming packet". Then I can read the ascii character codes and convert to readable text (this will recover the tabs, linefeeds and carriage returns).
I'm not familiar with awk or sed, but I know a bit of grep. How can I go about firstly extracting the sections (of variable size) that I need to translate from ASCII codes to text?
sed -n '/^Incoming packet/,/^Incoming raw data/{//!p}
This will print lines between the matches Incoming packet and Incoming raw. Process this output further to get your desired output.
Print only ASCII characters (print last 17 characters) from the matching line:
sed -n '/Incoming packet/,/Incoming raw data/{//!{s/^.*\(.\{17\}\)/\1/;p}}'
Ref:1, 2

Extract data between two matched patterns in a binary file containing non-ASCII characters using bash

I am trying to extract a jpeg image from a binary text file. I want to extract all data between 0xFF 0xD8 (start of image) and 0xFF 0xD9 (end of image) inclusive. Earlier, I have successfully run the following command to get the desired image.jpg from a single paragraph file received.txt:
sed 's/.*\xFF\xD8/\xFF\xD8/; s/\xFF\xD9.*/\xFF\xD9/' received.txt > image.jpg
But when I tried to run the same operation on a different file, it didn't work. I also tried using
sed -n '/\xFF\xD8/,/\xFF\xD9/p' received.txt > temp.txt
sed 's/.*\xFF\xD8/\xFF\xD8/; s/\xFF\xD9.*/\xFF\xD9/' temp.txt > image.jpg
to remove any lines before or after the matched lines but got no success.
Although the file was too large, I pasted the hex dump of the relevant portion below:
0a 55 57 5d 50 cf ff d8 ff fe ff ff ff d9 df 47 fe e7 c9 3b e9 9b 6b 55 c4 57 9b 98 73 fd 15 f7 77 7e f7 95 dd 55 f7 55 05 cc 55 97 55 dd 62 d1 1f 51 ef f1 ef fb e9 bf ed 5f bf f2 9d 75 af fe 6b fb bf 8f f7 f7 7e ff d3 bf 8e d5 5f df 57 75 fe 77 7b bf d7 af df 5d fb 0a 47 de d5 ff c1 23 9b 20 08 20 65 3c 06 83 11 05 30 50 a0 20 55 20 84 41 04 c2 59 50 89 64 44 44 10 05 20 87 28 1d a9
The hex dump of the desired output in this case is:
ff d8 ff fe ff ff ff d9
Update
While trying to resolve the issue, I found that the sed command removes all the characters before or after a matched pattern upto the non-ASCII character (0x80 - 0xFF) but not go beyond that non-ASCII character. As an example, if we try:
echo 55 57 5d 50 cf 50 65 7f ff d8 ff fe ff ff ff d9 | xxd -r -p | sed 's/.*\xFF\xD8/\xFF\xD8/' > output
The hex dump of the output can be seen as:
xxd output
which is:
55 57 5d 50 cf ff d8 ff fe ff ff ff d9
As can be seen, the characters between the non-ASCII character and matched pattern are removed but the characters before the non-ASCII character are not.
Alternative Solution (not perfect)
I used the following commands to somewhat resolve the problem:
sed 's/\xFF\xD8/\x0A\xFF\xD8/; s/\xFF\xD9/\xFF\xD9\x0A/' received.txt > temp.txt
then run the following command (which will work if there is no new line character (0x0A) somewhere between 0xFF 0xD8 and 0xFF 0xD9):
sed -n '/\xFF\xD8/{/\xFF\xD9/p}' temp.txt > image.jpg
but if image.jpg file is empty (after execution of the above command), then run the following command:
sed -n '/\xFF\xD8/,/\xFF\xD9/p' temp.txt > image.jpg
These commands will do the desired job except that it puts 0x0A at the end of the image.jpg file (i.e., after 0xFF 0xD9). In my case, it did not create any issue as JPEG file automatically discards data after 0xFF 0xD9 marker.
I was stuck at the implementation of 'if image file is empty' condition when #chaos came up with a perfect solution. So, I am now following his solution. Thanks a lot #chaos!
Please follow the link below for chaos solution!
https://unix.stackexchange.com/questions/231289/extract-data-between-two-matched-patterns-in-a-binary-file
Notes:
Here is how you can get the actual data from its hex dump which you can pipe to sed command:
echo 0a 55 57 5d 50 cf ff d8 ff fe ff ff ff d9 df 47 fe e7 c9 3b e9 9b 6b 55 c4 57 9b 98 73 fd 15 f7 77 7e f7 95 dd 55 f7 55 05 cc 55 97 55 dd 62 d1 1f 51 ef f1 ef fb e9 bf ed 5f bf f2 9d 75 af fe 6b fb bf 8f f7 f7 7e ff d3 bf 8e d5 5f df 57 75 fe 77 7b bf d7 af df 5d fb 0a 47 de d5 ff c1 23 9b 20 08 20 65 3c 06 83 11 05 30 50 a0 20 55 20 84 41 04 c2 59 50 89 64 44 44 10 05 20 87 28 1d a9 | xxd -r -p
and you can see the hex dump of a file by:
xxd file.txt

Can $() always replace backticks for command substitution?

The following code fails to generate binary numbers if backticks are replaced by dollar-parenthesis syntax:
#!/bin/bash
rm test.bin 2>/dev/null
for character in {0..255}
do
char=`printf '\\\\x'"%02x" $character`
printf "$char" >> test.bin
done
hexdump -C test.bin
Result:
00000000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f |................|
00000010 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f |................|
00000020 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f | !"#$%&'()*+,-./|
00000030 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f |0123456789:;<=>?|
00000040 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f |#ABCDEFGHIJKLMNO|
00000050 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f |PQRSTUVWXYZ[\]^_|
00000060 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f |`abcdefghijklmno|
00000070 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f |pqrstuvwxyz{|}~.|
00000080 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f |................|
00000090 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f |................|
000000a0 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af |................|
000000b0 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf |................|
000000c0 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf |................|
000000d0 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df |................|
000000e0 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef |................|
000000f0 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff |................|
That's ok so far. Let's replace backticks and see what we get:
#!/bin/bash
rm test.bin 2>/dev/null
for character in {0..255}
do
char=$(printf '\\\\x'"%02x" $character)
printf "$char" >> test.bin
done
hexdump -C test.bin
Result:
00000000 5c 78 30 30 5c 78 30 31 5c 78 30 32 5c 78 30 33 |\x00\x01\x02\x03|
00000010 5c 78 30 34 5c 78 30 35 5c 78 30 36 5c 78 30 37 |\x04\x05\x06\x07|
00000020 5c 78 30 38 5c 78 30 39 5c 78 30 61 5c 78 30 62 |\x08\x09\x0a\x0b|
00000030 5c 78 30 63 5c 78 30 64 5c 78 30 65 5c 78 30 66 |\x0c\x0d\x0e\x0f|
00000040 5c 78 31 30 5c 78 31 31 5c 78 31 32 5c 78 31 33 |\x10\x11\x12\x13|
00000050 5c 78 31 34 5c 78 31 35 5c 78 31 36 5c 78 31 37 |\x14\x15\x16\x17|
00000060 5c 78 31 38 5c 78 31 39 5c 78 31 61 5c 78 31 62 |\x18\x19\x1a\x1b|
00000070 5c 78 31 63 5c 78 31 64 5c 78 31 65 5c 78 31 66 |\x1c\x1d\x1e\x1f|
.
.
While I prefer dollar-parenthesis syntax it appears to fail in this case but why ?
Credits for the code snippet:
http://code.activestate.com/recipes/578441-a-building-block-bash-binary-file-manipulation
You are running into one of the reasons that $() is preferred to the backtick notation. The shell parsing of $() is more consistent (as it introduces a new parsing context as I understand it).
So your escaping, while correct for the backtick code, is excessive for the $() code.
Try this:
$ : > test.bin; for character in {0..255}
do
char=$(printf '\\x'"%02x" $character)
printf "$char" >> test.bin
done; hexdump -C test.bin
00000000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f |................|
00000010 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f |................|
00000020 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f | !"#$%&'()*+,-./|
00000030 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f |0123456789:;<=>?|
00000040 40 41 42 43 44 45 46 47 48 49 4a 4b 4c 4d 4e 4f |#ABCDEFGHIJKLMNO|
00000050 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f |PQRSTUVWXYZ[\]^_|
00000060 60 61 62 63 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f |`abcdefghijklmno|
00000070 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f |pqrstuvwxyz{|}~.|
00000080 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f |................|
00000090 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f |................|
000000a0 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae af |................|
000000b0 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf |................|
000000c0 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf |................|
000000d0 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df |................|
000000e0 e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef |................|
000000f0 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff |................|
00000100
A little more clearly compare this
$ printf %s\\n `printf %s "\\\\ff"`
\ff
$ printf %s\\n `printf %s '\\\\ff'`
\\ff
to this
$ printf %s\\n $(printf %s "\\\\ff")
\\ff
$ printf %s\\n $(printf %s '\\\\ff')
\\\\ff
This is the difference:
echo `echo '\\'`
\
echo $(echo '\\')
\\
From the manual, Command substitution section:
When the old-style backquoted form of substitution is used, backslash retains its literal meaning except when followed by "$", "`", or "\".
When using the "$(COMMAND)" form, all characters between the parentheses make up the command; none are treated specially.
$() doesn't need extra escaping for \. Use:
char=$(printf '\\x'"%02x" $character)

Bash echo command with binary data?

Would someone please explain why this script sometime return only 15 bytes in hex string representation?
for i in {1..10}; do
API_IV=`openssl rand 16`; API_IV_HEX=`echo -n "$API_IV" | od -vt x1 -w16 | awk '{$1="";print}'`; echo $API_IV_HEX;
done
like this:
c2 2a 09 0f 9a cd 64 02 28 06 43 f8 13 80 a5 04
fa c4 ac b1 95 23 7c 36 95 2d 5e 0e bf 05 fe f4
38 55 d3 b4 32 bb 61 f4 fd 17 92 67 e2 9b b4 04
6d a7 f8 46 e9 99 bd 89 87 f9 7f 2b 15 5a 17 8a
11 c8 89 f4 8f 66 93 f1 6d b9 2b 64 7e 01 61 68
93 e3 9d 28 95 e1 c8 92 e5 62 d9 bf 20 b3 1c dd
37 64 ef b0 2f da c7 60 1c c8 20 b8 28 9d f9
29 f0 5a e9 cc 36 66 de 02 82 fc 8e 36 bf 5d d1
b2 57 d8 79 21 df 73 1c af 07 e9 80 0a 67 c6 15
ba 77 cb 92 39 42 39 f9 a4 57 c8 c4 be 62 19 54
If pipe the "openssl rand 16" directly to the od command then it works fine, but I need the binary value. Thanks for your help.
echo, like various other standard commands, considers \x00 as an end-of-string marker. So stop displaying after it.
Maybe you are looking to the -hex option of openssl rand:
sh$ openssl rand 16 -hex
4248bf230fc9dd927ab53f799e2a9708
Given that option is available on your system, your example could be rewritten:
sh$ openssl version
OpenSSL 1.0.1e 11 Feb 2013
sh$ for i in {1..10}; do
openssl rand 16 -hex | sed -e 's|..|& |g' -e 's| $||'
done
20 cb 6b 7a 85 2d 0b fe 9e c7 d0 4b 91 88 1b bb
5d 74 99 5e 05 c9 7d 9d 37 dd 02 f3 23 bb c5 b7
51 e9 0f dc 58 04 5e 30 e3 6b 9f 63 aa fc 95 05
fc 6b b8 cb 05 82 53 85 78 0e 59 13 3b e7 c1 4b
cf fa fc d9 1a 25 df e0 f8 59 71 a6 2c 64 c5 87
93 1a 29 b4 5a 52 77 bb 3f bb 1d 0a 46 5d c8 b4
0c bb c2 b2 b4 89 d4 37 1c 86 0a 7a 58 b8 64 e2
ee fc a7 ec 6c f8 7f 51 04 43 d6 00 d8 79 65 43
b9 73 9e cc 4b 42 9e 64 9d 5b 21 6a 20 b7 c3 16
06 8a 15 22 6a d5 ae ab 9a d2 9f 60 f1 a9 26 bd
If you need to later convert from hex to bytes use this perl one-liner:
echo MY_HEX_STRING |
perl -ne 's/([0-9a-f]{2})/print chr hex $1/gie' |
my_tool_reading_binary_input_from_stdin
(from https://stackoverflow.com/a/1604770/2363712)
Please note I pipe to the client program and o not use a shell variable here, as I think it cannot properly handle the \x00.
As the bash cannot properly deal with binary strings containing \x00 your best bet if you absolutely want to stick with shell programming is to use an intermediate file to store binary data. And not a variable:
This is the idea. Feel free to adapt to your needs:
for i in {1..10}; do
openssl rand 16 > api_iv_$i # better use `mktemp` here
API_IV_HEX=`od -vt x1 -w16 < api_iv_$i | awk '{$1="";print}'`
echo $API_IV_HEX;
done
sh$ bash t.sh
cf 06 ab ab 86 fd ef 22 1a 2c bd 7f 8c 45 27 e5
2a 01 9c 7a fa 15 d3 ea 40 89 8b 26 d5 4f 97 08
55 2e c9 d3 cd 0d 3a 6f 1b a0 fe 38 6d 0e 20 07
fe 60 35 62 17 80 f2 db 64 7a af da 81 ff f7 e0
74 9a 5c 39 0e 1a 6b 89 a3 21 65 01 a3 de c4 1c
c3 11 45 e3 e0 dc 66 a3 e8 fb 5b 8a bd d0 7d 43
a4 ee 80 f8 c8 8b 4e 50 5c dd 21 00 3b d0 bc cf
e2 d5 11 d4 7d 98 08 a7 16 7b 8c 56 44 ba 6d 53
ad 63 65 fd bf 3f 1f 4a a1 c5 d0 58 23 ae d1 47
80 74 f1 d0 b9 00 e5 1d 50 74 53 96 4b ce 59 50
sh$ hexdump -C ./api_iv_10
00000000 80 74 f1 d0 b9 00 e5 1d 50 74 53 96 4b ce 59 50 |.t......PtS.K.YP|
00000010
As a personal opinion, if you really have a lot of binary data processing, I would recommend to switch to some other language more data oriented (Python, ...)
Because the missing byte was an ASCII NUL '\0' '\x00'. The echo command stops printing its argument(s) when it comes across a null byte in each argument.

Change the local setting to enable sed work correctly, but why?

The following is a bash file I wrote to convert all C++ style(//) comments in a C file to C style(/**/).
#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang
It works. But I found a problem I cannot explain. In default, my local setting is en_US.UTF-8. And in my C code, there are comments written in Chinese, such as
// some english 一些中文注释
If I don't change the local setting, i.e., do not run the statement export LANG=C, I'll get
/* some english */一些中文注释
instead of
/* some english 一些中文注释*/
I don't know why. I just find a solution by try and error.
After read Jonathan Leffler's answer, I think I've make some mistake leading to some misunderstand. In the question, those Chinese words were inputed in Google Chrome and were not the actual words in my C file. 一些中文注释 just means some Chinese comments.
Now I inputed // some english 一些中文注释 in Visual C++ 6.0 in Windows XP, and copied the c file to Debian. Then I just run sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 and got
/* some english 一些 */中文注释
I think it's different character coding(GB18030, GBK, UTF-8?) cause the different results.
The following is my results gotten on Debian
~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / s o m e e n g l i s h
2f 2f 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68 20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
d2 bb d0 a9 d6 d0 ce c4 d7 a2 ca cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * s o m e e n g l i s h
2f 2a 20 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68
0000020 322 273 320 251 * / 326 320 316 304 327 242 312 315
20 d2 bb d0 a9 20 2a 2f d6 d0 ce c4 d7 a2 ca cd
0000040
~/sandbox$
I think these Chinese Character encoding with 2 byte(Unicode).
There are another example:
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / I n W i n d o w : 250 250 ?
2f 2f 20 49 6e 57 69 6e 64 6f 77 3a 20 a8 a8 3f
0000020 1 ?
31 3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * I n W i n d o w : *
2f 2a 20 20 49 6e 57 69 6e 64 6f 77 3a 20 20 2a
0000020 / 250 250 ? 1 ?
2f a8 a8 3f 31 3f
Which platform are you working on? Your sed script works fine on MacOS X without changing locale. The Linux terminal was less happy with the Chinese characters, but it is not setup to use UTF-8. Moreover, a hex dump of the string that it did get contained a zero byte 0x00 where the Chinese started, which might lead to the confusion. (I note that your regex adds a space before the comment text if it starts // with a space.)
MacOS X (10.6.8)
The 'odx' command use is a hex-dump program.
$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
All of which looks clean and tidy.
Linux (RHEL 5)
I copied the x3.utf8 file to a Linux box, and dumped it. Then I ran the sed script on it, and all seemed OK:
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 /* some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 ...............
0x0020: E9 87 8A 20 2A 2F 0A ... */.
0x0027:
$
So far, so good. I also tried:
$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE
$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
So, the terminal is nominally working in UTF-8 after all, and it certainly seems display the data OK.
However, if I echo the string at the terminal, it gets into a tizzy. When I cut'n'pasted the string to the Linux terminal, it said:
$ echo "// some english d8d^G:
> "
// some english d8d:
$
and beeped.
$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: 64 38 64 07 3A 0A 0A d8d.:..
0x0017:
$
I'm not quite sure what to make of that. I think it means that something in the input side of bash is having some problems, but I'm not quite sure. I also am getting slightly inconsistent results. The first time I tried it, I got:
$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^#d:^[d8-f^Gf3(i^G
$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D 's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A d8-f.f3(i...
0x004C:
$
And in that hex dump, you can see a 0x00 byte (offset 0x003C). That appears at the position where you got the end comment, and a null there could confuse sed; but the whole input is such a mess it is hard to know what to make of it.
Okay, here's the correct answer...
The GNU regular expression library (regex) doesn't match everything when you put a . in your expression. Yup, I know how braindead that sounds.
The problem comes from the word "character", now reasonable people will say that everything that's in the input file for sed is characters. And even in your case they are perfectly correct. But regex has been programmed to required that the input be perfectly correctly formatted characters of the current locale character set (UTF-8) if they're correctly formatted characters for the Windows character set (UTF-16) they're not "characters".
So as . only matches "characters" it doesn't match your characters.
If you used the regex //.*$, ie: pinned it to the end of the line it wouldn't match at all because there's something that's not a "character" between the // and the end of the line.
And no you can't do anything like //\(.\|[^.]\)*$, it's just impossible to match those characters without switching to the C locale.
This will also, sometimes, destroy 8-bit transparency; ie: a binary piped through sed will get corrupted even if no changes are made.
Fortunately the C locale still uses the reasonable interpretation so anything that's not a perfectly correctly formatted ASCII-68 character is still a "character".

Resources