I've built a kernel on x86_64 (kernel ver 3.18.22) with kmemcheck enabled.
Relevant configs:
# grep KMEMCHECK /boot/config-3.18.22
CONFIG_HAVE_ARCH_KMEMCHECK=y
CONFIG_KMEMCHECK=y
CONFIG_KMEMCHECK_DISABLED_BY_DEFAULT=y
# CONFIG_KMEMCHECK_ENABLED_BY_DEFAULT is not set
# CONFIG_KMEMCHECK_ONESHOT_BY_DEFAULT is not set
CONFIG_KMEMCHECK_QUEUE_SIZE=64
CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT=5
CONFIG_KMEMCHECK_PARTIAL_OK=y
# CONFIG_KMEMCHECK_BITOPS_OK is not set
#
Wrote a quick kernel module to test kmemcheck catching uninitialized slab memory accesses. The function in question that runs this simple test case:
static int slab_test(void)
{
void *kbuf;
kbuf = kmalloc(512, GFP_KERNEL);
if (!kbuf) {
pr_warn("out of memory!");
return -ENOMEM;
}
pr_info("### slab_test: kbuf=%p\n", kbuf);
print_hex_dump_bytes("### ", DUMP_PREFIX_ADDRESS, kbuf, 32);
kfree(kbuf);
return 0;
}
I enable kmemcheck, insert the module and call the above function, log the output - all via a small wrapper script below:
# cat tst.sh
MOD=kmemchk_test
echo 0 > /proc/sys/kernel/kmemcheck
dmesg -C
rmmod ${MOD} 2>/dev/null
echo 1 > /proc/sys/kernel/kmemcheck
insmod ${MOD}.ko
sleep 1
echo 0 > /proc/sys/kernel/kmemcheck
dmesg > out.txt
#
My problem is this: kmemcheck does not seem to catch the uninitialized memory access at all! Here's the output:
# dmesg
--snip--
kern :info : [ +0.000005] ### slab_test: kbuf=ffff88003ccc8000
kern :debug : [ +0.000003] ### ffff88003ccc8000: 00 8c cc 3c 00 88 ff ff 75 6c 65 2f 6b 6d 65 6d ...<....ule/kmem
kern :debug : [ +0.000003] ### ffff88003ccc8010: 63 68 6b 5f 74 65 73 74 00 41 43 54 49 4f 4e 3d chk_test.ACTION=
#
Any idea why? TIA..
Related
I encounter a strange behaviour with bash string substitution.
I expected the same substitution on $r1 and $var to yield the exact same results.
both strings seem to have the same value.
But It is not the case and I can't understand what I am missing....
maybe is because of the glob? I just don't know... I am not pure IT guys and maybe it's something that will be evident for you.
(bottom a Repl.it link)
mkdir -p T21805
touch T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
r1=T21805/*R1*
echo $r1;
echo ${r1%%_S1*z}
var=T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
echo ${var%%_S1*z}
echo $r1| hexdump -C
echo $var | hexdump -C
output :
echo $r1
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
echo ${r1%%_S1*z}
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
echo ${var%%_S1*z}
T21805/T21805_SI-GA-D8-BH25N7DSXY
echo $r1| hexdump -C
00000000 54 32 31 38 30 35 2f 54 32 31 38 30 35 5f 53 49
|T21805/T21805_SI|
00000010 2d 47 41 2d 44 38 2d 42 48 32 35 4e 37 44 53 58
|-GA-D8-BH25N7DSX|
00000020 59 5f 53 31 5f 4c 30 30 31 5f 52 31 5f 30 30 31
|Y_S1_L001_R1_001|
00000030 2e 66 61 73 74 71 2e 67 7a 0a
|.fastq.gz.| 0000003a
echo $var | hexdump -C
00000000 54 32 31 38 30 35 2f 54 32 31 38 30 35 5f 53 49
|T21805/T21805_SI|
00000010 2d 47 41 2d 44 38 2d 42 48 32 35 4e 37 44 53 58
|-GA-D8-BH25N7DSX|
00000020 59 5f 53 31 5f 4c 30 30 31 5f 52 31 5f 30 30 31
|Y_S1_L001_R1_001|
00000030 2e 66 61 73 74 71 2e 67 7a 0a
|.fastq.gz.| 0000003a
Repl.it
I am interested on understanding why this is not working, I can achieve my desire output using sed for example.
Glob expansion doesn't happen at assignment time.
$ mkdir -p T21805
$ touch T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
$ touch T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_002.fastq.gz
$ r1=T21805/*R1*
$ printf '%s\n' "$r1"
T21805/*R1*
$ printf '%s\n' $r1
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_002.fastq.gz
It happens after the unquoted r1 has been expanded. When you write ${r1%%_S1*z}, the value of r1 doesn't contain the string S1; only after ${r1} expands is there an S1 you could match against.
If you set an array, the assignment rules are different. The glob expands before the assignment, and so you can do your filtering on each element of the array.
$ r1=( T21805/*R1* )
$ printf '%2\n' "${r1[#]}"
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_002.fastq.gz
$ printf '%s\n' "${r1[#]%%_S1*z}"
T21805/T21805_SI-GA-D8-BH25N7DSXY
T21805/T21805_SI-GA-D8-BH25N7DSXY
I ran it after set -xv to see the contents of r1.
$ r1=T21805/*R1*
+ r1='T21805/*R1*'
$ var=T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
+ var=T21805/T21805_SI-GA-D8-BH25N7DSXY_S1_L001_R1_001.fastq.gz
The r1 of$ {r1 %% _ S1 * z}isT21805 / * R1 *.
r1 does not include_S1 * z.
I've spent an embarrassingly long time trying to understand why the second conditional in the "foo" script below fails but the first one succeeds.
Please note:
The current directory contains two files: bar and foo.
All three strings $s1, $s2 and $s3 are equal according to hexdump.
Thanks in advance for any help.
Session: (Running on a Centos7 host):
>ls
bar foo
>cat foo
#!/bin/bash
s1="bar foo"
s2="bar foo"
s3=`ls`
echo -n $s1 | hexdump -C
echo -n $s2 | hexdump -C
echo -n $s3 | hexdump -C
if [ "$s1" = "$s2" ]; then # True
echo s1 = s2
fi
if [ "$s1" = "$s3" ]; then # NOT true! Why?
echo s1 = s3
fi
>foo
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
s1 = s2
>
Quote the variables when echoing.
echo -n "$s3" | hexdump -C
You'll see a newline between the file names, as ls uses -1 when the output is redirected.
Your demo would be more convincing with echo -n "$s1" etc. That would show that there's a newline in the middle of s3 where there's a space in s1 and s2. The echo without the double quotes mangles the newline into a space (and generally each sequence of one or more white space characters in the string into a single space).
Given:
#!/bin/bash
s1="bar foo"
s2="bar foo"
s3=`ls`
echo -n "$s1" | hexdump -C
echo -n "$s2" | hexdump -C
echo -n "$s3" | hexdump -C
if [ "$s1" = "$s2" ]; then # True
echo s1 = s2
fi
if [ "$s1" = "$s3" ]; then # NOT true because s3 contains a newline!
echo s1 = s3
fi
I get:
$ sh foo
00000000 2d 6e 20 62 61 72 20 66 6f 6f 0a |-n bar foo.|
0000000b
00000000 2d 6e 20 62 61 72 20 66 6f 6f 0a |-n bar foo.|
0000000b
00000000 2d 6e 20 62 61 72 0a 66 6f 6f 0a |-n bar.foo.|
0000000b
s1 = s2
$ bash foo
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
00000000 62 61 72 0a 66 6f 6f |bar.foo|
00000007
s1 = s2
$
This is the script I've constructed
It takes a list of files according to the extension supplied as an argument.
It then removes everything before the pattern 00000000: in those files.
The pattern 00000000: is preceded by the string <pre>, it then removes those five first characters.
The script then removes the last three lines of the file
The script the outputs only the hexdump data of the file.
The script runs xxd to convert the hexdump to a file.jpg
if [[ $# -eq 0 ]] ; then
echo 'Run script as ./hexconv ext'
exit 0
fi
for file in *.$1
do
filename=$(basename $file)
extension="${filename##*.}"
filename="${filename%.*}"
sed -n '/00000000:/,$p' $file | sed '1s/^.....//' | head -n -3 | awk '{print $2" "$3" "$4" "$5" "$6" "$7" "$8" "$9" "$10" "$11" "$12" "$13" "$14" "$15" "$16" "$17}' | xxd -p -r > $filename.jpg
done
It works as I want it too, but I suspect there are things to improve it by, but alas, I am a novice in the use of awk and sed.
Excerpt from file
<th>response-head:</th>
<td>HTTP/1.1 200 OK
Date: Sun, 15 Dec 2013 04:27:04 GMT
Server: PWS/8.0.18
X-Px: ms h0-s34.p6-lhr ( h0-s35.p6-lhr), ht-d h0-s35.p6-lhr.cdngp.net
Etag: "4556354-9fbf8-4e40387aadfc0"
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0, max-age=0
Accept-Ranges: bytes
Content-Length: 654328
Content-Type: image/jpeg
Last-Modified: Thu, 15 Aug 2013 21:55:19 GMT
Pragma: no-cache
</td>
</tr>
</table>
<hr/>
<pre>00000000: ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 48 ......JFIF.....H
00000010: 00 48 00 00 ff e1 00 18 45 78 69 66 00 00 49 49 .H......Exif..II
00000020: 2a 00 08 00 00 00 00 00 00 00 00 00 00 00 ff ed *...............
00000030: 00 48 50 68 74 73 68 70 20 33 2e 30 00 .HPhotoshop 3.0.
00000040: 38 42 49 4d 04 04 00 00 00 00 00 1c 01 5a 00 8BIM..........Z.
00000050: 03 1b 25 47 1c 02 00 00 02 00 02 00 38 42 49 4d ..%G........8BIM
00000060: 04 25 00 00 00 00 00 10 fc e1 89 c8 b7 c9 78 .%.............x
00000070: 34 62 34 07 58 77 eb ff e1 03 a5 68 74 74 70 /4b4.Xw.....http
00000080: 3a 6e 73 2e 61 64 62 65 2e 63 6d ://ns.adobe.com/
00000090: 78 61 70 31 2e 30 00 3c 78 70 61 63 6b xap/1.0/.<?xpack
000000a0: 65 74 20 62 65 67 69 6e 3d 22 ef bb bf 22 20 69 et begin="..." i
000000b0: 64 3d 22 57 35 4d 30 4d 70 43 65 68 69 48 7a 72 d="W5M0MpCehiHzr
000000c0: 65 53 7a 4e 54 63 7a 6b 63 39 64 22 3e 20 3c eSzNTczkc9d"?> <
000000d0: 78 3a 78 6d 70 6d 65 74 61 20 78 6d 6c 6e 73 3a x:xmpmeta xmlns:
000000e0: 78 3d 22 61 64 62 65 3a 6e 73 3a 6d 65 74 61 x="adobe:ns:meta
000000f0: 22 20 78 3a 78 6d 70 74 6b 3d 22 41 64 62 /" x:xmptk="Adob
00000100: 65 20 58 4d 50 20 43 72 65 20 35 2e 30 2d 63 e XMP Core 5.0-c
00000110: 30 36 31 20 36 34 2e 31 34 30 39 34 39 2c 20 32 061 64.140949, 2
00000120: 30 31 30 31 32 30 37 2d 31 30 3a 35 37 3a 010/12/07-10:57:
Although #CodeGnome is right and this might belong to Code Review SE, here you go anyway:
Slightly more efficient to combine the multiple sed commands into one, for example:
sed -n -e 's/^<pre>//' -e '/00000000:/,$p'
I decided to retract this part, as I'm not all that sure it's any better or clearer. Your version is fine, except that s/^<pre>// is better than s/^.....//.
Use exit 1 when checking the number of arguments to signal an error
What is for file in *. there? Iterate for all files ending with a dot? Typo?
Unless you're 100% sure the filenames will never contain spaces, you should quote them, but don't quote where you don't need, for example:
filename=$(basename "$file") # need to quote
extension=${filename##*.} # no need,
filename=${filename%.*} # no need
sed ... "$file" # need to quote
... | xxd > "$filename".jpg # need to quote
The last awk could be shorter and less error prone as a loop:
... | awk '{printf $2; for (i=3; i<=17; ++i) printf " " $i; print ""}'
It seems you want to learn. You might be interested in this other answer too: What are the rules to write robust shell scripts?
The error message should be sent to stderr, should not hard-code the name of the script in case you rename it later, and should exit with a nonzero value.
if (( ! $# )); then
echo >&2 "Run script as '$0' \$extension"
exit 1
fi
If you're going to put the then on the same line as the if, then you should put the do on the same line as the for, too, for consistency:
for file in *.$1; do
Using file for the full name and filename for the basename is confusing variable name choice. I would use basename for the variable, to match the operation. And you need to quote the parameter expansion:
basename=$(basename "$file")
But you don't need to quote the right hand side of an assignment:
extension=${basename##*.}
The part of a filename without the extension is sometimes called the root (in vi and csh :-modifiers, you get it with :r)... using that name would be less confusing than changing an existing variable and reusing it:
root=${basename%.*}
As far as the actual pipeline, I would reorder it to put the head before the awk, since the sed and the head are all about what lines to print out and should be grouped together before the awk which modifies those selected lines. I would also use a loop and printf to make the awk a little more wieldy:
sed -n '/0\{8\}:/,$p' "$file" |
head -n -3 |
awk '{ printf "%s", $2; for (f=3;f<=17;++f) { printf " %s", $f }; print "" }' |
xxd -p -r > "$root.jpg"
done
i hope you can give me an idea about what's going wrong.
The Szenario:
I run gitweb (CGI) with a script in fastcgi mode:
#!/bin/sh
export FCGI_SOCKET_PATH=127.0.0.1:7001
su git -c "/var/www/vh_[vhost]/htdocs/gitweb.cgi --fastcgi &"
Then i use nginx to serve that content:
...
fastcgi_pass 127.0.0.1:7001;
...
Everything works as expected, but here's the problem:
$ wget "http://git.[host].de/?p=[repo].git;a=summary" -O /tmp/test.txt && file --mime-encoding /tmp/test.txt
> /tmp/test.txt: iso-8859-1
$ su git -c "./gitweb.cgi \"?p=[repo].git;a=summary\" > ./test" && file --mime-encoding ./test
> ./test: utf-8
Which obviously means that fast-cgi output is utf8 while content served by nginx is iso-8859-1.
FireBugs Response Header:
Server nginx
Date Fri, 02 Sep 2011 14:14:08 GMT
Content-Type application/xhtml+xml; charset=utf-8
Transfer-Encoding chunked
Connection close
It looks like the transfer using the socket leads to an encoding problem.
I've tested a lot but can't figure out how to solve this.
although you aren't using PHP, I found the fix for my issue but wrapping the pieces that were being exposed as ISO-8859-1 with: utf8_encode(): http://php.net/manual/en/function.utf8-encode.php
If your CGI is in PERL, maybe http://perldoc.perl.org/utf8.html will solve your problem. It solved mine ... Z�rich
Another option could be to add the following to the http { } statement in your nginx.conf:
charset utf-8;
-sd
I can make it works by using fcgiwrap.
I though some environment variables where different between the two methods, so I added the following code to the gitweb.cgi dispatch() sub:
open my $tmplogfile, ">", "/tmp/gitweb-env.txt";
foreach my $varkey (sort keys %ENV) {
print $tmplogfile "$varkey = $ENV{$varkey}\n";
}
close $tmplogfile;
but the environment were the same.
Something may be done by fcgiwrap, I do not yet found what.
Here are the commands I use and the differences I found using tcpdump on the fcgi socket:
# gitweb spawned by fcgiwrap outputs utf-8
/usr/bin/spawn-fcgi -d /usr/share/gitweb -a 127.0.0.1 -p 3000 -u www-data -g gitolite -P /run/gitweb/gitweb.cgi.pid -- /usr/sbin/fcgiwrap
# Require the following nginx gitweb_fastcgi_params
# fastcgi_param QUERY_STRING $query_string;
# fastcgi_param REQUEST_METHOD $request_method;
# fastcgi_param SCRIPT_NAME $fastcgi_script_name;
# fastcgi_param DOCUMENT_ROOT $document_root;
# With the following nginx configuration
# upstream gitweb {
# server 127.0.0.1:3000;
# }
#
# server {
# listen 80;
#
# server_name git.example.net;
#
# root /usr/share/gitweb;
#
# access_log /var/log/nginx/gitweb-access.log;
# error_log /var/log/nginx/gitweb-errors.log;
#
# location / {
# alias /usr/share/gitweb/gitweb.cgi;
# include gitweb_fastcgi_params;
# fastcgi_pass gitweb;
# }
#
# location /static {
# alias /usr/share/gitweb/static;
# expires 31d;
# }
# }
# STDOUT captured on lo
# Begin of the FCGI answer
# 00000000 01 06 00 01 1f f8 00 00 53 74 61 74 75 73 3a 20 ........ Status:
# 00000010 32 30 30 20 4f 4b 0d 0a 43 6f 6e 74 65 6e 74 2d 200 OK.. Content-
# 00000020 54 79 70 65 3a 20 61 70 70 6c 69 63 61 74 69 6f Type: ap plicatio
# 00000030 6e 2f 78 68 74 6d 6c 2b 78 6d 6c 3b 20 63 68 61 n/xhtml+ xml; cha
# 00000040 72 73 65 74 3d 75 74 66 2d 38 0d 0a 0d 0a 3c 3f rset=utf -8....<?
# 00000050 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 xml vers ion="1.0
# [...]
#
# "Guido Günther" as UTF-8
# 00000FA0 6c 65 3d 22 53 65 61 72 63 68 20 66 6f 72 20 63 le="Sear ch for c
# 00000FB0 6f 6d 6d 69 74 73 20 61 75 74 68 6f 72 65 64 20 ommits a uthored
# 00000FC0 62 79 20 47 75 69 64 6f 20 47 c3 bc 6e 74 68 65 by Guido G..nthe
# 00000FD0 72 22 20 63 6c 61 73 73 3d 22 6c 69 73 74 22 20 r" class ="list"
Before, gitweb --fastcgi was directly spawned by spawn-fcgi:
# gitweb spawned by spawn-fcgi outputs iso-8859-1
/usr/bin/spawn-fcgi -d /usr/share/gitweb -a 127.0.0.1 -p 3000 -u www-data -g gitolite -P /run/gitweb/gitweb.cgi.pid -- /usr/share/gitweb/gitweb.cgi --fastcgi
# STDOUT captured on lo
# Begin of the FCGI answer with "00 46 02" in place of "1f f8 00" for utf-8 output
# 00000000 01 06 00 01 00 46 02 00 53 74 61 74 75 73 3a 20 .....F.. Status:
# 00000010 32 30 30 20 4f 4b 0d 0a 43 6f 6e 74 65 6e 74 2d 200 OK.. Content-
# 00000020 54 79 70 65 3a 20 61 70 70 6c 69 63 61 74 69 6f Type: ap plicatio
# 00000030 6e 2f 78 68 74 6d 6c 2b 78 6d 6c 3b 20 63 68 61 n/xhtml+ xml; cha
# 00000040 72 73 65 74 3d 75 74 66 2d 38 0d 0a 0d 0a 00 00 rset=utf -8......
# 00000050 01 06 00 01 02 88 00 00 3c 3f 78 6d 6c 20 76 65 ........ <?xml ve
# 00000060 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f rsion="1 .0" enco
# 00000070 64 69 6e 67 3d 22 75 74 66 2d 38 22 3f 3e 0a 3c ding="ut f-8"?>.<
# [...]
#
# "Guido Günther" as ISO-8859-1
# 00001128 74 6c 65 3d 22 53 65 61 72 63 68 20 66 6f 72 20 tle="Sea rch for
# 00001138 63 6f 6d 6d 69 74 73 20 61 75 74 68 6f 72 65 64 commits authored
# 00001148 20 62 79 20 47 75 69 64 6f 20 47 fc 6e 74 68 65 by Guid o G.nthe
The following is a bash file I wrote to convert all C++ style(//) comments in a C file to C style(/**/).
#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang
It works. But I found a problem I cannot explain. In default, my local setting is en_US.UTF-8. And in my C code, there are comments written in Chinese, such as
// some english 一些中文注释
If I don't change the local setting, i.e., do not run the statement export LANG=C, I'll get
/* some english */一些中文注释
instead of
/* some english 一些中文注释*/
I don't know why. I just find a solution by try and error.
After read Jonathan Leffler's answer, I think I've make some mistake leading to some misunderstand. In the question, those Chinese words were inputed in Google Chrome and were not the actual words in my C file. 一些中文注释 just means some Chinese comments.
Now I inputed // some english 一些中文注释 in Visual C++ 6.0 in Windows XP, and copied the c file to Debian. Then I just run sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 and got
/* some english 一些 */中文注释
I think it's different character coding(GB18030, GBK, UTF-8?) cause the different results.
The following is my results gotten on Debian
~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / s o m e e n g l i s h
2f 2f 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68 20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
d2 bb d0 a9 d6 d0 ce c4 d7 a2 ca cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * s o m e e n g l i s h
2f 2a 20 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68
0000020 322 273 320 251 * / 326 320 316 304 327 242 312 315
20 d2 bb d0 a9 20 2a 2f d6 d0 ce c4 d7 a2 ca cd
0000040
~/sandbox$
I think these Chinese Character encoding with 2 byte(Unicode).
There are another example:
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / I n W i n d o w : 250 250 ?
2f 2f 20 49 6e 57 69 6e 64 6f 77 3a 20 a8 a8 3f
0000020 1 ?
31 3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * I n W i n d o w : *
2f 2a 20 20 49 6e 57 69 6e 64 6f 77 3a 20 20 2a
0000020 / 250 250 ? 1 ?
2f a8 a8 3f 31 3f
Which platform are you working on? Your sed script works fine on MacOS X without changing locale. The Linux terminal was less happy with the Chinese characters, but it is not setup to use UTF-8. Moreover, a hex dump of the string that it did get contained a zero byte 0x00 where the Chinese started, which might lead to the confusion. (I note that your regex adds a space before the comment text if it starts // with a space.)
MacOS X (10.6.8)
The 'odx' command use is a hex-dump program.
$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
All of which looks clean and tidy.
Linux (RHEL 5)
I copied the x3.utf8 file to a Linux box, and dumped it. Then I ran the sed script on it, and all seemed OK:
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 /* some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 ...............
0x0020: E9 87 8A 20 2A 2F 0A ... */.
0x0027:
$
So far, so good. I also tried:
$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE
$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
So, the terminal is nominally working in UTF-8 after all, and it certainly seems display the data OK.
However, if I echo the string at the terminal, it gets into a tizzy. When I cut'n'pasted the string to the Linux terminal, it said:
$ echo "// some english d8d^G:
> "
// some english d8d:
$
and beeped.
$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: 64 38 64 07 3A 0A 0A d8d.:..
0x0017:
$
I'm not quite sure what to make of that. I think it means that something in the input side of bash is having some problems, but I'm not quite sure. I also am getting slightly inconsistent results. The first time I tried it, I got:
$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^#d:^[d8-f^Gf3(i^G
$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D 's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A d8-f.f3(i...
0x004C:
$
And in that hex dump, you can see a 0x00 byte (offset 0x003C). That appears at the position where you got the end comment, and a null there could confuse sed; but the whole input is such a mess it is hard to know what to make of it.
Okay, here's the correct answer...
The GNU regular expression library (regex) doesn't match everything when you put a . in your expression. Yup, I know how braindead that sounds.
The problem comes from the word "character", now reasonable people will say that everything that's in the input file for sed is characters. And even in your case they are perfectly correct. But regex has been programmed to required that the input be perfectly correctly formatted characters of the current locale character set (UTF-8) if they're correctly formatted characters for the Windows character set (UTF-16) they're not "characters".
So as . only matches "characters" it doesn't match your characters.
If you used the regex //.*$, ie: pinned it to the end of the line it wouldn't match at all because there's something that's not a "character" between the // and the end of the line.
And no you can't do anything like //\(.\|[^.]\)*$, it's just impossible to match those characters without switching to the C locale.
This will also, sometimes, destroy 8-bit transparency; ie: a binary piped through sed will get corrupted even if no changes are made.
Fortunately the C locale still uses the reasonable interpretation so anything that's not a perfectly correctly formatted ASCII-68 character is still a "character".