dieharder - proper binary input file - random

I use dieHarder tool with ASCII format input files and results are OK but now it`s the right time to use binary files instead. When I had converted my random data to a BIN file like described below no BIAS at all tests is seen. The documentation speaks of raw-binary input format when running on my UBUNTU machine but how this should look like ? My file content is as follows:
(UINT32 as bitstream in file)
0001110000111111000011101110111001111001010000000101110111111011111010011111001011111100100001 ...
call program as:
dieharder -g 201 -f <myFile.bin> -a
some sample probes of my input values:
473894638
00011100001111110000111011101110
2034261499
01111001010000000101110111111011
3925015684
11101001111100101111110010000100
...
All p-values will remain at 0.00000 when applying that binary format file.

I am curious whether how you write .bin file. I guess you wrote binary file in ASCII character. But it is not PROPER input_file_raw that Dieharder test needs. You should write file in Bytes(Binary) not ASCII. This post will be helpful to you or Comment please :)
I had tested several files with MT19937 (Mersenne Twister) and find out PROPER input file.
When you are going to write binary file for Dieharder test, you should keep in mind below 2 things.
Remove Header. (6 lines. From ####... to numbit: 32)
Change Integers from ASCII to Little Endian bytes
Dieharder Test example
Below Data is from MT19937 (32-bits, NOT 64-bits) in Go-language, with seed=0, generating 10,000,000 integers.
(Decimal) ASCII example
#==================================================================
# generator MT19937 seed = 0
#==================================================================
type: d
count: 10000000
numbit: 32
2357136044
2546248239
3071714933
3626093760
...
Binary file example
This is screenshots whether I can see with VSCode-HexEditor
AC 0A 7F 8C 2F AA C4 97 75 A6 16 B7 C0 CC 21 D8
43 B3 4E 9A FB 52 A2 DB C3 76 7D 8B 67 7D E5 D8
09 A4 74 6C D3 DE A1 9F 15 51 59 A5 F2 D6 66 62
24 B7 05 70 57 3A 2B 4C 46 3C 4B E4 D8 BD 84 0E
58 9A B2 F6 8C CD CC 45 3A 39 29 62 C1 42 48 7A
E6 7D AE CA 27 4A EA CF 57 A8 65 87 AE C8 DF 7A
58 5E 6B 91 51 8B 8D 64 A5 E6 F3 EC 19 42 09 D6
...
First data 2357136044 = 0x8C7F0AAC You can see First 4 bytes starts with 'AC' '0A' '7F' '8C'. That shows 2 things, there is no Header and it is Little Endian.
Code in Golang
I know below code is not helpful to you. As far as I know, there is no official Pure-MT19937 generator in Go-language. So, I do porting on my own from Pseudo-code in wiki to Go-language (1.17.1).
littleEndianFile, err := os.Create("./MT19937_LittleEndian.bin")
littleEndianFileBuffer := bufio.NewWriter(littleEndianFile)
littleEndianByte := make([]byte, 4)
// Generate MT19937 on my own.
test := NewMT19937(0)
newInt32 := test.NextUint32()
binary.LittleEndian.PutUint32(littleEndianByte, newInt32)
for _, eachByte := range littleEndianByte {
littleEndianFileBuffer.WriteByte(eachByte)
}
littleEndianFileBuffer.Flush()
Result - (Decimal) ASCII example
> dieharder -a -g 202 -f ./generated/MT19937_10000000.dat
#=============================================================================#
# dieharder version 3.31.1 Copyright 2003 Robert G. Brown #
#=============================================================================#
rng_name | filename |rands/second|
file_input|./generated/MT19937_10000000.dat| 7.79e+06 |
#=============================================================================#
test_name |ntup| tsamples |psamples| p-value |Assessment
#=============================================================================#
diehard_birthdays| 0| 100| 100|0.63638992| PASSED
diehard_operm5| 0| 1000000| 100|0.00012670| WEAK
diehard_rank_32x32| 0| 40000| 100|0.93085433| PASSED
diehard_rank_6x8| 0| 100000| 100|0.07088597| PASSED
diehard_bitstream| 0| 2097152| 100|0.10456387| PASSED
Result - Little-Endian example
> dieharder -a -g 201 -f ./generated/MT19937_10000000_LittleEndian.bin
#=============================================================================#
# dieharder version 3.31.1 Copyright 2003 Robert G. Brown #
#=============================================================================#
rng_name | filename |rands/second|
file_input_raw|./generated/MT19937_10000000_LittleEndian.bin| 5.60e+07 |
#=============================================================================#
test_name |ntup| tsamples |psamples| p-value |Assessment
#=============================================================================#
diehard_birthdays| 0| 100| 100|0.63638992| PASSED
diehard_operm5| 0| 1000000| 100|0.00012670| WEAK
diehard_rank_32x32| 0| 40000| 100|0.93085433| PASSED
diehard_rank_6x8| 0| 100000| 100|0.07088597| PASSED
diehard_bitstream| 0| 2097152| 100|0.10456387| PASSED
You can see above 2 tests (Decimal ASCII and Little-Endian) have same results (P-value)
Result - Big-Endian example
> dieharder -a -g 201 -f ./generated/MT19937_10000000_BigEndian.bin
#=============================================================================#
# dieharder version 3.31.1 Copyright 2003 Robert G. Brown #
#=============================================================================#
rng_name | filename |rands/second|
file_input_raw|./generated/MT19937_10000000_BigEndian.bin| 5.65e+07 |
#=============================================================================#
test_name |ntup| tsamples |psamples| p-value |Assessment
#=============================================================================#
diehard_birthdays| 0| 100| 100|0.46325487| PASSED
diehard_operm5| 0| 1000000| 100|0.00000093| FAILED
diehard_rank_32x32| 0| 40000| 100|0.93085433| PASSED
diehard_rank_6x8| 0| 100000| 100|0.27138035| PASSED
diehard_bitstream| 0| 2097152| 100|0.75581067| PASSED
diehard_opso| 0| 2097152| 100|0.25961325| PASSED
diehard_oqso| 0| 2097152| 100|0.00025268| WEAK
However you can see that there is some different P-value between above and Big-Endian File. That proves that Dieharder PROPER example should be written in Little-Endian Binary.
Conclusion and Comments
I am afraid that you wrote binaries in ASCII characters. If you can see data with normal text editor like Windows-notepad, that means you wrote in ASCII character and it is UN-PROPER input_file. So, you have to write in Little-Endian Binary instead. This post and test results proved that Little-Endian is right and input_file_raw don't need header.
I am not sure if there is difference between Little-Endian and Big-Endian in "Analyzing test results". In NIST SP800-22, the statistical randomness test is kind of "Counting the number of 0 or 1" or "Checking if there is pattern of '0101', '001100', etc." I think there is no difference in "TRUTH level", which means this generate random or not.
But, I recommend you that writing binaries in Little-Endian. Because we don't know if test builder has profound reason or not.. We just follow the "PROPER" direction for use. :)

Related

how to grep a hex data area

I have a hex file, I need to extract a range of it to a text file
From range:
To Range:
I need Output: AC:E4:B5:9A:53:1C
i tried many but it not really correct requirements, Output: Binary file filehex matches
grep "["'\x9f\x87\x6f\x11'"-"'\x9f\x87\x70\x11'"]" filehex > test.txt
hope someone can help me
Use -a to force the text interpretation of the input.
Use -o to only output the matching part.
The expression you used doesn't make much sense. It matches any characters in the set \x9, \x87, \x6f, and then the range \x11-\x9f, etc.
You are rather interested in something that starts with \x9\x87\x6f\x11 and ends in \x9f\x87\x70\x11, and there can be anything in between.
You can use cut to remove the leading and trailing 4 bytes.
grep -oa $'\x9f\x87\x6f\x11.*\x9f\x87\x70\x11' hexfile | cut -b5-21
If you know the length of the string will always be 17 bytes, you can use .\{17\} instead of .*.
Ok I've build randomly one binary $file
with your string at a location making hd command to split them.
Note: regarding k314159' comment, I use hd to produce hexdump output similarto CentOS's hexdump tool.
One shoot using sed:
hd $file |sed -e 'N;/ 9f \+\(|.*\n[0-9a-f]\+ \+\|\)87 \+\(|.*\n[0-9a-f]\+ \+\|\)6f \+\(|.*\n[0-9a-f]\+ \+\|\)11 /p;D;'
000161c0 96 7a b2 21 28 f1 b3 32 63 43 93 ff 50 a6 9f 87 |.z.!(..2cC..P...|
000161d0 6f 11 0d 7a a5 a9 81 9e 32 9d fb 71 27 6d 60 f2 |o..z....2..q'm`.|
0002c3a0
Explanation:
N merge next line in current buffer
\(|.*\n[0-9a-f]\+ \+\|\) match a | followed by anything and a newline (\n), then immediately an hexadecimal number and a space OR nothing.
p print current buffer (two lines)
D Delete upto newline in current buffer, keep last line for next sed loop.
The last hexadecimal 00028d2a correspond to the size of my binary $file:
printf "%x\n" $(stat -c %s $file)
Using bash + grep:
printf -v var "\x9f\x87\x6f\x11"
IFS=: read -r offset _ < <(grep -abo "$var" $file)
hd $file | sed -ne "$((offset/16-1)),+4p"
000161a0 b7 8f 4a 4d ed 89 6c 0b 25 f9 e7 c9 8c 99 6e 23 |..JM..l.%.....n#|
000161b0 3c ba 80 ec 2e 32 dd f3 a4 a2 09 bd 74 bf 66 11 |<....2......t.f.|
000161c0 96 7a b2 21 28 f1 b3 32 63 43 93 ff 50 a6 9f 87 |.z.!(..2cC..P...|
000161d0 6f 11 0d 7a a5 a9 81 9e 32 9d fb 71 27 6d 60 f2 |o..z....2..q'm`.|
000161e0 15 86 c2 bd 11 d0 08 90 c4 84 b9 80 04 4e 17 f1 |.............N..|
Where you could read your string:
000161c0 9f 87 | ..|
000161d0 6f 11 |o. |
For testing, I've built my test file by:
dd if=/vmlinuz bs=90574 count=1 of=/tmp/testfile
printf '\x9f\x87\x6f\x11' >>/tmp/testfile
dd if=/vmlinuz bs=90574 count=1 >>/tmp/testfile
file=/tmp/testfile
Use grep to search for the original binary file, not the hex dump. Extending choroba's answer, I think you may have problems with grep trying to interpret your search pattern as UTF-8 or some other encoding. You should temporarily set the environment variable LC_ALL=C for grep to treat each byte individually. Also, you can use the -P option to enable use of lookbehind and lookahead in your pattern. So your command becomes:
LANG=C grep -oaP $'(?<=\x9f\x87\x6f\x11).*(?=\x9f\x87\x70\x11)' binary-file > test.txt
Proof that it works:
$ echo $'BEFORE\x9f\x87\x6f\x11AC:E4:B5:9A:53:1C\x9f\x87\x70\x11AFTER' | LANG=C grep -oaP $'(?<=\x9f\x87\x6f\x11).*(?=\x9f\x87\x70\x11)'
AC:E4:B5:9A:53:1C
$

how to decode a base64 string into human readable characters on gnome terminal

I want to decode a base64 encoded string to human readable data, and seeking right encoding for the same.
This is the command that i am trying
echo H4sICJVHi14AA2ZsYWcyLnR4dAAzsvLzdHb193O1Kkktyk3KzLNKLjMp4gIAtRX2oBcAAAA= | base64 -d
Abve outputs to some fuzzy non human readable data.
�G�^flag2.txt3���tv��s�*I-�M�̳J.3)����
Why many characters are missed?
How can i read all the characters?
My gnome terminal is set to utf-8. Is there a better / wider encoding ? How do i set that?
Your Base64 encoded data is binary with mixed printable characters and mixed non-printable.
Lets see what it actually contain with hexdump:
<<<'H4sICJVHi14AA2ZsYWcyLnR4dAAzsvLzdHb193O1Kkktyk3KzLNKLjMp4gIAtRX2oBcAAAA=' base64 -d | hexdump -C
00000000 1f 8b 08 08 95 47 8b 5e 00 03 66 6c 61 67 32 2e |.....G.^..flag2.|
00000010 74 78 74 00 33 b2 f2 f3 74 76 f5 f7 73 b5 2a 49 |txt.3...tv..s.*I|
00000020 2d ca 4d ca cc b3 4a 2e 33 29 e2 02 00 b5 15 f6 |-.M...J.3)......|
00000030 a0 17 00 00 00 |.....|
00000035
You can extract valid text with the strings command:
<<<'H4sICJVHi14AA2ZsYWcyLnR4dAAzsvLzdHb193O1Kkktyk3KzLNKLjMp4gIAtRX2oBcAAAA=' base64 -d | strings
flag2.txt
J.3)
Or save it to a bin file:
<<<'H4sICJVHi14AA2ZsYWcyLnR4dAAzsvLzdHb193O1Kkktyk3KzLNKLjMp4gIAtRX2oBcAAAA=' >file.bin base64 -d
Lets check what it is:
file file.bin
file.bin: gzip compressed data, was "flag2.txt", last modified: Mon Apr 6 15:15:33 2020, from Unix, original size modulo 2^32 23
Since it is a gzip'ed data, lets gunzip it:
<file.bin gunzip
2:NICEONE:termbin:cv4r
Or doing it all in one-line:
<<<'H4sICJVHi14AA2ZsYWcyLnR4dAAzsvLzdHb193O1Kkktyk3KzLNKLjMp4gIAtRX2oBcAAAA=' base64 -d | gunzip
2:NICEONE:termbin:cv4r

Mifare 1K write block but cannot read value block

For the last three days I have been looking for block and value blocks for Mifare 1K.
For example, I wrote data successfully 1. block with this APDU:
< FF D6 00 01 10 61 79 79 69 6C 64 69 7A 66 61 74 69 68 31 31 31
- Start Block 01
- Number of Bytes to Write: 16
- Data: ayyildizfatih111
> 90 00
- Write Binary Block Success
Then I can read as below APDU:
< FF B0 00 01 10
- Data Read at Start Block 01
- Number of Bytes Read: 16
> 61 79 79 69 6C 64 69 7A 66 61 74 69 68 31 31 31 90 00
- ASCII Mode: ayyildizfatih111
- Read Binary Block Success
But when I tried read value block it's giving this error.
< FF B1 00 01 04
- ACR122U Read Value Block
> 63 00
- Operation failed
So my question is what is the difference? When I am writing data, should I use binary blocks or value blocks. Which one is better?
Reading the value block fails because your block 1 is not a value block. Binary data blocks and value blocks share the same memory, the difference is just how you format the contents of the block and how you set the permissions for the block.
In order to turn block 1 into a value block, you would set the blocks access bits to allow value block operations (decrement, transfer, restore, and (optional) increment). You would then write the block as a value block (with ACR122U V2.02: either using the Value Block Operation command or using a regular Update Binary Block command).
The format of a value block (when using binary data block operations) is:
+----------+----------+----------+----+----+----+----+
Byte | 0..3 | 4..7 | 8..11 | 12 | 13 | 14 | 15 |
+----------+----------+----------+----+----+----+----+
Data | xxxxxxxx | yyyyyyyy | xxxxxxxx | uu | vv | uu | vv |
+----------+----------+----------+----+----+----+----+
Where xxxxxxxx is a 4 byte signed (2's complement) integer (LSB = byte 0), yyyyyyyy is the inverted value of xxxxxxxx, uu is an address byte (can be used when implementing a backup mechanism), vv is the inverted value of uu.
If you should use binary data blocks or should use the value format depends on your application. If you want to store a 4 byte integer value and wat to use value block operations, you may prefer to use the value block format. If you want to store other data, don't need the redundancy of the value block format, only want to use binary read/write operations, you may prefer to use a block as free-form binary data block.

CRC16 and data communications

Hi I have been trying to calculate a CRC for a device I want to write a software interface for. For simplicity I will say X is the device and Y is the hardware controller. I am looking for a nudge in the right direction I am sure I am on the correct track just a little confused on a few points.
When the device is idle it sends the following strings of data every 2 seconds or so that looks like it is counting up in hex: The 2 bytes between the | | is the CRC I assume. (XX) is the varying byte.
X: 96 10 01 E1 (E4) 01 FF 10 17 | F7 EC | 10 06 E1 96 FE
X: 96 10 01 E1 (E6) 01 FF 10 17 | 7F FA | 10 06 E1 96 FE
X: 96 10 01 E1 (E8) 01 FF 10 17 | C7 9B | 10 06 E1 96 FE
X: 96 10 01 E1 (EA) 01 FF 10 17 | 4F 8D | FE 10 06 E1 96 FE
X: 96 10 01 E1 (EC) 01 FF 10 17 | D7 B6 | FE 10 06 E1 96 FE
X: 96 10 01 E1 (EE) 01 FF 10 17 | 5F A0 | FE 10 06 E1 96 FE
Using reveng with reveng -w 16 -s and the above sets of data I get:
width=16 poly=0x1021 init=0x1e69 refin=true refout=true xorout=0x0000 check=0x3da6 name=(none)
When I intercept the a command from the controller I get:
X: 96 10 01 E1 (EE) 01 FF 10 17 | 5F A0 | FE 10 06 E1 96 FE -- Last line before command
Y: E1 10 01 96 (22) 05 01 C0 A8 35 00 10 17 |0B B8| FE 10 06 96 E1 FE
Where (22) is the the modifier |0B B8| is the CRC. How is the 22 derived from the E4? is it another CRC?
When I sent the same command several times I intercepted the following:
Y: E1100196220501C0A8350010170BB8FE100696E1FE
Y: E11001962A0501C0A835001017C1C7FE100696E1FE
Y: E11001962E0501C0909400101753C8FE100696E1FE
Y: E1100196300501809094001017C3EEFE100696E1FE
Y: E1100196360501C090940010170D48FE100696E1FE
Y: E11001962A0501C09094001017B6F7FE100696E1FE
Y: E11001962A0501C09094001017B6F7FE100696E1FE
Using reveng with reveng -w 16 -s and the above sets of data I get:
width=16 poly=0x1021 init=0xd313 refin=true refout=true xorout=0x0000 check=0x295f name=(none)
The polynomial is the same but init and check vary, sorry for the long post but here is the summary of my questions:
1) Is it common for say the device to use the same polynomial but different init and check to the controller?
2) Is the constant counting strings from the device used to offset the variable byte used to calculate the checksum? If so what is this mechanism called and what methods could be used to derive the relationship between the count and the byte?
3) Am I on the right track or have I got lost along the way?
Thanks for taking the time to read this and would really appreciate a kick in the right direction.
Drop the first byte off of your X and Y sequences, and then you'll get for both:
width=16 poly=0x1021 init=0xffff refin=true refout=true xorout=0xffff check=0x906e name="X-25"
To wit:
% reveng -w 16 -s 100196220501C0A8350010170BB8 1001962A0501C0A835001017C1C7 1001962E0501C0909400101753C8 100196300501809094001017C3EE 100196360501C090940010170D48 1001962A0501C09094001017B6F7
width=16 poly=0x1021 init=0xffff refin=true refout=true xorout=0xffff check=0x906e name="X-25"
% reveng -w 16 -s 1001E1E401FF1017F7EC 1001E1E601FF10177FFA 1001E1E801FF1017C79B 1001E1EA01FF10174F8D 1001E1EC01FF1017D7B6 1001E1EE01FF10175FA0
width=16 poly=0x1021 init=0xffff refin=true refout=true xorout=0xffff check=0x906e name="X-25"

Change the local setting to enable sed work correctly, but why?

The following is a bash file I wrote to convert all C++ style(//) comments in a C file to C style(/**/).
#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang
It works. But I found a problem I cannot explain. In default, my local setting is en_US.UTF-8. And in my C code, there are comments written in Chinese, such as
// some english 一些中文注释
If I don't change the local setting, i.e., do not run the statement export LANG=C, I'll get
/* some english */一些中文注释
instead of
/* some english 一些中文注释*/
I don't know why. I just find a solution by try and error.
After read Jonathan Leffler's answer, I think I've make some mistake leading to some misunderstand. In the question, those Chinese words were inputed in Google Chrome and were not the actual words in my C file. 一些中文注释 just means some Chinese comments.
Now I inputed // some english 一些中文注释 in Visual C++ 6.0 in Windows XP, and copied the c file to Debian. Then I just run sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 and got
/* some english 一些 */中文注释
I think it's different character coding(GB18030, GBK, UTF-8?) cause the different results.
The following is my results gotten on Debian
~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / s o m e e n g l i s h
2f 2f 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68 20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
d2 bb d0 a9 d6 d0 ce c4 d7 a2 ca cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * s o m e e n g l i s h
2f 2a 20 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68
0000020 322 273 320 251 * / 326 320 316 304 327 242 312 315
20 d2 bb d0 a9 20 2a 2f d6 d0 ce c4 d7 a2 ca cd
0000040
~/sandbox$
I think these Chinese Character encoding with 2 byte(Unicode).
There are another example:
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / I n W i n d o w : 250 250 ?
2f 2f 20 49 6e 57 69 6e 64 6f 77 3a 20 a8 a8 3f
0000020 1 ?
31 3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * I n W i n d o w : *
2f 2a 20 20 49 6e 57 69 6e 64 6f 77 3a 20 20 2a
0000020 / 250 250 ? 1 ?
2f a8 a8 3f 31 3f
Which platform are you working on? Your sed script works fine on MacOS X without changing locale. The Linux terminal was less happy with the Chinese characters, but it is not setup to use UTF-8. Moreover, a hex dump of the string that it did get contained a zero byte 0x00 where the Chinese started, which might lead to the confusion. (I note that your regex adds a space before the comment text if it starts // with a space.)
MacOS X (10.6.8)
The 'odx' command use is a hex-dump program.
$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
All of which looks clean and tidy.
Linux (RHEL 5)
I copied the x3.utf8 file to a Linux box, and dumped it. Then I ran the sed script on it, and all seemed OK:
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 /* some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 ...............
0x0020: E9 87 8A 20 2A 2F 0A ... */.
0x0027:
$
So far, so good. I also tried:
$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE
$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
So, the terminal is nominally working in UTF-8 after all, and it certainly seems display the data OK.
However, if I echo the string at the terminal, it gets into a tizzy. When I cut'n'pasted the string to the Linux terminal, it said:
$ echo "// some english d8d^G:
> "
// some english d8d:
$
and beeped.
$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: 64 38 64 07 3A 0A 0A d8d.:..
0x0017:
$
I'm not quite sure what to make of that. I think it means that something in the input side of bash is having some problems, but I'm not quite sure. I also am getting slightly inconsistent results. The first time I tried it, I got:
$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^#d:^[d8-f^Gf3(i^G
$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D 's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A d8-f.f3(i...
0x004C:
$
And in that hex dump, you can see a 0x00 byte (offset 0x003C). That appears at the position where you got the end comment, and a null there could confuse sed; but the whole input is such a mess it is hard to know what to make of it.
Okay, here's the correct answer...
The GNU regular expression library (regex) doesn't match everything when you put a . in your expression. Yup, I know how braindead that sounds.
The problem comes from the word "character", now reasonable people will say that everything that's in the input file for sed is characters. And even in your case they are perfectly correct. But regex has been programmed to required that the input be perfectly correctly formatted characters of the current locale character set (UTF-8) if they're correctly formatted characters for the Windows character set (UTF-16) they're not "characters".
So as . only matches "characters" it doesn't match your characters.
If you used the regex //.*$, ie: pinned it to the end of the line it wouldn't match at all because there's something that's not a "character" between the // and the end of the line.
And no you can't do anything like //\(.\|[^.]\)*$, it's just impossible to match those characters without switching to the C locale.
This will also, sometimes, destroy 8-bit transparency; ie: a binary piped through sed will get corrupted even if no changes are made.
Fortunately the C locale still uses the reasonable interpretation so anything that's not a perfectly correctly formatted ASCII-68 character is still a "character".

Resources