I tried sed -ne '/\"/!p' theinput > theproductbut that got me nowhere. It didn't do anything. What can I try?
You don't need to escape quote. Write:
sed '/"/d' theinput > theproduct
or
sed -i '/"/d' theinput
to alter the file directly.
In case you have other quotes as #Jonathan Leffler suggests, you have to find out which ones. Then, using \x you can achieve what you want. \x is used to specify hexadecimal values.
sed -i '/\x22/d' theinput
The line above would delete all rows in theinput containing the ordinary (ASCII 34) quote. You'll have to try the code points Jonathan suggested.
try this:
grep -v '"' theinput > theproduct
The command you showed us should have worked.
$ cat theinput
foo"bar
foo.bar
$ sed -ne '/\"/!p' theinput > theproduct
$ cat theproduct
foo.bar
$
unless you're using csh or tcsh as your interactive shell. In that case, you'd need to escape the ! character, even within quotation marks:
% cat theinput
foo"bar
foo.bar
% sed -ne '/\"/!p' theinput > theproduct
sed -ne '/"/pwd' theinput > theproduct
sed: -e expression #1, char 5: extra characters after command
% rm theproduct
% sed -ne '/\"/\!p' theinput > theproduct
% cat theproduct
foo.bar
%
But that's inconsistent with your statement that "It didn't do anything", so it's not clear what's really going on (and the question is tagged bourne-shell anyway).
But there are much simpler ways to accomplish the same task, particularly the grep command suggested by #Mike Sokolov.
Are you sure you have 'ASCII' input? Could you have Unicode (UTF-8) with characters that are not not ASCII 34, or Unicode U+0022, but something else?
Alternative Unicode 'double quotes' could be:
U+2033 DOUBLE PRIME; U+201C LEFT DOUBLE QUOTATION MARK;
U+201D RIGHT DOUBLE QUOTATION MARK;
U+201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK;
U+02DD DOUBLE ACUTE ACCENT;
(and there could easily be others I've left out).
You can look to debug this with the od command:
$ cat theinput
No double quote here
Double quote " here
Unicode pseudo-double-quotes include “”‟″˝.
$ od -c theinput
0000000 N o d o u b l e q u o t e
0000020 h e r e \n D o u b l e q u o t
0000040 e " h e r e \n U n i c o d e
0000060 p s e u d o - d o u b l e - q
0000100 u o t e s i n c l u d e “ **
0000120 ** ” ** ** ‟ ** ** ″ ** ** ˝ ** . \n
0000136
$ od -x theinput
0000000 6f4e 6420 756f 6c62 2065 7571 746f 2065
0000020 6568 6572 440a 756f 6c62 2065 7571 746f
0000040 2065 2022 6568 6572 550a 696e 6f63 6564
0000060 7020 6573 6475 2d6f 6f64 6275 656c 712d
0000100 6f75 6574 2073 6e69 6c63 6475 2065 80e2
0000120 e29c 9d80 80e2 e29f b380 9dcb 0a2e
0000136
$ odx theinput
0x0000: 4E 6F 20 64 6F 75 62 6C 65 20 71 75 6F 74 65 20 No double quote
0x0010: 68 65 72 65 0A 44 6F 75 62 6C 65 20 71 75 6F 74 here.Double quot
0x0020: 65 20 22 20 68 65 72 65 0A 55 6E 69 63 6F 64 65 e " here.Unicode
0x0030: 20 70 73 65 75 64 6F 2D 64 6F 75 62 6C 65 2D 71 pseudo-double-q
0x0040: 75 6F 74 65 73 20 69 6E 63 6C 75 64 65 20 E2 80 uotes include ..
0x0050: 9C E2 80 9D E2 80 9F E2 80 B3 CB 9D 2E 0A ..............
0x005E:
$ sed '/"/d' theinput > theproduct
$ cat theproduct
No double quote here
Unicode pseudo-double-quotes include “”‟″˝.
$
(odx is my own command for dumping data in hex.)
Related
I have a hex file, I need to extract a range of it to a text file
From range:
To Range:
I need Output: AC:E4:B5:9A:53:1C
i tried many but it not really correct requirements, Output: Binary file filehex matches
grep "["'\x9f\x87\x6f\x11'"-"'\x9f\x87\x70\x11'"]" filehex > test.txt
hope someone can help me
Use -a to force the text interpretation of the input.
Use -o to only output the matching part.
The expression you used doesn't make much sense. It matches any characters in the set \x9, \x87, \x6f, and then the range \x11-\x9f, etc.
You are rather interested in something that starts with \x9\x87\x6f\x11 and ends in \x9f\x87\x70\x11, and there can be anything in between.
You can use cut to remove the leading and trailing 4 bytes.
grep -oa $'\x9f\x87\x6f\x11.*\x9f\x87\x70\x11' hexfile | cut -b5-21
If you know the length of the string will always be 17 bytes, you can use .\{17\} instead of .*.
Ok I've build randomly one binary $file
with your string at a location making hd command to split them.
Note: regarding k314159' comment, I use hd to produce hexdump output similarto CentOS's hexdump tool.
One shoot using sed:
hd $file |sed -e 'N;/ 9f \+\(|.*\n[0-9a-f]\+ \+\|\)87 \+\(|.*\n[0-9a-f]\+ \+\|\)6f \+\(|.*\n[0-9a-f]\+ \+\|\)11 /p;D;'
000161c0 96 7a b2 21 28 f1 b3 32 63 43 93 ff 50 a6 9f 87 |.z.!(..2cC..P...|
000161d0 6f 11 0d 7a a5 a9 81 9e 32 9d fb 71 27 6d 60 f2 |o..z....2..q'm`.|
0002c3a0
Explanation:
N merge next line in current buffer
\(|.*\n[0-9a-f]\+ \+\|\) match a | followed by anything and a newline (\n), then immediately an hexadecimal number and a space OR nothing.
p print current buffer (two lines)
D Delete upto newline in current buffer, keep last line for next sed loop.
The last hexadecimal 00028d2a correspond to the size of my binary $file:
printf "%x\n" $(stat -c %s $file)
Using bash + grep:
printf -v var "\x9f\x87\x6f\x11"
IFS=: read -r offset _ < <(grep -abo "$var" $file)
hd $file | sed -ne "$((offset/16-1)),+4p"
000161a0 b7 8f 4a 4d ed 89 6c 0b 25 f9 e7 c9 8c 99 6e 23 |..JM..l.%.....n#|
000161b0 3c ba 80 ec 2e 32 dd f3 a4 a2 09 bd 74 bf 66 11 |<....2......t.f.|
000161c0 96 7a b2 21 28 f1 b3 32 63 43 93 ff 50 a6 9f 87 |.z.!(..2cC..P...|
000161d0 6f 11 0d 7a a5 a9 81 9e 32 9d fb 71 27 6d 60 f2 |o..z....2..q'm`.|
000161e0 15 86 c2 bd 11 d0 08 90 c4 84 b9 80 04 4e 17 f1 |.............N..|
Where you could read your string:
000161c0 9f 87 | ..|
000161d0 6f 11 |o. |
For testing, I've built my test file by:
dd if=/vmlinuz bs=90574 count=1 of=/tmp/testfile
printf '\x9f\x87\x6f\x11' >>/tmp/testfile
dd if=/vmlinuz bs=90574 count=1 >>/tmp/testfile
file=/tmp/testfile
Use grep to search for the original binary file, not the hex dump. Extending choroba's answer, I think you may have problems with grep trying to interpret your search pattern as UTF-8 or some other encoding. You should temporarily set the environment variable LC_ALL=C for grep to treat each byte individually. Also, you can use the -P option to enable use of lookbehind and lookahead in your pattern. So your command becomes:
LANG=C grep -oaP $'(?<=\x9f\x87\x6f\x11).*(?=\x9f\x87\x70\x11)' binary-file > test.txt
Proof that it works:
$ echo $'BEFORE\x9f\x87\x6f\x11AC:E4:B5:9A:53:1C\x9f\x87\x70\x11AFTER' | LANG=C grep -oaP $'(?<=\x9f\x87\x6f\x11).*(?=\x9f\x87\x70\x11)'
AC:E4:B5:9A:53:1C
$
I would like to correct a bad encoding for thousand files. The error is always the same, an unknown char should be replaced with a french é.
$ find . -type f | grep 127427
./documents/1778_commande_127427_accus�_de_r�ception.pdf
$ find . -type f | grep 127427 | hexdump -C
00000000 2e 2f 64 6f 63 75 6d 65 6e 74 73 2f 31 37 37 38 |./documents/1778|
00000010 5f 63 6f 6d 6d 61 6e 64 65 5f 31 32 37 34 32 37 |_commande_127427|
00000020 5f 61 63 63 75 73 ef bf bd 5f 64 65 5f 72 ef bf |_accus..._de_r..|
00000030 bd 63 65 70 74 69 6f 6e 2e 70 64 66 0a |.ception.pdf.|
0000003d
So I am looking for ef bf bd which does not look like an unicode char. Unfortunately looking for the 0xef does not work:
$ find . -type f | grep -P '\xef'
(nothing)
Any clues?
Next I am planning to do something like:
$ find . -type f | grep <magic-here> | xargs -n1 -I{} sh -c 'mv "{}" $(echo "{}" | sed s/<magic-here>/é/) '
Like this:
echo $'\x2e\x2f\x64\x6f\x63\x75\x6d\x65\x6e\x74\x73\x2f\x31\x37\x37\x38\x5f\x63\x6f\x6d\x6d\x61\x6e\x64\x65\x5f\x31\x32\x37\x34\x32\x37\x5f\x61\x63\x63\x75\x73\xef\xbf\xbd\x5f\x64\x65\x5f\x72\xef\xbf\xbd\x63\x65\x70\x74\x69\x6f\x6e\x2e\x70\x64\x66\x0a'\
| grep -Fa $'\xef\xbf\xbd'
-a treats binary files as text. -F performs a fixed string search, no regular expressions. $'' is an ANSI string
The find command should look like this:
find ... -exec sed $'s/\xef\xbf\xbd/é/g' {} +
When you are sure that it works, use -i, this will change files in place:
find ... -exec sed -i $'s/\xef\xbf\xbd/é/g' {} +
I've spent an embarrassingly long time trying to understand why the second conditional in the "foo" script below fails but the first one succeeds.
Please note:
The current directory contains two files: bar and foo.
All three strings $s1, $s2 and $s3 are equal according to hexdump.
Thanks in advance for any help.
Session: (Running on a Centos7 host):
>ls
bar foo
>cat foo
#!/bin/bash
s1="bar foo"
s2="bar foo"
s3=`ls`
echo -n $s1 | hexdump -C
echo -n $s2 | hexdump -C
echo -n $s3 | hexdump -C
if [ "$s1" = "$s2" ]; then # True
echo s1 = s2
fi
if [ "$s1" = "$s3" ]; then # NOT true! Why?
echo s1 = s3
fi
>foo
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
s1 = s2
>
Quote the variables when echoing.
echo -n "$s3" | hexdump -C
You'll see a newline between the file names, as ls uses -1 when the output is redirected.
Your demo would be more convincing with echo -n "$s1" etc. That would show that there's a newline in the middle of s3 where there's a space in s1 and s2. The echo without the double quotes mangles the newline into a space (and generally each sequence of one or more white space characters in the string into a single space).
Given:
#!/bin/bash
s1="bar foo"
s2="bar foo"
s3=`ls`
echo -n "$s1" | hexdump -C
echo -n "$s2" | hexdump -C
echo -n "$s3" | hexdump -C
if [ "$s1" = "$s2" ]; then # True
echo s1 = s2
fi
if [ "$s1" = "$s3" ]; then # NOT true because s3 contains a newline!
echo s1 = s3
fi
I get:
$ sh foo
00000000 2d 6e 20 62 61 72 20 66 6f 6f 0a |-n bar foo.|
0000000b
00000000 2d 6e 20 62 61 72 20 66 6f 6f 0a |-n bar foo.|
0000000b
00000000 2d 6e 20 62 61 72 0a 66 6f 6f 0a |-n bar.foo.|
0000000b
s1 = s2
$ bash foo
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
00000000 62 61 72 20 66 6f 6f |bar foo|
00000007
00000000 62 61 72 0a 66 6f 6f |bar.foo|
00000007
s1 = s2
$
The following is a bash file I wrote to convert all C++ style(//) comments in a C file to C style(/**/).
#!/bin/bash
lang=`echo $LANG`
# It's necessary to change the local setting. I don't know why.
export LANG=C
# Can comment the following statement if there is not dos2unix command.
dos2unix -q $1
sed -i -e 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' $1
export LANG=$lang
It works. But I found a problem I cannot explain. In default, my local setting is en_US.UTF-8. And in my C code, there are comments written in Chinese, such as
// some english 一些中文注释
If I don't change the local setting, i.e., do not run the statement export LANG=C, I'll get
/* some english */一些中文注释
instead of
/* some english 一些中文注释*/
I don't know why. I just find a solution by try and error.
After read Jonathan Leffler's answer, I think I've make some mistake leading to some misunderstand. In the question, those Chinese words were inputed in Google Chrome and were not the actual words in my C file. 一些中文注释 just means some Chinese comments.
Now I inputed // some english 一些中文注释 in Visual C++ 6.0 in Windows XP, and copied the c file to Debian. Then I just run sed -i -e 's;^([[:blank:]])//(.);\1/ \2 /;' $1 and got
/* some english 一些 */中文注释
I think it's different character coding(GB18030, GBK, UTF-8?) cause the different results.
The following is my results gotten on Debian
~/sandbox$ uname -a
Linux xyt-dev 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux
~/sandbox$ echo $LANG
en_US.UTF-8
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / s o m e e n g l i s h
2f 2f 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68 20
0000020 322 273 320 251 326 320 316 304 327 242 312 315
d2 bb d0 a9 d6 d0 ce c4 d7 a2 ca cd
0000034
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * s o m e e n g l i s h
2f 2a 20 20 73 6f 6d 65 20 65 6e 67 6c 69 73 68
0000020 322 273 320 251 * / 326 320 316 304 327 242 312 315
20 d2 bb d0 a9 20 2a 2f d6 d0 ce c4 d7 a2 ca cd
0000040
~/sandbox$
I think these Chinese Character encoding with 2 byte(Unicode).
There are another example:
~/sandbox$ cat tt.c | od -c -t x1
0000000 / / I n W i n d o w : 250 250 ?
2f 2f 20 49 6e 57 69 6e 64 6f 77 3a 20 a8 a8 3f
0000020 1 ?
31 3f
0000022
~/sandbox$ ./convert_comment_style_cpp2c.sh tt.c
~/sandbox$ cat tt.c | od -c -t x1
0000000 / * I n W i n d o w : *
2f 2a 20 20 49 6e 57 69 6e 64 6f 77 3a 20 20 2a
0000020 / 250 250 ? 1 ?
2f a8 a8 3f 31 3f
Which platform are you working on? Your sed script works fine on MacOS X without changing locale. The Linux terminal was less happy with the Chinese characters, but it is not setup to use UTF-8. Moreover, a hex dump of the string that it did get contained a zero byte 0x00 where the Chinese started, which might lead to the confusion. (I note that your regex adds a space before the comment text if it starts // with a space.)
MacOS X (10.6.8)
The 'odx' command use is a hex-dump program.
$ echo "// some english 一些中文注释" > x3.utf8
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ utf8-unicode x3.utf8
0x2F = U+002F
0x2F = U+002F
0x20 = U+0020
0x73 = U+0073
0x6F = U+006F
0x6D = U+006D
0x65 = U+0065
0x20 = U+0020
0x65 = U+0065
0x6E = U+006E
0x67 = U+0067
0x6C = U+006C
0x69 = U+0069
0x73 = U+0073
0x68 = U+0068
0x20 = U+0020
0xE4 0xB8 0x80 = U+4E00
0xE4 0xBA 0x9B = U+4E9B
0xE4 0xB8 0xAD = U+4E2D
0xE6 0x96 0x87 = U+6587
0xE6 0xB3 0xA8 = U+6CE8
0xE9 0x87 0x8A = U+91CA
0x0A = U+000A
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
All of which looks clean and tidy.
Linux (RHEL 5)
I copied the x3.utf8 file to a Linux box, and dumped it. Then I ran the sed script on it, and all seemed OK:
$ odx x3.utf8
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 E9 ................
0x0020: 87 8A 0A ...
0x0023:
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8 | odx
0x0000: 2F 2A 20 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 /* some english
0x0010: 20 E4 B8 80 E4 BA 9B E4 B8 AD E6 96 87 E6 B3 A8 ...............
0x0020: E9 87 8A 20 2A 2F 0A ... */.
0x0027:
$
So far, so good. I also tried:
$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE
$ env | grep LC_
$ bash --version
GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2005 Free Software Foundation, Inc.
$ cat x3.utf8
// some english 一些中文注释
$ echo $(<x3.utf8)
// some english 一些中文注释
$ sed 's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;' x3.utf8
/* some english 一些中文注释 */
$
So, the terminal is nominally working in UTF-8 after all, and it certainly seems display the data OK.
However, if I echo the string at the terminal, it gets into a tizzy. When I cut'n'pasted the string to the Linux terminal, it said:
$ echo "// some english d8d^G:
> "
// some english d8d:
$
and beeped.
$ echo "// some english d8d^G:
> " | odx
0x0000: 2F 2F 20 73 6F 6D 65 20 65 6E 67 6C 69 73 68 20 // some english
0x0010: 64 38 64 07 3A 0A 0A d8d.:..
0x0017:
$
I'm not quite sure what to make of that. I think it means that something in the input side of bash is having some problems, but I'm not quite sure. I also am getting slightly inconsistent results. The first time I tried it, I got:
$ cat > xxx
's;^\([[:blank:]]*\)//\(.*\);\1/* \2 */;'
// some english d8^#d:^[d8-f^Gf3(i^G
$ odx xxx
0x0000: 27 73 3B 5E 5C 28 5B 5B 3A 62 6C 61 6E 6B 3A 5D 's;^\([[:blank:]
0x0010: 5D 2A 5C 29 2F 2F 5C 28 2E 2A 5C 29 3B 5C 31 2F ]*\)//\(.*\);\1/
0x0020: 2A 20 5C 32 20 2A 2F 3B 27 0A 2F 2F 20 73 6F 6D * \2 */;'.// som
0x0030: 65 20 65 6E 67 6C 69 73 68 20 64 38 00 64 3A 1B e english d8.d:.
0x0040: 64 38 2D 66 07 66 33 28 69 07 0A 0A d8-f.f3(i...
0x004C:
$
And in that hex dump, you can see a 0x00 byte (offset 0x003C). That appears at the position where you got the end comment, and a null there could confuse sed; but the whole input is such a mess it is hard to know what to make of it.
Okay, here's the correct answer...
The GNU regular expression library (regex) doesn't match everything when you put a . in your expression. Yup, I know how braindead that sounds.
The problem comes from the word "character", now reasonable people will say that everything that's in the input file for sed is characters. And even in your case they are perfectly correct. But regex has been programmed to required that the input be perfectly correctly formatted characters of the current locale character set (UTF-8) if they're correctly formatted characters for the Windows character set (UTF-16) they're not "characters".
So as . only matches "characters" it doesn't match your characters.
If you used the regex //.*$, ie: pinned it to the end of the line it wouldn't match at all because there's something that's not a "character" between the // and the end of the line.
And no you can't do anything like //\(.\|[^.]\)*$, it's just impossible to match those characters without switching to the C locale.
This will also, sometimes, destroy 8-bit transparency; ie: a binary piped through sed will get corrupted even if no changes are made.
Fortunately the C locale still uses the reasonable interpretation so anything that's not a perfectly correctly formatted ASCII-68 character is still a "character".
hi everybody can anyone tell me answer of this question ?
i created a simple txt file. it contain only two words and the words are hello word according to i studied computer uses ascii code to store the text on disk or memory .In ascii code each letter or symbol is represented by one byte or in simple words one byte is used to store a symbol.
Now the problem is this when ever i saw the size of file it shows 11 byte I understand 9 byte for words one byte for space makes the total of 10 then why it is showing 11 byte size .i tried different things such as changing the name of file saving it with shortest name possible or longest name possible but it did not change the total storage
so can any body explain why it is happening? i tried this thing over window or Linux(Ubuntu.centos) system result is same.
pax> echo hello word >outfile.txt
pax> ls -al outfile.txt
-rw-r--r-- 1 pax pax 11 2010-11-19 15:34 outfile.txt
pax> od -xcb outfile.txt
0000000 6568 6c6c 206f 6f77 6472 000a
h e l l o w o r d \n
150 145 154 154 157 040 167 157 162 144 012
pax> hd outfile.txt
00000000 68 65 6c 6c 6f 20 77 6f 72 64 0a |hello word.|
0000000b
As per above, you're storing "hello word" and the newline character. That's 11 characters in total. If you don't want the newline, you can use something like the -n option of echo (which doesn't add the newline):
pax> echo -n hello word >outfile.txt
pax> ls -al outfile.txt
-rw-r--r-- 1 pax pax 10 2010-11-19 15:36 outfile.txt
pax> od -xcb outfile.txt
0000000 6568 6c6c 206f 6f77 6472
h e l l o w o r d
150 145 154 154 157 040 167 157 162 144
pax> hd outfile.txt
00000000 68 65 6c 6c 6f 20 77 6f 72 64 |hello word|
0000000a
If you want to see the content of the file you can perform an octal dump of it using the "od" command under linux "od ". Most probably what you will see is a CR (carriage return) and a LN (linefeed).
The name of the file has nothing to do with his size.
Luis
Did you a new line in the text file (\n)? Just because this character cannot be seen does not mean it is not there.