Why won't sed remove my NULLs? (OS X) - macos

can someone tell me why sed won't remove my NULLs?
this is on OS X
$ printf '123\x00456' | sed 's/\x00/Z/g' | hexdump
0000000 31 32 33 00 34 35 36 0a
this doesn't work either:
$ printf '123'$(echo "\000")'456' | sed 's/'$(echo "\000")'/Z/g' | hexdump
0000000 31 32 33 00 34 35 36 0a

For deleting a single character or translating a single character to a single other character (not including multibyte characters), tr can do it, and unlike sed it supports all characters, including NULs, in all versions of unix since the beginning.
For translating:
tr '\0' Z
And for deleting:
tr -d '\0'

Related

Why does xxd add chars?

I am trying to reverse an od command from a system where I have no hexdump or base64 tools.
I do this like that (of course, in reality, the encoding takes place at the "small" system, the decoding is done on my workstation, but to test it, I try the whole way in one line first):
echo TEST | od -tx1 | xxd -r
Of course, echo TEST is just a placeholder here for eg. cat test.bmp or anything else.
> echo TEST
TEST
> echo TEST | od -tx1
0000000 54 45 53 54 0a
0000005
> echo TEST | od -tx1 | xxd -r
TEST
That looks right, but it is different, as we can see here if we give it to od again:
> echo TEST | od -tx1 | xxd -r | od -tx1
0000000 54 45 53 54 0a 00 00 00
0000010
Why does xxd -r add those 00s?
You're getting those three nul bytes because of xxd -r trying and failing to parse input that's in a different format than what it expects. od -tx1 adds an extra line with an offset but no data bytes. Plus the offsets in xxd have a colon after them, and are printed with a different width, and printable bytes are displayed as well as the hex dump, and possibly in a different base... Something about that doesn't play well with xxd, and it adds the extra bytes as a result.
Examples:
$ echo TEST | xxd
00000000: 5445 5354 0a TEST.
$ echo TEST | xxd | xxd -r
TEST
$ echo TEST | xxd | xxd -r | xxd
00000000: 5445 5354 0a TEST.
$ echo TEST | xxd | xxd -r | od -tx1
0000000 54 45 53 54 0a
0000005
$ echo TEST | od -tx1 | head -1 | xxd -r | od -tx1
0000000 54 45 53 54 0a
0000005
See how they're not present when giving xxd -r its expected xxd style input? And how they're not there when you prune that extra line from od's output? Don't mix and match incompatible data formats.
It seems to work if I remove the offsets at all:
> echo TEST | od -tx1 -An
54 45 53 54 0a
> echo TEST | od -tx1 -An | xxd -r -p
TEST
> echo TEST | od -tx1 -An | xxd -r -p | od -tx1 -An
54 45 53 54 0a
Bingo! Note the extra " " in front of the bytes. It seems to have no effect.

Grep not matching certain parts of man page

Grep doesn't seem to match certain strings from man output. It seems to be random in that I can't work out any rhyme or reason as to whether a string will match or not.
man sed | head -7:
SED(1) BSD General Commands Manual SED(1)
NAME
sed -- stream editor
SYNOPSIS
$ man sed | head -7 | grep sed # no match
$ man sed | head -7 | grep stream # match on "stream"
sed -- stream editor
$ man sed | head -7 | grep '\-\-' # match on "--"
sed -- stream editor
$ man sed | head -7 | grep NAME # no match
$ man sed | head -7 | grep SYNOPSIS # no match
This also happens when redirecting the output to a file and grepping that
$ man sed | head -7 > /tmp/sed.man
$ cat /tmp/sed.man | grep sed # no match
$ cat /tmp/sed.man | grep stream # match on "stream"
sed -- stream editor
$ grep sed /tmp/sed.man # no match
$ grep stream /tmp/sed.man # match on "stream"
sed -- stream editor
grep: grep (BSD grep) 2.5.1-FreeBSD
man: version 1.6c
macOS: 10.14.6 Beta
bash: GNU bash, version 5.0.7(1)-release (x86_64-apple-darwin18.5.0)
$ man sed | head -7 | hexdump -C
00000000 0a 53 45 44 28 31 29 20 20 20 20 20 20 20 20 20 |.SED(1) |
00000010 20 20 20 20 20 20 20 20 20 20 20 42 53 44 20 47 | BSD G|
00000020 65 6e 65 72 61 6c 20 43 6f 6d 6d 61 6e 64 73 20 |eneral Commands |
00000030 4d 61 6e 75 61 6c 20 20 20 20 20 20 20 20 20 20 |Manual |
00000040 20 20 20 20 20 20 20 20 20 53 45 44 28 31 29 0a | SED(1).|
00000050 0a 4e 08 4e 41 08 41 4d 08 4d 45 08 45 0a 20 20 |.N.NA.AM.ME.E. |
00000060 20 20 20 73 08 73 65 08 65 64 08 64 20 2d 2d 20 | s.se.ed.d -- |
00000070 73 74 72 65 61 6d 20 65 64 69 74 6f 72 0a 0a 53 |stream editor..S|
00000080 08 53 59 08 59 4e 08 4e 4f 08 4f 50 08 50 53 08 |.SY.YN.NO.OP.PS.|
00000090 53 49 08 49 53 08 53 0a |SI.IS.S.|
00000098
Googling is hard for this problem as any combination of "man" or "grep" doesn't mention my problem that strings (with no special characters) are not matching.
man-pages are using the roff-format (https://man.openbsd.org/roff). Do the following:
man sed > sed.man
vi sed.man
so you see:
SED(1) BSD General Commands Manual SED(1)
N^HNA^HAM^HME^HE
s^Hse^Hed^Hd -- stream editor
to convert a man-page to text without the ^H-stuff. have a look on http://www.schweikhardt.net/man_page_howto.html#q10
create a perl-Skript called strip-headers with the content:
#!/usr/bin/perl -wn
# make it slurp the whole file at once:
undef $/;
# delete first header:
s/^\n*.*\n+//;
# delete last footer:
s/\n+.*\n+$/\n/g;
# delete page breaks:
s/\n\n+[^ \t].*\n\n+(\S+).*\1\n\n+/\n/g;
# collapse two or more blank lines into a single one:
s/\n{3,}/\n\n/g;
# see what is left...
print;
change the rights on the perl-script chmod 750 strip-headers and run it with:
man sed | ./strip-headers | col -bx > sed.man
or
man sed | ./strip-headers | col -bx | head -7 | grep sed
macOS man doesn't support the --ascii flag, so I used col -bx to strip the annoying formatting from man for piping into other commands.
man sed | col -bx | grep SYNOPSIS
col -b: Do not output any backspaces, printing only the last character written to each column position.
col -x: Output multiple spaces instead of tabs.
Notes:
I've read that man is meant to detect whether you're piping to another command or into a file, etc, but that was not my experience. At least for man 1.6c, the default for macOS.
Solution using col: https://unix.stackexchange.com/a/15866
Thanks #Cyrus - I didn't know about hexdump
Thanks #Oliver Gaida - I didn't know cat and vi would show display differently

How to make a hexdump line by line with xxd?

To create sample file with cat.
cat > /tmp/test.txt <<EOF
> X1
> X22
> X333
> X4444
> EOF
To check the content in sample file.
cat /tmp/test.txt
X1
X22
X333
X4444
To make a hexdump with xxd.
xxd /tmp/test.txt
0000000: 5831 0a58 3232 0a58 3333 330a 5834 3434 X1.X22.X333.X444
0000010: 340a
How to make a hexdump line by line with xxd in such way as below:
58 31 0a
58 32 32 0a
58 33 33 33 0a
58 34 34 34 34 0a
$ xxd -ps </tmp/test.txt|sed -e 's/\(..\)/\1 /g' -e 's/0a /0a\n/g'
58 31 0a
58 32 32 0a
58 33 33 33 0a
58 34 34 34 34 0a
After all I found the hexdump tool in combination with sed the best solution:
hexdump -v -e '/1 "%02x "' /tmp/test.txt | sed 's/0a /0a\n/g'
Please download it and save as /home/urls.csv.
sample file to test
To test with
hexdump -v -e '/1 "%02x "' /home/urls.csv | sed 's/0a /0a\n/g'
To test with
xargs -I'{}' bash -c 'xxd <<< "${1}"' -- '{}' < /home/urls.csv
To test with
xargs -I'{}' bash -c 'hd <<< "${1}"' -- '{}' < /home/urls.csv
To test with
xxd -ps </home/urls.csv |sed -e 's/\(..\)/\1 /g' -e 's/0a /0a\n/g'
It is clear that Ipor Sircer's answer is no enough robustness for long files.

Grep from file fails but grep with individual lines from the file works

I am trying to extract lines from file genome.gff that contain a line from file suspicious.txt. suspicious.txt was derived from genome.gff and every line should match.
Using grep on a single line from suspicious.txt works as expected:
grep 'gene10002' genome.gff
NC_007082.3 Gnomon gene 1269632 1273520 . + . ID=gene10002;Dbxref=BEEBASE:GB54789,GeneID:409846;Name=bur;gbkey=Gene;gene=bur;gene_biotype=protein_coding
NC_007082.3 Gnomon mRNA 1269632 1273520 . + . ID=rna21310;Parent=gene10002;Dbxref=GeneID:409846,Genbank:XM_393336.5,BEEBASE:GB54789;Name=XM_393336.5;gbkey=mRNA;gene=bur;product=burgundy;transcript_id=XM_393336.5
But every variation on using grep from a file that I've been able to think of or find online produces no output or an empty file:
grep -f suspicious.txt genome.gff
grep -F -f suspicious.txt genome.gff
while read line; do grep "$line" genome.gff; done<suspicious.txt
while read line; do grep '$line' genome.gff; done<suspicious.txt
while read line; do grep "${line}" genome.gff; done<suspicious.txt
cat suspicious.txt | while read line; do grep '$line' genome.gff; done
cat suspicious.txt | while read line; do grep '$line' genome.gff >> suspicious.gff; done
cat suspicious.txt | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done
cat "$(cat suspicious_bee_geneIDs_test.txt)" | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done
Running it as a script also produces an empty file:
#!/bin/bash
SUSP=$1
GFF=$2
while read -r line; do
grep -e "${line}" $GFF >> suspicious_bee_genes.gff
done<$SUSP
This is what the files look like:
head genome.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Amel_4.5
#!genome-build-accession NCBI_Assembly:GCF_000002195.4
##sequence-region NC_007070.3 1 29893408
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7460
NC_007070.3 RefSeq region 1 29893408 . + . ID=id0;Dbxref=taxon:7460;Name=LG1;gbkey=Src;genome=chromosome;linkage- group=LG1;mol_type=genomic DNA;strain=DH4
NC_007070.3 Gnomon gene 181 211962 . - . ID=gene0;Dbxref=BEEBASE:GB42164,GeneID:726912;Name=cort;gbkey=Gene;gene=cort;gene_biotype=protein_coding
NC_007070.3 Gnomon mRNA 181 71559 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:726912,Genbank:XM_006557348.1,BEEBASE:GB42164;Name=XM_006557348.1;gbkey=mRNA;gene=cort;product=cortex%2C transcript variant X2;transcript_id=XM_006557348.1
wc -l genome.gff
457742
head suspicious.txt
gene10002
gene1001
gene1003
gene10038
gene10048
gene10088
gene10132
gene10134
gene10181
gene10209
wc -l suspicious.txt
928
Does anyone know what's going wrong here?
This can happen when the input file is in DOS format: each line will have a trailing CR character at the end, which will break the matching.
One way to check if this is the case is using hexdump, for example (just the first few lines):
$ hexdump -C suspicious.txt
00000000 67 65 6e 65 31 30 30 30 32 0d 0a 67 65 6e 65 31 |gene10002..gene1|
00000010 30 30 31 0d 0a 67 65 6e 65 31 30 30 33 0d 0a 67 |001..gene1003..g|
00000020 65 6e 65 31 30 30 33 38 0d 0a 67 65 6e 65 31 30 |ene10038..gene10|
In the ASCII representation at the right, notice the .. after each gene. These dots correspond to 0d and 0a. The 0d is the CR character.
Without the CR character, the output should look like this:
$ hexdump -C <(tr -d '\r' < suspicious.txt)
00000000 67 65 6e 65 31 30 30 30 32 0a 67 65 6e 65 31 30 |gene10002.gene10|
00000010 30 31 0a 67 65 6e 65 31 30 30 33 0a 67 65 6e 65 |01.gene1003.gene|
00000020 31 30 30 33 38 0a 67 65 6e 65 31 30 30 34 38 0a |10038.gene10048.|
Just one . after each gene, corresponding to 0a, and no 0d.
Another way to see the DOS line endings in the vi editor. If you open the file with vi, the status line would show [dos], or you could run the ex command :set ff? to make it tell you the file format (the status line will say fileformat=dos).
You can remove the CR characters on the fly like this:
grep -f <(tr -d '\r' < suspicious.txt) genome.gff
Or you could remove in vi, by running the ex command :set ff=unix and then save the file. There are other command line tools too that can remove the DOS line ending.
Another possibility is that instead of a trailing CR character, you might have trailing whitespace. The output of hexdump -C should make that perfectly clear. After the trailing whitespace characters are removed, the grep -f should work as expected.

Strange bash behaviour

So I'm trying to get a list of all the directories i'm currently running a program in, so i can keep track of the numerous jobs i have running at the moment.
When i run the commands individually, they all seem to work, but when i chain them together, something is going wrong... (ll is just the regular ls -l alias)
for pid in `top -n 1 -u will | grep -iP "(programs|to|match)" | awk '{print $1}'`;
do
ll /proc/$pid/fd | head -n 2 | tail -n 1;
done
Why is it that when i have the ll /proc/31353/fd inside the for loop, it cannot access the file, but when i use it normally it works fine?
And piped through hexdump -C:
$ top -n 1 -u will |
grep -iP "(scatci|congen|denprop|swmol3|sword|swedmos|swtrmo)" |
awk '{print $1}' | hexdump -C
00000000 1b 28 42 1b 5b 6d 1b 28 42 1b 5b 6d 32 31 33 35 |.(B.[m.(B.[m2135|
00000010 33 0a 1b 28 42 1b 5b 6d 1b 28 42 1b 5b 6d 32 39 |3..(B.[m.(B.[m29|
00000020 33 33 31 0a 1b 28 42 1b 5b 6d 1b 28 42 1b 5b 6d |331..(B.[m.(B.[m|
00000030 33 30 39 39 36 0a 1b 28 42 1b 5b 6d 1b 28 42 1b |30996..(B.[m.(B.|
00000040 5b 6d 32 36 37 31 38 0a |[m26718.|
00000048
chepner had the right hunch. The output of top is designed for humans, not for parsing. The hexdump shows that top is producing some terminal escape sequences. These escape sequences are part of the first field of the line so the resulting file name is something like /proc/\e(B\e[m\e(B\e[m21353/pid instead of /proc/21353/pid where \e is an escape character.
Use ps, pgrep or pidof instead. Under Linux, you can use the -C option to ps to match an exact program name (repeat the option to allow multiple names). Use the -o option to control the display format.
for pid in $(ps -o pid= -C scatci -C congen -C denprop -C swmol3 -C sword -C swedmos -C swtrmo); do
ls -l /proc/$pid/fd | head -n 2 | tail -n 1
done
If you want to sort by decreasing CPU usage:
for pid in $(ps -o %cpu=,pid= \
-C scatci -C congen -C denprop -C swmol3 -C sword -C swedmos -C swtrmo |
sort -k 1gr |
awk '{print $2}'); do
Additionally, use backticks instead of dollar-parenthesis for command substitution — quotes inside backticks behave somewhat bizarrely, and it's easy to make a mistake there. Quoting inside dollar-parenthesis is intuitive.
try to use "cut" instead of "awk", something like this:
for pid in `top -n 1 -u will | grep -iP "(scatci|congen|denprop|swmol3|sword|swedmos|swtrmo)" | sed 's/ / /g' | cut -d ' ' -f2`; do echo /proc/$pid/fd | head -n 2 | tail -n 1; done

Resources