Grep not matching certain parts of man page - bash

Grep doesn't seem to match certain strings from man output. It seems to be random in that I can't work out any rhyme or reason as to whether a string will match or not.
man sed | head -7:
SED(1) BSD General Commands Manual SED(1)
NAME
sed -- stream editor
SYNOPSIS
$ man sed | head -7 | grep sed # no match
$ man sed | head -7 | grep stream # match on "stream"
sed -- stream editor
$ man sed | head -7 | grep '\-\-' # match on "--"
sed -- stream editor
$ man sed | head -7 | grep NAME # no match
$ man sed | head -7 | grep SYNOPSIS # no match
This also happens when redirecting the output to a file and grepping that
$ man sed | head -7 > /tmp/sed.man
$ cat /tmp/sed.man | grep sed # no match
$ cat /tmp/sed.man | grep stream # match on "stream"
sed -- stream editor
$ grep sed /tmp/sed.man # no match
$ grep stream /tmp/sed.man # match on "stream"
sed -- stream editor
grep: grep (BSD grep) 2.5.1-FreeBSD
man: version 1.6c
macOS: 10.14.6 Beta
bash: GNU bash, version 5.0.7(1)-release (x86_64-apple-darwin18.5.0)
$ man sed | head -7 | hexdump -C
00000000 0a 53 45 44 28 31 29 20 20 20 20 20 20 20 20 20 |.SED(1) |
00000010 20 20 20 20 20 20 20 20 20 20 20 42 53 44 20 47 | BSD G|
00000020 65 6e 65 72 61 6c 20 43 6f 6d 6d 61 6e 64 73 20 |eneral Commands |
00000030 4d 61 6e 75 61 6c 20 20 20 20 20 20 20 20 20 20 |Manual |
00000040 20 20 20 20 20 20 20 20 20 53 45 44 28 31 29 0a | SED(1).|
00000050 0a 4e 08 4e 41 08 41 4d 08 4d 45 08 45 0a 20 20 |.N.NA.AM.ME.E. |
00000060 20 20 20 73 08 73 65 08 65 64 08 64 20 2d 2d 20 | s.se.ed.d -- |
00000070 73 74 72 65 61 6d 20 65 64 69 74 6f 72 0a 0a 53 |stream editor..S|
00000080 08 53 59 08 59 4e 08 4e 4f 08 4f 50 08 50 53 08 |.SY.YN.NO.OP.PS.|
00000090 53 49 08 49 53 08 53 0a |SI.IS.S.|
00000098
Googling is hard for this problem as any combination of "man" or "grep" doesn't mention my problem that strings (with no special characters) are not matching.

man-pages are using the roff-format (https://man.openbsd.org/roff). Do the following:
man sed > sed.man
vi sed.man
so you see:
SED(1) BSD General Commands Manual SED(1)
N^HNA^HAM^HME^HE
s^Hse^Hed^Hd -- stream editor
to convert a man-page to text without the ^H-stuff. have a look on http://www.schweikhardt.net/man_page_howto.html#q10
create a perl-Skript called strip-headers with the content:
#!/usr/bin/perl -wn
# make it slurp the whole file at once:
undef $/;
# delete first header:
s/^\n*.*\n+//;
# delete last footer:
s/\n+.*\n+$/\n/g;
# delete page breaks:
s/\n\n+[^ \t].*\n\n+(\S+).*\1\n\n+/\n/g;
# collapse two or more blank lines into a single one:
s/\n{3,}/\n\n/g;
# see what is left...
print;
change the rights on the perl-script chmod 750 strip-headers and run it with:
man sed | ./strip-headers | col -bx > sed.man
or
man sed | ./strip-headers | col -bx | head -7 | grep sed

macOS man doesn't support the --ascii flag, so I used col -bx to strip the annoying formatting from man for piping into other commands.
man sed | col -bx | grep SYNOPSIS
col -b: Do not output any backspaces, printing only the last character written to each column position.
col -x: Output multiple spaces instead of tabs.
Notes:
I've read that man is meant to detect whether you're piping to another command or into a file, etc, but that was not my experience. At least for man 1.6c, the default for macOS.
Solution using col: https://unix.stackexchange.com/a/15866
Thanks #Cyrus - I didn't know about hexdump
Thanks #Oliver Gaida - I didn't know cat and vi would show display differently

Related

how to grep a hex data area

I have a hex file, I need to extract a range of it to a text file
From range:
To Range:
I need Output: AC:E4:B5:9A:53:1C
i tried many but it not really correct requirements, Output: Binary file filehex matches
grep "["'\x9f\x87\x6f\x11'"-"'\x9f\x87\x70\x11'"]" filehex > test.txt
hope someone can help me
Use -a to force the text interpretation of the input.
Use -o to only output the matching part.
The expression you used doesn't make much sense. It matches any characters in the set \x9, \x87, \x6f, and then the range \x11-\x9f, etc.
You are rather interested in something that starts with \x9\x87\x6f\x11 and ends in \x9f\x87\x70\x11, and there can be anything in between.
You can use cut to remove the leading and trailing 4 bytes.
grep -oa $'\x9f\x87\x6f\x11.*\x9f\x87\x70\x11' hexfile | cut -b5-21
If you know the length of the string will always be 17 bytes, you can use .\{17\} instead of .*.
Ok I've build randomly one binary $file
with your string at a location making hd command to split them.
Note: regarding k314159' comment, I use hd to produce hexdump output similarto CentOS's hexdump tool.
One shoot using sed:
hd $file |sed -e 'N;/ 9f \+\(|.*\n[0-9a-f]\+ \+\|\)87 \+\(|.*\n[0-9a-f]\+ \+\|\)6f \+\(|.*\n[0-9a-f]\+ \+\|\)11 /p;D;'
000161c0 96 7a b2 21 28 f1 b3 32 63 43 93 ff 50 a6 9f 87 |.z.!(..2cC..P...|
000161d0 6f 11 0d 7a a5 a9 81 9e 32 9d fb 71 27 6d 60 f2 |o..z....2..q'm`.|
0002c3a0
Explanation:
N merge next line in current buffer
\(|.*\n[0-9a-f]\+ \+\|\) match a | followed by anything and a newline (\n), then immediately an hexadecimal number and a space OR nothing.
p print current buffer (two lines)
D Delete upto newline in current buffer, keep last line for next sed loop.
The last hexadecimal 00028d2a correspond to the size of my binary $file:
printf "%x\n" $(stat -c %s $file)
Using bash + grep:
printf -v var "\x9f\x87\x6f\x11"
IFS=: read -r offset _ < <(grep -abo "$var" $file)
hd $file | sed -ne "$((offset/16-1)),+4p"
000161a0 b7 8f 4a 4d ed 89 6c 0b 25 f9 e7 c9 8c 99 6e 23 |..JM..l.%.....n#|
000161b0 3c ba 80 ec 2e 32 dd f3 a4 a2 09 bd 74 bf 66 11 |<....2......t.f.|
000161c0 96 7a b2 21 28 f1 b3 32 63 43 93 ff 50 a6 9f 87 |.z.!(..2cC..P...|
000161d0 6f 11 0d 7a a5 a9 81 9e 32 9d fb 71 27 6d 60 f2 |o..z....2..q'm`.|
000161e0 15 86 c2 bd 11 d0 08 90 c4 84 b9 80 04 4e 17 f1 |.............N..|
Where you could read your string:
000161c0 9f 87 | ..|
000161d0 6f 11 |o. |
For testing, I've built my test file by:
dd if=/vmlinuz bs=90574 count=1 of=/tmp/testfile
printf '\x9f\x87\x6f\x11' >>/tmp/testfile
dd if=/vmlinuz bs=90574 count=1 >>/tmp/testfile
file=/tmp/testfile
Use grep to search for the original binary file, not the hex dump. Extending choroba's answer, I think you may have problems with grep trying to interpret your search pattern as UTF-8 or some other encoding. You should temporarily set the environment variable LC_ALL=C for grep to treat each byte individually. Also, you can use the -P option to enable use of lookbehind and lookahead in your pattern. So your command becomes:
LANG=C grep -oaP $'(?<=\x9f\x87\x6f\x11).*(?=\x9f\x87\x70\x11)' binary-file > test.txt
Proof that it works:
$ echo $'BEFORE\x9f\x87\x6f\x11AC:E4:B5:9A:53:1C\x9f\x87\x70\x11AFTER' | LANG=C grep -oaP $'(?<=\x9f\x87\x6f\x11).*(?=\x9f\x87\x70\x11)'
AC:E4:B5:9A:53:1C
$

What does 'BS' stands for in sublime text on macOS?

in macOS, I use zsh terminal ,then input command 'man sort > sort-man.txt'.
When open sort-man.txt with Sublime text, I see many 'BS'.
What does 'BS' stands for in sublime text on macOS??
It can be some encoding issue??
question picture
The man command outputs a “bold” character by printing the character, then printing a backspace character, then printing the character again. Thus:
:; man sort | hexdump -C | head
00000000 0a 53 4f 52 54 28 31 29 20 20 20 20 20 20 20 20 |.SORT(1) |
00000010 20 20 20 20 20 20 20 20 20 20 20 42 53 44 20 47 | BSD G|
00000020 65 6e 65 72 61 6c 20 43 6f 6d 6d 61 6e 64 73 20 |eneral Commands |
00000030 4d 61 6e 75 61 6c 20 20 20 20 20 20 20 20 20 20 |Manual |
00000040 20 20 20 20 20 20 20 20 53 4f 52 54 28 31 29 0a | SORT(1).|
00000050 0a 4e 08 4e 41 08 41 4d 08 4d 45 08 45 0a 20 20 |.N.NA.AM.ME.E. |
^ ^ ^
| | +--- ASCII N
| +------ ASCII Backspace
+--------- ASCII N
Way back in the days of physical terminals that printed on paper, this would have the effect of overstriking the character, making it appear bolder.
These days, your terminal emulator app interprets a sequence like this by changing the color or font of the character.
I guess Sublime Text shows the backspace character as BS.
Consulting the man man page, I find this under “TIPS”:
To get a plain text version of a man page, without backspaces and underscores, try
# man foo | col -b > foo.mantxt

Grep from file fails but grep with individual lines from the file works

I am trying to extract lines from file genome.gff that contain a line from file suspicious.txt. suspicious.txt was derived from genome.gff and every line should match.
Using grep on a single line from suspicious.txt works as expected:
grep 'gene10002' genome.gff
NC_007082.3 Gnomon gene 1269632 1273520 . + . ID=gene10002;Dbxref=BEEBASE:GB54789,GeneID:409846;Name=bur;gbkey=Gene;gene=bur;gene_biotype=protein_coding
NC_007082.3 Gnomon mRNA 1269632 1273520 . + . ID=rna21310;Parent=gene10002;Dbxref=GeneID:409846,Genbank:XM_393336.5,BEEBASE:GB54789;Name=XM_393336.5;gbkey=mRNA;gene=bur;product=burgundy;transcript_id=XM_393336.5
But every variation on using grep from a file that I've been able to think of or find online produces no output or an empty file:
grep -f suspicious.txt genome.gff
grep -F -f suspicious.txt genome.gff
while read line; do grep "$line" genome.gff; done<suspicious.txt
while read line; do grep '$line' genome.gff; done<suspicious.txt
while read line; do grep "${line}" genome.gff; done<suspicious.txt
cat suspicious.txt | while read line; do grep '$line' genome.gff; done
cat suspicious.txt | while read line; do grep '$line' genome.gff >> suspicious.gff; done
cat suspicious.txt | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done
cat "$(cat suspicious_bee_geneIDs_test.txt)" | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done
Running it as a script also produces an empty file:
#!/bin/bash
SUSP=$1
GFF=$2
while read -r line; do
grep -e "${line}" $GFF >> suspicious_bee_genes.gff
done<$SUSP
This is what the files look like:
head genome.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Amel_4.5
#!genome-build-accession NCBI_Assembly:GCF_000002195.4
##sequence-region NC_007070.3 1 29893408
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7460
NC_007070.3 RefSeq region 1 29893408 . + . ID=id0;Dbxref=taxon:7460;Name=LG1;gbkey=Src;genome=chromosome;linkage- group=LG1;mol_type=genomic DNA;strain=DH4
NC_007070.3 Gnomon gene 181 211962 . - . ID=gene0;Dbxref=BEEBASE:GB42164,GeneID:726912;Name=cort;gbkey=Gene;gene=cort;gene_biotype=protein_coding
NC_007070.3 Gnomon mRNA 181 71559 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:726912,Genbank:XM_006557348.1,BEEBASE:GB42164;Name=XM_006557348.1;gbkey=mRNA;gene=cort;product=cortex%2C transcript variant X2;transcript_id=XM_006557348.1
wc -l genome.gff
457742
head suspicious.txt
gene10002
gene1001
gene1003
gene10038
gene10048
gene10088
gene10132
gene10134
gene10181
gene10209
wc -l suspicious.txt
928
Does anyone know what's going wrong here?
This can happen when the input file is in DOS format: each line will have a trailing CR character at the end, which will break the matching.
One way to check if this is the case is using hexdump, for example (just the first few lines):
$ hexdump -C suspicious.txt
00000000 67 65 6e 65 31 30 30 30 32 0d 0a 67 65 6e 65 31 |gene10002..gene1|
00000010 30 30 31 0d 0a 67 65 6e 65 31 30 30 33 0d 0a 67 |001..gene1003..g|
00000020 65 6e 65 31 30 30 33 38 0d 0a 67 65 6e 65 31 30 |ene10038..gene10|
In the ASCII representation at the right, notice the .. after each gene. These dots correspond to 0d and 0a. The 0d is the CR character.
Without the CR character, the output should look like this:
$ hexdump -C <(tr -d '\r' < suspicious.txt)
00000000 67 65 6e 65 31 30 30 30 32 0a 67 65 6e 65 31 30 |gene10002.gene10|
00000010 30 31 0a 67 65 6e 65 31 30 30 33 0a 67 65 6e 65 |01.gene1003.gene|
00000020 31 30 30 33 38 0a 67 65 6e 65 31 30 30 34 38 0a |10038.gene10048.|
Just one . after each gene, corresponding to 0a, and no 0d.
Another way to see the DOS line endings in the vi editor. If you open the file with vi, the status line would show [dos], or you could run the ex command :set ff? to make it tell you the file format (the status line will say fileformat=dos).
You can remove the CR characters on the fly like this:
grep -f <(tr -d '\r' < suspicious.txt) genome.gff
Or you could remove in vi, by running the ex command :set ff=unix and then save the file. There are other command line tools too that can remove the DOS line ending.
Another possibility is that instead of a trailing CR character, you might have trailing whitespace. The output of hexdump -C should make that perfectly clear. After the trailing whitespace characters are removed, the grep -f should work as expected.

Strange bash behaviour

So I'm trying to get a list of all the directories i'm currently running a program in, so i can keep track of the numerous jobs i have running at the moment.
When i run the commands individually, they all seem to work, but when i chain them together, something is going wrong... (ll is just the regular ls -l alias)
for pid in `top -n 1 -u will | grep -iP "(programs|to|match)" | awk '{print $1}'`;
do
ll /proc/$pid/fd | head -n 2 | tail -n 1;
done
Why is it that when i have the ll /proc/31353/fd inside the for loop, it cannot access the file, but when i use it normally it works fine?
And piped through hexdump -C:
$ top -n 1 -u will |
grep -iP "(scatci|congen|denprop|swmol3|sword|swedmos|swtrmo)" |
awk '{print $1}' | hexdump -C
00000000 1b 28 42 1b 5b 6d 1b 28 42 1b 5b 6d 32 31 33 35 |.(B.[m.(B.[m2135|
00000010 33 0a 1b 28 42 1b 5b 6d 1b 28 42 1b 5b 6d 32 39 |3..(B.[m.(B.[m29|
00000020 33 33 31 0a 1b 28 42 1b 5b 6d 1b 28 42 1b 5b 6d |331..(B.[m.(B.[m|
00000030 33 30 39 39 36 0a 1b 28 42 1b 5b 6d 1b 28 42 1b |30996..(B.[m.(B.|
00000040 5b 6d 32 36 37 31 38 0a |[m26718.|
00000048
chepner had the right hunch. The output of top is designed for humans, not for parsing. The hexdump shows that top is producing some terminal escape sequences. These escape sequences are part of the first field of the line so the resulting file name is something like /proc/\e(B\e[m\e(B\e[m21353/pid instead of /proc/21353/pid where \e is an escape character.
Use ps, pgrep or pidof instead. Under Linux, you can use the -C option to ps to match an exact program name (repeat the option to allow multiple names). Use the -o option to control the display format.
for pid in $(ps -o pid= -C scatci -C congen -C denprop -C swmol3 -C sword -C swedmos -C swtrmo); do
ls -l /proc/$pid/fd | head -n 2 | tail -n 1
done
If you want to sort by decreasing CPU usage:
for pid in $(ps -o %cpu=,pid= \
-C scatci -C congen -C denprop -C swmol3 -C sword -C swedmos -C swtrmo |
sort -k 1gr |
awk '{print $2}'); do
Additionally, use backticks instead of dollar-parenthesis for command substitution — quotes inside backticks behave somewhat bizarrely, and it's easy to make a mistake there. Quoting inside dollar-parenthesis is intuitive.
try to use "cut" instead of "awk", something like this:
for pid in `top -n 1 -u will | grep -iP "(scatci|congen|denprop|swmol3|sword|swedmos|swtrmo)" | sed 's/ / /g' | cut -d ' ' -f2`; do echo /proc/$pid/fd | head -n 2 | tail -n 1; done

See PIDs only by specific users in bash

So I'm having some troubles in finding a way to isolate the PIDs from top using pipelines and not being able to useawk or perl. So far I'm able to isolate the specific Users (Cannot be your username or root) and now I'm not sure how to move on from here, I've tried using cut and several other options but it's not working. Here's my work so far:
top -n 1 | tail -n +8 | grep -Ev '\broot\b | \bmyUserName\b`
This outputs all the information minus the heading, and I need to remove everything else but the PIDs... Could anyone help at all?
EDIT: Also, right now what seems to work is just adding | cut -c 4-11 which shows only the PID, because there is only one other user that is not root on the system. I'm not sure it will work if there's more, but is there any better ideas as to how to make it work?
In theory:
top -n 1 | tail -n +8 | grep -Ev ' root | myUserName ' |
sed -e 's/^[ ]*\([0-9][0-9]*\) .*/\1/'
The sed command looks for start of line, optional blanks followed by a number and a blank and trailing garbage. However, this doesn't work because top generates screen control characters:
28433 jleffler 20 0 1511m 403m 31m S 2 1.3 70:35.76 chrome
looks OK, but when pushed through a hex dump, the output is:
0x0000: 1B 28 42 1B 5B 6D 32 38 34 33 33 20 6A 6C 65 66 .(B.[m28433 jlef
0x0010: 66 6C 65 72 20 20 32 30 20 20 20 30 20 31 35 31 fler 20 0 151
0x0020: 30 6D 20 34 30 34 6D 20 20 33 31 6D 20 53 20 20 0m 404m 31m S
0x0030: 20 20 34 20 20 31 2E 33 20 20 37 30 3A 33 37 2E 4 1.3 70:37.
0x0040: 38 37 20 63 68 72 6F 6D 65 20 20 20 20 20 20 20 87 chrome
To suppress that, use top -b (batch mode):
top -b -n 1 | tail -n +8 | grep -Ev ' root | myUserName ' |
sed -e 's/^[ ]*\([0-9][0-9]*\) .*/\1/'
This should generate a list of PIDs; it did for me.
If you were allowed awk, you might simplify that to:
top -b -n 1 | awk 'NR<=8 || $2~/^(root|myUserName)$/ {next} {print $1}'
And all this is predicated on 'Using top is a good way to go', rather than using ps (which is the normal tool to use for gathering PIDs.
If you want the PID's of all processes not spawned by root or some_user then you could list these processes using ps with -U user and the negation option -N:
ps -U root -U some_user -N -o pid
The -o option specifies that we're only interested in the PID in the output.
Now you can easily do something with these PID's in a loop or similar:
for pid in $(ps -U root -U some_user -N -o pid); do
# something to $pid
done

Resources