Does anyone know how to get grep, or similar tool, to retrieve offsets of hex strings in a file?
I have a bunch of hexdumps (from GDB) that I need to check for strings and then run again and check if the value has changed.
I have tried hexdump and dd, but the problem is because it's a stream, I lose my offset for the files.
Someone must have had this problem and a workaround. What can I do?
To clarify:
I have a series of dumped memory regions from GDB (typically several hundred MB)
I am trying to narrow down a number by searching for all the places the number is stored, then doing it again and checking if the new value is stored at the same memory location.
I cannot get grep to do anything because I am looking for hex values so all the times I have tried (like a bazillion, roughly) it will not give me the correct output.
The hex dumps are just complete binary files, the paterns are within float values at larges so 8? bytes?
The patterns are not line-wrapping, as far as I am aware. I am aware of the what it changes to, and I can do the same process and compare the lists to see which match.
Perl COULD be a option, but at this point, I would assume my lack of knowledge with bash and its tools is the main culprit.
Desired output format
It's a little hard to explain the output I am getting since I really am not getting any output.
I am anticipating (and expecting) something along the lines of:
<offset>:<searched value>
Which is the pretty well standard output I would normally get with grep -URbFo <searchterm> . > <output>
What I tried:
A. Problem is, when I try to search for hex values, I get the problem of if just not searching for the hex values, so if I search for 00 I should get like a million hits, because thats always the blankspace, but instead its searching for 00 as text, so in hex, 3030.
Any idea's?
B. I CAN force it through hexdump or something of the link but because its a stream it will not give me the offsets and filename that it found a match in.
C. Using grep -b option doesnt seem to work either, I did try all the flags that seemed useful to my situation, and nothing worked.
D. Using xxd -u /usr/bin/xxd as an example I get a output that would be useful, but I cannot use that for searching..
0004760: 73CC 6446 161E 266A 3140 5E79 4D37 FDC6 s.dF..&j1#^yM7..
0004770: BF04 0E34 A44E 5BE7 229F 9EEF 5F4F DFFA ...4.N[."..._O..
0004780: FADE 0C01 0000 000C 0000 0000 0000 0000 ................
Nice output, just what I want to see, but it just doesn't work for me in this situation..
E. Here are some of the things I've tried since posting this:
xxd -u /usr/bin/xxd | grep 'DF'
00017b0: 4010 8D05 0DFF FF0A 0300 53E3 0610 A003 #.........S.....
root# grep -ibH "df" /usr/bin/xxd
Binary file /usr/bin/xxd matches
xxd -u /usr/bin/xxd | grep -H 'DF'
(standard input):00017b0: 4010 8D05 0DFF FF0A 0300 53E3 0610 A003 #.........S.....
This seems to work for me:
LANG=C grep --only-matching --byte-offset --binary --text --perl-regexp "<\x-hex pattern>" <file>
short form:
LANG=C grep -obUaP "<\x-hex pattern>" <file>
Example:
LANG=C grep -obUaP "\x01\x02" /bin/grep
Output (cygwin binary):
153: <\x01\x02>
33210: <\x01\x02>
53453: <\x01\x02>
So you can grep this again to extract offsets. But don't forget to use binary mode again.
Note: LANG=C is needed to avoid utf8 encoding issues.
There's also a pretty handy tool called binwalk, written in python, which provides for binary pattern matching (and quite a lot more besides). Here's how you would search for a binary string, which outputs the offset in decimal and hex (from the docs):
$ binwalk -R "\x00\x01\x02\x03\x04" firmware.bin
DECIMAL HEX DESCRIPTION
--------------------------------------------------------------------------
377654 0x5C336 Raw string signature
We tried several things before arriving at an acceptable solution:
xxd -u /usr/bin/xxd | grep 'DF'
00017b0: 4010 8D05 0DFF FF0A 0300 53E3 0610 A003 #.........S.....
root# grep -ibH "df" /usr/bin/xxd
Binary file /usr/bin/xxd matches
xxd -u /usr/bin/xxd | grep -H 'DF'
(standard input):00017b0: 4010 8D05 0DFF FF0A 0300 53E3 0610 A003 #.........S.....
Then found we could get usable results with
xxd -u /usr/bin/xxd > /tmp/xxd.hex ; grep -H 'DF' /tmp/xxd
Note that using a simple search target like 'DF' will incorrectly match characters that span across byte boundaries, i.e.
xxd -u /usr/bin/xxd | grep 'DF'
00017b0: 4010 8D05 0DFF FF0A 0300 53E3 0610 A003 #.........S.....
--------------------^^
So we use an ORed regexp to search for ' DF' OR 'DF ' (the searchTarget preceded or followed by a space char).
The final result seems to be
xxd -u -ps -c 10000000000 DumpFile > DumpFile.hex
egrep ' DF|DF ' Dumpfile.hex
0001020: 0089 0424 8D95 D8F5 FFFF 89F0 E8DF F6FF ...$............
-----------------------------------------^^
0001220: 0C24 E871 0B00 0083 F8FF 89C3 0F84 DF03 .$.q............
--------------------------------------------^^
grep has a -P switch allowing to use perl regexp syntax
the perl regex allows to look at bytes, using \x.. syntax.
so you can look for a given hex string in a file with: grep -aP "\xdf"
but the outpt won't be very useful; indeed better do a regexp on the hexdump output;
The grep -P can be useful however to just find files matrching a given binary pattern.
Or to do a binary query of a pattern that actually happens in text
(see for example How to regexp CJK ideographs (in utf-8) )
I just used this:
grep -c $'\x0c' filename
To search for and count a page control character in the file..
So to include an offset in the output:
grep -b -o $'\x0c' filename | less
I am just piping the result to less because the character I am greping for does not print well and the less displays the results cleanly.
Output example:
21:^L
23:^L
2005:^L
If you want search for printable strings, you can use:
strings -ao filename | grep string
strings will output all printable strings from a binary with offsets, and grep will search within.
If you want search for any binary string, here is your friend:
https://github.com/tmbinc/bgrep
Related
I am searching all files on my drive for a given hexadecimal value, after it is found I need to copy and save the next 32 bytes after the found occurrence (there may be many occurrences in one file).
Right now I'm searching for files like this:
ggrep -obaRUP "\x01\x02\x03\x04" . > outputfile.txt
But this script retuns only file path. Preferably I'd like to use only standard Linux / Mac tools.
With -P (--perl-regexes) you can use the \K escape sequence to clear the matching buffer. Then match .{32} more chars(!):
LANG=C grep -obaRUP "\x01\x02\x03\x04\K.{32,32}" . > output.file
Note:
I'm using LANG=C to enforce a locale which is using single byte encoding, not utf-8. This is to make sure .{32} would not accidentally match unicode chars(!), but bytes instead.
The -P option is only supported by GNU grep (along with a few others used in your example)
You may want to open the output.file in a hex editor to actually see characters. For example hexdump, hd or xxd could be used.
Note, the above command will additionally print the filename and the line number / byte offset of the match. This is implicitly caused by using grep -R (recursive).
To get only the 32 bytes in the output, and nothing else, I suggest to use find:
find . -type f -exec grep -oaUP '\x01\x02\x03\x04\K.{32}' {} \;
My test was a little simple, but this worked for me.
$: IFS=: read -r file offset data <<< "$(grep -obaRUP "\x01\x02\x03\x04.{32}" .)"
$: echo "$file # $((offset+4)):[${data#????}]"
./x # 10:[HERE ARE THE THIRTY-TWO BYTES !!]
Rather than do a complicated look-behind, I just gabbed the ^A^B^C^D and the next 32 bytes, and stripped off the leading 4 bytes from the field.
#hek2mgl's \K makes all that unnecessary, though. Use -h to eliminate filenames.
$: grep -obahRUP "\x01\x02\x03\x04\K.{32}" .
10:HERE ARE THE THIRTY-TWO BYTES !!
Take out the -b if you don't want the offset.
$: grep -oahRUP "\x01\x02\x03\x04\K.{32}" .
HERE ARE THE THIRTY-TWO BYTES !!
Basically I want a "multiline grep that takes binary strings as patterns".
For example:
printf '\x00\x01\n\x02\x03' > big.bin
printf '\x01\n\x02' > small.bin
printf '\x00\n\x02' > small2.bin
Then the following should hold:
small.bin is contained in big.bin
small2.bin is not contained in big.bin
I don't want to have to convert the files to ASCII hex representation with xxd as shown e.g. at: https://unix.stackexchange.com/questions/217936/equivalent-command-to-grep-binary-files because that feels wasteful.
Ideally, the tool should handle large files that don't fit into memory.
Note that the following attempts don't work.
grep -f matches where it shouldn't because it must be splitting newlines:
grep -F -f small.bin big.bin
# Correct: Binary file big.bin matches
grep -F -f small2.bin big.bin
# Wrong: Binary file big.bin matches
Shell substitution as in $(cat) fails because it is impossible to handle null characters in Bash AFAIK, so the string just gets truncated at the first 0 I believe:
grep -F "$(cat small.bin)" big.bin
# Correct: Binary file big.bin matches
grep -F "$(cat small2.bin)" big.bin
# Wrong: Binary file big.bin matches
A C question has been asked at: How can i check if binary file's content is found in other binary file? but is it possible with any widely available CLI (hopefully POSIX, or GNU coreutils) tools?
Notably, implementing an non-naive algorithm such as Boyer-Moore is not entirely trivial.
I can hack up a working Python one liner as follows, but it won't work for files that don't fit into memory:
grepbin() ( python -c 'import sys;sys.exit(not open(sys.argv[1]).read() in open(sys.argv[2]).read())' "$1" "$2" )
grepbin small.bin big.bin && echo 1
grepbin small2.bin big.bin && echo 2
I could also find the following two tools on GitHub:
https://github.com/tmbinc/bgrep in C, installable with (amazing :-)):
curl -L 'https://github.com/tmbinc/bgrep/raw/master/bgrep.c' | gcc -O2 -x c -o /usr/local/bin/bgrep -
https://github.com/gahag/bgrep in Rust, installable with:
cargo install bgrep
but they don't seem so support taking the pattern from a file, you provide the input as hex ASCII on the command line. I could use:
bgrep $(xxd -p small.bin | tr -d '\n') big.bin
since it does not matter as much if the small file gets converted with xxd, but it's not really nice.
In any case, if I were to implement the feature, I'd likely it to the Rust library above.
bgrep is also mentioned at: How does bgrep work?
Tested on Ubuntu 20.10.
How to check if a binary file is contained inside another binary from the Linux command line?
The very POSIX portable way would be to use od to convert to hex and then check for substring with grep, along with some sed scripting in between.
The usual normal portable way, would be to use xxd instead of od:
xxd -p small.bin | tr -d '[ \n]' > small.bin2
xxd -p big.bin | tr -d '[ \n]' > big.bin2
grep -F -f small.bin2 big.bin2
which works fine tested in docker on alpine with busybox.
But:
I don't want to have to convert the files to ASCII hex representation with xxd as shown
then you can't work with binary files in shell. Pick another language. Shell is specifically created to parse nice looking human readable strings - for anything else, it's utterly unpleasant and for files with zero bytes xxd is the first thing you type.
I can hack up a working Python one liner as follows,
awk is also POSIX and available everywhere - I believe someone more skilled in awk may come and write the exact 1:1 of your python script, but:
but it won't work for files that don't fit into memory:
So write a different algorithm, that will not do that.
Overall, when giving the constraint of not using xxd (or od) to convert a binary file with zero bytes to it's hex representation:
is it possible with any widely available CLI (hopefully POSIX, or GNU coreutils) tools?
No. Write your own program for that. You may also write it in perl, it's sometimes available on machines that don't have python.
method1:
$echo -n "The quick brown fox jumps over the lazy dog" | openssl sha1 | base64
MmZkNGUxYzY3YTJkMjhmY2VkODQ5ZWUxYmI3NmU3MzkxYjkzZWIxMgo=
method2:
$ echo -n "The quick brown fox jumps over the lazy dog" | openssl sha1 | xxd -r -p | base64
L9ThxnotKPzthJ7hu3bnORuT6xI=
method3:
echo -n "The quick brown fox jumps over the lazy dog" | openssl sha1 | xxd -b -p | base64
MzI2NjY0MzQ2NTMxNjMzNjM3NjEzMjY0MzIzODY2NjM2NTY0MzgzNDM5NjU2NTMxNjI2MjM3MzY2NTM3CjMzMzkzMTYyMzkzMzY1NjIzMTMyMGEK
I am basically trying to do a checksum an input string The quick brown fox jumps over the lazy dog via sha1 as a checksum and then base64 the result and I have two methods above, I think the method2 is correct answer but I have to an extra step to convert the hex back into binary via xxd -r and plain format -p before I feed it into base64 again, why do I have to do this extra step ?
I don't find anywhere the base64 cmd tool is expecting the input to be binary ? But let's assume so when I explicitly convert it into binary and feed it to base64 via mehod3 xxd -b option,the result is different again.
This might be easier if it's in programing language bcos we have full control but via a few cmd tools its a bit confusing, could someone help me explain this ?
There are three different results here because you are passing in three different strings to base64.
Per your question on base64 expecting the input to be binary, #chepner is right here:
All data is binary; text is just a stream of bytes representing an encoding (ASCII, UTF-8, etc) of text.
Intermediary steps
Let's store the shared command in a variable for clarity.
$ msg='The quick brown fox jumps over the lazy dog'
$ sha_val="$(printf "$msg" | openssl sha1 | awk '{ print $2 }')"
$ printf "$sha_val"
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12
A couple things to note:
Using printf because it is more consistent, especially when we are comparing bytes and hashes.
Piping to awk '{ print $2 }' as openssl may prepend with (stdin)=.
Comparing the bytes
We can use xxd to compare the bytes for each, using -c 1000 to use 1000-char lines (i.e. don't add newlines for < 1000-char strings). This is useful for strings like the output in method2, where there are control characters that can't be printed.
method 1
This is the hex representation of the sha value. For example, the first 2 in the sha output is 32 in this result because hex 32 <=> dec 50 <=> ASCII/UTF-8 "2". If this is confusing, take a look at an ASCII table.
$ printf "$sha_val" | xxd -p -c 1000
32666434653163363761326432386663656438343965653162623736653733393162393365623132
method 2
This output is the EXACT SAME as $sha_val, given that we are converting from hex to ASCII binary and then back with xxd. Note that converting the sha value from hex to binary is not necessary for base64.
$ printf "$sha_val" | xxd -r -p | xxd -p -c 1000
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12
method 3
xxd's -p option is overriding the -b option, so xxd -b -p <=> xxd -p.
$ printf "$sha_val$" | xxd -p -c 1000 | xxd -p -c 1000
33323636363433343635333136333336333736313332363433323338363636333635363433383334333936353635333136323632333733363635333733333339333136323339333336353632333133323061
As you can see, base64 generates three different strings because it receives three different strings.
cat /dev/urandom is always a fun way to create scrolling characters on your display, but produces too many non-printable characters.
Is there an easy way to encode it on the command-line in such a way that all of its output are readable characters, base64 or uuencode for example.
Note that I prefer solutions that require no additional files to be created.
What about something like
cat /dev/urandom | base64
Which gives (lots of) stuff like
hX6VYoTG6n+suzKhPl35rI+Bsef8FwVKDYlzEJ2i5HLKa38SLLrE9bW9jViSR1PJGsDmNOEgWu+6
HdYm9SsRDcvDlZAdMXAiHBmq6BZXnj0w87YbdMnB0e2fyUY6ZkiHw+A0oNWCnJLME9/6vJUGsnPL
TEw4YI0fX5ZUvItt0skSSmI5EhaZn09gWEBKRjXVoGCOWVlXbOURkOcbemhsF1pGsRE2WKiOSvsr
Xj/5swkAA5csea1TW5mQ1qe7GBls6QBYapkxEMmJxXvatxFWjHVT3lKV0YVR3SI2CxOBePUgWxiL
ZkQccl+PGBWmkD7vW62bu1Lkp8edf7R/E653pi+e4WjLkN2wKl1uBbRroFsT71NzNBalvR/ZkFaa
2I04koI49ijYuqNojN5PoutNAVijyJDA9xMn1Z5UTdUB7LNerWiU64fUl+cgCC1g+nU2IOH7MEbv
gT0Mr5V+XAeLJUJSkFmxqg75U+mnUkpFF2dJiWivjvnuFO+khdjbVYNMD11n4fCQvN9AywzH23uo
03iOY1uv27ENeBfieFxiRwFfEkPDgTyIL3W6zgL0MEvxetk5kc0EJTlhvin7PwD/BtosN2dlfPvw
cjTKbdf43fru+WnFknH4cQq1LzN/foZqp+4FmoLjCvda21+Ckediz5mOhl0Gzuof8AuDFvReF5OU
Or, without the (useless) cat+pipe :
base64 /dev/urandom
(Same kind of output ^^ )
EDIT : you can also user the --wrap option of base64, to avoid having "short lines" :
base64 --wrap=0 /dev/urandom
This will remove wrapping, and you'll get "full-screen" display ^^
A number of folks have suggested catting and piping through base64 or uuencode. One issue with this is that you can't control how much data to read (it will continue forever, or until you hit ctrl+c). Another possibility is to use the dd command, which will let you specify how much data to read before exiting. For example, to read 1kb:
dd if=/dev/urandom bs=1k count=1 2>/dev/null | base64
Another option is to pipe to the strings command which may give more variety in its output (non-printable characters are discarded, any runs of least 4 printable characters [by default] are displayed). The problem with strings is that it displays each "run" on its own line.
dd if=/dev/urandom bs=1k count=1 2>/dev/null | strings
(of course you can replace the entire command with
strings /dev/urandom
if you don't want it to ever stop).
If you want something really funky, try one of:
cat -v /dev/urandom
dd if=/dev/urandom bs=1k count=1 2>/dev/null | cat -v
So, what is wrong with
cat /dev/urandom | uuencode -
?
Fixed after the first attempt didn't actually work... ::sigh::
BTW-- Many unix utilities use '-' in place of a filename to mean "use the standard input".
There are already several good answers on how to base64 encode random data (i.e. cat /dev/urandom | base64). However in the body of your question you elaborate:
... encode [urandom] on the command-line in such a way that all of it's output are readable characters, base64 or uuencode for example.
Given that you don't actually require parseable base64 and just want it to be readable, I'd suggest
cat /dev/urandom | tr -dC '[:graph:]'
base64 only outputs alphanumeric characters and two symbols (+ and / by default). [:graph:] will match any printable non-whitespace ascii, including many symbols/punctuation-marks that base64 lacks. Therefore using tr -dC '[:graph:]' will result in a more random-looking output, and have better input/output efficiency.
I often use < /dev/random stdbuf -o0 tr -Cd '[:graph:]' | stdbuf -o0 head --bytes 32 for generating strong passwords.
You can do more interesting stuff with BASH's FIFO pipes:
uuencode <(head -c 200 /dev/urandom | base64 | gzip)
cat /dev/urandom | tr -dc 'a-zA-Z0-9'
Try
xxd -ps /dev/urandom
xxd(1)
I’m writing a script to change the UUID of an NTFS partition (AFAIK, there is none). That means writing 8 bytes from 0x48 to 0x4F (72-79 decimal) of /dev/sdaX (X being the # of my partition).
If I wanted to change it to a random UUID, I could use this:
dd if=/dev/urandom of=/dev/sdaX bs=8 count=1 seek=9 conv=notrunc
Or I could change /dev/urandom to /dev/sdaY to clone the UUID from another partition.
But... what if I want to craft a personalized UUID? I already have it stored (and regex-checked) in a $UUID variable in hexadecimal string format (16 characters), like this:
UUID="2AE2C85D31835048"
I was thinking about this approach:
echo "$UUID" | xxd -r -p | dd of=/dev/sdaX ...
This is just a scratch... I’m not sure about the exact options to make it work. My question is:
Is the echo $var | xxd -r | dd really the best approach?
What would be the exact command and options to make it work?
As for the answers, I’m also looking for:
An explanation of all the options used, and what they do.
If possible, an alternative command to test it in a file and/or screen before changing the partition.
I already have an 100-byte dump file called ntfs.bin that I can use for tests and check results using
xxd ntfs.bin
So any solution that provides me a way to check results using xxd in screen so I can compare with original ntfs.bin file would be highly appreciated.
Try:
UUID="2AE2C85D31835048"
echo "$UUID" | xxd -r -p | wc -c
echo "$UUID" | xxd -r -p | dd of=file obs=1 oseek=72 conv=block,notrunc cbs=8