Hex Dump Specific Parts of File - Bash - bash

I'm trying to write a bash script to audit hard drives that have been wiped to ensure the wiping system is working properly. I would like to find a way to hex dump specific parts of a drive without having to hex dump the entire drive and extract the parts I'd like (as this seems to run for too long to make the script worth writing). Ideally, I'd be able to grab parts from the beginning, middle, and end of the drive.
I would like to take the output of the hex dump and check it for the existence of only one character (indicating the drive has been successfully wiped). This part, I can handle, but I thought it may affect any advice I may get.
I've used head piped into xxd to get the beginning of the file which has worked, but I'm still stuck on the other parts. I've tried using tail to just get the end of the drive, but that doesn't seem to work quickly either. Is it possible to do this efficiently? Possibly using dd or something else and pipe it into a hex editor? I've looked through options for xxd as well as hexdump to no avail. If someone could point me in the right direction, it would be greatly appreciated!

xxd has options to skip a ways into the file (-s) and dump a limited length (-l). If you use its plain hex (-p) option, you may be able to use grep to find any anomalies:
$ xxd -s 8192 -l 256 -p /dev/disk3s2 | grep [^0]
000000010000000000000000000000000000000000000000000000000000
000000000000000000000000300000000000000800000000000000000000
dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb
dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb
dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb
od has similar skip (-j) and limit length (-N). Similarly, dd has skip= and count= (although these are counted in blocks, not bytes; you can change the block size with bs=).
EDIT: Since xxd -p is giving weird results (not stopping at what should be the end of the device), I'd recommend running some tests to figure out what's going on. First, back up anything important on the computer, because if something is weird at the device access level, it's possible that some of these tests might overwrite something unexpected, possibly even on another disk.
Next, try dumping to the end of the device with different tools, and see if they all behave the same way:
xxd -s 65451982336 /dev/sdb | more # This *should* dump 512 bytes (32 lines) then stop, but apparently keeps going
od -xv -j 65451982336 /dev/sdb | more # This also *should* dump 512 bytes then stop
dd if=/dev/sdb skip=127835903 | xxd | more # This again should do the same thing (note that the skip value is in 512-byte blocks)
Do the other tools read past what fdisk reports as the end of the disk? If all three read more data, I'm going with the "fdisk is wrong/misleading" answer. You can test further by writing some nonzero data past the "end" and seeing what the results are:
dd if=/dev/random of=/dev/sdb seek=127835903 count=2
...then repeat the various dump commands. If they show two blocks (=64 lines) of random data followed by zeroes, I'm pretty sure the device is bigger than you think it is.

I am not near my shell, but something along these lines should get you started:
dd if=/dev/hda1 | hexdump -C | grep [^00]
will print all non-zero bytes.
dd if=/dev/hda1 | od -x -j100
will give you a hexadecimal dump with offsets, starting 100 bytes in.

Related

encode / decode binary data in a qr-code using qrencode and zbarimg in bash

I have some binary data that I want to encode in a qr-code and then be able to decode, all of that in bash. After a search, it looks like I should use qrencode for encoding, and zbarimg for decoding. After a bit of troubleshooting, I still do not manage to decode what I had encoded
Any idea why? Currently the closest I am to a solution is:
$ dd if=/dev/urandom bs=10 count=1 status=none > data.bin
$ xxd data.bin
00000000: b255 f625 1cf7 a051 3d07 .U.%...Q=.
$ cat data.bin | qrencode -l H -8 -o data.png
$ zbarimg --raw --quiet data.png | xxd
00000000: c2b2 55c3 b625 1cc3 b7c2 a051 3d07 0a ..U..%.....Q=..
It looks like I am not very far, but something is still off.
Edit 1: a possible fix is to use base64 wrapping, as explained in the answer by #leagris .
Edit 2: using base64 encoding doubles the size of the message. The reason why I use binary in the first place is to be size-efficient so I would like to avoid that. De-accepting the answer by #leagris as I would like to have it 'full binary', sorry.
Edit 3: as of 2020-03-03 it looks like this is a well-known issue of zbarimg and that a pull request to fix this is on its way:
https://github.com/mchehab/zbar/pull/64
Edit 4: if you know of another command-line tool on linux that is able to decrypt qr-codes with binary content, please feel free to let me know.
My pull request has been applied. ZBar version 0.23.1 and newer will be able to decode binary QR codes:
zbarimg --raw --oneshot -Sbinary qr.png
zbarcam --raw --oneshot -Sbinary
QR codes have several encoding modes. The simplest, most commonly used and widely supported is the alphanumeric encoding which is suitable for simple text. The byte encoding allows storing arbitrary 8 bit data in the QR code. The ECI mode is like 8 bit mode but with additional metadata that tells the decoder which character set to use in order to decode the binary data back to text. Here's a list of known ECI values and the character encodings they represent. For example, when a decoder encounters an ECI 26 mode QR code it knows to decode the binary data as UTF-8.
The qrencode tool is doing its job correctly: it is creating a byte mode QR code with the data you gave it as its contents. The problem is most decoders were explicitly designed to handle textual data first and foremost. The retrieval of the raw binary data is a detail at best.
Current versions of the zbar library will treat byte mode QR codes as if they were unknown ECI mode QR codes. If a character set isn't specified, it will attempt to guess the encoding and convert the data to it. This will most likely mangle the binary data. As you noted, I brought this up in issue #55 and after some time managed to submit a pull request to improve this. Should it be merged, the library will have binary decoder option that will instruct decoders to return the raw binary data without converting it. Another source of data mangling is the tendency of the command line tools to append line feeds to the output. I submitted a pull request to allow users to prevent this and it has already been merged.
The zxing-cpp library will also try to guess the encoding of binary data in QR codes. The comments suggest that the QR code specification requires that decoders pick an encoding without specifying a default or allowing them to return the raw binary data. In order to make that possible, the binary data is copied to a byte array which can be accessed through the DecoderResult. When I have some free time, I intend to write zximg and zxcam tools with binary decoding support for this library.
It's always possible to encode binary data as base 64 and encode the result as an alphanumeric QR code. However, base 64 encoding will increase the size of the data and the alphanumeric mode doesn't allow use of the QR code's maximum capacity. In a comment, you mentioned what you intend to use binary QR codes for:
I want to have a package to effectively dump some gpg stuff in a format that makes recovery easy.
That is the exact use case I'm attempting to enable with my pull request: an easier-to-restore paperkey. 4096 bit RSA secret keys can be directly QR encoded in 8 bit mode but not in alphanumeric mode as base 64-encoded data.
See also: Storing binary data in QR codes
Look like zbarimg is only supporting printable characters and adding a newline
printf '%s' 'Hello World!' >data.bin
xxd data.bin
qrencode -l H -8 -o data.png -r data.bin
zbarimg --raw --quiet data.png | xxd
I think a better more portable option would be to base64 encode your binary data before qr encoding.
Like this:
dd if=/dev/urandom bs=10 count=1 status=none > data.bin
xxd data.bin
base64 <data.bin | qrencode -l H -8 -o data.png
zbarimg --raw --quiet data.png | base64 -d | xxd

How to suppress the general information for top command

I wish to suppress the general information for the top command
using a top parameter.
By general information I mean the below stuff :
top - 09:35:05 up 3:26, 2 users, load average: 0.29, 0.22, 0.21
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.3%us, 0.7%sy, 0.0%ni, 96.3%id, 0.8%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3840932k total, 2687880k used, 1153052k free, 88380k buffers
Swap: 3998716k total, 0k used, 3998716k free, 987076k cached
What I do not wish to do is :
top -u user | grep process_name
or
top -bp $(pgrep process_name) | do_something
How can I achieve this?
Note: I am on Ubuntu 12.04 and top version is 3.2.8.
Came across this question today. I have a potential solution - create a top configuration file from inside top's interactive mode when the summary area is disabled. Since this file is also read at startup of top in batch mode, it will cause the summary area to be disabled in batch mode too.
Follow these steps to set it up..
Launch top in interactive mode.
Once inside interactive mode, disable the summary area by successively pressing 'l', 'm' and 't'.
Press 'W' (upper case) to write your top configuration file (normally, ~/.toprc)
Exit interactive mode.
Now when you run top in batch mode the summary area will not appear (!)
Taking it one step further...
If you only want this for certain situations and still want the summary area most of the time, you could use an alternate top configuration file. However, AFAIK, the way to get top to use an alternate config file is a bit funky. There are a couple of ways to do this. The approach I use is as follows:
Create a soft-link to the top executable. This does not have to be done as root, as long as you have write access to the link's location...
ln -s /usr/bin/top /home/myusername/bin/omgwtf
Launch top by typing the name of the link ('omgwtf') rather than 'top'. You will be in normal top interactive mode, but when you save the configuration file it will write to ~/.omgwtfrc, leaving ~/.toprc alone.
Disable the summary area and write the configuration file same as before (press 'l', 'm', 't' and 'W')
In the future, when you're ready to run top without summary info in batch mode, you'll have to invoke top via the link name you created. For example,
% omgwtf -usyslog -bn1
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
576 syslog 20 0 264496 8144 1352 S 0.0 0.1 0:03.66 rsyslogd
%
If you're running top in batch mode (-b -n1), just delete the header lines with sed:
top -b -n1 | sed 1,7d
That will remove the first 7 header lines that top outputs and returns only the processes.
It's known as the "Summary Area" and i don't think there is a way at top initialization to disable those.
But while top is running, you can disable those by pressing l, t, m.
From man top:
Summary-Area-defaults
'l' - Load Avg/Uptime On (thus program name)
't' - Task/Cpu states On (1+1 lines, see '1')
'm' - Mem/Swap usage On (2 lines worth)
'1' - Single Cpu On (thus 1 line if smp)
This will dump the output and it can be redirected to any file if needed.
top -n1 |grep -Ev "Tasks:|Cpu(s):|Swap:|Mem:"
To monitoring a particular process, following command is working for me -
top -sbn1 -p $(pidof <process_name>) | grep $(pidof <process_name>)
And to get the all process information you can use the following -
top -sbn1|sed -n '/PID/,/^$/p'
egrep may be good enough in this case, but I would add that perl -lane could do this kind of thing with lightning speed:
top -b -n 1 | perl -lane '/PID/ and $x=1; $x and print' | head -n10
This way you may forget the precise arguments for grep, sed, awk, etc. for good because perl is typically much faster than those tools.
On a mac you cannot use -b which is used in many of the other answers.
In that case the command would be top -n1 -l1 | sed 1,10d
Grabbing only the first process line (and its header), only logging once, instead of interactive, then suppress the general information for top command which are the first 10 lines.

How to resume reading a file?

I'm trying to find the best and most efficient way to resume reading a file from a given point.
The given file is being written frequently (this is a log file).
This file is rotated on a daily basis.
In the log file I'm looking for a pattern 'slow transaction'. End of such lines have a number into parentheses. I want to have the sum of the numbers.
Example of log line:
Jun 24 2015 10:00:00 slow transaction (5)
Jun 24 2015 10:00:06 slow transaction (1)
This is easy part that I could do with awk command to get total of 6 with above example.
Now my challenge is that I want to get the values from this file on a regular basis. I've an external system that polls a custom OID using SNMP. When hitting this OID the Linux host runs a couple of basic commands.
I want this SNMP polling event to get the number of events since the last polling only. I don't want to have the total every time, just the total of the newly added lines.
Just to mention that only bash can be used, or basic commands such as awk sed tail etc. No perl or advanced programming language.
I hope my description will be clear enough. Apologizes if this is duplicate. I did some researches before posting but did not find something that precisely correspond to my need.
Thank you for any assistance
In addition to the methods in the comment link, you can also simply use dd and stat to read the logfile size, save it and sleep 300 then check the logfile size again. If the filesize has changed, then skip over the old information with dd and read the new information only.
Note: you can add a test to handle the case where the logfile is deleted and then restarted with 0 size (e.g. if $((newsize < size)) then read all.
Here is a short example with 5 minute intervals:
#!/bin/bash
lfn=${1:-/path/to/logfile}
size=$(stat -c "%s" "$lfn") ## save original log size
while :; do
newsize=$(stat -c "%s" "$lfn") ## get new log size
if ((size != newsize)); then ## if change, use new info
## use dd to skip over existing text to new text
newtext=$(dd if="$lfn" bs="$size" skip=1 2>/dev/null)
## process newtext however you need
printf "\nnewtext:\n\n%s\n" "$newtext"
size=$((newsize)); ## update size to newsize
fi
sleep 300
done

How to zgrep the last line of a gz file without tail

Here is my problem, I have a set of big gz log files, the very first info in the line is a datetime text, e.g.: 2014-03-20 05:32:00.
I need to check what set of log files holds a specific data.
For the init I simply do a:
'-query-data-'
zgrep -m 1 '^20140320-04' 20140320-0{3,4}*gz
BUT HOW to do the same with the last line without process the whole file as would be done with zcat (too heavy):
zcat foo.gz | tail -1
Additional info, those logs are created with the data time of it's initial record, so if I want to query logs at 14:00:00 I have to search, also, in files created BEFORE 14:00:00, as a file would be created at 13:50:00 and closed at 14:10:00.
The easiest solution would be to alter your log rotation to create smaller files.
The second easiest solution would be to use a compression tool that supports random access.
Projects like dictzip, BGZF, and csio each add sync flush points at various intervals within gzip-compressed data that allow you to seek to in a program aware of that extra information. While it exists in the standard, the vanilla gzip does not add such markers either by default or by option.
Files compressed by these random-access-friendly utilities are slightly larger (by perhaps 2-20%) due to the markers themselves, but fully support decompression with gzip or another utility that is unaware of these markers.
You can learn more at this question about random access in various compression formats.
There's also a "Blasted Bioinformatics" blog by Peter Cock with several posts on this topic, including:
BGZF - Blocked, Bigger & Better GZIP! – gzip with random access (like dictzip)
Random access to BZIP2? – An investigation (result: can't be done, though I do it below)
Random access to blocked XZ format (BXZF) – xz with improved random access support
Experiments with xz
xz (an LZMA compression format) actually has random access support on a per-block level, but you will only get a single block with the defaults.
File creation
xz can concatenate multiple archives together, in which case each archive would have its own block. The GNU split can do this easily:
split -b 50M --filter 'xz -c' big.log > big.log.sp.xz
This tells split to break big.log into 50MB chunks (before compression) and run each one through xz -c, which outputs the compressed chunk to standard output. We then collect that standard output into a single file named big.log.sp.xz.
To do this without GNU, you'd need a loop:
split -b 50M big.log big.log-part
for p in big.log-part*; do xz -c $p; done > big.log.sp.xz
rm big.log-part*
Parsing
You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:
SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1
Side note
Version 5.1.1 introduced support for the --block-size flag:
xz --block-size=50M big.log
However, I have not been able to extract a specific block since it doesn't include full headers between blocks. I suspect this is nontrivial to do from the command line.
Experiments with gzip
gzip also supports concatenation. I (briefly) tried mimicking this process for gzip without any luck. gzip --verbose --list doesn't give enough information and it appears the headers are too variable to find.
This would require adding sync flush points, and since their size varies on the size of the last buffer in the previous compression, that's too hard to do on the command line (use dictzip or another of the previously discussed tools).
I did apt-get install dictzip and played with dictzip, but just a little. It doesn't work without arguments, creating a (massive!) .dz archive that neither dictunzip nor gunzip could understand.
Experiments with bzip2
bzip2 has headers we can find. This is still a bit messy, but it works.
Creation
This is just like the xz procedure above:
split -b 50M --filter 'bzip2 -c' big.log > big.log.sp.bz2
I should note that this is considerably slower than xz (48 min for bzip2 vs 17 min for xz vs 1 min for xz -0) as well as considerably larger (97M for bzip2 vs 25M for xz -0 vs 15M for xz), at least for my test log file.
Parsing
This is a little harder because we don't have the nice index. We have to guess at where to go, and we have to err on the side of scanning too much, but with a massive file, we'd still save I/O.
My guess for this test was 50000000 (out of the original 52428800, a pessimistic guess that isn't pessimistic enough for e.g. an H.264 movie.)
GUESS=50000000
LAST=$(tail -c$GUESS big.log.sp.bz2 \
|grep -abo 'BZh91AY&SY' |awk -F: 'END { print '$GUESS'-$1 }')
tail -c $LAST big.log.sp.bz2 |bunzip2 -c |tail -n1
This takes just the last 50 million bytes, finds the binary offset of the last BZIP2 header, subtracts that from the guess size, and pulls that many bytes off of the end of the file. Just that part is decompressed and thrown into tail.
Because this has to query the compressed file twice and has an extra scan (the grep call seeking the header, which examines the whole guessed space), this is a suboptimal solution. See also the below section on how slow bzip2 really is.
Perspective
Given how fast xz is, it's easily the best bet; using its fastest option (xz -0) is quite fast to compress or decompress and creates a smaller file than gzip or bzip2 on the log file I was testing with. Other tests (as well as various sources online) suggest that xz -0 is preferable to bzip2 in all scenarios.
————— No Random Access —————— ——————— Random Access ———————
FORMAT SIZE RATIO WRITE READ SIZE RATIO WRITE SEEK
————————— ————————————————————————————— —————————————————————————————
(original) 7211M 1.0000 - 0:06 7211M 1.0000 - 0:00
bzip2 96M 0.0133 48:31 3:15 97M 0.0134 47:39 0:00
gzip 79M 0.0109 0:59 0:22
dictzip 605M 0.0839 1:36 (fail)
xz -0 25M 0.0034 1:14 0:12 25M 0.0035 1:08 0:00
xz 14M 0.0019 16:32 0:11 14M 0.0020 16:44 0:00
Timing tests were not comprehensive, I did not average anything and disk caching was in use. Still, they look correct; there is a very small amount of overhead from split plus launching 145 compression instances rather than just one (this may even be a net gain if it allows an otherwise non-multithreaded utility to consume multiple threads).
Well, you can access randomly a gzipped file if you previously create an index for each file ...
I've developed a command line tool which creates indexes for gzip files which allow for very quick random access inside them:
https://github.com/circulosmeos/gztool
The tool has two options that may be of interest for you:
-S option supervise a still-growing file and creates an index for it as it is growing - this can be useful for gzipped rsyslog files as reduces to zero in the practice the time of index creation.
-t tails a gzip file: this way you can do: $ gztool -t foo.gz | tail -1
Please, note that if the index doesn't exists, this will consume the same time as a complete decompression: but as the index is reusable, next searches will be greatly reduced in time!
This tool is based on zran.c demonstration code from original zlib, so there's no out-of-the-rules magic!

fastest hashing in a unix environment?

I need to examine the output of a certain script 1000s of times on a unix platform and check if any of it has changed from before.
I've been doing this:
(script_stuff) | md5sum
and storing this value. I actually don't really need "md5", JUST a simple hash function which I can compare against a stored value to see if its changed. Its okay if there are an occassional false positive.
Is there anything better than md5sum that works faster and generates a fairly usable hash value? The script itself generates a few lines of text - maybe 10-20 on average to max 100 or so.
I had a look at fast md5sum on millions of strings in bash/ubuntu - that's wonderful, but I can't compile a new program. Need a system utility... :(
Additional "background" details:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I have no idea what the use of such a system would be, I'm just doing this as a job for someone else...
The cksum utility calculates a non-cryptographic CRC checksum.
How big is the output you're checking? A hundred lines max. I'd just save the entire original file then use cmp to see if it's changed. Given that a hash calculation will have to read every byte anyway, the only way you'll get an advantage from a checksum type calculation is if the cost of doing it is less than reading two files of that size.
And cmp won't give you any false positives or negatives :-)
pax> echo hello >qq1.txt
pax> echo goodbye >qq2.txt
pax> cp qq1.txt qq3.txt
pax> cmp qq1.txt qq2.txt >/dev/null
pax> echo $?
1
pax> cmp qq1.txt qq3.txt >/dev/null
pax> echo $?
0
Based on your question update:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I'm not sure you need to worry too much about the file I/O. The following script executed dig microsoft.com +short 5000 times first with file I/O then with output to /dev/null (by changing the comments).
#!/bin/bash
rm -rf qqtemp
mkdir qqtemp
((i = 0))
while [[ $i -ne 5000 ]] ; do
#dig microsoft.com +short >qqtemp/microsoft.com.$i
dig microsoft.com +short >/dev/null
((i = i + 1))
done
The elapsed times at 5 runs each are:
File I/O | /dev/null
----------+-----------
3:09 | 1:52
2:54 | 2:33
2:43 | 3:04
2:49 | 2:38
2:33 | 3:08
After removing the outliers and averaging, the results are 2:49 for the file I/O and 2:45 for the /dev/null. The time difference is four seconds for 5000 iterations, only 1/1250th of a second per item.
However, since an iteration over the 5000 takes up to three minutes, that's how long it will take maximum to detect a problem (a minute and a half on average). If that's not acceptable, you need to move away from bash to another tool.
Given that a single dig only takes about 0.012 seconds, you should theoretically do 5000 in sixty seconds assuming your checking tool takes no time at all. You may be better off doing something like this in Perl and using an associative array to store the output from dig.
Perl's semi-compiled nature means that it will probably run substantially faster than a bash script and Perl's fancy stuff will make the job a lot easier. However, you're unlikely to get that 60-second time much lower just because that's how long it takes to run the dig commands.

Resources