fastest hashing in a unix environment? - shell

I need to examine the output of a certain script 1000s of times on a unix platform and check if any of it has changed from before.
I've been doing this:
(script_stuff) | md5sum
and storing this value. I actually don't really need "md5", JUST a simple hash function which I can compare against a stored value to see if its changed. Its okay if there are an occassional false positive.
Is there anything better than md5sum that works faster and generates a fairly usable hash value? The script itself generates a few lines of text - maybe 10-20 on average to max 100 or so.
I had a look at fast md5sum on millions of strings in bash/ubuntu - that's wonderful, but I can't compile a new program. Need a system utility... :(
Additional "background" details:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I have no idea what the use of such a system would be, I'm just doing this as a job for someone else...

The cksum utility calculates a non-cryptographic CRC checksum.

How big is the output you're checking? A hundred lines max. I'd just save the entire original file then use cmp to see if it's changed. Given that a hash calculation will have to read every byte anyway, the only way you'll get an advantage from a checksum type calculation is if the cost of doing it is less than reading two files of that size.
And cmp won't give you any false positives or negatives :-)
pax> echo hello >qq1.txt
pax> echo goodbye >qq2.txt
pax> cp qq1.txt qq3.txt
pax> cmp qq1.txt qq2.txt >/dev/null
pax> echo $?
1
pax> cmp qq1.txt qq3.txt >/dev/null
pax> echo $?
0
Based on your question update:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I'm not sure you need to worry too much about the file I/O. The following script executed dig microsoft.com +short 5000 times first with file I/O then with output to /dev/null (by changing the comments).
#!/bin/bash
rm -rf qqtemp
mkdir qqtemp
((i = 0))
while [[ $i -ne 5000 ]] ; do
#dig microsoft.com +short >qqtemp/microsoft.com.$i
dig microsoft.com +short >/dev/null
((i = i + 1))
done
The elapsed times at 5 runs each are:
File I/O | /dev/null
----------+-----------
3:09 | 1:52
2:54 | 2:33
2:43 | 3:04
2:49 | 2:38
2:33 | 3:08
After removing the outliers and averaging, the results are 2:49 for the file I/O and 2:45 for the /dev/null. The time difference is four seconds for 5000 iterations, only 1/1250th of a second per item.
However, since an iteration over the 5000 takes up to three minutes, that's how long it will take maximum to detect a problem (a minute and a half on average). If that's not acceptable, you need to move away from bash to another tool.
Given that a single dig only takes about 0.012 seconds, you should theoretically do 5000 in sixty seconds assuming your checking tool takes no time at all. You may be better off doing something like this in Perl and using an associative array to store the output from dig.
Perl's semi-compiled nature means that it will probably run substantially faster than a bash script and Perl's fancy stuff will make the job a lot easier. However, you're unlikely to get that 60-second time much lower just because that's how long it takes to run the dig commands.

Related

Bash script - check how many times public IP changes

I am trying to create my first bash script. The goal of this script is to check at what rate my public IP changes. It is a fairly straight forward script. First it checks if the new address is different from the old one. If so then it should update the old one to the new one and print out the date along with the new IP address.
At this point I have created a simple script in order to accomplish this. But I have two main problems.
First the script keeps on printing out the IP even tough it hasn't changed and I have updated the PREV_IP with the CUR_IP.
My second problem is that I want the output to direct to a file instead of outputting it into the terminal.
The interval is currently set to 1 second for test purposes. This will change to a higher interval in the final product.
#!/bin/bash
while true
PREV_IP=00
do
CUR_IP=$(curl https://ipinfo.io/ip)
if [ $PREV_IP != "$CUR_IP" ]; then
PREV_IP=$CUR_IP
"$(date)"
echo "$CUR_IP"
sleep 1
fi
done
I also get a really weird output. I have edited my public IP to xx.xxx.xxx.xxx:
Sat 20 Mar 09:45:29 CET 2021
xx.xxx.xxx.xxx
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:--
while true
PREV_IP=00
do
is the reason you are seeing ip each loop. It's the same as while true; PREV_IP=00; do. The exit status of true; PREV_IP=00 is the exit status of last command - the exit status of assignment is 0 (success) - so the loop will always execute. But PREV_IP will be reset to 00 each loop... This is a typo and you meant to set prev_ip once, before the loop starts.
"$(date)"
will try execute the output of date command, as a next command. So it will print:
$ "$(date)"
bash: sob, 20 mar 2021, 10:57:02 CET: command not found
And finally, to silence curl, read man curl first and then find out about -s. I use -sS so errors are also visible.
Do not use uppercase variables in your scripts. Prefer lower case variables. Check you scripts with http://shellcheck.net . Quote variable expansions.
I would sleep each loop. Your script could look like this:
#!/bin/bash
prev=""
while true; do
cur=$(curl -sS https://ipinfo.io/ip)
if [ "$prev" != "$cur" ]; then
prev="$cur"
echo "$(date) $cur"
fi
sleep 1
done
that I want the output to direct to a file instead of outputting it into the terminal.
Then research how redirection works in shell and how to use it. The simplest would be to redirect echo output.
echo "$(date) $cur" >> "a_file.txt"
The interval is currently set to 1 second for test purposes. This will change to a higher interval in the final product.
You are still limited with the time it takes to connect to https://ipinfo.io/ip. And from ipinfo.io documentation:
Free usage of our API is limited to 50,000 API requests per month.
And finally, I wrote a script where I tried to use many public services as I found ,get_ip_external for getting external ip address. You may take multiple public services for getting ipv4 address and choose a random/round-robin one so that rate-limiting don't kick that fast.

bash asynchronous variable setting (dns lookup)

Let's say we had a loop that we want to have run as quickly as possible. Let's say something was being done to a list of hosts inside that loop; just for the sake of argument, let's say it was a redis query. Let's say that the list of hosts may change occasionally due to hosts being added/removed from a pool (not load balanced); however, the list is predictable (e.g., they all start with “foo” and end with 2 digits. So we want to run this occasionally; say, once every 15 minutes:
listOfHosts=$(dig +noall +ans foo{00..99}.domain | while read -r n rest; do printf '%s\n' ${n%.}; done)
to get the list of hosts. Let's say our loop looked something like this:
while :; do
for i in $listOfHosts; do
redis-cli -h $i llen something
done
(( ( $(date +%s) % 60 * 15) == 0 )) && callFunctionThatSetslistOfHosts
done
(now obviously there's some things missing, like testing to see if we've already run callFunctionThatSetslistOfHosts in the current minute and only running it once, and doing something with the redis output, and maybe the list of hosts should be an array, but basically this is it.)
How can we run callFunctionThatSetslistOfHosts asynchronously so that it doesn't slow down the loop. I.e., have it running in the background setting listOfHosts occasionally (e.g. once every 15 minutes), so that the next time the inner loop is run it gets a potentially different set of hosts to run the redis query on?
My major problem seems to be that in order to set listOfHosts in a loop, that loop has to be a subshell, and listOfHosts is local to that subshell, and setting it doesn't affect the global listOfHosts.
I may resort to pipes, but will have to poll the reader before generating a new list — not that that's terribly bad if I poll slowly, but I thought I'd present this as a problem.
Thanks.

Keep looping until ram is full then pause and continue the loop once it get free again

I have to run a command on many files which I do by bash looping and an ampersand at the end of each command so files run in parallel. Yet I don't want to consume all the ram on the server in case where I have 100 or such files. Is there a way to keep the loop going on until ram usage exceed certain threshold and at this point the loop pauses and once ram is free again then continue looping to next files ? Thanks
Yes, you can. Use the free command to determine the RAM usage. Then compare the current usage to your threshold. Wrap the condition of the loop into a function to make it a bit more readable:
ramAboveThreshold() {
local threshold=6000000000 # threshold in bytes (6 GB)
local used="$(free -b | awk 'NR == 2 {print $2}')"
(( used > threshold ))
}
Inside your old loop, place another loop that waits for the RAM usage to drop under your threshold:
for i in myFiles/*; do
while ramAboveThreshold; do
sleep 5
done
myCommand "$i" &
done
free does not only print the used RAM, but also the free and total RAM, so the script could be altered to have a a threshold like »at least n bytes free« or even »less than 60% used«.

How to resume reading a file?

I'm trying to find the best and most efficient way to resume reading a file from a given point.
The given file is being written frequently (this is a log file).
This file is rotated on a daily basis.
In the log file I'm looking for a pattern 'slow transaction'. End of such lines have a number into parentheses. I want to have the sum of the numbers.
Example of log line:
Jun 24 2015 10:00:00 slow transaction (5)
Jun 24 2015 10:00:06 slow transaction (1)
This is easy part that I could do with awk command to get total of 6 with above example.
Now my challenge is that I want to get the values from this file on a regular basis. I've an external system that polls a custom OID using SNMP. When hitting this OID the Linux host runs a couple of basic commands.
I want this SNMP polling event to get the number of events since the last polling only. I don't want to have the total every time, just the total of the newly added lines.
Just to mention that only bash can be used, or basic commands such as awk sed tail etc. No perl or advanced programming language.
I hope my description will be clear enough. Apologizes if this is duplicate. I did some researches before posting but did not find something that precisely correspond to my need.
Thank you for any assistance
In addition to the methods in the comment link, you can also simply use dd and stat to read the logfile size, save it and sleep 300 then check the logfile size again. If the filesize has changed, then skip over the old information with dd and read the new information only.
Note: you can add a test to handle the case where the logfile is deleted and then restarted with 0 size (e.g. if $((newsize < size)) then read all.
Here is a short example with 5 minute intervals:
#!/bin/bash
lfn=${1:-/path/to/logfile}
size=$(stat -c "%s" "$lfn") ## save original log size
while :; do
newsize=$(stat -c "%s" "$lfn") ## get new log size
if ((size != newsize)); then ## if change, use new info
## use dd to skip over existing text to new text
newtext=$(dd if="$lfn" bs="$size" skip=1 2>/dev/null)
## process newtext however you need
printf "\nnewtext:\n\n%s\n" "$newtext"
size=$((newsize)); ## update size to newsize
fi
sleep 300
done

Hex Dump Specific Parts of File - Bash

I'm trying to write a bash script to audit hard drives that have been wiped to ensure the wiping system is working properly. I would like to find a way to hex dump specific parts of a drive without having to hex dump the entire drive and extract the parts I'd like (as this seems to run for too long to make the script worth writing). Ideally, I'd be able to grab parts from the beginning, middle, and end of the drive.
I would like to take the output of the hex dump and check it for the existence of only one character (indicating the drive has been successfully wiped). This part, I can handle, but I thought it may affect any advice I may get.
I've used head piped into xxd to get the beginning of the file which has worked, but I'm still stuck on the other parts. I've tried using tail to just get the end of the drive, but that doesn't seem to work quickly either. Is it possible to do this efficiently? Possibly using dd or something else and pipe it into a hex editor? I've looked through options for xxd as well as hexdump to no avail. If someone could point me in the right direction, it would be greatly appreciated!
xxd has options to skip a ways into the file (-s) and dump a limited length (-l). If you use its plain hex (-p) option, you may be able to use grep to find any anomalies:
$ xxd -s 8192 -l 256 -p /dev/disk3s2 | grep [^0]
000000010000000000000000000000000000000000000000000000000000
000000000000000000000000300000000000000800000000000000000000
dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb
dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb
dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb
od has similar skip (-j) and limit length (-N). Similarly, dd has skip= and count= (although these are counted in blocks, not bytes; you can change the block size with bs=).
EDIT: Since xxd -p is giving weird results (not stopping at what should be the end of the device), I'd recommend running some tests to figure out what's going on. First, back up anything important on the computer, because if something is weird at the device access level, it's possible that some of these tests might overwrite something unexpected, possibly even on another disk.
Next, try dumping to the end of the device with different tools, and see if they all behave the same way:
xxd -s 65451982336 /dev/sdb | more # This *should* dump 512 bytes (32 lines) then stop, but apparently keeps going
od -xv -j 65451982336 /dev/sdb | more # This also *should* dump 512 bytes then stop
dd if=/dev/sdb skip=127835903 | xxd | more # This again should do the same thing (note that the skip value is in 512-byte blocks)
Do the other tools read past what fdisk reports as the end of the disk? If all three read more data, I'm going with the "fdisk is wrong/misleading" answer. You can test further by writing some nonzero data past the "end" and seeing what the results are:
dd if=/dev/random of=/dev/sdb seek=127835903 count=2
...then repeat the various dump commands. If they show two blocks (=64 lines) of random data followed by zeroes, I'm pretty sure the device is bigger than you think it is.
I am not near my shell, but something along these lines should get you started:
dd if=/dev/hda1 | hexdump -C | grep [^00]
will print all non-zero bytes.
dd if=/dev/hda1 | od -x -j100
will give you a hexadecimal dump with offsets, starting 100 bytes in.

Resources