Cancelling long-running Elasticsearch tasks times out - bash

My _search requests had been gradually becoming slower and slower to the point of 504 gateway timeouts. Then I saw dozens of super-long running indices:data/read/search tasks with no end in sight so I tried to cancel them using POST _tasks/_cancel?actions=*search (note that I only have one node of interest so I didn't need the &node=... param).
This only resulted in another (cancel) task being registered and now even my GET _tasks and GET _cat/tasks?v requests are timing out.
I'm wondering whether it's possible to
set a cap on the running_time_in_nanos attribute of all search tasks and/or auto-cancel all the exceeding ones
force-cancel the tasks without having to restart the ES service when the Tasks API itself is timing out
Side note: I already have a health-check bash script
if [ `curl -s -m 20 https://my-es-instance | grep "You Know, for Search" | wc -l` -ne 1 ];
then
echo "`date "+%F %T"` app not responding" &>> $my_log_file
...
but it doesn't take into consideration the fact that while the root (GET /) may be running, the _search endpoints are not.
What are the best practices here?

Related

Avoid mass e-mail notification in error analysis bash script

I am selecting error log details from a docker container and decide within a shell script, how and when to alert about the issue by discord and/or email.
Because I am receiving the email alerts too often with the same information in the email body, I want to implement the following two adjustments:
Fatal error log selection:
FATS="$(docker logs --since 24h $NODENAME 2>&1 | grep 'FATAL' | grep -v 'INFO')"
Email sent, in case FATS has some content:
swaks --from "$MAILFROM" --to "$MAILTO" --server "$MAILSERVER" --auth LOGIN --auth-user "$MAILUSER" --auth-password "$MAILPASS" --h-Subject "FATAL ERRORS FOUND" --body "$FATS" --silent "1"
How can I send the email only in the case, FATS has another content than the previous run of the script? I have thought about a hash about its content, which is stored and read in a text file. If the hash is the same than the previous script run, the email will be skipped.
Another option could be a local, temporary variable in the global user's bash profile, so that there is no file to be stored on the file system (to avoid read / writes).
How can I do that?
When you are writing a script for your monitoring, add functions for additional functionality, like:
logging all the alerts that have been send
make sure you don't send more than 1 alert each hour
consider sending warnings only during working hours
escalate a message when it fails N times without intermediate success
possible send an alert to different receivers (different email adresses or also to sms or teams)
make an interface for an operator so he can look back when something went wrong the first time.
When you have control which messages you send, it is easy to filter duplicate meassages (after changing --since).
I‘ve chosen the proposal of #ralf-dreager and reduced selection to 1d and 1h. Consequently, I‘ve changed my monitoring script to either go through the results of 1d or just 1h, without the need to select each time again and again. Huge performance improvement and no need to store anything else in a variable or on the file system.
FATS="$(docker logs --since 1h $NODENAME 2>&1 | grep 'FATAL' | grep -v 'INFO')"

Bash script - check how many times public IP changes

I am trying to create my first bash script. The goal of this script is to check at what rate my public IP changes. It is a fairly straight forward script. First it checks if the new address is different from the old one. If so then it should update the old one to the new one and print out the date along with the new IP address.
At this point I have created a simple script in order to accomplish this. But I have two main problems.
First the script keeps on printing out the IP even tough it hasn't changed and I have updated the PREV_IP with the CUR_IP.
My second problem is that I want the output to direct to a file instead of outputting it into the terminal.
The interval is currently set to 1 second for test purposes. This will change to a higher interval in the final product.
#!/bin/bash
while true
PREV_IP=00
do
CUR_IP=$(curl https://ipinfo.io/ip)
if [ $PREV_IP != "$CUR_IP" ]; then
PREV_IP=$CUR_IP
"$(date)"
echo "$CUR_IP"
sleep 1
fi
done
I also get a really weird output. I have edited my public IP to xx.xxx.xxx.xxx:
Sat 20 Mar 09:45:29 CET 2021
xx.xxx.xxx.xxx
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:--
while true
PREV_IP=00
do
is the reason you are seeing ip each loop. It's the same as while true; PREV_IP=00; do. The exit status of true; PREV_IP=00 is the exit status of last command - the exit status of assignment is 0 (success) - so the loop will always execute. But PREV_IP will be reset to 00 each loop... This is a typo and you meant to set prev_ip once, before the loop starts.
"$(date)"
will try execute the output of date command, as a next command. So it will print:
$ "$(date)"
bash: sob, 20 mar 2021, 10:57:02 CET: command not found
And finally, to silence curl, read man curl first and then find out about -s. I use -sS so errors are also visible.
Do not use uppercase variables in your scripts. Prefer lower case variables. Check you scripts with http://shellcheck.net . Quote variable expansions.
I would sleep each loop. Your script could look like this:
#!/bin/bash
prev=""
while true; do
cur=$(curl -sS https://ipinfo.io/ip)
if [ "$prev" != "$cur" ]; then
prev="$cur"
echo "$(date) $cur"
fi
sleep 1
done
that I want the output to direct to a file instead of outputting it into the terminal.
Then research how redirection works in shell and how to use it. The simplest would be to redirect echo output.
echo "$(date) $cur" >> "a_file.txt"
The interval is currently set to 1 second for test purposes. This will change to a higher interval in the final product.
You are still limited with the time it takes to connect to https://ipinfo.io/ip. And from ipinfo.io documentation:
Free usage of our API is limited to 50,000 API requests per month.
And finally, I wrote a script where I tried to use many public services as I found ,get_ip_external for getting external ip address. You may take multiple public services for getting ipv4 address and choose a random/round-robin one so that rate-limiting don't kick that fast.

How to make script in bash aware that a server is still busy installing/configuring and wait for reboot?

The issue / dilemma
I am currently busy creating a script to kickstart servers (with CentOS 6.x and CentOS 7.x) remotely. So far the script is working, but hangs on one minor thing. Well actually it does not hang, but it does not give detailed information about what is happening. In other words, I am not getting the correct information back in bash about the job being finished correctly.
I have tried various things, however it's hanging with the following message (which is being repeated endlessly):
servername is still installing and configuring packages...
PING 100.125.150.175 (100.125.150.175) 56(84) bytes of data.
64 bytes from 100.125.150.175: icmp_seq=1 ttl=63 time=0.152 ms
64 bytes from 100.125.150.175: icmp_seq=2 ttl=63 time=0.157 ms
64 bytes from 100.125.150.175: icmp_seq=3 ttl=63 time=0.157 ms
64 bytes from 100.125.150.175: icmp_seq=4 ttl=63 time=0.143 ms
64 bytes from 100.125.150.175: icmp_seq=5 ttl=63 time=0.182 ms
--- 100.125.150.175 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 120025ms
rtt min/avg/max/mdev = 0.143/0.158/0.182/0.015 ms
servername is still installing and configuring packages...
PING 100.125.150.175 (100.125.150.175) 56(84) bytes of data.
64 bytes from 100.125.150.175: icmp_seq=1 ttl=63 time=0.153 ms
64 bytes from 100.125.150.175: icmp_seq=2 ttl=63 time=0.132 ms
64 bytes from 100.125.150.175: icmp_seq=3 ttl=63 time=0.142 ms
etc....
So for some reason it does not contine to the next line of code or does the next action. Since it's only feedback to me (or another user), it's not a majorissue. But it would be nice to get this functional and providing (detailed) information back about the current progress or what the script/server is actually doing at the moment. This is not the case for the above (last) piece of code unfortunately.
This is the current code snippet I have (yes, it's a mess):
while true;
do
#ping -c3 -i3 $HWNODEIP > /dev/null
#ping -c5 -i30 $HWNODEIP > /dev/null
ping -c5 -i30 $HWNODEIP
if [ $? -eq 1 ] || [ $? -eq 2 ] || [ $? -eq 68 ]
then
echo -e " "
echo -e "Kickstart part II also done. $HOSTNAME will be rebooted one more time."
sleep 5
######return 0
echo -e " "
printf "%s" "Waiting for $HOSTNAME to come back online: "
while ! ping -c 1 -n -w 30 $HWNODEIP &> /dev/null
do
printf "%c" "."
#sleep 10
done
echo -e " "
echo -e "Reboot is done and $HOSTNAME is back online. Performing final check. Please wait..."
sleep 10
echo -e " "
sudo /usr/local/collectHWdata.pl $HWNODEIP
ssh root#$HWNODEIP "while ! test -e /root/kickstart-DONE; do sleep 3; done; echo KICKSTART IS DONE\!"
echo -e " "
exit
else
echo -e " "
echo -e "$HOSTNAME is still installing and configuring packages..."
fi
done
Sidenote: I removed > /dev/null #5 for debugging (not that it helped)
I am guessing I am using things incorrectly and I am by no means a experienced scripter; I can only do minor stuff, but ofcourse I am doing my best. I have been fooling around with this since last week and still no result on this part.
What am I trying to achieve?
The server is rebooted after the selected CentOS version, creating partitions and setting up the network. This all works. The above snippet is after that reboot. Now it will install packages I selected, configure various things (like Nagios) and install/compile certain PERL modules. And a few other minor things.
This is done correctly in the background. I wanted to make the script (the above piece of code) that the server is still busy with installing things and such. Since I lack the knowledge to do that, I decided for a different approach; check if the server is online (in other words that it's still installing). As long as the server is online, it's still installing/configuring things obviously. After that is done, the server will reboot once more to perform the final 2 commands (as seen in my snippet). However (here is the problem) it never does those commands, though the kickstart is completely done.
So I am guessing I am doing something wrong and even might messed up things (or got confused by doing so). Maybe someone has an idea, solution or a completely different approach to tackle and fix this problem (or at least I hope so).
Other things I have tried so far? Well I tried a various of ping commands and I also tried nc (netcat) but also without a good result. I every single time hit a brick wall with the last 2 commands and it keeps pinging instead of showing that the kickstart was done... I think I have spend several hours (since last week) on this already without getting anywhere.
So I am hoping someone can take a look at this and tell me what I am doing wrong and maybe there is a better approach (other than pinging a server) to see if it's still busy. Maybe a (remote) check on yum, perl or a service, so that the script knows it's still busy.
Sorry for the long post, but I know when I provide as much information as possible including code examples and results, this is more "appreciated". So I am hoping I provided adequate information. If not, let me know. I will try to add as much information as I can. As always I am always willing to learn or change my approach.
Thank you already for reading my post!
As noted in the comments under the question:
The server may already be rebooted by the time ping -c5 -i30 $HWNODEIP finishes. The command sends 5 packets (-c flag), waiting 30 seconds between each packet (-i interval flag). So thats's 5*30 = 150 seconds, which is a bit more than 2 minutes. A server could reboot just fine within 2 minutes, especially if there's SSD in use. So try lowering the total time it would take this command to complete.
[ $? -eq 68 ] is probably unnecessary. $HWNODEIP is just ip address, and exit code 68 is for domain name not being resolved, which doesn't apply to IP addresses.
The if statement could be simplified to
if ! ping -c5 -i30 "$HWNODEIP"
These are minor suggestions,probably not bulletproof. As confirmed by OP in the comments, lowering interval helps. There's other small improvements that could be done (like quoting variables), but that's outside the scope of the question, so I'll leave it for now.

bash asynchronous variable setting (dns lookup)

Let's say we had a loop that we want to have run as quickly as possible. Let's say something was being done to a list of hosts inside that loop; just for the sake of argument, let's say it was a redis query. Let's say that the list of hosts may change occasionally due to hosts being added/removed from a pool (not load balanced); however, the list is predictable (e.g., they all start with “foo” and end with 2 digits. So we want to run this occasionally; say, once every 15 minutes:
listOfHosts=$(dig +noall +ans foo{00..99}.domain | while read -r n rest; do printf '%s\n' ${n%.}; done)
to get the list of hosts. Let's say our loop looked something like this:
while :; do
for i in $listOfHosts; do
redis-cli -h $i llen something
done
(( ( $(date +%s) % 60 * 15) == 0 )) && callFunctionThatSetslistOfHosts
done
(now obviously there's some things missing, like testing to see if we've already run callFunctionThatSetslistOfHosts in the current minute and only running it once, and doing something with the redis output, and maybe the list of hosts should be an array, but basically this is it.)
How can we run callFunctionThatSetslistOfHosts asynchronously so that it doesn't slow down the loop. I.e., have it running in the background setting listOfHosts occasionally (e.g. once every 15 minutes), so that the next time the inner loop is run it gets a potentially different set of hosts to run the redis query on?
My major problem seems to be that in order to set listOfHosts in a loop, that loop has to be a subshell, and listOfHosts is local to that subshell, and setting it doesn't affect the global listOfHosts.
I may resort to pipes, but will have to poll the reader before generating a new list — not that that's terribly bad if I poll slowly, but I thought I'd present this as a problem.
Thanks.

fastest hashing in a unix environment?

I need to examine the output of a certain script 1000s of times on a unix platform and check if any of it has changed from before.
I've been doing this:
(script_stuff) | md5sum
and storing this value. I actually don't really need "md5", JUST a simple hash function which I can compare against a stored value to see if its changed. Its okay if there are an occassional false positive.
Is there anything better than md5sum that works faster and generates a fairly usable hash value? The script itself generates a few lines of text - maybe 10-20 on average to max 100 or so.
I had a look at fast md5sum on millions of strings in bash/ubuntu - that's wonderful, but I can't compile a new program. Need a system utility... :(
Additional "background" details:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I have no idea what the use of such a system would be, I'm just doing this as a job for someone else...
The cksum utility calculates a non-cryptographic CRC checksum.
How big is the output you're checking? A hundred lines max. I'd just save the entire original file then use cmp to see if it's changed. Given that a hash calculation will have to read every byte anyway, the only way you'll get an advantage from a checksum type calculation is if the cost of doing it is less than reading two files of that size.
And cmp won't give you any false positives or negatives :-)
pax> echo hello >qq1.txt
pax> echo goodbye >qq2.txt
pax> cp qq1.txt qq3.txt
pax> cmp qq1.txt qq2.txt >/dev/null
pax> echo $?
1
pax> cmp qq1.txt qq3.txt >/dev/null
pax> echo $?
0
Based on your question update:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I'm not sure you need to worry too much about the file I/O. The following script executed dig microsoft.com +short 5000 times first with file I/O then with output to /dev/null (by changing the comments).
#!/bin/bash
rm -rf qqtemp
mkdir qqtemp
((i = 0))
while [[ $i -ne 5000 ]] ; do
#dig microsoft.com +short >qqtemp/microsoft.com.$i
dig microsoft.com +short >/dev/null
((i = i + 1))
done
The elapsed times at 5 runs each are:
File I/O | /dev/null
----------+-----------
3:09 | 1:52
2:54 | 2:33
2:43 | 3:04
2:49 | 2:38
2:33 | 3:08
After removing the outliers and averaging, the results are 2:49 for the file I/O and 2:45 for the /dev/null. The time difference is four seconds for 5000 iterations, only 1/1250th of a second per item.
However, since an iteration over the 5000 takes up to three minutes, that's how long it will take maximum to detect a problem (a minute and a half on average). If that's not acceptable, you need to move away from bash to another tool.
Given that a single dig only takes about 0.012 seconds, you should theoretically do 5000 in sixty seconds assuming your checking tool takes no time at all. You may be better off doing something like this in Perl and using an associative array to store the output from dig.
Perl's semi-compiled nature means that it will probably run substantially faster than a bash script and Perl's fancy stuff will make the job a lot easier. However, you're unlikely to get that 60-second time much lower just because that's how long it takes to run the dig commands.

Resources