Elasticsearch index cleanup - elasticsearch

v Elasticsearch 5.6.*.
I'm looking at a way to implement a mechanism by which one of my index (that grows big in no time about 1 million documents per day) to manage the storage constraints automatically.
For example: I will define the max no of documents or max index size as a variable 'n'.
I'd write a scheduler that checks whether 'n' is true. If true, then I'd want to delete the oldest 'x' documents (based on time).
I have a couple of questions here:
Apparently, I do not want to delete too much or too less. How would I know what 'x' is? Can I simply say to elasticsearch that "Hey delete the oldest documents worth 5GB" - My intent is to simply free up a fixed amount of storage. Is this possible?
Secondly, I'd want to know what's the best practice here? Obviously I don't want to invent a square wheel here and if there's anything (eg: Curator and I've been hearing about it only recently) that does the job then I'd be happy to use it.

In your case, the best practice is to work with time-based indices, either daily, weekly or monthly indices, whichever makes sense for the amount of data you have and the retention you want. You also have the possibility to use the Rollover API in order to decide when a new index needs to be created (based on time, number of documents or index size)
It is much easier to delete an entire index than delete documents matching certain conditions within an index. If you do the latter, the documents will be deleted but the space won't be freed until the underlying segments get merged. Whereas if you delete an entire time-based index, then you're guaranteed to free up space.

I came up with a rather simple bash script solution to clean up time-based indices in Elasticsearch which I thought I'd share in case anyone is interested. The Curator seems to be the standard answer for doing this but I really didn't want to install and manage a Python application with all the dependencies it requires. You can't get much simpler than a bash script executed via cron and it doesn't have any dependencies outside of core Linux.
#!/bin/bash
# Make sure expected arguments were provided
if [ $# -lt 3 ]; then
echo "Invalid number of arguments!"
echo "This script is used to clean time based indices from Elasticsearch. The indices must have a"
echo "trailing date in a format that can be represented by the UNIX date command such as '%Y-%m-%d'."
echo ""
echo "Usage: `basename $0` host_url index_prefix num_days_to_keep [date_format]"
echo "The date_format argument is optional and defaults to '%Y-%m-%d'"
echo "Example: `basename $0` http://localhost:9200 cflogs- 7"
echo "Example: `basename $0` http://localhost:9200 elasticsearch_metrics- 31 %Y.%m.%d"
exit
fi
elasticsearchUrl=$1
indexNamePrefix=$2
numDaysDataToKeep=$3
dateFormat=%Y-%m-%d
if [ $# -ge 4 ]; then
dateFormat=$4
fi
# Get the curent date in a 'seconds since epoch' format
curDateInSecondsSinceEpoch=$(date +%s)
#echo "curDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch"
# Subtract numDaysDataToKeep from current epoch value to get the last day to keep
let "targetDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch - ($numDaysDataToKeep * 86400)"
#echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"
while : ; do
# Subtract one day from the target date epoch
let "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch - 86400"
#echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"
# Convert targetDateInSecondsSinceEpoch into a YYYY-MM-DD format
targetDateString=$(date --date="#$targetDateInSecondsSinceEpoch" +$dateFormat)
#echo "targetDateString=$targetDateString"
# Format the index name using the prefix and the calculated date string
indexName="$indexNamePrefix$targetDateString"
#echo "indexName=$indexName"
# First check if an index with this date pattern exists
# Curl options:
# -s silent mode. Don't show progress meter or error messages
# -w "%{http_code}\n" Causes curl to display the HTTP status code only after a completed transfer.
# -I Fetch the HTTP-header only in the response. For HEAD commands there is no body so this keeps curl from waiting on it.
# -o /dev/null Prevents the output in the response from being displayed. This does not apply to the -w output though.
httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -I -X HEAD "$elasticsearchUrl/$indexName")
#echo "httpCode=$httpCode"
if [ $httpCode -ne 200 ]
then
echo "Index $indexName does not exist. Stopping processing."
break;
fi
# Send the command to Elasticsearch to delete the index. Save the HTTP return code in a variable
httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -X DELETE $elasticsearchUrl/$indexName)
#echo "httpCode=$httpCode"
if [ $httpCode -eq 200 ]
then
echo "Successfully deleted index $indexName."
else
echo "FAILURE! Delete command failed with return code $httpCode. Continuing processing with next day."
continue;
fi
# Verify the index no longer exists. Should return 404 when the index isn't found.
httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -I -X HEAD "$elasticsearchUrl/$indexName")
#echo "httpCode=$httpCode"
if [ $httpCode -eq 200 ]
then
echo "FAILURE! Delete command responded successfully, but index still exists. Continuing processing with next day."
continue;
fi
done

I responded to the same question at https://discuss.elastic.co/t/elasticsearch-efficiently-cleaning-up-the-indices-to-save-space/137019
If your index is always growing, then deleting documents is not best practices. It sounds like you have time-series data. If true, then what you want is time-series indices, or better yet, rollover indices.
5GB is also a rather small amount to be purging, as a single Elasticsearch shard can healthily grow to 20GB - 50GB in size. Are you storage constrained? How many nodes do you have?

Related

Use curl and loop in shell script to through different jq values

I am using a curl command to get json data from an application called "Jira".
Stupidly (in my view), you cannot use the api to return more than 50 values at a time.
The only choice is to do it in multiple commands and they call this "pagination". It is not possible to get more than 50 results, no matter the command.
This is the command here:
curl -i -X GET 'https://account_name.atlassian.net/rest/api/3/project/search?jql=ORDER%20BY%20Created&maxResults=50&startAt=100' --user 'scouse_bob#mycompany.com:<sec_token_deets>'
This is the key piece of what I am trying to work into a loop to avoid having to do this manually each time:
startAt=100
My goal is to "somehow" have this loop in blocks of fifty, so, startAt=50 then startAt=100, startAt=150 etc and append the entire output to a file until the figure 650 is reached and / or there is no further output available.
I have played around with a command like this:
#!/bin/ksh
i=1
while [[ $i -lt 1000 ]] ; do
curl -i -X GET 'https://account_name.atlassian.net/rest/api/3/project/search?jql=ORDER%20BY%20Created&maxResults=50&startAt=100' --user 'scouse_bob#mycompany.com:<sec_token_deets>'
echo "$i"
(( i += 1 ))
done
Which does not really get me far as although it will loop, I am uncertain as to how to apply the variable.
Help appreciated.
My goal is to "somehow" have this loop in blocks of fifty, so, startAt=50 then startAt=100, startAt=150 etc and append the entire output to a file until the figure 650 is reached and / or there is no further output available.
The former is easy:
i=0
while [[ $i -lt 650 ]]; do
# if you meant until 650 inclusive, change to -le 650 or -lt 700
curl "https://host/path?blah&startAt=$i"
# pipe to/through some processing if desired
# note URL is in " so $i is expanded but
# other special chars like & don't screw up parsing
# also -X GET is the default (without -d or similar) and can be omitted
(( i+=50 ))
done
The latter depends on just what 'no further output available' looks like. I'd expect you probably don't get an HTTP error, but either a contenttype indicating error or a JSON containing either an end or error indication or a no-data indication. How to recognize this depends on what you get, and I don't know this API. I'll guess you probably want something more or less like:
curl ... >tmpfile
if jq -e '.eof==true' tmpfile; then break; else cat/whatever tmpfile; fi
# or
if jq -e '.data|length==0' tmpfile; then break; else cat/whatever tmpfile; fi
where tmpfile is some suitable filename that won't conflict with your other files; the most general way is to use $(mktemp) (saved in a variable). Or instead of a file put the data in a variable var=$(curl ...) and then use <<<$var as input to anything that reads stdin.
EDIT: I meant to make this CW to make it easier for anyone to add/fix the API specifics, but forgot; instead I encourage anyone who knows to edit.
You may want to stop when you get partial output i.e. if you ask for 50 and get 37, it may mean there is no more after those 37 and you don't need to try the next batch. Again this depends on the API which I don't know.

Removing all but most recent file whose name contains a substring

I am looking for a bash script (or one-liner) to accomplish the following:
Check to see if there is more than one file containing the substring "slurm-"
If so, remove all of the files containing the substring except for the newest one
Any help would be greatly appreciated, thank you.
The above isn't exceptionally efficient with a very long list of files, but (1) it's fast with a short list (low constant-time startup cost), and (2) it's very explicit about how it operates (easy to read and understand).
shopt -s nullglob
candidates=( slurm-* )
(( ${#candidates[#]} < 2 )) && exit 0 ## nothing to do if <2 files exist
latest=${candidates[0]} ## populate latest variable w/ first
for candidate in "${candidates[#]}"; do ## loop through the whole set
if [[ $candidate -nt $latest ]]; then ## and if one is newer, call it "latest"
latest=$candidate
fi
done
for candidate in "${candidates[#]}"; do ## iterate through the whole set
if [[ $candidate != "$latest" ]]; then ## and for everything but the latest file
rm -f -- "$candidate" ## run a deletion
fi
done
Answering the XY problem, you might find it a better course of action to actually add #SBATCH -o output.txt to your submission file to overwrite the Slurm output file every time, if your intent is to keep a clean working directory while submitting several times in a row the same job until it properly runs.

Shell script to rsync a file every week without cronjob (school assignement)

#!/bin/bash
z=1
b=$(date)
while [[ $z -eq 1 ]]
do
a=$(date)
if [ "$a" == "$b" ]
then
b=$(date -d "+7 days")
rsync -v -e ssh user#ip_address:~/sample.tgz /home/kartik2
sleep 1d
fi
done
I want to rsync a file every week !! But if I start this script on every boot the file will be rsynced every time the system starts !! How to alter the code to satisfy week basis rsync ? ( PS- I don't want to do this through cronjob - school assignment)
You are talking about having this run for weeks, right? So, we have to take into account that the system will be rebooted and it needs to be run unattended. In short, you need some means of ensuring the script is run at least once every week even when no one is around. The options look like this
Option 1 (worst)
You set a reminder for yourself and you log in every week and run the script. While you may be reliable as a person, this doesn't allow you to go on vacation. Besides, it goes against our principle of "when no one is around".
Option 2 (okay)
You can background the process (./once-a-week.sh &) but this will not reliable over time. Among other things, if the system restarts then your script will not be operating and you won't know.
Option 3 (better)
For this to be reliable over weeks one option is to daemonize the script. For a more detailed discussion on the matter, see: Best way to make a shell script daemon?
You would need to make sure the daemon is started after reboot or system failure. For more discussion on that matter, see: Make daemon start up with Linux
Option 4 (best)
You said no cron but it really is the best option. In particular, it would consume no system resources for the 6 days, 23 hours and 59 minutes when it does not need to running. Additionally, it is naturally resilient to reboots and the like. So, I feel compelled to say that creating a crontab entry like the following would be my top vote: #weekly /full/path/to/script
If you do choose option 2 or 3 above, you will need to make modifications to your script to contain a variable of the week number (date +%V) in which the script last successfully completed its run. The problem is, just having that in memory means that it will not be sustained past reboot.
To make any of the above more resilient, it might be best to create a directory where you can store a file to serve as a semaphore (e.g. week21.txt) or a file to store the state of the last run. Something like once-a-week.state to which you would write a value when run:
date +%V > once-a-week.state # write the week number to a file
Then to read the value, you would:
file="/path/to/once-a-week.state" # the file where the week number is stored
read -d $'\x04' name < "$file"
echo "$name"
You would then need to check to see if the week number matched this present week number and handle the needed action based on match or not.
#!/bin/bash
z=1
b=$(cat f1.txt)
while [[ $z -eq 1 ]]
do
a=$(date +"%d-%m-%y")
if [ "$a" == "$b" ] || [ "$b" == "" ] || [$a -ge $b ]
then
b=$(date +"%d-%m-%y" -d "+7 days")
echo $b > f1.txt
rsync -v -e ssh HOST#ip:~/sample.tgz /home/user
if [ $? -eq 0 ]
then
sleep 1d
fi
fi
done
This code seems to works well and good !! Any changes to it let me know

How to make sure that original file and gzipped version are not the same

My test equipment generates large text files which tend to grow in size over a period of several days as data is added.
But the text files are transferred to a PC for backup purposes daily, where they're compressed with gzip, even before they've finished growing.
This means I frequently have both file.txt and a compressed form file.txt.gz where the uncompressed file may be more up to date than the compressed version.
I decide which to keep with the following bash script gzandrm:
#!/usr/bin/bash
# Given an uncompressed file, look in the same directory for
# a gzipped version of the file and delete the uncompressed
# file if zdiff reveals they're identical. Otherwise, the
# file can be compressed.
# eg: find . -name '*.txt' -exec gzandrm {} \;
if [[ -e $1 && -e $1.gz ]]
then
# simple check: use zdiff and count the characters
DIFFS=$(zdiff "$1" "$1.gz" | wc -c)
if [[ $DIFFS -eq 0 ]]
then
# difference is '0', delete the uncompressed file
echo "'$1' already gzipped, so removed"
rm "$1"
else
# difference is non-zero, check manually
echo "'$1' and '$1.gz' are different"
fi
else
# go ahead and compress the file
echo "'$1' not yet gzipped, doing it now"
gzip "$1"
fi
and this has worked well, but it would make more sense to compare the modification dates of the files, since gzip does not change the modification date when it compresses, so two files with the same date are really the same file, even if one of them is compressed.
How can I modify my script to compare files by date, rather than size?
It's not entirely clear what the goal is, but it seems to be simple efficiency, so I think you should make two changes: 1) check modification times, as you suggest, and don't bother comparing content if the uncompressed file is no newer than the compressed file, and 2) use zcmp instead of zdiff.
Taking #2 first, your script does this:
DIFFS=$(zdiff "$1" "$1.gz" | wc -c)
if [[ $DIFFS -eq 0 ]]
which will perform a full diff of potentially large files, count the characters in diff's output, and examine the count. But all you really want to know is whether the content differs. cmp is better for that, since it will scan byte by byte and stop if it encounters a difference. It doesn't take the time to format a nice textual comparison (which you will mostly ignore); its exit status tells you the result. zcmp isn't quite as efficient as raw cmp, since it'll need to do an uncompress first, but zdiff has the same issue.
So you could switch to zcmp (and remove the use of a subshell, eliminate wc, not invoke [[, and avoid putting potentially large textual diff data into a variable) just by changing the above two lines to this:
if zcmp -s "$1" # if $1 and $1.gz are the same
To go a step further and check modification times first, you can use the -nt (newer than) option to the test command (also known as square bracket), rewriting the above line as this:
if [ ! "$1" -nt "$1.gz" ] || zcmp -s "$1"
which says that if the uncompressed version is no newer than the compressed version OR if they have the same content, then $1 is already gzipped and you can remove it. Note that if the uncompressed file is no newer, zcmp won't run at all, saving some cycles.
The rest of your script should work as is.
One caveat: modification times are very easy to change. Just moving the compressed file from one machine to another could change its modtime, so you'll have to consider your own case to know whether the modtime check is a valid optimization or more trouble than it's worth.
You can get an easy to compare date stamp of a file using stat with either the %Y or %Z format strings to get the time of last modification or change in seconds from epoch.
if [ $(stat -c %Z $1) -eq ($stat -c %Z $1.gz) ]; then
echo "Last changed time of $1 is the same as $1.gz"
fi

What is the Best Way to Perform Timestamp Comparison in Bash

I have an alert script that I am trying to keep from spamming me so I'd like to place a condition that if an alert has been sent within, say the last hour, to not send another alert. Now I have a cron job that checks the condition every minute because I need to be alerted quickly when the condition is met but I don't need to get the email every munite until I get the issue under control. What is the best way to compare time in bash to accomplish this?
By far the easiest is to store time stamps as modification times of dummy files. GNU touch and date commands can set/get these times and perform date calculations. Bash has tests to check whether a file is newer than (-nt) or older than (-ot) another.
For example, to only send a notification if the last notification was more than an hour ago:
touch -d '-1 hour' limit
if [ limit -nt last_notification ]; then
#send notification...
touch last_notification
fi
Use "test":
if test file1 -nt file2; then
# file1 is newer than file2
fi
EDIT: If you want to know when an event occurred, you can use "touch" to create a file which you can later compare using "test".
Use the date command to convert the two times into a standard format, and subtract them. You'll probably want to store the previous execution time in a dotfile then do something like:
last = $(cat /tmp/.lastrun)
curr = $(date '+%s')
diff = $(($curr - $last))
if [ $diff -gt 3600 ]; then
# ...
fi
echo "$curr" >/tmp/.lastrun
(Thanks, Steve.)

Resources