Bash script to limit a directory size by deleting files accessed last - bash

I had previously used a simple find command to delete tar files not accessed in the last x days (in this example, 3 days):
find /PATH/TO/FILES -type f -name "*.tar" -atime +3 -exec rm {} \;
I now need to improve this script by deleting in order of access date and my bash writing skills are a bit rusty. Here's what I need it to do:
check the size of a directory /PATH/TO/FILES
if size in 1) is greater than X size, get a list of the files by access date
delete files in order until size is less than X
The benefit here is for cache and backup directories, I will only delete what I need to to keep it within a limit, whereas the simplified method might go over size limit if one day is particularly large. I'm guessing I need to use stat and a bash for loop?

I improved brunner314's example and fixed the problems in it.
Here is a working script I'm using:
MAXSIZE="$2" # in MB
if [[ -z "$DELETEDIR" || -z "$MAXSIZE" || "$MAXSIZE" -lt 1 ]]; then
echo "usage: $0 [directory] [maxsize in megabytes]" >&2
exit 1
find "$DELETEDIR" -type f -printf "%T#::%p::%s\n" \
| sort -rn \
| awk -v maxbytes="$((1024 * 1024 * $MAXSIZE))" -F "::" '
BEGIN { curSize=0; }
curSize += $3;
if (curSize > maxbytes) { print $2; }
' \
| tac | awk '{printf "%s\0",$0}' | xargs -0 -r rm
# delete empty directories
find "$DELETEDIR" -mindepth 1 -depth -type d -empty -exec rmdir "{}" \;

Here's a simple, easy to read and understand method I came up with to do this:
DIRSIZE=$(du -s /PATH/TO/FILES | awk '{print $1}')
if [ "$DIRSIZE" -gt "$SOMELIMIT" ]
for f in `ls -rt --time=atime /PATH/TO/FILES/*.tar`; do
FILESIZE=`stat -c "%s" $f`
if [ "$DIRSIZE" -lt "$LIMITSIZE" ]; then

I didn't need to use loops, just some careful application of stat and awk. Details and explanation below, first the code:
find /PATH/TO/FILES -name '*.tar' -type f \
| sed 's/ /\\ /g' \
| xargs stat -f "%a::%z::%N" \
| sort -r \
| awk '
BEGIN{curSize=0; FS="::"}
{curSize += $2}
curSize > $X_SIZE{print $3}
| sed 's/ /\\ /g' \
| xargs rm
Note that this is one logical command line, but for the sake of sanity I split it up.
It starts with a find command based on the one above, without the parts that limit it to files older than 3 days. It pipes that to sed, to escape any spaces in the file names find returns, then uses xargs to run stat on all the results. The -f "%a::%z::%N" tells stat the format to use, with the time of last access in the first field, the size of the file in the second, and the name of the file in the third. I used '::' to separate the fields because it is easier to deal with spaces in the file names that way. Sort then sorts them on the first field, with -r to reverse the ordering.
Now we have a list of all the files we are interested in, in order from latest accessed to earliest accessed. Then the awk script adds up all the sizes as it goes through the list, and begins outputting them when it gets over $X_SIZE. The files that are not output this way will be the ones kept, the other file names go to sed again to escape any spaces and then to xargs, which runs rm them.


Chain grep commands to search for a pattern inside files that match another pattern

How can I chain multiple grep commands?
For example, if I want to search recursively for all PHP files that are publicly accessible, i.e those which contain $_user_location = 'public; and search for "SendQueue() inside all these files, what should I do?
Few of my failed attempts :
grep -rnw ./* -e "^.*user_location.*public" *.php | grep -i "^.*SendQueue().*" --color
grep -rnw ./* -e "^.*user_location.*public" *.php | xargs -0 -i "^.*SendQueue().*" --color
Print grep results with filename, extract filenames and pass those filenames to second grep.
grep -H ..... | cut -d: -f1 | xargs -d'\n' grep ....
Works as long as there are no : in filenames and usually there are none.
You could always do a plain old loop:
for i in *.php; do
if grep -q .... "$i"; then
grep .... "$i"
Using awk:
$ awk '
/SendQueue\(\)/ { # hash all SendQueue() containing records
/.*user_location.*public/ { # if condition met, flag up
if(f) # if flag up
for(j=1;j<=i;j++) # output all hashed records
print a[j]
}' file
$_user_location = 'public;
In the lack of sample output you only get:
For multiple files:
$ for f in *.php ; do awk ... $f ; done
If you add -l option to your first grep, you'll get all the file names which you can feed to you second grep, like :
grep -i "^.*SendQueue().*" --color $(grep -l ...)
assuming you don't have special characters in file names.
Some alternative, could be quicker...
1. Using sed
sed -s '/\(SendQueue()\|_user_location = \o47public\)/H;${ x;s/\n/ /g;/SendQueue.*_user_location\|_user_location.*SendQueue/F;};d' *.php
Could by write:
sed -s '
/\(SendQueue()\|_user_location = \o47public\)/H;
s/\n/ /g;
d' *.php
Or with find:
find /path -type f -name '*.php' -exec sed -s '
/\(SendQueue()\|_user_location = \o47public\)/H;
s/\n/ /g;
d' {} +
2. Using grep
But reading each file only 1 time
grep -c "\(SendQueue()\|_user_location = 'public\)" *.php | grep :2$
grep -c "\(SendQueue()\|_user_location = 'public\)" *.txt | sed -ne 's/:2$//p'
find /path -type f -name '*.php' -exec grep -c \
"\(SendQueue()\|_user_location = 'public\)" {} + |
sed -ne 's/:2$//p'
Of course, this work only if you're sure all sentence could be present only once.
To ensure no commented line will polute result, you could replace regex by
"^[^#/]*\(SendQueue()\|_user_location = 'public\)"
In all submited alternatives
I can mention two ways of doing this:
You can use find(1) in order to do recursive search. find is defined by POSIX and is most likely included in your system.
find . -type f -name '*.php' -exec grep -q "\$_user_location.*=.*'public" {} \; -exec grep 'SendQueue()' {} +
Here is the explanation for what this command does:
-type f Look for files
-name '*.php With the suffix .php
-exec grep -q ... {} \; Run the first grep sequence individually.
-exec grep {} + Run the second grep sequence on the files that were matched previously.
Ripgrep way
ripgrep is a really fast recursive grep tool. This will take much less search time, but you will need to obtain it separately.
rg --glob '*.php' -l "\$_user_location.*=.*'public" | xargs rg 'SendQueue\(\)'
Here is the explanation for what this command does:
--glob '*.php' Only looks inside files with the suffix .php
-l Only lists files that match
We enter the first query and pipe all the matching files to xargs
xargs runs rg with the second query and adds the received files as arguments so that ripgrep only searches those files.
Which one to use
ripgrep really shines on huge directories, but it really isn't necessary otherwise for what you are asking. Picking find is enough for most cases. The time you will spend obtaining ripgrep will probably be more than the time you will save by using it for this specific operation. ripgrep is a really nice tool regardless.
The find command has 2 -exec options:
-exec grep (...) {} \; This calls the grep command for each file match. This will run the following:
grep (query) file1.php
grep (query) file2.php
grep (query) file3.php
find tracks the command result for each file, and passes them to the next test if they succeed.
-exec grep (...) {} + This calls the command with all the files attached as arguments. This will expand as:
grep (query) file1.php file2.php file3.php

Select parent directory if non-unique directory is found

Hello I am trying to figure out how I can parse directories using built-in bash functionality.
The directory structure would look something like.
So far I have narrowed down to the name of the plugin which covers most of what I needed for the rest of the script.
find /home/mikal/PluginSDK -type f -name plugin-config.json | sed -r 's|/[^/]+$||' | awk -F "/" '{print $NF}'
The problem that I am running into is when the same vendor has different versions of plugin available for the same release. We may not always want to run a newer version of the plugin due to compatibility or performance of the plugin so having these show something like ver1-plugin_name or similar would be preferrable. I can't find anything that would be able to pick out the non-unique plugin/version so that I can make an array with all of the options.
This is the entirety of what I have written right now for this section of the script I am writing to make configuration changes to the system.
while IFS= read -r line; do
options+=( "$line" )
done < <( find /home/mikal/PluginSDK -type f -name plugin-config.json | sed -r 's|/[^/]+$||' | awk -F "/" '{print $NF}' )
select opt_number in "${options[#]}" "Quit";
if [[ $opt_number == "Quit" ]];
echo "Quitting"
find /home/mikal/PluginSDK -type f -name plugin-config.json -exec sh -c "sed -i 's/"preferred": true/"preferred": false/g'" {} \;
find /home/mikal/PluginSDK/${options[$(($REPLY-1))]} -type f -name plugin-config.json -exec sh -c "sed -i 's/"preferred": false/"preferred": true/g'" {} \;
Desired output for the entire thing would be something like.
1.) Ver1-Plugin_name
2.) Ver2-Plugin_name
3.) Plugin_name
4.) Plugin_name
5.) Quit
I apologize if my formatting is bad. First time posting.
lst=( Quit
$( find /home/mikal/PluginSDK -type f -name plugin-config.json |
awk -F/ '{ if (7==NF) { print $6 } else { print $6"-"$7 } }' )
select opt_number in "${lst[#]}"
. . .
You might want to c.f. BashFAQ 20 if your filenames could have any weirdness like embedded spaces.

How to get list of certain strings in a list of files using bash?

The title is maybe not really descriptive, but I couldn't find a more concise way to describe the problem.
I have a directory containing different files which have a name that e.g. looks like this:
{some text}2019Q2{some text}.pdf
So the filenames have somewhere in the name a year followed by a capital Q and then another number. The other text can be anything, but it won't contain anything matching the format year-Q-number. There will also be no numbers directly before or after this format.
I can work something out to get this from one filename, but I actually need a 'list' so I can do a for-loop over this in bash.
So, if my directory contains the files:
I want a for loop that goes over 2019Q2, 2019Q3, 2020Q1, and 2020Q2.
This is what I have so far. It is able to extract the substrings, but it still has doubles. Since I'm already in the loop and I don't see how I can remove the doubles.
find original/*.pdf -type f -print0 | while IFS= read -r -d '' line; do
echo $line | grep -oP '[0-9]{4}Q[0-9]'
# list all _filanames_ that end with .pdf from the folder original
find original -maxdepth 1 -name '*.pdf' -type f -print "%p\n" |
# extract the pattern
sed 's/.*\([0-9]{4}Q[0-9]\).*/\1/' |
# iterate
while IFS= read -r file; do
echo "$file"
I used -print %p to print just the filename, instead of full path. The GNU sed has -z option that you can use with -print0 (or -print "%p\0").
With how you have wanted to do this, if your files have no newline in the name, there is no need to loop over list in bash (as a rule of a thumb, try to avoid while read line, it's very slow):
find original -maxdepth 1 -name '*.pdf' -type f | grep -oP '[0-9]{4}Q[0-9]'
or with a zero seprated stream:
find original -maxdepth 1 -name '*.pdf' -type f -print0 |
grep -zoP '[0-9]{4}Q[0-9]' | tr '\0' '\n'
If you want to remove duplicate elements from the list, pipe it to sort -u.
Try this, in bash:
~ > $ ls
costumerA_2019Q2_something.pdf costumerB_2019Q2_something.pdf
costumerA_2019Q3_something.pdf other.pdf
costumerA_2020Q1_something.pdf someother.file.txt
~ > $ for x in `(ls)`; do [[ ${x} =~ [0-9]Q[1-4] ]] && echo $x; done;
~ > $ (for x in *; do [[ ${x} =~ ([0-9]{4}Q[1-4]).+pdf ]] && echo ${BASH_REMATCH[1]}; done;) | sort -u

Bash - Multiple replace with sed statement

I'm getting mad with a script performance.
Basically I have to replace 600 strings in more than 35000 files.
I have got something like this:
oldText1 newText1
oldText2 newText2
oldText3 newText3
files=(`find \. -name '*.js'`);
for ((i=0; i < $pattern_count ; i=i+2)); do
echo -en "\e[0K\r Status "$proggress"%. Iteration: "$i" of " $pattern_count;
for ((j=0; j < $files_count; j++)); do
command sed -i s#$search#$replace#g ${files[j]};
echo -en "\e[0K\r Inside the second loop: " $proggress"%. File: "$j" of "$files_count;
echo -en "\e[0K\r Status "$proggress"%. Iteration: "$i" of " $pattern_count;
But this takes tons of minutes. There is another solution? Probably using sed just one time and not in a double loop?
Thanks a lot.
Create a proper sed script:
Run this script with sed -f script.sed file (or in whatever way is required).
You may create that sed script using your array:
printf 's/%s/%s/g\n' "${patterns[#]}" >script.sed
Applying it to the files:
find . -type f -name '*.js' -exec sed -i -f script.sed {} ';'
I don't quite know how GNU sed (which I assume you're using) is handling multiple files when you use -i, but you may also want to try
find . -type f -name '*.js' -exec sed -i -f script.sed {} +
which may potentially be much more efficient (executing as few sed commands as possible). As always, test on data that you can afford to throw away after testing.
For more information about using -exec with find, see
You don't need to run sed multiple times over one file. You can separate sed commands with ';'
You can execute multiple seds in parallel
For example:
oldText1 newText1
oldText2 newText2
oldText3 newText3
// construct sed argument such as 's/old/new/g;s/old2/new2/g;...'
for ((i = 0; i < ${#patterns[#]}; i += 2)); do
echo -n "s/${patterns[i]}/${patterns[i+1]}/g;"
// find all files named '*.js' and pass them to args with zero as separator
// xargs will parse them:
// -0 use zero as separator
// --verbose will print the line before execution (ie. sed -i .... file)
// -n1 pass one argument/one line to one sed
// -P8 run 8 seds simulteneusly (experiment with that value, depends on how fast your cpu and harddrive is)
find . -type f -name '*.js' -print0 | xargs -0 --verbose -n1 -P8 sed -i "$sedarg"
If you need the progress bar so much, I guess you can count the lines xargs --verbose returns or better use parallel --bar, see this post.

How to make this script grep only the 1st line

for i in USER; do
find /home/$i/public_html/ -type f -iname '*.php' \
| xargs grep -A1 -l 'GLOBALS\|preg_replace\|array_diff_ukey\|gzuncompress\|gzinflate\|post_var\|sF=\|qV=\|_REQUEST'
Its ignoring the -A1. The end result is I just want it to show me files that contain any of matching words but only on the first line of the script. If there is a better more efficient less resource intensive way that would be great as well as this will be ran on very large shared servers.
Use awk instead:
for i in USER; do
find /home/$i/public_html/ -type f -iname '*.php' -exec \
awk 'FNR == 1 && /GLOBALS|preg_replace|array_diff_ukey|gzuncompress|gzinflate|post_var|sF=|qV=|_REQUEST/
{ print FILENAME }' {} +
This will print the current input file if the first line matches. It's not ideal, since it will read all of each file. If your version of awk supports it, you can use
awk '/GLOBALS|.../ { print FILENAME } {nextfile}'
The nextfile command will execute for the first line, effectively skipping the rest of the file after awk tests if it matches the regular expression.
The following code is untested:
for i in USER; do
find /home/$i/public_html/ -type f -iname '*.php' | while read -r; do
head -n1 "$REPLY" | grep -q 'GLOBALS\|preg_replace\|array_diff_ukey\|gzuncompress\|gzinflate\|post_var\|sF=\|qV=\|_REQUEST' \
&& echo "$REPLY"
The idea is to loop over each find result, explicitly test the first line, and print the filename if a match was found. I don't like it though because it feels so clunky.
for j in (find /home/$i/public_html/ -type f -iname '*.php');
do result=$(head -1l $j| grep $stuff );
[[ x$result |= x ]] && echo "$j: $result";
You'll need a little more effort to skip leasing blank lines. Fgrep will save resources.
A little perl would bring great improvement, but it's hard to type it on a phone.
On a less cramped keyboard, inserted less brief solution.
