Grab random files from a directory using just bash - bash

I am looking to create a bash script that can grab files fitting a certain glob pattern and cp them to another folder for example
$foo\
a.txt
b.txt
c.txt
e.txt
f.txt
g.txt
run script that request 2 files I would get
$bar\
c.txt
f.txt
I am not sure if bash has a random number generator and how to use that to pull from a list. The directory is large as well (over 100K) so some of the glob stuff won't work.
Thanks in advance

Using GNU shuf, this copies N random files matching the given glob pattern in the given source directory to the given destination directory.
#!/bin/bash -e
shopt -s failglob
n=${1:?} glob=${2:?} source=${3:?} dest=${4:?}
declare -i rand
IFS=
[[ -d "$source" ]]
[[ -d "$dest" && -w "$dest" ]]
cd "$dest"
dest=$PWD
cd "$OLDPWD"
cd "$source"
printf '%s\0' $glob |
shuf -zn "$n" |
xargs -0 cp -t "$dest"
Use like:
./cp-rand 2 '?.txt' /source/dir /dest/dir
This will work for a directory containing thousands of files. xargs will manage limits like ARG_MAX.
$glob, unquoted, undergoes filename expansion (glob expansion). Because IFS is empty, the glob pattern can contain whitespace.
Matching sub-directories will cause cp to error and a premature exit (some files may have already been copied). cp -r to allow sub-directories.
cp -t target and xargs -0 are not POSIX.
Note that using a random number to select files from a list can cause cause duplicates, so you might copy less than N files. Hence using GNU shuf.

Try this:
#!/bin/bash
sourcedir="files"
# Arguments processing
if [[ $# -ne 1 ]]
then
echo "Usage: random_files.bash NUMBER-OF-FILES"
echo " NUMBER-OF-FILES: how many random files to select"
exit 0
else
numberoffiles="$1"
fi
# Validations
listoffiles=()
while IFS='' read -r line; do listoffiles+=("$line"); done < <(find "$sourcedir" -type f -print)
totalnumberoffiles=${#listoffiles[#]}
# loop on the number of files the user wanted
for (( i=1; i<=numberoffiles; i++ ))
do
# Select a random number between 0 and $totalnumberoffiles
randomnumber=$(( RANDOM % totalnumberoffiles ))
echo "${listoffiles[$randomnumber]}"
done
build an array with the filenames
random a number from 0 to the size of the array
display the filename at that index
I built in a loop if you want to randomly select more than one file
you can setup another argument for the location of the files, I hard coded it here.
Another method, if this one fails because of to many files in the same directory, could be:
#!/bin/bash
sourcedir="files"
# Arguments processing
if [[ $# -ne 1 ]]
then
echo "Usage: random_files.bash NUMBER-OF-FILES"
echo " NUMBER-OF-FILES: how many random files to select"
exit 0
else
numberoffiles="$1"
fi
# Validations
find "$sourcedir" -type f -print >list.txt
totalnumberoffiles=$(wc -l list.txt | awk '{print $1}')
# loop on the number of files the user wanted
for (( i=1; i<=numberoffiles; i++ ))
do
# Select a random number between 1 and $totalnumberoffiles
randomnumber=$(( ( RANDOM % totalnumberoffiles ) + 1 ))
sed -n "${randomnumber}p" list.txt
done
/bin/rm -f list.txt
build a list of the files, so that each filename will be on one line
select a random number
in that one, the randomnumber must be +1 since line count starts at 1, not at 0 like in an array.
use sed to print the random line from the list of files

Related

More "random" alternative to shuf for selecting files in a directory

I put together the following Bash function (in my .bashrc) to open a "random" image from a given folder, one at a time until the user types N, after which it exits. The script works fine aside from the actual randomness of the images generated - in a quick test of 10 runs, only 4 images are unique.
Is this simply unavoidable due to the limited number of images in the directory (20), or is there an alternative to the shuf command that will yield more random results?
If it is unavoidable, what's the best way to adapt the function to avoid repeats (i.e. discard images that have already been selected)?
function generate_image() {
while true; do
command cd "D:\Users\Hashim\Pictures\Data" &&
image="$(find . -type f -exec file --mime-type {} \+ | awk -F: '{if ($2 ~/image\//) print $1}' | shuf -n1)" &&
echo "Opening $image" &&
cygstart "$image"
read -p "Open another random image? [Y/n]"$'\n' -n 1 -r
echo
if [[ $REPLY =~ ^[Nn]$ ]]
then exit
fi
done
}
One way to handle this is by searching the filesystem and creating an array with a list of files in randomized order, and going through everything in that list before searching again.
Because you go through everything from one batch of shuf output before starting the next batch of shuf output, there's no longer a risk of repeats until everything has been seen.
refresh_image_list() {
# respect prior image_dir value if set before the function is called
image_dir=${image_dir:-'D:/Users/Hashim/Pictures/Data'}
readarray -d '' image_list < <(
find "$image_dir" -type f -exec file -0 --mime-type -- {} + \
| while IFS= read -r -d '' filename && IFS= read -r desc; do
[[ $desc = *image* ]] && printf '%s\0' "$filename"
done \
| shuf -z
)
}
generate_image() {
while true; do
(( ${#image_list[#]} )) || refresh_image_list # if list is empty, recreate
set -- "${image_list[#]}" # set argument list from image list
while (( $# )); do # argument list isn't empty?
echo "Opening $1" # ...try the first item on it
cygstart "$1"
shift # ...and then discard that item
read -p $'Open another random image? [Y/n]\n' -n 1 -r
echo
if [[ $REPLY = [Nn] ]]; then # user wants to quit?
image_list=( "$#" ) # store unused images back to list
return 0
fi
done
done
}
We can simplify this if we're willing to just stop after the user has seen every image once, instead of generating a new batch, and don't need persistence across invocations:
generate_image() {
while IFS= read -r -d '' filename <&3; do
echo "Opening $filename"
cygstart "$filename"
read -p $'Open another random image? [Y/n]\n' -n 1 -r
echo
[[ $REPLY = [Nn] ]] && return 0
done 3< <(
find "$image_dir" -type f -exec file -0 --mime-type -- {} + \
| while IFS= read -r -d '' filename && IFS= read -r desc; do
[[ $desc = *image* ]] && printf '%s\0' "$filename"
done \
| shuf -z
)
}
file listings are rarely so gigantic it can't fit into RAM for awk :
find … -print0 |
mawk 'BEGIN { FS = "\0"
_^= RS = "^$"
} END { printf("%*s", srand()*!_, $(int(rand()*(NF-_))+_)) }'
That'll randomly print out the filename for one of the image files found, with no trailing byte of either \0 or \n, without having to perform any sort of sorting/shuffling.
NF - 1 because find prints out final \0, so NF count is always 1 more than # of files found.
It also protects against an empty input instead of referencing a negative field number - simply nothing gets printed at all.
From there, you can decide you want to open this image file.
Charles' answer is definitely the superior answer here, but for completeness I thought I would also add a middle-ground solution that I stumbled across while experimenting earlier on.
I learnt that shuf can be seeded with an external source of randomness, so by seeding it with /dev/urandom - the randomness generator device available on all UNIX-like systems - it can be made more random:
shuf -n1 --random-source=/dev/urandom
From my tests this appears to result in significantly fewer repeats than a standard shuf command, and could be an ideal solution if you want a little more randomness but can tolerate the occasional repeat.

Separate folders into subfolders according to numbering in bash

I have the following directory tree:
1_loc
2_buzdfg
4_foodga
5_bardfg
6_loc
8_buzass
9_foossd
12_bardaf
There may be numbers missing in the folder ordering.
I want to separate these folders into subfolders according to their numbers, so that all folders with a number smaller than 6 (before the second _loc folder) would go to folder1 and all folders with a number equal or greater than 6 with go to folder2.
I can solve the problem very easily using the mouse, of course, but I wanted a suggestion of how to do this automatically from the terminal.
Any ideas?
while read -r line; do
# Regex match the beginning of the string for a character between 1 and 5 - change this as you'd please to any regex
FOLDERNUMBER=""
[[ "$line" ~= "^[1-5]" ]] && FOLDERNUMBER="1" || FOLDERNUMBER="2"
# So FOLDERPATH = "./folder1", if FOLDERNUMBER=1
FOLDERPATH="./folder$FOLDERNUMBER"
# Check folder exists, if not create it
[[ ! -d "$FOLDERPATH" ]] && mkdir "$FOLDERPATH"
# Finally, move the file to FOLDERPATH
mv "$line" "$FOLDERPATH/$(basename $line)"
done < <(find . -type f)
# This iterates through each line of the command in the brackets - in this case, each file path from the `find` command.
I think the solution is to loop through the files and check the number before the first _.
Firstly, let's check how to get the number before _:
$ d="1_loc_b"
$ echo "${d%%_*}"
1
OK, so this works. Then, let's loop:
for file in *
do
echo "$file"
(( ${file%%_*} > 5)) && echo "moving to dir2/" || echo "moving to dir1/"
done
Suppose folder1 and folder2 exists in the same directory, I will do it like this:
for d in *_*; do # to avoid folder1 and folder2
# check if the first field seperated by _ is less than 5
if ((`echo $d | cut -d"_" -f1` < 6)); then
mv $d folder1/$d;
else
mv $d folder2/$d;
fi;
done
(more about cut)
You can go to the current directory and run these simple commands:
mv {1,2,3,4}_* folder1/
mv {5,6,7,8}_* folder2/
This assumes no other files/directory starting with these prefixes (i.e. 1-8).
Another pure bash, parameter-expansion solution:-
#!/bin/bash
# 'find' returns folders having a '_' in their names, the flag -print0 option to
# preserve special characters in names.
# the folders are names as './1_folder', './2_folder', bash magic is done
# to remove those special characters.
# '-v' flag in 'mv' for verbose action
while IFS= read -r -d '' folder; do
folderName="${folder%_*}" # To strip the characters after the '_'
finalName="${folderName##*/}" # To strip the everything before '/'
((finalName > 5)) && mv -v "$folder" folder1 || mv -v "$folder" folder2
done < <(find . -maxdepth 1 -mindepth 1 -name "*_*" -type d -print0)
You can create the a script with the following code and when you run it, the folders will be moved as desired..
#seperate the folders into 2 folders
#this is a generic solution for any folder that start with a number
#!/bin/bash
for file in *
do
prefix=`echo $file | awk 'BEGIN{FS="_"};{print $1}'`
if [[ $prefix != ?(-)+([0-9]) ]]
then continue
fi
if [ $prefix -le 4 ]
then mv "$file" folder1
elif [ $prefix -ge 5 ]
then mv "$file" folder2
fi
done

counting files, directories and subdirectories in a given path

I am trying to figure how to run a simple script "count.sh" that is called together with a path, e.g.:
count.sh /home/users/documents/myScripts
The script will need to iterate over the directories in this path and print how many files and folders (including hidden) in each level of this path.
For example:
7
8
9
10
(myScripts - 7, documents - 8, users -9, home - 10)
And by the way, can I run this script using count.sh pwd?
More or less something like that:
#!/bin/sh
P="$1"
while [ "/" != "$P" ]; do
echo "$P `find \"$P\" -maxdepth 1 -type f | wc -l`"
P=`dirname "$P"`;
done
echo "$P `find \"$P\" -maxdepth 1 -type f | wc -l`"
You can use it from the current directory with script.sh `pwd`
Another approach is to handle separation or tokenizing the path using an array and controlling word-splitting with the internal field separator (IFS). You can include the root directory if desired (you would need to trim the additional leading '/' in the printout in that case)
#!/bin/bash
[ -z "$1" -o ! -d "$1" ] && {
printf "error: directory argument required.\n"
exit 1
}
p="$1" ## remove two lines to include /
[ "${p:0:1}" = "/" ] && p="${p:1}"
oifs="$IFS" ## save internal field separator value
IFS=$'/' ## set to break on '/'
array=( $p ) ## tokenize given path into array
IFS="$oifs" ## restore original IFS
## print out path level info using wc
for ((i=0; i<${#array[#]}; i++)); do
dirnm="${dirnm}/${array[i]}"
printf "%d. %s -- %d\n" "$((i+1))" "$dirnm" $(($(ls -Al "$dirnm" | wc -l)-1))
done
Example Output
$ bash filedircount.sh /home/david/tmp
1. /home -- 5
2. /home/david -- 132
3. /home/david/tmp -- 113
As an alternative, you could use a for loop to loop through and count the items in the directory at each level instead of using wc if desired.
You could try the following
#!/bin/bash
DIR=$(cd "$1" ; pwd)
PREFIX=
until [ "$DIR" = / ] ; do
echo -n "$PREFIX"$(basename "$DIR")" "$(ls -Ab "$DIR" | wc -l)
DIR=$(dirname "$DIR")
PREFIX=", "
done
echo
(ls -Ab lists all files and folders except . and .. and escapes special characters so that only one line is printed per file even if filenames include newline characters. wc -l counts lines.)
You can invoke the script using
count.sh `pwd`

Adding up file sizes in bash shells

I've written a shell script that takes a directory as an arg and prints the file names and sizes, I wanted to find out how to add up the file sizes and store them so that I can print them after the loop. I've tried a few things but haven't gotten anywhere so far, any ideas?
#!/bin/bash
echo "Directory <$1> contains the following files:"
let "x=0"
TEMPFILE=./count.tmp
echo 0 > $TEMPFILE
ls $1 |
while read file
do
if [ -f $1/$file ]
then
echo "file: [$file]"
stat -c%s $file > $TEMPFILE
fi
cat $TEMPFILE
done
echo "number of files:"
cat ./count.tmp
Help would be thoroughly appreciated.
A number of issues in your code:
Don't parse ls
Quote variables in large majority of cases
Don't use temp files when they're not needed
Use already made tools like du for this (see comments)
Assuming you're just wanting to get practice at this and/or want to do something else other than what du already does, you should change syntax to something like
#!/bin/bash
dir="$1"
[[ $dir == *'/' ]] || dir="$dir/"
if [[ -d $dir ]]; then
echo "Directory <$1> contains the following files:"
else
echo "<$1> is not a valid directory, exiting"
exit 1
fi
shopt -s dotglob
for file in "$dir"*; do
if [[ -f $file ]]; then
echo "file: [$file]"
((size+=$(stat -c%s "$file")))
fi
done
echo "$size"
Note:
You don't have to pre-allocate variables in bash, $size is assumed to be 0
You can use (()) for math that doesn't require decimal places.
You can use globs (*) to get all files (including dirs, symlinks, etc...) in a particular directory (and globstar ** for recursive)
shopt -s dotglob Is needed so it includes hidden .whatever files in glob matching.
You can use ls -l to find size of files:
echo "Directory $1 contains the following:"
size=0
for f in "$1"/*; do
if [[ ! -d $f ]]; then
while read _ _ _ _ bytes _; do
if [[ -n $bytes ]]; then
((size+=$bytes))
echo -e "\tFile: ${f/$1\//} Size: $bytes bytes"
fi
done < <(ls -l "$f")
fi
done
echo "$1 Files total size: $size bytes"
Parsing ls results for size is ok here as byte size will always be found in the 5th field.
If you know what the date stamp format for ls is on your system and portability isn't important, you can parse ls to reliably find both the size and file in a single while read loop.
echo "Directory $1 contains the following:"
size=0
while read _ _ _ _ bytes _ _ _ file; do
if [[ -f $1/$file ]]; then
((size+=$bytes))
echo -e "\tFile: $file Size: $bytes bytes"
fi
done < <(ls -l "$1")
echo "$1 Files total size: $size bytes"
Note: These solutions would not include hidden files. Use ls -la for that.
Depending on the need or preference, ls can also print sizes in a number of different formats using options like -h or --block-size=SIZE.
#!/bin/bash
echo "Directory <$1> contains the following files:"
find ${1}/* -prune -type f -ls | \
awk '{print; SIZE+=$7} END {print ""; print "total bytes: " SIZE}'
Use find with -prune (so it does not recurse into subdirectories) and -type f (so it will only list files and no symlinks or directories) and -ls (so it lists the files).
Pipe the output into awk and
for each line print the whole line (print; replace with print $NF to only print the last item of each line, which is the filename including the directory). Also add the value of the 7th field, which is the file size (in my version of find) to the variable SIZE.
After all lines have been processed (END) print the calculated total size.

Why is while not not working?

AIM: To find files with a word count less than 1000 and move them another folder. Loop until all under 1k files are moved.
STATUS: It will only move one file, then error with "Unable to move file as it doesn't exist. For some reason $INPUT_SMALL doesn't seem to update with the new file name."
What am I doing wrong?
Current Script:
Check for input files already under 1k and move to Split folder
INPUT_SMALL=$( ls -S /folder1/ | grep -i reply | tail -1 )
INPUT_COUNT=$( cat /folder1/$INPUT_SMALL 2>/dev/null | wc -l )
function moveSmallInput() {
while [[ $INPUT_SMALL != "" ]] && [[ $INPUT_COUNT -le 1003 ]]
do
echo "Files smaller than 1k have been found in input folder, these will be moved to the split folder to be processed."
mv /folder1/$INPUT_SMALL /folder2/
done
}
I assume you are looking for files that has the word reply somewhere in the path. My solution is:
wc -w $(find /folder1 -type f -path '*reply*') | \
while read wordcount filename
do
if [[ $wordcount -lt 1003 ]]
then
printf "%4d %s\n" $wordcount $filename
#mv "$filename" /folder2
fi
done
Run the script once, if the output looks correct, then uncomment the mv command and run it for real this time.
Update
The above solution has trouble with files with embedded spaces. The problem occurs when the find command hands its output to the wc command. After a little bit of thinking, here is my revised soltuion:
find /folder1 -type f -path '*reply*' | \
while read filename
do
set $(wc -w "$filename") # $1= word count, $2 = filename
wordcount=$1
if [[ $wordcount -lt 1003 ]]
then
printf "%4d %s\n" $wordcount $filename
#mv "$filename" /folder2
fi
done
A somewhat shorter version
#!/bin/bash
find ./folder1 -type f | while read f
do
(( $(wc -w "$f" | awk '{print $1}' ) < 1000 )) && cp "$f" folder2
done
I left cp instead of mv for safery reasons. Change to mv after validating
I you also want to filter with reply use #Hai's version of the find command
Your variables INPUT_SMALL and INPUT_COUNT are not functions, they're just values you assigned once. You either need to move them inside your while loop or turn them into functions and evaluate them each time (rather than just expanding the variable values, as you are now).

Resources