I have a directory with about 2000 files. How can I select a random sample of N files through using either a bash script or a list of piped commands?
Here's a script that uses GNU sort's random option:
ls |sort -R |tail -$N |while read file; do
# Something involving $file, or you can leave
# off the while to just get the filenames
done
You can use shuf (from the GNU coreutils package) for that. Just feed it a list of file names and ask it to return the first line from a random permutation:
ls dirname | shuf -n 1
# probably faster and more flexible:
find dirname -type f | shuf -n 1
# etc..
Adjust the -n, --head-count=COUNT value to return the number of wanted lines. For example to return 5 random filenames you would use:
find dirname -type f | shuf -n 5
Here are a few possibilities that don't parse the output of ls and that are 100% safe regarding files with spaces and funny symbols in their name. All of them will populate an array randf with a list of random files. This array is easily printed with printf '%s\n' "${randf[#]}" if needed.
This one will possibly output the same file several times, and N needs to be known in advance. Here I chose N=42.
a=( * )
randf=( "${a[RANDOM%${#a[#]}]"{1..42}"}" )
This feature is not very well documented.
If N is not known in advance, but you really liked the previous possibility, you can use eval. But it's evil, and you must really make sure that N doesn't come directly from user input without being thoroughly checked!
N=42
a=( * )
eval randf=( \"\${a[RANDOM%\${#a[#]}]\"\{1..$N\}\"}\" )
I personally dislike eval and hence this answer!
The same using a more straightforward method (a loop):
N=42
a=( * )
randf=()
for((i=0;i<N;++i)); do
randf+=( "${a[RANDOM%${#a[#]}]}" )
done
If you don't want to possibly have several times the same file:
N=42
a=( * )
randf=()
for((i=0;i<N && ${#a[#]};++i)); do
((j=RANDOM%${#a[#]}))
randf+=( "${a[j]}" )
a=( "${a[#]:0:j}" "${a[#]:j+1}" )
done
Note. This is a late answer to an old post, but the accepted answer links to an external page that shows terrible bash practice, and the other answer is not much better as it also parses the output of ls. A comment to the accepted answer points to an excellent answer by Lhunath which obviously shows good practice, but doesn't exactly answer the OP.
ls | shuf -n 10 # ten random files
A simple solution for selecting 5 random files while avoiding to parse ls. It also works with files containing spaces, newlines and other special characters:
shuf -ezn 5 * | xargs -0 -n1 echo
Replace echo with the command you want to execute for your files.
This is an even later response to #gniourf_gniourf's late answer, which I just upvoted because it's by far the best answer, twice over. (Once for avoiding eval and once for safe filename handling.)
But it took me a few minutes to untangle the "not very well documented" feature(s) this answer uses. If your Bash skills are solid enough that you saw immediately how it works, then skip this comment. But I didn't, and having untangled it I think it's worth explaining.
Feature #1 is the shell's own file globbing. a=(*) creates an array, $a, whose members are the files in the current directory. Bash understands all the weirdnesses of filenames, so that list is guaranteed correct, guaranteed escaped, etc. No need to worry about properly parsing textual file names returned by ls.
Feature #2 is Bash parameter expansions for arrays, one nested within another. This starts with ${#ARRAY[#]}, which expands to the length of $ARRAY.
That expansion is then used to subscript the array. The standard way to find a random number between 1 and N is to take the value of random number modulo N. We want a random number between 0 and the length of our array. Here's the approach, broken into two lines for clarity's sake:
LENGTH=${#ARRAY[#]}
RANDOM=${a[RANDOM%$LENGTH]}
But this solution does it in a single line, removing the unnecessary variable assignment.
Feature #3 is Bash brace expansion, although I have to confess I don't entirely understand it. Brace expansion is used, for instance, to generate a list of 25 files named filename1.txt, filename2.txt, etc: echo "filename"{1..25}".txt".
The expression inside the subshell above, "${a[RANDOM%${#a[#]}]"{1..42}"}", uses that trick to produce 42 separate expansions. The brace expansion places a single digit in between the ] and the }, which at first I thought was subscripting the array, but if so it would be preceded by a colon. (It would also have returned 42 consecutive items from a random spot in the array, which is not at all the same thing as returning 42 random items from the array.) I think it's just making the shell run the expansion 42 times, thereby returning 42 random items from the array. (But if someone can explain it more fully, I'd love to hear it.)
The reason N has to be hardcoded (to 42) is that brace expansion happens before variable expansion.
Finally, here's Feature #4, if you want to do this recursively for a directory hierarchy:
shopt -s globstar
a=( ** )
This turns on a shell option that causes ** to match recursively. Now your $a array contains every file in the entire hierarchy.
If you have Python installed (works with either Python 2 or Python 3):
To select one file (or line from an arbitrary command), use
ls -1 | python -c "import sys; import random; print(random.choice(sys.stdin.readlines()).rstrip())"
To select N files/lines, use (note N is at the end of the command, replace this by a number)
ls -1 | python -c "import sys; import random; print(''.join(random.sample(sys.stdin.readlines(), int(sys.argv[1]))).rstrip())" N
If you want to copy a sample of those files to another folder:
ls | shuf -n 100 | xargs -I % cp % ../samples/
make samples directory first obviously.
MacOS does not have the sort -R and shuf commands, so I needed a bash only solution that randomizes all files without duplicates and did not find that here. This solution is similar to gniourf_gniourf's solution #4, but hopefully adds better comments.
The script should be easy to modify to stop after N samples using a counter with if, or gniourf_gniourf's for loop with N. $RANDOM is limited to ~32000 files, but that should do for most cases.
#!/bin/bash
array=(*) # this is the array of files to shuffle
# echo ${array[#]}
for dummy in "${array[#]}"; do # do loop length(array) times; once for each file
length=${#array[#]}
randomi=$(( $RANDOM % $length )) # select a random index
filename=${array[$randomi]}
echo "Processing: '$filename'" # do something with the file
unset -v "array[$randomi]" # set the element at index $randomi to NULL
array=("${array[#]}") # remove NULL elements introduced by unset; copy array
done
If you have more files in your folder, you can use the below piped command I found in unix stackexchange.
find /some/dir/ -type f -print0 | xargs -0 shuf -e -n 8 -z | xargs -0 cp -vt /target/dir/
Here I wanted to copy the files, but if you want to move files or do something else, just change the last command where I have used cp.
This is the only script I can get to play nice with bash on MacOS. I combined and edited snippets from the following two links:
ls command: how can I get a recursive full-path listing, one line per file?
http://www.linuxquestions.org/questions/linux-general-1/is-there-a-bash-command-for-picking-a-random-file-678687/
#!/bin/bash
# Reads a given directory and picks a random file.
# The directory you want to use. You could use "$1" instead if you
# wanted to parametrize it.
DIR="/path/to/"
# DIR="$1"
# Internal Field Separator set to newline, so file names with
# spaces do not break our script.
IFS='
'
if [[ -d "${DIR}" ]]
then
# Runs ls on the given dir, and dumps the output into a matrix,
# it uses the new lines character as a field delimiter, as explained above.
# file_matrix=($(ls -LR "${DIR}"))
file_matrix=($(ls -R $DIR | awk '; /:$/&&f{s=$0;f=0}; /:$/&&!f{sub(/:$/,"");s=$0;f=1;next}; NF&&f{ print s"/"$0 }'))
num_files=${#file_matrix[*]}
# This is the command you want to run on a random file.
# Change "ls -l" by anything you want, it's just an example.
ls -l "${file_matrix[$((RANDOM%num_files))]}"
fi
exit 0
I use this: it uses temporary file but goes deeply in a directory until it find a regular file and return it.
# find for a quasi-random file in a directory tree:
# directory to start search from:
ROOT="/";
tmp=/tmp/mytempfile
TARGET="$ROOT"
FILE="";
n=
r=
while [ -e "$TARGET" ]; do
TARGET="$(readlink -f "${TARGET}/$FILE")" ;
if [ -d "$TARGET" ]; then
ls -1 "$TARGET" 2> /dev/null > $tmp || break;
n=$(cat $tmp | wc -l);
if [ $n != 0 ]; then
FILE=$(shuf -n 1 $tmp)
# or if you dont have/want to use shuf:
# r=$(($RANDOM % $n)) ;
# FILE=$(tail -n +$(( $r + 1 )) $tmp | head -n 1);
fi ;
else
if [ -f "$TARGET" ] ; then
rm -f $tmp
echo $TARGET
break;
else
# is not a regular file, restart:
TARGET="$ROOT"
FILE=""
fi
fi
done;
How about a Perl solution slightly doctored from Mr. Kang over here:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?
$ ls | perl -MList::Util=shuffle -e '#lines = shuffle(<>); print
#lines[0..4]'
I've been following this tutorial (the idea can also be found in other posts of SO)
http://www.cyberciti.biz/faq/bash-loop-over-file/
This is my test script:
function getAllTests {
allfiles=$TEST_SCRIPTS/*
# Getting all stests in the
if [[ $1 == "s" ]]; then
for f in $allfiles
do
echo $f
done
fi
}
The idea is to print all files (one per line) in the directory found in TEST_SCRIPTS.
Instead of that this is what I get as an output:
/path/to/dir/*
(The actual path obviously, but this is to convey the idea).
I have tried the followign experiment on bash. Doing this
a=(./*)
And this read me all files in the current directory into a as an array. However if anything other than ./ is used then it does not work.
How can I use this procedure with a directory other than ./?
When there are no matches, the wildcard is not expanded.
I speculate that TESTSCRIPTS contains a path which does not exist; but without access to your code, there is obviously no way to diagnose this properly.
Common solutions include shopt -s nullglob which causes the shell to replace the wildcard with nothing when there are no matches; and explicitly checking for the expanded value being equal to the wildcard (in theory, this could misfire if there is a single file named literally * so this is not completely bulletproof!)
By the by, the allfiles variable appears to be superfluous, and you should generally be much more meticulous about quoting. See When to wrap quotes around a shell variable? for details.
function getAllTests {
local nullglob
shopt -q nullglob || nullglob=reset
shopt -s nullglob
# Getting all stests in the # fix sentence fragment?
if [[ $1 == "s" ]]; then
for f in "$TEST_SCRIPTS"/*; do # notice quotes
echo "$f" # ditto
done
fi
# Unset if it wasn't set originally
case $nullglob in 'reset') shopt -u nullglob;; esac
}
Setting and unsetting nullglob inside a single function is probably excessive; most commonly, you would set it once at the beginning of your script, and then write the script accordingly.
Is there a shorthand in bash to select an arbitrary file? * enumerates all files in the current directory, but what if I only want one file and don't care which it is?
FWIW I'm testing several different ffmpeg commands in a directory with similarly named video files, so tab-complete is cumbersome.
Here's the robust way of getting the first or a random file in a directory, handling the edge case of not having any files:
#!/bin/bash
# Let globs expand to 0 elements instead of themselves if no matches
shopt -s nullglob
# Add all the files in the current dir to an array
files=(*)
# Check if the array has any elements
if [[ ${#files[#]} -gt 0 ]]
then
first_file=${files[0]}
random_file=${files[RANDOM%${#files[#]}]}
echo "The first file is ${first_file}"
echo "A random file is ${random_file}"
else
echo "There are no files in the current directory."
fi
If you just want something short and hacky for interactive testing, you can create an array and reference it unindexed to get the first element with minimal typing:
$ testfile=( *.avi )
$ ffmpeg -i "$testfile" test.mp3
You can also bind Tab to zsh style completion:
$ bind 'TAB:menu-complete'
now, for the rest of this session, when you press Tab you'll get a complete filename instead of just a prefix (press Tab again to cycle through matches). This will let you conveniently pick a file with a single keystroke.
Occasionally I was using the shuf:
find -name '*whatever*' | shuf | head -n 1
The shuf is a tool, part of GNU coreutils, which prints the input lines in random order. In other words, it shuffles the lines.
I have a directory config with the following file listing:
$ ls config
file one
file two
file three
I want a bash script that will, when given no arguments, iterate over all those files; when given names of files as arguments, I want it to iterate over the named files.
#!/bin/sh
for file in ${#:-config/*}
do
echo "Processing '$file'"
done
As above, with no quotes around the list term in the for loop, it produces the expected output in the no-argument case, but breaks when you pass an argument (it splits the file names on spaces.) Quoting the list term (for file in "${#:-config/*}") works when I pass file names, but fails to expand the glob if I don't.
Is there a way to get both cases to work?
For a simpler solution, just modify your IFS variable
#!/bin/bash
IFS=''
for file in ${#:-config/*}
do
echo "Processing '$file'"
done
IFS=$' \n\t'
The $IFS is a default shell variable that lists all the separators used by the shell. If you remove the space from this list, the shell won't split on space anymore. You should set it back to its default value after you function so that it doesn't cause other functions to misbehave later in your script
NOTE: This seems to misbehave with dash (I used a debian, and #!/bin/sh links to dash). If you use an empty $IFS, args passed will be returned as only 1 file. However, if you put some random value (i.e. IFS=':'), the behaviour will be the one you wanted (except if there is a : in your files name)
This works fine with #!/bin/bash, though
Set the positional parameters explicitly if none are given; then the for loop is the same for both cases:
[ $# -eq 0 ] && set -- config/*
for file in "$#"; do
echo "Processing '$file'"
done
Put the processing code in a function, and then use different loops to call it:
if [ $# -eq 0 ]
then for file in config/*
do processing_func "$file"
done
else for file in "$#"
do processing_func "$file"
done
fi
I have a simple bash script, simple.sh, as follows:
#/usr/local/bin/bash
for i in $1
do
echo The current file is $i
done
When I run it with the following argument:
./simple.sh /home/test/*
it would only print and list out the first file located in the directory.
However, if I change my simple.sh to:
#/usr/local/bin/bash
DIR=/home/test/*
for i in $DIR
do
echo The current file is $i
done
it would correctly print out the files within the directory. Can someone help explain why the argument being passed is not showing the same result?
If you take "$1", it is the first file/directory, which is possible!
You should do it in this way:
for i in "$#"
do
echo The current file is ${i}
done
If you execute it with:
./simple.sh *
They list you all files of the actual dictionary
"$1" is alphabetical the first file/directory of your current directory, and in the for loop, the value of "i" would be e.g. a1.sh and then they would go out of the for loop!
If you do:
DIR=/home/<s.th.>/*
you save the value of all files/directories in DIR!
This is as portable as it gets, has no useless forks to ls and runs with a minimum of CPU cycles wasted:
#!/bin/sh
cd $1
for i in *; do
echo The current file is "$i"
done
Run as ./simple.sh /home/test
Your script does not receive "/home/test/*" as an argument; the shell expands the patter to the list of files that match, and your shell receives multiple arguments, one per matching file. Quoting the argument will work:
./simple.sh "/home/test/*"
Your change to using DIR=/home/test/* did what you expected because filename generation is not performed on the RHS of a variable assignment. When you left $DIR unquoted in the for loop, the pattern was expanded to the list of matching files.
How about list the file manully instead of using *:
#/usr/local/bin/bash
for i in $(ls $1)
do
echo The current file is $i
done
and type
./simple.sh /home/test/