Split a folder into multiple subfolders in terminal/bash script - bash

I have several folders, each with between 15,000 and 40,000 photos. I want each of these to be split into sub folders - each with 2,000 files in them.
What is a quick way to do this that will create each folder I need on the go and move all the files?
Currently I can only find how to move the first x items in a folder into a pre-existing directory. In order to use this on a folder with 20,000 items... I would need to create 10 folders manually, and run the command 10 times.
ls -1 | sort -n | head -2000| xargs -i mv "{}" /folder/
I tried putting it in a for-loop, but am having trouble getting it to make folders properly with mkdir. Even after I get around that, I need the program to only create folders for every 20th file (start of a new group). It wants to make a new folder for each file.
So... how can I easily move a large number of files into folders of an arbitrary number of files in each one?
Any help would be very... well... helpful!

Try something like this:
for i in `seq 1 20`; do mkdir -p "folder$i"; find . -type f -maxdepth 1 | head -n 2000 | xargs -i mv "{}" "folder$i"; done
Full script version:
#!/bin/bash
dir_size=2000
dir_name="folder"
n=$((`find . -maxdepth 1 -type f | wc -l`/$dir_size+1))
for i in `seq 1 $n`;
do
mkdir -p "$dir_name$i";
find . -maxdepth 1 -type f | head -n $dir_size | xargs -i mv "{}" "$dir_name$i"
done
For dummies:
create a new file: vim split_files.sh
update the dir_size and dir_name values to match your desires
note that the dir_name will have a number appended
navigate into the desired folder: cd my_folder
run the script: sh ../split_files.sh

This solution worked for me on MacOS:
i=0; for f in *; do d=dir_$(printf %03d $((i/100+1))); mkdir -p $d; mv "$f" $d; let i++; done
It creates subfolders of 100 elements each.

This solution can handle names with whitespace and wildcards and can be easily extended to support less straightforward tree structures. It will look for files in all direct subdirectories of the working directory and sort them into new subdirectories of those. New directories will be named 0, 1, etc.:
#!/bin/bash
maxfilesperdir=20
# loop through all top level directories:
while IFS= read -r -d $'\0' topleveldir
do
# enter top level subdirectory:
cd "$topleveldir"
declare -i filecount=0 # number of moved files per dir
declare -i dircount=0 # number of subdirs created per top level dir
# loop through all files in that directory and below
while IFS= read -r -d $'\0' filename
do
# whenever file counter is 0, make a new dir:
if [ "$filecount" -eq 0 ]
then
mkdir "$dircount"
fi
# move the file into the current dir:
mv "$filename" "${dircount}/"
filecount+=1
# whenever our file counter reaches its maximum, reset it, and
# increase dir counter:
if [ "$filecount" -ge "$maxfilesperdir" ]
then
dircount+=1
filecount=0
fi
done < <(find -type f -print0)
# go back to top level:
cd ..
done < <(find -mindepth 1 -maxdepth 1 -type d -print0)
The find -print0/read combination with process substitution has been stolen from another question.
It should be noted that simple globbing can handle all kinds of strange directory and file names as well. It is however not easily extensible for multiple levels of directories.

The code below assumes that the filenames do not contain linefeeds, spaces, tabs, single quotes, double quotes, or backslashes, and that filenames do not start with a dash. It also assumes that IFS has not been changed, because it uses while read instead of while IFS= read, and because variables are not quoted. Add setopt shwordsplit in Zsh.
i=1;while read l;do mkdir $i;mv $l $((i++));done< <(ls|xargs -n2000)
The code below assumes that filenames do not contain linefeeds and that they do not start with a dash. -n2000 takes 2000 arguments at a time and {#} is the sequence number of the job. Replace {#} with '{=$_=sprintf("%04d",$job->seq())=}' to pad numbers to four digits.
ls|parallel -n2000 mkdir {#}\;mv {} {#}
The command below assumes that filenames do not contain linefeeds. It uses the implementation of rename by Aristotle Pagaltzis which is the rename formula in Homebrew, where -p is needed to create directories, where --stdin is needed to get paths from STDIN, and where $N is the number of the file. In other implementations you can use $. or ++$::i instead of $N.
ls|rename --stdin -p 's,^,1+int(($N-1)/2000)."/",e'

I would go with something like this:
#!/bin/bash
# outnum generates the name of the output directory
outnum=1
# n is the number of files we have moved
n=0
# Go through all JPG files in the current directory
for f in *.jpg; do
# Create new output directory if first of new batch of 2000
if [ $n -eq 0 ]; then
outdir=folder$outnum
mkdir $outdir
((outnum++))
fi
# Move the file to the new subdirectory
mv "$f" "$outdir"
# Count how many we have moved to there
((n++))
# Start a new output directory if we have sent 2000
[ $n -eq 2000 ] && n=0
done

The answer above is very useful, but there is a very import point in Mac(10.13.6) terminal. Because xargs "-i" argument is not available, I have change the command from above to below.
ls -1 | sort -n | head -2000| xargs -I '{}' mv {} /folder/
Then, I use the below shell script(reference tmp's answer)
#!/bin/bash
dir_size=500
dir_name="folder"
n=$((`find . -maxdepth 1 -type f | wc -l`/$dir_size+1))
for i in `seq 1 $n`;
do
mkdir -p "$dir_name$i";
find . -maxdepth 1 -type f | head -n $dir_size | xargs -I '{}' mv {} "$dir_name$i"
done

This is a tweak of Mark Setchell's
Usage:
bash splitfiles.bash $PWD/directoryoffiles splitsize
It doesn't require the script to be located in the same dir as the files for splitting, it will operate on all files, not just the .jpg and allows you to specify the split size as an argument.
#!/bin/bash
# outnum generates the name of the output directory
outnum=1
# n is the number of files we have moved
n=0
if [ "$#" -ne 2 ]; then
echo Wrong number of args
echo Usage: bash splitfiles.bash $PWD/directoryoffiles splitsize
exit 1
fi
# Go through all files in the specified directory
for f in $1/*; do
# Create new output directory if first of new batch
if [ $n -eq 0 ]; then
outdir=$1/$outnum
mkdir $outdir
((outnum++))
fi
# Move the file to the new subdirectory
mv "$f" "$outdir"
# Count how many we have moved to there
((n++))
# Start a new output directory if current new dir is full
[ $n -eq $2 ] && n=0
done

Can be directly run in the terminal
i=0;
for f in *;
do
d=picture_$(printf %03d $((i/2000+1)));
mkdir -p $d;
mv "$f" $d;
let i++;
done
This script will move all files within the current directory into picture_001, picture_002... and so on. Each newly created folder will contain 2000 files
2000 is the chunked number
%03d is the suffix digit you can adjust (currently 001,002,003)
picture_ is the folder prefix
This script will chunk all files into its directory (create subdirectory)

You'll certainly have to write a script for that.
Hints of things to include in your script:
First count the number of files within your source directory
NBFiles=$(find . -type f -name *.jpg | wc -l)
Divide this count by 2000 and add 1, to determine number of directories to create
NBDIR=$(( $NBFILES / 2000 + 1 ))
Finally loop through your files and move them accross the subdirs.
You'll have to use two imbricated loops : one to pick and create the destination directory, the other to move 2000 files in this subdir, then create next subdir and move the next 2000 files to the new one, etc...

Related

Looping through sub dirs in large data set, making a new folder with the subdir name, & then hardlinking select files to that new directory

I'm struggling immensely with getting a nested for loop to work for this. The data set I am working with is very large (a little over a million files).
I was looking at a nested for loop but it seems unstable.
count=0
for dir in $(find "$sourceDir" -mindepth 1 -maxdepth 1 -type d)
do
(
mkdir -p "$destDir/$dir"
for file in $(find . -type f)
do
(
if [ $((count % 3)) -eq 2 ]
then
cp -prl "$file" $destDir/$dir
fi
((count ++))
)
done
)
((count++))
done
^^ this is only going into the last directory and finding the 3rd file. I need it to enter every directory and find the third file
I've thought of breaking this up into chunks and running several scripts instead of just one to make it more scalable.
I was able to figure out the answer thanks to the commenters!! My input was a folder with 4 sub folders and within each of those 4 subfolders, there are 12 files.
My ideal output was having every 3rd file (starting with three) hardlinked at an external location sorted within their subdirectories... so something like this -
subdirA (3rdfile hardlink,6thfile hardlink,9thfile hardlink,12thfile hardlink) subdirB (3rdfile hardlink,6thfile hardlink,...)
... and so on!!
Here is what got it to work:
#!/bin/bash
for d in *;
do
echo $d
mkdir Desktop/testjan16/$d
#### loops through each file in the folder and hardlinks every third file (starting w 3) to the appropriate directory
for f in `find ./$d -type f | sort | awk 'NR %3 == 0'`; do ln $f Desktop/testjan16/$d; done
done

Breaking down a filename into lexicographic based folders

Let's say I have thousands of images in a folder in the format filename_order.jpg.
filename are encoded as a 7 digits integer from 0000000 to 9999999
order is a number between 0 and 9
folder/
6398305_0.jpg
6398305_1.jpg
6398305_2.jpg
...
6399305_0.jpg
Is there an easy way to sort them into equality repartitioned folders based on the filenames?
folder/
6/3/9/
8/3/0/5/
6398305_0.jpg
6398305_1.jpg
6398305_2.jpg
...
9/3/0/7/
6399307_0.jpg
Is there a way to do the reverse operation as well: given a nested tree structure bringing it back to level 1 only.
The goal is being able to store them in S3 in an efficient way for millions of images.
Thank you.
This would do it in pure Bash:
#!/usr/bin/env bash
# extglob needed to expand number into a serie of folders path
shopt -s extglob
# Starting folder name
folder=folder
# Iterate all *.jpg files in folder
for file in "$folder/"*.jpg; do
# Remove leading directory path from file to get basename
basename="${file##*/}"
# Remove everything ater first _ to get only numbers
numbers="${basename%_*}"
# Insert / before each number to create a directory path from numbers
# Need Bash extglob
dir="$folder${numbers//?()/\/}"
# Create the directory path
echo mkdir -p "$dir"
# move file to its directory
echo mv "$file" "$dir/"
done
Remove the echo if the output matches your expectations.
Nesting a flat folder,
cp -R flat_folder/ nested_folder/
cd nested_folder/
for f in *_[0-9].jpg
do
filename=${f%.*}
extension=${f##*.}
number=${filename%_*}
index=${filename##*_}
folder=$(echo $number | sed 's/\(.\)\(.\)\(.\)\(.\)\(.\)\(.\)\(.\)/\1\/\2\/\3\/\4\/\5\/\6\/\7/')
mkdir -p $folder
mv $f $folder/
done
Flattening a nested folder,
cd nested_folder/
find . -name "*.jpg" -exec cp {} ../flat_folder/ \;

bash: recursively shorten directory name to first 10 characters

I need to recursively rename all subdirectories to the first 10 characters of the original subdirectory name.
For example, the below directory:
/Documents/super-long-folder-name/other-folder-name/
would be renamed to:
/Documents/super-long/other-fold/
I have found a way to rename files to the first 10 characters of the original file name, but now I need to do this for directories.
To recursively rename the file names, I installed the perl rename function: brew install rename and then executed the code below:
find . -path '????????????????????*' -exec rename 's/^(.{10}).*(\..*)$/$1$2/' * {} \;
The above code finds files with file paths greater than 20 characters, then renames the file to the first 10 characters of the original file name.
Now I am trying to find a similar solution that would allow me to do this to directory names.
Thank you in advance for any insight you might have!
Please try the following:
#!/bin/bash
finddir="."
# calculate the max depth
depth=1
while IFS= read -r -d "" dir; do
str="${dir//[^\/]}"
if (( depth < ${#str} )); then
depth=${#str}
fi
done < <(find "$finddir" -type d -links 2 -print0)
# change long dirnames by starting with depth=1 incrementally
for (( i=1; i<=depth; i++ )); do
while IFS= read -r -d "" dir; do
stem=${dir%/*} # parent dir
leaf=${dir##*/} # current target dir
if (( ${#leaf} > 10 )); then
short=${leaf:0:10}
if [[ -d $stem/$short ]]; then
echo "$stem/$short exists. $stem/$leaf unchanged."
else
mv -- "$stem/$leaf" "$stem/$short"
# echo mv "$stem/$leaf" "$stem/$short"
fi
fi
done < <(find "$finddir" -type d -mindepth "$i" -maxdepth "$i" -print0)
done
The difficult point is the pathnames dynamically change during the execution and
pathnames given by find may differ from actual (renamed) pathnames.
My approach is:
to calculate the maximum depth in advance.
to iterate into deeper directory starting with depth=1 until depth=maximum_depth calculated above.
Hope this helps.

Bash: how to copy multiple files with same name to multiple folders

I am working on Linux machine.
I have a lot of files named the same, with a directory structure like this:
P45_input_foo/result.dat
P45_input_bar/result.dat
P45_input_tar/result.dat
P45_input_cool/result.dat ...
It is difficult to copy them one by one. I want to copy them into another folder named as data with similar folder names and file names:
/data/foo/result.dat
/data/bar/result.dat
/data/tar/result.dat
/data/cool/result.dat ...
In stead of copy them one by one what I should do?
Using a for loop in bash :
# we list every files following the pattern : ./<somedirname>/<any file>
# if you want to specify a format for the folders, you could change it here
# i.e. for your case you could write 'for f in P45*/*' to only match folders starting by P45
for f in */*
do
# we strip the path of the file from its filename
# i.e. 'P45_input_foo/result.dat' will become 'P45_input_foo'
newpath="${f%/*}"
# mkdir -p /data/${newpath##*_} will create our new data structure
# - /data/${newpath##*_} extract the last chain of character after a _, in our example, 'foo'
# - mkdir -p will recursively create our structure
# - cp "$f" "$_" will copy the file to our new directory. It will not launch if mkdir returns an error
mkdir -p /data/${newpath##*_} && cp "$f" "$_"
done
the ${newpath##*_} and ${f%/*} usage are part of Bash string manipulation methods. You can read more about it here.
You will need to extract the 3rd item after "_" :
P45_input_foo --> foo
create the directory (if needed) and copy the file to it. Something like this (not tested, might need editing):
STARTING_DIR="/"
cd "$STARTING_DIR"
VAR=$(ls -1)
while read DIR; do
TARGET_DIR=$(echo "$DIR" | cut -d'_' -f3)
NEW_DIR="/data/$DIR"
if [ ! -d "$NEW_DIR" ]; then
mkdir "$NEW_DIR"
fi
cp "$DIR/result.dat" "$NEW_DIR/result.dat"
if [ $? -ne 0 ];
echo "ERROR: encountered an error while copying"
fi
done <<<"$VAR"
Explanation: assuming all the paths you've mentioned are under root / (if not change STARTING_PATH accordingly). With ls you get the list of the directories, store the output in VAR. Pass the content of VAR to the while loop.
A bit of find and with a few bash tricks, the below script could do the trick for you. Remember to run the script without the mv and see if "/data/"$folder"/" is the actual path that you want to move the file(s).
#!/bin/bash
while IFS= read -r -d '' file
do
fileNew="${file%/*}" # Everything before the last '\'
fileNew="${fileNew#*/}" # Everything after the last '\'
IFS="_" read _ _ folder <<<"$fileNew"
mv -v "$file" "/data/"$folder"/"
done < <(find . -type f -name "result.dat" -print0)

command line find first file in a directory

My directory structure is as follows
Directory1\file1.jpg
\file2.jpg
\file3.jpg
Directory2\anotherfile1.jpg
\anotherfile2.jpg
\anotherfile3.jpg
Directory3\yetanotherfile1.jpg
\yetanotherfile2.jpg
\yetanotherfile3.jpg
I'm trying to use the command line in a bash shell on ubuntu to take the first file from each directory and rename it to the directory name and move it up one level so it sits alongside the directory.
In the above example:
file1.jpg would be renamed to Directory1.jpg and placed alongside the folder Directory1
anotherfile1.jpg would be renamed to Directory2.jpg and placed alongside the folder Directory2
yetanotherfile1.jpg would be renamed to Directory3.jpg and placed alongside the folder Directory3
I've tried using:
find . -name "*.jpg"
but it does not list the files in sequential order (I need the first file).
This line:
find . -name "*.jpg" -type f -exec ls "{}" +;
lists the files in the correct order but how do I pick just the first file in each directory and move it up one level?
Any help would be appreciated!
Edit: When I refer to the first file what I mean is each jpg is numbered from 0 to however many files in that folder - for example: file1, file2...... file34, file35 etc... Another thing to mention is the format of the files is random, so the numbering might start at 0 or 1a or 1b etc...
You can go inside each dir and run:
$ mv `ls | head -n 1` ..
If first means whatever the shell glob finds first (lexical, but probably affected by LC_COLLATE), then this should work:
for dir in */; do
for file in "$dir"*.jpg; do
echo mv "$file" "${file%/*}.jpg" # If it does what you want, remove the echo
break 1
done
done
Proof of concept:
$ mkdir dir{1,2,3} && touch dir{1,2,3}/file{1,2,3}.jpg
$ for dir in */; do for file in "$dir"*.jpg; do echo mv "$file" "${file%/*}.jpg"; break 1; done; done
mv dir1/file1.jpg dir1.jpg
mv dir2/file1.jpg dir2.jpg
mv dir3/file1.jpg dir3.jpg
Look for all first level directories, identify first file in this directory and then move it one level up
find . -type d \! -name . -prune | while read d; do
f=$(ls $d | head -1)
mv $d/$f .
done
Building on the top answer, here is a general use bash function that simply returns the first path that resolves to a file within the given directory:
getFirstFile() {
for dir in "$1"; do
for file in "$dir"*; do
if [ -f "$file" ]; then
echo "$file"
break 1
fi
done
done
}
Usage:
# don't forget the trailing slash
getFirstFile ~/documents/
NOTE: it will silently return nothing if you pass it an invalid path.

Resources