Organizing Files In Directories with Terminal - bash

So I am wondering if there is any way to organize a directory on a mac with the terminal. I am a beginner with using the terminal and just seeing if this is possible.
I have a script that will scrape various pages and save certain data to a file (data irrelevant), such as this picture.
directory that needs organizing
I would like to know if I can write something that will read the file names and create directories that correspond. For example, it runs a loop that will read all files with "Year2014", create a folder named "Year2014", then place the files inside.
If you have any other questions, let me know!

The short answer is "Yes", and the longer answer is there are many ways to do it. Since you are using bash (or any POSIX shell), you have parameter expansion with substring removal available to help you trim text from the end of each filename to isolate the "YearXXXX" part of the filename that you can then use to (1) create the directory, and (2) move the file into the newly created directory.
Presuming Filenames Formatted WeekXXYearXXXX.txt
Take for example a simple for loop where the loop variable f will contain each filename in turn. You can isolate the "WeekXX" part of the name by using a parameter expansion that trims from the right of the string trough 'Y' leaving whatever "WeekXX" is. (save the result in a temporary variable) You can then use that temp variable to remove the "WeekXX" text from the original filename leaving "YearXXXX.txt". You then simply remove ".txt" from the first to arrive at the directory name to put the file in.
Scriptwise it would look like:
for f in *.txt; do ## loop over .txt files using variable $f
tmp="${f%%Y*}" ## remove though 'Y' from right
dname="${f#$tmp}" ## remove contents of tmp from left
dname="${dname%.txt}" ## remove .txt
mkdir -p "$dname" ## create dname (no error if exists)
mv "$f" "$dname" ## move $f to $dname
done
Where the temporary variable used is tmp and the final directory name is stored in the variable dname.
(note: you may want to use mv -i if you want mv to prompt before overwriting if the filename already exists in the target directory)
You can refer to man bash under the Parameter Expansion heading to read the specifics of each expansion which (among many more) are described as:
${var#pattern} Strip shortest match of pattern from front of $var
${var##pattern} Strip longest match of pattern from front of $var
${var%pattern} Strip shortest match of pattern from back of $var
${var%%pattern} Strip longest match of pattern from back of $var
Note this set of parameter expansions is POSIX so it will work with any POSIX shell, while most of the remaining expansions are bashisms (bash-only)
Let me know if you have further questions.

Related

BASH Shell Find Multiple Files with Wildcard and Perform Loop with Action

I have a script that I call with an application, I can't run it from command line. I derive the directory where the script is called and in the next variable go up 1 level where my files are stored. From there I have 3 variables with the full path and file names (with wildcard), which I will refer to as "masks".
I need to find and "do something with" (copy/write their names to a new file, whatever else) to each of these masks. The do something part isn't my obstacle as I've done this fine when I'm working with a single mask, but I would like to do it cleanly in a single loop instead of duplicating loop and just referencing each mask separately if possible.
Assume in my $FILESFOLDER directory below that I have 2 existing files, aaa0.csv & bbb0.csv, but no file matching the ccc*.csv mask.
#!/bin/bash
SCRIPTFOLDER=${0%/*}
FILESFOLDER="$(dirname "$SCRIPTFOLDER")"
ARCHIVEFOLDER="$FILESFOLDER"/archive
LOGFILE="$SCRIPTFOLDER"/log.txt
FILES1="$FILESFOLDER"/"aaa*.csv"
FILES2="$FILESFOLDER"/"bbb*.csv"
FILES3="$FILESFOLDER"/"ccc*.csv"
ALLFILES="$FILES1
$FILES2
$FILES3"
#here as an example I would like to do a loop through $ALLFILES and copy anything that matches to $ARCHIVEFOLDER.
for f in $ALLFILES; do
cp -v "$f" "$ARCHIVEFOLDER" > "$LOGFILE"
done
echo "$ALLFILES" >> "$LOGFILE"
The thing that really spins my head is when I run something like this (I haven't done it with the copy command in place) that log file at the end shows:
filesfolder/aaa0.csv filesfolder/bbb0.csv filesfolder/ccc*.csv
Where I would expect echoing $ALLFILES just to show me the masks
filesfolder/aaa*.csv filesfolder/bbb*.csv filesfolder/ccc*.csv
In my "do something" area, I need to be able to use whatever method to find the files by their full path/name with the wildcard if at all possible. Sometimes my network is down for maintenance and I don't want to risk failing a change directory. I rarely work in linux (primarily SQL background) so feel free to poke holes in everything I've done wrong. Thanks in advance!
Here's a light refactoring with significantly fewer distracting variables.
#!/bin/bash
script=${0%/*}
folder="$(dirname "$script")"
archive="$folder"/archive
log="$folder"/log.txt # you would certainly want this in the folder, not $script/log.txt
shopt -s nullglob
all=()
for prefix in aaa bbb ccc; do
cp -v "$folder/$prefix"*.csv "$archive" >>"$log" # append, don't overwrite
all+=("$folder/$prefix"*.csv)
done
echo "${all[#]}" >> "$log"
The change in the loop to append the output or cp -v instead of overwrite is a bug fix; otherwise the log would only contain the output from the last loop iteration.
I would probably prefer to have the files echoed from inside the loop as well, one per line, instead of collect them all on one humongous line. Then you can remove the array all and instead simply
printf '%s\n' "$folder/$prefix"*.csv >>"$log"
shopt -s nullglob is a Bash extension (so won't work with sh) which says to discard any wildcard which doesn't match any files (the default behavior is to leave globs unexpanded if they don't match anything). If you want a different solution, perhaps see Test whether a glob has any matches in Bash
You should use lower case for your private variables so I changed that, too. Notice also how the script variable doesn't actually contain a folder name (or "directory" as we adults prefer to call it); fixing that uncovered a bug in your attempt.
If your wildcards are more complex, you might want to create an array for each pattern.
tmpspaces=(/tmp/*\ *)
homequest=($HOME/*\?*)
for file in "${tmpspaces[#]}" "${homequest[#]}"; do
: stuff with "$file", with proper quoting
done
The only robust way to handle file names which could contain shell metacharacters is to use an array variable; using string variables for file names is notoriously brittle.
Perhaps see also https://mywiki.wooledge.org/BashFAQ/020

How to obtain the full PATH, *allowing* for symbolic links

I have written bash scripts that accept a directory name as an argument. A single dot ('.') is a valid directory name, but I sometimes need to know where '.' is. The readlink and realpath commands provide a resolved path, which does not help because I need to allow for symbolic links.
For example, the resolved path to the given directory might be something like /mnt/vol_01/and/then/some, whereas the script is called with '.' where '.' is /app/then/some (a sym link which would resolve to the first path I gave).
What I have done to solve my problem is use cd and pwd in combination to provide the full path I want, and it seems to have worked OK so far.
A simplified example of a script:
DEST_DIR=$1
# Convert the given destination directory to a full path, ALLOWING
# for symbolic links. This is necessary in cases where '.' is
# given as the destination directory.
DEST_DIR=$(cd $DEST_DIR && pwd -L)
# Do stuff in $DEST_DIR
My question is: is my use of cd and pwd the best way to get what I want? Or is there a better way?
If all you want to do is to make an absolute path that has minimal changes from a relative path then a simple, safe, and fast way to to it is:
[[ $dest_dir == /* ]] || dest_dir=$PWD/$dest_dir
(See Correct Bash and shell script variable capitalization for an explanation of why dest_dir is preferable to DEST_DIR.)
The code above will work even if the directory doesn't exist (yet) or if it's not possible to cd to it (e.g. because its permissions don't allow it). It may produce paths with redundant '.' components, '..' components, and redundant slashes (`/a//b', '//a/b/', ...).
If you want a minimally cleaned path (leaving symlinks unresolved), then a modified version of your original code may be a reasonable option:
dest_dir=$(cd -- "$dest_dir"/ && pwd)
The -- is necessary to handle directory names that begin with '-'.
The quotes in "$dest_dir" are necessary to handle names that contain whitespace (actually $IFS characters) or glob characters.
The trailing slash on "$dest_dir"/ is necessary to handle a directory whose relative name is simply -.
Plain pwd is sufficient because it behaves as if -L was specified by default.
Note that the code will set dest_dir to the empty string if the cd fails. You probably want to check for that before doing anything else with the variable.
Note also that $(cd ...) will create a subshell with Bash. That's good in one way because there's no need to cd back to the starting directory afterwards (which may not be possible), but it could cause a performance problem if you do it a lot (e.g. in a loop).
Finally, note that the code won't work if the directory name contains one or more trailing newlines (e.g. as created by mkdir $'dir\n'). It's possible to fix the problem (in case you really care about it), but it's messy. See How to avoid bash command substitution to remove the newline character? and shell: keep trailing newlines ('\n') in command substitution. One possible way to do it is:
dest_dir=$(cd -- "$dest_dir"/ && printf '%s.' "$PWD") # Add a trailing '.'
dest_dir=${dest_dir%.} # Remove the trailing '.'

How do I move files into folders with similar names in Unix?

I'm sorry if this question has been asked before, I just didn't know how to word it as a search query.
I have a set of folders that look like this:
Brain - Amygdala/ Brain - Spinal cord (cervical c-1)/ Skin - Sun Exposed (Lower leg)/
Brain - Caudate (basal ganglia)/ Lung/ Whole Blood/
I also have a set of files that look like this:
Brain_Amygdala.v7.covariates_output.txt Skin_Not_Sun_Exposed_Suprapubic.v7.covariates_output.txt
Brain_Caudate_basal_ganglia.v7.covariates_output.txt Skin_Sun_Exposed_Lower_leg.v7.covariates_output.txt
Brain_Spinal_cord_cervical_c-1.v7.covariates_output.txt Whole_Blood.v7.covariates_output.txt
As you can see, the files do not perfectly match up with the directories in their names. For example, Brain_Amygdala.v7.covariates_output.txt is not totally identical to Brain - Amygdala/. Even if we were to excise the tissue name from the covariates file, Brain_Amygdala is formatted differently from its corresponding folder.
Same with Whole Blood/. It is different from Whole_Blood.v7.covariates_output.txt, even if you were to isolate the tissue name from the covariates file Whole_Blood.
What I want to do, however, is to move each of these tissue files to their corresponding folder. If you notice, the covariate files are named after the tissue leading up to the first dot . in the file name. They are separated by underscores _. How I was thinking about approaching this was to break up the first few words leading up to the first . of the file name so that I can easily move it to its corresponding file.
e.g.
Brain_Amygdala.v7.covariates_output.txt -> Brain*Amygdala [mv]-> Brain*Amygdala/
a) I'm not sure how to isolate the first words of a file name leading up to the first . in a filename
b) if I were to do that, I don't know how to insert a wildcard in between each word and match that to the corresponding folder.
However, I am completely open to other ways of doing something like this.
Not a full answer, but it should address some of your concerns:
a) to isolate the first word of a string, leading up to the first .: use Parameter Expansions
string=Brain_Amygdala.v7.covariates_output.txt
until_dot=${string%%.*}
echo "$until_dot"
will output Brain_Amygdala (which we saved in the variable until_dot).
b) You may want to use the ${parameter/pattern/string} parameter expansion:
# Replace all non-alphabetic characters by the glob *
glob_pattern=${until_dot//[^[:alpha:]]/*}
echo "$glob_pattern"
will output (with the same variables as above) Brain*Amygdala
c) To use all of this: it's probably a good idea to determine the possible targets first, and do some basic checks:
# Use nullglob to have non matching glob expand to nothing
shopt -s nullglob
# DO NOT USE QUOTES IN THE FOLLOWING EXPANSION:
# the variable is actually a glob!
# Could also do dirs=( $glob_pattern*/ ) to check if directory
dirs=( $glob_pattern/ )
# Now check how many matches there are:
if ((${#dirs[#]} == 0)); then
echo >&2 "No matches for $glob_pattern"
elif ((${#dirs[#]} > 1)); then
echo >&2 "More than one matches for $glob_pattern: ${dirs[#]}"
else
echo "All good!"
# Remove the echo to actually perform the move
echo mv "$string" "${dirs[0]}"
fi
I don't know how your data will effectively conform to these, but I hope this answer actually answers some of your questions! (and to learn more about parameter expansions, do read — and experiment with — the link to the reference I gave you).

Removing an optional / (directory separator) in Bash

I have a Bash script that takes in a directory as a parameter, and after some processing will do some output based on the files in that directory.
The command would be like the following, where dir is a directory with the following structure inside
dir/foo
dir/bob
dir/haha
dir/bar
dir/sub-dir
dir/sub-dir/joe
> myscript ~/files/stuff/dir
After some processing, I'd like the output to be something like this
foo
bar
sub-dir/joe
The code I have to remove the path passed in is the following:
shopt -s extglob
for file in $files ; do
filename=${file#${1}?(/)}
This gets me to the following, but for some reason the optional / is not being taken care of. Thus, my output looks like this:
/foo
/bar
/sub-dir/joe
The reason I'm making it optional is because if the user runs the command
> myscript ~/files/stuff/dir/
I want it to still work. And, as it stands, if I run that command with the trailing slash, it outputs as desired.
So, why does my ?(/) not work? Based on everything I've read, that should be the right syntax, and I've tried a few other variations as well, all to no avail.
Thanks.
that other guy's helpful answer solves your immediate problem, but there are two things worth nothing:
enumerating filenames with an unquoted string variable (for file in $files) is ill-advised, as sjsam's helpful answer points out: it will break with filenames with embedded spaces and filenames that look like globs; as stated, storing filenames in an array is the robust choice.
there is no strict need to change global shell option shopt -s extglob: parameter expansions can be nested, so the following would work without changing shell options:
# Sample values:
file='dir/sub-dir/joe'
set -- 'dir/' # set $1; value 'dir' would have the same effect.
filename=${file#${1%/}} # -> '/sub-dir/joe'
The inner parameter expansion, ${1%/}, removes a trailing (%) / from $1, if any.
I suggested you change files to an array which is a possible workaround for non-standard filenames that may contain spaces.
files=("dir/A/B" "dir/B" "dir/C")
for filename in "${files[#]}"
do
echo ${filename##dir/} #replace dir/ with your param.
done
Output
A/B
B
C
Here's the documentation from man bash under "Parameter Expansion":
${parameter#word}
${parameter##word}
Remove matching prefix pattern. The word is
expanded to produce a pattern just as in pathname
expansion. If the pattern matches the beginning of
the value of parameter, then the result of the
expansion is the expanded value of parameter with
the shortest matching pattern (the ``#'' case) or
the longest matching pattern (the ``##'' case)
deleted.
Since # tries to delete the shortest match, it will never include any trailing optional parts.
You can just use ## instead:
filename=${file##${1}?(/)}
Depending on what your script does and how it works, you can also just rewrite it to cd to the directory to always work with paths relative to .

Modify text file based on file's name, repeat for all files in folder

I have a folder with several files named : something_1001.txt; something_1002.txt; something_1003.txt; etc.
Inside the files there is some text. Of course each file has a different text but the structure is always the same: some lines identified with the string ">TEXT", which are the ones I am interested in.
So my goal is :
for each file in the folder, read the file's name and extract the number between "_" and ".txt"
modify all the lines in this particular file that contain the string ">TEXT" in order to make it ">{NUMBER}_TEXT"
For example : file "something_1001.txt"; change all the lines containing ">TEXT" by ">1001_TEXT"; move on to file "something_1002.txt" change all the lines containing ">TEXT" by ">1002_TEXT"; etc.
Here is the code I wrote so far :
for i in /folder/*.txt
NAME=`echo $i | grep -oP '(?<=something_/).*(?=\.txt)'`
do
sed -i -e 's/>TEXT/>${NAME}_TEXT/g' /folder/something_${NAME}.txt
done
I created a small bash script to run the code but it's not working. There seems to be syntax errors and a loop error, but I can't figure out where.
Any help would be most welcome !
There are two problems here. One is that your loop syntax is wrong; the other is that you are using single quotes around the sed script, which prevents the shell from interpolating your variable.
The grep can be avoided, anyway; the shell has good built-in facilities for extracting the base name of a file.
for i in /folder/*.txt
do
base=${i#/folder/something_}
sed -i -e "s/>TEXT/>${base%.txt}_TEXT/" "$i"
done
The shell's ${var#prefix} and ${var%suffix} variable manipulation facility produces the value of $var with the prefix and suffix trimmed off, respectively.
As an aside, avoid uppercase variable names, because those are reserved for system use, and take care to double-quote any variable whose contents may include shell metacharacters.

Resources