Recursively concatenating (joining) and renaming text files in a directory tree - macos

I am using a Mac OS X Lion.
I have a folder: LITERATURE with the following structure:
LITERATURE > Y > YATES, DORNFORD > THE BROTHER OF DAPHNE:
Chapters 01-05.txt
Chapters 06-10.txt
Chapters 11-end.txt
I want to recursively concatenate the chapters that are split into multiple files (not all are). Then, I want to write the concatenated file to its parent's parent directory. The name of the concatenated file should be the same as the name of its parent directory.
For example, after running the script (in the folder structure shown above) I should get the following.
LITERATURE > Y > YATES, DORNFORD:
THE BROTHER OF DAPHNE.txt
THE BROTHER OF DAPHNE:
Chapters 01-05.txt
Chapters 06-10.txt
Chapters 11-end.txt
In this example, the parent directory is THE BROTHER OF DAPHNE and the parent's parent directory is YATES, DORNFORD.
[Updated March 6th—Rephrased the question/answer so that the question/answer is easy to find and understand.]

It's not clear what you mean by "recursively" but this should be enough to get you started.
#!/bin/bash
titlecase () { # adapted from http://stackoverflow.com/a/6969886/874188
local arr
arr=("${#,,}")
echo "${arr[#]^}"
}
for book in LITERATURE/?/*/*; do
title=$(titlecase ${book##*/})
for file in "$book"/*; do
cat "$file"
echo
done >"$book/$title"
echo '# not doing this:' rm "$book"/*.txt
done
This loops over LITERATURE/initial/author/BOOK TITLE and creates a file Book Title (where should a space be added?) from the catenated files in each book directory. (I would generate it in the parent directory and then remove the book directory completely, assuming it contains nothing of value any longer.) There is no recursion, just a loop over this directory structure.
Removing the chapter files is a bit risky so I'm not doing it here. You could remove the echo prefix from the line after the first done to enable it.
If you have book names which contain an asterisk or some other shell metacharacter this will be rather more complex -- the title assignment assumes you can use the book title unquoted.
Only the parameter expansion with case conversion is beyond the very basics of Bash. The array operations could perhaps also be a bit scary if you are a complete beginner. Proper understanding of quoting is also often a challenge for newcomers.

cat Chapters*.txt > FinaleFile.txt.raw
Chapters="$( ls -1 Chapters*.txt | sed -n 'H;${x;s/\
//g;s/ *Chapters //g;s/\.txt/ /g;s/ *$//p;}' )"
mv FinaleFile.txt.raw "FinaleFile ${Chapters}.txt"
cat all txt at once (assuming name sorted list)
take chapter number/ref from the ls of the folder and with a sed to adapt the format
rename the concatenate file including chapters

Shell doesn't like white space in names. However, over the years, Unix has come up with some tricks that'll help:
$ find . -name "Chapters*.txt" -type f -print0 | xargs -0 cat >> final_file.txt
Might do what you want.
The find recursively finds all of the directory entries in a file tree that matches the query (In this case, the type must be a file, and the name matches the pattern Chapter*.txt).
Normally, find separates out the directory entry names with NL, but the -print0 says to separate out the entries names with the NUL character. The NL is a valid character in a file name, but NUL isn't.
The xargs command takes the output of the find and processes it. xargs gathers all the names and passes them in bulk to the command you give it -- in this case the cat command.
Normally, xargs separates out files by white space which means Chapters would be one file and 01-05.txt would be another. However, the -0 tells xargs, to use NUL as a file separator -- which is what -print0 does.

Thanks for all your input. They got me thinking, and I managed to concatenate the files using the following steps:
This script replaces spaces in filenames with underscores.
#!/bin/bash
# We are going to iterate through the directory tree, up to a maximum depth of 20.
for i in `seq 1 20`
do
# In UNIX based systems, files and directories are the same (Everything is a File!).
# The 'find' command lists all files which contain spaces in its name. The | (pipe) …
# … forwards the list to a 'while' loop that iterates through each file in the list.
find . -name '* *' -maxdepth $i | while read file
do
# Here, we use 'sed' to replace spaces in the filename with underscores.
# The 'echo' prints a message to the console before renaming the file using 'mv'.
item=`echo "$file" | sed 's/ /_/g'`
echo "Renaming '$file' to '$item'"
mv "$file" "$item"
done
done
This script concatenates text files that start with Part, Chapter, Section, or Book.
#!/bin/bash
# Here, we go through all the directories (up to a depth of 20).
for D in `find . -maxdepth 20 -type d`
do
# Check if the parent directory contains any files of interest.
if ls $D/Part*.txt &>/dev/null ||
ls $D/Chapter*.txt &>/dev/null ||
ls $D/Section*.txt &>/dev/null ||
ls $D/Book*.txt &>/dev/null
then
# If we get here, then there are split files in the directory; we will concatenate them.
# First, we trim the full directory path ($D) so that we are left with the path to the …
# … files' parent's parent directory—We will write the concatenated file here. (✝)
ppdir="$(dirname "$D")"
# Here, we concatenate the files using 'cat'. The 'awk' command extracts the name of …
# … the parent directory from the full directory path ($D) and gives us the filename.
# Finally, we write the concatenated file to its parent's parent directory. (✝)
cat $D/*.txt > $ppdir/`echo $D|awk -F'/' '$0=$(NF-0)'`.txt
fi
done
Now, we delete all the files that we concatenated so that its parent directory is left empty.
find . -name 'Part*' -delete
find . -name 'Chapter*' -delete
find . -name 'Section*' -delete
find . -name 'Book*' -delete
The following command will delete empty directories. (✝) We wrote the concatenated file to its parent's parent directory so that its parent directory is left empty after deleting all the split files.
find . -type d -empty -delete
[Updated March 6th—Rephrased the question/answer so that the question/answer is easy to find and understand.]

Related

Bash: Find files with "find" in a folder tree that has trailing whitespace

I have a super weird directory structure, that might have trailing whitespace in folder names such as:
"/path/to /file /with /folders /that /contain /whitespaces /file.ext"
I want the bash "find" function to pick those up while it traverses the file tree - the default recursive nature of it doesn't pick them up apparently.
I'd need a specific find command, that picks up all directory structures (e.g. not filtering only for those that have this "anomaly"), including those having trailing (white)spaces.
Can someone point me to the right direction?
Try this, and see if it offers any clues:
find . -type d -print0 | xargs -0 -I D echo \"D\"
"-print0" and "xargs -0" should preserve your tricky directory names.

How to move files based on file names in a.csv doc - macOS Terminal?

Terminal noob need a little help :)
I have a 98 row long filename list in a .csv file. For example:
name01; name03, etc.
I have an external hard drive with a lot of files in chaotic file
structure. BUT the file names are consistent, something like:
name01_xy; name01_zq; name02_xyz etc.
I would like to copy every file and directory from the external hard
drive which begins with the filename stored in the .csv file to my
computer.
So basically it's a search and copy based on a text file from an eHDD to my computer. I guess the easiest way to do is a Terminal command. Do you have any advice? Thanks in advance!
The task can be split into three: read search criteria from file; find files by criteria; copy found files. We discuss each one separately and combine them in a one-liner step-by-step:
Read search criteria from .csv file
Since your .csv file is pretty much just a text file with one criterion per line, it's pretty easy: just cat the file.
$ cat file.csv
bea001f001
bea003n001
bea007f005
bea008f006
bea009n003
Find files
We will use find. Example: you have a directory /Users/me/where/to/search and want to find all files in there whose names start with bea001f001:
$ find /Users/me/where/to/search -type f -name "bea001f001*"
If you want to find all files that end with bea001f001, move the star wildcard (zero-or-more) to the beginning of the search criterion:
$ find /Users/me/where/to/search -type f -name "*bea001f001"
Now you can already guess what the search criterion for all files containing the name bea001f001 would look like: "*bea001f001*".
We use -type f to tell find that we are interested only in finding files and not directories.
Combine reading and finding
We use xargs for passing the file contents to find a -name argument:
$ cat file.csv | xargs -I [] find /Users/me/where/to/search -type f -name "[]*"
/Users/me/where/to/search/bea001f001_xy
/Users/me/where/to/search/bea001f001_xyz
/Users/me/where/to/search/bea009n003_zq
Copy files
We use cp. It is pretty straightforward: cp file target will copy file to directory target (if it is a directory, or replace file named target).
Complete one-liner
We pass results from find to cp not by piping, but by using the -exec argument passed to find:
$ cat file.csv | xargs -I [] find /Users/me/where/to/search -type f -name "[]*" -exec cp {} /Users/me/where/to/copy \;
Sorry this is my first post here. In response to the comments above, only the last file is selected likely because the others have a carriage return \r. If you first append the directory to each filename in the csv, you can perform the move with the following command, which strips the \r.
cp `tr -d '\r' < file.csv` /your/target/directory

save filename and information from the file into a two column txt doc. ubuntu terminal

I have a question regarding the manipulation and creation of text files in the ubuntu terminal. I have a directory that contains several 1000 subdirectories. In each directory, there is a file with the extension stats.txt. I want to write a piece of code that will run from the parent directory, and create a file with the name of all the stats.txt files in the first column, and then returns to me all the information from the 5th line of the same stats.txt file in the next column. The 5th line of the stats.txt file is a sentence of six words, not a single value.
For reference, I have successfully used the sed command in combination with find and cat to make a file containing the 5th line from each stats.txt file. I then used the ls command to save a list of all my subdirectories. I assumed both files would be in alphabetical order of the subdirectories, and thus easy to merge, but I was wrong. The find and cat functions, or at least my implementation of them, resulted in a file that appeared to be random in order (see below). No need to try to remedy this code, I'm open to all solutions.
# loop through subdirectories and save the 5th line of stats.txt as a different file.
for f in ~/*; do [ -d $f ] && cd "$f" && sed -n 5p *stats.txt > final.stats.txt done;
# find the final.stats.txt files and save them as a single file
find ./ -name 'final.stats.txt' -exec cat {} \; > compiled.stats.txt
Maybe something like this can help you get on track:
find . -name "*stats.txt" -exec awk 'FNR==5{print FILENAME, $0}' '{}' + > compiled.stats

Visit all subdirectories and extract first page from every pdf

I have a few folders with E-Books and I want to extract first page from every book. There are over two hundred books so doing this manually it's a big pain in the back and will be very time consuming.
I have a command that does the job for single file
pdftk TehInput.pdf cat 1 output cover_TehInput.pdf
How do I wrap this into a single script that visits everything and assigns the name to output like cover_wtv-original-name-is.pdf? All the output files might be everywhere like in the directory where script was started or near the original file.
You want to use the find command for this. Something like:
find . -iname '*.pdf' -exec pdftk '{}' cat 1 output '{}'.cover.pdf ';'
This will find all PDFs from the current directory (.) downwards, and execute
pdftk filename.pdf cat 1 output filename.pdf.cover.pdf
on it. It's the whole path that will get passed to pdftk, so you'll end up with the cover PDFs in the same directory as the original files. (You could do something to get rid of the .pdf.cover.pdf extensions if you need to.)
If you use no blanks or newlines in filenames:
find . -iname '*.pdf' -printf "%h %f\n" | sed -E 's|(.*) (.*)|echo pdftk \1/\2 cat 1 output \1/cover_\2|' | sh
If output is okay, remove "echo ".

Script to prepend all filenames within a directory

I've found an issue with adobes bates numbering tool, where file names are messing up the order in which they are numbered.
I was hoping to write a script that users would be able to click on and add the folder extension for all the files.
Then the script would prepend all the file names within the folder with a 000001filename.pdf 000002filename.pdf etc...
I've never combined scripts before but i've found scripts that either rename OR prepend. and i couldn't find anything that would rename sequentially with preceding 0's.
without much testing:
n=0 # or 1 if you like
format="%06d" # format of prefix
find . -maxdepth 1 -type f | # only one level, no dirs but also no symlinks etc
cut -d/ -f2 | # remove leading ./
sort | # plugin your sorting here
while read file
do
prefix=`printf "%06d" $n`
mv "$file" "$prefix$file" # but mv is dangerous!
n=$((n+1))
done

Resources