Comparing two directories to produce output - bash

I am writing a Bash script that will replace files in folder A (source) with folder B (target). But before this happens, I want to record 2 files.
The first file will contain a list of files in folder B that are newer than folder A, along with files that are different/orphans in folder B against folder A
The second file will contain a list of files in folder A that are newer than folder B, along with files that are different/orphans in folder A against folder B
How do I accomplish this in Bash? I've tried using diff -qr but it yields the following output:
Files old/VERSION and new/VERSION differ
Files old/conf/mime.conf and new/conf/mime.conf differ
Only in new/data/pages: playground
Files old/doku.php and new/doku.php differ
Files old/inc/auth.php and new/inc/auth.php differ
Files old/inc/lang/no/lang.php and new/inc/lang/no/lang.php differ
Files old/lib/plugins/acl/remote.php and new/lib/plugins/acl/remote.php differ
Files old/lib/plugins/authplain/auth.php and new/lib/plugins/authplain/auth.php differ
Files old/lib/plugins/usermanager/admin.php and new/lib/plugins/usermanager/admin.php differ
I've also tried this
(rsync -rcn --out-format="%n" old/ new/ && rsync -rcn --out-format="%n" new/ old/) | sort | uniq
but it doesn't give me the scope of results I require. The struggle here is that the data isn't in the correct format, I just want files not directories to show in the text files e.g:
conf/mime.conf
data/pages/playground/
data/pages/playground/playground.txt
doku.php
inc/auth.php
inc/lang/no/lang.php
lib/plugins/acl/remote.php
lib/plugins/authplain/auth.php
lib/plugins/usermanager/admin.php

List of files in directory B (new/) that are newer than directory A (old/):
find new -newermm old
This merely runs find and examines the content of new/ as filtered by -newerXY reference with X and Y both set to m (modification time) and reference being the old directory itself.
Files that are missing in directory B (new/) but are present in directory A (old/):
A=old B=new
diff -u <(find "$B" |sed "s:$B::") <(find "$A" |sed "s:$A::") \
|sed "/^+\//!d; s::$A/:"
This sets variables $A and $B to your target directories, then runs a unified diff on their contents (using process substitution to locate with find and remove the directory name with sed so diff isn't confused). The final sed command first matches for the additions (lines starting with a +/), modifies them to replace that +/ with the directory name and a slash, and prints them (other lines are removed).
Here is a bash script that will create the file:
#!/bin/bash
# Usage: bash script.bash OLD_DIR NEW_DIR [OUTPUT_FILE]
# compare given directories
if [ -n "$3" ]; then # the optional 3rd argument is the output file
OUTPUT="$3"
else # if it isn't provided, escape path slashes to underscores
OUTPUT="${2////_}-newer-than-${1////_}"
fi
{
find "$2" -newermm "$1"
diff -u <(find "$2" |sed "s:$2::") <(find "$1" |sed "s:$1::") \
|sed "/^+\//!d; s::$1/:"
} |sort > "$OUTPUT"
First, this determines the output file, which either comes from the third argument or else is created from the other inputs using a replacement to convert slashes to underscores in case there are paths, so for example, running as bash script.bash /usr/local/bin /usr/bin would output its file list to _usr_local_bin-newer-than-_usr_bin in the current working directory.
This combines the two commands and then ensures they are sorted. There won't be any duplicates, so you don't need to worry about that (if there were, you'd use sort -u).
You can get your first and second files by changing the order of arguments as you invoke this script.

Related

Script to automate moving files from multiple directories to newly created directories with the same suffix (not file extension)

I'm processing a large collection of born-digital materials for an archive but I'm being slowed down by the fact that I'm having to manually create directories and find and move files from multiple directories into newly created directories.
Problem: I have three directories containing three different types of content derived from different sources:
-disk_images -evidence_photos -document_scans
The disk images were created from CDs that come with cases and writing on the cases that need to be accessible and preserved for posterity so pictures have been taken of them and loaded into the evidence photos folder with a prefix and inventory number. Some CDs came with indexes on paper and have been scanned and OCR'd and loaded into the document scan folder with a prefix and an inventory number. Not all disk images have corresponding photos or scans so the inventory numbers in those folders are not linear.
I've been trying to think of ways to write a script that would look through each of these directories and move files with the same suffix (not extension) to newly created directories for each inventory number but his is way beyond my expertise. Any help would be much appreciated and I will be more than happy to clarify if need be.
examples of file names:
-disk_images/ahacd_001.iso
-evidence_photos/ahacd_case_001.jpg
-document_scans/ahacd_notes_001.pdf
Potential new directory name= ahacd_001
There all files with inventory number 001 would need to end up in ahacd_001
Bold= inventory number
Here is a squeleton of program to iterate through your 3 starting folders and split your file names:
for folder in `ls -d */` #list directories
do
echo "moving folder $folder"
ls $folder | while read file # list the files in the directory
do
echo $file
# split the file name with awk and get the first part ( 'ahacd' ) and the last ('002')
echo $file | awk -F '.' '{print $1}' |awk -F '_' '{print $1 "_" $NF}'
# when you are statisfied that your file splitting works...
mkdir folder # create your folder
move file # move the file
done
done
A few pointers to split the filenames :
Get last field using awk substr
First I would like to say that file or directory names starting with - is a bad idea even if it's allowed.
Test case:
mkdir -p /tmp/test/{-disk_images,-evidence_photos,-document_scans}
cd /tmp/test
touch -- "-disk_images/ahacd_001.iso" #create your three test files
touch -- "-evidence_photos/ahacd_case_001.jpg"
touch -- "-document_scans/ahacd_notes_001.pdf"
find -type f|perl -nlE \
'm{.*/(.*?)_(.*_)?(\d+)\.}&&say qq(mkdir -p target/$1_$3; mv "$_" target/$1_$3)'
...will not move the files, it just shows you what commands it thinks should be runned.
If those commands is what you want to be runned, then run them by adding |bash at the end of the same find|perl command:
find -type f|perl -nlE \
'm{.*/(.*?)_(.*_)?(\d+)\.}&&say qq(mkdir -p target/$1_$3; mv "$_" target/$1_$3)' \
| bash
find -ls #to see the result
All three files are now in the target/ahacd_001/ subfolder.

Move files in S3 to folders based on filename

I have s3 folder where files are staged from an application.
I need to move these files based on a specified folder structure using the filenames.
The files are named in a particular format:
s3://bucketname/staging/file1_YYYY_MM_DD_HH_MM_SS
s3://bucketname/staging/file1_YYYY_MM_DD_HH_MM_SS
I need to move them to s3 folders of this format:
s3://bucketname/file1/YYYY/MM/DD
I have the following code now to store all the filenames present in the staging folder in a file.
path=s3://bucketname/staging
count=`s3cmd ls $path | wc -l`
echo $count
if [[ $count -gt 0 ]]; then
list_files_to_move_s3=$(s3cmd ls -r $path | awk '{print $4}' > files_in_bucket.txt)
echo "exists"
else
echo "do not exist"
fi
I now need to read the filenames and move the files accordingly.
Can you please help.
You can parse the contents of files_in_bucket.txt with sed to produce the output you want:
---> cat tests3.txt
s3://bucketname/staging/file1_YYYY_MM_DD_HH_MM_SS
s3://bucketname/staging/file1_YYYY_MM_DD_HH_MM_SS
---> sed -r "s|^(s3://.*)/.*/(.*)_(.*)_(.*)_(.*)_.*_.*_.*$|\1/\2/\3/\4/\5|g" tests3.txt
s3://bucketname/file1/YYYY/MM/DD
s3://bucketname/file1/YYYY/MM/DD
--->
What's happening there is it's parsing out each line from the file tests3.txt, with each bit inside parentheses saved as a "variable" (I'm not sure what the correct term is for sed, but you get the idea) which can then be referenced in the substitution string as \1, \2, \3, etc. So it's picking out the first bit, including up until the first slash, skipping the "staging" bit, and then picking out the file and date portions of the file name.
Note that this assumes a very standardized layout of the filenames and your desired output.
Let me know if you have any questions about this or need further help.

Recursively concatenating (joining) and renaming text files in a directory tree

I am using a Mac OS X Lion.
I have a folder: LITERATURE with the following structure:
LITERATURE > Y > YATES, DORNFORD > THE BROTHER OF DAPHNE:
Chapters 01-05.txt
Chapters 06-10.txt
Chapters 11-end.txt
I want to recursively concatenate the chapters that are split into multiple files (not all are). Then, I want to write the concatenated file to its parent's parent directory. The name of the concatenated file should be the same as the name of its parent directory.
For example, after running the script (in the folder structure shown above) I should get the following.
LITERATURE > Y > YATES, DORNFORD:
THE BROTHER OF DAPHNE.txt
THE BROTHER OF DAPHNE:
Chapters 01-05.txt
Chapters 06-10.txt
Chapters 11-end.txt
In this example, the parent directory is THE BROTHER OF DAPHNE and the parent's parent directory is YATES, DORNFORD.
[Updated March 6th—Rephrased the question/answer so that the question/answer is easy to find and understand.]
It's not clear what you mean by "recursively" but this should be enough to get you started.
#!/bin/bash
titlecase () { # adapted from http://stackoverflow.com/a/6969886/874188
local arr
arr=("${#,,}")
echo "${arr[#]^}"
}
for book in LITERATURE/?/*/*; do
title=$(titlecase ${book##*/})
for file in "$book"/*; do
cat "$file"
echo
done >"$book/$title"
echo '# not doing this:' rm "$book"/*.txt
done
This loops over LITERATURE/initial/author/BOOK TITLE and creates a file Book Title (where should a space be added?) from the catenated files in each book directory. (I would generate it in the parent directory and then remove the book directory completely, assuming it contains nothing of value any longer.) There is no recursion, just a loop over this directory structure.
Removing the chapter files is a bit risky so I'm not doing it here. You could remove the echo prefix from the line after the first done to enable it.
If you have book names which contain an asterisk or some other shell metacharacter this will be rather more complex -- the title assignment assumes you can use the book title unquoted.
Only the parameter expansion with case conversion is beyond the very basics of Bash. The array operations could perhaps also be a bit scary if you are a complete beginner. Proper understanding of quoting is also often a challenge for newcomers.
cat Chapters*.txt > FinaleFile.txt.raw
Chapters="$( ls -1 Chapters*.txt | sed -n 'H;${x;s/\
//g;s/ *Chapters //g;s/\.txt/ /g;s/ *$//p;}' )"
mv FinaleFile.txt.raw "FinaleFile ${Chapters}.txt"
cat all txt at once (assuming name sorted list)
take chapter number/ref from the ls of the folder and with a sed to adapt the format
rename the concatenate file including chapters
Shell doesn't like white space in names. However, over the years, Unix has come up with some tricks that'll help:
$ find . -name "Chapters*.txt" -type f -print0 | xargs -0 cat >> final_file.txt
Might do what you want.
The find recursively finds all of the directory entries in a file tree that matches the query (In this case, the type must be a file, and the name matches the pattern Chapter*.txt).
Normally, find separates out the directory entry names with NL, but the -print0 says to separate out the entries names with the NUL character. The NL is a valid character in a file name, but NUL isn't.
The xargs command takes the output of the find and processes it. xargs gathers all the names and passes them in bulk to the command you give it -- in this case the cat command.
Normally, xargs separates out files by white space which means Chapters would be one file and 01-05.txt would be another. However, the -0 tells xargs, to use NUL as a file separator -- which is what -print0 does.
Thanks for all your input. They got me thinking, and I managed to concatenate the files using the following steps:
This script replaces spaces in filenames with underscores.
#!/bin/bash
# We are going to iterate through the directory tree, up to a maximum depth of 20.
for i in `seq 1 20`
do
# In UNIX based systems, files and directories are the same (Everything is a File!).
# The 'find' command lists all files which contain spaces in its name. The | (pipe) …
# … forwards the list to a 'while' loop that iterates through each file in the list.
find . -name '* *' -maxdepth $i | while read file
do
# Here, we use 'sed' to replace spaces in the filename with underscores.
# The 'echo' prints a message to the console before renaming the file using 'mv'.
item=`echo "$file" | sed 's/ /_/g'`
echo "Renaming '$file' to '$item'"
mv "$file" "$item"
done
done
This script concatenates text files that start with Part, Chapter, Section, or Book.
#!/bin/bash
# Here, we go through all the directories (up to a depth of 20).
for D in `find . -maxdepth 20 -type d`
do
# Check if the parent directory contains any files of interest.
if ls $D/Part*.txt &>/dev/null ||
ls $D/Chapter*.txt &>/dev/null ||
ls $D/Section*.txt &>/dev/null ||
ls $D/Book*.txt &>/dev/null
then
# If we get here, then there are split files in the directory; we will concatenate them.
# First, we trim the full directory path ($D) so that we are left with the path to the …
# … files' parent's parent directory—We will write the concatenated file here. (✝)
ppdir="$(dirname "$D")"
# Here, we concatenate the files using 'cat'. The 'awk' command extracts the name of …
# … the parent directory from the full directory path ($D) and gives us the filename.
# Finally, we write the concatenated file to its parent's parent directory. (✝)
cat $D/*.txt > $ppdir/`echo $D|awk -F'/' '$0=$(NF-0)'`.txt
fi
done
Now, we delete all the files that we concatenated so that its parent directory is left empty.
find . -name 'Part*' -delete
find . -name 'Chapter*' -delete
find . -name 'Section*' -delete
find . -name 'Book*' -delete
The following command will delete empty directories. (✝) We wrote the concatenated file to its parent's parent directory so that its parent directory is left empty after deleting all the split files.
find . -type d -empty -delete
[Updated March 6th—Rephrased the question/answer so that the question/answer is easy to find and understand.]

Script to prepend all filenames within a directory

I've found an issue with adobes bates numbering tool, where file names are messing up the order in which they are numbered.
I was hoping to write a script that users would be able to click on and add the folder extension for all the files.
Then the script would prepend all the file names within the folder with a 000001filename.pdf 000002filename.pdf etc...
I've never combined scripts before but i've found scripts that either rename OR prepend. and i couldn't find anything that would rename sequentially with preceding 0's.
without much testing:
n=0 # or 1 if you like
format="%06d" # format of prefix
find . -maxdepth 1 -type f | # only one level, no dirs but also no symlinks etc
cut -d/ -f2 | # remove leading ./
sort | # plugin your sorting here
while read file
do
prefix=`printf "%06d" $n`
mv "$file" "$prefix$file" # but mv is dangerous!
n=$((n+1))
done

Read file names from directory in Bash

I need to write a script that reads all the file names from a directory and then depending on the file name, for example if it contains R1 or R2, it will concatenates all the file names that contain, for example R1 in the name.
Can anyone give me some tip how to do this?
The only thing I was able to do is:
#!/bin/bash
FILES="path to the files"
for f in $FILES
do
cat $f
done
and this only shows me that the variable FILE is a directory not the files it has.
To make the smallest change that fixes the problem:
dir="path to the files"
for f in "$dir"/*; do
cat "$f"
done
To accomplish what you describe as your desired end goal:
shopt -s nullglob
dir="path to the files"
substrings=( R1 R2 )
for substring in "${substrings[#]}"; do
cat /dev/null "$dir"/*"$substring"* >"${substring}.out"
done
Note that cat can take multiple files in one invocation -- in fact, if you aren't doing that, you usually don't need to use cat at all.
Simple hack:
ls -al R1 | awk '{print $9}' >outputfilenameR1
ls -al R2 | awk '{print $9}' >outputfilenameR2
Your expectation that
for f in $FILES
will loop over all the file names which are stored in the directory defined by the variable FILES was disappointed by the fact that you had observed that the value of FILES was the only item processed in the for loop.
In order to create a list of files out of the value pointing to a directory it is necessary to provide a pattern for file names which if applied to the file system will give a list of found directory and file names upon evaluation of the pattern by using $FILES.
This can be done by appending of /* to the directory pattern string stored in the variable FILES which is then used to be evaluated to a list of file names using the $-character as directive for the shell to evaluate the value stored in FILES and replace $FILES with a list of found files. The pure * after /* guarantees that all entries in the directory are returned, so the list will contain not only files but also sub-directories if there are any.
In other words if you change the assignment to:
FILES="path to the files/*"
the script will then behave like you have expected it.

Resources