Find all duplicate subdirectories in directory - bash

I need to make a shell script that "lists all identical sub-directories (recursively) under the current working directory."
I'm new to shell scripts. How do I approach this?
To me, this means:
for each directory starting in some starting directory, compare it to every other directory it shares by name.
if the other directory has the same name, check size.
if same size also, recursively compare contents of each directory item by item, maybe by md5sum(?) and continuing to do so for each subdirectory within the directories (recursively?)
then, continue by recursively calling this on every subdirectory encountered
then, repeat for every other directory in the directory structure
It would have been the most complicated program I'd have ever written, so I assume I'm just not aware of some shell command to do most of it for me?
I.e., how should I have approached this? All the other parts were about googling until I discovered the shell command that did it 90% of it for me.
(For a previous assignment that I wasn't able to finish, took a zero on this part, need to know how to approach it in the future.)

I'd be surprised to hear that there is a special Unix tool or special usage of a standard Unix tool to do exactly what you describe. Maybe your understanding of the task is more complex than what the task giver intended. Maybe with "identical" something concerning linking was meant. Normally, hardlinking directories is not allowed, so this probably also isn't meant.
Anyway, I'd approach this task by creating checksums for all nodes in your tree, i. e. recursively:
For a directory take the names of all entries and their checksums (recursion) and compute a checksum of them,
for a plain file compute a checksum of its contents,
for symlinks and special files (devices, etc.) consider what you want (I'll leave this out).
After creating checksums for all elements, search for duplicates (by sorting a list of all and searching for consecutive lines).
A quick solution could be like this:
#!/bin/bash
dirchecksum() {
if [ -f "$1" ]
then
checksum=$(md5sum < "$1")
elif [ -d "$1" ]
then
checksum=$(
find "$1" -maxdepth 1 -printf "%P " \( ! -path "$1" \) \
-exec bash -c "dirchecksum {}" \; |
md5sum
)
fi
echo "$checksum"
echo "$checksum $1" 1>&3
}
export -f dirchecksum
list=$(dirchecksum "$1" 3>&1 1>/dev/null)
lastChecksum=''
while read checksum _ path
do
if [ "$checksum" = "$lastChecksum" ]
then
echo "duplicate found: $path = $lastPath"
fi
lastChecksum=$checksum
lastPath=$path
done < <(sort <<< "$list")
This script uses two tricks which might not be clear, so I mention them:
To pass a shell function to find -exec one can export -f it (done below it) and then call bash -c ... to execute it.
The shell function has two output streams, one for returning the result checksum (this is via stdout, i. e. fd 1), and one for giving out each checksum found on the way to this (this is via fd 3).
The sorting at the end uses the list given out via fd 3 as input.

Maybe something like this:
$ find -type d -exec sh -c "echo -n {}\ ; sh -c \"ls -s {}; basename {}\"|md5sum " \; | awk '$2 in a {print "Match:"; print a[$2], $1; next} a[$2]=$1{next}'
Match:
./bar/foo ./foo
find all directories: find -type d, output:
.
./bar
./bar/foo
./foo
ls -s {}; basename {} will print the simplified directory listing and the basename of the directory listed, for example for directory foo: ls -s foo; basename foo
total 0
0 test
foo
Those will cover the files in each dir, their sizes and the dir name. That output will be sent to md5sum and that along the dir:
. 674e2573b49826d4e32dfe81d9680369 -
./bar 4c2d588c5fa9781ad63ad8e86e575e01 -
./bar/foo ff8d1569685be86366f18ea89851db35 -
./foo ff8d1569685be86366f18ea89851db35 -
will be sent to awk:
$2 in a { # hash as array key
print "Match:" # separate hits in output
print a[$2], $1 # print matching dirscompared to
next # next record
}
a[$2]=$1 {next} # only first match is stored and
Test dir structure:
$ mkdir -p test/foo; mkdir -p test/bar/foo; touch test/foo/test; touch test/bar/foo/test
$ find test/
test/
test/bar
test/bar/foo
test/bar/foo/test # touch test
test/foo
test/foo/test # touch test

Related

Create archive from difference of two folders

I have the following problem.
There are two nested folders A and B. They are mostly identical, but B has a few files that A does not. (These are two mounted rootfs images).
I want to create a shell script that does the following:
Find out which files are contained in B but not in A.
copy the files found in 1. from B and create a tar.gz that contains these files, keeping the folder structure.
The goal is to import the additional data from image B afterwards on an embedded system that contains the contents of image A.
For the first step I put together the following code snippet. Note to grep "Nur" : "Nur in" = "Only in" (german):
diff -rq <A> <B>/ 2>/dev/null | grep Nur | awk '{print substr($3, 1, length($3)-1) "/" substr($4, 1, length($4)-1)}'
The result is the output of the paths relative to folder B.
I have no idea how to implement the second step. Can someone give me some help?
Using diff for finding files which don't exist is severe overkill; you are doing a lot of calculations to compare the contents of the files, where clearly all you care about is whether a file name exists or not.
Maybe try this instead.
tar zcf newfiles.tar.gz $(comm -13 <(cd A && find . -type f | sort) <(cd B && find . -type f | sort) | sed 's/^\./B/')
The find commands produce a listing of the file name hierarchies; comm -13 extracts the elements which are unique to the second input file (which here isn't really a file at all; we are using the shell's process substitution facility to provide the input) and the sed command adds the path into B back to the beginning.
Passing a command substitution $(...) as the argument to tar is problematic; if there are a lot of file names, you will run into "command line too long", and if your file names contain whitespace or other irregularities in them, the shell will mess them up. The standard solution is to use xargs but using xargs tar cf will overwrite the output file if xargs ends up calling tar more than once; though perhaps your tar has an option to read the file names from standard input.
With find:
$ mkdir -p A B
$ touch A/a A/b
$ touch B/a B/b B/c B/d
$ cd B
$ find . -type f -exec sh -c '[ ! -f ../A/"$1" ]' _ {} \; -print
./c
./d
The idea is to use the exec action with a shell script that tests the existence of the current file in the other directory. There are a few subtleties:
The first argument of sh -c is the script to execute, the second (here _ but could be anything else) corresponds to the $0 positional parameter of the script and the third ({}) is the current file name as set by find and passed to the script as positional parameter $1.
The -print action at the end is needed, even if it is normally the default with find, because the use of -exec cancels this default.
Example of use to generate your tarball with GNU tar:
$ cd B
$ find . -type f -exec sh -c '[ ! -f ../A/"$1" ]' _ {} \; -print > ../list.txt
$ tar -c -v -f ../diff.tar --files-from=../list.txt
./c
./d
Note: if you have unusual file names the --verbatim-files-from GNU tar option can help. Or a combination of the -print0 action of find and the --null option of GNU tar.
Note: if the shell is POSIX (e.g., bash) you can also run find from the parent directory and get the path of the files relative from there, if you prefer:
$ mkdir -p A B
$ touch A/a A/b
$ touch B/a B/b B/c B/d
$ find B -type f -exec sh -c '[ ! -f A"${1#B}" ]' _ {} \; -print
B/c
B/d

linux show head of the first file from ls command

I have a folder, e.g. named 'folder'. There are 50000 txt files under it, e.g, '00001.txt, 00002.txt, etc'.
Now I want to use one command line to show the head 10 lines in '00001.txt'. I have tried:
ls folder | head -1
which will show the filename of the first:
00001.txt
But I want to show the contents of folder/00001.txt
So, how do I do something like os.path.join(folder, xx) and show its head -10?
The better way to do this is not to use ls at all; see Why you shouldn't parse the output of ls, and the corresponding UNIX & Linux question Why not parse ls (and what to do instead?).
On a shell with arrays, you can glob into an array, and refer to items it contains by index.
#!/usr/bin/env bash
# ^^^^- bash, NOT sh; sh does not support arrays
# make array files contain entries like folder/0001.txt, folder/0002.txt, etc
files=( folder/* ) # note: if no files found, it will be files=( "folder/*" )
# make sure the first item in that array exists; if it does't, that means
# the glob failed to expand because no files matching the string exist.
if [[ -e ${files[0]} || -L ${files[0]} ]]; then
# file exists; pass the name to head
head -n 10 <"${files[0]}"
else
# file does not exist; spit out an error
echo "No files found in folder/" >&2
fi
If you wanted more control, I'd probably use find. For example, to skip directories, the -type f predicate can be used (with -maxdepth 1 to turn off recursion):
IFS= read -r -d '' file < <(find folder -maxdepth 1 -type f -print0 | sort -z)
head -10 -- "$file"
Although hard to understand what you are asking but I think something like this will work:
head -10 $(ls | head -1)
Basically, you get the file from $(ls | head -1) and then print the content.
If you invoke the ls command as ls "$PWD"/folder, it will include the absolute path of the file in the output.

How to remove files from a directory if their names are not in a text file? Bash script

I am writing a bash script and want it to tell me if the names of the files in a directory appear in a text file and if not, remove them.
Something like this:
counter = 1
numFiles = ls -1 TestDir/ | wc -l
while [$counter -lt $numFiles]
do
if [file in TestDir/ not in fileNames.txt]
then
rm file
fi
((counter++))
done
So what I need help with is the if statement, which is still pseudo-code.
You can simplify your script logic a lot :
#/bin/bash
# for loop to iterate over all files in the testdir
for file in TestDir/*
do
# if grep exit code is 1 (file not found in the text document), we delete the file
[[ ! $(grep -x "$file" fileNames.txt &> /dev/null) ]] && rm "$file"
done
It looks like you've got a solution that works, but I thought I'd offer this one as well, as it might still be of help to you or someone else.
find /Path/To/TestDir -type f ! -name '.*' -exec basename {} + | grep -xvF -f /Path/To/filenames.txt"
Breakdown
find: This gets file paths in the specified directory (which would be TestDir) that match the given criteria. In this case, I've specified it return only regular files (-type f) whose names don't start with a period (-name '.*'). It then uses its own builtin utility to execute the next command:
basename: Given a file path (which is what find spits out), it will return the base filename only, or, more specifically, everything after the last /.
|: This is a command pipe, that takes the output of the previous command to use as input in the next command.
grep: This is a regular-expression matching utility that, in this case, is given two lists of files: one fed in through the pipe from find—the files of your TestDir directory; and the files listed in filenames.txt. Ordinarily, the filenames in the text file would be used to match against filenames returned by find, and those that match would be given as the output. However, the -v flag inverts the matching process, so that grep returns those filenames that do not match.
What results is a list of files that exist in the directory TestDir, but do not appear in the filenames.txt file. These are the files you wish to delete, so you can simply use this line of code inside a parameter expansion $(...) to supply rm with the files it's able to delete.
The full command chain—after you cd into TestDir—looks like this:
rm $(find . -type f ! -name '.*' -exec basename {} + | grep -xvF -f filenames.txt")

Bash - Directory dependent script

I am trying to run a python script in a directory and using bash apply this script to each of its subdirectories.
I found a script on unix stack exchange that does it for 1 set of subdirectories here . But I want it to recursively work for all sub-directories.
The problem is I have a single wav.py in the parent directory but none in the sub-directories.
for d in ./*/ ; do (cd "$d" && python3 $1 SA1.wav); done
As you can see $1 (wav.py) is the path to my python file set when I call the bash script. I would also like the path to be relative to how many levels of the subdirectory tree I have traversed. I know I can use an absolute path. But it will cause issues later on, so I'd like to avoid it.
Eg. for 1 level
for d in ./*/ ; do (cd "$d" && python3 "../$1" SA1.wav); done
for 2 levels
for d in ./*/ ; do (cd "$d" && python3 "../../$1" SA1.wav); done
Sorry if this seems trivial. I'm still new to bash.
Additional Info:
This is my full directory path:
root#Chiku-Y700:/mnt/e/Code/Python - WorkSpace/timit/TIMIT/TEST/DR1# bash recursive.sh wav.py suit rag
the full command I'm trying to run is:
python3 $1 SA1.wav $2 SA2.wav $3
$2 and $3 are unrelated to any directory info.
I get:
python3: can't open file '/mnt/e/Code/Python': [Errno 2] No such file or directory
This error came 12 times for 11 subdirectories.
Let's look at your command, with wav.py being $1:
for d in ./*/ ; do (cd "$d" && python3 $1 SA1.wav); done
Can we reduce the complexity by making wav.py executable and giving it a shebang, so that you can call it directly? Then you may move it to your PATH or temporarily putting the location, where it sits, into the path. It's generally best habit, that your script does not depend upon the place, from where it is invoked, especially that it isn't needed to be in the same directory from where it is called.
PATH=$PWD:$PATH
for d in ./*/ ; do (cd "$d" && wav.py SA1.wav); done
The input data should not depend on the same directory restriction, so that you can call it from every dir and with the data being in an arbitrary dir, too:
for d in ./*/ ; do wav.py $d/SA1.wav; done
Probably, you produce an output file, which is written to the current directory, then you either should extract the output dir from the input dir, if this is what you always want to achieve, or let the user specify an output dir. A default outputdir might still be the inputdir or the current dir. Or you write to STDOUT, and pipe the output to a file, located to your choice.
But your full command is:
python3 $1 SA1.wav $2 SA2.wav $3
That's fine for simple commands, but maybe you can name these parameters in a meaningful way:
pyprog="$1"
samplerate="$2"
log="$3"
python3 $pyprog SA1.wav $samplerate SA2.wav $log
or, as done before
$pyprog SA1.wav "$samplerate" SA2.wav "$log"
Then, John1024s solution might work:
find . -type d -execdir $pyprog SA1.wav "$samplerate" SA2.wav "$log" ";"
If changing pyprog is not an option, there is a second approach to solve the problem:
Write a wrapper script, which takes the directory to work in as a parameter and test it with different depths of directories.
Then call that wrapper by find:
find . -type d -exec ./wrapper.sh {} ";"
The wrapper.sh should start with:
#/bin/bash
#
#
directory="$1"
and use it where needed.
Btw.: I would rename the Python - WorkSpace to Python-WorkSpace (even better: python-workSpace) too, because blanks in file and path names always cause trouble.

Trouble iterating through all files in directory

Part of my Bash script's intended function is to accept a directory name and then iterate through every file.
Here is part of my code:
#! /bin/bash
# sameln --- remove duplicate copies of files in specified directory
D=$1
cd $D #go to directory specified as default input
fileNum=0 #save file numbers
DIR=".*|*"
for f in $DIR #for every file in the directory
do
files[$fileNum]=$f #save that file into the array
fileNum=$((fileNum+1)) #increment the fileNum
echo aFile
done
The echo statement is for testing purposes. I passed as an argument the name of a directory with four regular files, and I expected my output to look like:
aFile
aFile
aFile
aFile
but the echo statement only shows up once.
A single operation
Use find for this, it's perfect for it.
find <dirname> -maxdepth 1 -type f -exec echo "{}" \;
The flags explained: maxdepth defines how deep int he hierarchy you want to look (dirs in dirs in dirs), type f defines files, as opposed to type d for dirs. And exec allows you to process the found file/dir, which is can be accessed through {}. You can alternatively pass it to a bash function to perform more tasks.
This simple bash script takes a dir as argument and lists all it's files:
#!/bin/bash
find "$1" -maxdepth 1 -type f -exec echo "{}" \;
Note that the last line is identical to find "$1" -maxdepth 1 -type f -print0.
Performing multiple tasks
Using find one can also perform multiple tasks by either piping to xargs or while read, but I prefer to use a function. An example:
#!/bin/bash
function dostuff {
# echo filename
echo "filename: $1"
# remove extension from file
mv "$1" "${1%.*}"
# get containing dir of file
dir="${1%/*}"
# get filename without containing dirs
file="${1##*/}"
# do more stuff like echoing results
echo "containing dir = $dir and file was called $file"
}; export -f dostuff
# export the function so you can call it in a subshell (important!!!)
find . -maxdepth 1 -type f -exec bash -c 'dostuff "{}"' \;
Note that the function needs to be exported, as you can see. This so you can call it in a subshell, which will be opened by executing bash -c 'dostuff'. To test it out, I suggest your comment to mv command in dostuff otherwise you will remove all your extensions haha.
Also note that this is safe for weird characters like spaces in filenames so no worries there.
Closing note
If you decide to go with the find command, which is a great choice, I advise you read up on it because it is a very powerful tool. A simple man find will teach you a lot and you will learn a lot of useful options to find. You can for instance quit from find once it has found a result, this can be handy to check if dirs contain videos or not for example in a rapid way. It's truly an amazing tool that can be used on various occasions and often you'll be done with a one liner (kinda like awk).
You can directly read the files into the array, then iterate through them:
#! /bin/bash
cd $1
files=(*)
for f in "${files[#]}"
do
echo $f
done
If you are iterating only files below a single directory, you are better off using simple filename/path expansion to avoid certain uncommon filename issues. The following will iterate through all files in a given directory passed as the first argument (default ./):
#!/bin/bash
srchdir="${1:-.}"
for i in "$srchdir"/*; do
printf " %s\n" "$i"
done
If you must iterate below an entire subtree that includes numerous branches, then find will likely be your only choice. However, be aware that using find or ls to populate a for loop brings with it the potential for problems with embedded characters such as a \n within a filename, etc. See Why for i in $(find . -type f) # is wrong even though unavoidable at times.

Resources