grep for two patterns independently (in different lines) - bash

I have some directories with the following structure:
DAY1/ # Files under this directory should have DAY1 in the name.
|-- Date
| |-- dir1 # Something wrong here, there are files with DAY2 and files with DAY1.
| |-- dir2
| |-- dir3
| |-- dir4
DAY2/ # Files under this directory should all have DAY2 in the name.
|-- Date
| |-- dir1
| |-- dir2 # Something wrong here, there are files with DAY2, and files with DAY1.
| |-- dir3
| |-- dir4
In each dir there are hundreds of thousands of files with names containing DAY, for example 0.0000.DAY1.01927492. Files with DAY1 on the name should only appear under parent directory DAY1.
Something went wrong when copying files around, so that I now have mixed files with DAY1 and DAY2 in some of the dir directories.
I wrote a script to find folders that contain mixed files, so I can then look at them more closely. My script is the following:
for directory in */; do
if ls $directory | grep -q DAY2 ; then
if ls $directory | grep -q DAY1; then
echo "mixed files in $directory";
fi ;
fi;
done
The problem here is that I'm going through all files twice, which doesn't make sense considering that I'd only have to look through the files once.
What would be a more efficient way achieve what I want?

If i understand you correctly, then you need to find the files under DAY1 directory recursively that have DAY2 in their names, similarly for DAY2 directory the files what have DAY1 in their names.
If so, for DAY1 directory:
find DAY1/ -type f -name '*DAY2*'
this will get you the files under DAY1 directory that have DAY2 in their names. Similarly for DAY2 directory:
find DAY2/ -type f -name '*DAY1*'
Both are recursive operations.
To get the directory names only:
find DAY1/ -type f -name '*DAY2*' -exec dirname {} +
Note that the $PWD will be shown as ..
To get uniqueness, pass the output to sort -u:
find DAY1/ -type f -name '*DAY2*' -exec dirname {} + | sort -u

Given that the difference between going through them once and going through them twice is just a factor-of-two difference, changing to an approach that goes through them only once might actually not be a win, since the new approach might easily take twice as long per file.
So you'll definitely want to experiment; it's not necessarily something that you can confidently reason about.
However, I will say that in addition to going through the files twice, the ls version also sorts the files, which probably has a more-than-linear cost (unless it's doing some kind of bucket-sort). Eliminating that, by writing ls --sort=none instead of just ls, will actually improve your algorithmic complexity, and is almost certain to give a tangible improvement.
But FWIW, here's a version that only goes through the files once, that you can try:
for directory in */; do
find "$directory" -maxdepth 1 \( -name '*DAY1*' -or -name '*DAY2*' \) -print0 \
| { saw_day1=
saw_day2=
while IFS= read -d '' subdirectory ; do
if [[ "$subdirectory" == *DAY1* ]] ; then
saw_day1=1
fi
if [[ "$subdirectory" == *DAY2* ]] ; then
saw_day2=1
fi
if [[ "$saw_day1" ]] && [[ "$saw_day2" ]] ; then
echo "mixed files in $directory"
break
fi
done
}
done

Related

Rename certain portion of filepaths in current directory recursively

Let's assume I have following directory tree:
.
|-- foo
`-- foodir
|-- bardir
| |-- bar
| `-- foo
|-- foo -> bardir/foo
`-- foodir
|-- bar
`-- foo
3 directories, 6 files
How can I rename all foo into buz, including symlinks? like:
.
|-- buz
`-- buzdir
|-- bardir
| |-- bar
| `-- buz
|-- buz -> bardir/buz
`-- buzdir
|-- bar
`-- buz
3 directories, 6 files
I thought it is relatively easy at the first glance, but it turns out that was unexpectedly tough.
Firstly, I tried to mv around all files using git ls-files:
$ for file in $(git ls-files '*foo*'); do mv "$file" "${file//foo/buz}"; done
This gave me a bunch of errors said that I have to create new directories before doing so:
mv: cannot move 'foodir/bardir/bar' to 'buzdir/bardir/bar': No such file or directory
mv: cannot move 'foodir/bardir/foo' to 'buzdir/bardir/buz': No such file or directory
mv: cannot move 'foodir/foo' to 'buzdir/buz': No such file or directory
mv: cannot move 'foodir/foodir/bar' to 'buzdir/buzdir/bar': No such file or directory
mv: cannot move 'foodir/foodir/foo' to 'buzdir/buzdir/buz': No such file or directory
I didn't want to care about cleaning up empty directories after copy, so I tried find -exec expecting it can handle file renaming while finding files based on its names.
$ find . -path .git -prune -o -name '*foo*' -exec bash -c 'mv "$0" "${0//foo/buz}"' "{}" \;
But find seems still tried renaming files from renamed path.
find: ./foodir: No such file or directory
My final solution is to find the first file/directory for every single mv commands.
#!/bin/bash
# Rename file paths recursively
while :; do
path=$(find . -path .git -prune -o -name '*foo*' -print -quit)
if [ -z "$path" ]; then
break
fi
if ! mv "$path" "${path/foo/buz}"; then
break
fi
done
# Change symlink targets as well
find . -path .git -prune -o -type l -exec bash -c '
target=$(readlink "$0")
if [ "$target" != "${target//foo/buz}" ]; then
ln -sfn "${target//foo/buz}"
fi
' "{}" \;
This, kinda lame, but works as I expected. So my questions are:
Can I assume find always output directories before its sub directories/files?
Is there any chance to avoid using find multiple times?
Thank you.

Rename files based on their parent directory in Bash

Been trying to piece together a couple previous posts for this task.
The directory tree looks like this:
TEST
|ABC_12345678
3_XYZ
|ABC_23456789
3_XYZ
etc
Each folder within the parent folder named "TEST" always starts with ABC_\d{8} -the 8 digits are always different. Within the folder ABC_\d{8} is always a folder entitled 3_XYZ that always has a file named "MD2_Phd.txt". The goal is to rename each "MD2_PhD.txt" file with the specific 8 digit ID found in the ABC folder name i.e. "\d{8}_PhD.txt"
After several iterations on various bits of code from different posts this is the best I can come up with,
cd /home/etc/Desktop/etc/TEST
find -type d -name 'ABC_(\d{8})' |
find $d -name "*_PhD.txt" -execdir rename 's/MD2$/$d/' "{}" \;
done
find + bash solution:
find -type f -regextype posix-egrep -regex ".*/TEST/ABC_[0-9]{8}/3_XYZ/MD2_Phd\.txt" \
-exec bash -c 'abc="${0%/*/*}"; fp="${0%/*}/";
mv "$0" "$fp${abc##*_}_PhD.txt" ' {} \;
Viewing results:
$ tree TEST/ABC_*
TEST/ABC_12345678
└── 3_XYZ
└── 12345678_PhD.txt
TEST/ABC_1234ss5678
└── 3_XYZ
└── MD2_Phd.txt
TEST/ABC_23456789
└── 3_XYZ
└── 23456789_PhD.txt
You are piping find output to another find. That won't work.
Use a loop instead:
dir_re='^.+_([[:digit:]]{8})/'
for file in *_????????/3_XYZ/MD2_PhD.txt; do
[[ -f $file ]] || continue
if [[ $file =~ $dir_re ]]; then
dir_num="${BASH_REMATCH[1]}"
new_name="${file%MD2_PhD.txt/$dir_num.txt}" # replace the MD2_PhD at the end
echo mv "$file" "$new_name" # remove echo from here once tested
fi
done

Recursive deletion of base folder where all content is not modyfied older than n days

I have some folders which needs my regular attention:
dir1
dir2
dir3
I know that I can safely remove them if anything inside has not been changed for 10 days. But how to do it?
I wanted to use "find /basedir/ -maxdepth 1 -mtime +10 -print | xargs -1 rm -f"
But this deletes those dirs which has not been modyfied but even if inside those folders were modyfied. Incomplete content of any of dir1,dir2 or dir3 is useless so I need to decide if delete whole dir1-3 or leave it based on how old complete content is.
Does anyone know easy way to do it?
Don't use -maxdepth. Iterate over the directories in a for loop:
for dir in /basedir/dir{1,2,3} ; do
if ! find "$dir" -mtime -10 | grep -q ^ ; then
rm -rf "$dir"
fi
done

How to remove intermediate folders containing only one folder each?

I had been playing around with mv, and now I have a situation.
Earlier, say
Folder1 had file1,2,3.
Now Folder1 has Folder2 which has Folder3 which has Folder4 which contains file1,2,3.
I am trying to write a bash script such that it identifies intermediate folders containing only 1 directory and moves all its contents up one level, ultimately giving back only Folder1->file1,2,3, and rest folders deleted.
I tried to write something like below, but I am :
1.unable to distinguish between file and folder
2.unable to find the file/directory name stored inside current folder
3.Not sure how to do recursively.
#!/bin/bash
echo "Directory Name?"
read dir_name
no_files=`ls -A| wc -l`
if [ $no_file==1 ] && [ itisaDirectory()];
then `mv folder_name/* dir_name`
fi
When you do not care for error messages and want to move all files in subdirs to the current dir and remove the remaining empty dir, do something like
find . -type f -exec mv {} "${dir_name}" \; 2>/dev/null
rm -r */
You ask for something else, only move files where an intermediate directory is unique. That is the case if exactly one subdir has that dir as a parent. The parent of a dir can be found with dirname.
When a dir has one subdir, only one subdir will have it as a parent. You can list all dirs, look for the parent and select the unique paths.
find . -type d -exec dirname {} \; | sort | uniq -u | while read dir; do
echo "${dir} has exactly one subdir"
done
The problem is that the dir can have files as well. We try to improve the above solution:
find . -exec dirname {} \; | sort | uniq -u | while read dir; do
echo "${dir} has exactly one subdir or one file"
done
You can test the content of the dir with if [ -d "${dir}/*" ] but I do not need to know:
find . -exec dirname {} \; | sort | uniq -u | while read dir; do
echo "${dir} has exactly one subdir or one file"
find "${dir}"/*/ -type f -exec mv {} "${dir_name}" \; 2>/dev/null
done
The path ${dir}/*/ will only exist when ${dir} has a subdirectory in it, and will move the files beneath. When $dir only has one file, the find command will find nothing.

Diff which folders are missing

I have two folders each which contain a lot of subfolders. Most of the files and folders should be the same, but I want to be able to figure out which folders are missing.
For example:
Folder1/
A/
1.jpg
B/
2.jpg
C/
3.jpg
D/
4.jpg
and
Folder2/
A/
1.jpg
E/
2.jpg
C/
3.jpg
D/
4.jpg
Is there any way to know that "B" got deleted? I'm running windows, but I have cygwin installed so bash scripts, diff, or python/perl would work.
I know I can just "diff -q -r Folder1 Folder2" everything in both folders, but that takes FOREVER and spits out everything that's changed, including files in those folders, where I just need the folders themselves.
Any suggestions?
Thanks!
diff -u <(cd Folder1 ; find | sort) <(cd Folder2 ; find | sort)
Some notes:
This would include files that are added/removed, but not files that are merely modified. If you don't even want to include files that are added/removed, change find to find -type d, as herby suggests.
If a given directory is added/removed, this will also list out all the files and directories within that directory. If that's a problem, you can address it by appending something like | perl -ne 'print unless m/^\Q$previous\E\//; $previous = $_;'.
Barron's answer makes me realize that you didn't actually specify that you need the folders to be examined recursively. If you just need the very top level, you can change find to either find -maxdepth 1 (or find -maxdepth 1 -type d) or ls -d * (or ls -d */), as you prefer.
(cd Folder1 && find . -type d >/tmp/$$.1)
(cd Folder2 && find . -type d >/tmp/$$.2)
diff /tmp/$$.1 /tmp/$$.2
rm /tmp/$$.1 /tmp/$$.2
This is how I hacked it together in bash:
dirs=`ls $PWD/Folder1`
for dir in ${dirs[*]}; do
if [ ! -e $PWD/Folder2/$dir ]; then
echo "$dir missing"
fi
done
I make no claim that this is an ideal solution, but since I'm also learning bash, I'd be interested to hear why this is a particularly good or bad way to go about it.
If you really want only one level of nesting, you can do this:
(cd Folder1 && find -type d -mindepth 1 -maxdepth 1) >list1
(cd Folder2 && find -type d -mindepth 1 -maxdepth 1) >list2
while read dir; do
fgrep -qx "$dir" list2 || echo "\"$dir\" has been deleted"
done <list1
If you are sure only to have directories in both folders, replace the find commands with a simple ls.

Resources