Recursively check length of directory name - bash

I need to determine if there are any directory names > 31 characters in a given directory (i.e. look underneath that root).
I know I can use something like find /path/to/root/dir -type d >> dirnames.txt
This will give me a text file of complete paths.
What I need is to get the actual number of characters in each directory name. Not sure if parsing the above results w/sed or awk makes sense. Looking for ideas/thoughts/suggestions/tips on how to accomplish this. Thanks!

This short script does it all in one go, i.e. finds all directory names and then outputs any which are greater than 31 characters in length (along with their length in characters):
for d in `find /path/to/root/dir -type d -exec basename {} \;` ; do
len=$(echo $d | wc -c)
if [ $len -gt 31 ] ; then
echo "$d = $len characters"
fi
done

Using your dirnames.txt file created by your find cmd, you can then sort the data by length of pathname, i.e.
awk '{print length($0) "\t" $0}' dirnames.txt | sort +0nr -1 > dirNamesWithSize.txt
This will present the longest path names (based on the value of length) at the top of the file.
I hope this helps.

Try this
find . -type d -exec bash -c '[ $(wc -c <<<"${1##*/}") -gt 32 ] && echo "${1}"' -- {} \; 2>/dev/null
The one bug, which I consider minor, is that it will over-count directory name length by 1 every time.
If what you wanted was the whole path rather than the last path component, then use this:
find . -type d | sed -e '/.\{32,\}/!d'
This version also has a bug, but only when file names have embedded newlines.
The output of both commands is a list of file names which match the criteria. Counting the length of each one is trivial from there.

Related

Bash find: exec in reverse oder

I am iterating over files like so:
find $directory -type f -exec codesign {} \;
Now the problem here is that files on a higher hierarchy are signed first.
Is there a way to iterate over a directory tree and handle the deepest files first?
So that
/My/path/to/app/bin
is handled before
/My/path/mainbin
Yes, just use -depth:
-depth
The primary shall always evaluate as true; it shall cause descent of the directory hierarchy to be done so that all entries in a directory are acted on before the directory itself. If a -depth primary is not specified, all entries in a directory shall be acted on after the directory itself. If any -depth primary is specified, it shall apply to the entire expression even if the -depth primary would not normally be evaluated.
For example:
$ mkdir -p top/a/b/c/d/e/f/g/h
$ find top -print
top
top/a
top/a/b
top/a/b/c
top/a/b/c/d
top/a/b/c/d/e
top/a/b/c/d/e/f
top/a/b/c/d/e/f/g
top/a/b/c/d/e/f/g/h
$ find top -depth -print
top/a/b/c/d/e/f/g/h
top/a/b/c/d/e/f/g
top/a/b/c/d/e/f
top/a/b/c/d/e
top/a/b/c/d
top/a/b/c
top/a/b
top/a
top
Note that at a particular level, ordering is still arbitrary.
Using GNU utilities, and decorate-sort-undecorate pattern (aka Schwartzian transform):
find . -type f -printf '%d %p\0' |
sort -znr |
sed -z 's/[0-9]* //' |
xargs -0 -I# echo codesign #
Drop the echo if the output looks ok.
Using find's -depth option as my other answer, or naive sort as some others, only ensures that sub-directories of a directory are processed before the directory itself, but not that the deepest level is processed first.
For example:
$ mkdir -p top/a/b/d/f/h top/a/c/e/g
$ find top -depth -print
top/a/c/e/g
top/a/c/e
top/a/c
top/a/b/d/f/h
top/a/b/d/f
top/a/b/d
top/a/b
top/a
top
For overall deepest level to be processed first, the ordering should be something like:
top/a/b/d/f/h
top/a/c/e/g
top/a/b/d/f
top/a/c/e
top/a/b/d
top/a/c
top/a/b
top/a
top
To determine this ordering, the entire list must be known, and then the number of levels (ie. /) of each path counted to enable ranking.
A simple-ish Perl script (assigned to a shell function for this example) to do this ordering is:
$ dsort(){
perl -ne '
BEGIN { $/ = "\0" } # null-delimited i/o
$fname[$.] = $_;
$depth[$.] = tr|/||;
END {
print
map { $fname[$_] }
sort { $depth[$b] <=> $depth[$a] }
keys #fname
}
'
}
Then:
$ find top -print0 | dsort | xargs -0 -I# echo #
top/a/b/d/f/h
top/a/c/e/g
top/a/b/d/f
top/a/c/e
top/a/b/d
top/a/c
top/a/b
top/a
top
How about sorting the output of find in descending order:
while IFS= read -d "" -r f; do
codesign "$f"
done < <(find "$directory" -type f -print0 | sort -zr)
<(command ..) is a process substitution which feeds the output
of the command to the read command in while loop via the redirect.
-print0, sort -z and read -d "" combo uses a null character
as a file delimiter. It is useful to protect filenames which include
special characters such as whitespace.
I don't know if there is a native way in find, but you may pipe the output of it into a loop and process it line by line as you wish this way:
find . | while read file; do echo filename: "$file"; done
In your case, if you are happy just reversing the output of find, you may go with something like:
find $directory -type f | tac | while read file; do codesign "$file"; done

Script for printing out file names and their number of appearance starting from a given folder

I need to write a shell script, which starts with a given folder name as an argument will print out the names of folder and files in it and how many times does each name appear in the given folder.
edit I need to check only their names, without taking into consideration the file extensions.
#!/bin/bash
folder="$1"
for f in "$folder"
do
echo "$f"
done
And I would expect to see something like this (if i have 3 files with the same name and different extension like x.html, x.css, x.sh and so on, in a directory called dir)
x
3 times
after executing the script with dir (the name of the directory) as a parameter.
The find command already does most of this for you.
find . -printf "%f\n" |
sort | uniq -c
This will not work correctly if you have files whose names contain a newline.
If your find doesn't support -printf, maybe try
find . -exec basename {} \; |
sort | uniq -c
To restrict to just file names or directory names, add -type f or -type d, respectively, before the action (-exec or -printf).
If you genuinely want to remove extensions, try
find .... whatever ... |
sed 's%\.[^./]*$%%' |
sort | uniq -c
Can you try this,
#!/bin/bash
IFS=$'\n' array=($(ls))
iter=0;
for file in ${array[*]}; do
filename=$(basename -- "$file")
extension="${filename##*.}"
filename="${filename%.*}"
filenamearray[$iter]=$filename
iter=$((iter+1))
done
for filename in ${filenamearray[#]}; do
echo $filename;
grep -o $filename <<< "${filenamearray[#]}" | wc -l
done
You can try with find and awk :
find . -type f -print0 |
awk '
BEGIN {
FS="/"
RS="\0"
}
{
k = split( $NF , b , "." )
if ( k > 1 )
sub ( "."b[k] , "" , $NF )
a[$NF]++
}
END {
for ( i in a ) {
j = a[i]>1 ? "s" : ""
print i
print a[i] " time" j
}
}'

find only the first file from many directories

I have a lot of directories:
13R
613
AB1
ACT
AMB
ANI
Each directories contains a lots of file:
20140828.13R.file.csv.gz
20140829.13R.file.csv.gz
20140830.13R.file.csv.gz
20140831.13R.file.csv.gz
20140901.13R.file.csv.gz
20131114.613.file.csv.gz
20131115.613.file.csv.gz
20131116.613.file.csv.gz
20131117.613.file.csv.gz
20141114.ab1.file.csv.gz
20141115.ab1.file.csv.gz
20141116.ab1.file.csv.gz
20141117.ab1.file.csv.gz
etc..
The purpose if to have the first file from each directories
The result what I expect is:
13R|20140828
613|20131114
AB1|20141114
Which is the name of the directories pipe the date from the filename.
I guess I need a find and head command + awk but I can't make it, I need your help.
Here what I have test it
for f in $(ls -1);do ls -1 $f/ | head -1;done
But the folder name is missing.
When I mean the first file, is the first file returned in an alphabetical order within the folder.
Thanks.
You can do this with a Bash loop.
Given:
/tmp/test
/tmp/test/dir_1
/tmp/test/dir_1/file_1
/tmp/test/dir_1/file_2
/tmp/test/dir_1/file_3
/tmp/test/dir_2
/tmp/test/dir_2/file_1
/tmp/test/dir_2/file_2
/tmp/test/dir_2/file_3
/tmp/test/dir_3
/tmp/test/dir_3/file_1
/tmp/test/dir_3/file_2
/tmp/test/dir_3/file_3
/tmp/test/file_1
/tmp/test/file_2
/tmp/test/file_3
Just loop through the directories and form an array from a glob and grab the first one:
prefix="/tmp/test"
cd "$prefix"
for fn in dir_*; do
cd "$prefix"/"$fn"
arr=(*)
echo "$fn|${arr[0]}"
done
Prints:
dir_1|file_1
dir_2|file_1
dir_3|file_1
If your definition of 'first' is different that Bash's, just sort the array arr according to your definition before taking the first element.
You can also do this with find and awk:
$ find /tmp/test -mindepth 2 -print0 | awk -v RS="\0" '{s=$0; sub(/[^/]+$/,"",s); if (s in paths) next; paths[s]; print $0}'
/tmp/test/dir_1/file_1
/tmp/test/dir_2/file_1
/tmp/test/dir_3/file_1
And insert a sort (or use gawk) to sort as desired
sort has an unique option. Only the directory should be unique, so use the first field in sorting -k1,1. The solution works when the list of files is sorted already.
printf "%s\n" */* | sort -k1,1 -t/ -u | sed 's#\(.*\)/\([0-9]*\).*#\1|\2#'
You will need to change the sed command when the date field may be followed by another number.
This works for me:
for dir in $(find "$FOLDER" -type d); do
FILE=$(ls -1 -p $dir | grep -v / | head -n1)
if [ ! -z "$FILE" ]; then
echo "$dir/$FILE"
fi
done

Indexing files and parsing name

I have a directory, ./grd_files/lat36/ that has 7 files in it (n36e114.grd, n36e115.grd, n36e116.grd, n36e117.grd, n36e118.grd, n36e119.grd, n36e120.grd. Also beneath ./grd_files/ are other folders named lat37, lat38, lat39. Each contains some files named in the same format as those in lat36, only instead of n36e114.grd, the file for the e114 longitude in the lat37 folder would be called n37e114. Now, not all lat** folders contain all the longitudes, but I need them to.
I have written a part of the script to determine which lat** folder has the most columns in it (it is lat36 with 7 longitudes). I want to compare the longitudes that exist in lat36 folder to the other folders, and if a column is missing in another folder, I will make it. I can handle the if then statement, but I am stumped on how to compare the lists in bash.
I was thinking to make a list of the file names in the row1 folder, and compare that to the to the files in the other folders, but the names won't and shouldn't match -- only the column part of the name will and should match. So far I have tried to make an array of the file names and then parse it for just the column part of the name. Note that these are actually map tiles, so the names are really in the format of coordinates in northing (row) and easing (col) e.g. n36e114.grd. So I want to isolate all the e114 style parts of the names and check and make sure that they exist in the other rows. I hope that makes sense. Below is what I attempted, but I am not great in bash syntax so I'm stumped. Thanks so much for the help.
col_list_raw=( $(find $maxdirectory -name ".grd" -exec basename {} .grd \;) )
col_list=( for c in ${col_list_raw[#]}; do echo ${col_list_raw[$c]:3:7}; done )
where $maxdirectory is the one with the most columns.*
UPDATE: I have removed what I described in italics above and attempted to incorporate the solution from John1024. Below is the code.
cd ./grd_files
for row in lat*/
do
ls "$row" | sed 's/.*lon/lon/' >"${row%/}.tmp"
done
for f in lat*.tmp
do
grep -vFf "$f" ${latXX}.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
cd ..
Where latXX is the folder with the most longitudes. John1024's first loop works nicely, and I get the correct lists for each of the lat** folders, but the second loop straight up compares the lists , returning:
lat37 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
lat38 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
lat39 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
I need that loop to compare only part of the file name. ie I want to check each folder for the existence of each longitude. So that if file `n37e114.grd' exists, nothing happens, but if it does not exist, that information is returned and I can execute a command based on the missing file. I hope my edits clear up the naming convention and are understandable. Thanks again for the help. AM
SOLUTION:
thanks to the help of #John1024 I was able to find a solution. I have reproduced the final solution below. Following this, I read in the *.out files and conduct my command on each line of them.
cd ./grd_files
for lat in */
do
ls "$lat" | sed 's/[a-z][1-9][1-9].*\([a-z][0-9][0-9]*\).grd/\1/' >"${lat%/}.tmp"
done
for file in *.tmp
do
lat=$(echo $file | awk -F "." '{print $1}')
grep -vFf "$file" ${xXX}.tmp >${lat}missing.out
[ -s ${lat}missing.out ] && echo ${file%.tmp} is missing $(cat ${lat}missing.out)
done
The question includes two different naming schemes for the files. Both would work the same, but to keep it simple and intuitive, this answer uses the first scheme.
It is possible to loop through bash arrays to find the missing columns. However, grep is well-suited to this task, greatly simplifies the logic, and, if there are many columns and rows, it is likely much faster. Using grep:
cd ./grd_files
for row in row*/
do
ls "$row" | sed 's/.*col/col/' >"${row%/}.tmp"
done
for f in row*.tmp
do
grep -vFf "$f" row1.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
The first loop above, creates lists of columns that exist in each of the rows. These lists are saved in temporary files name row1.tmp, row2.tmp, etc.
The second loop compares each of those lists to the reference row, row1.tmp. The list of columns missing from that row are saved in temporary file missing.tmp. If missing.tmp has a nonzero size, then there are missing columns and a report is generated.
For cleanup, one might want to delete the tmp files. If so, add this line to the end of the script:
rm row*.tmp missing.tmp
Fancier version
Using process substitution, the need for many of the temporary files can be eliminated:
trap "rm missing.tmp" EXIT
for row in row*/
do
ls row1/ | sed 's/.*col/col/' | grep -vFf <(ls "$row" | sed 's/.*col/col/') >missing.tmp
[ -s missing.tmp ] && echo $row is missing $(cat missing.tmp)
done
This version also uses trap to assure that the sole remaining temporary file is removed when the script is finished.
Using the other naming scheme as per revised question
cd ./grd_files
for row in lat*/
do
ls "$row" | sed 's/.*n[0-9][0-9]e/e/' >"${row%/}.tmp"
done
for f in lat*.tmp
do
grep -vFf "$f" ${latXX}.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
cd ..
As I told you in the comment, supplying the test data is a good practice. In this case you would got much more answers if supplied a script what creating a test case, something such next:
mkdir grid
cd grid
mkdir lat3{5..9}
#if you don't know the {3..9} expansion, simply write
#mkdir lat36 lat37 lat38 lat39
touch lat35/n35e111.grd
touch lat36/n36e11{4..9}.grd lat36/n36e120.grd
touch lat37/n37e11{4,6,8}.grd
touch lat38/n38e11{4..9}.grd
#39 missing all files
Such script what creating an test case helps much more as full page of words. ;) Or, if no script, at least supply the output of find like find grid -print. Your first edit helps a bit, (I missed it) and +100 to #John1024's work.
Now about the solution.
Your final solution have one problem. What if the directory with the MOST LONGITUDES (your latXX) missing some gridfile what exists in some other directories? E.g. it has the most gridfiles, but still not all. Like in the above test case, the lat36 contains 7 files (most of all), but sill missing a file n36e111.grd (because the 111 exists only in the lat35)?
Therefore i created an alternative solution, what eliminates this problem and show the result as the next matrix:
111 114 115 116 117 118 119 120
35: + no no no no no no no # the 111 is here
36: no + + + + + + + # the dir with a MOST of longitudes but missing 111
37: no + no + no + no no
38: no + + + + + + no
39: no no no no no no no no # missing all longitudes
the script
start="./test/grid"
cd "$start" || err "can cd to $start" || exit 1
known_longs=$(find . -type f -name \*.grd -print | sed 's:.*/n.*e\([0-9][0-9]*\)\.grd:\1:' | sort -u)
known_lats=$(find . -type d -print | grep -oP 'lat\K\d+(?=/?)' | sort -u)
print_matrix() {
echo -ne "\t"
paste -s - <<<"$known_longs"
for lat in $known_lats
do
echo -en "$lat:"
for long in $known_longs
do
[[ -e "./lat${lat}/n${lat}e${long}.grd" ]] && echo -en "\t+" || echo -en "\tno"
done
echo
done
}
print_matrix
The logic is easy:
search for all known longs e.g. for the filenames what contains eNNN
search for all known lats e.g. for the directories wit latNN
in a cycle test the existence if the files
The above printed matrix is probably not very useful, because you probably want do something with the found or missing files, so here is an action variant of the script.
start="./test/grid"
cd "$start" || err "can cd to $start" || exit 1
known_longs=$(find . -type f -name \*.grd -print | sed 's:.*/n.*e\([0-9][0-9]*\)\.grd:\1:' | sort -u)
known_lats=$(find . -type d -print | grep -oP 'lat\K\d+(?=/?)' | sort -u)
do_if_exists() {
local xlat="$1"
local xlong="$2"
filename="n${xlat}e${xlong}.grd"
#do nothing
}
do_if_missing() {
local xlat="$1"
local xlong="$2"
filename="n${xlat}e${xlong}.grd"
echo "from lat$xlat missing $filename"
}
do_actions() {
for lat in $known_lats
do
for long in $known_longs
do
[[ -e "./lat${lat}/n${lat}e${long}.grd" ]] && do_if_exists $lat $long || do_if_missing $lat $long
done
done
}
do_actions
what for the missing file do an action (echoes what missing), and the output is the next:
from lat35 missing n35e114.grd
from lat35 missing n35e115.grd
from lat35 missing n35e116.grd
from lat35 missing n35e117.grd
from lat35 missing n35e118.grd
from lat35 missing n35e119.grd
from lat35 missing n35e120.grd
from lat36 missing n36e111.grd
from lat37 missing n37e111.grd
from lat37 missing n37e115.grd
from lat37 missing n37e117.grd
from lat37 missing n37e119.grd
from lat37 missing n37e120.grd
from lat38 missing n38e111.grd
from lat38 missing n38e120.grd
from lat39 missing n39e111.grd
from lat39 missing n39e114.grd
from lat39 missing n39e115.grd
from lat39 missing n39e116.grd
from lat39 missing n39e117.grd
from lat39 missing n39e118.grd
from lat39 missing n39e119.grd
from lat39 missing n39e120.grd
Of course, is possible optimise more, like:
do the find only once (helps if the directory tree is large - by creating a list of filenames by the find command
don't test each file, but test the existence of the filename in the previously created list of filenames
like in the next
startdir="./test/grid"
(cd "$startdir" || err "can cd to $start" || exit 1
gridlist="/tmp/griglist.$$"
trap "rm -f $gridlist;exit" 0 2
find . -regex '\./lat[0-9][0-9]*.*' -print >$gridlist
known_longs=($(sed -n 's:^.*/n[0-9][0-9]*e\([0-9][0-9]*\)\.grd$:\1:p' $gridlist | sort -u))
known_lats=($(grep -oP '/lat\K\d+((?=/?)|$)' $gridlist | sort -u))
full_list() {
for lat in ${known_lats[#]}
do
for long in ${known_longs[#]}
do
echo "./lat${lat}/n${lat}e${long}.grd"
done
done
}
comm -13 $gridlist <(full_list)) | while read missing
do
#do something with the miising file
echo "$missing"
done

Efficient way to find paths from a list of filenames

From a list of file names stored in a file f, what's the best way to find the relative path of each file name under dir, outputting this new list to file p? I'm currently using the following:
while read name
do
find dir -type f -name "$name" >> p
done < f
which is too slow for a large list, or a large directory tree.
EDIT: A few numbers:
Number of directories under dir: 1870
Number of files under dir: 80622
Number of filenames in f: 73487
All files listed in f do exist under dir.
The following piece of python code does the trick. The key is to run find once and store the output in a hashmap to provide an O(1) way to get from file_name to the list of paths for the filename.
#!/usr/bin/env python
import os
file_names = open("f").readlines()
file_paths = os.popen("find . -type f").readlines()
file_names_to_paths = {}
for file_path in file_paths:
file_name = os.popen("basename "+file_path).read()
if file_name not in file_names_to_paths:
file_names_to_paths[file_name] = [file_path]
else:
file_names_to_paths[file_name].append(file_path) # duplicate file
out_file = open("p", "w")
for file_name in file_names:
if file_names_to_paths.has_key(file_name):
for path in file_names_to_paths[file_name]:
out_file.write(path)
Try this perl one-liner
perl -e '%H=map{chomp;$_=>1}<>;sub R{my($p)=#_;map R($_),<$p/*> if -d$p;($b=$p)=~s|.*/||;print"$p\n" if$H{$b}}R"."' f
1- create an hashmap whose keys are filenames : %H=map{chomp;$_=>1}<>
2- define a recursive subroutine to traverse directories : sub R{}
2.1- recusive call for directories : map R($_), if -d$p
2.2- extract the filename from the path : ($b=$p)=~s|.*/||
2.3- print if hashmap contains filename : print"$p\n" if$H{$b}
3- call R with path current directory : R"."
EDIT : to traverse hidden directories (.*)
perl -e '%H=map{chomp;$_=>1}<>;sub R{my($p)=#_;map R($_),grep !m|/\.\.?$|,<$p/.* $p/*> if -d$p;($b=$p)=~s|.*/||;print"$p\n" if$H{$b}}R"."' f
I think this should do the trick:
xargs locate -b < f | grep ^dir > p
Edit: I can't think of an easy way to prefix dir/*/ to the list of file names, otherwise you could just pass that directly to xargs locate.
Depending on what percentage of the directory tree is considered a match, it might be faster to find every file, then grep out the matching ones:
find "$dir" -type f | grep -f <( sed 's+\(.*\)+/\1$+' "$f" )
The sed command pre-processes your list of file names into regular expressions that will only match full names at the end of a path.
Here is an alternative using bash and grep
#!/bin/bash
flist(){
for x in "$1"/*; do #*/ for markup
[ -d "$x" ] && flist $x || echo "$x"
done
}
dir=/etc #the directory you are searching
list=$(< myfiles) #the file with file names
#format the list for grep
list="/${list//
/\$\|/}"
flist "$dir" | grep "$list"
...if you need full posix shell compliance (busybox ash, hush, etc...) replace the $list substring manipulation with a variant of chepner's sed and replace $(< file) with $(cat file)

Resources