print recursively the number of files in folders - bash

I struggled for hours to get this uggly line to work
wcr() { find "$#" -type d | while read F; do find $F -maxdepth 0 && printf "%5d " $(ls $F | wc) && printf "$F"; done; echo; }
Here is the result
39 41 754 ../matlab.sh
1 1 19 ./matlab.sh./micmac
1 1 14 ./micmac
My first question is: how can I write it smarter?
Second question: I would like the names printed before the counts but I dont know how to tabulate the outputs, so I cannot do better than this:
.
./matlab.sh 1 1 19
./matlab.sh./micmac 1 1 19
./micmac 1 1 14

I don't see what that find $F -maxdepth 0 shall be good for, so I would just strip it.
Also, if a filename contains a %, you are in trouble if you use it as the format string to printf, so I'd add an explicit format string. And I combined the two printfs. To switch the columns (see also below for more on this topic), just switch the arguments and adapt the format string accordingly.
You should use double quotes around using variables ("$F" instead of $F) to avoid problems with filenames with spaces or other stuff in them.
Then, if a file starts with spaces, your read would skip those, rendering the resulting variable useless. To avoid that, set IFS to an empty string for the time of the read.
To get only the number of directory entries, you should use option -l for wc to only count the lines (and not also words and characters).
Use option --sort=none to ls to speed up things by avoiding useless sorting.
Use option -b to ls to escape newline characters in file names and thus avoid breaking of counting.
Indent your code properly if you want others to read it.
This is the result:
wcr() {
find "$#" -type d | while IFS='' read F
do
printf "%5d %s\n" "$(ls --sort=none -b "$F" | wc -l)" "$F"
done
echo
}
I'd object to switching the columns. The potentially widest column should be at the end (in this case the path to the file). Otherwise you will have to live with unreadable output. But if you really want to do this, you'd have to do two passes: One to determine the longest entry and a second to format the output accordingly.

for i in $(find . -type d); do
printf '%-10s %s\n' "$(ls $i | wc -l)" "$i"
done
You probably could pre-process the output and use column to make some fancier output with whatever order, but since the path can get big, doing this is probably simpler.

Related

How do I find duplicate files by comparing them by size (ie: not hashing) in bash

How do I find duplicate files by comparing them by size (ie: not hashing) in bash.
Testbed files:
-rw-r--r-- 1 usern users 68239 May 3 12:29 The W.pdf
-rw-r--r-- 1 usern users 68239 May 3 12:29 W.pdf
-rw-r--r-- 1 usern users 8 May 3 13:43 X.pdf
Yes, files can have spaces (Boo!).
I want to check files in the same directory, move the ones which match something else into 'these are probably duplicates' folder.
My probable use-case is going to have humans randomly mis-naming a smaller set of files (ie: not generating files of arbitrary length). It is fairly unlikely that two files will be the same size and yet be different files. Sure, as a backup I could hash and check two files of identical size. But mostly, it will be people taking a file and misnaming it / re-adding it to a pile, of which it is already there.
So, preferably a solution with widely installed tools (posix?). And I'm not supposed to parse the output of ls, so I need another way to get actual size (and not a du approximate).
"Vote to close!"
Hold up cowboy.
I bet you're going to suggest this (cool, you can google search):
https://unix.stackexchange.com/questions/71176/find-duplicate-files
No fdupes (nor jdupes, nor...), nor finddup, nor rmlint, nor fslint - I can't guarantee those on other systems (much less mine), and I don't want to be stuck as customer support dealing with installing them on random systems from now to eternity, nor even in getting emails about that sh...stuff and having to tell them to RTFM and figure it out. Plus, in reality, I should write my script to test functionality of what is installed, but, that's beyond the scope.
https://unix.stackexchange.com/questions/192701/how-to-remove-duplicate-files-using-bash
All these solutions want to start by hashing. Some cool ideas in some of these: hash just a chunk of both files, starting somewhere past the header, then only do full compare if those turn up matching. Good idea for double checking work, but would prefer to only do that on the very, very few that actually are duplicate. As, looking over the first several thousand of these by hand, not one duplicate has been even close to a different file.
https://unix.stackexchange.com/questions/277697/whats-the-quickest-way-to-find-duplicated-files
Proposed:
$find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Breaks for me:
find: unknown option -- n
usage: find [-dHhLXx] [-f path] path ... [expression]
uniq: unknown option -- w
usage: uniq [-ci] [-d | -u] [-f fields] [-s chars] [input_file [output_file]]
find: unknown option -- t
usage: find [-dHhLXx] [-f path] path ... [expression]
xargs: md5sum: No such file or directory
https://unix.stackexchange.com/questions/170693/compare-directory-trees-regarding-file-name-and-size-and-date
Haven't been able to figure out how rsync -nrvc --delete might work in the same directory, but there might be solution in there.
Well how about cmp? Yeah, that looks pretty good, actually!
cmp -z file1 file2
Bummer, my version of cmp does not include the -z size option.
However, I tried implementing it just for grins - and when it failed, looking at it I realized that I also need help constructing my loop logic. Removing things from my loops in the midst of processing them is probably a recipe for breakage, duh.
if [ ! -d ../Dupes/ ]; then
mkdir ../Dupes/ || exit 1 # Cuz no set -e, and trap not working
fi
for i in ./*
do
for j in ./*
do
if [[ "$i" != "$j" ]]; then # Yes, it will be identical to itself
if [[ $(cmp -s "$i" "$j") ]]; then
echo "null" # Cuz I can't use negative of the comparison?
else
mv -i "$i" ../Dupes/
fi
fi
done
done
https://unix.stackexchange.com/questions/367749/how-to-find-and-delete-duplicate-files-within-the-same-directory
Might have something I could use, but I'm not following what's going on in there.
https://superuser.com/questions/259148/bash-find-duplicate-files-mac-linux-compatible
If it were something that returns size, instead of md5, maybe one of the answers in here?
https://unix.stackexchange.com/questions/570305/what-is-the-most-efficient-way-to-find-duplicate-files
Didn't really get answered.
TIL: Sending errors from . scriptname will close my terminal instantly. Thanks, Google!
TIL: Sending errors from scripts executed via $PATH will close the terminal if shopt -s extdebug + trap checkcommand DEBUG are set in profile to try and catch rm -r * - but at least will respect my alias for exit
TIL: Backticks deprecated, use $(things) - Ugh, so much re-writing to do :P
TIL: How to catch non-ascii characters in filenames, without using basename
TIL: "${file##*/}"
TIL: file - yes, X.pdf is not a PDF.
On the matter of POSIX
I'm afraid you cannot get the actual file size (not the number of blocks allocated by the file) in a plain posix shell without using ls. All the solutions like du --apparent-size, find -printf %s, and stat are not posix.
However, as long as your filenames don't contain linebreaks (spaces are ok) you could create safe solutions relying on ls. Correctly handling filenames with linebreaks would require very non-posix tools (like GNU sort -z) anyway.
Bash+POSIX Approach Actually Comparing The Files
I would drop the approach to compare only the file sizes and use cmp instead. For huge directories the posix script will be slow no matter what you do. Also, I expect cmp to do some fail fast checks (like comparing the file sizes) before actually comparing the file contents. For common scenarios with only a few files speed shouldn't matter anyway as even the worst script will run fast enough.
The following script places each group of actual duplicates (at least two, but can be more) into its own subdirectory of dups/. The script should work with all filenames; spaces, special symbols, and even linebreaks are ok. Note that we are still using bash (which is not posix). We just assume that all tools (like mv, find, ...) are posix.
#! /usr/bin/env bash
files=()
for f in *; do [ -f "$f" ] && files+=("$f"); done
max=${#files[#]}
for (( i = 0; i < max; i++ )); do
sameAsFileI=()
for (( j = i + 1; j < max; j++ )); do
cmp -s "${files[i]}" "${files[j]}" &&
sameAsFileI+=("${files[j]}") &&
unset 'files[j]'
done
(( ${#sameAsFileI[#]} == 0 )) && continue
mkdir -p "dups/$i/"
mv "${files[i]}" "${sameAsFileI[#]}" "dups/$i/"
# no need to unset files[i] because loops won't visit this entry again
files=("${files[#]}") # un-sparsify array
max=${#files[#]}
done
Fairly Portable Non-POSIX Approach Using File Sizes Only
If you need a faster approach that only compares the file sizes I suggest to not use a nested loop. Loops in bash are slow already, but if you nest them you have quadratic time complexity. It is faster and easier to ...
print only the file sizes without file names
apply sort | uniq -d to retrieve duplicates in time O(n log n)
Move all files having one of the duplicated sizes to a directory
This solution is not strictly posix conform. However, I tried to verify, that the tools and options in this solution are supported by most implementations. Your find has to support the non-posix options -maxdepth and -printf with %s for the actual file size and %f for the file basename (%p for the full path would be acceptable too).
The following script places all files of the same size into the directory potential-dups/. If there are two files of size n and two files of size m all four files end up in this single directory. The script should work with all file names expect those with linebreaks (that is \n; \r should be fine though).
#! /usr/bin/env sh
all=$(find . -maxdepth 1 -type f -printf '%s %f\n' | sort)
dupRegex=$(printf %s\\n "$all" | cut -d' ' -f1 | uniq -d |
sed -e 's/[][\.|$(){}?+*^]/\\&/g' -e 's/^/^/' | tr '\n' '|' | sed 's/|$//')
[ -z "$dupRegex" ] && exit
mkdir -p potential-dups
printf %s\\n "$all" | grep -E "$dupRegex" | cut -d' ' -f2- |
sed 's/./\\&/' | xargs -I_ mv _ potential-dups
In case you wonder about some of the sed commands: They quote the file names such that spaces and special symbols are processed correctly by subsequent tools. sed 's/[][\.|$(){}?+*^]/\\&/g' is for turning raw strings into equivalent extended regular expressions (ERE) and sed 's/./\\&/' is for literal processing by xargs. See the posix documentation of xargs:
-I replstr [...] Any <blank>s at the beginning of each line shall be ignored.
[...]
Note that the quoting rules used by xargs are not the same as in the shell. [...] An easy rule that can be used to transform any string into a quoted form that xargs interprets correctly is to precede each character in the string with a backslash.

Shorten filename to n characters while preserving file extension

I'm trying to shorten a filename while preserving the extension.
I think cut may be the best tool to use, but I'm not sure how to preserve the file extension.
For example, I'm trying to rename abcdefghijklmnop.txt to abcde.txt
I'd like to simply lop off the end of the filename so that the total character length doesn't exceed [in this example] 5 characters.
I'm not concerned with filename clashes because my dataset likely won't contain any, and anyway I'll do a find, analyze the files, and test before I rename anything.
The background for this is ultimately that I want to mass truncate filenames that exceed 135 characters so that I can rsync the files to an encrypted share on a Synology NAS.
I found a good way to search for all filenames that exceed 135 characters:
find . -type f | awk -F'/' 'length($NF)>135{print $0}'
And I'd like to pipe that to a simple cut command to trim the filename down to size. Perhaps there is a better way than this. I found a method to shorten filenames while preserving extensions, but I need to recurse through all sub-directories.
Any help would be appreciated, thank you!
Update for clarification:
I'd like to use a one-liner with a syntax like this:
find . -type f | awk -F'/' 'length($NF)>135{print $0}' | some_code_here_to_shorten_the_filename_while_preserving_the_extension
With GNU find and bash:
export n=10 # change according to your needs
find . -type f \
! -name '.*' \
-regextype egrep \
! -regex '.*\.[^/.]{'"$n"',}' \
-regex '.*[^/]{'$((n+1))',}' \
-execdir bash -c '
echo "PWD=$PWD"
for f in "${##./}"; do
ext=${f#"${f%.*}"}
echo mv -- "$f" "${f:0:n-${#ext}}${ext}"
done' bash {} +
This will perform a dry-run, that is it shows folders followed by the commands to be executed within them. Once you're happy with its output you can drop echo before mv (and echo "PWD=$PWD" line too if you want) and it'll actually rename all the files whose names exceed n characters to names exactly of n characters length including extension.
Note that this excludes hidden files, and files whose extensions are equal to or longer than n in length (e.g. .hidden, app.properties where n=10).
use bash string manipulations
Details: https://www.linuxtopia.org/online_books/advanced_bash_scripting_guide/string-manipulation.html.
scroll to "Substring Extraction"
example below cut filename to 10 chars preserving extension
~ % cat test
rawFileName=$(basename "$1")
filename="${rawFileName%.*}"
ext="${rawFileName##*.}"
if [[ ${#filename} < 9 ]]; then
echo ${filename:0:10}.${ext}
else
echo $1
fi
And tests:
~ % ./test 12345678901234567890.txt
1234567890.txt
~ % ./test 1234567.txt
1234567.txt
Update
Since your file are distributed in a tree of directories, you can use my original approach, but passing the script to a sh command passed to the -exec option of find:
n=5 find . -type f -exec sh -c 'f={}; d=${f%/*}; b=${f##*/}; e=${b##*.}; b=${b%.*}; mv -- "$f" "$d/${b:0:n}.$e"' \;
Original answer
If the filename is in a variable x, then ${x:0:5}.${x##*.} should do the job.
So you might do something like
n=5 # or 135, or whatever you like
for f in *; do
mv -- "$f" "${f:0:n}.${f##*.}"
done
Clearly this assumes that there are no clashes between the shortened names. If there are clashes, then only one would survive! So be careful.

How to loop through the sorted list of file names in two directories

Please note, i have read entries like For loop for files in multiple folders - bash shell and they ask for a significantly different thing.
I want to loop through the file names in a sorted order that exist in either of two directories. Files can potentially have spaces in them.
Let's say i have:
1/
a
a c b
b
c
2/
a
d
I would want to loop through: 'a', 'a c b', 'b', 'c', 'd'.
I have tried to do the following:
for fname in $((ls -1 -f -A "${dir1}"; ls -1 -f -A "${dir2}")|sort --unique); do
echo "testing ${fname}"
done
the result then is
testing .
testing ..
testing a
testing a
testing c
testing b
testing b
testing c
testing d
For whatever reason i am getting '.' and '..' entries, that i was trying to exclude with -A, and also the file 'a c b' gets broken down into three strings.
I have tried to resolve it by adding --zero to the sort command, that changed nothing; by quoting the whole $(ls...|sort) part, and has resulted into a single entry into the for loop that has received the entire string with multiple lines each of which contained filename.
Do not consciously ever parse output of ls command(See Why you shouldn't parse the output of ls(1) ), it has lots of potential pitfalls. Use the find command with its -print0 option to null delimit the files so that file name with spaces/newline or any meta-charactetrs are handled and subsequently use GNU sort with the same null delimit character, to sort them alphabetically & remove duplicate files. If dir1 and dir2 are shell variables containing the names of the folders to look up, you can do
while IFS= read -r -d '' file; do
printf '%s\n' "$file"
done< <(find "${dir1}" "${dir2}" -maxdepth 1 -type f -printf "%f\0" | sort -t / -u -z)
A much simpler approach might be to loop over everything and exclude duplicates by other means.
#!/bin/bash
# Keep an associative array of which names you have already processed
# Requires Bash 4
declare -A done
for file in 1/* 2/*; do
base=${file#*/} # trim directory prefix from value
test "${done[$base]}" && continue
: do things ...
done["$base"]="$file"
done
Answer:
Change the for delimiter from whitespace to \n using the following command:
IFS=$'\n'
You used -l for ls which implies -a (and overrides -A); Use --color=never instead.
To summarize:
IFS=$'\n'
for fname in $((ls -1 --color=never -A "${dir1}"; ls -1 --color=never -A "${dir2}")|sort --unique); do
echo "testing ${fname}"
done

Can't both add extra newlines in find -printf and pipe output through sort

I'm trying to print out the top 10 largest files in my current directory.
For now I'm using
find . -maxdepth 1 -printf '%s %p\n'|sort -nr| head
I'm not sure how to add an additional newline after each file size. Typing \n\n doesn't do anything, or \n\\n. I also need a tab before each line as well.
Don't use newlines here. They're the wrong tool for the job: Filenames can contain literal newlines, so you don't know if a newline in find's output here came from your format string or a literal filename. Moreover, by default, sort uses newlines as record separators, so the empty lines and the names they were previously next to are no longer next to each other when it's done (since they've been... well... sorted into different locations).
So -- generate your output in a format that makes sense given the content restraints, and then reformat it into what you actually want for output later:
while IFS= read -r -d$'\t' size && IFS= read -r -d '' filename; do
printf '\t%s\n\n' "$filename"
done < <(find . -maxdepth 1 -printf '%s\t%p\0'|sort -znr|head -z)
Note that this is full of GNUisms (sort -z, head -z) -- but find -printf is a GNUism itself.
By way of understanding this code, some pointers:
BashFAQ #1: "How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?"
BashFAQ #3: "How can I sort or compare files based on some metadata attribute (newest / oldest modification time, size, etc)?"
Process substitution: The <( ... ) syntax used to expose the output from the pipeline as a file.
The printf builtin: A discussion of the bash printf command.

Grabbing every 4th file

I have 16,000 jpg's from a webcan screeb grabber that I let run for a year pointing into the back year. I want to find a way to grab every 4th image so that I can then put them into another directory so I can later turn them into a movie. Is there a simple bash script or other way under linux that I can do this.
They are named like so......
frame-44558.jpg
frame-44559.jpg
frame-44560.jpg
frame-44561.jpg
Thanks from a newb needing help.
Seems to have worked.
Couple of errors in my origonal post. There were actually 280,000 images and the naming was.
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163405.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163505.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163605.jpg
I ran.
cp $(ls | awk '{nr++; if (nr % 10 == 0) print $0}') ../newdirectory/
Which appears to have copied the images. 70-900 per day from the looks of it.
Now I'm running
mencoder mf://*.jpg -mf w=640:h=480:fps=30:type=jpg -ovc lavc -lavcopts vcodec=msmpeg4v2 -nosound -o ../output-msmpeg4v2.avi
I'll let you know how the movie works out.
UPDATE: Movie did not work.
Only has images from 2007 in it even though the directory has 2008 as well.
webcam_2008-02-17_101403.jpg webcam_2008-03-27_192205.jpg
webcam_2008-02-17_102403.jpg webcam_2008-03-27_193205.jpg
webcam_2008-02-17_103403.jpg webcam_2008-03-27_194205.jpg
webcam_2008-02-17_104403.jpg webcam_2008-03-27_195205.jpg
How can I modify my mencoder line so that it uses all the images?
One simple way is:
$ touch a b c d e f g h i j k l m n o p q r s t u v w x y z
$ mv $(ls | awk '{nr++; if (nr % 4 == 0) print $0}') destdir
Create a script move.sh which contains this:
#!/bin/sh
mv $4 ../newdirectory/
Make it executable and then do this in the folder:
ls *.jpg | xargs -n 4 ./move.sh
This takes the list of filenames, passes four at a time into move.sh, which then ignores the first three and moves the fourth into a new folder.
This will work even if the numbers are not exactly in sequence (e.g. if some frame numbers are missing, then using mod 4 arithmetic won't work).
As suggested, you should use
seq -f 'frame-%g.jpg' 1 4 number-of-frames
to generate the list of filenames since 'ls' will fail on 280k files. So the final solution would be something like:
for f in `seq -f 'frame-%g.jpg' 1 4 number-of-frames` ; do
mv $f destdir/
done
seq -f 'frame-%g.jpg' 1 4 number-of-frames
…will print the names of the files you need.
An easy way in perl (probably easily adaptable to bash) is to glob the filenames in an array then get the sequence number and remove those that are not divisible by 4
Something like this will print the files you need:
ls -1 /path/to/files/ | perl -e 'while (<STDIN>) {($seq)=/(\d*)\.jpg$/; print $_ if $seq && $seq % 4 ==0}'
You can replace the print by a move...
This will work if the files are numbered in sequence even if the number of digits is not constant like file_9.jpg followed by file_10.jpg )
Given masto's caveats about sorting:
ls | sed -n '1~4 p' | xargs -i mv {} ../destdir/
The thing I like about this solution is that everything's doing what it was designed to do, so it feels unixy to me.
Just iterate over a list of files:
files=( frame-*.jpg )
i=0
while [[ $i -lt ${#files} ]] ; do
cur_file=${files[$i]}
mungle_frame $cur_file
i=$( expr $i + 4 )
done
This is pretty cheesy, but it should get the job done. Assuming you're currently cd'd into the directory containing all of your files:
mkdir ../outdir
ls | sort -n | while read fname; do mv "$fname" ../outdir/; read; read; read; done
The sort -n is there assuming your filenames don't all have the same number of digits; otherwise ls will sort in lexical order where frame-123.jpg comes before frame-4.jpg and I don't think that's what you want.
Please be careful, back up your files before trying my solution, etc. I don't want to be responsible for you losing a year's worth of data.
Note that this solution does handle files with spaces in the name, unlike most of the others. I know that wasn't part of the sample filenames, but it's easy to write shell commands that don't handle spaces safely, so I wanted to do that in this example.
brace expansion {m..n..s} is more efficient than seq. AND it allows a bit of output formatting:
$ echo {0000..0010..2}
0000 0002 0004 0006 0008 0010
Postscript: In curl if you only want every fourth (nth) numbered images so you tell curl a step counter too. This example range goes from 0 to 100 with an increment of 4 (n):
curl -O "http://example.com/[0-100:4].png"

Resources