Shorten filename to n characters while preserving file extension - bash

I'm trying to shorten a filename while preserving the extension.
I think cut may be the best tool to use, but I'm not sure how to preserve the file extension.
For example, I'm trying to rename abcdefghijklmnop.txt to abcde.txt
I'd like to simply lop off the end of the filename so that the total character length doesn't exceed [in this example] 5 characters.
I'm not concerned with filename clashes because my dataset likely won't contain any, and anyway I'll do a find, analyze the files, and test before I rename anything.
The background for this is ultimately that I want to mass truncate filenames that exceed 135 characters so that I can rsync the files to an encrypted share on a Synology NAS.
I found a good way to search for all filenames that exceed 135 characters:
find . -type f | awk -F'/' 'length($NF)>135{print $0}'
And I'd like to pipe that to a simple cut command to trim the filename down to size. Perhaps there is a better way than this. I found a method to shorten filenames while preserving extensions, but I need to recurse through all sub-directories.
Any help would be appreciated, thank you!
Update for clarification:
I'd like to use a one-liner with a syntax like this:
find . -type f | awk -F'/' 'length($NF)>135{print $0}' | some_code_here_to_shorten_the_filename_while_preserving_the_extension

With GNU find and bash:
export n=10 # change according to your needs
find . -type f \
! -name '.*' \
-regextype egrep \
! -regex '.*\.[^/.]{'"$n"',}' \
-regex '.*[^/]{'$((n+1))',}' \
-execdir bash -c '
echo "PWD=$PWD"
for f in "${##./}"; do
ext=${f#"${f%.*}"}
echo mv -- "$f" "${f:0:n-${#ext}}${ext}"
done' bash {} +
This will perform a dry-run, that is it shows folders followed by the commands to be executed within them. Once you're happy with its output you can drop echo before mv (and echo "PWD=$PWD" line too if you want) and it'll actually rename all the files whose names exceed n characters to names exactly of n characters length including extension.
Note that this excludes hidden files, and files whose extensions are equal to or longer than n in length (e.g. .hidden, app.properties where n=10).

use bash string manipulations
Details: https://www.linuxtopia.org/online_books/advanced_bash_scripting_guide/string-manipulation.html.
scroll to "Substring Extraction"
example below cut filename to 10 chars preserving extension
~ % cat test
rawFileName=$(basename "$1")
filename="${rawFileName%.*}"
ext="${rawFileName##*.}"
if [[ ${#filename} < 9 ]]; then
echo ${filename:0:10}.${ext}
else
echo $1
fi
And tests:
~ % ./test 12345678901234567890.txt
1234567890.txt
~ % ./test 1234567.txt
1234567.txt

Update
Since your file are distributed in a tree of directories, you can use my original approach, but passing the script to a sh command passed to the -exec option of find:
n=5 find . -type f -exec sh -c 'f={}; d=${f%/*}; b=${f##*/}; e=${b##*.}; b=${b%.*}; mv -- "$f" "$d/${b:0:n}.$e"' \;
Original answer
If the filename is in a variable x, then ${x:0:5}.${x##*.} should do the job.
So you might do something like
n=5 # or 135, or whatever you like
for f in *; do
mv -- "$f" "${f:0:n}.${f##*.}"
done
Clearly this assumes that there are no clashes between the shortened names. If there are clashes, then only one would survive! So be careful.

Related

How do I find duplicate files by comparing them by size (ie: not hashing) in bash

How do I find duplicate files by comparing them by size (ie: not hashing) in bash.
Testbed files:
-rw-r--r-- 1 usern users 68239 May 3 12:29 The W.pdf
-rw-r--r-- 1 usern users 68239 May 3 12:29 W.pdf
-rw-r--r-- 1 usern users 8 May 3 13:43 X.pdf
Yes, files can have spaces (Boo!).
I want to check files in the same directory, move the ones which match something else into 'these are probably duplicates' folder.
My probable use-case is going to have humans randomly mis-naming a smaller set of files (ie: not generating files of arbitrary length). It is fairly unlikely that two files will be the same size and yet be different files. Sure, as a backup I could hash and check two files of identical size. But mostly, it will be people taking a file and misnaming it / re-adding it to a pile, of which it is already there.
So, preferably a solution with widely installed tools (posix?). And I'm not supposed to parse the output of ls, so I need another way to get actual size (and not a du approximate).
"Vote to close!"
Hold up cowboy.
I bet you're going to suggest this (cool, you can google search):
https://unix.stackexchange.com/questions/71176/find-duplicate-files
No fdupes (nor jdupes, nor...), nor finddup, nor rmlint, nor fslint - I can't guarantee those on other systems (much less mine), and I don't want to be stuck as customer support dealing with installing them on random systems from now to eternity, nor even in getting emails about that sh...stuff and having to tell them to RTFM and figure it out. Plus, in reality, I should write my script to test functionality of what is installed, but, that's beyond the scope.
https://unix.stackexchange.com/questions/192701/how-to-remove-duplicate-files-using-bash
All these solutions want to start by hashing. Some cool ideas in some of these: hash just a chunk of both files, starting somewhere past the header, then only do full compare if those turn up matching. Good idea for double checking work, but would prefer to only do that on the very, very few that actually are duplicate. As, looking over the first several thousand of these by hand, not one duplicate has been even close to a different file.
https://unix.stackexchange.com/questions/277697/whats-the-quickest-way-to-find-duplicated-files
Proposed:
$find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Breaks for me:
find: unknown option -- n
usage: find [-dHhLXx] [-f path] path ... [expression]
uniq: unknown option -- w
usage: uniq [-ci] [-d | -u] [-f fields] [-s chars] [input_file [output_file]]
find: unknown option -- t
usage: find [-dHhLXx] [-f path] path ... [expression]
xargs: md5sum: No such file or directory
https://unix.stackexchange.com/questions/170693/compare-directory-trees-regarding-file-name-and-size-and-date
Haven't been able to figure out how rsync -nrvc --delete might work in the same directory, but there might be solution in there.
Well how about cmp? Yeah, that looks pretty good, actually!
cmp -z file1 file2
Bummer, my version of cmp does not include the -z size option.
However, I tried implementing it just for grins - and when it failed, looking at it I realized that I also need help constructing my loop logic. Removing things from my loops in the midst of processing them is probably a recipe for breakage, duh.
if [ ! -d ../Dupes/ ]; then
mkdir ../Dupes/ || exit 1 # Cuz no set -e, and trap not working
fi
for i in ./*
do
for j in ./*
do
if [[ "$i" != "$j" ]]; then # Yes, it will be identical to itself
if [[ $(cmp -s "$i" "$j") ]]; then
echo "null" # Cuz I can't use negative of the comparison?
else
mv -i "$i" ../Dupes/
fi
fi
done
done
https://unix.stackexchange.com/questions/367749/how-to-find-and-delete-duplicate-files-within-the-same-directory
Might have something I could use, but I'm not following what's going on in there.
https://superuser.com/questions/259148/bash-find-duplicate-files-mac-linux-compatible
If it were something that returns size, instead of md5, maybe one of the answers in here?
https://unix.stackexchange.com/questions/570305/what-is-the-most-efficient-way-to-find-duplicate-files
Didn't really get answered.
TIL: Sending errors from . scriptname will close my terminal instantly. Thanks, Google!
TIL: Sending errors from scripts executed via $PATH will close the terminal if shopt -s extdebug + trap checkcommand DEBUG are set in profile to try and catch rm -r * - but at least will respect my alias for exit
TIL: Backticks deprecated, use $(things) - Ugh, so much re-writing to do :P
TIL: How to catch non-ascii characters in filenames, without using basename
TIL: "${file##*/}"
TIL: file - yes, X.pdf is not a PDF.
On the matter of POSIX
I'm afraid you cannot get the actual file size (not the number of blocks allocated by the file) in a plain posix shell without using ls. All the solutions like du --apparent-size, find -printf %s, and stat are not posix.
However, as long as your filenames don't contain linebreaks (spaces are ok) you could create safe solutions relying on ls. Correctly handling filenames with linebreaks would require very non-posix tools (like GNU sort -z) anyway.
Bash+POSIX Approach Actually Comparing The Files
I would drop the approach to compare only the file sizes and use cmp instead. For huge directories the posix script will be slow no matter what you do. Also, I expect cmp to do some fail fast checks (like comparing the file sizes) before actually comparing the file contents. For common scenarios with only a few files speed shouldn't matter anyway as even the worst script will run fast enough.
The following script places each group of actual duplicates (at least two, but can be more) into its own subdirectory of dups/. The script should work with all filenames; spaces, special symbols, and even linebreaks are ok. Note that we are still using bash (which is not posix). We just assume that all tools (like mv, find, ...) are posix.
#! /usr/bin/env bash
files=()
for f in *; do [ -f "$f" ] && files+=("$f"); done
max=${#files[#]}
for (( i = 0; i < max; i++ )); do
sameAsFileI=()
for (( j = i + 1; j < max; j++ )); do
cmp -s "${files[i]}" "${files[j]}" &&
sameAsFileI+=("${files[j]}") &&
unset 'files[j]'
done
(( ${#sameAsFileI[#]} == 0 )) && continue
mkdir -p "dups/$i/"
mv "${files[i]}" "${sameAsFileI[#]}" "dups/$i/"
# no need to unset files[i] because loops won't visit this entry again
files=("${files[#]}") # un-sparsify array
max=${#files[#]}
done
Fairly Portable Non-POSIX Approach Using File Sizes Only
If you need a faster approach that only compares the file sizes I suggest to not use a nested loop. Loops in bash are slow already, but if you nest them you have quadratic time complexity. It is faster and easier to ...
print only the file sizes without file names
apply sort | uniq -d to retrieve duplicates in time O(n log n)
Move all files having one of the duplicated sizes to a directory
This solution is not strictly posix conform. However, I tried to verify, that the tools and options in this solution are supported by most implementations. Your find has to support the non-posix options -maxdepth and -printf with %s for the actual file size and %f for the file basename (%p for the full path would be acceptable too).
The following script places all files of the same size into the directory potential-dups/. If there are two files of size n and two files of size m all four files end up in this single directory. The script should work with all file names expect those with linebreaks (that is \n; \r should be fine though).
#! /usr/bin/env sh
all=$(find . -maxdepth 1 -type f -printf '%s %f\n' | sort)
dupRegex=$(printf %s\\n "$all" | cut -d' ' -f1 | uniq -d |
sed -e 's/[][\.|$(){}?+*^]/\\&/g' -e 's/^/^/' | tr '\n' '|' | sed 's/|$//')
[ -z "$dupRegex" ] && exit
mkdir -p potential-dups
printf %s\\n "$all" | grep -E "$dupRegex" | cut -d' ' -f2- |
sed 's/./\\&/' | xargs -I_ mv _ potential-dups
In case you wonder about some of the sed commands: They quote the file names such that spaces and special symbols are processed correctly by subsequent tools. sed 's/[][\.|$(){}?+*^]/\\&/g' is for turning raw strings into equivalent extended regular expressions (ERE) and sed 's/./\\&/' is for literal processing by xargs. See the posix documentation of xargs:
-I replstr [...] Any <blank>s at the beginning of each line shall be ignored.
[...]
Note that the quoting rules used by xargs are not the same as in the shell. [...] An easy rule that can be used to transform any string into a quoted form that xargs interprets correctly is to precede each character in the string with a backslash.

sh/bash: Find all files in a folder that start with a number followed by a _blank_

I am working on a shell script, and now I came to a point where I need to rename files that start with a number and a blank by removing this pattern and moving them to a specific folder that is basically the string between the second and third " - "
example :
001 - folder1 - example.doc > /folder1/example.doc
002 - folder2 - someexample.doc > /folder2/someexample.doc
003 - folder3 - someotherexample.doc > /folder3/someotherexample.doc
I want to do something like
find /tmp -name '*.doc' -print | rename .... ...
what I do not know is:
- how to tell find that the file starts with a number,
and second
- how to explode the name by a pattern like " - " and tell rename to place the file in the folder
If possible, I would avoid find and just use bash's regular expression matching. If you don't need to recursively search /tmp (as your version of find is doing), this should work in any version of bash:
regex='^[[:digit:]]+ - (.+) - (.+)$'
for f in /tmp/*.doc; do
[[ $f =~ /tmp/$regex ]] || continue
mv -- "$f" "/${BASH_REMATCH[1]/${BASH_REMATCH[2]}"
done
If you do need to recursively search, and you are using bash 4 or later, you can use the globstar option:
shopt -s globstar
regex='^[[:digit:]]+ - (.+) - (.+)$'
for f in /tmp/**/*.doc; do
b=$(basename "$f")
[[ $b =~ $regex ]] || continue
mv -- "$f" "/${BASH_REMATCH[1]/${BASH_REMATCH[2]}"
done
If your destination folder and file names have no spaces, and if all your original files are in the current directory, you could try something like:
while read f; do
if [[ "$f" =~ ^\./[0-9]+\ -\ ([^\ ]+)\ -\ (.+\.doc)$ ]]; then
mkdir -p "${BASH_REMATCH[1]}"
mv "$f" "${BASH_REMATCH[1]}/${BASH_REMATCH[2]}"
fi
done < <(find . -maxdepth 1 -name '[0-9]*.doc')
Explanations:
find . -maxdepth 1... restricts the search to the current directory.
find...-name '[0-9]*.doc matches only files which names start with at least one digit and end with .doc.
The regular expression models your original file names (with the initial ./ that find adds). It contains two sub-expressions enclosed in (...) corresponding to the folder name and to the destination file name. These sub-expressions are stored by bash in the BASH_REMATCH array if there is a match, at positions 1 and 2, respectively.
The regular expression removes leading and trailing spaces from the destination folder name and the leading spaces from the destination file name (I assume this is what you want).
With gawk:
find . -regex '^.*[0-9]+ - [a-zA-Z]+[0-9] - [a-zA-Z]+\.doc$' -printf %f | awk -F- '{ print "mv \""$0"\" /"gensub(" ","","g",$2)"/"gensub(" ","","g",$3) }'
Use find with regular expressions and then parse the output through to awk to execute the move command.
When you have verified that the commands are OK, run the commands with.
find . -regex '^.*[0-9]+ - [a-zA-Z]+[0-9] - [a-zA-Z]+\.doc$' -printf %f | awk -F- '{ print "mv \""$0"\" /"gensub(" ","","g",$2)"/"gensub(" ","","g",$3) }' | sh
Be wary that such a strategy can be open to a risk of command injection.
the answer is very close to the suggestion made by Raman, I removed printf, added the part that creates the folder and added the part removing the leading and the trailing blank.
in the end it looks like this:
find . -regex '^.*[0-9].*\.doc$' | awk -F- '{ print "mkdir \""gensub(" $","","g",gensub("^ ","","g",$2))"\" \nmv \""$f"\" \"./"gensub(" $","","g",gensub("^ ","","g",$2))"/"gensub(" $","","g",gensub("^ ","","g",$2"-"$3))"\"" }'|sh
thank you everybody for the suggestions.

How to find Disk usage in UNIX? [duplicate]

So, in many situations I wanted a way to know how much of my disk space is used by what, so I know what to get rid of, convert to another format, store elsewhere (such as data DVDs), move to another partition, etc. In this case I'm looking at a Windows partition from a SliTaz Linux bootable media.
In most cases, what I want is the size of files and folders, and for that I use NCurses-based ncdu:
But in this case, I want a way to get the size of all files matching a regex. An example regex for .bak files:
.*\.bak$
How do I get that information, considering a standard Linux with core GNU utilities or BusyBox?
Edit: The output is intended to be parseable by a script.
I suggest something like: find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1
Some notes:
The -print0 option for find and --files0-from for du are there to avoid issues with whitespace in file names
The regular expression is matched against the whole path, e.g. ./dir1/subdir2/file.bak, not just file.bak, so if you modify it, take that into account
I used h flag for du to produce a "human-readable" format but if you want to parse the output, you may be better off with k (always use kilobytes)
If you remove the tail command, you will additionally see the sizes of particular files and directories
Sidenote: a nice GUI tool for finding out who ate your disk space is FileLight. It doesn't do regexes, but is very handy for finding big directories or files clogging your disk.
du is my favorite answer. If you have a fixed filesystem structure, you can use:
du -hc *.bak
If you need to add subdirs, just add:
du -hc *.bak **/*.bak **/**/*.bak
etc etc
However, this isn't a very useful command, so using your find:
TOTAL=0;for I in $(find . -name \*.bak); do TOTAL=$((TOTAL+$(du $I | awk '{print $1}'))); done; echo $TOTAL
That will echo the total size in bytes of all of the files you find.
Hope that helps.
Run this in a Bourne Shell to declare a function that calculates the sum of sizes of all the files matching a regex pattern in the current directory:
sizeofregex() { IFS=$'\n'; for x in $(find . -regex "$1" 2> /dev/null); do du -sk "$x" | cut -f1; done | awk '{s+=$1} END {print s}' | sed 's/^$/0/'; unset IFS; }
(Alternatively, you can put it in a script.)
Usage:
cd /where/to/look
sizeofregex 'myregex'
The result will be a number (in KiB), including 0 (if there are no files that match your regex).
If you do not want it to look in other filesystems (say you want to look for all .so files under /, which is a mount of /dev/sda1, but not under /home, which is a mount of /dev/sdb1, add a -xdev parameter to find in the function above.
The previous solutions didn't work properly for me (I had trouble piping du) but the following worked great:
find path/to/directory -iregex ".*\.bak$" -exec du -csh '{}' + | tail -1
The iregex option is a case insensitive regular expression. Use regex if you want it to be case sensitive.
If you aren't comfortable with regular expressions, you can use the iname or name flags (the former being case insensitive):
find path/to/directory -iname "*.bak" -exec du -csh '{}' + | tail -1
In case you want the size of every match (rather than just the combined total), simply leave out the piped tail command:
find path/to/directory -iname "*.bak" -exec du -csh '{}' +
These approaches avoid the subdirectory problem in #MaddHackers' answer.
Hope this helps others in the same situation (in my case, finding the size of all DLL's in a .NET solution).
If you're OK with glob-patterns and you're only interested in the current directory:
stat -c "%s" *.bak | awk '{sum += $1} END {print sum}'
or
sum=0
while read size; do (( sum += size )); done < <(stat -c "%s" *.bak)
echo $sum
The %s directive to stat gives bytes not kilobytes.
If you want to descend into subdirectories, with bash version 4, you can shopt -s globstar and use the pattern **/*.bak
The accepted reply suggests to use
find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1
but that doesn't work on my system as du doesn't know a --files-0-from option on my system. Only GNU du knows that option, it's neither part of the POSIX Standard (so you won't find it in FreeBSD or macOS), nor will you find it on BusyBox based Linux systems (e.g. most embedded Linux systems) or any other Linux system that does not use the GNU du version.
Then there's a reply suggesting to use:
find path/to/directory -iregex .*\.bak$ -exec du -csh '{}' + | tail -1
This solution will work as long as there aren't too many files found, as + means that find will try call du with as many hits as possible in a single call, however, there might be a maximum number of arguments (N) a system supports and if there are more hits than this value, find will call du multiple times, splitting the hits into groups smaller than or equal to N items each and this case the result will be wrong and only show the size of the last du call.
Finally there is an answer using stat and awk, which is a nice way to do it, but it relies on shell globbing in a way that only Bash 4.x or later supports. It will not work with older versions and if it works with other shells is unpredictable.
A POSIX conform solution (works on Linux, macOS and any BSD variants), that doesn't suffer by any limitation and that will surely work with every shell would be:
find . -regex '.*\.bak' -exec stat -f "%z" {} \; | awk '{s += $1} END {print s}'

Is there a bash command which counts files?

Is there a bash command which counts the number of files that match a pattern?
For example, I want to get the count of all files in a directory which match this pattern: log*
This simple one-liner should work in any shell, not just bash:
ls -1q log* | wc -l
ls -1q will give you one line per file, even if they contain whitespace or special characters such as newlines.
The output is piped to wc -l, which counts the number of lines.
Lots of answers here, but some don't take into account
file names with spaces, newlines, or control characters in them
file names that start with hyphens (imagine a file called -l)
hidden files, that start with a dot (if the glob was *.log instead of log*
directories that match the glob (e.g. a directory called logs that matches log*)
empty directories (i.e. the result is 0)
extremely large directories (listing them all could exhaust memory)
Here's a solution that handles all of them:
ls 2>/dev/null -Ubad1 -- log* | wc -l
Explanation:
-U causes ls to not sort the entries, meaning it doesn't need to load the entire directory listing in memory
-b prints C-style escapes for nongraphic characters, crucially causing newlines to be printed as \n.
-a prints out all files, even hidden files (not strictly needed when the glob log* implies no hidden files)
-d prints out directories without attempting to list the contents of the directory, which is what ls normally would do
-1 makes sure that it's on one column (ls does this automatically when writing to a pipe, so it's not strictly necessary)
2>/dev/null redirects stderr so that if there are 0 log files, ignore the error message. (Note that shopt -s nullglob would cause ls to list the entire working directory instead.)
wc -l consumes the directory listing as it's being generated, so the output of ls is never in memory at any point in time.
-- File names are separated from the command using -- so as not to be understood as arguments to ls (in case log* is removed)
The shell will expand log* to the full list of files, which may exhaust memory if it's a lot of files, so then running it through grep is be better:
ls -Uba1 | grep ^log | wc -l
This last one handles extremely large directories of files without using a lot of memory (albeit it does use a subshell). The -d is no longer necessary, because it's only listing the contents of the current directory.
You can do this safely (i.e. won't be bugged by files with spaces or \n in their name) with bash:
$ shopt -s nullglob
$ logfiles=(*.log)
$ echo ${#logfiles[#]}
You need to enable nullglob so that you don't get the literal *.log in the $logfiles array if no files match. (See How to "undo" a 'set -x'? for examples of how to safely reset it.)
For a recursive search:
find . -type f -name '*.log' -printf x | wc -c
wc -c will count the number of characters in the output of find, while -printf x tells find to print a single x for each result. This avoids any problems with files with odd names which contain newlines etc.
For a non-recursive search, do this:
find . -maxdepth 1 -type f -name '*.log' -printf x | wc -c
The accepted answer for this question is wrong, but I have low rep so can't add a comment to it.
The correct answer to this question is given by Mat:
shopt -s nullglob
logfiles=(*.log)
echo ${#logfiles[#]}
The problem with the accepted answer is that wc -l counts the number of newline characters, and counts them even if they print to the terminal as '?' in the output of 'ls -l'. This means that the accepted answer FAILS when a filename contains a newline character. I have tested the suggested command:
ls -l log* | wc -l
and it erroneously reports a value of 2 even if there is only 1 file matching the pattern whose name happens to contain a newline character. For example:
touch log$'\n'def
ls log* -l | wc -l
An important comment
(not enough reputation to comment)
This is BUGGY:
ls -1q some_pattern | wc -l
If shopt -s nullglob happens to be set, it prints the number of ALL regular files, not just the ones with the pattern (tested on CentOS-8 and Cygwin). Who knows what other meaningless bugs does ls have?
This is CORRECT and much faster:
shopt -s nullglob; files=(some_pattern); echo ${#files[#]};
It does the expected job.
And the running times differ.
The 1st: 0.006 on CentOS, and 0.083 on Cygwin (in case it is used with care).
The 2nd: 0.000 on CentOS, and 0.003 on Cygwin.
If you have a lot of files and you don't want to use the elegant shopt -s nullglob and bash array solution, you can use find and so on as long as you don't print out the file name (which might contain newlines).
find -maxdepth 1 -name "log*" -not -name ".*" -printf '%i\n' | wc -l
This will find all files that match log* and that don't start with .* — The "not name .*" is redunant, but it's important to note that the default for "ls" is to not show dot-files, but the default for find is to include them.
This is a correct answer, and handles any type of file name you can throw at it, because the file name is never passed around between commands.
But, the shopt nullglob answer is the best answer!
Here is my one liner for this.
file_count=$( shopt -s nullglob ; set -- $directory_to_search_inside/* ; echo $#)
You can define such a command easily, using a shell function. This method does not require any external program and does not spawn any child process. It does not attempt hazardous ls parsing and handles “special” characters (whitespaces, newlines, backslashes and so on) just fine. It only relies on the file name expansion mechanism provided by the shell. It is compatible with at least sh, bash and zsh.
The line below defines a function called count which prints the number of arguments with which it has been called.
count() { echo $#; }
Simply call it with the desired pattern:
count log*
For the result to be correct when the globbing pattern has no match, the shell option nullglob (or failglob — which is the default behavior on zsh) must be set at the time expansion happens. It can be set like this:
shopt -s nullglob # for sh / bash
setopt nullglob # for zsh
Depending on what you want to count, you might also be interested in the shell option dotglob.
Unfortunately, with bash at least, it is not easy to set these options locally. If you don’t want to set them globally, the most straightforward solution is to use the function in this more convoluted manner:
( shopt -s nullglob ; shopt -u failglob ; count log* )
If you want to recover the lightweight syntax count log*, or if you really want to avoid spawning a subshell, you may hack something along the lines of:
# sh / bash:
# the alias is expanded before the globbing pattern, so we
# can set required options before the globbing gets expanded,
# and restore them afterwards.
count() {
eval "$_count_saved_shopts"
unset _count_saved_shopts
echo $#
}
alias count='
_count_saved_shopts="$(shopt -p nullglob failglob)"
shopt -s nullglob
shopt -u failglob
count'
As a bonus, this function is of a more general use. For instance:
count a* b* # count files which match either a* or b*
count $(jobs -ps) # count stopped jobs (sh / bash)
By turning the function into a script file (or an equivalent C program), callable from the PATH, it can also be composed with programs such as find and xargs:
find "$FIND_OPTIONS" -exec count {} \+ # count results of a search
You can use the -R option to find the files along with those inside the recursive directories
ls -R | wc -l // to find all the files
ls -R | grep log | wc -l // to find the files which contains the word log
you can use patterns on the grep
I've given this answer a lot of thought, especially given the don&apos;t-parse-ls stuff. At first, I tried
<WARNING! DID NOT WORK>
du --inodes --files0-from=<(find . -maxdepth 1 -type f -print0) | awk '{sum+=int($1)}END{print sum}'
</WARNING! DID NOT WORK>
which worked if there was only a filename like
touch $'w\nlf.aa'
but failed if I made a filename like this
touch $'firstline\n3 and some other\n1\n2\texciting\n86stuff.jpg'
I finally came up with what I'm putting below. Note I was trying to get a count of all files in the directory (not including any subdirectories). I think it, along with the answers by #Mat and #Dan_Yard , as well as having at least most of the requirements set out by #mogsie (I'm not sure about memory.) I think the answer by #mogsie is correct, but I always try to stay away from parsing ls unless it's an extremely specific situation.
awk -F"\0" '{print NF-1}' < <(find . -maxdepth 1 -type f -print0) | awk '{sum+=$1}END{print sum}'
More readably:
awk -F"\0" '{print NF-1}' < \
<(find . -maxdepth 1 -type f -print0) | \
awk '{sum+=$1}END{print sum}'
This is doing a find specifically for files, delimiting the output with a null character (to avoid problems with spaces and linefeeds), then counting the number of null characters. The number of files will be one less than the number of null characters, since there will be a null character at the end.
To answer the OP's question, there are two cases to consider
1) Non-recursive search:
awk -F"\0" '{print NF-1}' < \
<(find . -maxdepth 1 -type f -name "log*" -print0) | \
awk '{sum+=$1}END{print sum}'
2) Recursive search. Note that what's inside the -name parameter might need to be changed for slightly different behavior (hidden files, etc.).
awk -F"\0" '{print NF-1}' < \
<(find . -type f -name "log*" -print0) | \
awk '{sum+=$1}END{print sum}'
If anyone would like to comment on how these answers compare to those I've mentioned in this answer, please do.
Note, I got to this thought process while getting this answer.
This can be done with standard POSIX shell grammar.
Here is a simple count_entries function:
#!/usr/bin/env sh
count_entries()
{
# Emulating Bash nullglob
# If argument 1 is not an existing entry
if [ ! -e "$1" ]
# argument is a returned pattern
# then shift it out
then shift
fi
echo $#
}
for a compact definition:
count_entries(){ [ ! -e "$1" ]&&shift;echo $#;}
Featured POSIX compatible file counter by type:
#!/usr/bin/env sh
count_files()
# Count the file arguments matching the file operator
# Synopsys:
# count_files operator FILE [...]
# Arguments:
# $1: The file operator
# Allowed values:
# -a FILE True if file exists.
# -b FILE True if file is block special.
# -c FILE True if file is character special.
# -d FILE True if file is a directory.
# -e FILE True if file exists.
# -f FILE True if file exists and is a regular file.
# -g FILE True if file is set-group-id.
# -h FILE True if file is a symbolic link.
# -L FILE True if file is a symbolic link.
# -k FILE True if file has its `sticky' bit set.
# -p FILE True if file is a named pipe.
# -r FILE True if file is readable by you.
# -s FILE True if file exists and is not empty.
# -S FILE True if file is a socket.
# -t FD True if FD is opened on a terminal.
# -u FILE True if the file is set-user-id.
# -w FILE True if the file is writable by you.
# -x FILE True if the file is executable by you.
# -O FILE True if the file is effectively owned by you.
# -G FILE True if the file is effectively owned by your group.
# -N FILE True if the file has been modified since it was last read.
# $#: The files arguments
# Output:
# The number of matching files
# Return:
# 1: Unknown file operator
{
operator=$1
shift
case $operator in
-[abcdefghLkprsStuwxOGN])
for arg; do
# If file is not of required type
if ! test "$operator" "$arg"; then
# Shift it out
shift
fi
done
echo $#
;;
*)
printf 'Invalid file operator: %s\n' "$operator" >&2
return 1
;;
esac
}
count_files "$#"
Example usages:
count_files -f log*.txt
count_files -d datadir*
Alternate count non-directory entries without a loop:
#!/bin/sh
# Creates strings of as many dots as expanded arguments
# dotted string for entries matching star pattern
star=$(printf '%.0s.' ./*)
# dotted string for entries matching star slash pattern (directories)
star_dir=$(printf '%.0s.' ./*/)
# dotted string for entries matching dot star pattern
dot_star=$(printf '%.0s.' ./.*)
# dotted string for entries matching dot star slash pattern (directories)
dot_star_dir=$(printf '%.0s.' ./.*/)
# Print pattern matches count excluding directories matches
printf 'Files count: %d\n' $((
${#star} - ${#star_dir} +
${#dot_star} - ${#dot_star_dir}
))
Here is a generic Bash function you can use in your scripts.
# #see https://stackoverflow.com/a/11307382/430062
function countFiles {
shopt -s nullglob
logfiles=($1)
echo ${#logfiles[#]}
}
FILES_COUNT=$(countFiles "$file-*")
ls -1 log* | wc -l
Which means list one file per line and then pipe it to word count command with parameter switching to count lines.
Here's what I always do:
ls log* | awk 'END{print NR}'
To count everything just pipe ls to word count line:
ls | wc -l
To count with pattern, pipe to grep first:
ls | grep log | wc -l

How can I process a list of files that includes spaces in its names in Unix?

I'm trying to list the files in a directory and do something to them in the Mac OS X prompt.
It should go like this: for f in $(ls -1); do echo $f; done
If I have files without spaces in their names (fileA.txt, fileB.txt), the echo works fine.
If the files include spaces in their names ("file A.txt", "file B.txt"), I get 4 strings (file, A.txt, file, B.txt).
I've tried quoting the listing command, but it only changed the problem.
If I do this: for f in $(ls -1); do echo $f; done
I get: file A.txt\nfile B.txt
(It displays correctly, but it is a single string and I need the 2 lines separated.
Step away from ls if at all possible. Use find from the findutils package.
find /target/path -type f -print0 | xargs -0 your_command_here
-print0 will cause find to output the names separated by NUL characters (ASCII zero). The -0 argument to xargs tells it to expect the arguments separated by NUL characters too, so everything will work just fine.
Replace /target/path with the path under which your files are located.
-type f will only locate files. Use -type d for directories, or omit altogether to get both.
Replace your_command_here with the command you'll use to process the file names. (Note: If you run this from a shell using echo for your_command_here you'll get everything on one line - don't get confused by that shell artifact, xargs will do the expected right thing anyway.)
Edit: Alternatively (or if you don't have xargs), you can use the much less efficient
find /target/path -type f -exec your_command_here \{\} \;
\{\} \; is the escape for {} ; which is the placeholder for the currently processed file. find will then invoke your_command_here with {} ; replaced by the file name, and since your_command_here will be launched by find and not by the shell the spaces won't matter.
The second version will be less efficient since find will launch a new process for each and every file found. xargs is smart enough to pipe the commands to a newly launched process if it can figure it's safe to do so. Prefer the xargs version if you have the choice.
for f in *; do echo "$f"; done
should do what you want. Why are you using ls instead of * ?
In general, dealing with spaces in shell is a PITA. Take a look at the $IFS variable, or better yet at Perl, Ruby, Python, etc.
Here's an answer using $IFS as discussed by derobert
http://www.cyberciti.biz/tips/handling-filenames-with-spaces-in-bash.html
You can pipe the arguments into read. For example, to cat all files in the directory:
ls -1 | while read FILENAME; do cat "$FILENAME"; done
This means you can still use ls, as you have in your question, or any other command that produces $IFS delimited output.
The while loop makes it much easier to do several things to the argument, and makes complex processing more readable in my opinion. A contrived example:
ls -1 | while read FILE
do
echo 1: "$FILE"
echo 2: "$FILE"
done
look --quoting-style option.
for instance, --quoting-style=c would produce :
$ ls --quoting-style=c
"file1" "file2" "dir one"
Check out the manpage for xargs:
it works like this:
ls -1 /tmp/*.jpeg | xargs rm

Resources