How to count all the human readable files in Bash? - bash

I'm taking an intro course to UNIX and have a homework question that follows:
How many files in the previous question are text files? A text file is any file containing human-readable content. (TRICK QUESTION. Run the file command on a file to see whether the file is a text file or a binary data file! If you simply count the number of files with the .txt extension you will get no points for this question.)
The previous question simply asked how many regular files there were, which was easy to figure out by doing find . -type f | wc -l.
I'm just having trouble determining what "human readable content" is, since I'm assuming it means anything besides binary/assembly, but I thought that's what -type f displays. Maybe that's what the professor meant by saying "trick question"?
This question has a follow up later that also asks "What text files contain the string "csc" in any mix of upper and lower case?". Obviously "text" is referring to more than just .txt files, but I need to figure out the first question to determine this!

Quotes added for clarity:
Run the "file" command on a file to see whether the file is a text file or a binary data file!
The file command will inspect files and tell you what kind of file they appear to be. The word "text" will (almost) always be in the description for text files.
For example:
desktop.ini: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
tw2-wasteland.jpg: JPEG image data, JFIF standard 1.02
So the first part is asking you to run the file command and parse its output.
I'm just having trouble determining what "human readable content" is, since I'm assuming it means anything besides binary/assembly, but I thought that's what -type f displays.
find -type f finds files. It filters out other filesystem objects like directories, symlinks, and sockets. It will match any type of file, though: binary files, text files, anything.
Maybe that's what the professor meant by saying "trick question"?
It sounds like he's just saying don't do find -name '*.txt' or some such command to find text files. Don't assume a particular file extension. File extensions have much less meaning in UNIX than they do in Windows. Lots of files don't even have file extensions!
I'm thinking the professor wants us to be able to run the file command on all files and count the number of ones with 'text' in it.
How about a multi-part answer? I'll give the straightforward solution in #1, which is probably what your professor is looking for. And if you are interested I'll explain its shortcomings and how you can improve upon it.
One way is to use xargs, if you've learned about that. xargs runs another command, using the data from stdin as that command's arguments.
$ find . -type f | xargs file
./netbeans-6.7.1.desktop: ASCII text
./VMWare.desktop: a /usr/bin/env xdg-open script text executable
./VMWare: cannot open `./VMWare' (No such file or directory)
(copy).desktop: cannot open `(copy).desktop' (No such file or directory)
./Eclipse.desktop: a /usr/bin/env xdg-open script text executable
That works. Sort of. It'd be good enough for a homework assignment. But not good enough for a real world script.
Notice how it broke on the file VMWare (copy).desktop because it has a space in it. This is due to xargs's default behavior of splitting the arguments on whitespace. We can fix that by using xargs -0 to split command arguments on NUL characters instead of whitespace. File names can't contain NUL characters, so this will be able to handle anything.
$ find . -type f -print0 | xargs -0 file
./netbeans-6.7.1.desktop: ASCII text
./VMWare.desktop: a /usr/bin/env xdg-open script text executable
./VMWare (copy).desktop: a /usr/bin/env xdg-open script text executable
./Eclipse.desktop: a /usr/bin/env xdg-open script text executable
This is good enough for a production script, and is something you'll encounter a lot. But I personally prefer an alternative syntax which doesn't require a pipe, and so is slightly more efficient.
$ find . -type f -exec file {} \;
./netbeans-6.7.1.desktop: ASCII text
./VMWare.desktop: a /usr/bin/env xdg-open script text executable
./VMWare (copy).desktop: a /usr/bin/env xdg-open script text executable
./Eclipse.desktop: a /usr/bin/env xdg-open script text executable
To understand that, -exec calls file repeatedly, replacing {} with each file name it finds. The semi-colon \; marks the end of the file command.

there's a nice and easy way to determine whether a file is a human readable text file, just use file --mime-type <filename> and look for 'text/plain'. It will work no matter if the file has an ending or has a different ending to .txt
So you would do sth like:
FILES=`find $YOUR_DIR -type f`
for file in $FILES ;
do
mime=`/usr/bin/file --mime-type $YOUR_DIR/$file | /bin/sed 's/^.* //'`
if [ $mime = "text/plain" ]; then
fileTotal=$(( fileTotal + 1 ))
echo "$fileTotal - $file"
fi
done
echo "$fileTotal human readable files found!"
and the output would be sth like:
1 - /sampledir/samplefile
2 - /sampledir/anothersamplefile
....
23 human readable files found!
If you want to take it further to more mime types that are human readable(e.g. does HTML and/or XML count?) have a look at http://www.feedforall.com/mime-types.htm

Related

how list just one file from a (bash) shell directory listing

A bit lowly a query but here goes:
bash shell script. POSIX, Mint 21
I just want one/any (mp3) file from a directory. As a sample.
In normal execution, a full run, the code would be such
for f in *.mp3 do
#statements
done
This works fine but if I wanted to sample just one file of such an array/glob (?) without looping, how might I do that? I don't care which file, just that it is an mp3 from the directory I am working in.
Should I just start this for-loop and then exit(break) after one statement, or is there a neater way more tailored-for-the-job way?
for f in *.mp3 do
#statement
break
done
Ta (can not believe how dopey I feel asking this one, my forehead will hurt when I see the answers )
Since you are using Linux (Mint) you've got GNU find so one way to get one .mp3 file from the current directory is:
mp3file=$(find . -maxdepth 1 -mindepth 1 -name '*.mp3' -printf '%f' -quit)
-maxdepth 1 -mindepth 1 causes the search to be restricted to one level under the current directory.
-printf '%f' prints just the filename (e.g. foo.mp3). The -print option would print the path to the filename (e.g. ./foo.mp3). That may not matter to you.
-quit causes find to exit as soon as one match is found and printed.
Another option is to use the Bash : (colon) command and $_ (dollar underscore) special variable:
: *.mp3
mp3file=$_
: *.mp3 runs the : command with the list of .mp3 files in the current directory as arguments. The : command ignores its arguments and does nothing.
mp3file=$_ sets the value of the mp3file variable to the last argument supplied to the previous command (:).
The second option should not be used if the number of .mp3 files is large (hundreds or more) because it will find all of the files and sort them by name internally.
In both cases $mp3file should be checked to ensure that it really exists (e.g. [[ -e $mp3file ]]) before using it for anything else, in case there are no .mp3 files in the directory.
I would do it like this in POSIX shell:
mp3file=
for f in *.mp3; do
if [ -f "$f" ]; then
mp3file=$f
break
fi
done
# At this point, the variable mp3file contains a filename which
# represents a regular file (or a symbolic link) with the .mp3
# extension, or empty string if there is no such a file.
The fact that you use
for f in *.mp3 do
suggests to me, that the MP3s are named without to much strange characters in the filename.
In that case, if you really don't care which MP3, you could:
f=$(ls *.mp3|head)
statement
Or, if you want a different one every time:
f=$(ls *.mp3|sort -R | tail -1)
Note: if your filenames get more complicated (including spaces or other special characters), this will not work anymore.
Assuming you don't have spaces in your filenames, (and I don't understand why the collective taboo is against using ls in scripts at all, rather than not having spaces in filenames, personally) then:-
ls *.mp3 | tr ' ' '\n' | sed -n '1p'

Output file empty for Bash script that does "find" using GNU sed (gsed)

I have many files, each in a directory. My script should:
Find a string in a file. Let's say the file is called "results" and the string is "average."
Then append everything else on the string's line to another file called "allResults." After running the script, the file "allResults" should contain as many lines as there are "results" files, like
allResults.txt (what I want):
Everything on the same line as the string, "average" in directory1/results
Everything on the same line as the string, "average" in directory2/results
Everything on the same line as the string, "average" in directory3/results
...
Everything on the same line as the string, "average" in directory-i/results
My script can find what I need. I have checked by doing a "cat" on "allResults.txt" as the script is working and an "ls -l" on the parent directory of "allResults.txt." I.e., I can see the output of the "find" on my screen and the size of "allResults.txt" increases briefly, then goes back to 0. The problem is that "allResults.txt" is empty when the script has finished. So the results of the "find" are not being appended/added to "allResults.txt." They're being overwritten.
Here is my script (I use "gsed", GNU sed, because I'm a Mac OSX Sierra user):
#!/bin/bash
# Loop over all directories, find.
let allsteps=100000
for ((step=0; step <= allsteps; step++)); do
i=$((step));
findme="average"
find ${i}/experiment-1/results.dat -type f -exec gsed -n -i "s/${findme}//p" {} \; >> allResults.txt
done
Please note that I have used ">>" in my example here because I read that it appends (which is what I want--a list of all lines matching my "find" from all files), whereas ">" overwrites. However, in both cases (when I use ">" or ">>"), I end up with an empty allResults.txt file.
grep's default behavior is to print out matching lines. Using sed is overkill.
You also don't need an explicit loop. Indeed, excess looping is a common trope programmers tend to import from other languages where looping is common. Most shell commands and constructs accept multiple file names.
grep average */experiment-1/results.dat > allResults.txt
What's nice about this is the output file is only opened once and is written to in one fell swoop.
If you indeed have hundreds of thousands of files to process you might encounter a command-line length limit. If that happens you can switch to a find call which will make sure not to call grep with too many files at once.
find . -name results.dat -exec grep average {} + > allResults.txt

How to remove unknown file extensions from files using script

I can remove file extensions if I know the extensions, for example to remove .txt from files:
foreach file (`find . -type f`)
mv $file `basename $file .txt`
end
However if I don't know what kind of file extension to begin with, how would I do this?
I tried:
foreach file (`find . -type f`)
mv $file `basename $file .*`
end
but it wouldn't work.
What shell is this? At least in bash you can do:
find . -type f | while read -r; do
mv -- "$REPLY" "${REPLY%.*}"
done
(The usual caveats apply: This doesn't handle files whose name contains newlines.)
You can use sed to compute base file name.
foreach file (`find . -type f`)
mv $file `echo $file | sed -e 's/^\(.*\)\.[^.]\+$/\1/'`
end
Be cautious: The command you seek to run could cause loss of data!
If you don't think your file names contain newlines or double quotes, then you could use:
find . -type f -name '?*.*' |
sed 's/\(.*\)\.[^.]*$/mv "&" "\1"/' |
sh
This generates your list of files (making sure that the names contain at least one character plus a .), runs each file name through the sed script to convert it into an mv command by effectively removing the material from the last . onwards, and then running the stream of commands through a shell.
Clearly, you test this first by omitting the | sh part. Consider running it with | sh -x to get a trace of what the shell's doing. Consider making sure you capture the output of the shell, standard output and standard error, into a log file so you've got a record of the damage that occurred.
Do make sure you've got a backup of the original set of files before you start playing with this. It need only be a tar file stored in a different part of the directory hierarchy, and you can remove it as soon as you're happy with the results.
You can choose any shell; this doesn't rely on any shell constructs except pipes and single quotes and double quotes (pretty much common to all shells), and the sed script is version neutral too.
Note that if you have files xyz.c and xyz.h before you run this, you'll only have a file xyz afterwards (and what it contains depends on the order in which the files are processed, which needn't be alphabetic order).
If you think your file names might contain double quotes (but not single quotes), you can play with the changing the quotes in the sed script. If you might have to deal with both, you need a more complex sed script. If you need to deal with newlines in file names, then it is time to (a) tell your user(s) to stop being silly and (b) fix the names so they don't contain newlines. Then you can use the script above. If that isn't feasible, you have to work a lot harder to get the job done accurately — you probably need to make sure you've got a find that supports -print0, a sed that supports -z and an xargs that supports -0 (installing the most recent GNU versions if you don't already have the right support in place).
It's very simple:
$ set filename=/home/foo/bar.dat
$ echo ${filename:r}
/home/foo/bar
See more in man tcsh, in "History substitution":
r
Remove a filename extension '.xxx', leaving the root name.

Terminal - run 'file' (file type) for the whole directory

I'm a beginner in the terminal and bash language, so please be gentle and answer thoroughly. :)
I'm using Cygwin terminal.
I'm using the file command, which returns the file type, like:
$ file myfile1
myfile1: HTML document, ASCII text
Now, I have a directory called test, and I want to check the type of all files in it.
My endeavors:
I checked in the man page for file (man file), and I could see in the examples that you could type the names of all files after the command and it gives the types of all, like:
$ file myfile{1,2,3}
myfile1: HTML document, ASCII text
myfile2: gzip compressed data
myfile3: HTML document, ASCII text
But my files' names are random, so there's no specific pattern to follow.
I tried using the for loop, which I think is going to be the answer, but this didn't work:
$ for f in ls; do file $f; done
ls: cannot open `ls' (No such file or directory)
$ for f in ./; do file $f; done
./: directory
Any ideas?
Every Unix or Linux shell supports some kind of globs. In your case, all you need is to use * glob. This magic symbol represents all folders and files in the given path.
eg., file directory/*
Shell will substitute the glob with all matching files and directories in the given path. The resulting command that will actually get executed might be something like:
file directory/foo directory/bar directory/baz
You can use a combination of the find and xargs command.
For example:
find /your/directory/ | xargs file
HTH
file directory/*
Is probably the shortest simplest solution to fix your issue, but this is more of an answer as to why your loops weren't working.
for f in ls; do file $f; done
ls: cannot open `ls' (No such file or directory)
For this loop it is saying "for f in the directory or file 'ls' ; do..." If you wanted it to execute the ls command then you would need to do something like this
for f in `ls`; do file "$f"; done
But that wouldn't work correctly if any of the filenames contain whitespace. It is safer and more efficient to use the shell's builtin "globbing" like this
for f in *; do file "$f"; done
For this one there's an easy fix.
for f in ./; do file $f; done
./: directory
Currently, you're asking it to run the file command for the directory "./".
By changing it to " ./* " meaning, everything within the current directory (which is the same thing as just *).
for f in ./*; do file "$f"; done
Remember, double quote variables to prevent globbing and word splitting.
https://github.com/koalaman/shellcheck/wiki/SC2086

Bash: Check all files in a location against another for existence

I'm after a little help with some Bash scripting (on OSX). I want to create a script that takes two parameters - source folder and target folder - and checks all files in the source hierarchy to see whether or not they exist in the target hierarchy. i.e. Given a data DVD check whether the files contained on it are already on the internal drive.
What I've come up with so far is
#!/bin/bash
if [ $# -ne 2 ]
then
echo "Usage is command sourcedir targetdir"
exit 0
fi
source="$1"
target="$2"
for f in "$( find $source -type f -name '*' -print )"
do
I'm now not sure how it's best to obtain the filename without its path and then see if it exists. I am really a beginner at scripting.
Edit: The answers given so far are all very efficient in terms of compact code. However I need to be able to look for files found within the total source hierarchy anywhere within the target hierarchy. If found I would like to compare checksums and last modified dates etc and comment or, if not found, I would like to note this. The purpose is to check whether files on external media have been uploaded to a file server.
This should give you some ideas:
#!/bin/bash
DIR1="tmpa"
DIR2="tmpb"
function sorted_contents
{
cd "$1"
find . -type f | sort
}
DIR1_CONTENTS=$(sorted_contents "$DIR1")
DIR2_CONTENTS=$(sorted_contents "$DIR2")
diff -y <(echo "$DIR1_CONTENTS") <(echo "$DIR2_CONTENTS")
In my test directories, the output was:
[user#host so]$ ./dirdiff.sh
./address-book.dat ./address-book.dat
./passwords.txt ./passwords.txt
./some-song.mp3 <
./the-holy-grail.info ./the-holy-grail.info
> ./victory.wav
./zzz.wad ./zzz.wad
If its not clear, "some-song.mp3" was only in the first directory while "victory.wav" was only in the second. The rest of the files were common.
Note that this only compares the file names, not the contents. If you like where this is headed, you could play with the diff options (maybe --suppress-common-lines if you want cleaner output).
But this is probably how I'd approach it -- offload a lot of the work onto diff.
EDIT: I should also point out that something as simple as:
[user#host so]$ diff tmpa tmpb
would also work:
Only in tmpa: some-song.mp3
Only in tmpb: victory.wav
... but not feel as satisfying as writing a script yourself. :-)
To list only files in $source_dir that do not exist in $target_dir:
comm -23 <(cd "$source_dir" && find .|sort) <(cd "$target_dir" && find .|sort)
You can limit it to just regular files with -f on the find commands, etc.
The comm command (short for "common") finds lines in common between two text files and outputs three columns: lines only in the first file, lines only in the second file, and lines common to both. The numbers suppress the corresponding column, so the output of comm -23 is only the lines from the first file that don't appear in the second.
The process substitution syntax <(command) is replaced by the pathname to a named pipe connected to the output of the given command, which lets you use a "pipe" anywhere you could put a filename, instead of only stdin and stdout.
The commands in this case generate lists of files under the two directories - the cd makes the output relative to the directories being compared, so that corresponding files come out as identical strings, and the sort ensures that comm won't be confused by the same files listed in different order in the two folders.
A few remarks about the line for f in "$( find $source -type f -name '*' -print )":
Make that "$source". Always use double quotes around variable substitutions. Otherwise the result is split into words that are treated as wildcard patterns (a historical oddity in the shell parsing rules); in particular, this would fail if the value of the variable contain spaces.
You can't iterate over the output of find that way. Because of the double quotes, there would be a single iteration through the loop, with $f containing the complete output from find. Without double quotes, file names containing spaces and other special characters would trip the script.
-name '*' is a no-op, it matches everything.
As far as I understand, you want to look for files by name independently of their location, i.e. you consider /dvd/path/to/somefile to be a match to /internal-drive/different/path-to/somefile. So make an list of files on each side indexed by name. You can do this by massaging the output of find a little. The code below can cope with any character in file names except newlines.
list_files () {
find . -type f -print |
sed 's:^\(.*\)/\(.*\)$:\2/\1/\2:' |
sort
}
source_files="$(cd "$1" && list_files)"
dest_files="$(cd "$2" && list_files)"
join -t / -v 1 <(echo "$source_files") <(echo "$dest_files") |
sed 's:^[^/]*/::'
The list_files function generates a list of file names with paths, and prepends the file name in front of the files, so e.g. /mnt/dvd/some/dir/filename.txt will appear as filename.txt/./some/dir/filename.txt. It then sorts the files.
The join command prints out lines like filename.txt/./some/dir/filename.txt when there is a file called filename.txt in the source hierarchy but not in the destination hierarchy. We finally massage its output a little since we no longer need the filename at the beginning of the line.

Resources