Counting the number of occurrences of a character in multiple files with unix shell - shell

I would like to help out my girlfriend - she needs the specific count of certain characters in around 200 files (per file).
I already found How can I use the UNIX shell to count the number of times a letter appears in a text file?, but that only shows the complete number, not the number of occurrences per file. basically, what I want is the following:
$ ls
test1 test2
$ cat test1
ddddnnnn
ddnnddnnnn
$ cat test2
ddnnddnnnn
$ grep -o 'n' * | wc -w
16
$ <insert command here>
test1 10
test2 6
$
or something similar regarding the output. As this will be on her university machine, I cannot code anything in perl or so, just shell is allowed. My shell knowledge is a bit rusty, so I cannot come up with a better solution - maybe you could be of assistance.

grep -Ho n * | uniq -c
produces
10 test1:n
6 test2:n
If you want exactly your output:
grep -Ho n * | uniq -c | while read count file; do echo "${file%:n} $count"; done

It's not exactly elegant, but the most obvious solution is:
letter='n'
for file in *; do
count=`grep -o $letter "$file" | wc -w`
echo "$file contains $letter $count times"
done

Glen's answer is far better for the flavors of UNIX that support it. This will work on a UNIX that claims it is POSIX-compliant. This is meant for the poor folks for whom the other answer does not fly.
POSIX grep says nothing about grep -H -o See: http://pubs.opengroup.org/onlinepubs/009604499/utilities/grep.html
Get a list of the files you want call it list.txt. I chose the character ^ == shift 6 for no reason
while read fname
do
cnt=`tr -dc '^' < $fname | wc -c`
echo "$fname: $cnt"
done < list.txt

Related

grep from 7 GB text file OR many smaller ones

I have about two thousand text files in folder.
I want to loop each one and search for specific word in line.
for file in "./*.txt";
do
cat $file | grep "banana"
done
I was wondering if join all text files into one file would be faster.
The whole directory has about 7 GB.
You're not actually looping, you're calling cat just once on the string ./*.txt, i.e., your script is equivalent to
cat ./*.txt | grep 'banana'
This is not equivalent to
grep 'banana' ./*.txt
though, as the output for the latter would prefix the filename for each match; you could use
grep -h 'banana' ./*.txt
to suppress filenames.
The problem you could run into is that ./*.txt expands to something that is longer than the maximum command line length allowed; to prevent that, you could do something like
printf '%s\0' ./*.txt | xargs -0 grep -h 'banana'
which is save for both files containing blanks and shell metacharacters and calls grep as few times as possible1.
This can even be parallelized; to run 4 grep processes in parallel, each handling 5 files at a time:
printf '%s\0' ./*.txt | xargs -0 -L 5 -P 4 grep -h 'banana'
What I think you intended to run is this:
for file in ./*.txt; do
cat "$file" | grep "banana"
done
which would call cat/grep once per file.
1At first I thought that printf would run into trouble with command line length limitations as well, but it seems that as a shell built-in, it's exempt:
$ touch '%s\0' {1000000..10000000} > /dev/null
-bash: /usr/bin/touch: Argument list too long
$ printf '%s\0' {1000000..10000000} > /dev/null
$

Count unique words in all text files in directory, and delete those having less than 2?

This gets me the count. But how to delete those files having count < 2?
$ cat ./a1esso.doc | grep -o -E '\w+' | sort -u -f | wc --words
1
$ cat ./a1brit.doc | grep -o -E '\w+' | sort -u -f | wc --words
4
How to grab the filenames of those that have less than 2, so we may delete them? I will be scanning millions of files. A find command can find all the files, but the filename needs to be propagated through the pipeline it seems. At the right end, the rm command can be used it seems.
Thanks for reading.
Update:
The correct answer is going to use an input pipeline to feed filenames. This is not negotiable. This program is not for use on the one input file shown in the example, but is coming from a dynamic list of many files.
A filter apparatus to identify the names of the files which are meeting the criterion, will also be present in the accepted answer. This is not negotiable either.
You could do this …
test $(grep -o -E '\w+' ./a1esso.doc | sort -u -f | wc --words) -lt 2 && rm alesso.doc
Update: removed useless cat as per David's comment.

Bash: displaying wc with three digit output?

conducting a word count of a directory.
ls | wc -l
if output is "17", I would like the output to display as "017".
I have played with | printf with little luck.
Any suggestions would be appreciated.
printf is the way to go to format numbers:
printf "There were %03d files\n" "$(ls | wc -l)"
ls | wc -l will tell you how many lines it encountered parsing the output of ls, which may not be the same as the number of (non-dot) filenames in the directory. What if a filename has a newline? One reliable way to get the number of files in a directory is
x=(*)
printf '%03d\n' "${#x[#]}"
But that will only work with a shell that supports arrays. If you want a POSIX compatible approach, use a shell function:
countargs() { printf '%03d\n' $#; }
countargs *
This works because when a glob expands the shell maintains the words in each member of the glob expansion, regardless of the characters in the filename. But when you pipe a filename the command on the other side of the pipe can't tell it's anything other than a normal string, so it can't do any special handling.
You coud use sed.
ls | wc -l | sed 's/^17$/017/'
And this applies to all the two digit numbers.
ls | wc -l | sed '/^[0-9][0-9]$/s/.*/0&/'

Output number of lines in a text file to screen in Unix [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
bash echo number of lines of file given in a bash variable
Was wondering how you output the number of lines in a text file to screen and then store it in a variable.
I have a file called stats.txt and when I run wc -l stats.txt it outputs 8 stats.txt
I tried doing x = wc -l stats.txt thinking it would store the number only and the rest is just for visual but it does not work :(
Thanks for the help
There are two POSIX standard syntax for doing this:
x=`cat stats.txt | wc -l`
or
x=$(cat stats.txt | wc -l)
They both run the program and replace the invocation in the script with the standard output of the command, in this case assigning it to the $x variable. However, be aware that both trim ending newlines (this is actually what you want here, but can be dangerous sometimes, when you expect a newline).
Also, the second case can be easily nested (example: $(cat $(ls | head -n 1) | wc -l)). You can also do it with the first case, but it is more complex:
`cat \`ls | head -n 1\` | wc -l`
There are also quotation issues. You can include these expressions inside double-quotes, but with the back-ticks, you must continue quoting inside the command, while using the parenthesis allows you to "start a new quoting" group:
"`echo \"My string\"`"
"$(echo "My string")"
Hope this helps =)
you may try:
x=`cat stats.txt | wc -l`
or (from the another.anon.coward's comment):
x=`wc -l < stats.txt`

Best way to choose a random file from a directory in a shell script

What is the best way to choose a random file from a directory in a shell script?
Here is my solution in Bash but I would be very interested for a more portable (non-GNU) version for use on Unix proper.
dir='some/directory'
file=`/bin/ls -1 "$dir" | sort --random-sort | head -1`
path=`readlink --canonicalize "$dir/$file"` # Converts to full path
echo "The randomly-selected file is: $path"
Anybody have any other ideas?
Edit: lhunath makes a good point about parsing ls. I guess it comes down to whether you want to be portable or not. If you have the GNU findutils and coreutils then you can do:
find "$dir" -maxdepth 1 -mindepth 1 -type f -print0 \
| sort --zero-terminated --random-sort \
| sed 's/\d000.*//g/'
Whew, that was fun! Also it matches my question better since I said "random file". Honsetly though, these days it's hard to imagine a Unix system deployed out there having GNU installed but not Perl 5.
files=(/my/dir/*)
printf "%s\n" "${files[RANDOM % ${#files[#]}]}"
And don't parse ls. Read http://mywiki.wooledge.org/ParsingLs
Edit: Good luck finding a non-bash solution that's reliable. Most will break for certain types of filenames, such as filenames with spaces or newlines or dashes (it's pretty much impossible in pure sh). To do it right without bash, you'd need to fully migrate to awk/perl/python/... without piping that output for further processing or such.
Is "shuf" not portable?
shuf -n1 -e /path/to/files/*
or find if files are deeper than one directory:
find /path/to/files/ -type f | shuf -n1
it's part of coreutils but you'll need 6.4 or newer to get it... so RH/CentOS does not include it.
# ******************************************************************
# ******************************************************************
function randomFile {
tmpFile=$(mktemp)
files=$(find . -type f > $tmpFile)
total=$(cat "$tmpFile"|wc -l)
randomNumber=$(($RANDOM%$total))
i=0
while read line; do
if [ "$i" -eq "$randomNumber" ];then
# Do stuff with file
amarok $line
break
fi
i=$[$i+1]
done < $tmpFile
rm $tmpFile
}
Something like:
let x="$RANDOM % ${#file}"
echo "The randomly-selected file is ${path[$x]}"
$RANDOM in bash is a special variable that returns a random number, then I use modulus division to get a valid index, then reference that index in the array.
This boils down to: How can I create a random number in a Unix script in a portable way?
Because if you have a random number between 1 and N, you can use head -$N | tail to cut somewhere in the middle. Unfortunately, I know no portable way to do this with the shell alone. If you have Python or Perl, you can easily use their random support but AFAIK, there is no standard rand(1) command.
I think Awk is a good tool to get a random number. According to the Advanced Bash Guide, Awk is a good random number replacement for $RANDOM.
Here's a version of your script that avoids Bash-isms and GNU tools.
#! /bin/sh
dir='some/directory'
n_files=`/bin/ls -1 "$dir" | wc -l | cut -f1`
rand_num=`awk "BEGIN{srand();print int($n_files * rand()) + 1;}"`
file=`/bin/ls -1 "$dir" | sed -ne "${rand_num}p"`
path=`cd $dir && echo "$PWD/$file"` # Converts to full path.
echo "The randomly-selected file is: $path"
It inherits the problems other answers have mentioned should files contain newlines.
Newlines in file-names can be avoided by doing the following in Bash:
#!/bin/sh
OLDIFS=$IFS
IFS=$(echo -en "\n\b")
DIR="/home/user"
for file in $(ls -1 $DIR)
do
echo $file
done
IFS=$OLDIFS
Here's a shell snippet that relies only on POSIX features and copes with arbitrary file names (but omits dot files from the selection). The random selection uses awk, because that's all you get in POSIX. It's a very poor random number generator, since awk's RNG is seeded with the current time in seconds (so it's easily predictable, and returns the same choice if you call it multiple times per second).
set -- *
n=$(echo $# | awk '{srand(); print int(rand()*$0) + 1}')
eval "file=\$$n"
echo "Processing $file"
If you don't want to ignore dot files, the file name generation code (set -- *) needs to be replaced by something more complicated.
set -- *; [ -e "$1" ] || shift
set .[!.]* "$#"; [ -e "$1" ] || shift
set ..?* "$#"; [ -e "$1" ] || shift
if [ $# -eq 0]; then echo 1>&2 "empty directory"; exit 1; fi
If you have OpenSSL available, you can use it to generate random bytes. If you don't but your system has /dev/urandom, replace the call to openssl by dd if=/dev/urandom bs=3 count=1 2>/dev/null. Here's a snippet that sets n to a random value between 1 and $#, taking care not to introduce a bias. This snippet assumes that $# is at most 2^23-1.
while
n=$(($(openssl rand 3 | od -An -t u4) + 1))
[ $n -gt $((16777216 / $# * $#)) ]
do :; done
n=$((n % $#))
BusyBox (used on embedded devices) is usually configured to support $RANDOM but it doesn't have bash-style arrays or sort --random-sort or shuf. Hence the following:
#!/bin/sh
FILES="/usr/bin/*"
for f in $FILES; do echo "$RANDOM $f" ; done | sort -n | head -n1 | cut -d' ' -f2-
Note trailing "-" in cut -f2-; this is required to avoid truncating files that contain spaces (or whatever separator you want to use).
It won't handle filenames with embedded newlines correctly.
Put each line of output from the command 'ls' into an associative array named line and then choose one of those like so...
ls | awk '{ line[NR]=$0 } END { print line[(int(rand()*NR+1))]}'
My 2 cents, with a version that should not break when filenames with special chars exist:
#!/bin/bash --
dir='some/directory'
let number_of_files=$(find "${dir}" -type f -print0 | grep -zc .)
let rand_index=$((1+(RANDOM % number_of_files)))
printf "the randomly-selected file is: "
find "${dir}" -type f -print0 | head -z -n "${rand_index}" | tail -z -n 1
printf "\n"

Resources