Average word length of input file - shell

If i use
wc -m filename
it will generate the number of characters
and
wc -w filename
will generate number of words
if i used this info by dividing number of characters/number of words
it will give me misleading result as number of character will include spaces and punctuation
any advice ?

the solution that I came up with without writing a script was to pipe it through a couple of commands like this.
<filename tr -d " \t\n\r\.\?\!" | wc -m
This works to remove all of the spacing, like new line, tabs and normal spacing. A more rigorous tr command that included any sort of other punctuation like a colon can just be added to the list for example \:
Hope That Helps

Subtract out characters you do not want
chars=$(tr -dc '[:alnum:]' < filename | wc -c)
words=$(cat filename | wc -c)
Now do you calculation. I piped into wc to avoid the extra "filename" in output
printf "%.2f" $(echo "$chars/$words" | bc -l)
Edit: thanks BMW

Related

Why is wc -l counting lines incorrectly? [duplicate]

I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?
Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.
The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.
60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1

Counting number of different words in a txt file in Bash

Well, I do not know much about programming at bash, I'm new at it so I'm struggling to find a code to iterate all the lines in a txt file, and count how many words are different.
Example: If a txt file has "Nory was a Catholic because her mother was a Catholic"
So the result must be 7
$ grep -o '[^[:space:]]*' file | sort -u | wc -l
7
Sure. I assume you are ok with defining "words" as things that are separated by space? In which case, try something like this:
cat filename | sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" | sort -u | wc -l
This command says:
Dump contents of filename
Replace multiple spaces with a single space
Replace spaces with newline
Sort and "uniquify" the list
Print out the count of lines
Per the comment, you can technically get away without using cat if you'd like, with the following:
sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" filename | sort -u | wc -l
Further, from another comment, you could optionally use tr (importantly with it's -s flag to handle repeated spaces) instead of sed with something like:
tr -s " " "\n" < filename | sort -u | wc -l
The moral of the story is there are several ways this kind of thing can be accomplished, not to mention the other full answers that are given here :-) My personal favorite answer at this point is Ed Morton's which I've upvoted accordingly.
You could also lowercase the text so words compares regardless of casing.
Also filter words with the [:alnum:] character class, rather than [a-zA-Z0-9_] that is only valid for US-ASCII, and will fail dramatically with Greek or Turkish.
#!/usr/bin/env bash
echo "The uniq words are the words that appears at least once, regardless of casing." |
# Turn text to lowercase
tr '[:upper:]' '[:lower:]' |
# Split alphanumeric with newlines
tr -sc '[:alnum:]' '\n' |
# Sort uniq words
sort -u |
# Count lines of unique words
wc -l
I would do it like so, with comments:
echo "Nory was a Catholic because her mother was a Catholic" |
# tr replace
# -s - squeeze
# -c - complementary
# [a-zA-Z0-9_] - all letters, number and underscore
# but complementary set, so all non letters, not numbers and not underscores.
# replace them by newline
tr -sc '[a-zA-Z0-9_]' '\n' |
# and sort unique and display count
sort -u | wc -l
Tested on repl bash.
Decided to use [a-zA-Z0-9_], because this is how GNU sed \w extension matches a word.
cat yourfile.txt | xargs -n1 | sort | uniq -c > youroutputfile.txt
xargs -n1 = put one word per line
sort = sorts
uniq -c = counts occurrences of distinct values
source

Bash: displaying wc with three digit output?

conducting a word count of a directory.
ls | wc -l
if output is "17", I would like the output to display as "017".
I have played with | printf with little luck.
Any suggestions would be appreciated.
printf is the way to go to format numbers:
printf "There were %03d files\n" "$(ls | wc -l)"
ls | wc -l will tell you how many lines it encountered parsing the output of ls, which may not be the same as the number of (non-dot) filenames in the directory. What if a filename has a newline? One reliable way to get the number of files in a directory is
x=(*)
printf '%03d\n' "${#x[#]}"
But that will only work with a shell that supports arrays. If you want a POSIX compatible approach, use a shell function:
countargs() { printf '%03d\n' $#; }
countargs *
This works because when a glob expands the shell maintains the words in each member of the glob expansion, regardless of the characters in the filename. But when you pipe a filename the command on the other side of the pipe can't tell it's anything other than a normal string, so it can't do any special handling.
You coud use sed.
ls | wc -l | sed 's/^17$/017/'
And this applies to all the two digit numbers.
ls | wc -l | sed '/^[0-9][0-9]$/s/.*/0&/'

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

UNIX wc -l with line length restriction

I need to count the number of lines in a file, in a UNIX shell script, but I need the number of lines under 80 characters, and if there are more than 80 characters, count it as multiple lines.
I know wc -l counts the number of lines, and I know there aren't any options to specify this kind of thing, so how would I do this?
Use fold to break lines > 80 characters and then pipe the output to wc, e.g.
$ fold file | wc -l
This may do what you want:
sed -r 's,(.{80}),\1\n,g' filename | wc -l
While the fold answer best fits the unix way:
awk '{n += 1+int(length/80)} END {print n}' filename

Resources