UNIX wc -l with line length restriction - shell

I need to count the number of lines in a file, in a UNIX shell script, but I need the number of lines under 80 characters, and if there are more than 80 characters, count it as multiple lines.
I know wc -l counts the number of lines, and I know there aren't any options to specify this kind of thing, so how would I do this?

Use fold to break lines > 80 characters and then pipe the output to wc, e.g.
$ fold file | wc -l

This may do what you want:
sed -r 's,(.{80}),\1\n,g' filename | wc -l

While the fold answer best fits the unix way:
awk '{n += 1+int(length/80)} END {print n}' filename

Related

Why is wc -l counting lines incorrectly? [duplicate]

I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?
Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.
The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.
60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1

Average word length of input file

If i use
wc -m filename
it will generate the number of characters
and
wc -w filename
will generate number of words
if i used this info by dividing number of characters/number of words
it will give me misleading result as number of character will include spaces and punctuation
any advice ?
the solution that I came up with without writing a script was to pipe it through a couple of commands like this.
<filename tr -d " \t\n\r\.\?\!" | wc -m
This works to remove all of the spacing, like new line, tabs and normal spacing. A more rigorous tr command that included any sort of other punctuation like a colon can just be added to the list for example \:
Hope That Helps
Subtract out characters you do not want
chars=$(tr -dc '[:alnum:]' < filename | wc -c)
words=$(cat filename | wc -c)
Now do you calculation. I piped into wc to avoid the extra "filename" in output
printf "%.2f" $(echo "$chars/$words" | bc -l)
Edit: thanks BMW

How to print out the number of lines in an output command

For example:
cat /etc/passwd
What is the easiest way to count and display the number of lines the command outputs?
wc is the unix utility which counts characters, words, lines etc. Try man wc to learn more about it. The -l option makes it print only the number of lines (and not characters and other stuff).
So, wc -l <filename> will print the number of lines in the file <filename>.
You asked about how to count number of lines output from a command line program in general. To do that, you can use pipes in unix. So, you can pipe the output of any command to wc -l. In your example, cat /etc/password is the command line program you want to count. For that you should do:
cat /etc/password | wc -l

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

Can I grep only the first n lines of a file?

I have very long log files, is it possible to ask grep to only search the first 10 lines?
The magic of pipes;
head -10 log.txt | grep <whatever>
For folks who find this on Google, I needed to search the first n lines of multiple files, but to only print the matching filenames. I used
gawk 'FNR>10 {nextfile} /pattern/ { print FILENAME ; nextfile }' filenames
The FNR..nextfile stops processing a file once 10 lines have been seen. The //..{} prints the filename and moves on whenever the first match in a given file shows up. To quote the filenames for the benefit of other programs, use
gawk 'FNR>10 {nextfile} /pattern/ { print "\"" FILENAME "\"" ; nextfile }' filenames
Or use awk for a single process without |:
awk '/your_regexp/ && NR < 11' INPUTFILE
On each line, if your_regexp matches, and the number of records (lines) is less than 11, it executes the default action (which is printing the input line).
Or use sed:
sed -n '/your_regexp/p;10q' INPUTFILE
Checks your regexp and prints the line (-n means don't print the input, which is otherwise the default), and quits right after the 10th line.
You have a few options using programs along with grep. The simplest in my opinion is to use head:
head -n10 filename | grep ...
head will output the first 10 lines (using the -n option), and then you can pipe that output to grep.
grep "pattern" <(head -n 10 filename)
head -10 log.txt | grep -A 2 -B 2 pattern_to_search
-A 2: print two lines before the pattern.
-B 2: print two lines after the pattern.
head -10 log.txt # read the first 10 lines of the file.
You can use the following line:
head -n 10 /path/to/file | grep [...]
The output of head -10 file can be piped to grep in order to accomplish this:
head -10 file | grep …
Using Perl:
perl -ne 'last if $. > 10; print if /pattern/' file
An extension to Joachim Isaksson's answer: Quite often I need something from the middle of a long file, e.g. lines 5001 to 5020, in which case you can combine head with tail:
head -5020 file.txt | tail -20 | grep x
This gets the first 5020 lines, then shows only the last 20 of those, then pipes everything to grep.
(Edited: fencepost error in my example numbers, added pipe to grep)
grep -A 10 <Pattern>
This is to grab the pattern and the next 10 lines after the pattern. This would work well only for a known pattern, if you don't have a known pattern use the "head" suggestions.
grep -m6 "string" cov.txt
This searches only the first 6 lines for string

Resources