Why is wc -l counting lines incorrectly? [duplicate] - bash

I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?

Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.

The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.

60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1

Related

Counting character in a 1 line huge file

I'm working with a huge file (4 GB). This file doesn't have conventional line delimiter. Instead of the \n or such, the line delimiter is a bunch of characters (#####).
On this annoying file, I would like to count the number of time the "#####" string occurs to get the number of line.
I tried:
grep -o "#####" MyHugeFile.txt | wc -l
awk -F'#####' '{print NF-1}' MyHugeFile.txt
awk '{print gsub(/#####/,"&")}' MyHugeFile.txt
But none of them would work: the process seems to die before it gets the number of occurrence.
Is there another way ?
Thank you very much !

The wc -l gives wrong result

I got wrong result from the wc -l command. After a long :( checking a found the core of the problem, here is the simulation:
$ echo "line with end" > file
$ echo -n "line without end" >>file
$ wc -l file
1 file
here are two lines, but missing the last "\n". Any easy solution?
For the wc line is what ends with the "\n" char. One of solutions is grep-ing the lines. The grep not looking for the ending NL.
e.g.
$ grep -c . file #count the occurrence of any character
2
the above will not count empty lines. If you want them, use the
$ grep -c '^' file #count the beginnings of the lines
2
from man page of wc
-l, --lines
print the newline counts
form man page of echo
-n do not output the trailing newline
so you have 1 newline in your file and thus wc -l shows 1.
You can use the following awk command to count lines
awk 'END{print NR}' file

Printing a line of a file given line number

Is it possible, in UNIX, to print a particular line of a file? For example I would like to print line 10 of file example.c. I tried with cat, ls, awk but apparently either these don't have the feature or I'm not able to properly read the man :-).
Using awk:
awk 'NR==10' file
Using sed:
sed '10!d' file
sed -n '10{p;q;}' example.c
will print the tenth line of example.c for you.
Try head and tail, you can specify the amount of lines and where to start.
To get the third line:
head -n 3 yourfile.c | tail -n 1
head -n 10 /tmp/asdf | tail -n 1
Unfortunately, all other solutions which use head/tail will NOT work incorrectly if line number provided is larger than total number of lines in our file.
This will print line number N or nothing if N is beyond total number of lines:
grep "" file | grep "^20:"
If you want to cut line number from output, pipe it through sed:
grep "" file | grep "^20:" | sed 's/^20://'
Try this:
cat -n <yourfile> | grep ^[[:space:]]*<NUMBER>[[:space:]].*$
cat -n numbers the file
the regex of grep searches the line numbered ;-)
The original mismatched as mentioned in the comments.
Te current one looks for the exact match.
- i.e. in the particular cas we need a line starting with an arbitrary amount () of spaces the followed by a space followed by whatever (.)
In case anyone thumbles over this regex and doesn't get it at all - here is a good tutorial to get you started: http://regex.learncodethehardway.org/book/ (it uses python regex as an example tough).
This might work for you:
sed '10q;d' file

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

UNIX wc -l with line length restriction

I need to count the number of lines in a file, in a UNIX shell script, but I need the number of lines under 80 characters, and if there are more than 80 characters, count it as multiple lines.
I know wc -l counts the number of lines, and I know there aren't any options to specify this kind of thing, so how would I do this?
Use fold to break lines > 80 characters and then pipe the output to wc, e.g.
$ fold file | wc -l
This may do what you want:
sed -r 's,(.{80}),\1\n,g' filename | wc -l
While the fold answer best fits the unix way:
awk '{n += 1+int(length/80)} END {print n}' filename

Resources