I have a non-empty file (even a big one, 400Ko), that I can read with less.
But if I try to output the number of lines with wc -l /path/to/file it outputs 0.
How can it be possible?
You can verify for yourself that the file contains no newline/linefeed (ASCII 10) characters, which would result in wc -l reporting 0 lines.
First, count the characters in your file:
wc -c /path/to/file
You should get a non-zero value.
Now, filter out everything that isn't a newline:
tr -dc '\n' /path/to/file | wc -c
You should get back 0.
Or, delete the newlines and count the result.
tr -d '\n' | wc -c
You should get back the same value as in step 1.
wc counts number of '\n' characters in the file. Could it be that your file does not contain one?
Here is the GNU source:
https://www.gnu.org/software/cflow/manual/html_node/Source-of-wc-command.html
look for COUNT(c) macro.
Here's one way it's possible. Make a 400k file with just nulls in it:
dd if=/dev/zero bs=1024 count=400 of=/tmp/nulls ; ls -log /tmp/nulls
Output shows the file exists:
400+0 records in
400+0 records out
409600 bytes (410 kB, 400 KiB) copied, 0.00343425 s, 119 MB/s
-rw-rw-r-- 1 409600 Feb 28 11:12 /tmp/nulls
Now count the lines:
wc -l /tmp/nulls
0 /tmp/nulls
It is possible if the HTML file is minified. The newline characters would have been removed during minification of the content.
Try with file command,
file filename.html
filename.html: HTML document text, UTF-8 Unicode text, with very long lines, with no line terminators
Related
I am trying to count how many files have words with the pattern [Gg]reen.
#!/bin/bash
for File in `ls ./`
do
cat ./$File | egrep '[Gg]reen' | sed -n '$='
done
When I do this I get this output:
1
1
3
1
1
So I want to count the lines to get in total 5. I tried using wc -l after the sed but it didn't work; it counted the lines in all the files. I tried to use >file.txt but it didn't write anything on it. And when I use >> instead it writes but when I execute the shell it appends the lines again.
Since according to your question, you want to know how many files contain a pattern, you are interested in the number of files, not the number of pattern occurances.
For instance,
grep -l '[Gg]reen' * | wc -l
would produce the number of files which contain somewhere green or Green as a substring.
How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}
For example:
myCleanVar=$( wc -l < myFile )
myDirtVar=$( wc -l myFile )
echo $myCleanVar
9
echo $myDirtVar
9 myFile
why in "myCleanVar" I get an "integer" value from the "wc" command and in "myDirtVar" I get something like as: "9 file.txt"? I quoted "integer" because in know that in Bash shell by default all is treated as a string, but can't understand the differences of the behaviour of first and second expression. What is the particular effect of the redirection "<" in this case?
wc will list by default the name of file allowing you to use it on more than one file (and have result for all of them). If no filename is specified, the "standard input", which is usually the console input, is used, and no file name is printed. The < is needed to specify an "input redirection", that is read the input from given file instead of using user input.
Put all this information together and you get the reason of wc's behavior in your example
Question time: what would be the output of cat file | wc -l ?
The man page for wc says:
NAME
wc - print newline, word, and byte counts for each file
SYNOPSIS
wc [OPTION]... [FILE]...
wc [OPTION]... --files0-from=F
DESCRIPTION
Print newline, word, and byte counts for each FILE, and a total line if more than one FILE is specified.
So when you pass the file as an argument it will also print the name of the file because you could pass more than one file to wc for it to count lines, so you know which file has that line count. Of course when you use the stdin instead it does not know the name of the file so it doesn't print it.
Example
$ wc -l FILE1 FILE2
2 FILE1
4 FILE2
6 total
My SSD was dying so I tried to backup my /home with fsarchiver but during the process I got a bunch of errors like : file has been truncated: padding with zeros.
Now I'm trying to locate those files so Im' searching for a bash/python/perl... script allowing me to search for non-empty files with the last n bytes 'padded with zeros'.
Thank you in advance for your help and please excuse my english.
This script takes a list of files on the command line and reports the names of those whose last ten bytes padded with zeros:
#!/bin/sh
for fname in "$#"
do
if [ -s "$fname" -a "$(tail -c10 "$fname" | tr -d '\000' | wc -c)" -eq 0 ]
then
echo Truncated file: $fname
fi
done
It works by first checking that the file is non-empty ([ -s "$fname" ]) and then it takes the last ten bytes of the file (tail -c10 "$fname") and removes any bytes (tr -d '\000') and then counts how many bytes are left (wc -c). If all of the last ten bytes are zeros, then there will be no bytes left, and the file is reported as truncated.
If you want to use something other than 10 bytes in your test, adjust the tail option to suit.
If you test all the files in some directory or filesystem, find can assist. If the above script is in an executable file name padded.sh, then run:
find path/to/suspect/files -size +10c -exec padded.sh {} +
I am trying to copy part of a .txt file from the line number n to the line number n+y (let's say 1000 to 1000000).
I tried with operators and sed, and it failed. Here's the command I tried:
sed -n "1000, 1000000p" path/first/file > path/second/file
if you know how many lines are in your source file (wc -l) you can do this .. assume 12000 lines and you want lines 2000 - 7000 in your new file (total of 5000 lines).
cat myfile | tail -10000 | head -5000 > newfile
Read the last 10k lines, then read the 1st 5k lines from that.
sed command should work fine, replace double quotes with single quotes.
sed -n '1000, 1000000p' path/first/file > path/second/file