In loop cat file - echo name of file - count - bash

I trying make oneline command with operation where I can do:
in folder "data" have 570 files - each file have some text line - file are called from 1 to 570.txt
I want cat each file, grep by word and count how manny that word occurs.
For the moment he is trying to get this using ' for '
for FILES in $(find /home/my/data/ -type f -print -exec cat {} \;) ; do echo $FILES; cat $FILES |grep word ; done |wc -l
but if I do that they correctly counts but does not display the counted file
I would like it to look :
----> 1.txt <----
210
---> 2.txt <----
15
etc, etc, etc..
How to get it

grep -o word * | uniq -c
is practically all you need.
grep -o word * gives a line for each hit, but only prints the match, in this case "word". Each line is prefixed with the filename it was found in.
uniq -c gives only one line per file so to say and prefixes it with the count.
You can further format it to your needs with awk or whatever, though, for example like this:
grep -o word * | uniq -c | cut -f1 -d':' | awk '{print "File: " $2 " Count: " $1}'

You can try this :
for file in /path/to/folder/data/* ; do echo "----> $file <----" ; grep -c "word_to_count" /path/to/folder/data/$file ; done
for loop will ierate over file inside folder "data".
For each of these file, print the name and search for number of occurrence of "word_to_count" (grep -c will directly output a count of matching lines).
Be carefull, if there is more than one iteration of your search word inside a line, this solution will count only one for these iteration.

Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile
$ python -c "import sys; print(sum(int(l) for l in sys.stdin))"
If you only want the total number of lines, you could use
find /home/my/data/ -type f -exec cat {} + | wc -l

Related

append output of each iteration of a loop to the same in bash

I have 44 files (2 for each chromosome) divided in two types: .vcf and .filtered.vcf.
I would like to make a wc -l for each of them in a loop and append the output always to the same file. However, I would like to have 3 columns in this file: chr[1-22], wc -l of .vcf and wc -l of .filtered.vcf.
I've been trying to do independent wc -l for each file and paste together columnwise the 2 outputs for each of the chromosomes, but this is obviously not very efficient, because I'm generating a lot of unnecessary files. I'm trying this code for the 22 pairs of files:
wc -l file1.vcf | cut -f 1 > out1.vcf
wc -l file1.filtered.vcf | cut -f 1 > out1.filtered.vcf
paste -d "\t" out1.vcf out1.filtered.vcf
I would like to have just one output file containing three columns:
Chromosome VCFCount FilteredVCFCount
chr1 out1 out1.filtered
chr2 out2 out2.filtered
Any help will be appreciated, thank you very much in advance :)
printf "%s\n" *.filtered.vcf |
cut -d. -f1 |
sort |
xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.vcf")" "$(wc -l <"${1}.filtered.vcf")"' --
Output newline separated list of files in the directory
Remove the extension with cut (probably something along xargs -i basename {} .filtered.vcf would be safer)
Sort it (for nice sorted output!) (probably something along sort -tr -k2 -n would sort numerically and would be even better).
xargs -n1 For each one file execute the script sh -c
printf "%s\t%s\t%s\n" - output with custom format string ...
"$1" - the filename and...
"(wc -l <"${1}.vcf")" - the count the lines in .vcf file and...
"$(wc -l <"${1}.filtered.vcf")" - the count of the lines in the .filtered.vcf
Example:
> touch chr{1..3}{,.filtered}.vcf
> echo > chr1.filtered.vcf ; echo > chr2.vcf ;
> printf "%s\n" *.filtered.vcf |
> cut -d. -f1 |
> sort |
> xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.filtered.vcf")" "$(wc -l <"${1}.vcf")"' --
chr1 0 1
chr2 1 0
chr3 0 0
To have nice looking table with headers, use column:
> .... | column -N Chromosome,VCFCount,FilteredVCFCount -t -o ' '
Chromosome VCFCount FilteredVCFCount
chr1 0 1
chr2 1 0
chr3 0 0
Maybe try this.
for chr in chr*.vcf; do
base=${chr%.vcf}
awk -v base="$base" 'BEGIN { OFS="\t"
# Remove this to not have this pesky header line
print "Chromosome", "VCFCount", "FilteredVCFCount"
}
FNR==1 && n { p=n }
{ n=FNR }
END { print base, p, n }' "$chr" "$base.filtered.vcf"
done >counts.txt
The very simple Awk script just collects the highest line number for each file (so we basically reimplement wc -l) and prints the collected numbers in the desired format. FNR is the line number in the current input file; we simply save this, and copy the value to p to keep the saved value from the previous file in a separate variable when we switch to a new file (starting over at line number 1).
The shell parameter substitution ${variable%pattern} retrieves the value of variable with any suffix match on pattern removed. (There is also ${variable#pattern} to remove a prefix, and Bash has ## and %% to trim the longest pattern match instead of the shortest.)
If efficiency is important, you could probably refactor all of the script into a single Awk script, but this way, all the pieces are simple and hopefully understandable.

Searching for .extension files recursively and print the number of lines in the files found?

I ran into a problem I am trying to solve but can't think about a way without doing the whole thing from the beginning. My script gets an extension and searches for every .extension file recursively, then outputs the "filename:row #:word #". I would like to print out the total amount of row #-s found in those files too. Is there any way to do it using the existing code?
for i in find . -name '*.$1'|awk -F/ '{print $NF}'
do
echo "$i:`wc -l <$i|bc`:`wc -w <$i|bc`">>temp.txt
done
sort -r -t : -k3 temp.txt
cat temp.txt
I think you're almost there, unless I am missing something in your requirements:
#!/bin/bash
total=0
for f in `find . -name "*.$1"` ; do
lines=`wc -l < $f`
words=`wc -w < $f`
total=`echo "$lines+$total" | bc`
echo "* $f:$lines:$words"
done
echo "# Total: $total"
Edit:
Per recommendation of #Mark Setchel in the comments, this is a more refined version of the script above:
#!/bin/bash
total=0
for f in `find . -name "*.$1"` ; do
read lines words _ < <(wc -wl "$f")
total=$(($lines+$total))
echo "* $f:$lines:$words"
done
echo "# Total: $total"
Cheers
This is a one-liner printing the lines found per file, the path of the file and at the end the sum of all lines found in all the files:
find . -name "*.go" -exec wc -l {} \; | awk '{s+=$1} {print $1, $2} END {print s}'
In this example if will find for all files ending *.go then will execute use wc -l to get the number of lines and print the output to stdout, awk then is used to sum all the output of column 1 in the variable s the one will be only printed at the end: END {print s}
In case you would also like to get the words and the total sum at the end you could use:
find . -name "*.go" -exec wc {} \; | \
awk '{s+=$1; w+=$2} {print $1, $2, $4} END {print "Total:", s, w}'
Hope this can give you an idea about how to format, sum etc. your data based on the input.

One line command with variable, word count and zcat

I have many files on a server which contains many lines:
201701010530.contentState.csv.gz
201701020530.contentState.csv.gz
201701030530.contentState.csv.gz
201701040530.contentState.csv.gz
I would like with one line command this result:
170033|20170101
169865|20170102
170010|20170103
170715|20170104
The goal is to have the number of lines of each file, just by keeping the date which is already in the filename of the file.
I tried this but the result is not in one line but two...
for f in $(ls -1 2017*gz);do zcat $f | wc -l;echo $f | awk '{print substr($0,1,8)}';done
Thanks in advance guys.
Just use zcat file | wc -l to get the number of lines.
For the name, I understand it is enough to extract the first 8 characters:
$ t="201701030530.contentState.csv.gz"
$ echo "${t:0:8}"
20170103
All together:
for file in 2017*gz;
do
lines=$(zcat "$file" | wc -l)
printf "%s|%s\n" "$lines" "${file:0:8}"
done > myresult.csv
Note the usage of for file in 2017*gz; to go through the files matching the 2017*gz pattern: this suffices, no need to parse ls!
Use zgrep -c ^ file to count the lines, here encapsulated in awk:
$ awk 'FNR==1{ "zgrep -c ^ " FILENAME | getline s; print s "|" substr(FILENAME,1,8) }' *.gz
12|20170101
The whole "zgrep -c ^ " FILENAME should probably be in a var (s) and then s | getline s.

Bash: grabbing the second line and last line of output (ls -lrS) only

I am looking to get the second line and last line of what the ls -lrS command outputs. Ive been using ls -lrS | (head -2 | tail -1) && (tail -n1) But it seems to only get the first line only, and I have to press control C to stop it.
Another problem I am having is using the awk command, I wanted to just grab the file size and file name. If I were to get the correct lines (second and last) my desired output would be
files=$(ls -lrS | (head -2 | tail -1) && (tail -n1) awk '{ print "%s", $5; "%s", $8; }' )
I was hoping it would print:
1234 file.abc
12345 file2.abc
Using the format stable GNU stat command:
stat --format='%s %n' * | sort -n | sed -n '1p;$p'
If you're using BSD stat, adjust accordingly.
If you want a lot more control over what files go into this calculation, and arguably better portability, use find. In this example, I'm getting all non-dot files in the current directory:
find -maxdepth 1 -not -path '*/\.*' -printf '%s %p\n' | sort -n | sed -n '1p;$p'
And take care if your directory contains two or fewer entries, or if any of your entries have a new-line in their name.
Using awk:
ls -lrS | awk 'NR==2 { print; } END { print; }'
It prints when the line number NR is 2 and again on the final line.
Note: As pointed out in the comments, $0 may or may not be available in an END block depending on your awk version.
whatever | awk 'NR==2{x=$0;next} {y=$0} END{if (x!="") print x; if (y!="") print y}'
You need that complexity (and more to be REALLY robust) to handle input that's less than 3 lines.
ls is not a reliable tool for this job: It can't represent all possible filenames (spaces are possible, but also newlines and other special characters -- all but NUL). One robust solution on a system with GNU tools is to use find:
{
# read the first size and name
IFS= read -r -d' ' first_size; IFS= read -r -d '' first_name;
# handle case where only one file exists
last_size=$first_size; last_name=$first_name
# continue reading "last" size and name, until one really is last
while IFS= read -r -d' ' curr_size && IFS= read -r -d '' curr_name; do
last_size=$curr_size; last_name=$curr_name
done
} < <(find . -mindepth 1 -maxdepth 1 -type f -printf '%s %P\0' | sort -n -z)
The above puts results into variables $first_size, $first_name, $last_size and $last_name, usable thusly:
printf 'Smallest file is %d bytes, named %q\n' "$first_size" "$first_name"
printf 'Largest file is %d bytes, named %q\n' "$last_size" "$last_name"
In terms of how it works:
find ... -printf '%s %P\0'
...emits a stream of the following form from find:
<size> <name><NUL>
Running that stream through sort -n -z does a numeric sort on its contents. IFS= read -r -d' ' first_size reads the everything up to the first space; IFS= read -r -d '' first_name reads everything up to the first NUL; and then the loop continues to read and store additional size/name pairs until the last one is reached.

using cut command in bash [duplicate]

This question already has answers here:
Get just the integer from wc in bash
(19 answers)
Closed 8 years ago.
I want to get only the number of lines in a file:
so I do:
$wc -l countlines.py
9 countlines.py
I do not want the filename, so I tried
$wc -l countlines.py | cut -d ' ' -f1
but this just echo empty line.
I just want number 9 to be printed
Use stdin and you won't have issue with wc printing filename
wc -l < countlines.py
You can also use awk to count lines. (reference)
awk 'END { print NR }' countlines.py
where countlines.py is the file you want to count
If your file doesn't ends with a \n (new line) the wc -l gives a wrong result. Try it with the next simulated example:
echo "line1" > testfile #correct line with a \n at the end
echo -n "line2" >> testfile #added another line - but without the \n
the
$ wc -l < testfile
1
returns 1. (The wc counts the number of newlines (\n) in a file.)
Therefore, for counting lines (and not the \n characters) in a file, you should to use
grep -c '' testfile
e.g. find empty character in a file (this is true for every line) and count the occurences -c. For the above testfile it returns the correct 2.
Additionally, if you want count the non-empty lines, you can do it with
grep -c '.' file
Don't trust wc :)
Ps: one of the strangest use of wc is
grep 'pattern' file | wc -l
instead of
grep -c 'pattern' file
cut is being confused by the leading whitespace.
I'd use awk to print the 1st field here:
% wc -l countlines.py | awk '{ print $1 }'
As an alternative, wc won't print the file name if it is being piped input from stdin
$ cat countlines.py | wc -l
9
yet another way :
cnt=$(wc -l < countlines.py )
echo "total is $cnt "
Piping the file name into wc removes it from the output, then translate away the whitespace:
wc -l <countlines.py |tr -d ' '
Use awk like this:
wc -l countlines.py | awk {'print $1'}

Resources