Counting Python files with bash and awk always returns zero - bash

I want to get a number of python files on my desktop and I have coded a small script for that. But the awk command does not work as is have expected.
script
ls -l | awk '{ if($NF=="*.py") print $NF; }' | wc -l
I know that there is another solution to finding a number of python files on a PC but I just want to know what am i doing wrong here.

ls -l | awk '{ if($NF=="*.py") print $NF; }' | wc -l
Your code does count of files literally named *.py, you should deploy regex matching and use correct GNU AWK syntax, after fixing that, your code becomes
ls -l | awk '{ if($NF~/[.]py$/) print $NF; }' | wc -l
note [.] which denote literal . and $ denoting end of string.
Your code might be further ameloriated, as there is not need to use if here, as pattern-action will do that is
ls -l | awk '$NF~/[.]py$/{ print $NF; }' | wc -l
Morever you might easily implemented counting inside GNU AWK rather than deploying wc -l as follows
ls -l | awk '$NF~/[.]py$/{t+=1}END{print t}'
Here, t is increased by 1 for every describe line, and after all is processed, that is in END it is printed. Observe there is no need to declare t variable in GNU AWK.

Don't try to parse the output of ls, see https://mywiki.wooledge.org/ParsingLs.
Beyond that your awk script is failing because $NF=="*.py" is doing a literal string partial comparison of the last sting of non-spaces against *.py when you probably wanted a regexp comparison such as $NF~/*.py$/ and your print $NF would fail for any file names containing spaces.
If you really want to involve awk in this for some reason then, assuming the list of python files doesn't exceed ARG_MAX, it'd be:
awk 'BEGIN{print ARGC-1; exit}' *.py
but you could just do it in bash:
shopt -s nullglob
files=(*.py)
echo "${#files[#]}"
or if you want to have a pipe to wc -l for some reason and your files can't have newlines in their names then:
printf '%s\n' *.py | wc -l

gfind . -maxdepth 1 -type f -name "*.py" -print0 |
{m,g}awk 'END { print NR }' RS='\0' FS='^$'
or
{m,g}awk 'END { print --NF }' RS='^$' FS='\0'
879

Related

How to get the nth recent file in the nth last modified subdirectory using pipes

I'm doing an exercise for OS exam. It requires to get the 3rd recent file of the 2nd last modified sub-directory inside current directory. Then I have to print its lines in reverse order. I can not use tac command. The text suggest to use (other than awk and sed): head, tails, wc.
I've succeded getting filename of the requested file (but in a too complex way I think). Now I have to print it in reverse. I think I can use this awk solution https://stackoverflow.com/a/744093/11614625.
This is how I'm getting the filename:
ls -t | head | awk '{system("test -d \"" $0 "\" && echo \"" $0 "\"")}' | awk 'NR==2 {system("ls \"" $0 "\" | head")}' | awk 'NR==1'
How can I do better? And what if 3rd directory or 2nd file doesn't exists?
See https://mywiki.wooledge.org/ParsingLs and awk '{system("test -d \"" $0 "\" && echo \"" $0 "\"")}' is calling shell to call awk to call system to call shell to call test which is clearly a worse approach than just having shell call test in the first place if you were going to do that. Also, any solution that reads the whole file into memory (as any sed or a naive awk solution would) will fail for large files as they'll exceed available memory.
Unfortunately this is how to do what you want robustly:
dir="$(find . -mindepth 1 -maxdepth 1 -type d -printf '%T+\t%p\0' |
sort -rz |
awk -v RS='\0' 'NR==2{sub(/[^\t]+\t/,""); print; exit}')" &&
file="$(find "$dir" -mindepth 1 -maxdepth 1 -type f -printf '%T+\t%p\0' |
sort -z |
awk -v RS='\0' 'NR==3{sub(/[^\t]+\t/,""); print; exit}')" &&
cat -n "$file" | sort -rn | cut -f2-
If any of the commands in any of the pipes fail then the error message from the command that failed will be printed and then no other command will execute and the overall exit status will be the failure one from that failing command.
I used cat | sort | cut rather than awk or sed to print the file in reverse because awk (unless you write demand paging in it) or sed would have to read the whole file into memory at once and so would fail for very large files while sort is designed to handle large files by using paging with tmp files as necessary and only keeping parts of the file in memory at a time so it's limited only by how much free disk space you have on your device.
The above requires GNU tools to provide/handle NUL line-endings - if you don't have those then change \0 to \n in the find command, remove the z from sort options, and remove -v RS='\0' from the awk command and be aware that the result will only work if your directory or file names don't contain newlines.

Sorting through numbered files for program execution

I have many files with the same format: mubunching-100302.0003.001_1c, mubunching-100302.0005.001_1c ...
I would like to feed a program many of these files that have a minimum value, e.g. only files with index *.0005.* and greater:
python Code.py mubunching-100302.0005.001_1c mubunching-100302.0008.001_1c ...
I am fairly new to bash and am not sure where to begin. Thanks for any help and suggestions!
You can get a list of all files matching your criteria like this:
ls | awk -F. '$2 >= 5 {print}'
This has awk compare the 2nd . delimited field against 5, and only print out names for which this is true. If you want to then process these files with you Python script:
ls | awk -F. '$2 >= 5 {print}' | xargs python Code.py
For example, given a directory containing:
$ ls
mubunching-100302.0002.001_1c mubunching-100302.0005.001_1c
mubunching-100302.0003.001_1c mubunching-100302.0008.001_1c
This first command above will produce:
$ ls | awk -F. '$2 >= 5 {print}'
mubunching-100302.0005.001_1c
mubunching-100302.0008.001_1c
You could use find and awk to get the list of desired filenames:
find . -type f -name "mubunching*" | awk -F'[.]' '$(NF-1)>=5'
In order to pass the list to your program, use process substitution:
python Code.py $(find . -type f -name "mubunching*" | awk -F'[.]' '$(NF-1)>=5')

Bash/Shell - paths with spaces messing things up

I have a bash/shell function that is supposed to find files then awk/copy the first file it finds to another directory. Unfortunately if the directory that contains the file has spaces in the name the whole thing fails, since it truncates the path for some reason or another. How do I fix it?
If file.txt is in /path/to/search/spaces are bad/ it fails.
dir=/path/to/destination/ | find /path/to/search -name file.txt | head -n 1 | awk -v dir="$dir" '{printf "cp \"%s\" \"%s\"\n", $1, dir}' | sh
cp: /path/to/search/spaces: No such file or directory
*If file.txt is in /path/to/search/spacesarebad/ it works, but notice there are no spaces. :-/
Awk's default separator is white space. Simply change it to something else by doing:
awk -F"\t" ...
Your script should look like:
dir=/path/to/destination/ | find /path/to/search -name file.txt | head -n 1 | awk -F"\t" -v dir="$dir" '{printf "cp \"%s\" \"%s\"\n", $1, dir}' | sh
As pointed by the comments, you don't really need all those steps, you could actually simply do (one-liner):
dir=/path/to/destination/ && path="$(find /path/to/search -name file.txt | head -n 1)" && cp "$path" "$dir"
Formated code (that may look better, in this case ^^):
dir=/path/to/destination/
path="$(find /path/to/search -name file.txt | head -n 1)"
cp "$path" "$dir"
The "" are used to assign the entire content of the string to the variable, causing the separator IFS, which is a white space by default, not to be considered over the string.
If you think spaces are bad, wait till you get into trouble with newlines. Consider for example:
mkdir spaces\ are\ bad
touch spaces\ are\ bad/file.txt
mkdir newlines$'\n'are$'\n'even$'\n'worse
touch newlines$'\n'are$'\n'even$'\n'worse/file.txt
And:
find . -name file.txt
The head command assumes newline delimiter. You can get around the space and newline issue with GNU find and GNU grep (maybe others) by using \0 delimiters:
find . -name file.txt -print0 | grep -zm1 . | xargs -0 cp -t "$dir"
You could try this.
awk '{print substr($0, index($0,$9))}'
For example this is the output of ls command:
-rw-r--r--. 1 root root 73834496 Dec 6 10:55 File with spaces 2
If you use simple awk like this
# awk '{print $9}'
It returns only
# File
If used with the full command
# awk '{print substr($0, index($0,$9))}'
I get the whole output
File with spaces 2
Here
substr(s, a, b) : it returns b number of chars from string s, starting at position a. The parameter b is optional.
For example if the match is addr:192.168.1.133 and you use substr as follows
# awk '{print substr($2,6)}'
You get the IP i.e 192.168.1.133. Note the 6 is the character starting from a in addr
So in the proper command the $2 is $0 ( which prints whole line.) and index($0,$9) matches $9 and prints everything ahead of column 9. You can change that to index($0,$8) and see that the output changes to
# 10:55 File with spaces 2
`index(IN, FIND)'
This searches the string IN for the first occurrence of the string
FIND, and returns the position in characters where that occurrence
begins in the string IN.
I hope it helps. Moreover if you are assigning this value to a variable in script then you need to enclose the variables in double quotes. Other wise you will get errors if you are doing some other operation for the extracted file name.

Count number of occurrences of a specific regex on multiple files

I am trying to write up a bash script to count the number of times a specific pattern matches on a list of files.
I've googled for solutions but I've only found solutions for single files.
I know I can use egrep -o PATTERN file, but how do I generalize to a list of files and out the sum at the end?
EDIT: Adding the script I am trying to write:
#! /bin/bash
egrep -o -c "\s*assert.*;" $1 | awk -F: '{sum+=$2} END{print sum}'
Running egrep directly on the command line works fine, but within a bash script it doesn't. Do I have to specially protect the RegEx?
You could use grep -c to count the matches within each file, and then use awk at the end to sum up the counts, e.g.:
grep -c PATTERN * | awk -F: '{sum+=$2} END{print sum}'
grep -o <pattern> file1 [file2 .. | *] |
uniq -c
If you want the total only:
grep -o <pattern> file1 [file2 .. | *] | wc -l
Edit: The sort seems unnecessary.
The accepted answer has a problem in that grep will count as 1 even though the PATTERN may appear more than once on a line. Besides, one command does the job
awk 'BEGIN{RS="\0777";FS="PATTERN"} { print NF-1 } ' file

Linux commands to output part of input file's name and line count

What Linux commands would you use successively, for a bunch of files, to count the number of lines in a file and output to an output file with part of the corresponding input file as part of the output line. So for example we were looking at file LOG_Yellow and it had 28 lines, the the output file would have a line like this (Yellow and 28 are tab separated):
Yellow 28
wc -l [filenames] | grep -v " total$" | sed s/[prefix]//
The wc -l generates the output in almost the right format; grep -v removes the "total" line that wc generates for you; sed strips the junk you don't want from the filenames.
wc -l * | head --lines=-1 > output.txt
produces output like this:
linecount1 filename1
linecount2 filename2
I think you should be able to work from here to extend to your needs.
edit: since I haven't seen the rules for you name extraction, I still leave the full name. However, unlike other answers I'd prefer to use head rather then grep, which not only should be slightly faster, but also avoids the case of filtering out files named total*.
edit2 (having read the comments): the following does the whole lot:
wc -l * | head --lines=-1 | sed s/LOG_// | awk '{print $2 "\t" $1}' > output.txt
wc -l *| grep -v " total"
send
28 Yellow
You can reverse it if you want (awk, if you don't have space in file names)
wc -l *| egrep -v " total$" | sed s/[prefix]//
| awk '{print $2 " " $1}'
Short of writing the script for you:
'for' for looping through your files.
'echo -n' for printing the current file
'wc -l' for finding out the line count
And dont forget to redirect
('>' or '>>') your results to your
output file

Resources