Counting number of occurrences in several files

Counting number of occurrences in several files - shell

I want to check the number of occurrences of, let's say, the character '[', recursively in all the files of a directory that have the same extension, e.g. *.c. I am working with the SO Solaris in Unix.
I tried some solutions that are given in other posts, and the only one that works is this one, since with this OS I cannot use the command grep -o:
sed 's/[^x]//g' filename | tr -d '012' | wc -c
Where x is the occurrence I want to count. This one works but it's not recursive, is there any way to make it recursive?

You can get a recursive listing from find and execute commands with its -exec argument.
I'd suggest like:
find . -name '*.c' -exec cat {} \; | tr -c -d ']' | wc -c
The -c argument to tr means to use the opposite of the string supplied -- i.e. in this case, match everything but ].
The . in the find command means to search in the current directory, but you can supply any other directory name there as well.

I hope you have nawk installed. Then you can just:
nawk '{a+=gsub(/\]/,"x")}END{print a}' /path/*

You can write a snippet code itself. I suggest you to run the following:
awk '{for (i=1;i<=NF;i++) if ($i=="[") n++} END{print n}' *.c
This will search for "[" in all files in the present directory and print the number of occurrences.

Related

filename group by a pattern and select only one from each group

I have following files(as an example, 60000+ actually) and all the log files follows this pattern:
analyse-ABC008795-84865-201911261249.log
analyse-ABC008795-84866-201911261249.log
analyse-ABC008795-84867-201911261249.log
analyse-ABC008795-84868-201911261249.log
analyse-ABC008795-84869-201911261249.log
analyse-ABC008796-84870-201911261249.log
analyse-ABC008796-84871-201911261249.log
analyse-ABC008796-84872-201911261249.log
analyse-ABC008796-84873-201911261249.log
Only numbers get change in log files. I want to take one file from each category where files should be categorized by ABC.... number. So, as you can see, there are only two categories here:
analyse-ABC008795
analyse-ABC008796
So, what I want to have is one file(let's say first file) from each category. Output should look like this:
analyse-ABC008795-84865-201911261249.log
analyse-ABC008796-84870-201911261249.log
This should be done in Bash/linux environment, so that after I get this, I should use grep to check if my "searching string" contain in those files
ls -l | <what should I do to group and get one file from each category> | grep "searching string"

With bash and awk.
files=(*.log)
printf '%s\n' "${files[#]}" | awk -F- '!seen[$2]++'
Or use find instead of a bash array for a more portable approach.
find . -type f -name '*.log' | awk -F- '!seen[$2]++'
If your find has the -printf flag and you don't want the leading ./ from the filename add it before the pipe |
-printf '%f\n'
The !seen[$2]++ Remove second and subsequent instances of each input line, without having to sort them first. The $2 means the second field which -F is using.

Applying awk pattern to all files with same name, outputting each to a new file

I'm trying to recursively find all files with the same name in a directory, apply an awk pattern to them, and then output to the directory where each of those files lives a new updated version of the file.
I thought it was better to use a for loop than xargs, but I don't exactly how to make this work...
for f in $(find . -name FILENAME.txt );
do awk -F"\(corr\)" '{print $1,$2,$3,$4}' ./FILENAME.txt > ./newFILENAME.txt $f;
done
Ultimately I would like to be able to remove multiple strings from the file at once using -F, but also not sure how to do that using awk.
Also is there a way to remove "(cor*)" where the * represents a wildcard? Not sure how to do while keeping with the escape sequence for the parentheses
Thanks!

To use (corr*) as a field separator where * is a glob-style wildcard, try:
awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
For example:
$ echo '1(corr)2(corrTwo)3(corrThree)4' | awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
1 2 3 4
To apply this command to every file under the current directory named FILENAME.txt, use:
find . -name FILENAME.txt -execdir sh -c 'awk -F'\''[(]corr[^)]*[)]'\'' '\''{print $1,$2,$3,$4}'\'' "$1" > ./newFILENAME.txt' Awk {} \;
Notes
Don't use:
for f in $(find . -name FILENAME.txt ); do
If any file or directory has whitespace or other shell-active characters in it, the results will be an unpleasant surprise.
Handling both parens and square brackets as field separators
Consider this test file:
$ cat file.txt
1(corr)2(corrTwo)3[some]4
To eliminate both types of separators and print the first four columns:
$ awk -F'[(]corr[^)]*[)]|[[][^]]*[]]' '{print $1,$2,$3,$4}' file.txt
1 2 3 4

SHELL printing just right part after . (DOT)

I need to find just extension of all files in directory (if there are 2 same extensions, its just one). I already have it. But the output of my script is like
test.txt
test2.txt
hello.iso
bay.fds
hellllu.pdf
Im using grep -e -e '.' and it just highlight DOTs
And i need just these extensions give in one variable like txt,iso,fds,pdf
Is there anyone who could help? I already had it one time but i had it on array. Today I found out It's has to work on dash too.

You can use find with awk to get all unique extensions:
find . -type f -name '?*.?*' -print0 |
awk -F. -v RS='\0' '!seen[$NF]++{print $NF}'

can be done with find as well, but I think this is easier
for f in *.*; do echo "${f##*.}"; done | sort -u
if you want to assign a comma separated list of the unique extensions, you can follow this
ext=$(for f in *.*; do echo "${f##*.}"; done | sort -u | paste -sd,)
echo $ext
csv,pdf,txt
alternatively with ls
ls -1 *.* | rev | cut -d. -f1 | rev | sort -u | paste -sd,
rev/rev is required if you have more than one dot in the filename, assuming the extension is after the last dot. For any other directory simply change the part *.* to dirpath/*.* in all scripts.
I'm not sure I understand your comment. If you don't assign to a variable, by default it will print to the output. If you want to pass directory name as a variable to a script, put the code into a script file and replace dirpath with $1, assuming that will be your first argument to the script
#!/bin/bash
# print unique extension in the directory passed as an argument, i.e.
ls -1 "$1"/*.* ...
if you have sub directories with extensions above scripts include them as well, to limit only to file types replace ls .. with
find . -maxdepth 1 -type f -name "*.*" | ...

Getting last element of a path (different from #10124314 as basename falls over)

I need to process a couple of thousand PDF files sorted alphabietically on their filename ideally from bash. So from my simple perspective I need to walk a tree of files, stripping off path as I go and then do various grepping, sorting etc
Having seen an answer to a similar question I've tried doing a
tim#MERLIN:~/Documents/Scanned$ basename `find ./ -print`
but that gets messed up by some directory names which have spaces in them - e.g. there is one called General Letters which acts like a chicken-bone in the works and results in
basename: extra operand ‘Letters’
Try 'basename --help' for more information.
I can't see a way to get find to strip out the pathname and I would prefer to use find given its plethora of options to filter on age, size etc. Nor can I see any way to get basename to cope gracefully with spaces in this context.
I considered using cut but I can't work out how to get cut to give me the last field by doing something like cut -d/ <whatever> I'm sure there must be an easy way to do it: some sort of in-line sed or awk script?
I don't particularly want the buggeration of writing a perl/Python script to do it for me as I know I should be able to do it from the command line.
So any simple tips or suggestions?
Updated/Solved
Many thanks to Cyrus the solution is
tim#MERLIN:~/Documents/Scanned$ find . -name *.pdf -printf '%f\n' | sort

Try this:
find ./ -printf '%f\n'
%f: File's name with any leading directories removed (only the last element).

Here is a working solution using awk:
find ./ | awk -F'/' '{ print $NF }';
It simply uses / as delimiter and prints the last value of the line.
Or with grep:
find ./ | grep -oE "[^/]+$"

Through sed,
find ./ | sed 's/.*\/\(.*\)$/\1/g'

If you want get a list of pathnames (recursively) but want sort them by filenames (not by path names) you can use:
find . -printf '%f|%p\n' | sort -k 1 -t'|' | cut -d'|' -f2-
You need a GNU find for this. (Linux ok, not default in OS X).
Without the GNU find, you can do the above with:
find . -print | sed 's:\(.*\)/\(.*\)$:\2\|\1/\2:' | sort -k 1 -t'|' | cut -d'|' -f2-
(Assuming there is no \n in the filenames)

Force sort command to ignore folder names

I ran the following from a base folder ./
find . -name *.xvi.txt | sort
Which returns the following sort order:
./LT/filename.2004167.xvi.txt
./LT/filename.2004247.xvi.txt
./pred/2004186/filename.2004186.xvi.txt
./pred/2004202/filename.2004202.xvi.txt
./pred/2004222/filename.2004222.xvi.txt
As you can see, the filenames follow a regular structure, but the files themselves might be located in different parts of the directory structure. Is there a way of ignoring the folder names and/or directory structure so that the sort returns a list of folders/filenames based ONLY on the file names themselves? Like so:
./LT/filename.2004167.xvi.txt
./pred/2004186/filename.2004186.xvi.txt
./pred/2004202/filename.2004202.xvi.txt
./pred/2004222/filename.2004222.xvi.txt
./LT/filename.2004247.xvi.txt
I've tried a few different switches under the find and sort commands, but no luck. I could always copy everything out to a single folder and sort from there, but there are several hundred files, and I'm hoping that a more elegant option exists.
Thanks! Your help is appreciated.

If your find has -printf you can print both the base filename and the full filename. Sort by the first field, then strip it off.
find . -name '*.xvi.txt' -printf '%f %p\n' | sort -k1,1 | cut -f 2- -d ' '
I have chosen a space as a delimiter. If your filenames include spaces, you should choose another delimiter which is a character that's not in your filenames. If any filenames include newlines, you'll have to modify this because it won't work.
Note that the glob in the find command should be quoted.

If your find doesn't have printf, you could use awk to accomplish the same thing:
find . -name *.xvi.txt | awk -F / '{ print $NF, $0 }' | sort | sed 's/.* //'
The same caveats about spaces that Dennis Williamson mentioned apply here. And for variety, I'm using sed to strip off the sort field, instead of cut.

find . -name *.xvi.txt | sort -t'.' -k3 -n
will sort it as you want. the only problem is if filename or directory name will include additinal dots.
To avoid it you can use :
find . -name *.xvi.txt | sed 's/[0-9]\+.xvi.txt$/\\&/' | sort -t'\' -k2 | sed 's/\\//'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Counting number of occurrences in several files - shell

I hope you have nawk installed. Then you can just: nawk '{a+=gsub(/\]/,"x")}END{print a}' /path/*

You can write a snippet code itself. I suggest you to run the following: awk '{for (i=1;i<=NF;i++) if ($i=="[") n++} END{print n}' *.c This will search for "[" in all files in the present directory and print the number of occurrences.

Related

filename group by a pattern and select only one from each group

Applying awk pattern to all files with same name, outputting each to a new file

SHELL printing just right part after . (DOT)

Getting last element of a path (different from #10124314 as basename falls over)

Force sort command to ignore folder names

Categories

Resources