In bash how do I count the occurence of each word in a set, in multiple files - bash

I have a big list of words (>1000) which are actually filenames and a directory with a lot of source code files (>2000). I want , for each word(filename) in the list, to count its total occurences in all the files of the directory. What I currently do is:
#!/bin/sh
SEARCHPATH=$1
for var in "${#:2}"
do
BASE=$( basename "$var" )
COUNT=$(grep -o "$BASE" $SEARCHPATH/* | wc -l)
echo -e "$BASE:" " $COUNT"
done
which works but is inefficient because for each word it searches the whole directory, and the words are too many. I am looking for a solution that scans the directory once, accumulating the word count.

Put all your words in a file. Then you can try this:
grep -ohFf wordsFile path/* | sort | uniq -c

Related

1. How to use the input not including the first one 2.Using grep and sed to find the pattern entered by the user and how to create the next line

The command that I'm making wants the first input to be a file and search how many times a certain pattern occurs within the file, using grep and sed.
Ex:
$ cat file1
oneonetwotwotwothreefourfive
Intended output:
$ ./command file1 one two three
one 2
two 3
three 1
The problem is the file does not have any lines and is just a long list of letters. I'm trying to use sed to replace the pattern I'm looking for with "FIND" and move the list to the next line and this continues until the end of file. Then, use $grep FIND to get the line that contains FIND. Finally, use wc -l to find a number of lines. However, I cannot find the option to move the list to the next line
Ex:
$cat file1
oneonetwosixone
Intended output:
FIND
FIND
twosixFIND
Another problem that I've been having is how to use the rest of the input, not including the file.
Failed attempt:
file=$1
for PATTERN in 2 3 4 5 ... N
do
variable=$(sed 's/$PATTERN/find/g' $file | grep FIND $file | wc -l)
echo $PATTERN $variable
exit
Another failed attempt:
file=$1
PATTERN=$($2,$3 ... $N)
for PATTERN in $*
do variable=$(sed 's/$PATTERN/FIND/g' $file | grep FIND $file | wc-1)
echo $PATTERN $variable
exit
Any suggestions and help will be greatly appreciated. Thank you in advance.
Non-portable solution with GNU grep:
file=$1
shift
for pattern in "$#"; do
echo "$pattern" $(grep -o -e "$pattern" <"$file" | wc -l)
done
If you want to use sed and your "patterns" are actually fixed strings (which don't contain characters that have special meaning to sed), you could do something like:
file=$1
shift
for pattern in "$#"; do
echo "$pattern" $(
sed "s/$pattern/\n&\n/g" "$file" |\
grep -e "$pattern" | wc -l
)
done
Your code has several issues:
you should quote use of variables where word splitting may happen
don't use ALLCAPS variable names - they are reserved for use by the shell
if you put a string in single-quotes, variable expansion does not happen
if you give grep a file, it won't read standard input
your for loop has no terminating done
This might work for you (GNU bash,sed and uniq):
f(){ local file=$1;
shift;
local args="$#";
sed -E 's/'${args// /|}'/\n&\n/g
s/(\n\S+)\n\S+/\1/g
s/\n+/\n/g
s/.(.*)/echo "\1"|uniq -c/e
s/ *(\S+) (\S+)/\2 \1/mg' $file; }
Separate arguments into file and remaining arguments.
Apply arguments as alternation within a sed substitution command which splits words into lines separated by a newline either side.
Remove unwanted words and unwanted newlines.
Evaluate the manufactured file within a sed substitution using the uniq command with the -c option.
Rearrange the output and print the result.
The problem is the file does not have any lines
Great! So the problem reduces to putting newlines.
func() {
file=$1
shift
rgx=$(printf "%s\\|" "$#" | sed 's#\\|$##');
# put the newline between words
sed 's/\('"$rgx"'\)/&\n/g' "$file" |
# it's just standard here
sort | uniq -c |
# filter only input - i.e. exclude fourfive
grep -xf <(printf " *[0-9]\+ %s\n" "$#")
};
func <(echo oneonetwotwotwothreefourfive) one two three
outputs:
2 one
1 three
3 two

grep from 7 GB text file OR many smaller ones

I have about two thousand text files in folder.
I want to loop each one and search for specific word in line.
for file in "./*.txt";
do
cat $file | grep "banana"
done
I was wondering if join all text files into one file would be faster.
The whole directory has about 7 GB.
You're not actually looping, you're calling cat just once on the string ./*.txt, i.e., your script is equivalent to
cat ./*.txt | grep 'banana'
This is not equivalent to
grep 'banana' ./*.txt
though, as the output for the latter would prefix the filename for each match; you could use
grep -h 'banana' ./*.txt
to suppress filenames.
The problem you could run into is that ./*.txt expands to something that is longer than the maximum command line length allowed; to prevent that, you could do something like
printf '%s\0' ./*.txt | xargs -0 grep -h 'banana'
which is save for both files containing blanks and shell metacharacters and calls grep as few times as possible1.
This can even be parallelized; to run 4 grep processes in parallel, each handling 5 files at a time:
printf '%s\0' ./*.txt | xargs -0 -L 5 -P 4 grep -h 'banana'
What I think you intended to run is this:
for file in ./*.txt; do
cat "$file" | grep "banana"
done
which would call cat/grep once per file.
1At first I thought that printf would run into trouble with command line length limitations as well, but it seems that as a shell built-in, it's exempt:
$ touch '%s\0' {1000000..10000000} > /dev/null
-bash: /usr/bin/touch: Argument list too long
$ printf '%s\0' {1000000..10000000} > /dev/null
$

Bash script to store list of files in an array with number of occurrences of each word in all files

So far, my bash script takes in two arguments...input which can be a file or a directory, and output, which is the output file. It finds all files recursively and if the input is a file it finds all occurrences of each word in all the files found and list them in the output file with the number on the left and the word on the right sorted from greatest to least. Right now it is also counting numbers as words which it shouldn't do...how can I have it only find all occurrences of valid words and no numbers? Also, in the last if statement...if the input is a directory, I am having trouble getting it to do the same thing I had it do for the file. It needs to find all files in that directory, and if there is another directory in that directory, it needs to find all files in it and so on. Then it needs to count all occurrences of each word in all files and store them to the output file just as in the case for a file. I was thinking to store them in an array, but I'm not sure if its the best way, and my syntax is off because its not working...so I would like to know how can I do this? Thanks!
#!/bin/bash
INPUT="$1"
OUTPUT="$2"
ARRAY=();
# Check that there are two arguments
if [ "$#" -ne 2 ]
then
echo "Usage: $0 {dir-name}";
exit 1
fi
# Check that INPUT is different from OUTPUT
if [ "$INPUT" = "$OUTPUT" ]
then
echo "$INPUT must be different from $OUTPUT";
fi
# Check if INPUT is a file...if so, find number of occurrences of each word
# and store in OUTPUT file sorted in greatest to least
if [ -f "$INPUT" ]
then
for name in $INPUT; do
if [ -f "$name" ]
then
xargs grep -hoP '\b\w+\b' < "$name" | sort | uniq -c | sort -n -r > "$OUTPUT"
fi
done
# If INPUT is a directory, find number of occurrences of each word
# and store in OUTPUT file sorted in greatest to least
elif [ -d "$INPUT" ]
then
find $name -type f > "${ARRAY[#]}"
for name in "${ARRAY[#]}"; do
if [ -f "$name" ]
then
xargs grep -hoP '\b\w+\b' < "$name" | sort | uniq -c | sort -n -r > "$OUTPUT"
fi
done
fi
I don't recommend you specifying the output file, because you must to more validity checking for it, e.g.
the output shouldn't exists (if you don't want allow the overwrite)
if you want allow the overwrite, if the output exists, it must be an plain file
and so on..
it is better to have a possibility to use more input directories/files as arguments
therefore is better (an it is more bash-ish) produces output to standard output and you can redirect it to file at invocation, like
bash wordcounter.sh files or directories more the one to count words > to_some_file
e.g
bash worcounter.sh some_dir >result.txt
#or
bash wordcounter.sh file1.txt file2.txt .... fileN.txt > result2.txt
#or
bash wordcounter.sh dir1 file1 dir2 file2 >result2.txt
the whole wordcounter.sh could be the next:
for arg
do
find "$arg" -type f -print0
done |xargs -0 grep -hoP '\b[[:alpha:]]+\b' |sort |uniq -c |sort -nr
where:
the find will search plain files the for all arguments
and on the the generated file-list will run the counting script
The script sill has some drawbacks, e.g. will try count words in the image-files too and like, maybe in the next question in this serie you will ask for it ;)
EDIT
If you really want two argument script e.g. script where_to_search output (what isn't very bash-like), put the above script into the function, and do whatever you want, e.g:
#!/bin/bash
wordcounter() {
for arg
do
find "$arg" -type f -print0
done |xargs -0 grep -hoP '\b[[:alpha:]]+\b' |sort |uniq -c |sort -nr
}
where="$1"
output="$2"
#do here the necessary checks
#...
#and run the function
wordcounter "$where" > "$output"
#end of script

find the number of entries in a file and remove those enties using a shell script

I have the following code where I have collected all the file sizes greater than 40k from my system. I have stored all of this info into a text file. I need to process the file to read the number of times each entry is found in the text file and delete all of those entries.
I have the following code but it does not seem to to working properly.
#! /bin/sh
rm -rf /home/b/Desktop/CalcfileSizeGreater40.txt
filename="/home/b/Desktop/fileSizeGreater40.txt"
cat $filename | while read line
do
number_of_times=`cat $filename | grep $line | wc -l`
echo $line:$number_of_times
echo $line : $number_of_times >> /home/b/Desktop/CalcfileSizeGreater40.txt
sed '/$line/d' $filename >tmp
mv tmp $filename
done
When I look at the CalcfileSizeGreater40.txt i can see
131072 : 4
65553 : 9
65553 : 9
65553 : 9
65553 : 9
65553 : 9
65553 : 9
131072 : 4
65553 : 9
65553 : 9
65553 : 9
any ideas as to where I am going wrong ?
You can simplify this line:
number_of_times=`cat $filename | grep $line | wc -l`
to:
number_of_times=$(grep -c "$line" "$filename")
The use of $(...) in place of back-quotes is extra beneficial when you need to nest command execution. You can count occurrences with grep, and you never needed to use cat. It is a good idea to get into the habit of enclosing file names in variables in double quotes just in case the file names end up with spaces in them.
Editing the file that you are using cat on is not a good idea. Because of the way you are operating, the initial cat will echo every line of the original file in turn, completely ignoring any changes you make to a (different) file of the same name with the editing commands. This is why some of your names showed up a lot in the output.
However, what you are basically trying to do is count the number of occurrences of each line in the file. This is conventionally done with:
sort "$filename" |
uniq -c
The sort groups all identical sets of lines together in the file, and uniq -c counts the number of occurrences of each distinct line. It does, however, output the count before the line, so that has to be reversed — we can use sed for that. So, your script could be just:
sizefile="/home/b/Desktop/CalcfileSizeGreater40.txt"
rm -f "$sizefile"
filename="/home/b/Desktop/fileSizeGreater40.txt"
sort "$filename" |
uniq -c |
sed 's/^[ ]*\([0-9][0-9]*\)[ ]\(.*\)/\2 : \1/' > "$sizefile"
I'd be cautious about using rm -fr on your CalcfileSizeGreater40.txt; rm -f is sufficient for a file, and you probably don't want to remove stuff if it isn't a file but is a directory.
One pleasant side effect of this is that the code is a lot more efficient than the original as it makes one pass through the file (unless it is so big that sort has to split it up to handle it).
I am finding the sed code a little difficult to follow.
I should have explained that the [ ] bits are meant to represent a blank and a tab. On my machine, it appears that uniq only generates spaces, so you could simplify that to:
sed 's/^ *\([0-9][0-9]*\) \(.*\)/\2 : \1/'
The regex looks for the start of a line, any number of blanks, and then a number (which it remembers as \1 because of the \(...\) enclosing it), followed by a space and then 'everything else', which is also remembered (as '\2'). The replacement then prints the 'everything else' followed by a space, colon, space and the count.
sort -g $filename | uniq -c
you will got (times number) in every line
10 500000
1 10000
you just need to swap every line
sort -g $filename | uniq -c | while read a b; do echo $b $a ; done

Listing files in date order with spaces in filenames

I am starting with a file containing a list of hundreds of files (full paths) in a random order. I would like to list the details of the ten latest files in that list. This is my naive attempt:
$ ls -las -t `cat list-of-files.txt` | head -10
That works, so long as none of the files have spaces in, but fails if they do as those files are split up at the spaces and treated as separate files. File "hello world" gives me:
ls: hello: No such file or directory
ls: world: No such file or directory
I have tried quoting the files in the original list-of-files file, but the here-document still splits the files up at the spaces in the filenames, treating the quotes as part of the filenames:
$ ls -las -t `awk '{print "\"" $0 "\""}' list-of-files.txt` | head -10
ls: "hello: No such file or directory
ls: world": No such file or directory
The only way I can think of doing this, is to ls each file individually (using xargs perhaps) and create an intermediate file with the file listings and the date in a sortable order as the first field in each line, then sort that intermediate file. However, that feels a bit cumbersome and inefficient (hundreds of ls commands rather than one or two). But that may be the only way to do it?
Is there any way to pass "ls" a list of files to process, where those files could contain spaces - it seems like it should be simple, but I'm stumped.
Instead of "one or more blank characters", you can force bash to use another field separator:
OIFS=$IFS
IFS=$'\n'
ls -las -t $(cat list-of-files.txt) | head -10
IFS=$OIFS
However, I don't think this code would be more efficient than doing a loop; in addition, that won't work if the number of files in list-of-files.txt exceeds the max number of arguments.
Try this:
xargs -a list-of-files.txt ls -last | head -n 10
I'm not sure whether this will work, but did you try escaping spaces with \? Using sed or something. sed "s/ /\\\\ /g" list-of-files.txt, for example.
This worked for me:
xargs -d\\n ls -last < list-of-files.txt | head -10

Resources