Count unique words in all text files in directory, and delete those having less than 2? - bash

This gets me the count. But how to delete those files having count < 2?
$ cat ./a1esso.doc | grep -o -E '\w+' | sort -u -f | wc --words
1
$ cat ./a1brit.doc | grep -o -E '\w+' | sort -u -f | wc --words
4
How to grab the filenames of those that have less than 2, so we may delete them? I will be scanning millions of files. A find command can find all the files, but the filename needs to be propagated through the pipeline it seems. At the right end, the rm command can be used it seems.
Thanks for reading.
Update:
The correct answer is going to use an input pipeline to feed filenames. This is not negotiable. This program is not for use on the one input file shown in the example, but is coming from a dynamic list of many files.
A filter apparatus to identify the names of the files which are meeting the criterion, will also be present in the accepted answer. This is not negotiable either.

You could do this …
test $(grep -o -E '\w+' ./a1esso.doc | sort -u -f | wc --words) -lt 2 && rm alesso.doc
Update: removed useless cat as per David's comment.

Related

grep from 7 GB text file OR many smaller ones

I have about two thousand text files in folder.
I want to loop each one and search for specific word in line.
for file in "./*.txt";
do
cat $file | grep "banana"
done
I was wondering if join all text files into one file would be faster.
The whole directory has about 7 GB.
You're not actually looping, you're calling cat just once on the string ./*.txt, i.e., your script is equivalent to
cat ./*.txt | grep 'banana'
This is not equivalent to
grep 'banana' ./*.txt
though, as the output for the latter would prefix the filename for each match; you could use
grep -h 'banana' ./*.txt
to suppress filenames.
The problem you could run into is that ./*.txt expands to something that is longer than the maximum command line length allowed; to prevent that, you could do something like
printf '%s\0' ./*.txt | xargs -0 grep -h 'banana'
which is save for both files containing blanks and shell metacharacters and calls grep as few times as possible1.
This can even be parallelized; to run 4 grep processes in parallel, each handling 5 files at a time:
printf '%s\0' ./*.txt | xargs -0 -L 5 -P 4 grep -h 'banana'
What I think you intended to run is this:
for file in ./*.txt; do
cat "$file" | grep "banana"
done
which would call cat/grep once per file.
1At first I thought that printf would run into trouble with command line length limitations as well, but it seems that as a shell built-in, it's exempt:
$ touch '%s\0' {1000000..10000000} > /dev/null
-bash: /usr/bin/touch: Argument list too long
$ printf '%s\0' {1000000..10000000} > /dev/null
$

File Name comparision in Bash

I have two files containing list of files. I need to check what files are missing in the list of second file. Problem is that I do not have to match full name, but only need to match last 19 Characters of the file names.
E.g
MyFile12343220150510230000.xlsx
and
MyFile99999620150510230000.xlsx
are same files.
This is a unique problem and I don't know how to start. Kindly help.
awk based solution:
$ awk '
{start=length($0) - 18;}
NR==FNR{a[substr($0, start)]++; next;} #save last 19 characters for every line in file2
{if(!a[substr($0, start)]) print $0;} #If that is not present in file1, print that line.
' file2.list file.list
First you can use comm to match the exact file names and obtain a list of files not matchig. Then you can use agrep. I've never used it, but you might find it useful.
Or, as last option, you can do a brute force and for every line in the first file search into the second:
#!/bin/bash
# Iterate through the first file
while read LINE; do
# Find the section of the filename that has to match in the other file
CHECK_SECTION="$(echo "$LINE" | sed -nre 's/^.*([0-9]{14})\.(.*)$/\1.\2/p')"
# Create a regex to match the filenames in the second file
SEARCH_REGEX="^.*$CHECK_SECTION$"
# Search...
egrep "$SEARCH_REGEX" inputFile_2.txt
done < inputFile_1.txt
Here I assumed the filenames end with 14 digits that must match in the other file and a file extension that can be different from file to file but that has to match too:
MyFile12343220150510230000.xlsx
| variable | 14digits |.ext
So, if the first file is FILE1 and the second file is FILE2 then if the intention is only to identify the files in FILE2 that don't exist in FILE1, the following should do:
tmp1=$(mktemp)
tmp2=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev
rm ${tmp1} ${tmp2}
In a nutshell, this reverses the characters on each line, and extracts the part you're interested in, saving to a temporary file, for each list of files. The reversal of characters is done since you haven't said whether or not the length of filenames is guaranteed to be constant---the only thing we can rely on here is that the last 19 characters are of a fixed format (in this case, although the format is easily inferred, it isn't really relevant). The sort is important in order for the diff to show you what's not in the second file that is in the first.
If you're certain that there will only ever be files missing from FILE2 and not the other way around (that is, files in FILE2 that don't exist in FILE1), then you can clean things up by removing the cruft introduced by diff, so the last line becomes:
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//'
The grep limits the output to those lines with xlsx filenames, and the sed removes everything on a line from the first space encountered onwards.
Of course, technically this only tells you what time-stamped-grouped groups of files exist in FILE1 but not FILE2--as I understand it, this is what you're looking for (my understanding of your problem description is that MyFile12343220150510230000.xlsx and MyFile99999620150510230000.xlsx would have identical content). If the file names are always the same length (as you subsequently affirmed), then there's no need for the rev's and the cut commands can just be amended to refer to fixed character positions.
In any case, to get the final list of files, you'll have to use the "cleaned up" output to filter the content of FILE1; so, modifying the script above so that it includes the "cleanup" command, we can filter the files that you need using a grep--the whole script then becomes:
tmp1=$(mktemp)
tmp2=$(mktemp)
missing=$(mktemp)
cat $FILE1 | rev | cut -c -19 | sort | uniq > ${tmp1}
cat $FILE2 | rev | cut -c -19 | sort | uniq > ${tmp2}
diff ${tmp1} ${tmp2} | rev | grep -i xlsx | sed 's/[[:space:]]\+.*//' > ${missing}
grep -E "("`echo $(<${missing}) | sed 's/[[:space:]]/|/g'`")" ${tmp1}
rm ${tmp1} ${tmp2} ${missing}
The extended grep command (-E) just builds up an "or" regular expression for each timestamp-plus-extension and applies it to the first file. Of course, this is all assuming that there will never be timestamp-groups that exist in FILE2 and not in FILE1--if this is the case, then the "diff output processing" bit needs to be a little more clever.
Or you could use your standard coreutil tools:
for i in $(cat file1 file2 | sort | uniq -u); do
grep -q "$i" f1.txt && \
echo "f2 missing '$i'" || \
echo "f1 missing '$i'"
done
It will identify which non-common entries are missing from which file. You can also manipulate the non-common filenames in any way you like, e.g. parameter expansion/substring extraction, substring removal, or character indexes.

Faster grep in many files from several strings in a file

I have the following working script to grep in a directory of Many files from some specific strings previously saved into a file.
I use the files extension to grep all files as its name are random and note that every string from my previously file should be searched in all the files.
Also, I cut the outputting grep as it return 2 or 3 lines of the matched file and I only want a specific part that shows the filename.
I might be using something redundant, how it could be faster?
#!/bin/bash
#working but slow
cd /var/FILES_DIRECTORY
while read line
do
LC_ALL=C fgrep "$line" *.cps | cut -c1-27 >> /var/tmp/test_OUT.txt
done < "/var/tmp/test_STRINGS.txt"
grep -F -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27
Isn't what you're looking for ?
this should speed up your script :
#!/bin/bash
#working fast
cd /var/FILES_DIRECTORY
export LC_ALL=C
grep -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27 > /var/tmp/test_OUT.txt

Counting the number of occurrences of a character in multiple files with unix shell

I would like to help out my girlfriend - she needs the specific count of certain characters in around 200 files (per file).
I already found How can I use the UNIX shell to count the number of times a letter appears in a text file?, but that only shows the complete number, not the number of occurrences per file. basically, what I want is the following:
$ ls
test1 test2
$ cat test1
ddddnnnn
ddnnddnnnn
$ cat test2
ddnnddnnnn
$ grep -o 'n' * | wc -w
16
$ <insert command here>
test1 10
test2 6
$
or something similar regarding the output. As this will be on her university machine, I cannot code anything in perl or so, just shell is allowed. My shell knowledge is a bit rusty, so I cannot come up with a better solution - maybe you could be of assistance.
grep -Ho n * | uniq -c
produces
10 test1:n
6 test2:n
If you want exactly your output:
grep -Ho n * | uniq -c | while read count file; do echo "${file%:n} $count"; done
It's not exactly elegant, but the most obvious solution is:
letter='n'
for file in *; do
count=`grep -o $letter "$file" | wc -w`
echo "$file contains $letter $count times"
done
Glen's answer is far better for the flavors of UNIX that support it. This will work on a UNIX that claims it is POSIX-compliant. This is meant for the poor folks for whom the other answer does not fly.
POSIX grep says nothing about grep -H -o See: http://pubs.opengroup.org/onlinepubs/009604499/utilities/grep.html
Get a list of the files you want call it list.txt. I chose the character ^ == shift 6 for no reason
while read fname
do
cnt=`tr -dc '^' < $fname | wc -c`
echo "$fname: $cnt"
done < list.txt

How to loop over files in natural order in Bash?

I am looping over all the files in a directory with the following command:
for i in *.fas; do some_code; done;
However, I get them in this order
vvchr1.fas
vvchr10.fas
vvchr11.fas
vvchr2.fas
...
instead of
vvchr1.fas
vvchr2.fas
vvchr3.fas
...
what is natural order.
I have tried sort command, but to no avail.
readarray -d '' entries < <(printf '%s\0' *.fas | sort -zV)
for entry in "${entries[#]}"; do
# do something with $entry
done
where printf '%s\0' *.fas yields a NUL separated list of directory entries with the extension .fas, and sort -zV sorts them in natural order.
Note that you need GNU sort installed in order for this to work.
With option sort -g it compares according to general numerical value
for FILE in `ls ./raw/ | sort -g`; do echo "$FILE"; done
0.log
1.log
2.log
...
10.log
11.log
This will only work if the name of the files are numerical. If they are string you will get them in alphabetical order. E.g.:
for FILE in `ls ./raw/* | sort -g`; do echo "$FILE"; done
raw/0.log
raw/10.log
raw/11.log
...
raw/2.log
You will get the files in ASCII order. This means that vvchr10* comes before vvchr2*. I realise that you can not rename your files (my bioinformatician brain tells me they contain chromosome data, and we simply don't call chromosome 1 "chr01"), so here's another solution (not using sort -V which I can't find on any operating system I'm using):
ls *.fas | sed 's/^\([^0-9]*\)\([0-9]*\)/\1 \2/' | sort -k2,2n | tr -d ' ' |
while read filename; do
# do work with $filename
done
This is a bit convoluted and will not work with filenames containing spaces.
Another solution: Suppose we'd like to iterate over the files in size-order instead, which might be more appropriate for some bioinformatics tasks:
du *.fas | sort -k2,2n |
while read filesize filename; do
# do work with $filename
done
To reverse the sorting, just add r after -k2,2n (to get -k2,2nr).
You mean that files with the number 10 comes before files with number 3 in your list? Thats because ls sorts its result very simple, so something-10.whatever is smaller than something-3.whatever.
One solution is to rename all files so they have the same number of digits (the files with single-digit in them start with 0 in the number).
while IFS= read -r file ; do
ls -l "$file" # or whatever
done < <(find . -name '*.fas' 2>/dev/null | sed -r -e 's/([0-9]+)/ \1/' | sort -k 2 -n | sed -e 's/ //;')
Solves the problem, presuming the file naming stays consistent, doesn't rely on very-recent versions of GNU sort, does not rely on reading the output of ls and doesn't fall victim to the pipe-to-while problems.
Like #Kusalananda's solution (perhaps easier to remember?) but catering for all files(?):
array=("$(ls |sed 's/[^0-9]*\([0-9]*\)\..*/\1 &/'| sort -n | sed 's/^[^ ]* //')")
for x in "${array[#]}";do echo "$x";done
In essence add a sort key, sort, remove sort key.
EDIT: moved comment to appropriate solution
use sort -rh and the while loop
du -sh * | sort -rh | grep -P "avi$" |awk '{print $2}' | while read f; do fp=`pwd`/$f; echo $fp; done;

Resources