What is the fastest way to the delete lines in a file which have no match in a second file? - ruby

I have two files, wordlist.txt and text.txt.
The first file, wordlist.txt, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:
你
你们
我
The second file, text.txt, contains long passages, e.g.:
你们要去哪里?
卡拉OK好不好?
I want to create a new word list (wordsfount.txt), but it should only contain those lines from wordlist.txt which are found at least once within text.txt. The output file from the above should show this:
你
你们
"我" is not found in this list because it is never found in text.txt.
I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.
I know a simple way in BASH to check each line in worlist.txt and see if it is in text.txt using grep:
a=1
while read line
do
c=`grep -c $line text.txt`
if [ "$c" -ge 1 ]
then
echo $line >> wordsfound.txt
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < wordlist.txt
Unfortunately, as wordlist.txt is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:
As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:
我
我们
Due to this fact, if "我" is never found within text.txt, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt that also contained within wordlist.txt. If there are about 8,000 unique characters found in wordlist.txt, then the script should not need to check so many lines.
What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?

I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in /usr/share/dict/words which are also in war_and_peace.txt. You can change that with:
perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt
On my computer, it takes just over a second to run.
use strict;
use warnings;
use utf8::all;
use Getopt::Long;
my $wordlist = '/usr/share/dict/words';
my $text = 'war_and_peace.txt';
GetOptions(
"worlist=s" => \$wordlist,
"text=s" => \$text,
);
open my $text_fh, '<', $text
or die "Cannot open '$text' for reading: $!";
my %is_in_text;
while ( my $line = <$text_fh> ) {
chomp($line);
# you will want to customize this line
my #words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
next unless #words;
# This beasty uses the 'x' builtin in list context to assign
# the value of 1 to all keys (the words)
#is_in_text{#words} = (1) x #words;
}
open my $wordlist_fh, '<', $wordlist
or die "Cannot open '$wordlist' for reading: $!";
while ( my $word = <$wordlist_fh> ) {
chomp($word);
if ( $is_in_text{$word} ) {
print "$word\n";
}
}
And here's my timing:
• [ovid] $ wc -w war_and_peace.txt
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt
real 0m1.081s
user 0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt
15277 wordsfound.txt

Just use comm
http://unstableme.blogspot.com/2009/08/linux-comm-command-brief-tutorial.html
comm -1 wordlist.txt text.txt

This might work for you:
tr '[:punct:]' ' ' < text.txt | tr -s ' ' '\n' |sort -u | grep -f - wordlist.txt
Basically, create a new word list from text.txt and grep it against wordlist.txt file.
N.B. You may want to use the software you used to build the original wordlist.txt. In which case all you need is:
yoursoftware < text.txt > newwordlist.txt
grep -f newwordlist.txt wordlist.txt

Use grep with fixed-strings (-F) semantics, this will be fastest. Similarly, if you want to write it in Perl, use the index function instead of regex.
sort -u wordlist.txt > wordlist-unique.txt
grep -F -f wordlist-unique.txt text.txt
I'm surprised that there are already four answers, but no one posted this yet. People just don't know their toolbox anymore.

I would probably use Perl;
use strict;
my #aWordList = ();
open(WORDLIST, "< wordlist.txt") || die("Can't open wordlist.txt);
while(my $sWord = <WORDLIST>)
{
chomp($sWord);
push(#aWordList, $sWord);
}
close(WORDLIST);
open(TEXT, "< text.txt") || die("Can't open text.txt);
while(my $sText = <TEXT>)
{
foreach my $sWord (#aWordList)
{
if($sText =~ /$sWord/)
{
print("$sWord\n");
}
}
}
close(TEXT);
This won't be too slow, but if you could let us know the size of the files you're dealing with I could have a go at writing something much more clever with hash tables

Quite sure not the fastest solution, but at least a working one (I hope).
This solution needs ruby 1.9, the text file are expected to be UTF-8.
#encoding: utf-8
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.include?(word)
}
#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
f << new_wordlist.join("\n")
}
Can you provide a bigger example to make some benchmark on different methods? (Perhaps some test files to download?)
Below a benchmark with four methods.
#encoding: utf-8
require 'benchmark'
N = 10_000 #Number of Test loops
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')
def solution_count
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.count(word) > 0
}
new_wordlist.sort
end
#Faster then count, it can stop after the first hit
def solution_include
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.include?(word)
}
new_wordlist.sort
end
def solution_combine()
#get biggest word size
max = 0
$wordlist.each{|word| max = word.size if word.size > max }
#Build list of all letter combination from text
words_in_txt = []
0.upto($txt.size){|i|
1.upto(max){|l|
words_in_txt << $txt[i,l]
}
}
(words_in_txt & $wordlist).sort
end
#Idea behind:
#- remove string if found.
#- the next comparison is faster, the search text is shorter.
#
#This will not work with overlapping words.
#Example:
# abcdef contains def.
# if we check bcd first, the 'd' of def will be deleted, def is not detected.
def solution_gsub
new_wordlist = []
txt = $txt.dup #avoid to manipulate data source for other methods
#We must start with the big words.
#If we start with small one, we destroy long words
$wordlist.sort_by{|x| x.size }.reverse.each{|word|
new_wordlist << word if txt.gsub!(word,'')
}
#Now we must add words which where already part of longer words
new_wordlist.dup.each{|neww|
$wordlist.each{|word|
new_wordlist << word if word != neww and neww.include?(word)
}
}
new_wordlist.sort
end
#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
#~ f << solution_include.join("\n")
f << solution_combine.join("\n")
}
#Check the different results
if solution_count != solution_include
puts "Difference solution_count <> solution_include"
end
if solution_gsub != solution_include
puts "Difference solution_gsub <> solution_include"
end
if solution_combine != solution_include
puts "Difference solution_combine <> solution_include"
end
#Benchmark the solution
Benchmark.bmbm(10) {|b|
b.report('count') { N.times { solution_count } }
b.report('include') { N.times { solution_include } }
b.report('gsub') { N.times { solution_gsub } } #wrong results
b.report('combine') { N.times { solution_gsub } } #wrong results
} #Benchmark
I think, the solution_gsub variant is not correct. See the comment in the method definition. If CJK may allow this solution, the please give me a feedback.
That variant is the slowest in my test, but perhaps it will tune up with bigger examples.
And perhaps it can be tuned a bit.
The variant combine is also very slow, but it would be interestiung what happens with a bigger example.

First TXR Lisp solution ( http://www.nongnu.org/txr ):
(defvar tg-hash (hash)) ;; tg == "trigraph"
(unless (= (len *args*) 2)
(put-line `arguments required: <wordfile> <textfile>`)
(exit nil))
(defvar wordfile [*args* 0])
(defvar textfile [*args* 1])
(mapcar (lambda (line)
(dotimes (i (len line))
(push line [tg-hash [line i..(succ i)]])
(push line [tg-hash [line i..(ssucc i)]])
(push line [tg-hash [line i..(sssucc i)]])))
(file-get-lines textfile))
(mapcar (lambda (word)
(if (< (len word) 4)
(if [tg-hash word]
(put-line word))
(if (find word [tg-hash [word 0..3]]
(op search-str #2 #1))
(put-line word))))
(file-get-lines wordfile))
The strategy here is to reduce the corpus of words to a hash table which is indexed on individual characters, digraphs and trigraphs occuring in the lines, associating these fragments with the lines. Then when we process the word list, this reduces the search effort.
Firstly if the word is short, three characters or less (probably common in Chinese words), we can try to get an instant match in the hash table. If no match, word is not in the corpus.
If the word is longer than three characters, we can try to get a match for the first three characters. That gives us a list of lines which contain a match for the trigraph. We can search those lines exhaustively to see which ones of them match the word. I suspect that this will greatly reduce the number of lines that have to be searched.
I would need your data, or something representative thereof, to be able to see what the behavior is like.
Sample run:
$ txr words.tl words.txt text.txt
water
fire
earth
the
$ cat words.txt
water
fire
earth
the
it
$ cat text.txt
Long ago people
believed that the four
elements were
just
water
fire
earth
(TXR reads UTF-8 and does all string manipulation in Unicode, so testing with ASCII characters is valid.)
The use of lazy lists means that we do not store the entire list of 300,000 words, for instance. Although we are using the Lisp mapcar function, the list is being generated on the fly and because we don't keep the reference to the head of the list, it is eligible for garbage collection.
Unfortunately we do have to keep the text corpus in memory because the hash table associates lines.
If that's a problem, the solution could be reversed. Scan all the words, and then process the text corpus lazily, tagging those words which occur. Then eliminate the rest. I will post such a solution also.

new file newlist.txt
for each word in wordlist.txt:
check if word is in text.txt (I would use grep, if you're willing to use bash)
if yes:
append it to newlist.txt (probably echo word >> newlist.txt)
if no:
next word

Simplest way with bash script:
Preprocessing first with "tr" and "sort" to format it to one word a line and remove duplicated lines.
Do this:
cat wordlist.txt | while read i; do grep -E "^$i$" text.txt; done;
That's the list of words you want...

Try this:
cat wordlist.txt | while read line
do
if [[ grep -wc $line text.txt -gt 0 ]]
then
echo $line
fi
done
Whatever you do, if you use grep you must use -w to match a whole word. Otherwise if you have foo in wordlist.txt and foobar in text.txt, you'll get wrong match.
If the files are VERY big, and this loop takes too much time to run, you can convert text.txt to a list of work (easy with AWK), and use comm to find the words that are in both lists.

This solution is in perl, maintains your original symantics and uses the optimization you suggested.
#!/usr/bin/perl
#list=split("\n",`sort < ./wordlist.txt | uniq`);
$size=scalar(#list);
for ($i=0;$i<$size;++$i) { $list[$i]=quotemeta($list[$i]);}
for ($i=0;$i<$size;++$i) {
my $j = $i+1;
while ($list[$j]=~/^$list[$i]/) {
++$j;
}
$skip[$i]=($j-$i-1);
}
open IN,"<./text.txt" || die;
#text = (<IN>);
close IN;
foreach $c(#text) {
for ($i=0;$i<$size;++$i) {
if ($c=~/$list[$i]/) {
$found{$list[$i]}=1;
last;
}
else {
$i+=$skip[$i];
}
}
}
open OUT,">wordsfound.txt" ||die;
while ( my ($key, $value) = each(%found) ) {
print OUT "$key\n";
}
close OUT;
exit;

Use paralel processing to speed up the processing.
1) sort & uniq on wordlist.txt, then split it to several files (X)
Do some testing, X is equal with your computer cores.
split -d -l wordlist.txt
2) use xargs -p X -n 1 script.sh x00 > output-x00.txt
to process the files in paralel
find ./splitted_files_dir -type f -name "x*" -print| xargs -p 20 -n 1 -I SPLITTED_FILE script.sh SPLITTED_FILE
3) cat output* > output.txt concatenate output files
This will speed up the processing enough, and you are able to use tools that you could understand. This will ease up the maintinging "cost".
The script almost identical that you used in the first place.
script.sh
FILE=$1
OUTPUTFILE="output-${FILE}.txt"
WORDLIST="wordliist.txt"
a=1
while read line
do
c=`grep -c $line ${FILE} `
if [ "$c" -ge 1 ]
then
echo $line >> ${OUTPUTFILE}
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < ${WORDLIST}

Related

Unscramble words Challenge - improve my bash solution

There is a Capture the Flag challenge
I have two files; one with scrambled text like this with about 550 entries
dnaoyt
cinuertdso
bda
haey
tolpap
...
The second file is a dictionary with about 9,000 entries
radar
ccd
gcc
fcc
historical
...
The goal is to find the right, unscrambled version of the word, which is contained in the dictionary file.
My approach is to sort the characters from the first word from the first file and then look up if the first word from the second file has the same length. If so then sort that too and compare them.
This is my fully functional bash script, but it is very slow.
#!/bin/bash
while IFS="" read -r p || [ -n "$p" ]
do
var=0
ro=$(echo $p | perl -F -lane 'print sort #F')
len_ro=${#ro}
while IFS="" read -r o || [ -n "$o" ]
do
ro2=$(echo $o | perl -F -lane 'print sort # F')
len_ro2=${#ro2}
let "var+=1"
if [ $len_ro == $len_ro2 ]; then
if [ $ro == $ro2 ]; then
echo $o >> new.txt
echo $var >> whichline.txt
fi
fi
done < dictionary.txt
done < scrambled-words.txt
I have also tried converting all characters to ASCII integers and sum each word, but while comparing I realized that the sum of a different char pattern may have the same sum.
[edit]
For the records:
- no anagrams contained in dictionary
- to get the flag, you need to export the unscrambled words as one blob and ans make a SHA-Hash out of it (thats the flag)
- link to ctf for guy who wanted the files https://challenges.reply.com/tamtamy/user/login.action
You're better off creating a lookup dictionary (keyed by the sorted word) from the dictionary file.
Your loop body is executed 550 * 9,000 = 4,950,000 times (O(N*M)).
The solution I propose executes two loops of at most 9,000 passes each (O(N+M)).
Bonus: It finds all possible solutions at no cost.
#!/usr/bin/perl
use strict;
use warnings qw( all );
use feature qw( say );
my $dict_qfn = "dictionary.txt";
my $scrambled_qfn = "scrambled-words.txt";
sub key { join "", sort split //, $_[0] }
my %dict;
{
open(my $fh, "<", $dict_qfn)
or die("Can't open \"$dict_qfn\": $!\n");
while (<$fh>) {
chomp;
push #{ $dict{key($_)} }, $_;
}
}
{
open(my $fh, "<", $scrambled_qfn)
or die("Can't open \"$scrambled_qfn\": $!\n");
while (<$fh>) {
chomp;
my $matches = $dict{key($_)};
say "$_ matches #$matches" if $matches;
}
}
I wouldn't be surprised if this only takes one millionths of the time of your solution for the sizes you provided (and it scales so much better than yours if you were to increase the sizes).
I would do something like this with gawk
gawk '
NR == FNR {
dict[csort()] = $0
next
}
{
print dict[csort()]
}
function csort( chars, sorted) {
split($0, chars, "")
asort(chars)
for (i in chars)
sorted = sorted chars[i]
return sorted
}' dictionary.txt scrambled-words.txt
Here's perl-free solution I came up with using sort and join:
sort_letters() {
# Splits each letter onto a line, sorts the letters, then joins them
# e.g. "hello" becomes "ehllo"
echo "${1}" | fold-b1 | sort | tr -d '\n'
}
# For each input file...
for input in "dict.txt" "words.txt"; do
# Convert each line to [sorted] [original]
# then sort and save the results with a .sorted extension
while read -r original; do
sorted=$(sort_letters "${original}")
echo "${sorted} ${original}"
done < "${input}" | sort > "${input}.sorted"
done
# Join the two files on the [sorted] word
# outputting the scrambled and unscrambed words
join -j 1 -o 1.2,2.2 "words.txt.sorted" "dict.txt.sorted"
I tried something very alike, but a bit different.
#!/bin/bash
exec 3<scrambled-words.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>scrambled-words_sorted.txt
exec 3>&-
exec 3<dictionary.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>dictionary_sorted.txt
exec 3>&-
printf "" > whichline.txt
exec 3<scrambled-words_sorted.txt
while read -r line <&3; do
counter="$((++counter))"
grep -n -e "^${line}$" dictionary_sorted.txt | cut -d ':' -f 1 | tr -d '\n' >>whichline.txt printf "\n" >>whichline.txt
done
exec 3>&-
As you can see I don't create a new.txt file; instead I only create whichline.txt with a blank line where the word doesn't match. You can easily paste them up to create new.txt.
The logic behind the script is nearly the logic behind yours, with the exception that I called perl less times and I save two support files.
I think (but I am not sure) that creating them and cycle only one file will be better than ~5kk calls of perl. This way "only" ~10k times is called.
Finally, I decided to use grep because it's (maybe) the fastest regex matcher, and searching for the entire line the lenght is intrinsic in the regex.
Please, note that what #benjamin-w said is still valid and, in that case, grep will reply badly and I did not managed it!
I hope this could help [:

Check if all of multiple strings or regexes exist in a file

I want to check if all of my strings exist in a text file. They could exist on the same line or on different lines. And partial matches should be OK. Like this:
...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on
In the above example, we could have regexes in place of strings.
For example, the following code checks if any of my strings exists in the file:
if grep -EFq "string1|string2|string3" file; then
# there is at least one match
fi
How to check if all of them exist? Since we are just interested in the presence of all matches, we should stop reading the file as soon all strings are matched.
Is it possible to do it without having to invoke grep multiple times (which won't scale when input file is large or if we have a large number of strings to match) or use a tool like awk or python?
Also, is there a solution for strings that can easily be extended for regexes?
Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.
In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:
awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file
And here's a bunch of other information and options:
Assuming you're really looking for strings, it'd be:
awk -v strings='string1 string2 string3' '
BEGIN {
numStrings = split(strings,tmp)
for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
for (str in strs) {
if ( index($0,str) ) {
delete strs[str]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file
the above will stop reading the file as soon as all strings have matched.
If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:
awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file
Actually, even if it were strings you could do:
awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file
The main issue with the above 2 GNU awk solutions is that, like #anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.
I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:
awk '
NR==FNR { strings[$0]; next }
{
for (string in strings)
if ( !index($0,string) )
exit 1
}
' file_of_strings RS='^$' file_to_be_searched
and for regexps it'd be:
awk '
NR==FNR { regexps[$0]; next }
{
for (regexp in regexps)
if ( $0 !~ regexp )
exit 1
}
' file_of_regexps RS='^$' file_to_be_searched
If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0' instead of RS='^$' or by appending to variable one line at a time as it's read and then processing that variable in the END section.
If your file_to_be_searched is too large to fit in memory then it'd be this for strings:
awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
for (string in strings) {
if ( index($0,string) ) {
delete strings[string]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched
and the equivalent for regexps:
awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
for (regexp in regexps) {
if ( $0 ~ regexp ) {
delete regexps[regexp]
numRegexps--
}
}
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched
git grep
Here is the syntax using git grep with multiple patterns:
git grep --all-match --no-index -l -e string1 -e string2 -e string3 file
You may also combine patterns with Boolean expressions such as --and, --or and --not.
Check man git-grep for help.
--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.
--no-index Search files in the current directory that is not managed by Git.
-l/--files-with-matches/--name-only Show only the names of files.
-e The next parameter is the pattern. Default is to use basic regexp.
Other params to consider:
--threads Number of grep worker threads to use.
-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.
To change the pattern type, you may also use -G/--basic-regexp (default), -F/--fixed-strings, -E/--extended-regexp, -P/--perl-regexp, -f file, and other.
This gnu-awk script may work:
cat fileSearch.awk
re == "" {
exit
}
{
split($0, null, "\\<(" re "\\>)", b)
for (i=1; i<=length(b); i++)
gsub("\\<" b[i] "([|]|$)", "", re)
}
END {
exit (re != "")
}
Then use it as:
if awk -v re='string1|string2|string3' -f fileSearch.awk file; then
echo "all strings were found"
else
echo "all strings were not found"
fi
Alternatively, you can use this gnu grep solution with PCRE option:
grep -qzP '(?s)(?=.*\bstring1\b)(?=.*\bstring2\b)(?=.*\bstring3\b)' file
Using -z we make grep read complete file into a single string.
We are using multiple lookahead assertions to assert that all the strings are present in the file.
Regex must use (?s) or DOTALL mod to make .* match across the lines.
As per man grep:
-z, --null-data
Treat input and output data as sequences of lines, each terminated by a
zero byte (the ASCII NUL character) instead of a newline.
First, you probably want to use awk. Since you eliminated that option in the question statement, yes, it is possible to do and this provides a way to do it. It is likely MUCH slower than using awk, but if you want to do it anyway...
This is based on the following assumptions:G
Invoking AWK is unacceptable
Invoking grep multiple times is unacceptable
The use of any other external tools are unacceptable
Invoking grep less than once is acceptable
It must return success if everything is found, failure when not
Using bash instead of external tools is acceptable
bash version is >= 3 for the regular expression version
This might meet all of your requirements: (regex version miss some comments, look at string version instead)
#!/bin/bash
multimatch() {
filename="$1" # Filename is first parameter
shift # move it out of the way that "$#" is useful
strings=( "$#" ) # search strings into an array
declare -a matches # Array to keep track which strings already match
# Initiate array tracking what we have matches for
for ((i=0;i<${#strings[#]};i++)); do
matches[$i]=0
done
while IFS= read -r line; do # Read file linewise
foundmatch=0 # Flag to indicate whether this line matched anything
for ((i=0;i<${#strings[#]};i++)); do # Loop through strings indexes
if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
string="${strings[$i]}" # fetch the string
if [[ $line = *$string* ]]; then # check if it matches
matches[$i]=1 # mark that we have found this
foundmatch=1 # set the flag, we need to check whether we have something left
fi
fi
done
# If we found something, we need to check whether we
# can stop looking
if [ "$foundmatch" -eq 1 ]; then
somethingleft=0 # Flag to see if we still have unmatched strings
for ((i=0;i<${#matches[#]};i++)); do
if [ "${matches[$i]}" -eq 0 ]; then
somethingleft=1 # Something is still outstanding
break # no need check whether more strings are outstanding
fi
done
# If we didn't find anything unmatched, we have everything
if [ "$somethingleft" -eq 0 ]; then return 0; fi
fi
done < "$filename"
# If we get here, we didn't have everything in the file
return 1
}
multimatch_regex() {
filename="$1" # Filename is first parameter
shift # move it out of the way that "$#" is useful
regexes=( "$#" ) # Regexes into an array
declare -a matches # Array to keep track which regexes already match
# Initiate array tracking what we have matches for
for ((i=0;i<${#regexes[#]};i++)); do
matches[$i]=0
done
while IFS= read -r line; do # Read file linewise
foundmatch=0 # Flag to indicate whether this line matched anything
for ((i=0;i<${#strings[#]};i++)); do # Loop through strings indexes
if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
regex="${regexes[$i]}" # Get regex from array
if [[ $line =~ $regex ]]; then # We use the bash regex operator here
matches[$i]=1 # mark that we have found this
foundmatch=1 # set the flag, we need to check whether we have something left
fi
fi
done
# If we found something, we need to check whether we
# can stop looking
if [ "$foundmatch" -eq 1 ]; then
somethingleft=0 # Flag to see if we still have unmatched strings
for ((i=0;i<${#matches[#]};i++)); do
if [ "${matches[$i]}" -eq 0 ]; then
somethingleft=1 # Something is still outstanding
break # no need check whether more strings are outstanding
fi
done
# If we didn't find anything unmatched, we have everything
if [ "$somethingleft" -eq 0 ]; then return 0; fi
fi
done < "$filename"
# If we get here, we didn't have everything in the file
return 1
}
if multimatch "filename" string1 string2 string3; then
echo "file has all strings"
else
echo "file miss one or more strings"
fi
if multimatch_regex "filename" "regex1" "regex2" "regex3"; then
echo "file match all regular expressions"
else
echo "file does not match all regular expressions"
fi
Benchmarks
I did some benchmarking searching .c,.h and .sh in arch/arm/ from Linux 4.16.2 for the strings "void", "function", and "#define". (Shell wrappers were added/ the code tuned that all can be called as testname <filename> <searchstring> [...] and that an if can be used to check the result)
Results: (measured with time, real time rounded to closest half second)
multimatch: 49s
multimatch_regex: 55s
matchall: 10.5s
fileMatchesAllNames: 4s
awk (first version): 4s
agrep: 4.5s
Perl re (-r): 10.5s
Perl non-re: 9.5s
Perl non-re optimised: 5s (Removed Getopt::Std and regex support for faster startup)
Perl re optimised: 7s (Removed Getopt::Std and non-regex support for faster startup)
git grep: 3.5s
C version (no regex): 1.5s
(Invoking grep multiple times, especially with the recursive method, did better than I expected)
A recursive solution. Iterate over the files one by one. For each file, check if it matches the first pattern and break early (-m1: on first match), only if it matched the first pattern, search for second pattern and so on:
#!/bin/bash
patterns="$#"
fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo "$file"
else
shift
pattern=$1
shift
grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $#
fi
}
for file in *
do
test -f "$file" && fileMatchesAllNames "$file" $patterns
done
Usage:
./allfilter.sh cat filter java
test.sh
Searches in the current dir for the tokens "cat", "filter" and "java". Found them only in "test.sh".
So grep is invoked often in the worst case scenario (finding the first N-1 patterns in the last line of each file, except for the N-th pattern).
But with an informed ordering (rarly matches first, early matchings first) if possible, the solution should be reasonable fast, since many files are abandoned early because they didn't match the first keyword, or accepted early, as they matched a keyword close to the top.
Example: You search a scala source file which contains tailrec (somewhat rarely used), mutable (rarely used, but if so, close to the top on import statements) main (rarely used, often not close to the top) and println (often used, unpredictable position), you would order them:
./allfilter.sh mutable tailrec main println
Performance:
ls *.scala | wc
89 89 2030
In 89 scala files, I have the keywords distribution:
for keyword in mutable tailrec main println; do grep -m 1 $keyword *.scala | wc -l ; done
16
34
41
71
Searching them with a slightly modified version of the scripts, which allows to use a filepattern as first argument takes about 0.2s:
time ./allfilter.sh "*.scala" mutable tailrec main println
Filepattern: *.scala Patterns: mutable tailrec main println
aoc21-2017-12-22_00:16:21.scala
aoc25.scala
CondenseString.scala
Partition.scala
StringCondense.scala
real 0m0.216s
user 0m0.024s
sys 0m0.028s
in close to 15.000 codelines:
cat *.scala | wc
14913 81614 610893
update:
After reading in the comments to the question, that we might be talking about thounsands of patterns, handing them as arguments doesn't seem to be a clever idea; better read them from a file, and pass the filename as argument - maybe for the list of files to filter too:
#!/bin/bash
filelist="$1"
patternfile="$2"
patterns="$(< $patternfile)"
fileMatchesAllNames () {
file=$1
if [[ $# -eq 1 ]]
then
echo "$file"
else
shift
pattern=$1
shift
grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $#
fi
}
echo -e "Filepattern: $filepattern\tPatterns: $patterns"
for file in $(< $filelist)
do
test -f "$file" && fileMatchesAllNames "$file" $patterns
done
If the number and length of patterns/files exceeds the possibilities of argument passing, the list of patterns could be split into many patternfiles and processed in a loop (for example of 20 pattern files):
for i in {1..20}
do
./allfilter2.sh file.$i.lst pattern.$i.lst > file.$((i+1)).lst
done
You can
make use of the -o|--only-matching option of grep (which forces to output only the matched parts of a matching line, with each such part on a separate output line),
then eliminate duplicate occurrences of matched strings with sort -u,
and finally check that the count of remaining lines equals the count of the input strings.
Demonstration:
$ cat input
...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on
$ grep -o -F $'string1\nstring2\nstring3' input|sort -u|wc -l
3
$ grep -o -F $'string1\nstring3' input|sort -u|wc -l
2
$ grep -o -F $'string1\nstring2\nfoo' input|sort -u|wc -l
2
One shortcoming with this solution (failing to meet the partial matches should be OK requirement) is that grep doesn't detect overlapping matches. For example, although the text abcd matches both abc and bcd, grep finds only one of them:
$ grep -o -F $'abc\nbcd' <<< abcd
abc
$ grep -o -F $'bcd\nabc' <<< abcd
abc
Note that this approach/solution works only for fixed strings. It cannot be extended for regexes, because a single regex can match multiple different strings and we cannot track which match corresponds to which regex. The best you can do is store the matches in a temporary file, and then run grep multiple times using one regex at a time.
The solution implemented as a bash script:
matchall:
#!/usr/bin/env bash
if [ $# -lt 2 ]
then
echo "Usage: $(basename "$0") input_file string1 [string2 ...]"
exit 1
fi
function find_all_matches()
(
infile="$1"
shift
IFS=$'\n'
newline_separated_list_of_strings="$*"
grep -o -F "$newline_separated_list_of_strings" "$infile"
)
string_count=$(($# - 1))
matched_string_count=$(find_all_matches "$#"|sort -u|wc -l)
if [ "$matched_string_count" -eq "$string_count" ]
then
echo "ALL strings matched"
exit 0
else
echo "Some strings DID NOT match"
exit 1
fi
Demonstration:
$ ./matchall
Usage: matchall input_file string1 [string2 ...]
$ ./matchall input string1 string2 string3
ALL strings matched
$ ./matchall input string1 string2
ALL strings matched
$ ./matchall input string1 string2 foo
Some strings DID NOT match
The easiest way for me to check if the file has all three patterns is to get only matched patterns, output only unique parts and count lines.
Then you will be able to check it with a simple Test condition: test 3 -eq $grep_lines.
grep_lines=$(grep -Eo 'string1|string2|string3' file | uniq | wc -l)
Regarding your second question, I don't think it's possible to stop reading the file as soon as more than one pattern is found. I've read man page for grep and there are no options that could help you with that. You can only stop reading lines after specific one with an option grep -m [number] which happens no matter of matched patterns.
Pretty sure that a custom function is needed for that purpose.
It's an interesting problem, and there's nothing obvious in the grep man page to suggest an easy answer. There's might be an insane regex that would do it, but may be clearer with a straightforward chain of greps, even though that ends up scanning the file n-times. At least the -q option has it bail at the first match each time, and the && will shortcut evaluation if one of the strings is not found.
$grep -Fq string1 t && grep -Fq string2 t && grep -Fq string3 t
$echo $?
0
$grep -Fq string1 t && grep -Fq blah t && grep -Fq string3 t
$echo $?
1
Perhaps with gnu sed
cat match_word.sh
sed -z '
/\b'"$2"'/!bA
/\b'"$3"'/!bA
/\b'"$4"'/!bA
/\b'"$5"'/!bA
s/.*/0\n/
q
:A
s/.*/1\n/
' "$1"
and you call it like that :
./match_word.sh infile string1 string2 string3
return 0 if all match are found else 1
here you can look for 4 strings
if you want more, you can add lines like
/\b'"$x"'/!bA
Just for "solutions completeness", you can use a different tool and avoid multiple greps and awk/sed or big (and probably slow) shell loops; Such a tool is agrep.
agrep is actually a kind of egrep supporting also and operation between patterns, using ; as a pattern separator.
Like egrep and like most of the well known tools, agrep is a tool that operates on records/lines and thus we still need a way to treat the whole file as a single record.
Moreover agrep provides a -d option to set your custom record delimiter.
Some tests:
$ cat file6
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3
$ agrep -d '$$\n' 'str3;str2;str1;str4' file6;echo $?
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3
0
$ agrep -d '$$\n' 'str3;str2;str1;str4;str5' file6;echo $?
1
$ agrep -p 'str3;str2;str1' file6 #-p prints lines containing all three patterns in any position
str1 str2 str3
str3 str1 str2
No tool is perfect, and agrep has also some limitations; you can't use a regex /pattern longer than 32 chars and some options are not available when used with regexps- all these are explained in agrep man page
Ignoring the "Is it possible to do it without ... or use a tool like awk or python?" requirement, you can do it with a Perl script:
(Use an appropriate shebang for your system or something like /bin/env perl)
#!/usr/bin/perl
use Getopt::Std; # option parsing
my %opts;
my $filename;
my #patterns;
getopts('rf:',\%opts); # Allowing -f <filename> and -r to enable regex processing
if ($opts{'f'}) { # if -f is given
$filename = $opts{'f'};
#patterns = #ARGV[0 .. $#ARGV]; # Use everything else as patterns
} else { # Otherwise
$filename = $ARGV[0]; # First parameter is filename
#patterns = #ARGV[1 .. $#ARGV]; # Rest is patterns
}
my $use_re= $opts{'r'}; # Flag on whether patterns are regex or not
open(INF,'<',$filename) or die("Can't open input file '$filename'");
while (my $line = <INF>) {
my #removal_list = (); # List of stuff that matched that we don't want to check again
for (my $i=0;$i <= $#patterns;$i++) {
my $pattern = $patterns[$i];
if (($use_re&& $line =~ /$pattern/) || # regex match
(!$use_re&& index($line,$pattern) >= 0)) { # or string search
push(#removal_list,$i); # Mark to be removed
}
}
# Now remove everything we found this time
# We need to work backwards to keep us from messing
# with the list while we're busy
for (my $i=$#removal_list;$i >= 0;$i--) {
splice(#patterns,$removal_list[$i],1);
}
if (scalar(#patterns) == 0) { # If we don't need to match anything anymore
close(INF) or warn("Error closing '$filename'");
exit(0); # We found everything
}
}
# End of file
close(INF) or die("Error closing '$filename'");
exit(1); # If we reach this, we haven't matched everything
Is saved as matcher.pl this will search for plain text strings:
./matcher filename string1 string2 string3 'complex string'
This will search for regular expressions:
./matcher -r filename regex1 'regex2' 'regex4'
(The filename can be given with -f instead):
./matcher -f filename -r string1 string2 string3 'complex string'
It is limited to single line matching patterns (due to dealing with the file linewise).
The performance, when calling for lots of files from a shell script, is slower than awk (But search patterns can contain spaces, unlike the ones passed space-separated in -v to awk). If converted to a function and called from Perl code (with a file containing a list of files to search), it should be much faster than most awk implementations. (When called on several smallish files, the perl startup time (parsing, etc of the script) dominates the timing)
It can be sped up significantly by hardcoding whether regular expressions are used or not, at the cost of flexibility. (See my benchmarks here to see what effect removing Getopt::Std has)
perl -lne '%m = (%m, map {$_ => 1} m!\b(string1|string2|string3)\b!g); END { print scalar keys %m == 3 ? "Match": "No Match"}' file
In python using the fileinput module allows the files to be specified on the command line or the text read line by line from stdin. You could hard code the strings into a python list.
# Strings to match, must be valid regular expression patterns
# or be escaped when compiled into regex below.
strings = (
r'string1',
r'string2',
r'string3',
)
or read the strings from another file
import re
from fileinput import input, filename, nextfile, isfirstline
for line in input():
if isfirstline():
regexs = map(re.compile, strings) # new file, reload all strings
# keep only strings that have not been seen in this file
regexs = [rx for rx in regexs if not rx.match(line)]
if not regexs: # found all strings
print filename()
nextfile()
Assuming all your strings to check are in a file strings.txt, and the file you want to check in is input.txt, the following one liner will do :
Updated the answer based on comments :
$ diff <( sort -u strings.txt ) <( grep -o -f strings.txt input.txt | sort -u )
Explanation :
Use grep's -o option to match only the strings you are interested in. This gives all the strings that are present in the file input.txt. Then use diff to get the strings that are not found. If all the strings were found, the result would be nothing. Or, just check the exit code of diff.
What it does not do :
Exit as soon as all matches are found.
Extendible to regx.
Overlapping matches.
What it does do :
Find all matches.
Single call to grep.
Does not use awk or python.
Many of these answers are fine as far as they go.
But if performance is an issue -- certainly possible if the input is large and you have many thousands of patterns -- then you'll get a large speedup using a tool like lex or flex that generates a true deterministic finite automaton as a recognizer rather than calling a regex interpreter once per pattern.
The finite automaton will execute a few machine instructions per input character regardless of the number of patterns.
A no-frills flex solution:
%{
void match(int);
%}
%option noyywrap
%%
"abc" match(0);
"ABC" match(1);
[0-9]+ match(2);
/* Continue adding regex and exact string patterns... */
[ \t\n] /* Do nothing with whitespace. */
. /* Do nothing with unknown characters. */
%%
// Total number of patterns.
#define N_PATTERNS 3
int n_matches = 0;
int counts[10000];
void match(int n) {
if (counts[n]++ == 0 && ++n_matches == N_PATTERNS) {
printf("All matched!\n");
exit(0);
}
}
int main(void) {
yyin = stdin;
yylex();
printf("Only matched %d patterns.\n", n_matches);
return 1;
}
A down side is that you'd have to build this for every given set of patterns. That's not too bad:
flex matcher.y
gcc -O lex.yy.c -o matcher
Now run it:
./matcher < input.txt
The following python script should do the trick. It kind of does call the equivalent of grep (re.search) multiple times for each line -- i.e. it it searches each pattern for each line, but since you are not forking out a process each time, it should be much more efficient. Also, it removes the patterns which have already been found and stops when all of them have been found.
#!/usr/bin/env python
import re
# the file to search
filename = '/path/to/your/file.txt'
# list of patterns -- can be read from a file or command line
# depending on the count
patterns = [r'py.*$', r'\s+open\s+', r'^import\s+']
patterns = map(re.compile, patterns)
with open(filename) as f:
for line in f:
# search for pattern matches
results = map(lambda x: x.search(line), patterns)
# remove the patterns that did match
results = zip(results, patterns)
results = filter(lambda x: x[0] == None, results)
patterns = map(lambda x: x[1], results)
# stop if no more patterns are left
if len(patterns) == 0:
break
# print the patterns which were not found
for p in patterns:
print p.pattern
You can add a separate check for plain strings (string in line) if you are dealing with plain (non-regex) strings -- will be slightly more efficient.
Does that solve your problem?
One more Perl variant - whenever all given strings match..even when the file is read half through, the processing completes and just prints the results
> perl -lne ' /\b(string1|string2|string3)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ? "Match": "No Match"}' all_match.txt
Match
> perl -lne ' /\b(string1|string2|stringx)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ? "Match": "No Match"}' all_match.txt
No Match
First delete the line separator, and then use normal grep multiple times, as the number of patterns as in below.
Example: Let the file content be as below
PAT1
PAT2
PAT3
something
somethingelse
cat file | tr -d "\n" | grep "PAT1" | grep "PAT2" | grep -c "PAT3"
For plain speed, with no external tool limitations, and no regexes, this (crude) C version does a decent job. (Possibly Linux only, although it should work on all Unix-like systems with mmap)
#include <sys/mman.h>
#include <sys/stat.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
/* https://stackoverflow.com/a/8584708/1837991 */
inline char *sstrstr(char *haystack, char *needle, size_t length)
{
size_t needle_length = strlen(needle);
size_t i;
for (i = 0; i < length; i++) {
if (i + needle_length > length) {
return NULL;
}
if (strncmp(&haystack[i], needle, needle_length) == 0) {
return &haystack[i];
}
}
return NULL;
}
int matcher(char * filename, char ** strings, unsigned int str_count)
{
int fd;
struct stat sb;
char *addr;
unsigned int i = 0; /* Used to keep us from running of the end of strings into SIGSEGV */
fd = open(filename, O_RDONLY);
if (fd == -1) {
fprintf(stderr,"Error '%s' with open on '%s'\n",strerror(errno),filename);
return 2;
}
if (fstat(fd, &sb) == -1) { /* To obtain file size */
fprintf(stderr,"Error '%s' with fstat on '%s'\n",strerror(errno),filename);
close(fd);
return 2;
}
if (sb.st_size <= 0) { /* zero byte file */
close(fd);
return 1; /* 0 byte files don't match anything */
}
/* mmap the file. */
addr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (addr == MAP_FAILED) {
fprintf(stderr,"Error '%s' with mmap on '%s'\n",strerror(errno),filename);
close(fd);
return 2;
}
while (i++ < str_count) {
char * found = sstrstr(addr,strings[0],sb.st_size);
if (found == NULL) { /* If we haven't found this string, we can't find all of them */
munmap(addr, sb.st_size);
close(fd);
return 1; /* so give the user an error */
}
strings++;
}
munmap(addr, sb.st_size);
close(fd);
return 0; /* if we get here, we found everything */
}
int main(int argc, char *argv[])
{
char *filename;
char **strings;
unsigned int str_count;
if (argc < 3) { /* Lets count parameters at least... */
fprintf(stderr,"%i is not enough parameters!\n",argc);
return 2;
}
filename = argv[1]; /* First parameter is filename */
strings = argv + 2; /* Search strings start from 3rd parameter */
str_count = argc - 2; /* strings are two ($0 and filename) less than argc */
return matcher(filename,strings,str_count);
}
Compile it with:
gcc matcher.c -o matcher
Run it with:
./matcher filename needle1 needle2 needle3
Credits:
uses sstrstr
File handling mostly stolen from the mmap man page
Notes:
It will scan through the parts of the file preceding the matched strings multiple times - it will only open the file once though.
The entire file might end up loaded into memory, especially if a string doesn't match, the OS needs to decide that
regex support can probably be added by using the POSIX regex library (Performance would likely be slightly better than grep - it is should be based on the same library and you would gain reduced overhead from only opening the file once for searching for multiple regexes)
Files containing nulls should work, search strings with them not though...
All characters other than null should be searchable (\r, \n, etc)
I didn't see a simple counter among answers, so here is a counter oriented solution using awk that stops as soon as all matches are satisfied:
/string1/ { a = 1 }
/string2/ { b = 1 }
/string3/ { c = 1 }
{
if (c + a + b == 3) {
print "Found!";
exit;
}
}
A generic script
to expand usage through shell arguments:
#! /bin/sh
awk -v vars="$*" -v argc=$# '
BEGIN { split(vars, args); }
{
for (arg in args) {
if (!temp[arg] && $0 ~ args[arg]) {
inc++;
temp[arg] = 1;
}
}
if (inc == argc) {
print "Found!";
exit;
}
}
END { exit 1; }
' filename
Usage (in which you can pass Regular Expressions):
./script "str1?" "(wo)?men" str3
or to apply a string of patterns:
./script "str1? (wo)?men str3"
$ cat allstringsfile | tr '\n' ' ' | awk -f awkpattern1
Where allstringsfile is your text file, as in the original question.
awkpattern1 contains the string patterns, with && condition:
$ cat awkpattern1
/string1/ && /string2/ && /string3/

combining multiple grep searches and making my script more efficient

I have a file called Type1.txt, that looks like this:
$ cat Type1.txt
ID.580.G3C0
TTTTTTTTTTT
ID.580.G3C8
ATTATATC-AAA
ID.580.GXC16
ATTATTTC-ACG-TTTTTCCTA
ID.694.G9C3
ATTATATC-ACG-AAATCCTA
ID.694.G9C3
etc...
I want to write a bash script to count the instances of each ID and export it into another file that provides a summary, something like this:
ID.580 = 3
ID.694 = 1
etc...
So far the script is messy and unusable.
For the above I have the following:
#!/bin/bash
for Count in `grep -c "ID.580" Type1.txt; do
echo $Count=ID.580
done > Result.txt #Allows to count only for that single ID.
I have over a thousand ID.XXX, making this code unusable since it's not plausible to add individual ID.XXX for each search. Thank you for the help!
Shell
The code below uses the standard UNIX utilities, and does not assume that the second part of the ID is exactly 3 characters, but will find ID.1.123123123 and ID.1234.123123 and properly only take the first dot-delimited part. As it is
grep '^ID\.[0-9]' Type1.txt | cut -d . -f 1-2 | sort \
| uniq -c | awk '{ print $2" = "$1 }'
grep filters only lines beginning with ID. followed by 1 digit (at least)
cut uses . as the field delimiter, and only outputting fields 1 and 2, thus removing
everything after and including the second . on the line.
sort sorts the lines for uniq to work
uniq prints each line from its input prefixed with a count
awk part reverses these fields and prints them separated with =.
If the first part of the ID can contain letters too, change the end of regular expression to [0-9] to [0-9A-Z]. for example
The pipeline outputs
ID.580 = 3
ID.694 = 2
Python
As the Python is popular among biologists, you might want to hone your python skills instead:
from collections import Counter
counter = Counter()
with open('Type1.txt') as f:
for line in f:
if line.startswith('ID.'):
top_id = '.'.join(line.split('.', 2)[:2])
counter[top_id] += 1
for top_id, count in sorted(counter.items()):
print("%s = %d" % (top_id, count))
The results are exactly identical.
grep '^ID.[0-9][0-9][0-9]' input_file | cut -c1-6 | sort | uniq -c
works?
TL;DR
Given your particular corpus and grouping strategy, there's more than one way to get the results you need. Here are two alternative solutions, one in awk, and one in Ruby.
GNU awk
One way is to use GNU awk to perform the following steps:
match just the ID lines
split matching input lines into fields
select and print the fields you need
sort the lines in the filtered result
count the adjacent duplicates
perform any specialized formatting on the result
For example:
$ awk '/^ID/ {split($0, a, "."); print a[1] "." a[2]}' /tmp/foo |
sort | uniq --count | awk '{print $2 " = " $1}'
ID.580 = 3
ID.694 = 2
With the corpus you provided in your question, this takes an average of 8 ms on my system. A larger corpus will take longer, of course, but unless you have a really huge data set this should be fast enough for most purposes.
Ruby
Ruby offers what I consider a more elegant solution, but is in fact slower. The idea here is to store the relevant portion of your IDs as hash keys, and increment a counter each time you encounter a given ID. For example, consider this Ruby one-liner:
$ ruby -ne 'BEGIN { id = Hash.new(0) }
id[$&] += 1 if /\AID\.\d+/
END { id.each_pair do |k,v| puts "#{k} = #{v}" end }' /tmp/foo
ID.580 = 3
ID.694 = 2
This solution takes around 45 ms to process the same corpus, so I wouldn't recommend it over the awk pipeline just for transforming output. The main advantage to doing it this way is that you have an actual data structure (e.g. a Hash object) that you could manipulate in a more full-featured program.
Here is awk one liner:
$ awk -F. '$1=="ID"{a[$2,$3]++}END{for (i in a) {split(i,ind,SUBSEP); r[ind[1]]++}for (i in r) print "ID."i" = "r[i]}' file
ID.694 = 1
ID.580 = 3
And here is a pure bash solution:
#!/bin/bash
while IFS=. read -r pre id code rest
do
[[ $pre == ID ]] || continue
[[ ${a[$id]} =~ \."$code"\. ]] || {
a[$id]="${a[$id]}.$code."
((count[$id]++));
}
done < file
for i in "${!count[#]}"
do
echo "ID.$i = ${count[$i]}"
done
$ ./script.sh
ID.580 = 3
ID.694 = 1
awk might work too...
awk '/ID.580/{x++}END{print x}' test.txt
You can put this in a for loop
for i in ID.580 ID.694
do
awk '/'$i'/{x++}END{print x}' test.txt
done

Deleting characters from a column if they appear fewer than 20 times

I have a CSV file with two columns:
cat # c a t
dog # d o g
bat # b a t
To simplify communication, I've used English letters for this example, but I'm dealing with CJK in UTF-8.
I would like to delete any character appearing in the second column, which appears on fewer than 20 lines within the first column (characters could be anything from numbers, letters, to Chinese characters, and punctuation, but not spaces).
For e.g., if "o" appears on 15 lines in the first column, all appearances of "o" are deleted from the second column. If "a" appears on 35 lines in the first column, no change is made.
The first column must not be changed.
I don't need to count multiple appearances of a letter on a single line. For e.g. "robot" has 2 o's, but this detail is not important, only that "robot" has an "o", so that is counted as one line.
How can I delete the characters that appear less than 20 times?
Here is a script using awk. Change the var num to be your frequency cutoff point. I've set it to 1 to show how it works against a small sample file. Note how f is still deleted even though it shows up three times on a single line. Also, passing the same input file twice is not a typo.
awk -v num=1 '
BEGIN { OFS=FS="#" }
FNR==NR{
split($1,a,"")
for (x in a)
if(a[x] != " " && !c[a[x]]++)
l[a[x]]++
delete c
next
}
!flag++{
for (x in l)
if (l[x] <= num)
cclass = cclass x
}
{
gsub("["cclass"]", " " , $2)
}1' ./infile.csv ./infile.csv
Sample Input
$ cat ./infile
fff # f f f
cat # c a t
dog # d o g
bat # b a t
Output
$ ./delchar.sh
fff #
cat # a t
dog #
bat # a t
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
open my $IN, '<:utf8', $ARGV[0] or die $!;
my %chars;
while (<$IN>) {
chomp;
my #cols = split /#/;
my %linechars;
undef #linechars{ split //, $cols[0] };
$chars{$_}++ for keys %linechars;
}
seek $IN, 0, 0;
my #remove = grep $chars{$_} < 20, keys %chars;
my $remove_reg = '[' . join(q{}, #remove) . ']';
warn $remove_reg;
while (<$IN>) {
my #cols = split /#/;
$cols[1] =~ s/$remove_reg//g;
print join '#', #cols;
}
I am not sure how whitespace should be handled, so you might need to adjust the script.
the answer is:
cut -d " " -f #column $file | sed -e 's/\.//g' -e 's/\,//g' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
where $file is your text file and $column is the column you need to look for its frequency. It gives you out the list of their frequency
then you can go on looping on those results which have the first digit greater than your treshold and grepping on the whole lines.

What is an efficient way to replace list of strings with another list in Unix file?

Suppose I have two lists of strings (list A and list B) with the exact same number of entries, N, in each list, and I want to replace all occurrences of the the nth element of A with the nth element of B in a file in Unix (ideally using Bash scripting).
What's the most efficient way to do this?
An inefficient way would be to make N calls to "sed s/stringA/stringB/g".
This will do it in one pass. It reads listA and listB into awk arrays, then for each line of the linput, it examines each word and if the word is found in listA, the word is replaced by the corresponding word in listB.
awk '
FILENAME == ARGV[1] { listA[$1] = FNR; next }
FILENAME == ARGV[2] { listB[FNR] = $1; next }
{
for (i = 1; i <= NF; i++) {
if ($i in listA) {
$i = listB[listA[$i]]
}
}
print
}
' listA listB filename > filename.new
mv filename.new filename
I'm assuming the strings in listA do not contain whitespace (awk's default field separator)
Make one call to sed that writes the sed script, and another to use it? If your lists are in files listA and listB, then:
paste -d : listA listB | sed 's/\([^:]*\):\([^:]*\)/s%\1%\2%/' > sed.script
sed -f sed.script files.to.be.mapped.*
I'm making some sweeping assumptions about 'words' not containing either colon or percent symbols, but you can adapt around that. Some versions of sed have upper bounds on the number of commands that can be specified; if that's a problem because your word lists are big enough, then you may have to split the generated sed script into separate files which are applied - or change to use something without the limit (Perl, for example).
Another item to be aware of is sequence of changes. If you want to swap two words, you need to craft your word lists carefully. In general, if you map (1) wordA to wordB and (2) wordB to wordC, it matters whether the sed script does mapping (1) before or after mapping (2).
The script shown is not careful about word boundaries; you can make it careful about them in various ways, depending on the version of sed you are using and your criteria for what constitutes a word.
I needed to do something similar, and I wound up generating sed commands based on a map file:
$ cat file.map
abc => 123
def => 456
ghi => 789
$ cat stuff.txt
abc jdy kdt
kdb def gbk
qng pbf ghi
non non non
try one abc
$ sed `cat file.map | awk '{print "-e s/"$1"/"$3"/"}'`<<<"`cat stuff.txt`"
123 jdy kdt
kdb 456 gbk
qng pbf 789
non non non
try one 123
Make sure your shell supports as many parameters to sed as you have in your map.
This is fairly straightforward with Tcl:
set fA [open listA r]
set fB [open listB r]
set fin [open input.file r]
set fout [open output.file w]
# read listA and listB and create the mapping of corresponding lines
while {[gets $fA strA] != -1} {
set strB [gets $fB]
lappend map $strA $strB
}
# apply the mapping to the input file
puts $fout [string map $map [read $fin]]
# if the file is large, do it line by line instead
#while {[gets $fin line] != -1} {
# puts $fout [string map $map $line]
#}
close $fA
close $fB
close $fin
close $fout
file rename output.file input.file
you can do this in bash. Get your lists into arrays.
listA=(a b c)
listB=(d e f)
data=$(<file)
echo "${data//${listA[2]}/${listB[2]}}" #change the 3rd element. Redirect to file where necessary

Resources