Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
Updated
I have a file (file.txt) with a list of words:
apple
banana
cherry
orange
pineapples
I have a csv file (data.csv) that contains lots of data:
1,"tasty apples",3,5
23,"iphone app",5,12
1,"sour grapes",3,5
23,"banana apple smoothie",5,12
1,"cherries and orange shortage",3,5
23,"apple iphone orange cover",5,12
3,"pineapple cherry bubble gum",13,5
5,"pineapples are best frozen",22,33
I want to append the match from file like this (output.csv):
1,"tasty apples",3,5,""
23,"iphone app",5,12,""
1,"sour grapes",3,5,""
23,"banana apple smoothie",5,12,"apple+banana"
1,"cherries and orange shortage",3,5,"orange"
23,"apple iphone orange cover",5,12,"apple+orange"
3,"pineapple cherry bubble gum",13,5,"cherry"
5,"pineapples are best frozen",22,33,"pineapples"
I can this with grep, but in order to do this, I have to use a while loop with if statements and process text files.
The problem with doing this is that file.txt has about 500 lines, and data.csv has 330,000 lines. My script would work, however it may take days to complete.
I'm wondering is there a more efficient way to do this than my method?
Perl to the rescue!
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
open my $f1, '<', 'file.txt' or die $!;
my #fruits;
chomp, push #fruits, $_ while <$f1>;
my %order;
#order{#fruits} = 0 .. $#fruits;
my $regex = join '|', sort { length $b <=> length $a } #fruits;
csv(
in => 'data.csv1',
eol => "\n",
on_in => sub {
my #matches;
push #matches, $1 while $_[1][1] =~ /\b($regex)\b/g;
push #{ $_[1] }, join '+',
sort { $order{$a} <=> $order{$b} }
#matches;
},
);
Unfortunately, Text::CSV_XS can't quote the last field if it doesn't contain a special character (or without quoting all the fields). If file.txt doesn't contain double quotes and commas, though, you add them easily:
perl ... | sed 's/,\([^,"]*\)$/,"\1"/'
Is there a reason you want that last field quoted? The "+" has no special meaning in CSV, so does not need quotation and neither does an empty field.
Text::CSV_XS does support quotation of empty fields or quotation of all fields, but not yet quotation of all non-numeric fields.
Based on choroba's answer, which allows the last field to be "apple+apple+orange", which is not clearly defined in the OP if that is wanted, I'd write it like this:
use 5.14.1;
use warnings;
use Text::CSV_XS qw( csv );
use Data::Peek;
chomp (my #fruits = do { local #ARGV = "file.txt"; <> });
my %order;
#order{#fruits} = 0 .. $#fruits;
my $regex = join "|", sort { length $b <=> length $a } #fruits;
csv (
in => "data1.csv",
eol => "\n",
quote_empty => 1,
on_in => sub {
push #{$_[1]}, join "+" =>
sort { $order{$a} <=> $order{$b} }
keys %{{map { $_ => 1 }
($_[1][1] =~ m/\b($regex)\b/g)}};
},
);
Related
Let's say I have a set of 100,000 strings, each one around 20-50 bytes on average.
Then let's say I have another set of 100,000,000 strings, each one also around 20-50 bytes on average.
I'd like to go through all 100,000,000 strings from the second set and check if any one of the strings from the first set exist in any one string from the second set.
Example: string from first set: "abc", string from second set: "123abc123" = match!
I've tried using Perl's index(), but it's not fast enough. Is there a better way to do this type of matching?
I found Algorithm::AhoCorasik::XS on CPAN, which implements the classic, very efficient multiple string search Aho-Corasick algorithm (Same one used by grep -F), and it seems to be reasonably fast (About half the speed of an equivalent grep invocation):
Example script:
#!/usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use Algorithm::AhoCorasick::XS;
open my $set1, "<", "set1.txt";
my #needles = <$set1>;
chomp #needles;
my $search = Algorithm::AhoCorasick::XS->new(\#needles);
open my $set2, "<", "set2.txt";
while (<$set2>) {
chomp;
say if defined $search->first_match($_);
}
and using it (With randomly-generated test files):
$ wc -l set1.txt set2.txt
10000 set1.txt
500000 set2.txt
510000 total
$ time perl test.pl | wc -l
458414
real 0m0.403s
user 0m0.359s
sys 0m0.031s
$ time grep -Ff set1.txt set2.txt | wc -l
458414
real 0m0.199s
user 0m0.188s
sys 0m0.031s
You should use a regex alternation, like:
my #string = qw/abc def ghi/;
my $matcher = qr/#{[join '|', map quotemeta, sort #string]}/;
This should be faster than using index. But it can be made faster yet:
Up to a certain limit, depending on both the length and number of the strings, perl will build a trie for efficient matching; see e.g. https://perlmonks.org/?node_id=670558. You will want to experiment with how many strings you can include in a single regex to generate an array of regexes. Then combine those separate regexes into a single one (untested):
my #search_strings = ...;
my #matchers;
my $string_limit = 3000; # a guess on my part
my #strings = sort #search_strings;
while (my #subset = splice #strings, 0, $string_limit) {
push #matchers, qr/^.*?#{[join '|', map quotemeta, sort #subset]}/s;
}
my $matcher = '(?:' . join('|', map "(??{\$matchers[$_]})", 0..$#matchers) . ')';
$matcher = do { use re 'eval'; qr/$matcher/ };
/$matcher/ and print "line $. matched: $_" while <>;
The (??{...}) construct is needed to join the separate regexes; without it, the subregexes are all just interpolated and the joined regex compiled all together, removing the trie optimization. Each subregex starts with ^.*? so it searches the entire string; without that, the joined regex would have to invoke each subregex separately for each position in the string.
Using contrived data, I'm seeing about 3000 strings searched per second with this approach in a not very fast vm. Using the naive regex is less than 50 strings per second. Using grep, as suggested in a comment by Shawn, is faster (about 4200 strings per second for me) but gives you less control if you want to do things like identify which strings matched or at what positions.
You may want to have a look at https://github.com/leendo/hello-world .
Its parallel processing makes it really fast. Either just type in all search terms individually or as || conjunctions, or (better) adapt it to run the second set programatically in one go.
Here is an idea: you could partition the dictionary into lists of words that have the same 2 or 3 letter prefix. You would then iterate on the large set and for each position in each string, extract the prefix and try and match the strings that have this prefix.
You would use hashtable to store the lists with O(1) lookup time.
If some words in the dictionary are shorter than the prefix length, you would have to special case short words.
Making prefixes longer will make the hashtable larger but the lists shorter, improving the test time.
I have no clue if this can be implemented efficiently in Perl.
The task is quite simple and perhaps used on everyday basis around the globe.
Please see following code snippet for one of many possible implementations
use strict;
use warnings;
use feature 'say';
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
my $re;
GetOptions(
'sample|s=s' => \$opt{sample},
'debug|d' => \$opt{debug},
'help|h' => \$opt{help},
'man|m' => \$opt{man}
) or pod2usage(2);
pod2usage(1) if $opt{help};
pod2usage(-verbose => 2) if $opt{man};
pod2usage("$0: No files given.") if ((#ARGV == 0) && (-t STDIN));
$re = read_re($opt{sample});
say "DEBUG: search pattern is ($re)" if $opt{debug};
find_in_file($re);
sub find_in_file {
my $search = shift;
while( <> ) {
chomp;
next unless /$search/;
say;
}
}
sub read_re {
my $filename = shift;
open my $fh, '<', $filename
or die "Couldn't open $filename";
my #data = <$fh>;
close $fh;
chomp #data;
my $re = join('|', #data);
return $re;
}
__END__
=head1 NAME
file_in_file.pl - search strings of one file in other
=head1 SYNOPSIS
yt_video_list.pl [options] -s sample.txt dbfile.txt
Options:
-s,--sample search pattern file
-d,--debug debug flag
-h,--help brief help message
-m,--man full documentation
=head1 OPTIONS
=over 4
=item B<-s,--sample>
Search pattern file
=item B<-d,--debug>
Print debug information.
=item B<-h,--help>
Print a brief help message and exits.
=item B<-m,--man>
Prints the manual page and exits.
=back
B<This program> seaches patterns from B<sample> file in B<dbfile>
=cut
OS: Windows server 2012, so I don't have access to Unix utils
Activestate Perl 5.16. Sorry I cannot upgrade the OS or Perl, I'm stuck with it.
I did a google search and read about 10 pages from that, I find similar problems but not what I'm looking for.
I then did 3 searches here and found similar issues with SQL, R, XSLT, but not what I'm looking for.
I actually am not sure where to start so I don't even have code yet.
I'd like to combine consecutive page numbers into a page range. Input will be a series of numbers in an array.
Input as an array of numbers: my #a=(1,2,5)
Output as a string: 1-2, 5
Input ex: (1,2,3,5,7)
Output ex: 1-3, 5, 7
Input ex: (100,101,102,103,115,120,121)
Output ex: 100-103,115,120-121
Thank you for your help!
This is the only code I have so far.
sub procpages_old
# $aref = array ref to list of page numbers.
# $model = used for debugging.
# $zpos = used for debugging only.
{my($aref,$model,$zpos)=#_;
my $procname=(caller(0))[3];
my #arr=#$aref; # Array of page numbers.
my #newarr=();
my $i=0;
my $np1=0; # Page 1 of possible range.
my $np2=0; # Page 2 of possible range.
my $p1=0; # Page number to test.
my $p2=0;
my $newpos=0;
while ($i<$#arr)
{
$np1=$arr[$i];
$np2=getdata($arr[$i+1],'');
$p1=$np1;
$p2=$np2;
while ($p2==($p1+1)) # Consecutive page numbers?
{
$i++;
$p1=$a[$i];
$p2=getdata($a[$i+1],'');
}
$newarr[$newpos]=$np1.'-'.$p2;
$newpos++;
# End of loop
$i++;
}
my $pages=join(', ',#arr);
return $pages;
}
That's called an intspan. Use Set::IntSpan::Fast::XS.
use Set::IntSpan::Fast::XS qw();
my $s = Set::IntSpan::Fast::XS->new;
$s->add(100,101,102,103,115,120,121);
$s->as_string; # 100-103,115,120-121
This seems to do what you want.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
chomp;
say rangify(split /,/);
}
sub rangify {
my #nums = #_;
my #range;
for (0 .. $#nums) {
if ($_ == 0 or $nums[$_] != $nums[$_ - 1] + 1) {
push #range, [ $nums[$_] ];
} else {
push #{$range[-1]}, $nums[$_];
}
}
for (#range) {
if (#$_ == 1) {
$_ = $_->[0];
} else {
$_ = "$_->[0]-$_->[-1]";
}
}
return join ',', #range;
}
__DATA__
1,2,5
1,2,3,5,7
100,101,102,103,115,120,121
The rangify() function builds an array of arrays. It traverses your input list and if a number is just one more than the previous number then it adds the new number to the second-level array that's currently at the end of the first-level array. If the new number is not sequential, it adds a new second-level array at the end of the first level array.
Having built this data structure, we walk the first-level array, looking at each of the second-level arrays. If the second level array contains only one element then we know it's not a range, so we overwrite the value with the single number from the array. If it contains more than one element, then it's a range and we overwrite the value with the first and last elements separated with a hyphen.
So I managed to adjust this code to work for me. Pass your array of numbers into procpages() which will then call num2range().
######################################################################
# In:
# Out:
sub num2range
{
local $_ = join ',' => #_;
s/(?<!\d)(\d+)(?:,((??{$++1}))(?!\d))+/$1-$+/g;
tr/-,/, /;
return $_;
}
######################################################################
# Concatenate consecutive page numbers in array.
# In: array like (1,2,5,7,99,100,101)
# Out: string like "1-2, 6, 7, 99-101"
sub procpages
{my($aref,$model,$zpos)=#_;
my $procname=(caller(0))[3];
my #arr=#$aref;
my $pages=num2range(#arr);
$pages=~s/\,/\-/g; # Change comma to dash.
$pages=~s/ /\, /g; # Change space to comma and space.
#$pages=~s/\,/\, /g;
return $pages;
}
You probably have the best solution already with the Set::IntSpan::Fast::XS module, but assuming you want to take the opportunity to learn perl here's another perl-ish way to do it.
use strict;
use warnings;
my #nums = (1,2,5);
my $prev = -999; # assuming you only use positive values, this will work
my #out = ();
for my $num (#nums) {
# if we are continuing a sequence, add a hyphen unless we did last time
if ($num == $prev + 1) {
push (#out, '-') unless (#out and $out[-1] eq '-');
}
else {
# if we are breaking a sequence (#out ends in '-'), add the previous number first
if (#out and $out[-1] eq '-') {
push(#out, $prev);
}
# then add the current number
push (#out, $num);
}
# track the previous number
$prev = $num;
}
# add the final number if necessary to close the sequence
push(#out, $prev) if (#out and $out[-1] eq '-');
# join all values with comma
my $pages = join(',', #out);
# flatten the ',-,' sequence to a single '-'
$pages =~ s/,-,/-/g;
print "$pages\n";
This is not super elegant or short, but is very simple to understand and debug.
I'm trying to read a CSV file in a Bash script. I achieved that successfully using gawk and specifying FPAT like:
gawk -v LOGFILE="${LOGFILE}" 'BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
}
NR == 1{
# doing some logic with header
}
NR >= 2{
# doing some logic with fields
}' <filename>
The problem here is, the file contains data like:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
Now, with this data I'm getting wrong data because it is ignoring commas, which is giving me wrong position number of extracted data.
For example, it is telling "7865431234" is present at 3rd position whereas it is at 6th.
Can anyone suggest the changes to get the correct position of fields?
Your FPAT requires each field to contain at least one character, but you want to recognize empty fields with zero characters. Add an alternative to FPAT that allows zero characters:
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("[%s]", $i); print "" }'
Note the extra | at the end of FPAT. The action simply identifies the record number, the number of fields, and surrounds the value of each field with square brackets.
When your data string is provided to that script, the output is:
1:8:["RAM"]["31st street, Bengaluru, India"][][][]["7865431234"][]["VALID"]
That shows the four empty fields quite clearly.
Now all you have to do is deal with:
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,"",,,"INVALID"
where there are double quotes inside the quoted value. That's not dreadfully hard to manage:
gawk 'BEGIN { FPAT = "([^,]+)|(\"([^\"]|\"\")*\")[^,]*|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("%d[%s]", i, $i); print "" }' "$#"
The FPAT says that a field is:
a sequence of non-commas,
or it is a field started with a double quote, containing zero or more instances of either:
a non-quote, or
two double quotes
followed by a double quote and optional non-comma data
or it is empty
Note that the 'optional non-comma data' should be empty, and only appears in malformed CSV data.
Given input data:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,,,,"INVALID"
"Some","","Empty","",Fields "" Wrapped,"",in quotes
"Malformed" CSV,Data,"Note it has data after" a close quote,"and before a comma,",,"INVALID"
This produces:
1:8:1["RAM"]2["31st street, Bengaluru, India"]3[]4[]5[]6["7865431234"]7[]8["VALID"]
2:8:1["Mr ""Manipulator"", the Artisan"]2["29th Street, Delhi, India"]3[]4[]5[]6[]7[]8["INVALID"]
3:7:1["Some"]2[""]3["Empty"]4[""]5[Fields "" Wrapped]6[""]7[in quotes]
4:6:1["Malformed" CSV]2[Data]3["Note it has data after" a close quote]4["and before a comma,"]5[]6["INVALID"]
Note that the field numbers are included as a prefix to the bracketed data (so I tweaked the print format slightly).
About the only format this doesn't handle is one where newlines can be embedded in the data for a field — by the nature of the line-based input, it assumes that no field is split over multiple lines. (It also means it won't properly recognize a field that starts with a double quote and doesn't have a matching double quote before the end of the line. I suppose you could add an alternative to recognize that. It would be better just to make the data right.)
Note the advice in Sobrique's answer to use a tool designed to handle CSV for handling CSV. That is generally a good idea, and the more complex the sets of variations you have to deal with, the better an idea it is. This is close to as complicated a regex as you should consider using. Also note that although RFC 4180 defines a version of CSV formally and rigorously, there are multiple programs (including MS Office) that handle different but related formats.
If you have csv that needs parsing, then whilst you can usually hack it with a regex, it's far easier to user a parser.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV -> new;
open ( my $input, '<', 'flarg.csv' ) or die $!;
while ( my $row = $csv -> getline ( $input ) ) {
if ( $. == 1 ) {
# do first row stuff;
print "Header: ", join ",", #$row,"\n";
}
else {
print join "\n", #$row;
}
}
Or simpler yet - use Text::ParseWords which is core.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
while ( my $line = <DATA> ) {
my #fields = parse_line(',', 1, $line);
print join "\n", #fields;
}
__DATA__
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
I have a dictionary with 250K words (txt file). For each of those words I would like to come up with a script that will throw all possible anagrams (each anagram should also be in the dictionary).
Ideally the script would output in this format:
word1: anagram1,anagram2...
word2: anagram1,anagram2...
Any help would be greatly appreciated.
Inspired by this, I would suggest you create a Trie.
Then, the trie with N levels will have all possible anagrams (where N is the length of the original word). Now, to get different sized words, I suggest you simply traverse the trie, ie. for all 3 letter subwords, just make all strings that are 3 levels deep in the trie.
I'm not really sure of this, because I didn't test this, but it's an interesting challenge, and this suggestion would be how I would start tackling it.
Hope it helps a little =)
It must be anagram week.
I'm going to refer you to an answer I submitted to a prior question: https://stackoverflow.com/a/12811405/128421. It shows how to build a hash for quick searches of words that have common letters.
For your purpose, of finding substrings/inner-words, you will also want to find the possible inner words. Here's how to quickly locate unique combinations of letters of varying sizes, based on a starting word:
word = 'misses'
word_letters = word.downcase.split('').sort
3.upto(word.length) { |i| puts word_letters.combination(i).map(&:join).uniq }
eim
eis
ems
ess
ims
iss
mss
sss
eims
eiss
emss
esss
imss
isss
msss
eimss
eisss
emsss
imsss
eimsss
Once you have those combinations, split them (or don't do the join) and do look-ups in the hash my previous answer built.
What I tried so far in Perl :
use strict;
use warnings;
use Algorithm::Combinatorics qw(permutations);
die "First argument should be a dict\n" unless $ARGV[0] or die $!;
open my $fh, "<", $ARGV[0] or die $!;
my #arr = <$fh>;
my $h = {};
map { chomp; $h->{lc($_)} = [] } #arr;
foreach my $word (#arr) {
$word = lc($word);
my $chars = [ ( $word =~ m/./g ) ];
my $it = permutations($chars);
while ( my $p = $it->next ) {
my $str = join "", #$p;
if ($str ne $word && exists $h->{$str}) {
push #{ $h->{$word} }, $str
unless grep { /^$str$/ } #{ $h->{$word} };
}
}
if (#{ $h->{$word} }) {
print "$word\n";
print "\t$_\n" for #{ $h->{$word} };
}
}
END{ close $fh; }
There's maybe some possible improvement for speed, but it works.
I use French dict from words archlinux package.
EXAMPLE
$ perl annagrammes.pl /usr/share/dict/french
abaissent
absentais
abstenais
abaisser
baissera
baserais
rabaisse
(...)
NOTE
To installl the perl module :
cpan -i Algorithm::Combinatorics
h = Hash.new{[]}
array_of_words.each{|w| h[w.downcase.chars.sort].push(w)}
h.values
I have an input file with the following format
ant,1
bat,1
bat,2
cat,4
cat,1
cat,2
dog,4
I need to aggregate the col2 for each key (column1) so the result is:
ant,1
bat,3
cat,7
dog,4
Other considerations:
Assume that the input file is sorted
The input file is pretty large (about 1M rows), so I don't want to use an array and take up memory
Each input line should be processed as we read it, and move to the next line
I need to write the results to an outFile
I need to do this in Perl, but a pseudo-code or algorithm would help just as fine
Thanks!
This is what I came up with... want to see if this can be written better/elegant.
open infile, outFile
prev_line = <infile>;
print_line = $prev_line;
while(<>){
curr_line = $_;
#prev_cols=split(',', $prev_line);
#curr_cols=split(',', $curr_line);
if ( $prev_cols[0] eq $curr_cols[0] ){
$prev_cols[1] += curr_cols[1];
$print_line = "$prev_cols[0],$prev_cols[1]\n";
$print_flag = 0;
}
else{
$print outFile "$print_line";
$print_flag = 1;
$print_line = $curr_line;
}
$prev_line = $curr_line;
}
if($print_flag = 1){
print outFile "$curr_line";
}
else{
print outFile "$print_line";
}
#!/usr/bin/perl
use warnings;
use strict;
use integer;
my %a;
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
$a{$animal} += $n if defined $n;
}
print "$_,${a{$_}}\n" for sort keys %a;
This short code affords you the chance to learn Perl's excellent hash facility, as %a. Hashes are central to Perl. One really cannot write fluent Perl without them.
Observe incidentally that the code exercises Perl's interesting autovivification feature. The first time a particular animal is encountered in the input stream, no count exists, so Perl implicitly assumes a pre-existing count of zero. Thus, the += operator does not fail, even though it seems that it should. It just adds to zero in the first instance.
On the other hand, it may happen that not only the number of data but the number of animals is so large that one would not like to store the hash %a. In this case, one can still calculate totals, provided only that the data are sorted by animal in the input, as they are in your example. In this case, something like the following might suit (though regrettably it is not nearly so neat as the above).
#!/usr/bin/perl
use warnings;
use strict;
use integer;
my $last_animal = undef;
my $total_for_the_last_animal = 0;
sub start_new_animal ($$) {
my $next_animal = shift;
my $n = shift;
print "$last_animal,$total_for_the_last_animal\n"
if defined $last_animal;
$last_animal = $next_animal;
$total_for_the_last_animal = $n;
}
while (<>) {
my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
if (
defined($n) && defined($animal) && defined($last_animal)
&& $animal eq $last_animal
) { $total_for_the_last_animal += $n; }
else { start_new_animal $animal, $n; }
}
start_new_animal undef, 0;
Use Perl’s awk mode.
-a
turns on autosplit mode when used with a -n or -p. An implicit split command to the #F array is done as the first thing inside the implicit while loop produced by the -n or -p.
perl -ane 'print pop(#F), "\n";'
is equivalent to
while (<>) {
#F = split(' ');
print pop(#F), "\n";
}
An alternate delimiter may be specified using -F.
All that’s left for you is to accumulate the sums in a hash and print them.
$ perl -F, -lane '$s{$F[0]} += $F[1];
END { print "$_,$s{$_}" for sort keys %s }' input
Output:
ant,1
bat,3
cat,7
dog,4
It's trivial in perl. Loop on the file input. Split the input line on comma. For each key in column one keep a hash to which you add the value in column two. At the end of the file print the list of hash keys and their values. It can be done in one line but that would obfuscate the algorithm.