Efficient use of Perl hash - performance

I'm using a hash to abbreviate state names
%STATEABBRIVATE = ('ALABAMA' => 'AL',
...);
Some of my input sets already have abbreviated state names. Would it be more efficient to use an if defined $STATEABBRIVATE{$state} or to add another 51 matched pairs 'AL'=>'AL' to the hash?

If you want to verify that the state really exists, using AL => 'AL' might be the easiest way.
To keep your code DRY (Don't Repeat Yourself), you can just
my %STATEABBRIVATE = ( ALABAMA => 'AL',
...
);
my #abbrevs = values %STATEABBRIVATE;
#STATEABBRIVATE{#abbrevs} = #abbrevs;
If you're concenrned about performance, the bottleneck is probably somewhere else:
#! /usr/bin/perl
use warnings;
use strict;
use Benchmark qw{ cmpthese };
use Test::More;
my %hash = qw( Alabama AL Alaska AK Arizona AZ Arkansas AR California CA
Colorado CO Connecticut CT Delaware DE Florida FL
Georgia GA Hawaii HI Idaho ID Illinois IL Indiana IN
Iowa IA Kansas KS Kentucky KY Louisiana LA Maine ME
Maryland MD Massachusetts MA Michigan MI Minnesota MN
Mississippi MS Missouri MO Montana MT Nebraska NE
Nevada NV Ohio OH Oklahoma OK Oregon OR Pennsylvania PA
Tennessee TN Texas TX Utah UT Vermont VT Virginia VA
Washington WA Wisconsin WI Wyoming WY );
$hash{'West Virginia'} = 'WV';
$hash{'South Dakota'} = 'SD';
$hash{'South Carolina'} = 'SC';
$hash{'Rhode Island'} = 'RI';
$hash{'North Dakota'} = 'ND';
$hash{'North Carolina'} = 'NC';
$hash{'New York'} = 'NY';
$hash{'New Mexico'} = 'NM';
$hash{'New Jersey'} = 'NJ';
$hash{'New Hampshire'} = 'NH';
my %larger = %hash;
#larger{ values %hash } = values %hash;
sub def {
my $state = shift;
return defined $hash{$state} ? $hash{$state} : $state
}
sub ex {
my $state = shift;
return exists $hash{$state} ? $hash{$state} : $state
}
sub hash {
my $state = shift;
return $larger{$state}
}
is(def($_), ex($_), "def-ex-$_") for keys %larger;
is(def($_), hash($_), "def-hash-$_") for keys %larger;
done_testing();
cmpthese(-1,
{ hash => sub { map hash($_), keys %larger },
ex => sub { map ex($_), keys %larger },
def => sub { map def($_), keys %larger },
});
Results:
Rate def ex hash
def 27307/s -- -2% -11%
ex 27926/s 2% -- -9%
hash 30632/s 12% 10% --

Both if defined $STATEABBRIVATE{$state} and any hash lookups are going to be constant time (i.e. O(1) operations). In fact, defined() probably uses a hash table lookup behind the scenes anyway. So, my prediction is that the difference in performance is going to be negligible, even with large data sets. This is, at best, an educated guess.

Related

Sort hash by values

This is not how I populated my hash. Just for easier reading, here are its contents, keys are on a fixed length string:
my %country_hash = (
"001 Sample Name New Zealand" => "NEW ZEALAND",
"002 Samp2 Nam2 Zimbabwe " => "ZIMBABWE",
"003 SSS NNN Australia " => "AUSTRALIA",
"004 John Sample Philippines" => "PHILIPPINES,
);
I want to get the sorted keys based on values. So my expectation:
"003 SSS NNN Australia "
"001 Sample Name New Zealand"
"004 John Sample Philippines"
"002 Samp2 Nam2 Zimbabwe "
What I did:
foreach my $line( sort {$country_hash{$a} <=> $country_hash{$b} or $a cmp $b} keys %country_hash ){
print "$line\n";
}
also;
(I doubted this will sort but anyway)
my #sorted = sort { $country_hash{$a} <=> $country_hash{$b} } keys %country_hash;
foreach my $line(#sorted){
print "$line\n";
}
Neither of them sorted correctly. I hope someone could help.
If you had used warnings, you would have been told that <=> is the wrong operator; it is used for numeric comparison. Use cmp for string comparison instead. Refer to sort.
use warnings;
use strict;
my %country_hash = (
"001 Sample Name New Zealand" => "NEW ZEALAND",
"002 Samp2 Nam2 Zimbabwe " => "ZIMBABWE",
"003 SSS NNN Australia " => "AUSTRALIA",
"004 John Sample Philippines" => "PHILIPPINES",
);
my #sorted = sort { $country_hash{$a} cmp $country_hash{$b} } keys %country_hash;
foreach my $line(#sorted){
print "$line\n";
}
This prints:
003 SSS NNN Australia
001 Sample Name New Zealand
004 John Sample Philippines
002 Samp2 Nam2 Zimbabwe
This also works (without the extra array):
foreach my $line (sort {$country_hash{$a} cmp $country_hash{$b}} keys %country_hash) {
print "$line\n";
}

How can i separate a full name?

I've to take the right part and clean it after it comparate with the middle part and save if are equal
> #!/usr/bin/env ruby
require 'rubygems'
require 'levenshtein'
require 'csv'
# Extending String class for blank? method
class String
def blank?
self.strip.empty?
end
end
# In
lines = CSV.read('entrada.csv')
lines.each do |line|
id = line[0].upcase.strip
left = line[1].upcase.strip
right = line[2].upcase.strip
eduardo = line[2].upcase.split(' ','de')
line[0] = id
line[1] = left
line[2] = right
line[4] = eduardo[0]+eduardo[1]
distance = Levenshtein.distance left, right
line << 99 if (left.blank? or right.blank?)
line << distance unless (left.blank? or right.blank?)
end
# Out
# counter = 0
CSV.open('salida.csv', 'w') do |csv|
lines.each do |line|
# counter = counter + 1 if line[3] <= 3
csv << line
end
end
# p counter
The middle is the correct the rigth i should correct
Some examples:
Eduardo | Abner | Herrera | Herrera -> Eduardo Herrera
Angel | De | Leon -> Angel De Leon
Maira | Angelina | de | Leon -> Maira De Leon
Marquilla | Gutierrez | Petronilda |De | Leon -> Marquilla Petronilda
First order of business is to come up with some rules. Based on your examples, and Spanish naming customs, here's my stab at the rules.
A name has a forename, paternal surname, and optional maternal surname.
A forename can be multiple words.
A surname can be multiple words linked by a de, y, or e.
So ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] should be { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda de Leon' }
To simplify the process, I'd first join any composite surnames into one field. ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda De Leon']. Watch out for cases like ['Angel', 'De', 'Leon'] in which case the surname is probably De Leon.
Once that's done, figuring out which part is which becomes easier.
name = {}
if parts.length == 1
error?
# The special case of only two parts: forename paternal_surname
elsif parts.length == 2
name = {
forename: parts[0],
paternal_surname: parts[1]
}
# forename paternal_surname maternal_surname
else
# The forename can have multiple parts, so work from the
# end and whatever's left is their forename.
name[:maternal_surname] = parts.pop
name[:paternal_surname] = parts.pop
name[:forename] = parts.join(" ")
end
There's a lot of ambiguity in Spanish naming, so this can only be an educated guess at what their actual name is. You'll probably have to tweak the rules as you learn more about the dataset. For example, I'm pretty sure handling of de is not that simple. For example...
One Leocadia Blanco Álvarez, married to a Pedro Pérez Montilla, may be addressed as Leocadia Blanco de Pérez or as Leocadia Blanco Álvarez de Pérez
In that case ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda', 'De Leon'] which is { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda', married_to: 'Leon' } or 'Marquilla Gutierrez Petronilda who is married to someone whose parental surname is Leon.
Good luck.
I would add more columns to the database, like last_name1, last_name2, last_name3, etc, and make them optional (don't put validations on those attributes). Hope that answers your question!

Ruby class with array attribute

I have a non-database-backed class in Ruby:
class User
attr_accessor :countries
end
I want countries to simply be an array of ISO country codes (US, GB, CA, AU, etc) and I don't want to build a separate model to hold each. Is there a magic way to make Ruby understand that :countries is an array and treat it accordingly, or do I need to write the countries and countries= methods?
I tried just setting the countries array with user.countries = ['US'], and I'm getting a NoMethodError.
The type of a variable doesn't matter in Ruby.
attr_accessor just creates getter and setter methods that set and return instance variables; #countries in this case. You can set the instance variable to your array, or use the setter:
class User
attr_accessor :countries
def initialize
#countries = %w[Foo Bar Baz]
# Or...
self.countries = %w[Foo Bar Baz]
end
end
> puts User.new.countries
=> ["Foo", "Bar", "Baz"]
Personally I prefer using the instance variable instead of self.xxx; it's too easy to forget the self. bit and you end up setting a local variable, leaving the instance variable nil. I also think it's ugly.
If the countries won't be changing between instances, why not a constant?
Edit/Clarification
Tadman's point is well-taken, e.g., this diatribe on state. The circumtances under which I don't care about that are limited to small, self-controlled, stand-alone classes. There are inherent risks in making those assumptions, the level of those risks is project-dependent.
Looks like countries should be a constant:
class User
COUNTRIES = %w(
AF AX AL DZ AS AD AO AI AQ AG AR AM AW AU AT AZ BS BH BD BB BY BE BZ BJ BM
BT BO BQ BA BW BV BR IO BN BG BF BI KH CM CA CV KY CF TD CL CN CX CC CO KM
CG CD CK CR CI HR CU CW CY CZ DK DJ DM DO EC EG SV GQ ER EE ET FK FO FJ FI
FR GF PF TF GA GM GE DE GH GI GR GL GD GP GU GT GG GN GW GY HT HM VA HN HK
HU IS IN ID IR IQ IE IM IL IT JM JP JE JO KZ KE KI KP KR KW KG LA LV LB LS
LR LY LI LT LU MO MK MG MW MY MV ML MT MH MQ MR MU YT MX FM MD MC MN ME MS
MA MZ MM NA NR NP NL NC NZ NI NE NG NU NF MP NO OM PK PW PS PA PG PY PE PH
PN PL PT PR QA RE RO RU RW BL SH KN LC MF PM VC WS SM ST SA SN RS SC SL SG
SX SK SI SB SO ZA GS SS ES LK SD SR SJ SZ SE CH SY TW TJ TZ TH TL TG TK TO
TT TN TR TM TC TV UG UA AE GB US UM UY UZ VU VE VN VG VI WF EH YE ZM ZW
).freeze
end
User::COUNTRIES.include? "US" #=> true
freeze prevents modifications:
User::COUNTRIES.delete "US" #=> RuntimeError: can't modify frozen Array
Update
The problem here is that your countries array has to be persisted somehow. You are mentioning has_many so Rails seems to be involved. You can use ActiveRecord's serialize method:
class User < ActiveRecord::Base
serialize :countries
end
This will save the countries attribute to the database as an object and retrieve it as such:
u = User.new
u.countries = ["US", "CA"]
u.save
u = User.last
u.countries
#=> ["US", "CA"]
It's converted to and from YAML internally, so the users table looks like:
mysql> SELECT * FROM users;
+----+-------------------+---------------------+---------------------+
| id | countries | created_at | updated_at |
+----+-------------------+---------------------+---------------------+
| 1 | ---\n- US\n- CA\n | 2013-09-24 18:24:03 | 2013-09-24 18:24:03 |
+----+-------------------+---------------------+---------------------+
1 row in set (0,00 sec)

Parsing a dictionary text file in ruby

I am using ruby to try and parse a text file that has the form...
AAB eel bbc
ABA did eye non pap mom ere bob nun eve pip gig dad nan ana gog aha
mum sis ada ava ewe pop tit gag tat bub pup
eke ele hah huh pep sos tot wow aba ala
bib dud tnt
ABB all see off too ill add lee ass err xii ann fee vii inn egg odd bee dee goo
woo cnn pee fcc tee wee ebb edd gee ott ree vee ell orr rcc att boo cee cii
coo kee moo mss soo doo faa hee icc iss itt kii loo mee nee nuu ogg opp pii
tll upp voo zee
I need to be able to search by the first column, such as "AAB",and then search through all values that are associated with that key. I have tried to import the text file into a hash of arrays but could never get more than the first value to store. I have no preference as to how I can search the file, whether that is store the data into some data structure or just search the text file every time, I just need to be able to do it. I am at a loss as to how to proceed with this and any help would be greatly appreciated. Thanks
-amc25114
This will read your dictionary file. I'm storing the content in a string, then
turning it into a StringIO object to let me pretend it's a file. You can use
File.readlines to read directly from the file itself:
require 'pp'
require 'stringio'
text = 'AAB eel bbc
ABA did eye non pap mom ere bob nun eve pip gig dad nan ana gog aha
mum sis ada ava ewe pop tit gag tat bub pup
eke ele hah huh pep sos tot wow aba ala
bib dud tnt
ABB all see off too ill add lee ass err xii ann fee vii inn egg odd bee dee goo
woo cnn pee fcc tee wee ebb edd gee ott ree vee ell orr rcc att boo cee cii
coo kee moo mss soo doo faa hee icc iss itt kii loo mee nee nuu ogg opp pii
tll upp voo zee
'
file = StringIO.new(text)
dictionary = Hash[
file.readlines.slice_before(/^\S/).map{ |ary|
key, *values = ary.map(&:strip).join(' ').split(' ')
[key, values]
}
]
dictionary is a hash looking like:
{
"AAB"=>[
"eel", "bbc"
],
"ABA"=>[
"did", "eye", "non", "pap", "mom", "ere", "bob", "nun", "eve", "pip",
"gig", "dad", "nan", "ana", "gog", "aha", "mum", "sis", "ada", "ava",
"ewe", "pop", "tit", "gag", "tat", "bub", "pup", "eke", "ele", "hah",
"huh", "pep", "sos", "tot", "wow", "aba", "ala", "bib", "dud", "tnt"
],
"ABB"=>[
"all", "see", "off", "too", "ill", "add", "lee", "ass", "err", "xii",
"ann", "fee", "vii", "inn", "egg", "odd", "bee", "dee", "goo", "woo",
"cnn", "pee", "fcc", "tee", "wee", "ebb", "edd", "gee", "ott", "ree",
"vee", "ell", "orr", "rcc", "att", "boo", "cee", "cii", "coo", "kee",
"moo", "mss", "soo", "doo", "faa", "hee", "icc", "iss", "itt", "kii",
"loo", "mee", "nee", "nuu", "ogg", "opp", "pii", "tll", "upp", "voo", "zee"
]
}
You can look up using the keys:
dictionary['AAB']
=> ["eel", "bbc"]
And search inside the array using include?:
dictionary['AAB'].include?('eel')
=> true
dictionary['AAB'].include?('foo')
=> false
class A
def initialize
#h, key = readlines.inject({}) do |m, s|
a = s.split
m[key = a.shift] = [] if s =~ /^[^\s]/
m[key] += a
m
end
end
def lookup k, v # not sure what you really want to do here
p [k, v, (#h[k].index v)]
end
self
end.new.lookup 'ABA', 'wow'
My 2 cents:
file = File.open("/path_to_file_here")
recent_key = ""
results = Hash.new
while (line = file.gets)
key = line[/[A-Z]+/]
recent_key = key if key
line.scan(/[a-z]+/).each do |val|
results[recent_key.to_sym] = [] if !results[recent_key.to_sym]
results[recent_key.to_sym] << val
end
end
puts results
This will give you this ouput:
{
:AAB=>["eel", "bbc"],
:ABA=>["did", "eye", "non", "pap", "mom", "ere", "bob", "nun", "eve", "pip", "gig", "dad", "nan", "ana", "gog", "aha", "mum", "sis", "ada", "ava", "ewe", "pop", "tit", "gag", "tat", "bub", "pup", "eke", "ele", "hah", "huh", "pep", "sos", "tot", "wow", "aba", "ala", "bib", "dud", "tnt"],
:ABB=>["all", "see", "off", "too", "ill", "add", "lee", "ass", "err", "xii", "ann", "fee", "vii", "inn", "egg", "odd", "bee", "dee", "goo", "woo", "cnn", "pee", "fcc", "tee", "wee", "ebb", "edd", "gee", "ott", "ree", "vee", "ell", "orr", "rcc", "att", "boo", "cee", "cii", "coo", "kee", "moo", "mss", "soo", "doo", "faa", "hee", "icc", "iss", "itt", "kii", "loo", "mee", "nee", "nuu", "ogg", "opp", "pii", "tll", "upp", "voo", "zee"]
}

Household mail merge (code golf)

I wrote some mail merge code the other day and although it works I'm a turned off by the code. I'd like to see what it would look like in other languages.
So for the input the routine takes a list of contacts
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Erica,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Marge,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Ted,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Raoul,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
It will then merge lines with the same address and surname into one record. Assume the rows are unsorted). The code should also be flexible enough that fields can be supplied in any order (so it will need to take field indexes as parameters). For a family of two it concatenates both first name fields. For a family of three or more the first name is set to "the" and the lastname is set to "surname family".
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
My C# implementation of this is:
var source = File.ReadAllLines(#"sample.csv").Select(l => l.Split(','));
var merged = HouseholdMerge(source, 0, 1, new[] {1, 2, 3, 4, 5});
public static IEnumerable<string[]> HouseholdMerge(IEnumerable<string[]> data, int fnIndex, int lnIndex, int[] groupIndexes)
{
Func<string[], string> groupby = fields => String.Join("", fields.Where((f, i) => groupIndexes.Contains(i)));
var groups = data.OrderBy(groupby).GroupBy(groupby);
foreach (var group in groups)
{
string[] result = group.First().ToArray();
if (group.Count() == 2)
{
result[fnIndex] += " and " + group.ElementAt(1)[fnIndex];
}
else if (group.Count() > 2)
{
result[fnIndex] = "The";
result[lnIndex] += " Family";
}
yield return result;
}
}
I don't like how I've had to do the groupby delegate. I'd like if C# had some way to convert a string expression to a delegate. e.g. Func groupby = f => "f[2] + f[3] + f[4] + f[5] + f[1];" I have a feeling something like this can probably be done in Lisp or Python. I look forward to seeing nicer implementation in other languages.
Edit: Where did the community wiki checkbox go? Some mod please fix that.
Ruby — 181 155
Name/surname indexes are in code:a and b. Input data is from ARGF.
a,b=0,1
[*$<].map{|i|i.strip.split ?,}.group_by{|i|i.rotate(a).drop 1}.map{|i,j|k,l,m=j
k[a]+=' and '+l[a]if l
(k[a]='The';k[b]+=' Family')if m
puts k*','}
Python - not golfed
I'm not sure what the order of the rows should be if the indices are not 0 and 1 for the input file
import csv
from collections import defaultdict
class HouseHold(list):
def __init__(self, fn_idx, ln_idx):
self.fn_idx = fn_idx
self.ln_idx = ln_idx
def append(self, item):
self.item = item
list.append(self, item[self.fn_idx])
def get_value(self):
fn_idx = self.fn_idx
ln_idx = self.ln_idx
item = self.item
addr = [j for i,j in enumerate(item) if i not in (fn_idx, ln_idx)]
if len(self) < 3:
fn, ln = " and ".join(self), item[ln_idx]
else:
fn, ln = "The", item[ln_idx]+" Family"
return [fn, ln] + addr
def source(fname):
with open(fname) as in_file:
for item in csv.reader(in_file):
yield item
def household_merge(src, fn_idx, ln_idx, groupby):
res = defaultdict(lambda:HouseHold(fn_idx, ln_idx))
for item in src:
key = tuple(item[x] for x in groupby)
res[key].append(item)
return res.values()
data = household_merge(source("sample.csv"), 0, 1, [1,2,3,4,5,6,7])
with open("result.csv", "w") as out_file:
csv.writer(out_file).writerows(item.get_value() for item in data)
Python - 178 chars
import sys
d={}
for x in sys.stdin:F,c,A=x.partition(',');d[A]=d.get(A,[])+[F]
print"".join([" and ".join(v)+c+A,"The"+c+A.replace(c,' Family,',1)][2<len(v)]for A,v in d.items())
Output
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Python 2.6.6 - 287 Characters
This assumes you can hard code a filename (named i). If you want to take input from command line, this goes up ~16 chars.
from itertools import*
for z,g in groupby(sorted([l.split(',')for l in open('i').readlines()],key=lambda x:x[1:]), lambda x:x[2:]):
l=list(g);r=len(l);k=','.join(z);o=l[0]
if r>2:print'The,'+o[1],"Family,"+k,
elif r>1:print o[0],"and",l[1][0]+","+o[1]+","+k,
else:print','.join(o),
Output
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
I'm sure this could be improved upon, but it is getting late.
Haskell - 341 321
(Changes as per comments).
Unfortunately Haskell has no standard split function which makes this rather long.
Input to stdin, output on stdout.
import List
import Data.Ord
main=interact$unlines.e.lines
s[]=[]
s(',':x)=s x
s l#(x:y)=let(h,i)=break(==k)l in h:(s i)
t[]=[]
t x=tail x
h=head
m=map
k=','
e l=m(t.(>>=(k:)))$(m c$groupBy g$sortBy(comparing t)$m s l)
c(x:[])=x
c(x:y:[])=(h x++" and "++h y):t x
c x="The":((h$t$h x)++" Family"):(t$t$h x)
g a b=t a==t b
Lua, 434 bytes
x,y=1,2 s,p,r,a=string.gsub,pairs,io.read,{}for j,b,c,d,e,f,g,h,i in r('*a'):gmatch('('..('([^,]*),'):rep(7)..'([^,]*))\n')
do k=s(s(s(j,b,''),c,''),'[,%s]','')for l,m in p(a)do if not m.f and (m[y]:match(c) and m[9]==k) then z=1
if m.d then m[x]="The"m[y]=m[y]..' family'm.f=1 else m[x]=m[x].." and "..b m.d=1 end end end if not z then
a[#a+1]={b,c,d,e,f,g,h,i,k} end z=nil end for k,v in p(a)do v[9]=nil print(table.concat(v,','))end

Resources