I've to take the right part and clean it after it comparate with the middle part and save if are equal
> #!/usr/bin/env ruby
require 'rubygems'
require 'levenshtein'
require 'csv'
# Extending String class for blank? method
class String
def blank?
self.strip.empty?
end
end
# In
lines = CSV.read('entrada.csv')
lines.each do |line|
id = line[0].upcase.strip
left = line[1].upcase.strip
right = line[2].upcase.strip
eduardo = line[2].upcase.split(' ','de')
line[0] = id
line[1] = left
line[2] = right
line[4] = eduardo[0]+eduardo[1]
distance = Levenshtein.distance left, right
line << 99 if (left.blank? or right.blank?)
line << distance unless (left.blank? or right.blank?)
end
# Out
# counter = 0
CSV.open('salida.csv', 'w') do |csv|
lines.each do |line|
# counter = counter + 1 if line[3] <= 3
csv << line
end
end
# p counter
The middle is the correct the rigth i should correct
Some examples:
Eduardo | Abner | Herrera | Herrera -> Eduardo Herrera
Angel | De | Leon -> Angel De Leon
Maira | Angelina | de | Leon -> Maira De Leon
Marquilla | Gutierrez | Petronilda |De | Leon -> Marquilla Petronilda
First order of business is to come up with some rules. Based on your examples, and Spanish naming customs, here's my stab at the rules.
A name has a forename, paternal surname, and optional maternal surname.
A forename can be multiple words.
A surname can be multiple words linked by a de, y, or e.
So ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] should be { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda de Leon' }
To simplify the process, I'd first join any composite surnames into one field. ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda De Leon']. Watch out for cases like ['Angel', 'De', 'Leon'] in which case the surname is probably De Leon.
Once that's done, figuring out which part is which becomes easier.
name = {}
if parts.length == 1
error?
# The special case of only two parts: forename paternal_surname
elsif parts.length == 2
name = {
forename: parts[0],
paternal_surname: parts[1]
}
# forename paternal_surname maternal_surname
else
# The forename can have multiple parts, so work from the
# end and whatever's left is their forename.
name[:maternal_surname] = parts.pop
name[:paternal_surname] = parts.pop
name[:forename] = parts.join(" ")
end
There's a lot of ambiguity in Spanish naming, so this can only be an educated guess at what their actual name is. You'll probably have to tweak the rules as you learn more about the dataset. For example, I'm pretty sure handling of de is not that simple. For example...
One Leocadia Blanco Álvarez, married to a Pedro Pérez Montilla, may be addressed as Leocadia Blanco de Pérez or as Leocadia Blanco Álvarez de Pérez
In that case ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda', 'De Leon'] which is { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda', married_to: 'Leon' } or 'Marquilla Gutierrez Petronilda who is married to someone whose parental surname is Leon.
Good luck.
I would add more columns to the database, like last_name1, last_name2, last_name3, etc, and make them optional (don't put validations on those attributes). Hope that answers your question!
I'm using a hash to abbreviate state names
%STATEABBRIVATE = ('ALABAMA' => 'AL',
...);
Some of my input sets already have abbreviated state names. Would it be more efficient to use an if defined $STATEABBRIVATE{$state} or to add another 51 matched pairs 'AL'=>'AL' to the hash?
If you want to verify that the state really exists, using AL => 'AL' might be the easiest way.
To keep your code DRY (Don't Repeat Yourself), you can just
my %STATEABBRIVATE = ( ALABAMA => 'AL',
...
);
my #abbrevs = values %STATEABBRIVATE;
#STATEABBRIVATE{#abbrevs} = #abbrevs;
If you're concenrned about performance, the bottleneck is probably somewhere else:
#! /usr/bin/perl
use warnings;
use strict;
use Benchmark qw{ cmpthese };
use Test::More;
my %hash = qw( Alabama AL Alaska AK Arizona AZ Arkansas AR California CA
Colorado CO Connecticut CT Delaware DE Florida FL
Georgia GA Hawaii HI Idaho ID Illinois IL Indiana IN
Iowa IA Kansas KS Kentucky KY Louisiana LA Maine ME
Maryland MD Massachusetts MA Michigan MI Minnesota MN
Mississippi MS Missouri MO Montana MT Nebraska NE
Nevada NV Ohio OH Oklahoma OK Oregon OR Pennsylvania PA
Tennessee TN Texas TX Utah UT Vermont VT Virginia VA
Washington WA Wisconsin WI Wyoming WY );
$hash{'West Virginia'} = 'WV';
$hash{'South Dakota'} = 'SD';
$hash{'South Carolina'} = 'SC';
$hash{'Rhode Island'} = 'RI';
$hash{'North Dakota'} = 'ND';
$hash{'North Carolina'} = 'NC';
$hash{'New York'} = 'NY';
$hash{'New Mexico'} = 'NM';
$hash{'New Jersey'} = 'NJ';
$hash{'New Hampshire'} = 'NH';
my %larger = %hash;
#larger{ values %hash } = values %hash;
sub def {
my $state = shift;
return defined $hash{$state} ? $hash{$state} : $state
}
sub ex {
my $state = shift;
return exists $hash{$state} ? $hash{$state} : $state
}
sub hash {
my $state = shift;
return $larger{$state}
}
is(def($_), ex($_), "def-ex-$_") for keys %larger;
is(def($_), hash($_), "def-hash-$_") for keys %larger;
done_testing();
cmpthese(-1,
{ hash => sub { map hash($_), keys %larger },
ex => sub { map ex($_), keys %larger },
def => sub { map def($_), keys %larger },
});
Results:
Rate def ex hash
def 27307/s -- -2% -11%
ex 27926/s 2% -- -9%
hash 30632/s 12% 10% --
Both if defined $STATEABBRIVATE{$state} and any hash lookups are going to be constant time (i.e. O(1) operations). In fact, defined() probably uses a hash table lookup behind the scenes anyway. So, my prediction is that the difference in performance is going to be negligible, even with large data sets. This is, at best, an educated guess.
I'm a ruby newcomer who's trying to read a text file (a Valgrind simulation output) like this:
--------------------------------------------------------------------------------
Profile data file 'temp/gt_1024_2_16.out'
--------------------------------------------------------------------------------
I1 cache: 1024 B, 16 B, 2-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 3145728 B, 64 B, 12-way associative
Profiled target: bash run.sh
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 99 0 0 0 0 0 0 0 0
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
1,894,017 246,981 2,448 519,124 4,691 2,792 337,817 1,846 1,672 PROGRAM TOTALS
// other data
I want to extract the PROGRAM TOTALS table and put it into a hash. Something like...
myHash = { :Ir => 1894017, :I1mr => 246981, ILmr => 2448, ..., DLmw => 1672 }
What are the best options for doing this? Could the CSV classes help me out? Thanks a bunch.
My current code:
file = File.open(fileName, "r")
while header = file.gets
if header =~ / Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw /
# Found the header
file.gets # skip the ---- line
values = file.gets
puts "Header: " + header
puts " Data: " + values
break
end
end
I've got this output:
Header: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Data: 1,894,017 246,981 2,448 519,124 4,691 2,792 337,817 1,846 1,672 PROGRAM TOTALS
How could I join these two strings into a hash?
look at:
NAMES_INDEX = 16 # the line number of Ir I1mr ILmr Dr ...
NUMBERS_INDEX = 18 # the line number of 1,894,017 246,981 2,448 ...
FILE_NAME= "temp/gt_1024_2_16.out" # the file name
f = f = File.readlines(FILE_NAME)
names = f[NAMES_INDEX].split
numbers = f[NUMBERS_INDEX].split[0..-3].map{|a| a.delete(",").to_i}
h = Hash[names.zip numbers]
p h
It looks like your column names are fixed, since you search for them to find the data line.
This is how I would do it
data = nil
names = %w/ Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw /
open('E:\Perl\source\valgrind.txt', 'r') do |f|
f.each_line do |line|
if line =~ /PROGRAM TOTALS/
values = line.scan(/[\d\,]+/).map { |num| num.tr(',', '').to_i }
data = Hash[ names.zip(values) ]
break
end
end
end
p data
output
{"Ir"=>1894017, "I1mr"=>246981, "ILmr"=>2448, "Dr"=>519124, "D1mr"=>4691, "DLmr"=>2792, "Dw"=>337817, "D1mw"=>1846, "DLmw"=>1672}
I would write the code like this:
file_path, lines_with_data = 'data.txt', [16,18]
header, data = File.readlines(file_path)
.values_at(*lines_with_data)
.map{|line| line.strip.gsub(',','')
.split(/\s+/)}
data.map!(&:to_i)
p Hash[header.zip(data)] # => {"Ir"=>1894017, "I1mr"=>246981, "ILmr"=>2448, "Dr"=>519124, "D1mr"=>4691, "DLmr"=>2792, "Dw"=>337817, "D1mw"=>1846, "DLmw"=>1672}
I wrote some mail merge code the other day and although it works I'm a turned off by the code. I'd like to see what it would look like in other languages.
So for the input the routine takes a list of contacts
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Erica,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Marge,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Ted,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Raoul,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
It will then merge lines with the same address and surname into one record. Assume the rows are unsorted). The code should also be flexible enough that fields can be supplied in any order (so it will need to take field indexes as parameters). For a family of two it concatenates both first name fields. For a family of three or more the first name is set to "the" and the lastname is set to "surname family".
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
My C# implementation of this is:
var source = File.ReadAllLines(#"sample.csv").Select(l => l.Split(','));
var merged = HouseholdMerge(source, 0, 1, new[] {1, 2, 3, 4, 5});
public static IEnumerable<string[]> HouseholdMerge(IEnumerable<string[]> data, int fnIndex, int lnIndex, int[] groupIndexes)
{
Func<string[], string> groupby = fields => String.Join("", fields.Where((f, i) => groupIndexes.Contains(i)));
var groups = data.OrderBy(groupby).GroupBy(groupby);
foreach (var group in groups)
{
string[] result = group.First().ToArray();
if (group.Count() == 2)
{
result[fnIndex] += " and " + group.ElementAt(1)[fnIndex];
}
else if (group.Count() > 2)
{
result[fnIndex] = "The";
result[lnIndex] += " Family";
}
yield return result;
}
}
I don't like how I've had to do the groupby delegate. I'd like if C# had some way to convert a string expression to a delegate. e.g. Func groupby = f => "f[2] + f[3] + f[4] + f[5] + f[1];" I have a feeling something like this can probably be done in Lisp or Python. I look forward to seeing nicer implementation in other languages.
Edit: Where did the community wiki checkbox go? Some mod please fix that.
Ruby — 181 155
Name/surname indexes are in code:a and b. Input data is from ARGF.
a,b=0,1
[*$<].map{|i|i.strip.split ?,}.group_by{|i|i.rotate(a).drop 1}.map{|i,j|k,l,m=j
k[a]+=' and '+l[a]if l
(k[a]='The';k[b]+=' Family')if m
puts k*','}
Python - not golfed
I'm not sure what the order of the rows should be if the indices are not 0 and 1 for the input file
import csv
from collections import defaultdict
class HouseHold(list):
def __init__(self, fn_idx, ln_idx):
self.fn_idx = fn_idx
self.ln_idx = ln_idx
def append(self, item):
self.item = item
list.append(self, item[self.fn_idx])
def get_value(self):
fn_idx = self.fn_idx
ln_idx = self.ln_idx
item = self.item
addr = [j for i,j in enumerate(item) if i not in (fn_idx, ln_idx)]
if len(self) < 3:
fn, ln = " and ".join(self), item[ln_idx]
else:
fn, ln = "The", item[ln_idx]+" Family"
return [fn, ln] + addr
def source(fname):
with open(fname) as in_file:
for item in csv.reader(in_file):
yield item
def household_merge(src, fn_idx, ln_idx, groupby):
res = defaultdict(lambda:HouseHold(fn_idx, ln_idx))
for item in src:
key = tuple(item[x] for x in groupby)
res[key].append(item)
return res.values()
data = household_merge(source("sample.csv"), 0, 1, [1,2,3,4,5,6,7])
with open("result.csv", "w") as out_file:
csv.writer(out_file).writerows(item.get_value() for item in data)
Python - 178 chars
import sys
d={}
for x in sys.stdin:F,c,A=x.partition(',');d[A]=d.get(A,[])+[F]
print"".join([" and ".join(v)+c+A,"The"+c+A.replace(c,' Family,',1)][2<len(v)]for A,v in d.items())
Output
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Python 2.6.6 - 287 Characters
This assumes you can hard code a filename (named i). If you want to take input from command line, this goes up ~16 chars.
from itertools import*
for z,g in groupby(sorted([l.split(',')for l in open('i').readlines()],key=lambda x:x[1:]), lambda x:x[2:]):
l=list(g);r=len(l);k=','.join(z);o=l[0]
if r>2:print'The,'+o[1],"Family,"+k,
elif r>1:print o[0],"and",l[1][0]+","+o[1]+","+k,
else:print','.join(o),
Output
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
I'm sure this could be improved upon, but it is getting late.
Haskell - 341 321
(Changes as per comments).
Unfortunately Haskell has no standard split function which makes this rather long.
Input to stdin, output on stdout.
import List
import Data.Ord
main=interact$unlines.e.lines
s[]=[]
s(',':x)=s x
s l#(x:y)=let(h,i)=break(==k)l in h:(s i)
t[]=[]
t x=tail x
h=head
m=map
k=','
e l=m(t.(>>=(k:)))$(m c$groupBy g$sortBy(comparing t)$m s l)
c(x:[])=x
c(x:y:[])=(h x++" and "++h y):t x
c x="The":((h$t$h x)++" Family"):(t$t$h x)
g a b=t a==t b
Lua, 434 bytes
x,y=1,2 s,p,r,a=string.gsub,pairs,io.read,{}for j,b,c,d,e,f,g,h,i in r('*a'):gmatch('('..('([^,]*),'):rep(7)..'([^,]*))\n')
do k=s(s(s(j,b,''),c,''),'[,%s]','')for l,m in p(a)do if not m.f and (m[y]:match(c) and m[9]==k) then z=1
if m.d then m[x]="The"m[y]=m[y]..' family'm.f=1 else m[x]=m[x].." and "..b m.d=1 end end end if not z then
a[#a+1]={b,c,d,e,f,g,h,i,k} end z=nil end for k,v in p(a)do v[9]=nil print(table.concat(v,','))end