How can i separate a full name? - ruby

I've to take the right part and clean it after it comparate with the middle part and save if are equal
> #!/usr/bin/env ruby
require 'rubygems'
require 'levenshtein'
require 'csv'
# Extending String class for blank? method
class String
def blank?
self.strip.empty?
end
end
# In
lines = CSV.read('entrada.csv')
lines.each do |line|
id = line[0].upcase.strip
left = line[1].upcase.strip
right = line[2].upcase.strip
eduardo = line[2].upcase.split(' ','de')
line[0] = id
line[1] = left
line[2] = right
line[4] = eduardo[0]+eduardo[1]
distance = Levenshtein.distance left, right
line << 99 if (left.blank? or right.blank?)
line << distance unless (left.blank? or right.blank?)
end
# Out
# counter = 0
CSV.open('salida.csv', 'w') do |csv|
lines.each do |line|
# counter = counter + 1 if line[3] <= 3
csv << line
end
end
# p counter
The middle is the correct the rigth i should correct
Some examples:
Eduardo | Abner | Herrera | Herrera -> Eduardo Herrera
Angel | De | Leon -> Angel De Leon
Maira | Angelina | de | Leon -> Maira De Leon
Marquilla | Gutierrez | Petronilda |De | Leon -> Marquilla Petronilda

First order of business is to come up with some rules. Based on your examples, and Spanish naming customs, here's my stab at the rules.
A name has a forename, paternal surname, and optional maternal surname.
A forename can be multiple words.
A surname can be multiple words linked by a de, y, or e.
So ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] should be { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda de Leon' }
To simplify the process, I'd first join any composite surnames into one field. ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda De Leon']. Watch out for cases like ['Angel', 'De', 'Leon'] in which case the surname is probably De Leon.
Once that's done, figuring out which part is which becomes easier.
name = {}
if parts.length == 1
error?
# The special case of only two parts: forename paternal_surname
elsif parts.length == 2
name = {
forename: parts[0],
paternal_surname: parts[1]
}
# forename paternal_surname maternal_surname
else
# The forename can have multiple parts, so work from the
# end and whatever's left is their forename.
name[:maternal_surname] = parts.pop
name[:paternal_surname] = parts.pop
name[:forename] = parts.join(" ")
end
There's a lot of ambiguity in Spanish naming, so this can only be an educated guess at what their actual name is. You'll probably have to tweak the rules as you learn more about the dataset. For example, I'm pretty sure handling of de is not that simple. For example...
One Leocadia Blanco Álvarez, married to a Pedro Pérez Montilla, may be addressed as Leocadia Blanco de Pérez or as Leocadia Blanco Álvarez de Pérez
In that case ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda', 'De Leon'] which is { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda', married_to: 'Leon' } or 'Marquilla Gutierrez Petronilda who is married to someone whose parental surname is Leon.
Good luck.

I would add more columns to the database, like last_name1, last_name2, last_name3, etc, and make them optional (don't put validations on those attributes). Hope that answers your question!

Related

Reversing and Splitting in Python

I have a file "names.txt". The contents are
"Smith,RobJones,MikeJane,SallyPetel,Brian"
and I want to read "names.txt" and make a new file "names2.txt" that looks like:
"Rob Smith Mike Jones Sally Jane Brian Petel"
I know I should be using #rstrip(\n) and #.split(',')
So far I have:
namesfile = input('Enter name of file: ') #open names.txt
openfile = open(namesfile, 'r')
This will do exactly that. You might be able to polish this and make it more elegant and I encourage you to do so:
import re
with open('names.txt') as f:
# Split the names
names = re.sub(r'([A-Z])(?![A-Z])',r',\1',f.read()).split(',')
# Filter empty results
names = [n for n in names if n != '']
# Swap pairs with each other
for i in range(len(names)):
if((i+1)%2 == 0):
names[i], names[i-1] = names[i-1], names[i]
print ' '.join(names)

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

How to extract string from large file only if specific string appears previous using Ruby?

I am trying to extract information from a large file and cannot figure out how to extract strings from file lines only when a previous line in the same record within the file has been matched by regex. An example of one record in the file is as follows:
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent: coordinate IM with MENTAL COMPETENCY (IM)
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization, by a patient or research subject, etc,...
This file contains over 20,000 records like this example. I want to identify a small percent of those records using the "MH" field. In this example, I want to find "Informed Consent", and then use regex to extract the information in the FX, AN, and MS fields only within that record. So far, I have opened the file, accessed the hash that the MH terms are stored in, and been able to extract those terms from the records in the file. I also have a functioning regex that identifies the content in the "FX" field.
File.open('mesh_descriptor.bin').each do |file_line|
file_line = file_line.chomp
# read each key of candidate_descriptor_keys
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
mesh_header = $1
puts "MH from Mesh Descriptor file is: #{mesh_header}"
if file_line =~ /^FX\s=\s(.*)$/
see_also = $1
puts " See_Also from Descriptor file is: #{see_also}"
end
end
end
end
The hash contains the following MH (keys):
candidate_descriptor_keys = ["Body Weight", "Obesity", "Thinness", "Fetal Weight", "Overweight"]
I had success extracting "FX" when I put the statement outside of the "if" statement to extract "MH", but all of the "FX" from the whole file were retrieved - not what I need. I thought putting the "if" statement for "FX" within the previous "if" statement would restrict the results to only those found when the first statement is true, but I am getting no results (also no errors) with this strategy. What I would like as a result is:
> Informed Consent
> Disclosure
> Mental Competency
> Therapeutic Misconception
> Treatment Refusal
as well as the strings within the "AN" and "MS" fields for only those records matching "MH". Any suggestions would be helpful!
I think this may be what you are looking for, but if not, let me know and I will change it. Look especially at the very end to see if that is the sort of output (for input having two records, both with a "MH" field) you want. I will also add a "explanation" section at the end once I have understood your question correctly.
I have assumed that each record begins
*NEW_RECORD
and you wish to identify all lines beginning "MH" whose field is one of the elements of:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
and for each match, you would like to print the contents of the lines for the same record that begin with "FX", "AN" and "MS".
Code
NEW_RECORD_MARKER = "*NEW RECORD"
def getem(fname, candidate_descriptor_keys)
line = 0
found_mh = false
File.open(fname).each do |file_line|
file_line = file_line.strip
case
when file_line == NEW_RECORD_MARKER
puts # space between records
found_mh = false
when found_mh == false
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
found_mh = true
puts "MH from line #{line} of file is: #{cand_term}"
break
end
end
when found_mh
["FX", "AN", "MS"].each do |des|
if file_line =~ /^#{des}\s=\s(.*)$/
see_also = $1
puts " Line #{line} of file is: #{des}: #{see_also}"
end
end
end
line += 1
end
end
Example
Let's begin be creating a file, starging with a "here document that contains two records":
records =<<_
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization
*NEW RECORD
MH = Obesity
AQ = ES HI LJ PX SN ST
ENTRY = Obesity
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = 1st FX
FX = 2nd FX
AN = Only AN
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Only MS
_
If you puts records you will see it is just a string. (You'll see that I shortened two of them.) Now write it to a file:
File.write('mesh_descriptor', records)
If you wish to confirm the file contents, you could do this:
puts File.read('mesh_descriptor')
We also need to define define the array candidate_descriptor_keys:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
We can now execute the method getem:
getem('mesh_descriptor', candidate_descriptor_keys)
MH from line 2 of file is: Informed Consent
Line 7 of file is: FX: Disclosure
Line 8 of file is: FX: Mental Competency
Line 9 of file is: FX: Therapeutic Misconception
Line 10 of file is: FX: Treatment Refusal
Line 13 of file is: AN: competency to consent
Line 16 of file is: MS: Voluntary authorization
MH from line 18 of file is: Obesity
Line 23 of file is: FX: 1st FX
Line 24 of file is: FX: 2nd FX
Line 25 of file is: AN: Only AN
Line 28 of file is: MS: Only MS

Join array of strings into 1 or more strings each within a certain char limit (+ prepend and append texts)

Let's say I have an array of Twitter account names:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
And a prepend and append variable:
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
How can I turn this into an array of as few strings as possible each with a maximum length of 140 characters, starting with the prepend text, ending with the append text, and in between the Twitter account names all starting with an #-sign and separated with a space. Like this:
tweets = ['Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday', 'Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday', 'Check out these cool people: #example18 #example19 #example20 #FollowFriday']
(The order of the accounts isn't important so theoretically you could try and find the best order to make the most use of the available space, but that's not required.)
Any suggestions? I'm thinking I should use the scan method, but haven't figured out the right way yet.
It's pretty easy using a bunch of loops, but I'm guessing that won't be necessary when using the right Ruby methods. Here's what I came up with so far:
# Create one long string of #usernames separated by a space
tmp = twitter_accounts.map!{|a| a.insert(0, '#')}.join(' ')
# alternative: tmp = '#' + twitter_accounts.join(' #')
# Number of characters left for mentioning the Twitter accounts
length = 140 - (prepend + append).length
# This method would split a string into multiple strings
# each with a maximum length of 'length' and it will only split on empty spaces (' ')
# ideally strip that space as well (although .map(&:strip) could be use too)
tweets = tmp.some_method(' ', length)
# Prepend and append
tweets.map!{|t| prepend + t + append}
P.S.
If anyone has a suggestion for a better title let me know. I had a difficult time summarizing my question.
The String rindex method has an optional parameter where you can specify where to start searching backwards in a string:
arr = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
str = arr.map{|name|"##{name}"}.join(' ')
prepend = 'Check out these cool people: '
append = ' #FollowFriday'
max_chars = 140 - prepend.size - append.size
until str.size <= max_chars do
p str.slice!(0, str.rindex(" ", max_chars))
str.lstrip! #get rid of the leading space
end
p str unless str.empty?
I'd make use of reduce for this:
string = %w[example1 example2 example3 example4 example5 example6 example7 example8 example9 example10 example11 example12 example13 example14 example15 example16 example17 example18 example19 example20]
prepend = 'Check out these cool people:'
append = '#FollowFriday'
# Extra -1 is for the space before `append`
max_content_length = 140 - prepend.length - append.length - 1
content_strings = string.reduce([""]) { |result, target|
result.push("") if result[-1].length + target.length + 2 > max_content_length
result[-1] += " ##{target}"
result
}
tweets = content_strings.map { |s| "#{prepend}#{s} #{append}" }
Which would yield:
"Check out these cool people: #example1 #example2 #example3 #example4 #example5 #example6 #example7 #example8 #example9 #FollowFriday"
"Check out these cool people: #example10 #example11 #example12 #example13 #example14 #example15 #example16 #example17 #FollowFriday"
"Check out these cool people: #example18 #example19 #example20 #FollowFriday"

Household mail merge (code golf)

I wrote some mail merge code the other day and although it works I'm a turned off by the code. I'd like to see what it would look like in other languages.
So for the input the routine takes a list of contacts
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Erica,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Marge,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Ted,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Raoul,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
It will then merge lines with the same address and surname into one record. Assume the rows are unsorted). The code should also be flexible enough that fields can be supplied in any order (so it will need to take field indexes as parameters). For a family of two it concatenates both first name fields. For a family of three or more the first name is set to "the" and the lastname is set to "surname family".
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
My C# implementation of this is:
var source = File.ReadAllLines(#"sample.csv").Select(l => l.Split(','));
var merged = HouseholdMerge(source, 0, 1, new[] {1, 2, 3, 4, 5});
public static IEnumerable<string[]> HouseholdMerge(IEnumerable<string[]> data, int fnIndex, int lnIndex, int[] groupIndexes)
{
Func<string[], string> groupby = fields => String.Join("", fields.Where((f, i) => groupIndexes.Contains(i)));
var groups = data.OrderBy(groupby).GroupBy(groupby);
foreach (var group in groups)
{
string[] result = group.First().ToArray();
if (group.Count() == 2)
{
result[fnIndex] += " and " + group.ElementAt(1)[fnIndex];
}
else if (group.Count() > 2)
{
result[fnIndex] = "The";
result[lnIndex] += " Family";
}
yield return result;
}
}
I don't like how I've had to do the groupby delegate. I'd like if C# had some way to convert a string expression to a delegate. e.g. Func groupby = f => "f[2] + f[3] + f[4] + f[5] + f[1];" I have a feeling something like this can probably be done in Lisp or Python. I look forward to seeing nicer implementation in other languages.
Edit: Where did the community wiki checkbox go? Some mod please fix that.
Ruby — 181 155
Name/surname indexes are in code:a and b. Input data is from ARGF.
a,b=0,1
[*$<].map{|i|i.strip.split ?,}.group_by{|i|i.rotate(a).drop 1}.map{|i,j|k,l,m=j
k[a]+=' and '+l[a]if l
(k[a]='The';k[b]+=' Family')if m
puts k*','}
Python - not golfed
I'm not sure what the order of the rows should be if the indices are not 0 and 1 for the input file
import csv
from collections import defaultdict
class HouseHold(list):
def __init__(self, fn_idx, ln_idx):
self.fn_idx = fn_idx
self.ln_idx = ln_idx
def append(self, item):
self.item = item
list.append(self, item[self.fn_idx])
def get_value(self):
fn_idx = self.fn_idx
ln_idx = self.ln_idx
item = self.item
addr = [j for i,j in enumerate(item) if i not in (fn_idx, ln_idx)]
if len(self) < 3:
fn, ln = " and ".join(self), item[ln_idx]
else:
fn, ln = "The", item[ln_idx]+" Family"
return [fn, ln] + addr
def source(fname):
with open(fname) as in_file:
for item in csv.reader(in_file):
yield item
def household_merge(src, fn_idx, ln_idx, groupby):
res = defaultdict(lambda:HouseHold(fn_idx, ln_idx))
for item in src:
key = tuple(item[x] for x in groupby)
res[key].append(item)
return res.values()
data = household_merge(source("sample.csv"), 0, 1, [1,2,3,4,5,6,7])
with open("result.csv", "w") as out_file:
csv.writer(out_file).writerows(item.get_value() for item in data)
Python - 178 chars
import sys
d={}
for x in sys.stdin:F,c,A=x.partition(',');d[A]=d.get(A,[])+[F]
print"".join([" and ".join(v)+c+A,"The"+c+A.replace(c,' Family,',1)][2<len(v)]for A,v in d.items())
Output
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Python 2.6.6 - 287 Characters
This assumes you can hard code a filename (named i). If you want to take input from command line, this goes up ~16 chars.
from itertools import*
for z,g in groupby(sorted([l.split(',')for l in open('i').readlines()],key=lambda x:x[1:]), lambda x:x[2:]):
l=list(g);r=len(l);k=','.join(z);o=l[0]
if r>2:print'The,'+o[1],"Family,"+k,
elif r>1:print o[0],"and",l[1][0]+","+o[1]+","+k,
else:print','.join(o),
Output
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
I'm sure this could be improved upon, but it is getting late.
Haskell - 341 321
(Changes as per comments).
Unfortunately Haskell has no standard split function which makes this rather long.
Input to stdin, output on stdout.
import List
import Data.Ord
main=interact$unlines.e.lines
s[]=[]
s(',':x)=s x
s l#(x:y)=let(h,i)=break(==k)l in h:(s i)
t[]=[]
t x=tail x
h=head
m=map
k=','
e l=m(t.(>>=(k:)))$(m c$groupBy g$sortBy(comparing t)$m s l)
c(x:[])=x
c(x:y:[])=(h x++" and "++h y):t x
c x="The":((h$t$h x)++" Family"):(t$t$h x)
g a b=t a==t b
Lua, 434 bytes
x,y=1,2 s,p,r,a=string.gsub,pairs,io.read,{}for j,b,c,d,e,f,g,h,i in r('*a'):gmatch('('..('([^,]*),'):rep(7)..'([^,]*))\n')
do k=s(s(s(j,b,''),c,''),'[,%s]','')for l,m in p(a)do if not m.f and (m[y]:match(c) and m[9]==k) then z=1
if m.d then m[x]="The"m[y]=m[y]..' family'm.f=1 else m[x]=m[x].." and "..b m.d=1 end end end if not z then
a[#a+1]={b,c,d,e,f,g,h,i,k} end z=nil end for k,v in p(a)do v[9]=nil print(table.concat(v,','))end

Resources