Algorithm for grouping names

Algorithm for grouping names - algorithm

What's a good way to group this list of names:
Doctor Watson.
Dr. John Watson.
Dr. J Watson.
Watson.
J Watson.
Sherlock.
Mr. Holmes.
S Holmes.
Holmes.
Sherlock Holmes.
Into a grouped list of unique and complete names:
Dr. John Watson.
Mr. Sherlock Holmes.
Also interesting:
Mr Watson
Watson
Mrs Watson
Watson
John Watson
Since the algorithm doesn't need to make inferences about whether the first Watson is a Mr (likely) or Mrs but only group them uniquely, the only problem here is that John Watson obviously belongs to Mr and not Mrs Watson. Without a dictionary of given names for each gender, this can't be deduced.
So far I've thought of iterating through the list and checking each item with the remaining items. At each match, you group and start from the beginning again, and on the first pass where no grouping occurs you stop.
Here's some rough (and still untested) Python. You'd call it with a list of names.
def groupedNames(ns):
if len(ns) > 1:
# First item is query, rest are target names to try matching
q = ns[0]
# For storing unmatched names, passed on later
unmatched = []
for i in range(1,len(ns)):
t = ts[i]
if areMatchingNames(q,t):
# groupNames() groups two names into one, retaining all info
return groupedNames( [groupNames(q,t)] + unmatched + ns[i+1:] )
else:
unmatched.append(t)
# When matching is finished
return ns

If your names are always of the form [honorific][first name or initial]LastName, then you can start by extracting and sorting by the last name. If some names have the form LastName[,[honorific][first name or initial]], you can parse them and convert to the first form. Or, you might want to convert everything to some other form.
In any case, you put the names into some canonical form and then sort by last name. Your problem is greatly reduced. You can then sort by first name and honorific within a last name group and then go sequentially through them to extract the complete names from the fragments.
As you noted, there are some ambiguities that you'll have to resolve. For example, you might have:
John Watson
Jane Watson
Dr. J. Watson
There's not enough information to say which of the two (if either!) is the doctor. And, as you pointed out, without information about the gender of names, you can't resolve Mr. J. Watson or Mrs. J. Watson.

I suggest using hashing here.
Define a hash function as interpreting the name as a base 26 number where a = 0 and z = 25
Now just hash the individual words. So
h(sherlock holmes) = h(sherlock) + h(holmes) = h(holmes) + h(sherlock).
Using this you can easily identify names like:
John Watson and Watson John
For ambiguities like Dr. John Watson and Mr John Watson you can define the hash value for Mr and Dr to be the same.
To resolve conflicts like J. Watson and John Watson, you can just have just the first letter and the last name hashed. You can extend the idea for similar conflicts.

Related

How to set up training and feature template files for NER? - CRF++

For the problem of named entity recognition,
After tokenizing the sentences, how do you set up the columns? it looks like one column in the documentation is POS tag, but where do these come from? Am I supposed to tag the POS myself or is there a tool to generate these?
What is the next column represent? A class like PERSON, LOCATION, etc? and does it have to be in any particular format?
Is there any example of a completed training file and template for NER?

You can find example training and test data in the crf++ repo here. The training data for noun phrase chunking looks like this:
Confidence NN B
in IN O
the DT B
pound NN I
is VBZ O
widely RB O
expected VBN O
... etc ...
The columns are arbitrary in that they can be anything. CRF++ requires that every line have the same number of columns (or be blank, to separate sentences), not all CRF packages require that. You will have to provide the data values yourself; they are the data the classifier learns from.
While anything can go in the various columns, one convention you should know is IOB Format. To deal with potentially multi-token entities, you mark them as Inside/Outside/Beginning. It may be useful to give an example. Pretend we are training a classifier to detect names - for compactness I'll write this on one line:
John/B Smith/I ate/O an/O apple/O ./O
In columnar format it would look like this:
John B
Smith I
ate O
an O
apple O
. O
With these tags, B (beginning) means the word is the first in an entity, I means a word is inside an entity (it comes after a B tag), and O means the word is not an entity. If you have more than one type of entity it's typical to use labels like B-PERSON or I-PLACE.
The reason for using IOB tags is so that the classifier can learn different transition probabilities for starting, continuing, and ending entities. So if you're learning company names It'll learn that Inc./I-COMPANY usually transitions to an O label because Inc. is usually the last part of a company name.
Templates are another problem and CRF++ uses its own special format, but again, there are examples in the source distribution you can look at. Also see this question.
To answer the comment on my answer, you can generate POS tags using any POS tagger. You don't even have to provide POS tags at all, though they're usually helpful. The other labels can be added by hand or automatically; for example, you can use a list of known nouns as a starting point. Here's an example using spaCy for a simple name detector:
import spacy
nlp = spacy.load('en')
names = ['John', 'Jane', etc...]
text = nlp("John ate an apple.")
for word in text:
person = 'O' # default not a person
if str(word) in names:
person = 'B-PERSON'
print(str(word), word.pos_, person)

Comparing two files in Ruby with different data types

I had an interview today and wanted input on how you would solve this issue that came up. I answered the question, but in my mind I was thinking there is a better way.
Here is the scenario. You have two files that you need to compare. In the first file you have a list in string format of NFL team abbreviations for example:
ARI
CHIC
GB
NYG
DET
WASH
PHL
PITT
STL
SF
CLEV
IND
DAL
KC
In the second file you would have the following information in a hash or json for example:
"data":
{"description": name: "CLEV","totfd":26,"totyds":396,"pyds":282,"ryds":114,"pen":4,"penyds":24,
"trnovr":0,"pt":4,"ptyds":163,"ptavg":36,"top":"37:05"}},"players":null}
How would you take the strings in the first file (the abbreviations) and see if that abbreviation was included somewhere in the data of the second file? So, for example I want to see if CLEV, ARI, WASH, so on would be anywhere in the second file. If that abbreviation is included I would want to extract information based on that abbreviation.
Here was my answer:
I would iterate over each abbreviation looking for that specific abbreviation inside the second file.
I felt my answer was poor, but I wanted to see if others had a good idea on what they would do.
thanks
Mike Riley

You should ask questions in your interview. Some questions I'd ask:
Will the hash/json include duplicate data for teams? Meaning, will CLEV have multiple records in there? If not, now you know you have unique data so there's no need to group anything ahead of time.
If it's not unique, I'd get a list of all the names that exist in the hash, so you can do a comparison between the array given and the other file.
This is in O(n) for the traversal + O(logN) for the value lookup:
hash = [{'description': 'some team', 'name': 'CLEV','totfd':26,'totyds':396,'pyds':282 },
{'description': 'some team', 'name': 'PHL','totfd':26,'totyds':396,'pyds':282 }]
hash_names = hash.map { |team| team[:name] }
Now that we have a list of names in the hash, we can find out where there is an overlap. We can add the two arrays together and figure out who shows up in there more than once. There are many ways to do that, but we should keep with our run time of O(n):
list = ["ARI","CHIC","GB","NYG","DET","WASH","PHL","PITT","STL","SF","CLEV","IND","DAL"]
teams_in_both = (list + hash_names).group_by { |team| team }.keep_if { |_, occ| occ.size > 1 }.map(&:first)
Now we have a list of:
["PHL", "CLEV"]
We know enough to say who's important to us and can fetch the remaining data accordingly.

Avoid data redundancy in Prolog

I'm tinkering with Prolog and ran into the following problem.
Assume I want to create a little knowledge base about the courses of an university.
I need the following two relation schemes:
relation scheme for lecturer: lecturer(Name,Surname)
relation scheme for course: course(Topic,Lecturer,Date,Location).
I have a lecturer, John Doe:
lecturer(doe,john).
John Doe teaches the complexity class:
course(complexity,lecturer(doe,john),monday,roomA).
Now I have a redundancy in the information - not good!
Is there any way to achieve something like this:
l1 = lecturer(doe,john).
course(complexity,l1,monday,roomA).
Many thanks in advance!

The same normalization possibilities as in data bases apply:
id_firstname_surname(1, john, doe).
and:
course_day_room_lecturer(complexity, monday 'A', 1).
That is, we have introduced a unique ID for each lecturer, and use that to refer to the person.

Rewriting sentences while retaining semantic meaning

Is it possible to use WordNet to rewrite a sentence so that the semantic meaning of the sentence still ways the same (or mostly the same)?
Let's say I have this sentence:
Obama met with Putin last week.
Is it possible to use WordNet to rephrase the sentence into alternatives like:
Obama and Putin met the previous week.
Obama and Putin met each other a week ago.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms?
For example:
Obama met Putin the previous week.

If the question is the possibility to use WordNet to do sentence paraphrases. It is possible with much grammatical/syntax components. You would need system that:
First get the individual semantics of the tokens and parse the sentence for its syntax.
Then understand the overall semantics of the composite sentence (especially if it's metaphorical)
Then rehash the sentence with some grammatical generator.
Up till now I only know of ACE parser/generator that can do something like that but it takes a LOT of hacking the system to make it work as a paraphrase generator. http://sweaglesw.org/linguistics/ace/
So to answer your questions,
Is it possible to use WordNet to rephrase the sentence into alternatives? Sadly, WordNet isn't a silverbullet. You will need more than semantics for a paraphrase task.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms? Yes this is possible. BUT to figure out which synonym is replace-able is hard... And you would also need some morphology/syntax component.
First you will run into a problem of multiple senses per word:
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
print i, len(possible_senses), possible_senses
[out]:
Obama 0 []
met 13 [Synset('meet.v.01'), Synset('meet.v.02'), Synset('converge.v.01'), Synset('meet.v.04'), Synset('meet.v.05'), Synset('meet.v.06'), Synset('meet.v.07'), Synset('meet.v.08'), Synset('meet.v.09'), Synset('meet.v.10'), Synset('meet.v.11'), Synset('suffer.v.10'), Synset('touch.v.05')]
Putin 1 [Synset('putin.n.01')]
the 0 []
previous 3 [Synset('previous.s.01'), Synset('former.s.03'), Synset('previous.s.03')]
week 3 [Synset('week.n.01'), Synset('workweek.n.01'), Synset('week.n.03')]
Then even if you know the sense (let's say the first sense), you get multiple words per sense and not every word can be replaced in the sentence. Moreover, they are in the lemma form not a surface form (e.g. verbs are in their base form (simple present tense) and nouns are in singular):
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
if possible_senses:
print i, possible_senses[0].lemma_names
else:
print i
[out]:
Obama
met ['meet', 'run_into', 'encounter', 'run_across', 'come_across', 'see']
Putin ['Putin', 'Vladimir_Putin', 'Vladimir_Vladimirovich_Putin']
the
previous ['previous', 'old']
week ['week', 'hebdomad']

One approach is grammatical analysis with nltk read more here and after analysis convert your sentence in to active voice or passive voice.

How to limit the number of retrieved characters from a database field in rails?

Consider a passage (~400 characters) in a database table(text).
Like
There is only one more week to Easter. I have already started my
holiday. The idea of visiting my uncle during this Easter is
wonderful. His farm is in this village down in Cornwall. This village
is very peaceful and beautiful. I have asked my aunt if I can bring
Sam, my dog, with me. I promise her I will keep him under control. He
attacked and he ate some animals from her farm in October. But he is
part of the family and I cannot leave him behind.
but i need to retrieve only limited characters from that like ~150 characters only.
There is only one more week to Easter. I have already started my
holiday. The idea of visiting my uncle during this Easter is
wonderful. His farm is in this village down in Cornwall. This village
is very peaceful...
Is there any function in rails or only truncate(:limit,:option{}) function for that output?

Assuming you have a model Passage with a field text, you can select specific field (and use SQL functions within) like this:
passages = Passage.select("id, LEFT(text,10) as text_short, CHAR_LENGTH(text) as text_length")
# => [#<Passage id: 1>, #<Passage id: 2>, #<Passage id: 3>]
passages.first.id
# => 1
passages.first.text_short
# => "There is o"
passages.first.text_length
# => 453

Why not get the whole string and only use the first 150 characters? I doubt it will slow things down much at all.
somehow_access_string[0...150] + '...'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio