grouping all Named entities in a Document - n-gram

I would like to group all named entities in a given document.
For Example,
**Barack Hussein Obama** II is the 44th and current President of the United States, and the first African American to hold the office.
I do not want to use OpenNLP APIs as it might not be able to recognize all named entities.
Is there any way to generate such n-grams using other services or may be a way to group all noun terms together.

If you want to avoid using NER, you could use a sentence chunker or parser. This will extract noun phrases generically. OpenNLP has a sentence chunker and parser, but if you are for some reason adverse to using OpenNLP, you can try others.
If you are interested in using the OpenNLP chunker i will post some code that extracts noun phrases using OpenNLP.
Here is the code. You will need to download the models from sourceforge here
http://opennlp.sourceforge.net/models-1.5/
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
/**
*
* Extracts noun phrases from a sentence. To create sentences using OpenNLP use
* the SentenceDetector classes.
*/
public class OpenNLPNounPhraseExtractor {
static final int N = 2;
public static void main(String[] args) {
try {
String modelPath = "c:\\temp\\opennlpmodels\\";
TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
TokenizerME wordBreaker = new TokenizerME(tm);
POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
POSTaggerME posme = new POSTaggerME(pm);
InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
ChunkerModel chunkerModel = new ChunkerModel(modelIn);
ChunkerME chunkerME = new ChunkerME(chunkerModel);
//this is your sentence
String sentence = "Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office.";
//words is the tokenized sentence
String[] words = wordBreaker.tokenize(sentence);
//posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
String[] posTags = posme.tag(words);
//chunks are the start end "spans" indices to the chunks in the words array
Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
//chunkStrings are the actual chunks
String[] chunkStrings = Span.spansToStrings(chunks, words);
for (int i = 0; i < chunks.length; i++) {
if (chunks[i].getType().equals("NP")) {
System.out.println("NP: \n\t" + chunkStrings[i]);
String[] split = chunkStrings[i].split(" ");
List<String> ngrams = ngram(Arrays.asList(split), N, " ");
System.out.println("ngrams:");
for (String gram : ngrams) {
System.out.println("\t" + gram);
}
}
}
} catch (IOException e) {
}
}
public static List<String> ngram(List<String> input, int n, String separator) {
if (input.size() <= n) {
return input;
}
List<String> outGrams = new ArrayList<String>();
for (int i = 0; i < input.size() - (n - 2); i++) {
String gram = "";
if ((i + n) <= input.size()) {
for (int x = i; x < (n + i); x++) {
gram += input.get(x) + separator;
}
gram = gram.substring(0, gram.lastIndexOf(separator));
outGrams.add(gram);
}
}
return outGrams;
}
}
the output I get with your sentence is this (with N set to 2 (bigram)
NP:
Barack Hussein Obama II
ngrams:
Barack Hussein
Hussein Obama
Obama II
NP:
the 44th and current President
ngrams:
the 44th
44th and
and current
current President
NP:
the United States
ngrams:
the United
United States
NP:
the first African American
ngrams:
the first
first African
African American
NP:
the office
ngrams:
the
office
this does not explicitly handle the case of when an adjective falls outside of the NP... if so you can get this info from the POS tags and integrate it. What I gave you should send you in the right direction.

Related

Pig sum fails with +ve and -ve values

I have below data
primary,first,second
1,393440.09,354096.08
1,4410533.33,3969479.99
1,-4803973.41,-4323576.07
I have to aggregate and sum first and second column. Below is the script I am executing
data_load= load <filelocation> using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') As (primary:double, first:double,second:double)
dataAgrr = group data_load by primary;
sumData = FOREACH dataAgrr GENERATE
group as data,
SUM(data_load.first) as first,
SUM(data_load.second) as second,
SUM(data_load.primary) as primary;
After executing, below Output is produced:
(1.0,0.009999999951105565,-5.820766091346741E-11,3.0)
But when we manually adding second column (354096.08, 3969479.99, -4323576.07) gives 0.
Pig uses Java "double" internally.
Testing with a sample code below
import java.math.BigDecimal;
public class TestSum {
public static void main(String[] args) {
double d1 = 354096.08;
double d2 = 3969479.99;
double d3 = -4323576.07;
System.err.println("Total in double is " + ((d3 + d2 ) + d1));
BigDecimal bd1 = new BigDecimal("354096.08");
BigDecimal bd2 = new BigDecimal("3969479.99");
BigDecimal bd3 = new BigDecimal("-4323576.07");
System.err.println("Total in BigDecimal is " + bd3.add(bd2).add(bd1));
}
}
This produces
Total in double is -5.820766091346741E-11
Total in BigDecimal is 0.00
If you need a better precision, you may want to try using "bigdecimal" instead of "double" in your script.

Context Free Grammar for English Sounding Names

I am currently writing an application that will generate random data; specifically, random names. I have made some decent progress, but am not satisfied with many of the generated names. The problem lies in my production rules, which I've attached to the bottom of this post.
The basic idea is: consonant, vowel, consonant, vowel, but some consonants themselves map to vowels (such as b< VO >).
I have not fully created the rules yet, but the final idea would follow the format shown below. However, rather than finishing it, I would like to make a better basis for the production rules.
I have tried to find a reference that discusses either: a CFG already created for English-sounding words, or an English reference that disassembles the basic format of letter combinations for words. Unfortunately, I have not been able to find a useful resource to help me advance farther than I already have. Does anyone know of a place I should look, or a reference I can look at?
ALSO: in your opinion, do you believe a context-sensitive grammar might work better?
//the following will deal with single vowels and consonants
var CO = ['b','c','d','f','g','h','j','k','l','m','n','p','qu','r','s','t','v','w','x','y','z'];
CO.probabilities = [2.41,4.49,6.87,3.59,3.25,9.84,0.24,1.24,6.5,3.88,10.9,3.11,0.153,9.67,10.2,14.6,1.58,3.81,0.242,3.19,0.12];
CO.name = "CO";
var VO = ['a','e','i','o','u'];
VO.probabilities = [21.43,33.33,18.28,19.7,7.23];
VO.name = "VO";
var LETTER = ['<VO>','<CO>'];
LETTER.probabilities = [38.1,61.9];
LETTER.name = "LETTER";
//the following deal with connsonant pairs
var BH = ['c','p','r','s','t']; //the fisrt part of a th, ph, sh, pair (before H)
BH.probabilities = [20,10,20,25,25];
BH.name = "BH";
var BL = ['b','c','f','g','p','s']; //before letter l
BL.probabilities = [10,20,10,10,25,25]
BL.name = "BL";
var COP = ['<BH>h','<BL>l'] //consonant pairs
COP.probabilities = [50,50];
COP.name = "COP";
//this is a generic syllable, that does not take grammar rules into consideration
var SYL = ['<CO><VO>','<VO><CO>','<CO><VO><VO>'];
SYL.probabilities = [50,20,30];
SYL.name = "SYL";
//the following deal with mid word syllablse
var CLOSED = ['<CO><VO><CO>','<CO><VO><CO><CO>'];
CLOSED.probabilities = [75,25];
CLOSED.name = "CLOSED";
var OPEN = ['<CO><VO>','<CO><CO><VO>'];
OPEN.probabilities = [60,40];
OPEN.name = "OPEN";
var VR = ['<VO>r']; //vowel-r
VR.probabilities = [100];
VR.name = "VR";
var MID = ['<CLOSED>','<OPEN>','<VR>'];
MID.probabilities = [33,33,33];
MID.name = "MID";
//the following will deal with ending syllables
var VCE = ['<VO><CO>e','<LETTER><VO><CO>e'];
VCE.probabilities = [75,25];
VCE.name = "VCE";
var CLE = ['<CO>le'];
CLE.probabilities = [100];
CLE.name = "CLE";
var OE = ['tion','age','ive']; //other endings
OE.probabilities = [33,33,33];
OE.name = "OE";
var ES = ['<VCE>','<CLE>','<OE>','<VR>']; //contains all ending syllables
ES.probabilities = [40,40,20];
ES.name = "ES";
var rules = [CO,VO,BH,BL,COP,LETTER,SYL,CLOSED,OPEN,VR,MID,VCE,CLE,OE,ES];
//These are some highly-defined production rules
var streetSuffix = ['road','street','way','avenue','drive','grove','lane','gardens','place','crescent','close','square','hill','circus','mews','vale','rise','mead'];
streetSuffix.probabilities = [15,15,5,10,5,2.7,2.7,2.7,2.7,2.7,2.7,2.7,2.7,2.7,2.7,2.7,2.7,2.7];
var states = ['Alabama','Alaska','American Samoa','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','Florida','Georgia','Guam','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas','Kentucky','Louisiana','Maine','Marshall Islands','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey','New Mexico','New York','North Carolina','North Dakota','Ohio','Oklahoma','Oregon','Palau','Pennsylvania','Puerto Rico','Rhode Island','South Carolina','South Dakota','Tennessee','Texas','Utah','Vermont','Virgin Island','Virginia','Washington','West Virginia','Wisconsin','Wyoming'];
var cityNewWordSuffix = ['city','town',''];
var cityEndWordSuffix = ['polis','ville','ford','furt','forth','shire','berg','gurg','borough','brough','field','kirk','bury','stadt',''];
var siteSuffix = ['com','org','net','edu'];
/**
This will generate a random name of Length length
*/
function generateRandomName() {
//string will be random length of CO VO pattern for now
var result;
result = "<COP><VO><MID><VO><ES>";
while (hasNonTerminal(result)) {
result = replaceFirstNonTerminal(result);
}
return result;
}
Here are a few words generated by the machine in its current state:
"cheiroene",
"sloeraase",
"sledehgeute",
"rhaorenone",
"rheerisute",
"chaereehe",
"sletraoege",
"sluureese",
"chaheyleete",
"chierauhe",
"ploclooate",
"glawofhaice",
"thanisgoage",
"slelaodose",
"blaereode",
"shihudeife",
"slaereene",
"pleheaele",
"rhepicsaile",
"ploeruoge",
"sliareuhe",
"thaereafe",
"thaaraeke",
"cheoreate",
"shofetniote",
"phiraoese",
"clilniueye",
"slepceikede",
"cligloueohe",
"phitleoime",

How do I get the score distribution value for CoreNLP Sentiment?

I have setup CoreNLP server on my ubuntu instance and it works ok. I more interested in Sentiment module and currently I get is
{
sentimentValue: "2",
sentiment: "Neutral"
}
What I need is score distribution value, as you see here: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
"scoreDistr": [0.1685, 0.7187, 0.0903, 0.0157, 0.0068]
What am I missing or How do I get such data ?
Thanks
You need to get the a tree object via SentimentCoreAnnotations.SentimentAnnotatedTree.class from your annotated sentence. Then, you can get the predictions through the RNNCoreAnnotations class. I wrote the following self-contained demo code below that shows how to get the scores for each label of a CoreNLP Sentiment prediction.
import java.util.Arrays;
import java.util.List;
import java.util.Properties;
import org.ejml.simple.SimpleMatrix;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.util.CoreMap;
public class DemoSentiment {
public static void main(String[] args) {
final List<String> texts = Arrays.asList("I am happy.", "This is a neutral sentence.", "I am very angry.");
final Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
final StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
for (String text : texts) {
final Annotation doc = new Annotation(text);
pipeline.annotate(doc);
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
final Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class);
final SimpleMatrix sm = RNNCoreAnnotations.getPredictions(tree);
final String sentiment = sentence.get(SentimentCoreAnnotations.SentimentClass.class);
System.out.println("sentence: "+sentence);
System.out.println("sentiment: "+sentiment);
System.out.println("matrix: "+sm);
}
}
}
}
The output will be similar (some floating point rounding errors or an updated models might change the scores) to what is below.
For the first sentence I am happy., you can see that the sentiment is Positive, and, the highest value in the returned matrix is 0.618, at the fourth position, when interpreting the matrix as an ordered listing.
The second sentence This is a neutral sentence. has its highest score in the middle, at 0.952, hence the Neutral sentiment.
The last sentence has correspondingly a Negative sentiment, with its highest score of 0.652 at the second position.
sentence: I am happy.
sentiment: Positive
matrix: Type = dense , numRows = 5 , numCols = 1
0.016
0.037
0.132
0.618
0.196
sentence: This is a neutral sentence.
sentiment: Neutral
matrix: Type = dense , numRows = 5 , numCols = 1
0.001
0.007
0.952
0.039
0.001
sentence: I am very angry.
sentiment: Negative
matrix: Type = dense , numRows = 5 , numCols = 1
0.166
0.652
0.142
0.028
0.012

Why POS tagging algorithm tags `can't` as separate words?

I'm using Stanford Log-linear Part-Of-Speech Tagger and here is the sample sentence that I tag:
He can't do that
When tagged I get this result:
He_PRP ca_MD n't_RB do_VB that_DT
As you can see, can't is split into two words, ca is marked as Modal (MD) and n't is marked as ADVERB (RB)?
I actually get the same result if I use can not separately: can is MD and not is RB, so is this way of breaking up is expected instead of say breaking like can_MD and 't_RB?
Note: This is not the perfect answer.
I think that the problem originates from the Tokenizer used in Stanford POS Tagger, not from the tagger itself. the Tokenizer (PTBTokenizer) can not handle apostrophe properly:
1- Stanford PTBTokenizer token's split delimiter.
2- Stanford coreNLP - split words ignoring apostrophe.
As they mentioned here Stanford Tokenizer, the PTBTokenizer will tokenizes the sentence :
"Oh, no," she's saying, "our $400 blender can't handle something this
hard!"
to:
...... our $ 400 blender ca n't handle something
Try to find a suitable tokenization method and apply it to the tagger as following:
import java.util.List;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.ling.Sentence;
import edu.stanford.nlp.ling.TaggedWord;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
public class Test {
public static void main(String[] args) throws Exception {
String model = "F:/code/stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger";
MaxentTagger tagger = new MaxentTagger(model);
List<HasWord> sent;
sent = Sentence.toWordList("He", "can", "'t", "do", "that", ".");
//sent = Sentence.toWordList("He", "can't", "do", "that", ".");
List<TaggedWord> taggedSent = tagger.tagSentence(sent);
for (TaggedWord tw : taggedSent) {
System.out.print(tw.word() + "=" + tw.tag() + " , " );
}
}
}
output:
He=PRP , can=MD , 't=VB , do=VB , that=DT , .=. ,

Household mail merge (code golf)

I wrote some mail merge code the other day and although it works I'm a turned off by the code. I'd like to see what it would look like in other languages.
So for the input the routine takes a list of contacts
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Erica,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Marge,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Ted,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Raoul,Simpson,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
It will then merge lines with the same address and surname into one record. Assume the rows are unsorted). The code should also be flexible enough that fields can be supplied in any order (so it will need to take field indexes as parameters). For a family of two it concatenates both first name fields. For a family of three or more the first name is set to "the" and the lastname is set to "surname family".
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
My C# implementation of this is:
var source = File.ReadAllLines(#"sample.csv").Select(l => l.Split(','));
var merged = HouseholdMerge(source, 0, 1, new[] {1, 2, 3, 4, 5});
public static IEnumerable<string[]> HouseholdMerge(IEnumerable<string[]> data, int fnIndex, int lnIndex, int[] groupIndexes)
{
Func<string[], string> groupby = fields => String.Join("", fields.Where((f, i) => groupIndexes.Contains(i)));
var groups = data.OrderBy(groupby).GroupBy(groupby);
foreach (var group in groups)
{
string[] result = group.First().ToArray();
if (group.Count() == 2)
{
result[fnIndex] += " and " + group.ElementAt(1)[fnIndex];
}
else if (group.Count() > 2)
{
result[fnIndex] = "The";
result[lnIndex] += " Family";
}
yield return result;
}
}
I don't like how I've had to do the groupby delegate. I'd like if C# had some way to convert a string expression to a delegate. e.g. Func groupby = f => "f[2] + f[3] + f[4] + f[5] + f[1];" I have a feeling something like this can probably be done in Lisp or Python. I look forward to seeing nicer implementation in other languages.
Edit: Where did the community wiki checkbox go? Some mod please fix that.
Ruby — 181 155
Name/surname indexes are in code:a and b. Input data is from ARGF.
a,b=0,1
[*$<].map{|i|i.strip.split ?,}.group_by{|i|i.rotate(a).drop 1}.map{|i,j|k,l,m=j
k[a]+=' and '+l[a]if l
(k[a]='The';k[b]+=' Family')if m
puts k*','}
Python - not golfed
I'm not sure what the order of the rows should be if the indices are not 0 and 1 for the input file
import csv
from collections import defaultdict
class HouseHold(list):
def __init__(self, fn_idx, ln_idx):
self.fn_idx = fn_idx
self.ln_idx = ln_idx
def append(self, item):
self.item = item
list.append(self, item[self.fn_idx])
def get_value(self):
fn_idx = self.fn_idx
ln_idx = self.ln_idx
item = self.item
addr = [j for i,j in enumerate(item) if i not in (fn_idx, ln_idx)]
if len(self) < 3:
fn, ln = " and ".join(self), item[ln_idx]
else:
fn, ln = "The", item[ln_idx]+" Family"
return [fn, ln] + addr
def source(fname):
with open(fname) as in_file:
for item in csv.reader(in_file):
yield item
def household_merge(src, fn_idx, ln_idx, groupby):
res = defaultdict(lambda:HouseHold(fn_idx, ln_idx))
for item in src:
key = tuple(item[x] for x in groupby)
res[key].append(item)
return res.values()
data = household_merge(source("sample.csv"), 0, 1, [1,2,3,4,5,6,7])
with open("result.csv", "w") as out_file:
csv.writer(out_file).writerows(item.get_value() for item in data)
Python - 178 chars
import sys
d={}
for x in sys.stdin:F,c,A=x.partition(',');d[A]=d.get(A,[])+[F]
print"".join([" and ".join(v)+c+A,"The"+c+A.replace(c,' Family,',1)][2<len(v)]for A,v in d.items())
Output
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Python 2.6.6 - 287 Characters
This assumes you can hard code a filename (named i). If you want to take input from command line, this goes up ~16 chars.
from itertools import*
for z,g in groupby(sorted([l.split(',')for l in open('i').readlines()],key=lambda x:x[1:]), lambda x:x[2:]):
l=list(g);r=len(l);k=','.join(z);o=l[0]
if r>2:print'The,'+o[1],"Family,"+k,
elif r>1:print o[0],"and",l[1][0]+","+o[1]+","+k,
else:print','.join(o),
Output
Erica and Abraham,Johnson,2681 Eagle Peak,,Bellevue,Washington,United States,98004
Larry,Lyon,52560 Free Street,,Toronto,Ontario,Canada,M4B 1V7
The,Simpson Family,6388 Lake City Way,,Burnaby,British Columbia,Canada,V5A 3A6
Jim,Smith,2681 Eagle Peak,,Bellevue,Washington,United States,98004
I'm sure this could be improved upon, but it is getting late.
Haskell - 341 321
(Changes as per comments).
Unfortunately Haskell has no standard split function which makes this rather long.
Input to stdin, output on stdout.
import List
import Data.Ord
main=interact$unlines.e.lines
s[]=[]
s(',':x)=s x
s l#(x:y)=let(h,i)=break(==k)l in h:(s i)
t[]=[]
t x=tail x
h=head
m=map
k=','
e l=m(t.(>>=(k:)))$(m c$groupBy g$sortBy(comparing t)$m s l)
c(x:[])=x
c(x:y:[])=(h x++" and "++h y):t x
c x="The":((h$t$h x)++" Family"):(t$t$h x)
g a b=t a==t b
Lua, 434 bytes
x,y=1,2 s,p,r,a=string.gsub,pairs,io.read,{}for j,b,c,d,e,f,g,h,i in r('*a'):gmatch('('..('([^,]*),'):rep(7)..'([^,]*))\n')
do k=s(s(s(j,b,''),c,''),'[,%s]','')for l,m in p(a)do if not m.f and (m[y]:match(c) and m[9]==k) then z=1
if m.d then m[x]="The"m[y]=m[y]..' family'm.f=1 else m[x]=m[x].." and "..b m.d=1 end end end if not z then
a[#a+1]={b,c,d,e,f,g,h,i,k} end z=nil end for k,v in p(a)do v[9]=nil print(table.concat(v,','))end

Resources